

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Material Type: Notes; Class: Statistical Applications; Subject: Mathematics; University: Saint Mary's College; Term: Unknown 1990;
Typology: Study notes
1 / 3
This page cannot be seen from the preview
Don't miss anything!
Simple Linear Regression
Regression expresses a relation used to predict one variable, called the response variable (or “dependent” variable
A sample question This table represent a sample of ten trucks; for each, we have with the age in years and the annual maintenance cost. We want to find a linear equation, using this information, which most closely describes (“predicts”) the [average] maintenance cost of a truck, based on its age. [Notice how the language tells us the ‘predictor” is age and the ‘response” (“predicted”) is cost.
Truck age cost number (years) ($Thousands) 1 1 3. 2 2 3. 3 2 4. 4 2 5. 5 2 5. 6 3 5. 7 4 7. 8 4 8. 9 5 7. 10 5 9. For this course, we will focus on linear regression in which we look for the linear equation y = b 0 +b 1 x 1 +b 2 x 2 +.. .+bkxk which best fits our data, and will begin with “simple” linear regression — using one predictor. Thus we will have data in pairs — two different variables (a value of the predictor and a value of the response) observed on the same individual (or sampling unit) and we will be looking for the equation y = b 0 + b 1 x which best describes the relation; later we will look at tests to decide whether we have evidence for a similar relation in the population, and at how to use the relation to make predictions.
The linear regression model (the theory we are using) Our calculations (and decision of what is “better” or “worse” fit) are based on the following model (assumptions about the population): There are two random variables X and Y. For each possible value x of X there is a probability distribution of values of Y (Y |x) which fits the following conditions:
The Regression eaquation For any linear equation y = b 0 + b 1 x each data point (xi, yi) gives a “predicted value” ˆyi = b 0 + b 1 xi and there is a residual yi − yˆi which gives the error (difference between the actual value for that point and the prediction for the point). The “line of best fit in the sense of least squares” or the “regression line for predicting y based on x or the “OLS [stands for “ordinary least squares”] line for y based on x” is the ˆy = b 0 + b 1 x for which the total of the squares of the residuals (y 1 − yˆ 1 )^2 + (y 2 − yˆ 2 )^2 + · · · + (yn − yˆn)^2 [ = (y 1 − (b 0 + b 1 x))^2 + (y 2 − (b 0 + b 1 x 2 ))^2 + · · · + (yn − b(b 0 + b 1 xn))^2 ] is smallest. [If our model is correct, this is the method that will most often bring us closest to the “real” population line]. Fortunately, some work with calculus (already done for us by some nice people years ago) gives us the following equations (the “normal equations”) for∑ b 0 and b 1 yi = nb 0 + b 1
∑ xi xiyi = b 0 + b 1
x^2 i which we can rewrite [after some clever algebra] as
slope = b 1 =
(xi − ¯x)(yi − ¯y) ∑ (xi − x¯)^2
(xiyi) − nx¯¯y ∑ (xi − ¯x)^2
, intercept = b 0 = ¯y − b 1 ¯x
We would not calculate these by hand but would use a statistics package (in Minitab: Stat>REGRESSION>REGRESSION (See the online Minitab handbook); “Response” and “Predictor” are columns containing values of those variables (y, x repectively)) or a calculator (Stat>Calc>Linreg(a+bx) or Stat>Calc>Linreg(ax +b) – it doesn’t matter– on the TI-8x family - List with predictor first, List with response second (See the online “Using your calculator for statistics” pages)) The intercept formula tells us that the regression line always goes through the point (¯x, y¯), which seems reasonable and is useful for a “cheap check”. The slope formula can be rewritten (once we know about the correlation coefficient r)
in the form b 1 = r
sy sx
not prove) that if our model is correct, b 0 and b 1 are unbiased consistent estimators of β 0 and β 1 (averaging the b 0 values obtained from all possible samples of size n gives β 0 ; similarly for b 1 ). This will be very useful when we want to carry out tests and make estimates for the population values of slope and intercept.
Coefficient of determination r^2 : Looking at the sum of squares (of deviations from the mean) for y, (that is SSy =
(yi − y¯)^2 ) we have the same division into pieces that we used for ANOVA - but SSR - the sum of squares for the regression [the sum of squares of the predictions] takes the place of SSTR ( the sum of squares between groups - the “groups” are defined by the x values) and SSE is the sum of the squares of the residuals (the residuals are the “error of prediction” values) So SST = SSE + SSR, that is∑ (yi − y¯)^2 =
(yi − ˆyi)^2 +
(ˆyi − y¯)^2 The coefficient of determination measures the proportion of SST (variation in y) that correpsonds to (is explained by) the relation to x:
r^2 =
The value of r^2 tells us how well the data (from the sample) fits a linear model. We need the appropriate hypothesis test structure (described below under “Testing the regression coefficients”) to decide whether this is good enough to be convincing about the population.
Correlation:
The [sample] correlation coefficient r is given by r = (sign of b 0 )
r^2 (computation from data: r =
n − 1
∑ (^ xi − ¯x sx
yi − y¯ sy
It measures the extent to which the data points follow a straight line and will have a value ranging from −1 (perfect match to a line with negative slope) through 0 (do not follow a line – though might follow some other curve nicely) to + (perfect match to a line with positive slope). Like the variance, it is hard to interpret – the coefficient of determination r^2 has a more concrete meaning [but loses the information about sign]. The correlation in the population is called ρ.
Standard error of the residuals (“σ” for the regression): σ is estimated (from the sample) by:
se =
n − 2
(yi − yˆi)^2 n − 2
(yi − (b 0 + b 1 xi))^2 n − 2 This is the value Minitab and your text call “s”. It is the [sample] standard deviation of the residuals; it is also MSE. Note the “n − 2” denominator – there are n − 2 (not n − 1) degrees of freedom because we are using both b 0 and b 1 in calculating the residuals.
Repeat of the assumptions of linear regression [will matter for Tests, Confidence Intervals, etc.]:
Testing the regression coefficients – is there evidence that that there is a linear relation? If there is no linear relation between X and Y, then the [population] regression coefficient β 1 is 0. Thus, to decide “Is there a linear relation” our test is
H 0 : β 1 = 0 [no linear relationship between variables – values of X are not useful for linear prediction of values of Y] Ha : β 1 6 = 0 [some linear relation between X and Y]