




















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
An introduction to simple linear regression, including the model, statistical hypotheses, and calculations. It covers the assumptions of normal distribution and equal spread, the interpretation of regression coefficients, and residual checking. The document also includes examples and explanations of fitted values, residuals, and scatterplots.
What you will learn
Typology: Exams
1 / 28
This page cannot be seen from the preview
Don't miss anything!
An analysis appropriate for a quantitative outcome and a single quantitative ex- planatory variable.
When we are examining the relationship between a quantitative outcome and a single quantitative explanatory variable, simple linear regression is the most com- monly considered analysis method. (The “simple” part tells us we are only con- sidering a single explanatory variable.) In linear regression we usually have many different values of the explanatory variable, and we usually assume that values between the observed values of the explanatory variables are also possible values of the explanatory variables. We postulate a linear relationship between the pop- ulation mean of the outcome and the value of the explanatory variable. If we let Y be some outcome, and x be some explanatory variable, then we can express the structural model using the equation
E(Y |x) = β 0 + β 1 x
where E(), which is read “expected value of”, indicates a population mean; Y |x, which is read “Y given x”, indicates that we are looking at the possible values of Y when x is restricted to some single value; β 0 , read “beta zero”, is the intercept parameter; and β 1 , read “beta one”. is the slope parameter. A common term for any parameter or parameter estimate used in an equation for predicting Y from
x is coefficient. Often the “1” subscript in β 1 is replaced by the name of the explanatory variable or some abbreviation of it.
So the structural model says that for each value of x the population mean of Y (over all of the subjects who have that particular value “x” for their explanatory variable) can be calculated using the simple linear expression β 0 + β 1 x. Of course we cannot make the calculation exactly, in practice, because the two parameters are unknown “secrets of nature”. In practice, we make estimates of the parameters and substitute the estimates into the equation.
In real life we know that although the equation makes a prediction of the true mean of the outcome for any fixed value of the explanatory variable, it would be unwise to use extrapolation to make predictions outside of the range of x values that we have available for study. On the other hand it is reasonable to interpolate, i.e., to make predictions for unobserved x values in between the observed x values. The structural model is essentially the assumption of “linearity”, at least within the range of the observed explanatory data.
It is important to realize that the “linear” in “linear regression” does not imply that only linear relationships can be studied. Technically it only says that the beta’s must not be in a transformed form. It is OK to transform x or Y , and that allows many non-linear relationships to be represented on a new scale that makes the relationship linear.
The structural model underlying a linear regression analysis is that the explanatory and outcome variables are linearly related such that the population mean of the outcome for any x value is β 0 + β 1 x.
The error model that we use is that for each particular x, if we have or could collect many subjects with that x value, their distribution around the population mean is Gaussian with a spread, say σ^2 , that is the same value for each value of x (and corresponding population mean of y). Of course, the value of σ^2 is an unknown parameter, and we can make an estimate of it from the data. The error model described so far includes not only the assumptions of “Normality” and “equal variance”, but also the assumption of “fixed-x”. The “fixed-x” assumption is that the explanatory variable is measured without error. Sometimes this is possible, e.g., if it is a count, such as the number of legs on an insect, but usually there is some error in the measurement of the explanatory variable. In practice,
variable each time, serial correlation is extremely likely. Breaking the assumption of independent errors does not indicate that no analysis is possible, only that linear regression is an inappropriate analysis. Other methods such as time series methods or mixed models are appropriate when errors are correlated.
The worst case of breaking the independent errors assumption in re- gression is when the observations are repeated measurement on the same experimental unit (subject).
Before going into the details of linear regression, it is worth thinking about the variable types for the explanatory and outcome variables and the relationship of ANOVA to linear regression. For both ANOVA and linear regression we assume a Normal distribution of the outcome for each value of the explanatory variable. (It is equivalent to say that all of the errors are Normally distributed.) Implic- itly this indicates that the outcome should be a continuous quantitative variable. Practically speaking, real measurements are rounded and therefore some of their continuous nature is not available to us. If we round too much, the variable is essentially discrete and, with too much rounding, can no longer be approximated by the smooth Gaussian curve. Fortunately regression and ANOVA are both quite robust to deviations from the Normality assumption, and it is OK to use discrete or continuous outcomes that have at least a moderate number of different values, e.g., 10 or more. It can even be reasonable in some circumstances to use regression or ANOVA when the outcome is ordinal with a fairly small number of levels.
The explanatory variable in ANOVA is categorical and nominal. Imagine we are studying the effects of a drug on some outcome and we first do an experiment comparing control (no drug) vs. drug (at a particular concentration). Regression and ANOVA would give equivalent conclusions about the effect of drug on the outcome, but regression seems inappropriate. Two related reasons are that there is no way to check the appropriateness of the linearity assumption, and that after a regression analysis it is appropriate to interpolate between the x (dose) values, and that is inappropriate here.
Now consider another experiment with 0, 50 and 100 mg of drug. Now ANOVA and regression give different answers because ANOVA makes no assumptions about the relationships of the three population means, but regression assumes a linear relationship. If the truth is linearity, the regression will have a bit more power
0 2 4 6 8 10
0
5
10
15
x
Y
Figure 9.1: Mnemonic for the simple regression model.
than ANOVA. If the truth is non-linearity, regression will make inappropriate predictions, but at least regression will have a chance to detect the non-linearity. ANOVA also loses some power because it incorrectly treats the doses as nominal when they are at least ordinal. As the number of doses increases, it is more and more appropriate to use regression instead of ANOVA, and we will be able to better detect any non-linearity and correct for it, e.g., with a data transformation.
Figure 9.1 shows a way to think about and remember most of the regression model assumptions. The four little Normal curves represent the Normally dis- tributed outcomes (Y values) at each of four fixed x values. The fact that the four Normal curves have the same spreads represents the equal variance assump- tion. And the fact that the four means of the Normal curves fall along a straight line represents the linearity assumption. Only the fifth assumption of independent errors is not shown on this mnemonic plot.
ll
ll
l
l
l
l
l
l
l l
l
l l
l
ll
l l
l
ll
l
0 20 40 60 80 100
100
200
300
400
500
600
Soil Nitrogen (mg/pot)
Final Weight (gm)
Figure 9.2: Scatterplot of corn data.
mg.
EDA, in the form of a scatterplot is shown in figure 9.2. We want to use EDA to check that the assumptions are reasonable before trying a regression analysis. We can see that the assumptions of linearity seems plausible because we can imagine a straight line from bottom left to top right going through the center of the points. Also the assumption of equal spread is plausible because for any narrow range of nitrogen values (horizontally), the spread of weight values (vertically) is fairly similar. These assumptions should only be doubted at this stage if they are drastically broken. The assumption of Normality is not something that human beings can test by looking at a scatterplot. But if we noticed, for instance, that there were only two possible outcomes in the whole experiment, we could reject the idea that the distribution of weights is Normal at each nitrogen level.
The assumption of fixed-x cannot be seen in the data. Usually we just think about the way the explanatory variable is measured and judge whether or not it is measured precisely (with small spread). Here, it is not too hard to measure the amount of nitrogen fertilizer added to each pot, so we accept the assumption of
fixed-x. In some cases, we can actually perform repeated measurements of x on the same case to see the spread of x and then do the same thing for y at each of a few values, then reject the fixed-x assumption if the ratio of x to y variance is larger than, e.g., around 0.1.
The assumption of independent error is usually not visible in the data and must be judged by the way the experiment was run. But if serial correlation is suspected, there are tests such as the Durbin-Watson test that can be used to detect such correlation.
Once we make an initial judgement that linear regression is not a stupid thing to do for our data, based on plausibility of the model after examining our EDA, we perform the linear regression analysis, then further verify the model assumptions with residual checking.
The basic regression analysis uses fairly simple formulas to get estimates of the parameters β 0 , β 1 , and σ^2. These estimates can be derived from either of two basic approaches which lead to identical results. We will not discuss the more complicated maximum likelihood approach here. The least squares approach is fairly straightforward. It says that we should choose as the best-fit line, that line which minimizes the sum of the squared residuals, where the residuals are the vertical distances from individual points to the best-fit “regression” line.
The principle is shown in figure 9.3. The plot shows a simple example with four data points. The diagonal line shown in black is close to, but not equal to the “best-fit” line.
Any line can be characterized by its intercept and slope. The intercept is the y value when x equals zero, which is 1.0 in the example. Be sure to look carefully at the x-axis scale; if it does not start at zero, you might read off the intercept incorrectly. The slope is the change in y for a one-unit change in x. Because the line is straight, you can read this off anywhere. Also, an equivalent definition is the change in y divided by the change in x for any segment of the line. In the figure, a segment of the line is marked with a small right triangle. The vertical change is 2 units and the horizontal change is 1 unit, therefore the slope is 2/1=2. Using b 0 for the intercept and b 1 for the slope, the equation of the line is y = b 0 + b 1 x.
By plugging different values for x into this equation we can find the corre- sponding y values that are on the line drawn. For any given b 0 and b 1 we get a potential best-fit line, and the vertical distances of the points from the line are called the residuals. We can use the symbol ˆyi, pronounced “y hat sub i”, where “sub” means subscript, to indicate the fitted or predicted value of outcome y for subject i. (Some people also use the y i′ “y-prime sub i”.) For subject i, who has explanatory variable xi, the prediction is ˆyi = b 0 + b 1 xi and the residual is yi − yˆi. The least square principle says that the best-fit line is the one with the smallest sum of squared residuals. It is interesting to note that the sum of the residuals (not squared) is zero for the least-squares best-fit line.
In practice, we don’t really try every possible line. Instead we use calculus to find the values of b 0 and b 1 that give the minimum sum of squared residuals. You don’t need to memorize or use these equations, but here they are in case you are interested.
b 1 =
∑n i=1(xi^ −^ x¯)(yi^ −^ y¯) (xi − ¯x)^2 b 0 = ¯y − b 1 x¯
Also, the best estimate of σ^2 is
s^2 =
∑n i=1(yi^ −^ yˆi) 2 n − 2
Whenever we ask a computer to perform simple linear regression, it uses these equations to find the best fit line, then shows us the parameter estimates. Some- times the symbols βˆ 0 and βˆ 1 are used instead of b 0 and b 1. Even though these symbols have Greek letters in them, the “hat” over the beta tells us that we are dealing with statistics, not parameters.
Here are the derivations of the coefficient estimates. SSR indicates sum of squared residuals, the quantity to minimize.
∑^ n
i=
(yi − (β 0 + β 1 xi))^2 (9.1)
∑^ n
i=
( y^2 i − 2 yi(β 0 + β 1 xi) + β 02 + 2β 0 β 1 xi + β 12 x^2 i
) (9.2)
∂β 0
∑^ n
i=
(− 2 yi + 2β 0 + 2β 1 xi) (9.3)
∑^ n
i=
( −yi + βˆ 0 + βˆ 1 xi
) (9.4)
0 = −ny¯ + n βˆ 0 + βˆ 1 n¯x (9.5) βˆ 0 = y¯ − βˆ 1 x¯ (9.6) ∂SSR ∂β 1
∑^ n
i=
( − 2 xiyi + 2β 0 xi + 2β 1 x^2 i
) (9.7)
∑^ n
i=
xiyi + βˆ 0
∑^ n
i=
xi + βˆ 1
∑^ n
i=
x^2 i (9.8)
∑^ n
i=
xiyi + (¯y − βˆ 1 ¯x)
∑^ n
i=
xi + βˆ 1
∑^ n
i=
x^2 i (9.9)
βˆ 1 =
∑n ∑i=1^ xi(yi^ −^ y¯) n i=1 xi(xi^ −^ x¯)^
A little algebra shows that this formula for βˆ 1 is equivalent to the one shown above because c
∑n i=1(zi^ −^ z¯) =^ c^ ·^ 0 = 0 for any constant^ c^ and variable z. In multiple regression, the matrix formula for the coefficient estimates is (X′X)−^1 X′y, where X is the matrix with all ones in the first column (for the intercept) and the values of the explanatory variables in subsequent columns.
Because the intercept and slope estimates are statistics, they have sampling distributions, and these are determined by the true values of β 0 , β 1 , and σ^2 , as well as the positions of the x values and the number of subjects at each x value. If the model assumptions are correct, the sampling distributions of the intercept and slope estimates both have means equal to the true values, β 0 and β 1 , and are Normally distributed with variances that can be calculated according to fairly simple formulas which involve the x values and σ^2.
In practice, we have to estimate σ^2 with s^2. This has two consequences. First we talk about the standard errors of the sampling distributions of each of the betas
Unstandardized Coefficients 95% Confidence Interval for B B Std. Error t Sig. Lower Bound Upper Bound (Constant) 94.821 18.116 4.682 .000 47.251 122. Nitrogen added 5.269 .299 17.610 .000 4.684 5.
Table 9.1: Regression results for the corn experiment.
be written as b 0 , βˆ 0 or rarely B 0 , but β 0 is incorrect, because the parameter value β 0 is a fixed, unknown “secret of nature”. (Usually we should just say that b 0 equals 94.8 because the original data and most experimental data has at most 3 significant figures.)
The number 5.269 is the slope estimate, estimated slope coefficient, slope es- timate for nitrogen added, or coefficient estimate for nitrogen added, and can be written as b 1 , βˆ 1 or rarely B 1 , but β 1 is incorrect. Sometimes symbols such as βnitrogen or βN for the parameter and bnitrogen or bN for the estimates will be used as better, more meaningful names, especially when dealing with multiple explana- tory variables in multiple (as opposed to simple) regression.
To the right of the intercept and slope coefficients you will find their standard errors. As usual, standard errors are estimated standard deviations of the corre- sponding sampling distributions. For example, the SE of 0.299 for BN gives an idea of the scale of the variability of the estimate BN , which is 5.269 here but will vary with a standard deviation of approximately 0.299 around the true, unknown value of βN if we repeat the whole experiment many times. The two t-statistics are cal- culated by all computer programs using the default null hypotheses of H 0 : βj = 0 according to the general t-statistic formula
tj =
bj − hypothesized value of βj SE(bj )
Then the computer uses the null sampling distributions of the t-statistics, i.e., the t-distribution with n − 2 df, to compute the 2-sided p-values as the areas under the null sampling distribution more extreme (farther from zero) than the coefficient estimates for this experiment. SPSS reports this as “Sig.”, and as usual gives the misleading output “.000” when the p-value is really “< 0 .0005”.
In simple regression the p-value for the null hypothesis H 0 : β 1 = 0 comes from the t-test for b 1. If applicable, a similar test is made for β 0.
SPSS also gives Standardized Coefficients (not shown here). These are the coefficient estimates obtained when both the explanatory and outcome variables are converted to so-called Z-scores by subtracting their means then dividing by their standard deviations. Under these conditions the intercept estimate is zero, so it is not shown. The main use of standardized coefficients is to allow compari- son of the importance of different explanatory variables in multiple regression by showing the comparative effects of changing the explanatory variables by one stan- dard deviation instead of by one unit of measurement. I rarely use standardized coefficients.
The output above also shows the “95% Confidence Interval for B” which is gen- erated in SPSS by clicking “Confidence Intervals” under the “Statistics” button. In the given example we can say “we are 95% confident that βN is between 4. and 5.89.” More exactly, we know that using the method of construction of coeffi- cient estimates and confidence intervals detailed above, and if the assumptions of regression are met, then each time we perform an experiment in this setting we will get a different confidence interval (center and width), and out of many confidence intervals 95% of them will contain βN and 5% of them will not.
The confidence interval for β 1 gives a meaningful measure of the loca- tion of the parameter and our uncertainty about that location, regard- less of whether or not the null hypothesis is true. This also applies to β 0.
It is very important that you learn to correctly and completely interpret the co- efficient estimates. From E(Y |x) = β 0 + β 1 x we can see that b 0 represents our estimate of the mean outcome when x = 0. Before making an interpretation of b 0 ,
tion. Be sure to substitute association if you are looking at an observational study. The general meaning of a slope coefficient is the change in Y caused by a one-unit increase in x. It is very important to know in what units x are measured, so that the meaning of a one-unit increase can be clearly expressed. For the corn experi- ment, the slope is the change in mean corn plant weight (in grams) caused by a one mg increase in nitrogen added per pot. If a one-unit change is not substantively meaningful, the effect of a larger change should be used in the interpretation. For the corn example we could say the a 10 mg increase in nitrogen added causes a 52.7 gram increase in plant weight on average. We can also interpret the CI for β 1 in the corn experiment by saying that we are 95% confident that the change in mean plant weight caused by a 10 mg increase in nitrogen is 46.8 to 58.9 gm.
Be sure to pay attention to the sign of b 1. If it is positive then b 1 represents the increase in outcome caused by each one-unit increase in the explanatory variable. If b 1 is negative, then each one-unit increase in the explanatory variable is associated with a fall in outcome of magnitude equal to the absolute value of b 1.
A significant p-value indicates that we should reject the null hypothesis that β 1 = 0. We can express this as evidence that plant weight is affected by changes in nitrogen added. If the null hypothesis is retained, we should express this as having no good evidence that nitrogen added affects plant weight. Particularly in the case of when we retain the null hypothesis, the interpretation of the CI for β 1 is better than simply relying on the general meaning of retain.
The interpretation of b 1 is the change (increase or decrease depending on the sign) in the average outcome when the explanatory variable increases by one unit. This should always be stated in terms of the actual variables of the study. Retention of the null hypothesis H 0 : β 1 = 0 indicates no evidence that a change in x is associated with (or causes for a randomized experiment) a change in y. Rejection indicates that changes in x cause changes in y (assuming a randomized experiment).
Every regression analysis should include a residual analysis as a further check on the adequacy of the chosen regression model. Remember that there is a residual value for each data point, and that it is computed as the (signed) difference yi − yˆi. A positive residual indicates a data point higher than expected, and a negative residual indicates a point lower than expected.
A residual is the deviation of an outcome from the predicated mean value for all subjects with the same value for the explanatory variable.
A plot of all residuals on the y-axis vs. the predicted values on the x-axis, called a residual vs. fit plot, is a good way to check the linearity and equal variance assumptions. A quantile-normal plot of all of the residuals is a good way to check the Normality assumption. As mentioned above, the fixed-x assumption cannot be checked with residual analysis (or any other data analysis). Serial correlation can be checked with special residual analyses, but is not visible on the two standard residual plots. The other types of correlated errors are not detected by standard residual analyses.
To analyze a residual vs. fit plot, such as any of the examples shown in figure 9.4, you should mentally divide it up into about 5 to 10 vertical stripes. Then each stripe represents all of the residuals for a number of subjects who have a similar predicted values. For simple regression, when there is only a single explanatory variable, similar predicted values is equivalent to similar values of the explanatory variable. But be careful, if the slope is negative, low x values are on the right. (Note that sometimes the x-axis is set to be the values of the explanatory variable, in which case each stripe directly represents subjects with similar x values.)
To check the linearity assumption, consider that for each x value, if the mean of Y falls on a straight line, then the residuals have a mean of zero. If we incorrectly fit a straight line to a curve, then some or most of the predicted means are incorrect, and this causes the residuals for at least specific ranges of x (or the predicated Y ) to be non-zero on average. Specifically if the data follow a simple curve, we will tend to have either a pattern of high then low then high residuals or the reverse. So the technique used to detect non-linearity in a residual vs. fit plot is to find the
l
l
l
l
l
l
l (^) l
l l l l
l l l
ll l
l
l
l l
l
l l
l
l l l l l
l
l l l
l
l
l
l (^) l l
l
l
l l
l (^) l l
l
l l l ll l
ll
l
l
l
l
l l l
l l
l
l l
l l^ ll
l
l
l
l l
l
l
20 40 60 80 100
−
0
5
10
A
Fitted value
Residual l^ l
l
l l l
l
l
l
l
l
l
l
l
l (^) l
l
l
l l
l
l
l
l l l
l
l
l l l
ll
l
l ll l
l
l
l
l l l
l l l l l
l l l
l l l
l
l
l
l l
l l
l
l
l
l
l
l
l l
l
l
l
l
l l l l l
l
0 20 40 60 80 100
−
−
0
5
10
B
Fitted value
Residual
l
l ll
l l
l
ll (^) l
l
l (^) l
l
l l l
l l
l lll l l l l l
l
l l l
l l
l l l l l
l l l
l
l
l
ll l
l l l
l l l
l
l ll l l
ll ll
l l
l l l l
l
l
l
l
l
ll
l
l
l
20 40 60 80 100
−
0
50
100
C
Fitted value
Residual l
l
l l l
l
ll^ ll^ l
l
l
ll
l
l ll l
l
l l
ll l
l ll (^) l l l
l
l l
l
l l l l
l l l l
l l
l
l
l
l
ll (^) llll l l l l
l l (^) ll
l
l
l
l
l l
l
l l l
l
l
l
l
l l
0 20 40 60 80
−
0
50
D
Fitted value
Residual
Figure 9.5: Sample residual vs. fit plots for testing equal variance.
data that produced plots C and D. With practice you will get better at reading these plots.
To detect unequal spread, we use the vertical bands in a different way. Ideally the vertical spread of residual values is equal in each vertical band. This takes practice to judge in light of the expected variability of individual points, especially when there are few points per band. The main idea is to realize that the minimum and maximum residual in any set of data is not very robust, and tends to vary a lot from sample to sample. We need to estimate a more robust measure of spread such as the IQR. This can be done by eyeballing the middle 50% of the data. Eyeballing the middle 60 or 80% of the data is also a reasonable way to test the equal variance assumption.
Figure 9.5 shows four residual vs. fit plots, each of which shows good linearity. The red horizontal lines mark the central 60% of the residuals. Plots A and B show no evidence of unequal variance; the red lines are a similar distance apart in each band. In plot C you can see that the red lines increase in distance apart as you move from left to right. This indicates unequal variance, with greater variance at high predicted values (high x values if the slope is positive). Plot D show a pattern with unequal variance in which the smallest variance is in the middle of the range of predicted values, with larger variance at both ends. Again, this takes practice, but you should at least recognize obvious patterns like those shown in plots C and D. And you should avoid over-reading the slight variations seen in plots A and B.
The residual vs. fit plot can be used to detect non-linearity and/or unequal variance.
The check of normality can be done with a quantile normal plot as seen in figure 9.6. Plot A shows no problem with Normality of the residuals because the points show a random scatter around the reference line (see section 4.3.4). Plot B is also consistent with Normality, perhaps showing slight skew to the left. Plot C shows definite skew to the right, because at both ends we see that several points are higher than expected. Plot D shows a severe low outlier as well as heavy tails (positive kurtosis) because the low values are too low and the high values are too high.
A quantile normal plot of the residuals of a regression analysis can be used to detect non-Normality.
No model perfectly represents the real world. It is worth learning how far we can “bend” the assumptions without breaking the value of a regression analysis.
If the linearity assumption is violated more than a fairly small amount, the regression loses its meaning. The most obvious way this happens is in the inter- pretation of b 1. We interpret b 1 as the change in the mean of Y for a one-unit