
































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Topic 6 - Multiple Regression Analysis and Model Building (2011) Material Type: Notes; Class: Statistics 2 - Intermediate; Subject: Statistics; University: Carleton University; Term: Forever 1989;
Typology: Study notes
1 / 40
This page cannot be seen from the preview
Don't miss anything!
OBJECTIVES: This lecture introduces Multiple Linear Regression Analysis, which helps determine the
manner in which a change in any one or more of a set of independent variable impacts a dependent
variable. It also introduces model building or stepwise regression, using both forward selection and
backward elimination, the assumptions required for multiple linear regression, and how to diagnose
and compare models, and test if the required assumptions hold or not.
1. The Multiple Linear Regression Model
1a. The Model and its Assumptions
1b. Estimating the Regression Model - Ordinary Least Squares
1c. Least Squares Regression Properties, the Sum Of Squares, and Standard Errors
1d. Testing the Significance of Slope Coefficients vs the Overall Regression
2. Model Building
2a. Steps in Model Building
2b. Variations in the Multiple Linear Regression Model
2c. What is a “Good” Model?
2d. Stepwise Regression: Forward Selection and Backwards Elimination
3. Testing the OLS Assumptions
3a. Preliminary “Tests” of Assumptions on Residuals/Error Terms
3b. Multicollinearity and the Variance Inflation Factor (VIF)
**4. Common Errors
5a. Short Answer Questions
5b. Long Answer Questions
5c. Solutions to Short Answer Questions
Multiple linear regression is a multivariate statistical method that examines the linear correlations of
two or more independent variables on a dependent variable. It helps answer questions like:
To what extent to age, height, bone density, diet, and exercise predict a person’s body mass
index?
What factors predict economic growth?
To what extent does university, program of study, GPA, sex, age, work experience, and recent
economic growth predict post-education earnings?
What factors predict the Canadian exchange rate in the next few months?
It is used when the dependent variable is a continuous (interval or ratio) variable, and explains how
changes in independent variables explain or predict changes in the dependent variable.
Population regression equation: y = β 0
+ β
1
x
1
+ β
2
x
2
+… + β
k
x
k
+ ε ;
Sample (or estimated) regression equation: ŷ = b 0
+ b 1
x 1
+ b 2
x 2
+… + b k
x k
Error term: e = y – ŷ
Meaning:
Vectors of n observations:
y = value of dependent variable.
x i
= value of ith independent variable; i = 1, 2, … , k.
ε = Population error term.
ŷ = Estimated or predicted y value.
e = y - ŷ = Sample error or residual.
Scalars:
β 0
= Population or regression intercept..
β i
= Population or regression slope coefficient associated with the i
th
variable, x ij
. This is the
amount by which y changes for every one unit change in x i
b 0
= Sample intercept = unbiased estimate of the regression intercept
b i
= Sample slope coefficient associated with x i
n = sample size; k = number of slope coefficients (k=1 for simple linear regression model.)
The error term, ε , is the random (error) component of the model.
The terms, β 0
+β 1
x 1
+ β 2
x 2
+… + β k
x k
is the linear component.
The sample errorvector, e , estimates the population error, ε.
Since there are n observations for (y, x 1
, x 2
, …, x k
), we also write (y j
, x 1j
, x 2j
, …, x kj
), j=1, 2, …, n, so
Population regression equation: y j
= β 0
+β 1
x 1j
x 2j
+… + β k
x kj
Sample (or estimated) regression equation: ŷ j
= b 0
x 1j
x 2j
+… + b k
x kj
Error term: e j
= y j
Just as in all previous, multiple linear regression models require that certain assumptions hold in order
for the estimated model to be valid and good. These can be summarized in four assumptions, the first
three of which are also necessary for a simple linear regression model:
1
, x
2
, …, x
k
), ε~ iid N(0,σ
2
): Individual values of the error term, ε, are
independent, normally distributed, with mean 0 and constant variance, σ
2
The assumption of constant variance is known as the assumption of homoskedasticity.
When variances are not constant, hetroskedasticity exists.
, x 2
, …, x k
), the set of independent variables.
This is also expressed mathematically as ∑x i
e i
= 0, for i=1, 2, …, k
, β 1
, β 2
, … , β k
. Technically, the population regression
model connects the values E(y| (x 1
, x 2
, …, x k
) ) and x 1
, x 2
, …, x k
with a k-dimensional plane.
1
, x
2
, …, x
k
), are independent of each other.
Failure of an assumption can result in a sample model that is not the best estimate, or perhaps not even
a good estimate, for the population model. This means the sample model may not actually explain the
data well, may have biased estimates, and may not provide good forecasts or predictions of future
values of the dependent variable.
Advanced econometrics courses look at regression methods that relax these assumptions. For example:
The first relaxation examined is when error terms are heteroskedastic.
A Time Series Analysis course examines when independent variables are correlated with the
error term, and how this is taken into account.
Logistic Regression models are used when dependent variables are dummy variables, taking the
value 0 (for failure) or 1 (for success), instead of being a continuous (ratio or interval) variable.
Though we don’t look at these extensions in this course, we do examine the impact when independent
variables have a linear relationship with each other, which results in what is called “multicollinearity”.
We also look briefly at how to use residual analysis as preliminary “tests” to examine the validity of
some of the assumptions on the residual/error term.
The Least Squares Criterion is a criterion for estimating the population regression equation that
minimizes the sum of squared residuals, ∑e
2
= ∑(y- ŷ )
2
Ordinary Least Squares (OLS) is commonly used to describe this estimation method when the
population regression model is linear, error terms are iid N(0,σ
2
), and independent variables are
independent of each other and from the error term.
Equations to estimate regression coefficients: This requires knowledge of Linear (Matrix) Algebra,
multiple linear regressions models in this course. which is not a pre-requisite for Econ 2202. For
this reason, EXCEL outputs are used to estimate the sample model, and conduct tests.
Meaning of Regression Coefficients: The intercept is the value of the dependent variable when the
independent variables are all zero. The slope coefficient, b i
, can be positive or negative, and
measures the average estimated change in the dependent variable, y, for a unit change in the
dependent variable, x i
Writing multiple linear regression in matrix notation : we can write y = β 0
+ β 1
x 1
+ β 2
x 2
+… + β k
x k
+ ε ,
or y j
= β 0
+β 1
x 1j
x 2j
+… + β k
x kj
, in matrix notation as follows:
y X , where
k
2
1
0
1 2n kn
13 23 k
12 22 k
11 21 k
1 x x
1 x x x
1 x x x
1 x x x
n
x
X
y n
y
y
y
y
3
2
1
n
3
2
1
Each vector, β , y and ε , represents the n observations, and each column in X also represents n
observations. The first row in X is a vector of “1”s, to allow for a constant intercept. This gives
0 1 1 2 2n
0 1 13 2 23
0 1 12 2 22
0 1 11 2 21 k
2
1
0
1 2n kn
13 23 k
12 22 k
11 21 k
x x
x x x
x x x
x x x
1 x x
1 x x x
1 x x x
1 x x x
n
k
n k
x x
X
= β 0
+β 1
x 1j
x 2j
+… + β k
x kj
, for i=1, 2, ..., n.
OLS Estimates, b, for β: Multiple linear regression coefficients are estimated using the equations
below. As in simple linear regression, Ordinary Least Squares (OLS) provides the calculations for the
sample coefficients, b
0
, b
1
, b
2
, …, b
k
. OLS minimizes the sums of squares or residuals, ∑e
i
2
b X X X y
T 1 T
k
2
1
0
k
b
b
b
b
b
2
1
0
1 2 k3 kn
21 22 23 2
11 12 13 1
x x
x x x
x x x
1 1 1 1
k k
T
x x
x
x
X
n
y
y
y
y
y
3
2
1
n
e
e
e
e
e
3
2
1
Proving the OLS estimator is unbiased :
1 1 1
1 1 1
1 1
T T T T T T
T T T T T T
T T T T
Eb E X X X y E X X X X
0
ˆ e y y
2
SST y y
2
SSR y ˆ y
2
1 1 2 2
2
o k k
SSE y y y b bx bx bx
Some important standard deviations or standard errors are given by:
ε
n k
s
i
bi
bi
Since s bi
cannot be calculated without matrix algebra when k≥2, EXCEL output determines it.
A test of significance of the overall multiple linear regression uses analysis of variance. Similar to
ANOVA in Topic 3, this test is a right-tailed F-test. In a multiple linear regression model, the F statistic is
the ratio of SSR and SSE, each divided by their respective degrees of freedom: F 0
= (SSR/k)/(SSE/(n-k-1)).
This test examines, in effect, if the variation in the dependent variable explained by the model (SSR) is
significantly larger than the unexplained variation (SST – SSR = SSE).
Like the F tests for simple linear regression, the F test for multiple linear regression requires the error
terms, ε i
, be independent with mean 0 and a constant variance,
2
Ho: Regression is not significant, Ha: Regression is significant, or
Ho: β 1
= β 1
= … = β k
= 0, Ha: not all population slope coefficients are zero
The test statistic, which has (k, n-k-1) degrees of freedom, is:
; ( , 1 )
( 1 )
0
df k n k
MSE
MSR
SSE n k
SSR k
F
A test of significance of a single slope coefficient uses a two-tailed t-test, as in the case of simple linear
regression, and the test statistic, with n-k-1 degrees of freedom, is given by:
0
bi
i i
This test statistic is identical to that used for a simple linear regression model. The key difference lies in
the fact that s bi
can be calculated using a simple calculator for simple linear regression, while for a
multiple linear regression, statistical software may be necessary due to the need to use matrix algebra.
To test if the impact of x i
on y, is significant, that is, if changes in x i
matter in explaining changes in y, we
test the significance of the slope coefficient, β i
Ho: β i
= 0, Ha: β i
≠ 0; or
Ho: variable i (or x i
) does not matter, Ha: variable i (or x i
) matters
It is possible to have a statistically significant multiple regression model with one or more insignificant
slope coefficients. This means the regression equation includes one or more independent variables that
don’t significantly explain changes in the dependent variable, while remaining independent variables do.
EXAMPLE 6.2 : Ten branches of a national chain store were randomly selected to
evaluate the effects of population and income on sales in each sales district. Data
labels mean: y=weekly sales in millions of dollars; x 1
=population in hundreds of
thousands; and x 2
=weekly family income in millions of dollars. Output follows
a. State the population and estimated sample regression equations.
b. Test the significance of the equation at the five-percent significance level.
c. Test the significance of each slope coefficient.
Regression Statistics
Multiple R 0.
R Square 0.
Adjusted R
Square 0.
Standard Error 0.
Observations 10
ANOVA df SS MS F
Significanc
e F
Regression 2 55.
Residual 7 4.
Total 9 59.
Coeffici
ents
Standard
Error t Stat
value Lower 95%
Upper
Intercept -8.3195 1.
x1 0.7095 0.
x2 0.6987 0.
y x 1
x 2
Regression equations: The Excel Output produces three tables: 1) Regression Statistics; 2) ANOVA; and
Table 3: Regression Estimates and Standard Errors:
As an example, 6.1 provided the table:
Coefficient
s Standard Error t Stat P-value Lower 95% Upper 95%
Intercept -8.3195 1.9375 -4.2940 0.0036 -12.9009 -3.
x1 0.7095 0.0862 8.2280 0.0001 0.5056 0.
x2 0.6987 0.0974 7.1716 0.0002 0.4683 0.
From this table, it is possible to write the estimates sample regression equation, and hence, the sample
regression and population regression equations. It also allows for testing of significance of the individual
slope coefficients, and provides confidence intervals for the slope coefficients. When running the
regression, the significance level (and confidence level) can be chosen by the user:
Population regression equation: y = β 0
+β 1
x 1
x 2
Sample regression equation: ŷ = b 0
x 1
+b 2
x 2
Estimated sample regression equation: ŷ = - 8.3195 + 0.7095 x 1
+0.6987 x 2
To interpret the equations:
In the estimated equation, the intercept, b
0
=-8.3195, is negative, so sales=-$83,195, on average,
when population=0 and family income=0.
The slope coefficient associated with population is b 1
= 0.7095. Since x 1
is measured in hundreds
of thousands, for every 100,000 increase in population, weekly sales increase by $7,095.
The slope coefficient associated with weekly family income is b 1
=0.6987. Since x 2
is measured in
$millions, for every $1M increase in total weekly family income, weekly sales rise by $6,987.
Though the unit of measure for the dependent and independent variables was unimportant
when we ran the regression, it was important to properly interpret the results.
Table 2: ANOVA Table
This table is used to test the significance of the overall regression model. Its EXCEL format is:
ANOVA df SS MS F Significance F
Regression k SSR MSR=SSR/k F 0
=MSR/MSE p-value=Pr(F>F 0
Residual N-k-1 SST MSE=SSE/(N-k-1)
Total N-1 SSB
ANOVA df SS MS F Significance F
Regression 2 55.1489 27.5745 43.3648 0.
Residual 7 4.4511 0.
Total 9 59.
0
4. Create a correlation matrix, and test the strength of linear relationships between the variables. If
there is no significant linear relationship between an independent variable and the dependent
variable, the independent variable should typically not be included in your model. For those
dependent variables with significant linear relationships with the independent variable, sort them
from the strongest linear relationship (the absolute value of the correlation coefficient) to the
weakest. EXCEL creates such a matrix quite easily, as we will see in class. Since it is not always clear
which correlations are significant, and testing them takes time, Stepwise regression is one way to
build a model that implicitly test for the significance of correlations, without doing so explicitly.
5. Test for multicollinearity (as well as other OLS assumptions). If the correlation coefficient
between two independent variables seems high in absolute terms, there may be multicollinearity ,
where one independent variable is highly correlated or linearly dependent on another. The third
section of this topic discusses how to test for multicollinearity. Multicollinearity violates the OLS
assumptions, so independent variables that create this should be excluded from the model.
A multiple linear regression model must be independent in its coefficients, but not necessarily in its
independent variables. Provided the dependent variable, y, is a continuous variable – ratio or interval ,
(or discrete with a very large number of possible outcomes), independent variables take many forms.
1. Qualitative Independent Variables: Independent variables can be qualitative, provided they are
recorded to a numeric value. For example we might want to see if sex (variable 1), age (variable 2)
and marital status (variable 3) make a difference to annual after-tax income (y). Sex and marital
status are both qualitative.
Convert these variables to numeric, and then you can use the multiple linear regression model. For
example, if the model includes x i
= marital status, we can let we can let x i
=0 if single, x i
=1 if widowed,
x i
=2 if separated or divorced; x i
=3 if married.
2. Dummy Variables (or Indicator Variables): These are a very specific type of qualitative variable
which take the value 1 when the observation has the quality of interest, and the value 0 otherwise.
For example, in a regression where x i
represents sex, we can let x i
=1 if male and x i
=0 if female. In
this case, the Dummy or Indicator Variable indicates sex.
3. Non-linear relationships : Suppose you have the following population regression model:
y = β 0
+ β 1
x 1
+ β 2
x 2
+ β 3
x 3
+ …+ β k
x k
3
+ ε
You might think that this model is non-linear. In fact, that is not true. If we used the letter “v” in our
equation for all the x’s, we could write:
y = β 0
+ β 1
v 1
+ β 2
v 2
+ β 3
v 3
+ …+ β k
v k
+ ε
which is a multiple linear regression model, where v
1
= x
1
, v
2
= x
2
, …, v
k-
= x
k-
. This means that
independent variables can be transformations of something measured, such as log (income), age
2
and so on. The same logic does not apply to the coefficients. If the model used β k
2
, for example, we
could solve for b k
2
, but not the value of b k
, because there could be multiple roots. More importantly,
b k
would not longer be a BLUE estimate for β k
(BLUE= best, linear, unbiased estimate).
4. Interaction affects : Suppose you have the following population regression model:
y = β
0
+ β
1
x
1
+ β
2
x
2
+ β
3
x
1
x
2
+ ε
Your third slope coefficient is an interaction effect between x1 and x2. In fact, this is very much like
a Two-Way ANOVA with replication. This remains a multiple linear regression model. Just think, if I
gave you x3 instead and didn’t tell you that x3=x1*x2, wouldn’t you think the model was linear?
Individual variables are considered “basic” terms, while interactions are “interaction terms”. A
composite model is a model that has both the basic terms and interaction terms. This gets a little
more complicated if we started with three basic terms – then we have several possibilities for
interactions (x1 and x2; x1 and x3; x2 and x3; and x1, x2 and x3).
5. Time : Suppose you have the following population regression model:
y = β
0
+ β
1
x
1
+ β
2
x
2
+ β
3
t + ε
Where the variable, t , is some measure of time (day, month, year, etc).
6. Time series : Suppose you have the following population regression model:
y t
= β 0
+ β 1
x 1
+ β 2
x 2
+ β 3
y t-
+ ε
Now one of your independent variables is a past value of the dependent variable.
Several statistics are important in determining if a multiple linear model is “good”, or in choosing the
“better” of two or more models.
The first step, before any comparison, is to ensure OLS assumptions hold, and the model is significant.
Examine the population regression model to see if it is linear in the coefficients.
Use the regression output to determine if the model is significant.
Run a correlation matrix, and conduct the VIF test (if needed) to determine if multicollinearity is
present.
Eliminate from your comparisons any model that is non-linear in coefficients, insignificant, or shows the
presence of multicollinearity.
The next step is to use the following four measures to determine the “best” model out of the set that
meet all required conditions: 1) the Coefficient of Determination (R-squared); 2) the Adjusted R-
squared; 3) the Mean Squared Error (MSE); and 4) the test-statistic, F 0
. The order of importance is,
typically, 2), 3), 1) and 4).
Coefficients Standard Error t Stat P-value
Intercept 3580.3279 1182.0066 3.0290 0.
temperature 15.2339 6.5278 2.3337 0.
production 0.0861 0.0226 3.8100 0.
days -110.6611 60.6128 -1.8257 0.
Model 4: dependent variable = days
R Square 0.
Adjusted R Square 0.
Standard Error 1.
Observations 17
ANOVA df SS MS F Significance F
Regression 1 6.5567 6.5567 3.5533 0.
Residual 15 27.6786 1.
Total 16 34.
Coefficient
s Standard Error t Stat P-value
Intercept 18.3976 1.6631 11.0620 0.
temperature 0.0474 0.0251 1.8850 0.
y x 1
x 2
x 3
x 4
3067 58.8 7107 21 129
2828 65.2 6373 22 141
2891 70.9 6796 22 153
2994 77.4 9208 20 166
3082 79.3 14792 25 193
3898 81.0 14564 23 189
3502 71.9 11964 20 175
3060 63.9 13526 23 186
3211 54.5 12656 20 190
3286 39.5 14119 20 187
3542 44.5 16691 22 195
3125 43.6 14571 19 206
3022 56.0 13619 22 198
2922 64.7 14575 22 192
3950 73.0 14556 21 191
4488 78.9 18573 21 200
3295 79.4 15618 22 200
a. Note that p-value = significance F.
Model 1: 1 2 3 4
y x x x x , p-value = 0.
Model 2: 4 1 2 3
x x x x p-value = 0.
Model 3: 1 2 3
y x x x p-value = 0.
Model 4: 3 1
x ˆ 1 8. 3976 0. 0474 x
p-value = 0.
(http://www.theatlantic.com/business/archive/2009/04/department-of-awful-statistics/7182/#)
The lesson from this graph is the importance of selecting only appropriate
variables in the potential set of explanatory or independent variables.
This is a vital first step in model building, and the selection of variables
should be guided by sound theory and established practice. Once this is
done, you are ready to start model building.
The second step before applying stepwise regression involves calculation
of a Correlation Matrix, preferably using statistical software. This matrix is
key in the next steps. Consider Example 6.4.
EXAMPLE 6.4: In a study of the achievement of executives, a consulting
firm for management efficiency administered three psychological tests to
20 randomly selected executives from firms with annual profits over $
billion, in order to construct a multiple regression model. In this model,
the variables and data available are given below, and to the right,
respectively:
y = current annual salary in thousands of dollars;
x 1
= test score on vitality & drive;
x 2
= test score on numerical & verbal reasoning ability
x 3
= test score on sociability & leadership.
Before selecting the most appropriate model, calculate the correlation matrix, and determine which
score is the most highly correlated with current annual salary, and which is least correlated. Do you
have reason to believe the relationship is causal? Explain.
SOLUTION: To construct this correlation matrix in EXCEL, first copy and paste the table into EXCEL.
Next, in Tools , click on Data Analysis. Choose Correlation in the first pop-up box, then push the OK
button. When the second pop-up box appears, include the entire table in the Input Range , check Labels
in First Row , then push the OK button.
y x 1
x 2
x 3
85 16 78 17
82 17 65 11
120 20 80 14
95 22 74 6
125 24 82 29
145 32 77 25
158 32 95 36
140 30 80 28
138 33 75 26
135 34 77 24
175 37 82 50
190 37 82 85
175 39 82 68
185 38 83 71
150 38 94 15
200 40 92 66
210 43 90 92
225 44 89 95
250 45 98 85
245 45 90 95
The correlation matrix from EXCEL is diagonal, since correlation (x i
, x j
) = correlation (x j
, x i
). The
diagonal elements are correlations of variables with themselves, which is always equal to 1.
The correlation matrix below shows the highest correlation (r=0.9366) between salary and test score on
vitality & drive (x 1
); and the lowest (r=0.7560) between salary and test score on numerical & verbal
reasoning ability (x 2
). Though test scores may not explain differences in salary, they may be proxy
measurement of the qualities that explain salary differences.
y x 1
x 2
x 3
y 1
x 1
0.936598 1
x 2
0.755978 0.708663 1
x 3
0.924352 0.828226 0.58567 1
Now, let’s model build. Forward Selection builds a model starting with a simple linear regression, where
the explanatory variable is the one most highly correlated with the dependent variable. In subsequent
regressions, explanatory variables are added, one at a time. If the additional variable results in a better
model, the new model is kept, and the process repeated. If the additional variable results in a “worse”
model, the new variable is dropped, the previous model is kept, and the process stops.
With Backward Elimination, the model begins with all explanatory variables. Variables are dropped, one
at a time, starting with the variable least correlated with the dependent variable, till the model selected
is better than all previous ones, and better than the next one in the queue.
Forward Selection: Specify a simple linear regression model with the explanatory variable most highly
correlated with the dependent variable.
explanatory variables are unsatisfactory in explaining your dependent variable. STOP.