Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Multiple Regression Analysis and Model Building |, Study notes of Statistics

Topic 6 - Multiple Regression Analysis and Model Building (2011) Material Type: Notes; Class: Statistics 2 - Intermediate; Subject: Statistics; University: Carleton University; Term: Forever 1989;

Typology: Study notes

2010/2011

Uploaded on 04/29/2011

rollercoaster-101
rollercoaster-101 🇨🇦

4.6

(6)

41 documents

1 / 40

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
TOPIC 6: MULTIPLE REGRESSION ANALYSIS AND MODEL BUILDING
OBJECTIVES: This lecture introduces Multiple Linear Regression Analysis, which helps determine the
manner in which a change in any one or more of a set of independent variable impacts a dependent
variable. It also introduces model building or stepwise regression, using both forward selection and
backward elimination, the assumptions required for multiple linear regression, and how to diagnose
and compare models, and test if the required assumptions hold or not.
CONTENT
1. The Multiple Linear Regression Model
1a. The Model and its Assumptions
1b. Estimating the Regression Model - Ordinary Least Squares
1c. Least Squares Regression Properties, the Sum Of Squares, and Standard Errors
1d. Testing the Significance of Slope Coefficients vs the Overall Regression
2. Model Building
2a. Steps in Model Building
2b. Variations in the Multiple Linear Regression Model
2c. What is a “Good” Model?
2d. Stepwise Regression: Forward Selection and Backwards Elimination
3. Testing the OLS Assumptions
3a. Preliminary “Tests” of Assumptions on Residuals/Error Terms
3b. Multicollinearity and the Variance Inflation Factor (VIF)
4. Common Errors
5. Problems
5a. Short Answer Questions
5b. Long Answer Questions
5c. Solutions to Short Answer Questions
ECON 2202, Topic 6– ©S. Dubey, 2011 1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28

Partial preview of the text

Download Multiple Regression Analysis and Model Building | and more Study notes Statistics in PDF only on Docsity!

TOPIC 6: MULTIPLE REGRESSION ANALYSIS AND MODEL BUILDING

OBJECTIVES: This lecture introduces Multiple Linear Regression Analysis, which helps determine the

manner in which a change in any one or more of a set of independent variable impacts a dependent

variable. It also introduces model building or stepwise regression, using both forward selection and

backward elimination, the assumptions required for multiple linear regression, and how to diagnose

and compare models, and test if the required assumptions hold or not.

CONTENT

1. The Multiple Linear Regression Model

1a. The Model and its Assumptions

1b. Estimating the Regression Model - Ordinary Least Squares

1c. Least Squares Regression Properties, the Sum Of Squares, and Standard Errors

1d. Testing the Significance of Slope Coefficients vs the Overall Regression

2. Model Building

2a. Steps in Model Building

2b. Variations in the Multiple Linear Regression Model

2c. What is a “Good” Model?

2d. Stepwise Regression: Forward Selection and Backwards Elimination

3. Testing the OLS Assumptions

3a. Preliminary “Tests” of Assumptions on Residuals/Error Terms

3b. Multicollinearity and the Variance Inflation Factor (VIF)

**4. Common Errors

  1. Problems**

5a. Short Answer Questions

5b. Long Answer Questions

5c. Solutions to Short Answer Questions

1. THE MULTIPLE LINEAR REGRESSION MODEL

Multiple linear regression is a multivariate statistical method that examines the linear correlations of

two or more independent variables on a dependent variable. It helps answer questions like:

 To what extent to age, height, bone density, diet, and exercise predict a person’s body mass

index?

 What factors predict economic growth?

 To what extent does university, program of study, GPA, sex, age, work experience, and recent

economic growth predict post-education earnings?

 What factors predict the Canadian exchange rate in the next few months?

It is used when the dependent variable is a continuous (interval or ratio) variable, and explains how

changes in independent variables explain or predict changes in the dependent variable.

1 A. THE MODEL AND ITS ASSUMPTIONS

Population regression equation: y = β 0

+ β

1

x

1

+ β

2

x

2

+… + β

k

x

k

+ ε ;

Sample (or estimated) regression equation: ŷ = b 0

+ b 1

x 1

+ b 2

x 2

+… + b k

x k

Error term: e = y – ŷ

Meaning:

Vectors of n observations:

y = value of dependent variable.

x i

= value of ith independent variable; i = 1, 2, … , k.

ε = Population error term.

ŷ = Estimated or predicted y value.

e = y - ŷ = Sample error or residual.

Scalars:

β 0

= Population or regression intercept..

β i

= Population or regression slope coefficient associated with the i

th

variable, x ij

. This is the

amount by which y changes for every one unit change in x i

b 0

= Sample intercept = unbiased estimate of the regression intercept

b i

= Sample slope coefficient associated with x i

n = sample size; k = number of slope coefficients (k=1 for simple linear regression model.)

 The error term, ε , is the random (error) component of the model.

 The terms, β 0

+β 1

x 1

+ β 2

x 2

+… + β k

x k

is the linear component.

 The sample errorvector, e , estimates the population error, ε.

Since there are n observations for (y, x 1

, x 2

, …, x k

), we also write (y j

, x 1j

, x 2j

, …, x kj

), j=1, 2, …, n, so

Population regression equation: y j

= β 0

+β 1

x 1j

  • β 2

x 2j

+… + β k

x kj

  • ε j

Sample (or estimated) regression equation: ŷ j

= b 0

  • b 1

x 1j

  • b 2

x 2j

+… + b k

x kj

Error term: e j

= y j

  • ŷ j

Let’s see how Robert J. Barro applies multiple regression analysis to identifying the determinants of

economic growth, in his August 1996 paper, “Determinants of Economic Growth: A Cross-Country

Empirical Study,” (NBER Working Paper 5698). Barro used data from 100 countries from 1960 to

MODEL ASSUMPTIONS:

Just as in all previous, multiple linear regression models require that certain assumptions hold in order

for the estimated model to be valid and good. These can be summarized in four assumptions, the first

three of which are also necessary for a simple linear regression model:

  1. For any value of (x

1

, x

2

, …, x

k

), ε~ iid N(0,σ

2

): Individual values of the error term, ε, are

independent, normally distributed, with mean 0 and constant variance, σ

2

 The assumption of constant variance is known as the assumption of homoskedasticity.

 When variances are not constant, hetroskedasticity exists.

  1. Error terms are independent of (x 1

, x 2

, …, x k

), the set of independent variables.

 This is also expressed mathematically as ∑x i

e i

= 0, for i=1, 2, …, k

  1. The model is linear in its coefficients, β 0

, β 1

, β 2

, … , β k

. Technically, the population regression

model connects the values E(y| (x 1

, x 2

, …, x k

) ) and x 1

, x 2

, …, x k

with a k-dimensional plane.

  1. The independent variables, (x

1

, x

2

, …, x

k

), are independent of each other.

WHEN ASSUMPTIONS FAIL:

Failure of an assumption can result in a sample model that is not the best estimate, or perhaps not even

a good estimate, for the population model. This means the sample model may not actually explain the

data well, may have biased estimates, and may not provide good forecasts or predictions of future

values of the dependent variable.

Advanced econometrics courses look at regression methods that relax these assumptions. For example:

 The first relaxation examined is when error terms are heteroskedastic.

 A Time Series Analysis course examines when independent variables are correlated with the

error term, and how this is taken into account.

 Logistic Regression models are used when dependent variables are dummy variables, taking the

value 0 (for failure) or 1 (for success), instead of being a continuous (ratio or interval) variable.

Though we don’t look at these extensions in this course, we do examine the impact when independent

variables have a linear relationship with each other, which results in what is called “multicollinearity”.

We also look briefly at how to use residual analysis as preliminary “tests” to examine the validity of

some of the assumptions on the residual/error term.

1 B. ESTIMATING THE REGRESSION COEFFICIENTS (ORDINARY LEAST SQUARES)

The Least Squares Criterion is a criterion for estimating the population regression equation that

minimizes the sum of squared residuals, ∑e

2

= ∑(y- ŷ )

2

Ordinary Least Squares (OLS) is commonly used to describe this estimation method when the

population regression model is linear, error terms are iid N(0,σ

2

), and independent variables are

independent of each other and from the error term.

Equations to estimate regression coefficients: This requires knowledge of Linear (Matrix) Algebra,

multiple linear regressions models in this course. which is not a pre-requisite for Econ 2202. For

this reason, EXCEL outputs are used to estimate the sample model, and conduct tests.

Meaning of Regression Coefficients: The intercept is the value of the dependent variable when the

independent variables are all zero. The slope coefficient, b i

, can be positive or negative, and

measures the average estimated change in the dependent variable, y, for a unit change in the

dependent variable, x i

Writing multiple linear regression in matrix notation : we can write y = β 0

+ β 1

x 1

+ β 2

x 2

+… + β k

x k

+ ε ,

or y j

= β 0

+β 1

x 1j

  • β 2

x 2j

+… + β k

x kj

  • ε j

, in matrix notation as follows:

yX    , where

k

2

1

0

1 2n kn

13 23 k

12 22 k

11 21 k

1 x x

1 x x x

1 x x x

1 x x x

    

n

x

X

y n

y

y

y

y

3

2

1

n

3

2

1

Each vector, β , y and ε , represents the n observations, and each column in X also represents n

observations. The first row in X is a vector of “1”s, to allow for a constant intercept. This gives

   

   

   

   

0 1 1 2 2n

0 1 13 2 23

0 1 12 2 22

0 1 11 2 21 k

2

1

0

1 2n kn

13 23 k

12 22 k

11 21 k

x x

x x x

x x x

x x x

1 x x

1 x x x

1 x x x

1 x x x

   

   

   

   

    

    

n

k

n k

x x

X

= β 0

+β 1

x 1j

  • β 2

x 2j

+… + β k

x kj

, for i=1, 2, ..., n.

OLS Estimates, b, for β: Multiple linear regression coefficients are estimated using the equations

below. As in simple linear regression, Ordinary Least Squares (OLS) provides the calculations for the

sample coefficients, b

0

, b

1

, b

2

, …, b

k

. OLS minimizes the sums of squares or residuals, ∑e

i

2

b X X X y

T 1 T

k

2

1

0

k

b

b

b

b

b

2

1

0

1 2 k3 kn

21 22 23 2

11 12 13 1

x x

x x x

x x x

1 1 1 1

   

k k

T

x x

x

x

X

n

y

y

y

y

y

3

2

1

n

e

e

e

e

e

3

2

1

Proving the OLS estimator is unbiased :

since( ) and E  ( )  ( ) E( ) 0

1 1 1

1 1 1

1 1

  

  

 

T T T T T T

T T T T T T

T T T T

X X X X X X X X X X

E X X X X X X X E E X X X

Eb E X X X y E X X X X

 SUM OF ERRORS:

 

 

   0

ˆ e y y

 TOTAL SUM OF SQUARES:

 

 

2

SST y y

 SUM OF SQUARES REGRESSION:

 

 

2

SSR y ˆ y

 SUM OF SQUARES ERROR:

   

 

2

1 1 2 2

2

o k k

SSE y y y b bx bxbx

Some important standard deviations or standard errors are given by:

a. Population Standard Error of the Estimate of y : σ

ε

b. Sample Standard Deviation of the Estimate of y: MSE

n k

SSE

s

c. Population Standard Deviation of the Slope Coefficient, β

i

bi

d. Sample Standard Deviation of the Slope Coefficient: s

bi

Since s bi

cannot be calculated without matrix algebra when k≥2, EXCEL output determines it.

1 D. TESTING THE SIGNIFICANCE OF SLOPE COEFFICIENTS VS THE OVERALL REGRESSION

A test of significance of the overall multiple linear regression uses analysis of variance. Similar to

ANOVA in Topic 3, this test is a right-tailed F-test. In a multiple linear regression model, the F statistic is

the ratio of SSR and SSE, each divided by their respective degrees of freedom: F 0

= (SSR/k)/(SSE/(n-k-1)).

This test examines, in effect, if the variation in the dependent variable explained by the model (SSR) is

significantly larger than the unexplained variation (SST – SSR = SSE).

Like the F tests for simple linear regression, the F test for multiple linear regression requires the error

terms, ε i

, be independent with mean 0 and a constant variance,

2

. The hypotheses tested are:

Ho: Regression is not significant, Ha: Regression is significant, or

Ho: β 1

= β 1

= … = β k

= 0, Ha: not all population slope coefficients are zero

The test statistic, which has (k, n-k-1) degrees of freedom, is:

; ( , 1 )

( 1 )

0

   

 

df k n k

MSE

MSR

SSE n k

SSR k

F

A test of significance of a single slope coefficient uses a two-tailed t-test, as in the case of simple linear

regression, and the test statistic, with n-k-1 degrees of freedom, is given by:

0

 df n k

s

b

t

bi

i i

This test statistic is identical to that used for a simple linear regression model. The key difference lies in

the fact that s bi

can be calculated using a simple calculator for simple linear regression, while for a

multiple linear regression, statistical software may be necessary due to the need to use matrix algebra.

To test if the impact of x i

on y, is significant, that is, if changes in x i

matter in explaining changes in y, we

test the significance of the slope coefficient, β i

Ho: β i

= 0, Ha: β i

≠ 0; or

Ho: variable i (or x i

) does not matter, Ha: variable i (or x i

) matters

It is possible to have a statistically significant multiple regression model with one or more insignificant

slope coefficients. This means the regression equation includes one or more independent variables that

don’t significantly explain changes in the dependent variable, while remaining independent variables do.

EXAMPLE 6.2 : Ten branches of a national chain store were randomly selected to

evaluate the effects of population and income on sales in each sales district. Data

labels mean: y=weekly sales in millions of dollars; x 1

=population in hundreds of

thousands; and x 2

=weekly family income in millions of dollars. Output follows

a. State the population and estimated sample regression equations.

b. Test the significance of the equation at the five-percent significance level.

c. Test the significance of each slope coefficient.

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.

R Square 0.

Adjusted R

Square 0.

Standard Error 0.

Observations 10

ANOVA df SS MS F

Significanc

e F

Regression 2 55.

Residual 7 4.

Total 9 59.

Coeffici

ents

Standard

Error t Stat

P-

value Lower 95%

Upper

Intercept -8.3195 1.

x1 0.7095 0.

x2 0.6987 0.

y x 1

x 2

Regression equations: The Excel Output produces three tables: 1) Regression Statistics; 2) ANOVA; and

  1. regression estimates and standard errors. This lecture explains these three tables, in reverse order.

Table 3: Regression Estimates and Standard Errors:

As an example, 6.1 provided the table:

Coefficient

s Standard Error t Stat P-value Lower 95% Upper 95%

Intercept -8.3195 1.9375 -4.2940 0.0036 -12.9009 -3.

x1 0.7095 0.0862 8.2280 0.0001 0.5056 0.

x2 0.6987 0.0974 7.1716 0.0002 0.4683 0.

From this table, it is possible to write the estimates sample regression equation, and hence, the sample

regression and population regression equations. It also allows for testing of significance of the individual

slope coefficients, and provides confidence intervals for the slope coefficients. When running the

regression, the significance level (and confidence level) can be chosen by the user:

Population regression equation: y = β 0

+β 1

x 1

  • β 2

x 2

  • ε

Sample regression equation: ŷ = b 0

  • b 1

x 1

+b 2

x 2

Estimated sample regression equation: ŷ = - 8.3195 + 0.7095 x 1

+0.6987 x 2

To interpret the equations:

 In the estimated equation, the intercept, b

0

=-8.3195, is negative, so sales=-$83,195, on average,

when population=0 and family income=0.

 The slope coefficient associated with population is b 1

= 0.7095. Since x 1

is measured in hundreds

of thousands, for every 100,000 increase in population, weekly sales increase by $7,095.

 The slope coefficient associated with weekly family income is b 1

=0.6987. Since x 2

is measured in

$millions, for every $1M increase in total weekly family income, weekly sales rise by $6,987.

 Though the unit of measure for the dependent and independent variables was unimportant

when we ran the regression, it was important to properly interpret the results.

Table 2: ANOVA Table

This table is used to test the significance of the overall regression model. Its EXCEL format is:

ANOVA df SS MS F Significance F

Regression k SSR MSR=SSR/k F 0

=MSR/MSE p-value=Pr(F>F 0

Residual N-k-1 SST MSE=SSE/(N-k-1)

Total N-1 SSB

The degrees of freedom column appears before the SS column, unlike the ANOVA tables used

to test equality of population means. In Example 6.1, the table was:

ANOVA df SS MS F Significance F

Regression 2 55.1489 27.5745 43.3648 0.

Residual 7 4.4511 0.

Total 9 59.

From this table, since the p-value (Significance F) =0.0001, for all significance levels of 0.

and less (α≤0.0001), we reject H

0

and conclude the regression is significant.

4. Create a correlation matrix, and test the strength of linear relationships between the variables. If

there is no significant linear relationship between an independent variable and the dependent

variable, the independent variable should typically not be included in your model. For those

dependent variables with significant linear relationships with the independent variable, sort them

from the strongest linear relationship (the absolute value of the correlation coefficient) to the

weakest. EXCEL creates such a matrix quite easily, as we will see in class. Since it is not always clear

which correlations are significant, and testing them takes time, Stepwise regression is one way to

build a model that implicitly test for the significance of correlations, without doing so explicitly.

5. Test for multicollinearity (as well as other OLS assumptions). If the correlation coefficient

between two independent variables seems high in absolute terms, there may be multicollinearity ,

where one independent variable is highly correlated or linearly dependent on another. The third

section of this topic discusses how to test for multicollinearity. Multicollinearity violates the OLS

assumptions, so independent variables that create this should be excluded from the model.

2 B. VARIATIONS IN THE MULTIPLE LINEAR REGRESSION MODEL

A multiple linear regression model must be independent in its coefficients, but not necessarily in its

independent variables. Provided the dependent variable, y, is a continuous variable – ratio or interval ,

(or discrete with a very large number of possible outcomes), independent variables take many forms.

1. Qualitative Independent Variables: Independent variables can be qualitative, provided they are

recorded to a numeric value. For example we might want to see if sex (variable 1), age (variable 2)

and marital status (variable 3) make a difference to annual after-tax income (y). Sex and marital

status are both qualitative.

Convert these variables to numeric, and then you can use the multiple linear regression model. For

example, if the model includes x i

= marital status, we can let we can let x i

=0 if single, x i

=1 if widowed,

x i

=2 if separated or divorced; x i

=3 if married.

2. Dummy Variables (or Indicator Variables): These are a very specific type of qualitative variable

which take the value 1 when the observation has the quality of interest, and the value 0 otherwise.

For example, in a regression where x i

represents sex, we can let x i

=1 if male and x i

=0 if female. In

this case, the Dummy or Indicator Variable indicates sex.

3. Non-linear relationships : Suppose you have the following population regression model:

y = β 0

+ β 1

x 1

+ β 2

x 2

+ β 3

x 3

+ …+ β k

x k

3

+ ε

You might think that this model is non-linear. In fact, that is not true. If we used the letter “v” in our

equation for all the x’s, we could write:

y = β 0

+ β 1

v 1

+ β 2

v 2

+ β 3

v 3

+ …+ β k

v k

+ ε

which is a multiple linear regression model, where v

1

= x

1

, v

2

= x

2

, …, v

k-

= x

k-

. This means that

independent variables can be transformations of something measured, such as log (income), age

2

and so on. The same logic does not apply to the coefficients. If the model used β k

2

, for example, we

could solve for b k

2

, but not the value of b k

, because there could be multiple roots. More importantly,

b k

would not longer be a BLUE estimate for β k

(BLUE= best, linear, unbiased estimate).

4. Interaction affects : Suppose you have the following population regression model:

y = β

0

+ β

1

x

1

+ β

2

x

2

+ β

3

x

1

x

2

+ ε

Your third slope coefficient is an interaction effect between x1 and x2. In fact, this is very much like

a Two-Way ANOVA with replication. This remains a multiple linear regression model. Just think, if I

gave you x3 instead and didn’t tell you that x3=x1*x2, wouldn’t you think the model was linear?

Individual variables are considered “basic” terms, while interactions are “interaction terms”. A

composite model is a model that has both the basic terms and interaction terms. This gets a little

more complicated if we started with three basic terms – then we have several possibilities for

interactions (x1 and x2; x1 and x3; x2 and x3; and x1, x2 and x3).

5. Time : Suppose you have the following population regression model:

y = β

0

+ β

1

x

1

+ β

2

x

2

+ β

3

t + ε

Where the variable, t , is some measure of time (day, month, year, etc).

6. Time series : Suppose you have the following population regression model:

y t

= β 0

+ β 1

x 1

+ β 2

x 2

+ β 3

y t-

+ ε

Now one of your independent variables is a past value of the dependent variable.

2 C. WHAT IS A “GOOD” MODEL?

Several statistics are important in determining if a multiple linear model is “good”, or in choosing the

“better” of two or more models.

The first step, before any comparison, is to ensure OLS assumptions hold, and the model is significant.

 Examine the population regression model to see if it is linear in the coefficients.

 Use the regression output to determine if the model is significant.

 Run a correlation matrix, and conduct the VIF test (if needed) to determine if multicollinearity is

present.

Eliminate from your comparisons any model that is non-linear in coefficients, insignificant, or shows the

presence of multicollinearity.

The next step is to use the following four measures to determine the “best” model out of the set that

meet all required conditions: 1) the Coefficient of Determination (R-squared); 2) the Adjusted R-

squared; 3) the Mean Squared Error (MSE); and 4) the test-statistic, F 0

. The order of importance is,

typically, 2), 3), 1) and 4).

Model 1: dependent variable = water usage

  • R Square 0.
  • Adjusted R Square 0.
  • Standard Error 248.
  • Observations
  • Regression 4 2448834.0099 612208.5025 9.8770 0. ANOVA df SS MS F Significance F
  • Residual 12 743797.5195 61983.
  • Total 16 3192631.
  • Intercept 6360.3373 1314.3916 4.8390 0. Coefficients Standard Error t Stat P-value
  • temperature 13.8689 5.1598 2.6879 0.
  • production 0.2117 0.0455 4.6484 0.
  • days -126.6904 48.0223 -2.6382 0.
  • workers -21.8180 7.2845 -2.9951 0.
  • R Square 0. Model 2: independent variable = workers
  • Adjusted R Square 0.
  • Standard Error 9.
  • Observations
  • Regression 3 6572.3931 2190.7977 24.3823 0. ANOVA df SS MS F Significance F
  • Residual 13 1168.0774 89.
  • Total 16 7740.
  • Intercept 127.4184 35.4334 3.5960 0. Coefficients Standard Error t Stat P-value
  • temperature -0.0626 0.1957 -0.3197 0.
  • production 0.0058 0.0007 8.4898 0.
  • days -0.7347 1.8170 -0.4043 0.
  • Multiple R 0. Model 3: dependent variable = water usage
  • R Square 0.
  • Adjusted R Square 0.
  • Standard Error 316.
  • Observations
  • Regression 3 1892801.6613 630933.8871 6.3102 0. ANOVA df SS MS F Significance F
  • Residual 13 1299829.8681 99986.
  • Total 16 3192631.

Coefficients Standard Error t Stat P-value

Intercept 3580.3279 1182.0066 3.0290 0.

temperature 15.2339 6.5278 2.3337 0.

production 0.0861 0.0226 3.8100 0.

days -110.6611 60.6128 -1.8257 0.

Model 4: dependent variable = days

R Square 0.

Adjusted R Square 0.

Standard Error 1.

Observations 17

ANOVA df SS MS F Significance F

Regression 1 6.5567 6.5567 3.5533 0.

Residual 15 27.6786 1.

Total 16 34.

Coefficient

s Standard Error t Stat P-value

Intercept 18.3976 1.6631 11.0620 0.

temperature 0.0474 0.0251 1.8850 0.

y x 1

x 2

x 3

x 4

3067 58.8 7107 21 129

2828 65.2 6373 22 141

2891 70.9 6796 22 153

2994 77.4 9208 20 166

3082 79.3 14792 25 193

3898 81.0 14564 23 189

3502 71.9 11964 20 175

3060 63.9 13526 23 186

3211 54.5 12656 20 190

3286 39.5 14119 20 187

3542 44.5 16691 22 195

3125 43.6 14571 19 206

3022 56.0 13619 22 198

2922 64.7 14575 22 192

3950 73.0 14556 21 191

4488 78.9 18573 21 200

3295 79.4 15618 22 200

SOLUTION :

a. Note that p-value = significance F.

Model 1: 1 2 3 4

y   xxxx , p-value = 0.

Model 2: 4 1 2 3

x   xxx p-value = 0.

Model 3: 1 2 3

y   xxx p-value = 0.

Model 4: 3 1

x ˆ  1 8. 3976  0. 0474 x

p-value = 0.

(http://www.theatlantic.com/business/archive/2009/04/department-of-awful-statistics/7182/#)

The lesson from this graph is the importance of selecting only appropriate

variables in the potential set of explanatory or independent variables.

This is a vital first step in model building, and the selection of variables

should be guided by sound theory and established practice. Once this is

done, you are ready to start model building.

The second step before applying stepwise regression involves calculation

of a Correlation Matrix, preferably using statistical software. This matrix is

key in the next steps. Consider Example 6.4.

EXAMPLE 6.4: In a study of the achievement of executives, a consulting

firm for management efficiency administered three psychological tests to

20 randomly selected executives from firms with annual profits over $

billion, in order to construct a multiple regression model. In this model,

the variables and data available are given below, and to the right,

respectively:

y = current annual salary in thousands of dollars;

x 1

= test score on vitality & drive;

x 2

= test score on numerical & verbal reasoning ability

x 3

= test score on sociability & leadership.

Before selecting the most appropriate model, calculate the correlation matrix, and determine which

score is the most highly correlated with current annual salary, and which is least correlated. Do you

have reason to believe the relationship is causal? Explain.

SOLUTION: To construct this correlation matrix in EXCEL, first copy and paste the table into EXCEL.

Next, in Tools , click on Data Analysis. Choose Correlation in the first pop-up box, then push the OK

button. When the second pop-up box appears, include the entire table in the Input Range , check Labels

in First Row , then push the OK button.

y x 1

x 2

x 3

85 16 78 17

82 17 65 11

120 20 80 14

95 22 74 6

125 24 82 29

145 32 77 25

158 32 95 36

140 30 80 28

138 33 75 26

135 34 77 24

175 37 82 50

190 37 82 85

175 39 82 68

185 38 83 71

150 38 94 15

200 40 92 66

210 43 90 92

225 44 89 95

250 45 98 85

245 45 90 95

The correlation matrix from EXCEL is diagonal, since correlation (x i

, x j

) = correlation (x j

, x i

). The

diagonal elements are correlations of variables with themselves, which is always equal to 1.

The correlation matrix below shows the highest correlation (r=0.9366) between salary and test score on

vitality & drive (x 1

); and the lowest (r=0.7560) between salary and test score on numerical & verbal

reasoning ability (x 2

). Though test scores may not explain differences in salary, they may be proxy

measurement of the qualities that explain salary differences.

y x 1

x 2

x 3

y 1

x 1

0.936598 1

x 2

0.755978 0.708663 1

x 3

0.924352 0.828226 0.58567 1

Now, let’s model build. Forward Selection builds a model starting with a simple linear regression, where

the explanatory variable is the one most highly correlated with the dependent variable. In subsequent

regressions, explanatory variables are added, one at a time. If the additional variable results in a better

model, the new model is kept, and the process repeated. If the additional variable results in a “worse”

model, the new variable is dropped, the previous model is kept, and the process stops.

With Backward Elimination, the model begins with all explanatory variables. Variables are dropped, one

at a time, starting with the variable least correlated with the dependent variable, till the model selected

is better than all previous ones, and better than the next one in the queue.

Forward Selection: Specify a simple linear regression model with the explanatory variable most highly

correlated with the dependent variable.

  1. Conduct an F-test to determine if the explanatory variable is significant. If it is not, your set of

explanatory variables are unsatisfactory in explaining your dependent variable. STOP.