Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Multicollinearity: Detection and Remedial Measure - Handout | STAT 3115, Exams of Statistics

University of Connecticut (UConn) - Avery Point Statistics

Material Type: Exam; Class: Analysis of Experiments; Subject: Statistics; University: University of Connecticut; Term: Unknown 1989;

Typology: Exams

2009/2010

Uploaded on 02/25/2010

koofers-user-8ra 🇺🇸

10 documents

1 / 8

This page cannot be seen from the preview

Don't miss anything!

Multicollinearity:

Detection

and

Remedial

Measures

Q

4.4

Variable

Selection

4.4.1

The

R-Square

Selection

Method

Guaranteed

optimum

subsets

are

obtained

by

using

the

MODEL statement

option

SELECTION=RSQUARE

in

PROC

REG.

You can use

this method

for the BOQ

data

as

follows:

proc

reg

dat

=

boq;

model

manh

=

occup

checkin

hours common wings

cap

rooms

/

selection

=

rscruare

best

= 4

cp;

run;

Two

additional options

are

specified

here

as

follows:

BEST=4

specifies

that

only

the

best

four

(smallest

error

mean

square)

models

for

each subset

size

are to be

printed. This

option

prevents excessive

output.

CP

specifies

the

printing

of the

Mallows C(P) statistic (denoted

by

C(P)

in the

output)

for

each subset. This

is the

most

popular

of

several statistics

used

to aid

selection

of

a

final

model (see

Section

4.4.2).

The

results

appear

in

Output

4.8.

Output

4.8

Regression

for

BOQ

Data

Using

R-SQUARE

Selection

Number

in

Model

1

2

3

4

5

6

&

1

Regression

R-

Square

0

.9619

0

.8888

0.8149

0

.7920

0

.9765

0

.9645

0.9634

0

.9629

0

.9839

0.9803

0

.9161

0

.9166

0.9B51

0.9845

0.9839

0

.9839

0

.9854

0.9851

0.9846

0

.9854

0

.9854

0.9851

0.9847

0.9854

The REG

Procedure

Model:

MODEL1

R-Square

Selection Method

Models

for

Dependent

Variable:

MANH

C{p)

Variables

in

Model

21.7606 OCCUP

101.8743

ROOMS

182.8633

CHECKIN

20T.9875

CAP

7.8089

20.8704

22.1085

22.7211

1.7003

5.6299

9.5345

9.6912

2.3204

2.9396

3.6414

3.6868

4.0091

4.3159

4

.3195

4.8286

6

.0024

6.0059

6

.3146

6.8219

8.0000

OCCUP

CHECKIN

OCCUP

'

CAP

OCCUP ROOMS

OCCUP WINGS

OCCUP

CHECKIN

CAP

OCCUP

CHECKIN

ROOMS

OCCUP

CHECKIN

COMMON

OCCDP

CHECKIN

WINGS

OCCUP

CHECKIN

COMMON

CAP

OCCUP

CHECKIN

CAP

ROOMS

OCCUP

CHECKIN

COMMON ROOMS

OCCUP

CHECKIN

WINGS

CAP

OCCUP

CHECKIN

COMMON WINGS

CAP

OCCUP

CHECKIN

HOURS

COMMON

CAP

OCCUP

CHECKIN

COMMON

CAP

ROOMS

OCCUP

CHECKIN

WINGS

CAP

ROOMS

OCCUP

CHECKIN

HOURS

COMMON WINGS

CAP

OCCUP

CHECKIN

COMMON WINGS

CAP

ROOMS

OCCUP

CHECKIN

HOURS COMMON

CAP

ROOMS

OCOTP

CHECKIS

HOURS

WINGS

CAP

ROOMS

OCCUP

CHECKIN

HOURS

COMMON WINGS

CAP

ROOMS

The

output

from

the

SELECTION=RSQUARE

option

provides,

for

each

subset

size,

the

variables

included

in the

four

best models, listed

in

order

of

decreasing R-Square,

along

with

the

R-Square

and

C(P) statistics.

You can see

that

R-Square

remains

virtually

unchanged

down

to

subsets

of

size three,

with

all

four

selections

having

nearly

equal R-Square values. Additional considerations

in

this selection

are

presented

in

Section

4.4.2.

Partial preview of the text

Download Multicollinearity: Detection and Remedial Measure - Handout | STAT 3115 and more Exams Statistics in PDF only on Docsity!

Multicollinearity: Detection and Remedial Measures Q 4.4 Variable Selection

4.4.1 The R-Square Selection Method

Guaranteed optimum subsets are obtained by using the MODEL statement option

SELECTION=RSQUARE in PROC REG. You can use this method for the BOQ data as follows:

proc reg dat = boq;

model manh = occup checkin hours common wings cap rooms /

selection = rscruare best = 4 cp;

run;

Two additional options are specified here as follows:

BEST=4 specifies that only the best four (smallest error mean square) models for each subset

size are to be printed. This option prevents excessive output.

CP specifies the printing of the Mallows C(P) statistic (denoted by C(P) in the output)

for each subset. This is the most popular of several statistics used to aid selection of

a final model (see Section 4.4.2).

The results appear in Output 4.8.

Output 4.

Regression

for BOQ

Data Using

R-SQUARE

Selection

Number in Model 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 & 1

Regression

R- Square

0_._

0.9B

The REG Procedure Model: MODEL R-Square Selection Method Models for Dependent Variable: MANH

C{p) Variables in Model 21.7606 OCCUP 101.8743 ROOMS 182.8633 CHECKIN 20T.9875 CAP

OCCUP CHECKIN OCCUP ' CAP OCCUP ROOMS OCCUP WINGS OCCUP CHECKIN CAP OCCUP CHECKIN ROOMS OCCUP CHECKIN COMMON OCCDP CHECKIN WINGS OCCUP CHECKIN COMMON CAP OCCUP CHECKIN CAP ROOMS OCCUP CHECKIN COMMON ROOMS OCCUP CHECKIN WINGS CAP OCCUP CHECKIN COMMON WINGS CAP OCCUP CHECKIN HOURS COMMON CAP OCCUP CHECKIN COMMON CAP ROOMS OCCUP CHECKIN WINGS CAP ROOMS OCCUP CHECKIN HOURS COMMON WINGS CAP OCCUP CHECKIN COMMON WINGS CAP ROOMS OCCUP CHECKIN HOURS COMMON CAP ROOMS OCOTP CHECKIS HOURS WINGS CAP ROOMS OCCUP CHECKIN HOURS COMMON WINGS CAP ROOMS

The output from the SELECTION=RSQUARE option provides, for each subset size, the variables

included in the four best models, listed in order of decreasing R-Square, along with the R-Square

and C(P) statistics. You can see that R-Square remains virtually unchanged down to subsets of

size three, with all four selections having nearly equal R-Square values. Additional considerations

in this selection are presented in Section 4.4.2.

110 4.4 Variable Selection Q Chapter 4

Some additional options that are useful for problems with many variables are as follows:

INCLUDE=n specifies that the first n independent variables in the MODEL statement are to be included in all subset models.

START=n specifies that only subsets of n or more variables are to be considered. This number includes the variables specified by the INCLUDE option.

STOP=n specifies that subsets of no more than n variables are to be considered. This number also includes the variables specified by the INCLUDE option.

B specifies that the regression coefficients are printed for each subset model. This option should be used sparingly as it will produce a large amount of output. A more efficient way of getting coefficient estimates is to use information from Output 4.8 to choose interesting models and obtain the coefficient estimates by repeated MODEL statements or, interactively, with ADD and DELETE statements.

4.4.2 Choosing Useful Models

An examination of the R-Square values in Output 4.8 does not reveal any obvious choices for selecting a most useful subset model. A number of other statistics have been developed to aid in making these choices and 12 of these are available as additional options with the RSQUARE option. Among these, the most frequently used is the C(P) statistic, proposed by Mallows (1973). This statistic is a measure of total squared error for a subset model containing p independent variables. The total squared error is a measure of the error variance plus the bias introduced by not including variables in a model. It may, therefore, indicate when variable selection is deleting too many variables. The C(P) statistic is computed as follows:

C(P) = (SSE(p)/MSE) _-(N-2p) + _

where

MSB is the error mean square for the full model (or some other estimate of pure error)

SSE(p) is the error sum of squares for the subset model containing p independent variables (not including the intercept)^5

N is the total sample size.

For any given number of selected variables, larger C(P) values indicate equations with larger error mean squares. For any subset model for which C(P)>(p+l), there is evidence of bias due to the deletion of important variables for a model. On the other hand, if there are values of C(P)<(p+l), the full model is said to be overspecified; that is, it contains too many variables.

Mallows recommends that C(P) be plotted against/?, and further recommends selecting that subset size where the minimum C(P) first approaches (p+1), starting from the full model. The magnitudes of differences in the C(P) statistic between the optimum and near optimum models for each subset size are also of interest.

' In the original presentation of the C(P) statistic (Mallows 1973), the intercept coefficient is also considered as a candidate for selection, so that in that presentation the number of variables in the model is one more than what is defined here and results in the +1 elements in the equations. As implied in the discussion of the COLLIN option, allowing the deletion of the intercept is not normally useful.

112 4.4 Variable Selection G Chapter 4

You can examine the resulting data set using PROC PRINT. Output 4.10 shows the results.

Output 4. Output Data Set from R-Square Selection

Obs 1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19 Obs 1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19

MODEL MODEL 1 MODEL MODEL 1 MODEL MODEL MODEL MODEL 1 MODEL MODEL MODEL MODEL MODEL 1 MODEL 1 MODEL 1 MODEL MODEL 1 MODEL MODEL 1 MODEL WINGS

_TYPE FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS

DEPVAR MANH MANH MANH MANH MANH MANH MANH MANH MANH MANH MANH MANH MANH MANH MANH MANH MANH MANH MANH

RMSE Intercept 660.841 119. 848.463 589.

178 330. 003 565.654 168. 608.909 70. SOB 649.474 294. 477.343 117. 548.021 147. 563.885 199. 455.909 198. 471.375 107. 488.188 116. 434.089 201. 462 466 444 444 474 455 CAP ROOMS

12

15 . 10 .0640 27 .0267 22 7 .3330 32 .1181 27 .2805 24 .4644 21 .1956 30 .0264 24 .1176 26 .4577 29 .2074 30 .0319 24 .4803 29

7569

6204 0936 7231 9761 7806 0176 9040 6242 7609 8996 9655 3402 7238 8604 3248

829 202 532 137 049 203 228 131 301 142 167 134 MANH

-1'

148 923 275 948 597 968 IN 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7

OCCDP

-0. -1.

-1. -1. -1. P_ 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8

CHECKIN

EDF 23 23 23 22 22 22 21 21 21 20 20 20 19 19 19 18 18 18 17

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

HOURS

. . . . _RSQ . . . . . . . . . . . . . . . . . . .

COMMON

-25. -16. 9867

-20. -18. -17. -20. -20. -18. -21. CP

9768

' 6.

0071

6.

0000

As you can see, this data set contains a number of statistics. The estimated coefficients are identified by the names of the independent variables. For the C(p) plot, plot the variable CP against IN for the C(P) values and P against IN for the reference line. proc gplot; symboll v = 'C' c=black; symbol2 v = star c=black 1=1 I=join; plot cpin=l pin=2 / overlay vaxis= 0 to 15 by 5 ,-

run;

The plot, which is not reproduced here, will look like the one in Output 4.9 except for the line joining the reference points.

Multicollinearity: Detection and Remedial Measures Q 4.4 Variable Selection 113

The pattern of C(P) values for this example is quite typical for situations where multicollinearity is serious. Starting with IN=7, they initially become smaller than (p+1), as fewer variables are included, but eventually start to increase. In other words, the residual mean square initially decreases as variables are deleted. In this plot, there is a definite "corner" at IN=3, where the C(P) values increase rapidly with smaller subset sizes. Hence, a model with three variables appears to be a good choice.

A second criterion is the difference in C(P) between the optimum and second optimum subset for each subset size. In the above models with four or more variables, these differences are very small, which implies that the multicollinearity allows interchange of variables without affecting the fit of the model. However, that difference is larger with three variables, implying that the degree of multicollinearity has decreased. You can now estimate the best three-variable model using the information from Output 4.8 as follows: proc reg data = boq; model manh = occup checkin cap / vif;

run;

Results of the regression appear in Output 4.11.

Output 4. Regression for BOQ Data, Best Three Variable Model

The REG Procedure Model : MODEL Dependent Variable: MANH

Source DF Model 3 Error 20 Corrected Total 23

Root MSE Dependent Mean Coeff Var

Parameter Variable DF Estimate Intercept 1 2 0 7. 8 6 4 8 6 OCCUP 1 2 0. 6 7 1 6 3 CHECKIN 1 1. CAP 1 - 3. 4 5 3 9 7

Analysis of Variance Sum of Mean Squares Square F Value Pr > F 87359949 29119983 4 0 6. 2 2 <. 0 0 0 1 1433710 716B 88793659

267.74149 R-Square 0. 2 0 5 0. 0 0 7 0 8 Adj R-Sq 0.

Parameter Estimates Standard Error t Value Pr > t

Variance Inflation 7 8. 2 8 5 3 9 2. 6 6 0. 0 1 5 2 0 1.75123 11.80 <. 0 0 0 1 8. 2 1 9 0 8

2 9 3 6 6 4. 8 9 <. 0 0 0 1 4. 0 4 6 1 5 1.14110 - 3. 0 3 0. 0 0 6 7 7. 6 3 8 2 5

Note: Since this model has been specified by the data, the p values cannot be used literally but are useful for determining the relative importance of the variables.

It does appear that more turnovers (CHECKIN) require additional manpower. The negative partial coefficient for CAP may reflect lower man-hours for a larger proportion of vacant rooms.

A number of other statistics are available to assist in choosing subset models. Some are relatively obvious, such as the residual mean square or standard deviations, while others are related to R-Square, with some providing adjustments for degrees of freedom or scaling preferences. They are all essentially equivalent, although some have different theoretical justification. A number of these are available as options with the selection methods described in this section. Keywords and literature references for these options are provided in the PROC REG chapter in the SAS/STAT User's Guide.

Multicollinearity: Detection and Remedial Measures Q 4.4 Variable Selection 115

Backward elimination (BACKWARD or B) begins by computing the regression with all independent variables specified in the MODEL statement. The procedure deletes from that model the variable whose coefficient has the largest p value (smallest partial F value). The resulting equation is examined for the variable now contributing the least, which is then deleted, and so on. The procedure stops when all coefficients remaining in the model are statistically significant at a level specified by the user (see SLS specification later in this section). With this method, once a variable has been deleted, it is deleted permanently.

Stepwise selection (STEPWISE) begins like forward selection, but after a variable has been added to the model, the resulting equation is examined to see if any coefficient has a sufficiently large p value (see SLS specification later in this section) to suggest that a variable should be dropped. This procedure continues until no additions or deletions are indicated according to significance levels (SLE and SLS) chosen by the user.

Maximum R-Square improvement (MAXR) begins by selecting one- and two-variable models as in forward selection. At that point, the procedure examines all possible pairwise interchanges with the variables not in the model. Among all interchanges that increase R^2 , that interchange resulting in the largest increase in R^2 is implemented. This process is repeated until no pairwise interchange improves the model. At this point, a third variable is selected as in forward selection. Then the interchanging process is repeated, and so on. This method usually requires more computer time than the other three, but it also tends to have a better chance of finding more nearly optimum models. In addition, the maximum R-Square improvement method often produces a larger number of equations, a feature that may be of some value in the final evaluation process.

Minimum R-Square improvement (MINR) is similar to maximum R-Square improvement, except that interchanges are implemented for those with minimum improvement. Since interchanges are not implemented when R^2 is decreased, the final results are quite similar to those of the maximum R-Square improvement method, except that a larger number of equations may be examined.

A number of options are available to provide greater control over the selection procedures. One option available for all procedures is INCLUDED, which specifies that the first n independent variables in the MODEL statement are to be kept in the model at all times.

For the FORWARD, BACKWARD, and STEPWISE methods, you may specify desired significance levels for stopping the addition or elimination of variables as follows:

SLE = .xxx specifies the significance level for stopping the addition of variables in the forward selection mode. If not specified, the default is 0.50 for FORWARD and 0.15 for STEPWISE.

SLS = .xxx specifies the significance level for stopping the backward elimination mode. If not specified, the default is 0.10 for BACKWARD and 0.15 for STEPWISE.

116 4.4 Variable Selection Q Chapter 4

The smallest permissible value for SLS is 0.0001, which almost always ensures that the final equation obtained by backward elimination contains only one variable, while the maximum SLE of 0.99 usually includes all variables when forward selection has stopped. It is again important to note that, since variable selection is an exploratory rather than confirmatory analysis, the SLE and SLS values do not have the usual interpretation as probabilities of erroneously rejecting the null hypothesis of the nonexistence of the coefficients in any selected model.

MAXR and MINR selection do not use significance levels, but you may specify the starting and stopping of these procedures with these options:

START=.y specifies that the interchanging procedure starts with the first s variables in the MODEL statement.

STOP=^ specifies that the procedure stops when the best ^-variable model has been found.

For MAXR and MINR, the INCLUDE option works as it does with RSQUARE. That is, if the INCLUDE option is used in addition to the START or STOP option, the INCLUDE option overrides the START or STOP specification. For example, if you use START=3 and INCLUDE=2, the selection starts with the first three variables, but the first two may not be candidates for deletion.

Because the outputs from the step-type selection procedures are quite lengthy, they are illustrated here only with the SELECTION=FORWARD option using the redefined variables of the BOQ data. Use these SAS statements:

proc reg data = bog; model relman = relocc relcheck relcom relwings relcap hours rooms / selection = f ;

The results appear in Output 4.12.

Output 4. Forward Selection for Redefined Variables

BOQ DATA, MYERS P 145

. OMIT OBS 23 The REG Procedure Model: MODEL Dependent Variable: RELMAN Forward Selection: Step 1 Variable RELCAP Entered: R-Square = 0.1849 and C(p) = 3.

Source Model Error Corrected Total

Variable Intercept RELCAP

Analysis of Variance Sum of Squares

Mean Square 1 22 23

Parameter Estimate

-25.

329.B

07436

F Value

Pr > F

Standard Error Type II SS F Value Pr > F

15.16 0. 4.99 0. Bounds on condition number: 1, 1

Continued

Multicollinearity: Detection and Remedial Measure - Handout | STAT 3115, Exams of Statistics

Related documents

Partial preview of the text

Download Multicollinearity: Detection and Remedial Measure - Handout | STAT 3115 and more Exams Statistics in PDF only on Docsity!

Multicollinearity: Detection and Remedial Measures Q 4.4 Variable Selection

4.4.1 The R-Square Selection Method

Guaranteed optimum subsets are obtained by using the MODEL statement option

SELECTION=RSQUARE in PROC REG. You can use this method for the BOQ data as follows:

proc reg dat = boq;

model manh = occup checkin hours common wings cap rooms /

selection = rscruare best = 4 cp;

run;

Two additional options are specified here as follows:

BEST=4 specifies that only the best four (smallest error mean square) models for each subset

size are to be printed. This option prevents excessive output.

CP specifies the printing of the Mallows C(P) statistic (denoted by C(P) in the output)

for each subset. This is the most popular of several statistics used to aid selection of

a final model (see Section 4.4.2).

The results appear in Output 4.8.

Output 4.

Regression

for BOQ

Data Using

R-SQUARE

Selection

The output from the SELECTION=RSQUARE option provides, for each subset size, the variables

included in the four best models, listed in order of decreasing R-Square, along with the R-Square

and C(P) statistics. You can see that R-Square remains virtually unchanged down to subsets of

size three, with all four selections having nearly equal R-Square values. Additional considerations

in this selection are presented in Section 4.4.2.

4.4.2 Choosing Useful Models

-1'

run;

run;