Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Multicollinearity: Detection and Remedial Measure - Handout | STAT 3115, Exams of Statistics

Material Type: Exam; Class: Analysis of Experiments; Subject: Statistics; University: University of Connecticut; Term: Unknown 1989;

Typology: Exams

2009/2010

Uploaded on 02/25/2010

koofers-user-8ra
koofers-user-8ra 🇺🇸

10 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Multicollinearity:
Detection
and
Remedial
Measures
Q
4.4
Variable
Selection
4.4.1
The
R-Square
Selection
Method
Guaranteed
optimum
subsets
are
obtained
by
using
the
MODEL statement
option
SELECTION=RSQUARE
in
PROC
REG.
You can use
this method
for the BOQ
data
as
follows:
proc
reg
dat
=
boq;
model
manh
=
occup
checkin
hours common wings
cap
rooms
/
selection
=
rscruare
best
= 4
cp;
run;
Two
additional options
are
specified
here
as
follows:
BEST=4
specifies
that
only
the
best
four
(smallest
error
mean
square)
models
for
each subset
size
are to be
printed. This
option
prevents excessive
output.
CP
specifies
the
printing
of the
Mallows C(P) statistic (denoted
by
C(P)
in the
output)
for
each subset. This
is the
most
popular
of
several statistics
used
to aid
selection
of
a
final
model (see
Section
4.4.2).
The
results
appear
in
Output
4.8.
Output
4.8
Regression
for
BOQ
Data
Using
R-SQUARE
Selection
Number
in
Model
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
6
6
6
&
1
Regression
R-
Square
0
.9619
0
.8888
0.8149
0
.7920
0
.9765
0
.9645
0.9634
0
.9629
0
.9839
0.9803
0
.9161
0
.9166
0.9B51
0.9845
0.9839
0
.9839
0
.9854
0.9851
0.9851
0.9846
0
.9854
0
.9854
0.9851
0.9847
0.9854
The REG
Procedure
Model:
MODEL1
R-Square
Selection Method
Models
for
Dependent
Variable:
MANH
C{p)
Variables
in
Model
21.7606 OCCUP
101.8743
ROOMS
182.8633
CHECKIN
20T.9875
CAP
7.8089
20.8704
22.1085
22.7211
1.7003
5.6299
9.5345
9.6912
2.3204
2.9396
3.6414
3.6868
4.0091
4.3159
4
.3195
4.8286
6
.0024
6.0059
6
.3146
6.8219
8.0000
OCCUP
CHECKIN
OCCUP
'
CAP
OCCUP ROOMS
OCCUP WINGS
OCCUP
CHECKIN
CAP
OCCUP
CHECKIN
ROOMS
OCCUP
CHECKIN
COMMON
OCCDP
CHECKIN
WINGS
OCCUP
CHECKIN
COMMON
CAP
OCCUP
CHECKIN
CAP
ROOMS
OCCUP
CHECKIN
COMMON ROOMS
OCCUP
CHECKIN
WINGS
CAP
OCCUP
CHECKIN
COMMON WINGS
CAP
OCCUP
CHECKIN
HOURS
COMMON
CAP
OCCUP
CHECKIN
COMMON
CAP
ROOMS
OCCUP
CHECKIN
WINGS
CAP
ROOMS
OCCUP
CHECKIN
HOURS
COMMON WINGS
CAP
OCCUP
CHECKIN
COMMON WINGS
CAP
ROOMS
OCCUP
CHECKIN
HOURS COMMON
CAP
ROOMS
OCOTP
CHECKIS
HOURS
WINGS
CAP
ROOMS
OCCUP
CHECKIN
HOURS
COMMON WINGS
CAP
ROOMS
The
output
from
the
SELECTION=RSQUARE
option
provides,
for
each
subset
size,
the
variables
included
in the
four
best models, listed
in
order
of
decreasing R-Square,
along
with
the
R-Square
and
C(P) statistics.
You can see
that
R-Square
remains
virtually
unchanged
down
to
subsets
of
size three,
with
all
four
selections
having
nearly
equal R-Square values. Additional considerations
in
this selection
are
presented
in
Section
4.4.2.
pf3
pf4
pf5
pf8

Partial preview of the text

Download Multicollinearity: Detection and Remedial Measure - Handout | STAT 3115 and more Exams Statistics in PDF only on Docsity!

Multicollinearity: Detection and Remedial Measures Q 4.4 Variable Selection

4.4.1 The R-Square Selection Method

Guaranteed optimum subsets are obtained by using the MODEL statement option

SELECTION=RSQUARE in PROC REG. You can use this method for the BOQ data as follows:

proc reg dat = boq;

model manh = occup checkin hours common wings cap rooms /

selection = rscruare best = 4 cp;

run;

Two additional options are specified here as follows:

BEST=4 specifies that only the best four (smallest error mean square) models for each subset

size are to be printed. This option prevents excessive output.

CP specifies the printing of the Mallows C(P) statistic (denoted by C(P) in the output)

for each subset. This is the most popular of several statistics used to aid selection of

a final model (see Section 4.4.2).

The results appear in Output 4.8.

Output 4.

Regression

for BOQ

Data Using

R-SQUARE

Selection

Number in Model 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 & 1

Regression

R- Square

0_._

0.9B

The REG Procedure Model: MODEL R-Square Selection Method Models for Dependent Variable: MANH

C{p) Variables in Model 21.7606 OCCUP 101.8743 ROOMS 182.8633 CHECKIN 20T.9875 CAP

OCCUP CHECKIN OCCUP ' CAP OCCUP ROOMS OCCUP WINGS OCCUP CHECKIN CAP OCCUP CHECKIN ROOMS OCCUP CHECKIN COMMON OCCDP CHECKIN WINGS OCCUP CHECKIN COMMON CAP OCCUP CHECKIN CAP ROOMS OCCUP CHECKIN COMMON ROOMS OCCUP CHECKIN WINGS CAP OCCUP CHECKIN COMMON WINGS CAP OCCUP CHECKIN HOURS COMMON CAP OCCUP CHECKIN COMMON CAP ROOMS OCCUP CHECKIN WINGS CAP ROOMS OCCUP CHECKIN HOURS COMMON WINGS CAP OCCUP CHECKIN COMMON WINGS CAP ROOMS OCCUP CHECKIN HOURS COMMON CAP ROOMS OCOTP CHECKIS HOURS WINGS CAP ROOMS OCCUP CHECKIN HOURS COMMON WINGS CAP ROOMS

The output from the SELECTION=RSQUARE option provides, for each subset size, the variables

included in the four best models, listed in order of decreasing R-Square, along with the R-Square

and C(P) statistics. You can see that R-Square remains virtually unchanged down to subsets of

size three, with all four selections having nearly equal R-Square values. Additional considerations

in this selection are presented in Section 4.4.2.

110 4.4 Variable Selection Q Chapter 4

Some additional options that are useful for problems with many variables are as follows:

INCLUDE=n specifies that the first n independent variables in the MODEL statement are to be included in all subset models.

START=n specifies that only subsets of n or more variables are to be considered. This number includes the variables specified by the INCLUDE option.

STOP=n specifies that subsets of no more than n variables are to be considered. This number also includes the variables specified by the INCLUDE option.

B specifies that the regression coefficients are printed for each subset model. This option should be used sparingly as it will produce a large amount of output. A more efficient way of getting coefficient estimates is to use information from Output 4.8 to choose interesting models and obtain the coefficient estimates by repeated MODEL statements or, interactively, with ADD and DELETE statements.

4.4.2 Choosing Useful Models

An examination of the R-Square values in Output 4.8 does not reveal any obvious choices for selecting a most useful subset model. A number of other statistics have been developed to aid in making these choices and 12 of these are available as additional options with the RSQUARE option. Among these, the most frequently used is the C(P) statistic, proposed by Mallows (1973). This statistic is a measure of total squared error for a subset model containing p independent variables. The total squared error is a measure of the error variance plus the bias introduced by not including variables in a model. It may, therefore, indicate when variable selection is deleting too many variables. The C(P) statistic is computed as follows:

C(P) = (SSE(p)/MSE) _-(N-2p) + _

where

MSB is the error mean square for the full model (or some other estimate of pure error)

SSE(p) is the error sum of squares for the subset model containing p independent variables (not including the intercept)^5

N is the total sample size.

For any given number of selected variables, larger C(P) values indicate equations with larger error mean squares. For any subset model for which C(P)>(p+l), there is evidence of bias due to the deletion of important variables for a model. On the other hand, if there are values of C(P)<(p+l), the full model is said to be overspecified; that is, it contains too many variables.

Mallows recommends that C(P) be plotted against/?, and further recommends selecting that subset size where the minimum C(P) first approaches (p+1), starting from the full model. The magnitudes of differences in the C(P) statistic between the optimum and near optimum models for each subset size are also of interest.

' In the original presentation of the C(P) statistic (Mallows 1973), the intercept coefficient is also considered as a candidate for selection, so that in that presentation the number of variables in the model is one more than what is defined here and results in the +1 elements in the equations. As implied in the discussion of the COLLIN option, allowing the deletion of the intercept is not normally useful.

112 4.4 Variable Selection G Chapter 4

You can examine the resulting data set using PROC PRINT. Output 4.10 shows the results.

Output 4. Output Data Set from R-Square Selection

Obs 1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19 Obs 1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19

MODEL MODEL 1 MODEL MODEL 1 MODEL MODEL MODEL MODEL 1 MODEL MODEL MODEL MODEL MODEL 1 MODEL 1 MODEL 1 MODEL MODEL 1 MODEL MODEL 1 MODEL WINGS

_TYPE FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS

DEPVAR MANH MANH MANH MANH MANH MANH MANH MANH MANH MANH MANH MANH MANH MANH MANH MANH MANH MANH MANH

RMSE Intercept 660.841 119. 848.463 589.

  1. 178 330. 003 565.654 168. 608.909 70. SOB 649.474 294. 477.343 117. 548.021 147. 563.885 199. 455.909 198. 471.375 107. 488.188 116. 434.089 201. 462 466 444 444 474 455 CAP ROOMS

12

15 . 10 .0640 27 .0267 22 7 .3330 32 .1181 27 .2805 24 .4644 21 .1956 30 .0264 24 .1176 26 .4577 29 .2074 30 .0319 24 .4803 29

7569

6204 0936 7231 9761 7806 0176 9040 6242 7609 8996 9655 3402 7238 8604 3248

829 202 532 137 049 203 228 131 301 142 167 134 MANH

-1'

148 923 275 948 597 968 IN 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7

OCCDP

-0. -1.

-1. -1. -1. P_ 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8

CHECKIN

EDF 23 23 23 22 22 22 21 21 21 20 20 20 19 19 19 18 18 18 17

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

HOURS

. . . . _RSQ . . . . . . . . . . . . . . . . . . .

COMMON

-25. -16. 9867

-20. -18. -17. -20. -20. -18. -21. CP

  1. 9768

' 6.

  1. 0071

6.

  1. 0000

As you can see, this data set contains a number of statistics. The estimated coefficients are identified by the names of the independent variables. For the C(p) plot, plot the variable CP against IN for the C(P) values and P against IN for the reference line. proc gplot; symboll v = 'C' c=black; symbol2 v = star c=black 1=1 I=join; plot cpin=l pin=2 / overlay vaxis= 0 to 15 by 5 ,-

run;

The plot, which is not reproduced here, will look like the one in Output 4.9 except for the line joining the reference points.

Multicollinearity: Detection and Remedial Measures Q 4.4 Variable Selection 113

The pattern of C(P) values for this example is quite typical for situations where multicollinearity is serious. Starting with IN=7, they initially become smaller than (p+1), as fewer variables are included, but eventually start to increase. In other words, the residual mean square initially decreases as variables are deleted. In this plot, there is a definite "corner" at IN=3, where the C(P) values increase rapidly with smaller subset sizes. Hence, a model with three variables appears to be a good choice.

A second criterion is the difference in C(P) between the optimum and second optimum subset for each subset size. In the above models with four or more variables, these differences are very small, which implies that the multicollinearity allows interchange of variables without affecting the fit of the model. However, that difference is larger with three variables, implying that the degree of multicollinearity has decreased. You can now estimate the best three-variable model using the information from Output 4.8 as follows: proc reg data = boq; model manh = occup checkin cap / vif;

run;

Results of the regression appear in Output 4.11.

Output 4. Regression for BOQ Data, Best Three Variable Model

The REG Procedure Model : MODEL Dependent Variable: MANH

Source DF Model 3 Error 20 Corrected Total 23

Root MSE Dependent Mean Coeff Var

Parameter Variable DF Estimate Intercept 1 2 0 7. 8 6 4 8 6 OCCUP 1 2 0. 6 7 1 6 3 CHECKIN 1 1. CAP 1 - 3. 4 5 3 9 7

Analysis of Variance Sum of Mean Squares Square F Value Pr > F 87359949 29119983 4 0 6. 2 2 <. 0 0 0 1 1433710 716B 88793659

267.74149 R-Square 0. 2 0 5 0. 0 0 7 0 8 Adj R-Sq 0.

Parameter Estimates Standard Error t Value Pr > t

Variance Inflation 7 8. 2 8 5 3 9 2. 6 6 0. 0 1 5 2 0 1.75123 11.80 <. 0 0 0 1 8. 2 1 9 0 8

  1. 2 9 3 6 6 4. 8 9 <. 0 0 0 1 4. 0 4 6 1 5 1.14110 - 3. 0 3 0. 0 0 6 7 7. 6 3 8 2 5

Note: Since this model has been specified by the data, the p values cannot be used literally but are useful for determining the relative importance of the variables.

It does appear that more turnovers (CHECKIN) require additional manpower. The negative partial coefficient for CAP may reflect lower man-hours for a larger proportion of vacant rooms.

A number of other statistics are available to assist in choosing subset models. Some are relatively obvious, such as the residual mean square or standard deviations, while others are related to R-Square, with some providing adjustments for degrees of freedom or scaling preferences. They are all essentially equivalent, although some have different theoretical justification. A number of these are available as options with the selection methods described in this section. Keywords and literature references for these options are provided in the PROC REG chapter in the SAS/STAT User's Guide.

Multicollinearity: Detection and Remedial Measures Q 4.4 Variable Selection 115

Backward elimination (BACKWARD or B) begins by computing the regression with all independent variables specified in the MODEL statement. The procedure deletes from that model the variable whose coefficient has the largest p value (smallest partial F value). The resulting equation is examined for the variable now contributing the least, which is then deleted, and so on. The procedure stops when all coefficients remaining in the model are statistically significant at a level specified by the user (see SLS specification later in this section). With this method, once a variable has been deleted, it is deleted permanently.

Stepwise selection (STEPWISE) begins like forward selection, but after a variable has been added to the model, the resulting equation is examined to see if any coefficient has a sufficiently large p value (see SLS specification later in this section) to suggest that a variable should be dropped. This procedure continues until no additions or deletions are indicated according to significance levels (SLE and SLS) chosen by the user.

Maximum R-Square improvement (MAXR) begins by selecting one- and two-variable models as in forward selection. At that point, the procedure examines all possible pairwise interchanges with the variables not in the model. Among all interchanges that increase R^2 , that interchange resulting in the largest increase in R^2 is implemented. This process is repeated until no pairwise interchange improves the model. At this point, a third variable is selected as in forward selection. Then the interchanging process is repeated, and so on. This method usually requires more computer time than the other three, but it also tends to have a better chance of finding more nearly optimum models. In addition, the maximum R-Square improvement method often produces a larger number of equations, a feature that may be of some value in the final evaluation process.

Minimum R-Square improvement (MINR) is similar to maximum R-Square improvement, except that interchanges are implemented for those with minimum improvement. Since interchanges are not implemented when R^2 is decreased, the final results are quite similar to those of the maximum R-Square improvement method, except that a larger number of equations may be examined.

A number of options are available to provide greater control over the selection procedures. One option available for all procedures is INCLUDED, which specifies that the first n independent variables in the MODEL statement are to be kept in the model at all times.

For the FORWARD, BACKWARD, and STEPWISE methods, you may specify desired significance levels for stopping the addition or elimination of variables as follows:

SLE = .xxx specifies the significance level for stopping the addition of variables in the forward selection mode. If not specified, the default is 0.50 for FORWARD and 0.15 for STEPWISE.

SLS = .xxx specifies the significance level for stopping the backward elimination mode. If not specified, the default is 0.10 for BACKWARD and 0.15 for STEPWISE.

116 4.4 Variable Selection Q Chapter 4

The smallest permissible value for SLS is 0.0001, which almost always ensures that the final equation obtained by backward elimination contains only one variable, while the maximum SLE of 0.99 usually includes all variables when forward selection has stopped. It is again important to note that, since variable selection is an exploratory rather than confirmatory analysis, the SLE and SLS values do not have the usual interpretation as probabilities of erroneously rejecting the null hypothesis of the nonexistence of the coefficients in any selected model.

MAXR and MINR selection do not use significance levels, but you may specify the starting and stopping of these procedures with these options:

START=.y specifies that the interchanging procedure starts with the first s variables in the MODEL statement.

STOP=^ specifies that the procedure stops when the best ^-variable model has been found.

For MAXR and MINR, the INCLUDE option works as it does with RSQUARE. That is, if the INCLUDE option is used in addition to the START or STOP option, the INCLUDE option overrides the START or STOP specification. For example, if you use START=3 and INCLUDE=2, the selection starts with the first three variables, but the first two may not be candidates for deletion.

Because the outputs from the step-type selection procedures are quite lengthy, they are illustrated here only with the SELECTION=FORWARD option using the redefined variables of the BOQ data. Use these SAS statements:

proc reg data = bog; model relman = relocc relcheck relcom relwings relcap hours rooms / selection = f ;

The results appear in Output 4.12.

Output 4. Forward Selection for Redefined Variables

BOQ DATA, MYERS P 145

. OMIT OBS 23 The REG Procedure Model: MODEL Dependent Variable: RELMAN Forward Selection: Step 1 Variable RELCAP Entered: R-Square = 0.1849 and C(p) = 3.

Source Model Error Corrected Total

Variable Intercept RELCAP

Analysis of Variance Sum of Squares

Mean Square 1 22 23

Parameter Estimate

-25.

329.B

  1. 07436

F Value

Pr > F

Standard Error Type II SS F Value Pr > F

15.16 0. 4.99 0. Bounds on condition number: 1, 1

Continued