




Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Material Type: Exam; Class: Analysis of Experiments; Subject: Statistics; University: University of Connecticut; Term: Unknown 1989;
Typology: Exams
1 / 8
This page cannot be seen from the preview
Don't miss anything!
Number in Model 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 & 1
Regression
R- Square
0_._
0.9B
The REG Procedure Model: MODEL R-Square Selection Method Models for Dependent Variable: MANH
C{p) Variables in Model 21.7606 OCCUP 101.8743 ROOMS 182.8633 CHECKIN 20T.9875 CAP
OCCUP CHECKIN OCCUP ' CAP OCCUP ROOMS OCCUP WINGS OCCUP CHECKIN CAP OCCUP CHECKIN ROOMS OCCUP CHECKIN COMMON OCCDP CHECKIN WINGS OCCUP CHECKIN COMMON CAP OCCUP CHECKIN CAP ROOMS OCCUP CHECKIN COMMON ROOMS OCCUP CHECKIN WINGS CAP OCCUP CHECKIN COMMON WINGS CAP OCCUP CHECKIN HOURS COMMON CAP OCCUP CHECKIN COMMON CAP ROOMS OCCUP CHECKIN WINGS CAP ROOMS OCCUP CHECKIN HOURS COMMON WINGS CAP OCCUP CHECKIN COMMON WINGS CAP ROOMS OCCUP CHECKIN HOURS COMMON CAP ROOMS OCOTP CHECKIS HOURS WINGS CAP ROOMS OCCUP CHECKIN HOURS COMMON WINGS CAP ROOMS
110 4.4 Variable Selection Q Chapter 4
Some additional options that are useful for problems with many variables are as follows:
INCLUDE=n specifies that the first n independent variables in the MODEL statement are to be included in all subset models.
START=n specifies that only subsets of n or more variables are to be considered. This number includes the variables specified by the INCLUDE option.
STOP=n specifies that subsets of no more than n variables are to be considered. This number also includes the variables specified by the INCLUDE option.
B specifies that the regression coefficients are printed for each subset model. This option should be used sparingly as it will produce a large amount of output. A more efficient way of getting coefficient estimates is to use information from Output 4.8 to choose interesting models and obtain the coefficient estimates by repeated MODEL statements or, interactively, with ADD and DELETE statements.
An examination of the R-Square values in Output 4.8 does not reveal any obvious choices for selecting a most useful subset model. A number of other statistics have been developed to aid in making these choices and 12 of these are available as additional options with the RSQUARE option. Among these, the most frequently used is the C(P) statistic, proposed by Mallows (1973). This statistic is a measure of total squared error for a subset model containing p independent variables. The total squared error is a measure of the error variance plus the bias introduced by not including variables in a model. It may, therefore, indicate when variable selection is deleting too many variables. The C(P) statistic is computed as follows:
C(P) = (SSE(p)/MSE) _-(N-2p) + _
where
MSB is the error mean square for the full model (or some other estimate of pure error)
SSE(p) is the error sum of squares for the subset model containing p independent variables (not including the intercept)^5
N is the total sample size.
For any given number of selected variables, larger C(P) values indicate equations with larger error mean squares. For any subset model for which C(P)>(p+l), there is evidence of bias due to the deletion of important variables for a model. On the other hand, if there are values of C(P)<(p+l), the full model is said to be overspecified; that is, it contains too many variables.
Mallows recommends that C(P) be plotted against/?, and further recommends selecting that subset size where the minimum C(P) first approaches (p+1), starting from the full model. The magnitudes of differences in the C(P) statistic between the optimum and near optimum models for each subset size are also of interest.
' In the original presentation of the C(P) statistic (Mallows 1973), the intercept coefficient is also considered as a candidate for selection, so that in that presentation the number of variables in the model is one more than what is defined here and results in the +1 elements in the equations. As implied in the discussion of the COLLIN option, allowing the deletion of the intercept is not normally useful.
112 4.4 Variable Selection G Chapter 4
You can examine the resulting data set using PROC PRINT. Output 4.10 shows the results.
Output 4. Output Data Set from R-Square Selection
Obs 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19 Obs 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
MODEL MODEL 1 MODEL MODEL 1 MODEL MODEL MODEL MODEL 1 MODEL MODEL MODEL MODEL MODEL 1 MODEL 1 MODEL 1 MODEL MODEL 1 MODEL MODEL 1 MODEL WINGS
_TYPE FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS FARMS
DEPVAR MANH MANH MANH MANH MANH MANH MANH MANH MANH MANH MANH MANH MANH MANH MANH MANH MANH MANH MANH
RMSE Intercept 660.841 119. 848.463 589.
12
15 . 10 .0640 27 .0267 22 7 .3330 32 .1181 27 .2805 24 .4644 21 .1956 30 .0264 24 .1176 26 .4577 29 .2074 30 .0319 24 .4803 29
7569
6204 0936 7231 9761 7806 0176 9040 6242 7609 8996 9655 3402 7238 8604 3248
829 202 532 137 049 203 228 131 301 142 167 134 MANH
148 923 275 948 597 968 IN 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7
OCCDP
-0. -1.
-1. -1. -1. P_ 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8
CHECKIN
EDF 23 23 23 22 22 22 21 21 21 20 20 20 19 19 19 18 18 18 17
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
HOURS
. . . . _RSQ . . . . . . . . . . . . . . . . . . .
COMMON
-25. -16. 9867
-20. -18. -17. -20. -20. -18. -21. CP
' 6.
6.
As you can see, this data set contains a number of statistics. The estimated coefficients are identified by the names of the independent variables. For the C(p) plot, plot the variable CP against IN for the C(P) values and P against IN for the reference line. proc gplot; symboll v = 'C' c=black; symbol2 v = star c=black 1=1 I=join; plot cpin=l pin=2 / overlay vaxis= 0 to 15 by 5 ,-
The plot, which is not reproduced here, will look like the one in Output 4.9 except for the line joining the reference points.
Multicollinearity: Detection and Remedial Measures Q 4.4 Variable Selection 113
The pattern of C(P) values for this example is quite typical for situations where multicollinearity is serious. Starting with IN=7, they initially become smaller than (p+1), as fewer variables are included, but eventually start to increase. In other words, the residual mean square initially decreases as variables are deleted. In this plot, there is a definite "corner" at IN=3, where the C(P) values increase rapidly with smaller subset sizes. Hence, a model with three variables appears to be a good choice.
A second criterion is the difference in C(P) between the optimum and second optimum subset for each subset size. In the above models with four or more variables, these differences are very small, which implies that the multicollinearity allows interchange of variables without affecting the fit of the model. However, that difference is larger with three variables, implying that the degree of multicollinearity has decreased. You can now estimate the best three-variable model using the information from Output 4.8 as follows: proc reg data = boq; model manh = occup checkin cap / vif;
Results of the regression appear in Output 4.11.
Output 4. Regression for BOQ Data, Best Three Variable Model
The REG Procedure Model : MODEL Dependent Variable: MANH
Source DF Model 3 Error 20 Corrected Total 23
Root MSE Dependent Mean Coeff Var
Parameter Variable DF Estimate Intercept 1 2 0 7. 8 6 4 8 6 OCCUP 1 2 0. 6 7 1 6 3 CHECKIN 1 1. CAP 1 - 3. 4 5 3 9 7
Analysis of Variance Sum of Mean Squares Square F Value Pr > F 87359949 29119983 4 0 6. 2 2 <. 0 0 0 1 1433710 716B 88793659
267.74149 R-Square 0. 2 0 5 0. 0 0 7 0 8 Adj R-Sq 0.
Parameter Estimates Standard Error t Value Pr > t
Variance Inflation 7 8. 2 8 5 3 9 2. 6 6 0. 0 1 5 2 0 1.75123 11.80 <. 0 0 0 1 8. 2 1 9 0 8
Note: Since this model has been specified by the data, the p values cannot be used literally but are useful for determining the relative importance of the variables.
It does appear that more turnovers (CHECKIN) require additional manpower. The negative partial coefficient for CAP may reflect lower man-hours for a larger proportion of vacant rooms.
A number of other statistics are available to assist in choosing subset models. Some are relatively obvious, such as the residual mean square or standard deviations, while others are related to R-Square, with some providing adjustments for degrees of freedom or scaling preferences. They are all essentially equivalent, although some have different theoretical justification. A number of these are available as options with the selection methods described in this section. Keywords and literature references for these options are provided in the PROC REG chapter in the SAS/STAT User's Guide.
Multicollinearity: Detection and Remedial Measures Q 4.4 Variable Selection 115
Backward elimination (BACKWARD or B) begins by computing the regression with all independent variables specified in the MODEL statement. The procedure deletes from that model the variable whose coefficient has the largest p value (smallest partial F value). The resulting equation is examined for the variable now contributing the least, which is then deleted, and so on. The procedure stops when all coefficients remaining in the model are statistically significant at a level specified by the user (see SLS specification later in this section). With this method, once a variable has been deleted, it is deleted permanently.
Stepwise selection (STEPWISE) begins like forward selection, but after a variable has been added to the model, the resulting equation is examined to see if any coefficient has a sufficiently large p value (see SLS specification later in this section) to suggest that a variable should be dropped. This procedure continues until no additions or deletions are indicated according to significance levels (SLE and SLS) chosen by the user.
Maximum R-Square improvement (MAXR) begins by selecting one- and two-variable models as in forward selection. At that point, the procedure examines all possible pairwise interchanges with the variables not in the model. Among all interchanges that increase R^2 , that interchange resulting in the largest increase in R^2 is implemented. This process is repeated until no pairwise interchange improves the model. At this point, a third variable is selected as in forward selection. Then the interchanging process is repeated, and so on. This method usually requires more computer time than the other three, but it also tends to have a better chance of finding more nearly optimum models. In addition, the maximum R-Square improvement method often produces a larger number of equations, a feature that may be of some value in the final evaluation process.
Minimum R-Square improvement (MINR) is similar to maximum R-Square improvement, except that interchanges are implemented for those with minimum improvement. Since interchanges are not implemented when R^2 is decreased, the final results are quite similar to those of the maximum R-Square improvement method, except that a larger number of equations may be examined.
A number of options are available to provide greater control over the selection procedures. One option available for all procedures is INCLUDED, which specifies that the first n independent variables in the MODEL statement are to be kept in the model at all times.
For the FORWARD, BACKWARD, and STEPWISE methods, you may specify desired significance levels for stopping the addition or elimination of variables as follows:
SLE = .xxx specifies the significance level for stopping the addition of variables in the forward selection mode. If not specified, the default is 0.50 for FORWARD and 0.15 for STEPWISE.
SLS = .xxx specifies the significance level for stopping the backward elimination mode. If not specified, the default is 0.10 for BACKWARD and 0.15 for STEPWISE.
116 4.4 Variable Selection Q Chapter 4
The smallest permissible value for SLS is 0.0001, which almost always ensures that the final equation obtained by backward elimination contains only one variable, while the maximum SLE of 0.99 usually includes all variables when forward selection has stopped. It is again important to note that, since variable selection is an exploratory rather than confirmatory analysis, the SLE and SLS values do not have the usual interpretation as probabilities of erroneously rejecting the null hypothesis of the nonexistence of the coefficients in any selected model.
MAXR and MINR selection do not use significance levels, but you may specify the starting and stopping of these procedures with these options:
START=.y specifies that the interchanging procedure starts with the first s variables in the MODEL statement.
STOP=^ specifies that the procedure stops when the best ^-variable model has been found.
For MAXR and MINR, the INCLUDE option works as it does with RSQUARE. That is, if the INCLUDE option is used in addition to the START or STOP option, the INCLUDE option overrides the START or STOP specification. For example, if you use START=3 and INCLUDE=2, the selection starts with the first three variables, but the first two may not be candidates for deletion.
Because the outputs from the step-type selection procedures are quite lengthy, they are illustrated here only with the SELECTION=FORWARD option using the redefined variables of the BOQ data. Use these SAS statements:
proc reg data = bog; model relman = relocc relcheck relcom relwings relcap hours rooms / selection = f ;
The results appear in Output 4.12.
Output 4. Forward Selection for Redefined Variables
BOQ DATA, MYERS P 145
. OMIT OBS 23 The REG Procedure Model: MODEL Dependent Variable: RELMAN Forward Selection: Step 1 Variable RELCAP Entered: R-Square = 0.1849 and C(p) = 3.
Source Model Error Corrected Total
Variable Intercept RELCAP
Analysis of Variance Sum of Squares
Mean Square 1 22 23
Parameter Estimate
-25.
329.B
F Value
Pr > F
Standard Error Type II SS F Value Pr > F
15.16 0. 4.99 0. Bounds on condition number: 1, 1
Continued