Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Degrees of Freedom in Best Subset Selection and Lasso: A Comparative Study, Study Guides, Projects, Research of Statistics

The concept of degrees of freedom in the context of best subset selection and Lasso regression. The author compares the degrees of freedom of these two methods and investigates the search degrees of freedom for subset selection. The document also extends the notion of search degrees of freedom to general adaptive regression procedures and provides simulation results.

What you will learn

  • What is the difference between degrees of freedom in best subset selection and Lasso regression?
  • How does the search degrees of freedom of subset selection compare to that of the Lasso?
  • How does the normal error model affect the calculation of degrees of freedom?
  • What is the significance of degrees of freedom in model selection and comparisons?
  • What is the role of search degrees of freedom in adaptive regression procedures?

Typology: Study Guides, Projects, Research

2021/2022

Uploaded on 09/27/2022

seymour
seymour 🇬🇧

4.8

(16)

216 documents

1 / 24

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Degrees of Freedom and Model Search
Ryan J. Tibshirani
Abstract
Degrees of freedom is a fundamental concept in statistical modeling, as it provides a quan-
titative description of the amount of fitting performed by a given procedure. But, despite this
fundamental role in statistics, its behavior is not completely well-understood, even in somewhat
basic settings. For example, it may seem intuitively obvious that the best subset selection fit
with subset size khas degrees of freedom larger than k, but this has not been formally verified,
nor has is been precisely studied. At large, the current paper is motivated by this problem, and
we derive an exact expression for the degrees of freedom of best subset selection in a restricted
setting (orthogonal predictor variables). Along the way, we develop a concept that we name
“search degrees of freedom”; intuitively, for adaptive regression procedures that perform vari-
able selection, this is a part of the (total) degrees of freedom that we attribute entirely to the
model selection mechanism. Finally, we establish a modest extension of Stein’s formula to cover
discontinuous functions, and discuss its potential role in degrees of freedom and search degrees
of freedom calculations.
Keywords: degrees of freedom, model search, lasso, best subset selection, Stein’s formula
1 Introduction
Suppose that we are given observations yRnfrom the model
y=µ+, with E()=0,Cov() = σ2I, (1)
where µRnis some fixed, true mean parameter of interest, and Rnare uncorrelated errors,
with zero mean and common marginal variance σ2>0. For a function f:RnRn, thought of as
a procedure for producing fitted values, ˆµ=f(y), recall that the degrees of freedom of fis defined
as (Efron 1986, Hastie & Tibshirani 1990):
df(f) = 1
σ2
n
X
i=1
Covfi(y), yi.(2)
Intuitively, the quantity df(f) reflects the effective number of parameters used by fin producing
the fitted output ˆµ. Consider linear regression, for example, where f(y) is the least squares fit of y
onto predictor variables x1,...xpRn: for this procedure f, our intuition gives the right answer, as
its degrees of freedom is simply p, the number of estimated regression coefficients.1This, e.g., leads
to an unbiased estimate of the risk of the linear regression fit, via Mallows’s Cpcriterion (Mallows
1973).
In general, characterizations of degrees of freedom are highly relevant for purposes like model
comparisons and model selection; see, e.g., Efron (1986), Hastie & Tibshirani (1990), Tibshirani &
Taylor (2012), and Section 1.2, for more motivation. Unfortunately, however, counting degrees of
freedom can become quite complicated for nonlinear, adaptive procedures. (By nonlinear, we mean
fbeing nonlinear as a function of y.) Even for many basic adaptive procedures, explicit answers are
1This is assuming linear independence of x1,...xp; in general, it is the dimension of span{x1,...xp}.
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18

Partial preview of the text

Download Degrees of Freedom in Best Subset Selection and Lasso: A Comparative Study and more Study Guides, Projects, Research Statistics in PDF only on Docsity!

Degrees of Freedom and Model Search

Ryan J. Tibshirani

Abstract Degrees of freedom is a fundamental concept in statistical modeling, as it provides a quan- titative description of the amount of fitting performed by a given procedure. But, despite this fundamental role in statistics, its behavior is not completely well-understood, even in somewhat basic settings. For example, it may seem intuitively obvious that the best subset selection fit with subset size k has degrees of freedom larger than k, but this has not been formally verified, nor has is been precisely studied. At large, the current paper is motivated by this problem, and we derive an exact expression for the degrees of freedom of best subset selection in a restricted setting (orthogonal predictor variables). Along the way, we develop a concept that we name “search degrees of freedom”; intuitively, for adaptive regression procedures that perform vari- able selection, this is a part of the (total) degrees of freedom that we attribute entirely to the model selection mechanism. Finally, we establish a modest extension of Stein’s formula to cover discontinuous functions, and discuss its potential role in degrees of freedom and search degrees of freedom calculations. Keywords: degrees of freedom, model search, lasso, best subset selection, Stein’s formula

1 Introduction

Suppose that we are given observations y ∈ Rn^ from the model

y = μ + , with E() = 0, Cov() = σ^2 I, (1)

where μ ∈ Rn^ is some fixed, true mean parameter of interest, and  ∈ Rn^ are uncorrelated errors, with zero mean and common marginal variance σ^2 > 0. For a function f : Rn^ → Rn, thought of as a procedure for producing fitted values, ˆμ = f (y), recall that the degrees of freedom of f is defined as (Efron 1986, Hastie & Tibshirani 1990):

df(f ) =

σ^2

∑^ n

i=

Cov

fi(y), yi

Intuitively, the quantity df(f ) reflects the effective number of parameters used by f in producing the fitted output ˆμ. Consider linear regression, for example, where f (y) is the least squares fit of y onto predictor variables x 1 ,... xp ∈ Rn: for this procedure f , our intuition gives the right answer, as its degrees of freedom is simply p, the number of estimated regression coefficients.^1 This, e.g., leads to an unbiased estimate of the risk of the linear regression fit, via Mallows’s Cp criterion (Mallows 1973). In general, characterizations of degrees of freedom are highly relevant for purposes like model comparisons and model selection; see, e.g., Efron (1986), Hastie & Tibshirani (1990), Tibshirani & Taylor (2012), and Section 1.2, for more motivation. Unfortunately, however, counting degrees of freedom can become quite complicated for nonlinear, adaptive procedures. (By nonlinear, we mean f being nonlinear as a function of y.) Even for many basic adaptive procedures, explicit answers are

(^1) This is assuming linear independence of x 1 ,... xp; in general, it is the dimension of span{x 1 ,... xp}.

not known. A good example is best subset selection, in which, for a fixed integer k, we regress on the subset of x 1 ,... xp of size at most k giving the best linear fit of y (as measured by the residual sum of squares). Is the degrees of freedom here larger than k? It seems that the answer should be “yes”, because even though there are k coefficients in the final linear model, the variables in this model were chosen adaptively (based on the data). And if the answer is indeed “yes”, then the natural follow-up question is: how much larger is it? That is, how many effective parameters does it “cost” to search through the space of candidate models? The goal of this paper is to investigate these questions, and related ones.

1.1 A motivating example

We begin by raising an interesting point: though it seems certain that a procedure like best subset selection would suffer an inflation of degrees of freedom, not all adaptive regression procedures do. In particular, the lasso (Tibshirani 1996, Chen et al. 1998), which also performs variable selection in the linear model setting, presents a very different story in terms of its degrees of freedom. Stacking the predictor variables x 1 ,... xp along the columns of a matrix X ∈ Rn×p, the lasso estimate can be expressed as:

βˆlasso^ = argmin β∈Rp

‖y − Xβ‖^22 + λ‖β‖ 1 , (3)

where λ ≥ 0 is a tuning parameter, controlling the level of sparsity. Though not strictly necessary for our discussion, we assume for simplicity that X has columns in general position, which ensures uniqueness of the lasso solution βˆlasso^ (see, e.g., Tibshirani (2013)). We will write Alasso^ ⊆ { 1 ,... p} to denote the indices of nonzero coefficients in βˆlasso, called the support or active set of βˆlasso, also expressed as Alasso^ = supp( βˆlasso). The lasso admits a simple formula for its degrees of freedom.

Theorem 1 (Zou et al. 2007, Tibshirani & Taylor 2012). Provided that the variables (columns) in X are in general position, the lasso fit μˆlasso^ = X βˆlasso^ has degrees of freedom

df(ˆμlasso) = E|Alasso|,

where |Alasso| is the size of the lasso active set Alasso^ = supp( βˆlasso). The above expectation assumes that X and λ are fixed, and is taken over the sampling distribution y ∼ N (μ, σ^2 I).

In other words, the degrees of freedom of the lasso fit is the number of selected variables, in expectation. This is somewhat remarkable because, as with subset selection, the lasso uses the data to choose which variables to put in the model. So how can its degrees of freedom be equal to the (average) number of selected variables, and not more? The key realization is that the lasso shrinks the coefficients of these variables towards zero, instead of perfoming a full least squares fit. This shrinkage is due to the ` 1 penalty that appears in (3). Amazingly, the “surplus” from adaptively building the model is exactly accounted for by the “deficit” from shrinking the coefficients, so that altogether (in expectation), the degrees of freedom is simply the number of variables in the model.

Remark 1. An analogous result holds for an entirely arbitrary predictor matrix X (not necessarily having columns in general position), see Tibshirani & Taylor (2012); analogous results also exist for the generalized lasso problem (special cases of which are the fused lasso and trend filtering), see Tibshirani & Taylor (2011, 2012).

Figure 1 shows an empirical comparison between the degrees of freedom of the lasso and best subset selection fits, for a simple example with n = 20, p = 10. The predictor variables were setup to have a block correlation structure, in that variables 1 through 4 had high pairwise correlation (between 0.6 and 0.9), variables 5 through 10 also had high pairwise correlation (between 0.6 and 0.9), and the two blocks were uncorrelated with each other. The outcome y was drawn by adding

abbreviate A = Asubset, then it is not hard to see that

βˆsubset A = (X ATXA)−^1 X ATy,

i.e., the active coefficients are given by least squares on the active variables XA (the submatrix of X formed by taking the columns in A). Therefore, like the lasso, best subset selection chooses an active set of variables adaptively, but unlike the lasso, it fits their coefficients without shrinkage, using ordinary least squares. It pays for the “surplus” of covariance from the adaptive model search, as well as the usual amount from least squares estimation, resulting in a total degrees of freedom much larger than |A| (or rather, E|A|). A clarifying note: simulations along the lines of that in Figure 1 can be found throughout the literature and we do not mean to claim originality here (e.g., see Figure 4 of Tibshirani & Knight (1999) for an early example, and Figure 2 of Janson et al. (2013) for a recent example). This simulation is instead simply meant to motivate the work that follows, as an aim of this paper is to examine the observed phenomenon in Figure 1 more formally.

1.2 Degrees of freedom and optimism

Degrees of freedom is closely connected to the concept of optimism, and so alternatively, we could have motivated the study of the covariance term on the right-hand side in (2) from the perspective of the optimism, rather than the complexity, of a fitting procedure. Assuming only that y is drawn from the model in (1), and that y′^ is an independent copy of y (i.e., an independent draw from (1)), it is straightforward to show that for any fitting procedure f ,

E‖y′^ − f (y)‖^22 − E‖y − f (y)‖^22 = 2σ^2 · df(f ). (5)

The quantity on the left-hand side above is called the optimism of f , i.e., the difference in the mean squared test error and mean squared training error. The identity in (5) shows that (for uncorrelated, homoskedastic regression errors as in (1)) the optimism of f is just a positive constant times its degrees of freedom; in other words, fitting procedures with a higher degrees of freedom will have higher a optimism. Hence, from the example in the last section, we know when they are tuned to have the same (expected) number of variables in the fitted model, best subset selection will produce a training error that is generally far more optimistic than that produced by the lasso.

1.3 Lagrange versus constrained problem forms

Recall that we defined the subset selection estimator using the Lagrange form optimization problem (4), instead of the (perhaps more typical) constrained form definition

βˆsubset^ ∈ argmin β∈Rp

‖y − Xβ‖^22 subject to ‖β‖ 0 ≤ k. (6)

There are several points now worth making. First, these are nonconvex optimization problems, and so the Lagrange and constrained forms (4) and (6) of subset selection are generally not equivalent. In fact, for all λ, solutions of (4) are solutions of (6) for some choice of k, but the reverse is not true. Second, even in situations in which the Lagrange and constrained forms of a particular optimization problem are equivalent (e.g., this is true under strong duality, and so it is true for most convex problems, under very weak conditions), there is a difference between studying the degrees of freedom of an estimator defined in one problem form versus the other. This is because the map from the Lagrange parameter in one form to the constraint bound in the other generically depends on y, i.e., it is a random mapping (Kaufman & Rosset (2013) discuss this for ridge regression and the lasso). Lastly, in this paper, we focus on the Lagrange form (4) of subset selection because we find this problem is easier to analyze mathematically. For example, in Lagrange form with X = I, the ith

component of the subset selection fit βˆsubset i depends on yi only (and is given by hard thresholding), for each i = 1,... n; in constrained form with X = I, each βˆ isubset is a function of the order statistics of |y 1 |,... |yn|, and hence depends on the whole sample. Given the general spirit of our paper, it is important to recall the relevant work of Ye (1998), who studied degrees of freedom for special cases of best subset selection in constrained form. In one such special case (orthogonal predictors with null underlying signal), the author derived a simple expression for degrees of freedom as the sum of the k largest order statistics from a sample of n independent χ^21 random variables. This indeed establishes that, in this particular special case, the constrained form of best subset selection with k active variables has degrees of freedom larger than k. It does not, however, imply any results about the Lagrange case for the reasons explained above.

1.4 Assumptions, notation, and outline

Throughout this work, we will assume the model

y = μ + ,  ∼ N (0, σ^2 I). (7)

Note that this is stronger than the model in (1), since we are assuming a normal error distribution. While the model in (1) is sufficient to define the notion of degrees of freedom in general, we actually require normality for the calculations to come—specifically, Lemma 1 (on the degrees of freedom of hard thresholding), and all results in Section 5 (on extending Stein’s formula), rely on the normal error model. Beyond this running assumption, we will make any additional assumptions clear when needed. In terms of notation, we write M +^ to denote the (Moore-Penrose) pseudoinverse of a matrix M , with M +^ = (M T^ M )+M T^ for rectangular matrices M , and we write MS to denote the submatrix of M whose columns correspond to the set of indices S. We write φ for the standard normal density function and Φ for the standard normal cumulative distribution function. Finally, here is an outline for the rest of this article. In Section 2, we derive an explicit formula for the degrees of freedom of the best subset selection fit, under orthogonal predictors X. We also introduce the notion of search degrees of freedom for subset selection, and study its characteristics in various settings. In Section 3, we define search degrees of freedom for generic adaptive regression procedures, including the lasso and ridge regression as special cases. Section 4 returns to considering best subset selection, this time with general predictor variables X. Because exact formulae for the degrees of freedom and search degrees of freedom of best subset selection are not available in the general X case, we turn to simulation to investigate these quantities. We also examine the search degrees of freedom of the lasso across the same simulated examples (as its analytic calculation is again intractable for general X). Section 5 casts all of this work on degrees of freedom (and search degrees of freedom) in a different light, by deriving an extension of Stein’s formula. Stein’s formula is a powerful tool that can be used to compute the degrees of freedom of continuous and almost differentiable fitting procedures; our extension covers functions that have “well-behaved” points of discontinuity, in some sense. This extended version of Stein’s formula offers an alternative proof of the exact result in Section 2 (the orthogonal X case), and potentially, provides a perspective from which we can formally understand the empirical findings in Section 4 (the general X case). In Section 6, we conclude with some discussion.

2 Best subset selection with an orthogonal X

In the special case that X ∈ Rn×p^ is orthogonal, i.e., X has orthonormal columns, we can compute the degrees of freedom of the best subset selection fit directly.

0 1 2 3 4 5 6 7

0

20

40

60

80

100

λ

Total df Search df E| A |

0 20 40 60 80 100

0

20

40

60

80

100

E| A |

Total df Search df E| A |

Figure 2: An example with n = p = 100, X = I, and μ = 0. The left panel plots the curves df(ˆμsubset), sdf(ˆμsubset), and E|Asubset| as functions of λ, drawn as blue, red, and black lines, respectively. The right panel plots the same quantities with respect to E|Asubset|.

2.2 Example: null signal

We consider first the case of a null underlying signal, i.e., μ = 0. The best subset selection search degrees of freedom (10), as a function of λ, becomes

sdf(ˆμsubset) =

2 p

2 λ σ

φ

2 λ σ

In Figure 2, we plot the quantities df(ˆμsubset), sdf(ˆμsubset), and E|Asubset| as functions of λ, for a simple example with n = p = 100, underlying signal μ = 0, noise variance σ^2 = 1, and predictor matrix X = I, the 100 × 100 identity matrix. We emphasize that this figure was produced without any random draws or simulations, and the plotted curves are exactly as prescribed by Theorem 2 (recall that E|Asubset| also has an explicit form in terms of λ, given in the proof of Lemma 1). In the left panel, we can see that the search degrees of freedom curve is maximized at approximately λ = 0.5, and achieves a maximum value of nearly 50. That is, when λ = 0.5, best subset selection spends nearly 50 (extra) parameters searching through the space of models! It is perhaps more natural to parametrize the curves in terms of the expected number of active variables E|Asubset| (instead of λ), as displayed in the right panel of Figure 2. This parametriza- tion reveals something interesting: the search degrees of freedom curve is maximized at roughly E|Asubset| = 31.7. In other words, searching is most costly when there are approximately 31.7 vari- ables in the model. This is a bit counterintuitive, because there are more subsets of size 50 than any other size, that is, the function

F (k) =

p k

, k = 1, 2 ,... p,

is maximized at k = p/2 = 50. Hence we might believe that searching through subsets of variables is most costly when E|Asubset| = 50, because in this case the search space is largest. Instead, the maximum actually occurs at about E|Asubset| = 31.7. Given the simple form (11) of the search degrees of freedom curve in the null signal case, we can verify this observation analytically: direct calculation shows that the right hand side in (11) is maximized at λ = σ^2 /2, which, when plugged

into the formula for the expected number of selected variables in the null case,

E|Asubset| = 2pΦ

2 λ σ

yields E|Asubset| = 2Φ(−1)p ≈ 0. 317 p. Although this calculation may have been reassuring, the intuitive question remains: why is the 31.7 variable model associated with the highest cost of model searching (over, say, the 50 variable model)? At this point, we cannot offer a truly satisfying intuitive answer, but we will attempt an explanation nonetheless. Recall that search degrees of freedom measures the additional amount of covariance in (2) that we attribute to searching through the space of models—additional from the baseline amount E|Asubset|, which comes from estimating the coefficients in the selected model. The shape of the search degrees of freedom curve, when μ = 0, tells us that there is more covariance to be gained when the selected model has 31.7 variables than when it has 50 variables. As the size of the selected subset k increases from 0 to 50, note that:

  1. the number of subsets of size k increases, which means that there are more opportunities to decrease the training error, and so the total degrees of freedom (optimism) increases;
  2. trivially, the baseline amount of fitting also increases, as this baseline is just k, the degrees of freedom (optimism) of a fixed model on k variables.

Search degrees of freedom is the difference between these two quantities (i.e., total minus baseline degrees of freedom), and as it turns out, the two are optimally balanced at approximately k = 31. 7 (at exactly k = 2Φ(−1)p) in the null signal case.

2.3 Example: sparse signal

Now we consider the case in which μ = Xβ∗, for some sparse coefficient vector β∗^ ∈ Rp. We let A∗^ = supp(β∗) denote the true support set, and k∗^ = |A∗| the true number of nonzero coefficients, assumed to be small. The search degrees of freedom curve in (10) is

sdf(ˆμsubset) =

2 λ σ

i∈A∗

[

φ

2 λ − β i∗ σ

  • φ

2 λ + β i∗ σ

)]

2(p − k∗)

2 λ σ

φ

2 λ σ

When the nonzero coefficients β i∗ are moderate (not very large), the curve in (12) acts much like the search degrees of freedom curve (11) in the null case. Otherwise, it can behave very differently. We therefore examine two different sparse setups by example, having low and high signal-to-noise ratios. See Figure 3. In both setups, we take n = p = 100, σ^2 = 1, X = I, and μ = Xβ∗, with

β i∗ =

ρ i = 1,... 10 0 i = 11,... 100.

The left panel uses ρ = 1, and the right uses ρ = 8. We plot the total degrees of freedom and search degrees of freedom of subset selection as a function of the expected number of selected variables (and note, as before, that these plots are produced by mathematical formulae, not by simulation). The curves in the left panel, i.e., in the low signal-to-noise ratio case, appear extremely similar to those in the null signal case (right panel of Figure 2). The search degrees of freedom curve peaks when the expected number of selected variables is about E|Asubset| = 31.9, and its peak height is again just short of 50. Meanwhile, in the high signal-to-noise ratio case, i.e., the right panel of Figure 3, the behavior is very different. The search degrees of freedom curve is bimodal, and is basically zero when the expected number of selected variables is 10. The intuition: with such a high signal-to-noise ratio in

0 20 40 60 80 100

0

20

40

60

80

100

Low signal−to−noise ratio

E| A |

Total df Search df E| A |

0 20 40 60 80 100

0

100

200

300

High signal−to−noise ratio

E| A |

Figure 4: An example with n = p = 100, X = I, and μ = Xβ∗^ with β∗ i = ρ, i = 1,... p. The left panel corresponds to ρ = 1 (low signal-to-noise regime) and the right to ρ = 8 (high signal-to-noise regime).

closed-form expressions, they are not derived from simulation). We can see that the low signal-to- noise ratio case, in the left panel, yields a set of curves quite similar to those from the null signal case, in the right panel of Figure 2. One difference is that the search degrees of freedom curve has a higher maximum (its value about 56, versus 48 in the null signal case), and the location of this maximum is further to the left (occuring at about E|Asubset| = 29.4, versus E|Asubset| = 31.7 in the former case). On the other hand, the right panel of the figure shows the high signal-to-noise ratio case, where the total degrees of freedom curve is now nonmonotone, and reaches its maximum at an expected number of selected variables (very nearly) E|Asubset| = 50. The search degrees of freedom curve itself peaks much later than it does in the other cases, at approximately E|Asubset| = 45.2. Another striking difference is the sheer magnitude of the degrees of freedom curves: at 50 selected variables on average, the total degrees of freedom of the best subset selection fit is well over 300. Mathematically, this makes sense, as the search degrees of freedom curve in (14) is increasing in |β i∗ |. Furthermore, we can liken the degrees of freedom curves in the right panel of Figure 4 to those in a small portion of the plot in the right panel of Figure 3, namely, the portion corresponding to E|Asubset| ≤ 10. The two sets of curves here appear similar in shape. This is intuitively explained by the fact that, in the high signal-to-noise ratio regime, subset selection over a dense true model is similar to subset selection over a sparse true model, provided that we constrain our attention in the latter case to subsets of size less than or equal to the true model size (since under this constraint, the truly irrelevant variables in the sparse model do not play much of a role).

3 Search degrees of freedom for general procedures

Here we extend the notion of search degrees of freedom to general adaptive regression procedures. Given an outcome y ∈ Rn^ and predictors X ∈ Rn×p, we consider a fitting procedure f : Rn^ → Rn of the form f (y) = X βˆ(f^ ),

for some estimated coefficients βˆ(f^ )^ ∈ Rp. Clearly, the lasso and best subset selection are two ex- amples of such a fitting procedure, with the coefficients as in (3) and (4), respectively. We denote A(f^ )^ = supp( βˆ(f^ )), the support set of the estimated coefficients under f. The overall complexity of f is measured by its degrees of freedom, as defined in (2) (just as it is for all fitting procedures), but we may be also interested in a degree of complexity associated solely with its model selection component—i.e., we might ask: how many effective parameters does f spend in simply selecting the active set A(f^ )? We propose to address this question by developing a notion of search degrees of freedom for f , in a way that generalizes the notion considered in the last section specifically for subset selection. Abbreviating A = A(f^ ), we first define a modified procedure f˜ that returns the least squares fit on the active set A, f˜ (y) = PAy.

where PA = XA(X ATXA)+X AT is the projection onto the span of active predictors XA (note the use of the pseudoinverse, as XA need not have full column rank, depending on the nature of the procedure f ). We now define the search degrees of freedom of f as

sdf(f ) = df( f˜ ) − E[rank(XA)]

=

σ^2

∑^ n

i=

Cov

(PAy)i, yi

− E[rank(XA)]. (15)

The intuition behind this definition: by construction, f˜ and f are identical in their selection of the active set A, and only differ in how they estimate the nonzero coefficients once A has been chosen, with f˜ using least squares, and f using a possibly different mechanism. If A were fixed, then a least squares fit on XA would use E[rank(XA)] degrees of freedom, and so it seems reasonable to assign the leftover part, df( f˜ ) − E[rank(XA)], as the degrees of freedom spent by f˜ in selecting A in the first place, i.e., the amount spent by f in selecting A in the first place. It may help to discuss some specific cases.

3.1 Best subset selection

When f is the best subset selection fit, we have f˜ = f , i.e., subset selection already performs least squares on the set of selected variables A. Therefore,

sdf(f ) = df(f ) − E|A|, (16)

where we have also used the fact that XA must have linearly independent columns with best subset selection (otherwise, we could strictly decrease the ` 0 penalty in (4) while keeping the squared error loss unchanged). This matches our definition (10) of search degrees of freedom for subset selection in the orthogonal X case—it is the total degrees of freedom minus the expected number of selected variables, with the total being explicitly computable for orthogonal predictors, as we showed in the last section. The same expression (16) holds for any fitting procedure f that uses least squares to estimate the coefficients in its selected model, because then f˜ = f. (Note that, in full generality, E|A| should be replaced again by E[rank(XA)] in case XA need not have full column rank.) An example of another such procedure is forward stepwise regression.

3.2 Ridge regression

For ridge regression, the active model is A = { 1 ,... p} for any draw of the outcome y, which means that the modified procedure f˜ is just the full regression fit on X, and

sdf(f ) = E[rank(X)] − E[rank(X)] = 0.

degrees of freedom of the lasso (i.e., the difference between the green curve and the diagonal) is smaller than the search degrees of freedom of subset selection (the difference between the red curve and the diagonal).

0 2 4 6 8 10

0

2

4

6

8

10

Average # of nonzero coefficients

Degrees of freedom

l l l l l l l l l l

l (^) Lasso Subset selection Relaxed lasso

Figure 5: The same simulation setup as in Figure 1, but now including the relaxed lasso degrees of freedom on the left panel, in green. (The relaxed lasso is the fitting procedure that performs least squares on the lasso active set.) We can see that the relaxed lasso has a smaller degrees of freedom than subset selection, as a function of their (respective) average number of selected variables. Hence, the lasso exhibits a smaller search degrees of freedom than subset selection, in this example.

This discrepancy between the search degrees of freedom of the lasso and subset selection, for correlated variables X, stands in contrast to the orthogonal case, where the two quantities were proven to be equal (subject to the appropriate parametrization). Further simulations with correlated predictors show that, for the most part, this discrepancy persists across a variety of cases; consult Figure 6 and the accompanying caption text for details. However, it is important to note that this phenomenon is not universal, and in some instances (particularly, when the computed active set is small, and the true signal is dense) the search degrees of freedom of the lasso can grow quite large and compete with that of subset selection. Hence, we can see that the two quantities do not always obey a simple ordering, and the simulations presented here call for a more formal understanding of their relationship. Unfortunately, this is not an easy task, since direct calculation of the relevant quantities—the degrees of freedom of best subset selection and the relaxed lasso—is not tractable for a general X. In cases such as these, one usually turns to Stein’s formula as an alternative for calculating degrees of freedom; e.g., the result in Theorem 1 is derived using Stein’s formula. But Stein’s formula only applies to continuous (and almost differentiable) fitting procedures f = f (y), and neither the best subset selection nor the relaxed lasso fit is continuous in y. The next section, therefore, is focused on extending Stein’s result to discontinuous functions.

Null

0 5 10 15

0

5

10

15

Average # of nonzeros

Degrees of freedom

lllllllllll^ lll lllllllllll llllllll llllll lllll lllll llll llll llll lll llll lll lll lll llll llll lllll lllll lllllll llllllllllllllllllllllllllllllllllllllllllllllll

l (^) Lasso Subset selection Relaxed lasso 0 5 10 15

0

2

4

6

Average # of nonzeros

Search degrees of freedom

Subset selection Lasso / relaxed lasso

Sparse

0 5 10 15

0

5

10

15

Average # of nonzeros

Degrees of freedom

llllllllllllll^ ll lllllllllll lllllllll llllll llllll lllll lllll llll lllll lllll lllllll lllllllllll lllllllllllllll lllllllll llll llll llll lllll llllll lll llllllllll

l Lasso Subset selection Relaxed lasso 0 5 10 15

0

2

4

6

8

Average # of nonzeros

Search degrees of freedom

Subset selection Lasso / relaxed lasso

Dense

0 5 10 15

0

5

10

15

20

Average # of nonzeros

Degrees of freedom

lllllllllllllllllllllll^ llll llllllllllllllllllllllllllllllllllllllllllllllllll lllllllllll llllll lllll llll lll lll lll lllll llll lllll llll ll ll llllllllllllllll

l (^) Lasso Subset selection Relaxed lasso 0 5 10 15

0

5

10

15

20

Average # of nonzeros

Search degrees of freedom

Subset selection Lasso / relaxed lasso

Figure 6: A set of simulation results with n = 30, p = 16 (we are confined to such a small setup because of the exponential computational complexity of subset selection). The rows of X were drawn i.i.d. from N (0, Σ), where Σ is block diagonal with two equal sized ( 8 × 8 ) blocks Σ 1 , Σ 2. All diagonal entries of Σ 1 , Σ 2 were set to 1, and the off-diagonal entries were drawn uniformly between 0.4 and 0.9. We considered three cases for the true mean μ = Xβ∗: null (β∗^ = 0), sparse (β∗^ is supported on 3 variables in the first block and 1 in the second, with all nonzero components equal to 1), and dense (β∗^ has all components equal to 1). In all cases, we drew y around μ with independent standard normal noise, for a total of 100 repetitions. Overall, the search degrees of freedom of subset selection appears to be larger than that of the lasso, but at times the latter can rival the former in magnitude, especially for small active sets, and in the dense signal case.

5.1 An extension of Stein’s univariate lemma

We consider functions f : R → R that are absolutely continuous on a partition of R. Formally:

Definition 1. We say that a function f : R → R is piecewise absolutely continuous, or p-absolutely continuous, if there exist points δ 1 < δ 2 <... < δm such that f is absolutely continuous on each one of the open intervals (−∞, δ 1 ), (δ 1 , δ 2 ),... (δm, ∞).

For a p-absolutely continuous function f , we write D(f ) = {δ 1 ,... δm} for its discontinuity set. Furthermore, note that such a function f has a derivative f ′^ almost everywhere (because it has a derivative almost everywhere on each of the intervals (−∞, δ 1 ), (δ 1 , δ 2 ),... (δm, ∞)). We will simply refer to f ′^ as its derivative. Finally, we use the following helpful notation for one-sided limits,

f (x)+ = lim t↓x f (t) and f (x)− = lim t↑x f (t).

We now have the following extension of Stein’s univariate lemma, Lemma 2.

Lemma 4. Let Z ∼ N (0, 1). Let f : R → R be p-absolutely continuous, and have a discontinuity set D(f ) = {δ 1 ,... δm}. Let f ′^ be its derivative, and assume that E|f ′(Z)| < ∞. Then

E[Zf (Z)] = E[f ′(Z)] +

∑^ m

k=

φ(δk)

[

f (δk)+ − f (δk)−

]

The proof is similar to Stein’s proof of Lemma 2, and is left to the appendix, for readability. It is straightforward to extend this result to a nonstandard normal distribution.

Corollary 1. Let X ∼ N (μ, σ^2 ). Let h : R → R be p-absolutely continuous, with discontinuity set D(h) = {δ 1 ,... δm}, and derivative h′^ satisfying E|h′(X)| < ∞. Then

1 σ^2

E[(X − μ)h(X)] = E[h′(X)] +

σ

∑^ m

k=

φ

δk − μ σ

[

h(δk)+ − h(δk)−

]

With this extension, we can immediately say something about degrees of freedom, though only in a somewhat restricted setting. Suppose that f : Rn^ → Rn^ provides the fit ˆμ = f (y), and that f is actually univariate in each coordinate,

f (y) =

f 1 (y 1 ),... fn(yn)

Suppose also that each coordinate function fi : R → R is p-absolutely continuous. We can apply Corollary 1 with X = yi and h = fi, and sum over i to give

df(f ) =

σ^2

∑^ n

i=

Cov

fi(yi), yi

∑^ n

i=

E[f (^) i′ (yi)] +

σ

∑^ n

i=

δ∈D(fi)

φ

δ − μi σ

[

fi(δ)+ − fi(δ)−

]

The above expression provides an alternative way of proving the result on the degrees of freedom of hard thresholding, which was given in Lemma 1, the critical lemma for deriving the degrees of freedom of both best subset selection and the relaxed lasso for orthogonal predictors, Theorems 2 and 3. We step through this proof next.

Alternate proof of Lemma 1. For f (y) = Ht(y), the ith coordinate function is

fi(yi) = [Ht(yi)]i = yi · 1 {|yi| ≥ t},

which has a discontinuity set D(fi) = {−t, t}. The second term in (20) is hence

1 σ

∑^ n

i=

[

φ

t − μi σ

· (t − 0) + φ

−t − μi σ

· (0 − −t)

]

t σ

∑^ n

i=

[

φ

t − μi σ

  • φ

t + μi σ

)]

while the first term is simply ∑n

i=

E[1{|yi| ≥ t}] = E|At|.

Adding these together gives

df(Ht) = E|At| +

t σ

∑^ n

i=

[

φ

t − μi σ

  • φ

t + μi σ

)]

precisely the conclusion of Lemma 1.

5.2 An extension of Stein’s multivariate lemma

The degrees of freedom result (20) applies to functions f for which the ith component function fi depends only on the ith component of the input, fi(y) = fi(yi), for i = 1,... n. Using this result, we could compute the degrees of freedom of the best subset selection and relaxed lasso fits in the orthogonal predictor matrix case. Generally speaking, however, we cannot use this result outside of the orthogonal setting, due to the requirement on f that fi(y) = fi(yi), i = 1,... n. Therefore, in the hope of understanding degrees of freedom for procedures like best subset selection and the relaxed lasso in a broader context, we derive an extension of Stein’s multivariate lemma. Stein’s multivariate lemma, Lemma 3, is concerned with functions g : Rn^ → R that are contin- uous and almost differentiable. Loosely speaking, the concept of almost differentiability is really a statement about absolute continuity. In words, a function is said to be almost differentiable if it is absolutely continuous on almost every line parallel to the coordinate axis (this notion is different, but equivalent, to that given by Stein). Before translating this mathematically, we introduce some notation. Let us write x = (xi, x−i) to emphasize that x ∈ Rn^ is determined by its ith component xi ∈ R and the other n − 1 components x−i ∈ Rn−^1. For g : Rn^ → R, we let g( · , x−i) denote g as a function of the ith component alone, with all others components fixed at the value x−i. We can now formally define almost differentiability:

Definition 2. We say that a function g : Rn^ → R is almost differentiable if for every i = 1,... n and Lebesgue almost every x−i ∈ Rn−^1 , the function g( · , x−i) : R → R is absolutely continuous.

Similar to the univariate case, we propose a relaxed continuity condition. Namely:

Definition 3. We say that a function g : Rn^ → R is p-almost differentiable if for every i = 1,... n and Lebesgue almost every x−i ∈ Rn−^1 , the function g( · , x−i) : R → R is p-absolutely continuous.

Note that a function g that is p-almost differentiable has partial derivatives almost everywhere, and we write the collection as ∇g = (∂g/∂x 1 ,... ∂g/∂xn).^2 Also, when dealing with g( · , x−i), the function g restricted to its ith variable with all others fixed at x−i, we write its one-sided limits as

g(xi, x−i)+ = lim t↓xi

g(t, x−i) and g(xi, x−i)− = lim t↑xi

g(t, x−i).

We are now ready to present our extension of Stein’s multivariate lemma.

(^2) Of course, this does not necessarily mean that g has a well-defined gradient, and so, cumbersome as it may read, we are careful about referring to ∇g as the vector of partial derivatives, instead of the gradient.

As f for the relaxed lasso and subset selection is the locally linear projection map f (y) = PAy, almost everywhere in y, the first term

∑n i=1 E[∂fi(y)/∂yi] in (22) is simply^ E|A|. The second term, then, exactly coincides with the search degrees of freedom of these procedures. (Recall that the same breakdown occurred when using the univariate Stein extension to derive the degrees of freedom of hard thresholding, in Section 5.1.) This suggests a couple potential insights into degrees of freedom and search degrees of freedom that may be gleaned from the extended Stein formula (22), which we discuss below.

  • Positivity of search degrees of freedom. If one could show that

fi(δ, y−i)+ − fi(δ, y−i)− > 0 (23)

for each discontinuity point δ ∈ D(fi(·, y−i)), almost every y−i ∈ Rn, and each i = 1,... n, then this would imply that the second term in (22) is positive. For the relaxed lasso and subset selection fits, this would mean that the search degrees of freedom term is always positive, i.e., the total degrees of freedom of these procedures is always larger than the (expected) number of selected variables. In words, the condition in (23) says that the ith fitted value, at a point of discontinuity, can only increase as the ith component of y increases. Note that this is a sufficient but not necessary condition for positivity of search degrees of freedom.

  • Search degrees of freedom and discontinuities. The fact that the second term in (22) gives the search degrees of freedom of the best subset selection and the relaxed lasso fits tells us that the search degrees of freedom of a procedure is intimately related to its discontinuities over y. At a high level: the greater the number of discontinuities, the greater the magnitude of these discontinuities, and the closer they occur to the true mean μ, the greater the search degrees of freedom. This may provide some help in understanding the apparent (empirical) differences in search degrees of freedom between the relaxed lasso and best subset selection fits under correlated setups, as seen in Section 4. The particular discontinuities of concern in (22) arise from fixing all but ith component of the outcome at y−i, and examining the ith fitted value fi(·, y−i) as a function of its ith argument. One might expect that this function fi(·, y−i) generally exhibits more points of discontinuity for best subset selection compared to the relaxed lasso, due to the more complicated boundaries of the elements Ui in the active-set-determining decomposi- tion described above (these boundaries are piecewise quadratic for best subset selection, and piecewise linear for the relaxed lasso). This is in line with the general trend of subset selection displaying a larger search degrees of freedom than the relaxed lasso. But, as demonstrated in Figure 6, something changes for large values of λ (small active sets, on average), and for μ = Xβ∗^ with a sparse or (especially) dense true coefficient vector β∗; we saw that the search degrees of freedom of both the relaxed lasso and best subset selection fits can grow very large in these cases. Matching search degrees of freedom to the second term in (22), therefore, we infer that both fits must experience major discontinuities here (and these are somehow comparable overall, when measured in number, magnitude, and proximity to μ). This makes sense, especially when we think of taking λ large enough so that these procedures are forced to select an active set that is strictly contained in the true support A∗^ = supp(β∗); different values of y, quite close to μ = Xβ∗, will make different subsets of A∗^ look more or less appealing according to the criterions in (3), (4).

5.4 Connection to Theorem 2 of Hansen & Sokol (2014)

After completing this work, we discovered the independent and concurrent work of Hansen & Sokol (2014). These authors propose an interesting and completely different geometric approach to study-

ing the degrees of freedom of a metric projection estimator

f (y) ∈ argmin u∈K

‖y − u‖^22 ,

where the set K ⊆ Rn^ can be nonconvex. Their Theorem 2 gives a decomposition for degrees of freedom that possesses an intriguing tie to ours in (22). Namely, these authors show that the degrees of freedom of any metric projection estimator f an be expressed as its expected divergence plus an “extra” term, this term being the integral of the normal density with respect to a singular measure (dependent on f ). Equating this with our expression in (22), we see that the two forms of “extra” terms must match—i.e., our second term in (22), defined by a sum over the discontinuities of the projection f , must be equal to their integral. This has an immediate implication for the projection operator onto the ` 0 ball of radius k, i.e., the best subset selection estimator in constrained form: the search degrees of freedom here must be nonnegative (as the integral of a density with respect to a measure is always nonnegative). The decomposition of Hansen & Sokol (2014) hence elegantly proves that the best subset selection fit, constrained to have k active variables, attains a degrees of freedom larger than or equal k. However, as far as we can tell, their Theorem 2 does not apply to best subset selection in Lagrange form, the estimator considered in our paper, since it is limited to metric projection estimators. To be clear, our extension of Stein’s formula in (22) is not restricted to any particular form of fitting procedure f (though we do require the regularity conditions in (21)). We find the connections between our work and theirs fascinating, and hope to understand them more deeply in the future.

6 Discussion

In this work, we explored the degrees of freedom of best subset selection and the relaxed lasso (the procedure that performs least squares on the active set returned by the lasso). We derived exact expressions for the degrees of freedom of these fitting procedures with orthogonal predictors X, and investigated by simulation their degrees of freedom for correlated predictors. We introduced a new concept, search degrees of freedom, which intuitively measures the amount of degrees of freedom expended by an adaptive regression procedure in merely constructing an active set of variables (i.e., not counting the degrees of freedom attributed to estimating the active coefficients). Search degrees of freedom has a precise definition for any regression procedure. For subset selection and the relaxed lasso, this reduces to the (total) degrees of freedom minus the expected number of active variables; for the lasso, we simply equate its search degrees of freedom with that of the relaxed lasso, since these two procedures have the exact same search step. The last section of this paper derived an extension of Stein’s formula for discontinuous functions. This was motivated by the hope that such a formula could provide an alternative lens from which we could view degrees of freedom for discontinuous fitting procedures like subset selection and the relaxed lasso. The application of this formula to these fitting procedures is not easy, and our grasp of the implications of this formula for degrees of freedom is only preliminary. There is much work to be done, but we are hopeful that our extension of Stein’s result will prove useful for understanding degrees of freedom and search degrees of freedom, and potentially, for other purposes as well.

Acknowledgements

The idea for this paper was inspired by a conversation with Jacob Bien. We thank Rob Tibshirani for helpful feedback and encouragement, and the editors and referees who read this paper and gave many useful comments and references.