






















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Methods for testing and correcting for selectivity bias in panel data selection models, specifically the work of Verbeek and Nijman (1992) and Wooldridge (1995). the estimation of unknown parameter vectors and unobservable individual effects, as well as the use of the smoothed conditional maximum score estimator for binary response panel data models.
What you will learn
Typology: Schemes and Mind Maps
1 / 30
This page cannot be seen from the preview
Don't miss anything!
Econornettica, Vol. 65, No. 6 (November, 19971, 1335-
We consider the problem of estimation in a panel data samplc selection model, where both thc selection and the regression equation of intercst contain unobservable individ- ual-specific effects. We propose a two-step estimation procedure, which "differences out" both the sample selection effect and the unobservable individual effect from the cquation of intercst. In the first step, the unknown coefficients of the "selection" equation are consistently estimated. The estimates are then used to estimate thc regression equation of interest. The estimator proposed in this paper is consistent and asymptotically normal, with a rate of convergence that can be made arbitrarily close to n - ' I 2 , depending on the strength of certain smoothness assumptions. The finite sample properties of the estimator are invcstigated in a small Monte Carlo simulation. KEYWORDS:Sample selection, panel data, individual-specific effects.
SAMPLESELECTION IS A PROBLEM frequently encountered in applied research. It arises as a result of either self-selection by the individuals under investigation, or sample selection decisions made by data analysts. A classic example, studied in the seminal work of Gronau (1974) and Heckman (1976), is female labor supply, where hours worked are observed only for those women who decide to participate in the labor force. Failure to account for sample selection is well known to lead to inconsistent estimation of the behavioral parameters of interest, as these are confounded with parameters that determine the probability of entry into the sample. In recent years a vast amount of econometric literature has been devoted to the problem of controlling for sample selectivity. The research however has almost exclusively focused on the cross-sectional data case. See Powell (1994) for a review of this literature and for references. In contrast, this paper focuses on the case where the researcher has panel or longitudinal data a ~ a i l a b l e. ~Sample selectivity is as acute a problem in panel as in cross section data. In addition, panel data sets are commonly characterized by nonrandomly missing observations due to sample attrition.
This paper is bascd on Chapter 1 of my thesis completed at Northwestern University. Evanston, Illinois. I wish to thank my thesis advisor Bo Honor& for invaluable help and support during this project. Many individuals, among them a co-editor and two anonymous referecs, have offered useful comments and suggestions for which I am very grateful. Joel Horowitz kindly provided a computer program used in this study. An earlicr version of the paper was prescnted at the North American Summer Meetings of the Econometric Society, June, 1994. Financial support from NSF through Grant No. SES-9210037 to Bo Honor& is gratefully acknowledged. All remaining errors are my responsibility. An Appendix which contains a proof of a theorem not included in the paper may be obtained at the world wide web site: http://www.spc.uchicago.edu/E-Kyriazidou. " Obviously, the analysis is similar for any kind of data that have a group structure.
The most typical concern in empirical work using panel data has been the presence of unobserved heterogeneity. Heterogeneity across economic agents may arise for example as a result of different preferences, endowments, or attributes. These permanent individual characteristics are commonly unobserv- able, or may simply not be measurable due to their qualitative nature. Failure to account for such individual-specific effects may result in biased and inconsistent estimates of the parameters of interest. In linear panel data models, these unobserved effects may be "differenced" out, using the familiar "within" ("fixed-effects") approach. This method is generally not applicable in limited dependent variable models. Exceptions include the discrete choice model stud- ied by Rasch (1960, 1961), Anderson (1970), and Manski (1987), and the censored and truncated regression models (Honor6 (1992, 1993)). See also Chamberlain (1984), and Hsiao (1986) for a discussion of panel data methods. The simultaneous presence of sample selectivity and unobserved heterogene- ity has been noted in empirical work (as for example in Hausman and Wise (19791, Nijman and Verbeek (1992), and Rosholm and Smith (1994)). Given the pervasiveness of either problem in panel data studies, it appears highly desirable to be able to control for both of them simultaneously. The present paper is a step in this direction. In particular, we consider the problem of estimating a panel data model. where both the sample selection rule, assumed to follow a binary response model, and the (linear) regression equation of interest contain additive perma- nent unobservable individual-specific effects that may depend on the observable explanatory variables in an arbitrary way. In this type 2 Tobit model (in the terminology of Amemiya (1985)), sample selectivity induces a fundamental nonlinearity in the equation of interest with respect to the unobserved charac- teristics, which, in contrast to linear panel data models, cannot be "differenced away." This is because the sample selection effect, which enters additivelp in the main equation, is a (generally unknown) nonlinear function of both the observed time-varying regressors and the unobservable individual effects of the selection equation, and is therefore not constant over time. Furthermore, even if one were willing to specify the distribution of the underlying time-varying errors (for example normal) in order to estimate the model by maximum likelihood, the presence of unobservable effects in the selection rule would require that the researcher also specify a functional form for their statistical dependence on the observed variables. Apart from being nonrobust to distributional misspecification, this fully parametric "random ef- fects" approach is also computationally cumbersome, as it requires multiple numerical integration over both the unobservable effects and the entire length of the panel. Heckman's (1976, 1979) two-step correction, although computa- tionally much more tractable, also requires full specification of the underlying distributions of the unobservables, and is therefore susceptible to inconsisten- cies due to misspecification. Thus, the results of this paper will be important even if the distribution of the individual effects is the only nuisance parameter in the model.
The first step of the proposed estimation method requires that the discrete choice selection equation be estimated consistently and at a sufficiently fast rate. To this end, we propose using a "smoothed" version of Manski's (1987) condi- tional maximum score e ~ t i m a t o r , ~which follows the approach taken by Horowitz (1992) for estimating cross section discrete choice models. Under appropriate assumptions, stronger than those in Manski (1987), the smoothed estimator improves on the rate of convergence of the original estimator, and also allows standard statistical inference. Furthermore, it dispenses with parametric as- sumptions on the distribution of the errors, required for example by the conditional maximum likelihood estimator proposed by Rasch (1960, 1961) and Andersen (1970). Although our analysis is based on the assumption of a censored panel, with only two observations per individual, it easily generalizes to the case of a longer and possibly,unbalanced panel, and may be also modified to accommodate truncated samples, in which case estimation of the selection equation is infeasi- ble. Extensions of our estimation method to cover these situations are discussed at the end of the next section. The paper is organized as follows. Section 2 describes the model and moti- vates the proposed estimation procedure. Section 3 states the assumptions and derives the asymptotic properties of the estimator. Section 4 presents the results of a Monte Carlo study investigating the small sample performance of the proposed estimator. Section 5 offers conclusions and suggests topics for future research. The proofs of theorems and lemmata are given in the Appendix.
We consider the following model:
e ~ t i m a t e , ~ and wi, are vectors of explanatory variables (with possibly common x: elements), a>nd 17, are unobservable time-invariant individual-specific effects
served disturbances (not necessarily independent of each other), while yz E 3 is a latent variable whose observability depends on the outcome of the indicator
The smoothed conditional maximum score estimator for binary response panel data models, along with its asymptotic properties and necessary assumptions, is presented in an earlier version of this paper (Kyriazidou (1994)). See also Charlier, Melenberg, and van Soest (1995). Obviously constants cannot be identified in either equation, since they would be absorbed in the individual effects. These will be treated as nuisance parameters and will not be estimated. Our analysis also applies to the case where a: = rl,
variable d,, E {O,l). In particular, it is assumed that, while ( d , , , ~ , , )is always observed, (y:, x:) is observed only6 if d,, = 1. In other words, the "selection" variable d,, determines whether the itth observation in equation (2.1) is cen- sored or not. Thus, our problem is to estimate P and y from a sample consisting of quadruples (dil,wi,,yi,,xi,). We will denote the vector of (observed and
without the "fixed effects" a* and rl,, our model becomes a panel data version of the well known sample selection model considered in the literature, and could be estimated by any of the existing methods. Without sample selectivity, that is with d,, = 1 for all i and t , equation (2.1) is the standard panel data linear regression model. In our setup, it is possible to estimate y in the discrete choice "selection" equation (2.2) using either the conditional maximum likelihood approach pro- posed by Rasch (1960, 1961) and Andersen (1970), or the conditional maximum score method proposed by Manski (1987). On the other hand, estimation of P based on the main equation of interest (2.1) is confronted with two problems: first, the presence of the unobservable effect ai,= d,,. a" and second and more fundamental, the potential "endogeneity" of the regressors xi, = di;x:, which arises from their dependence on the selection variable d,,, and which may result in "selection bias." The first problem is easily solved by noting that for those observations that have d,, = d,, = 1, time differencing will eliminate the effect a,, from equation (2.1). This is analogous to the "fixed-effects" approach taken in linear panel data models. In general though, application of standard methods, e.g., OLS, on this first-differenced subsample will yield inconsistent estimates of P, due to sample selectivity. This may be seen from the population regression function for the first-differenced subsample:
E ( y i l - y i 2 I d i l = 1 , d i 2 = 1 , l i ) = (x:~ - 4 , ) p + E ( E ~- &;Idil = 1 , d i 2= 1, i i ). In general, there is no reason to expect that E(&,TId,, = 1, d,, = 1, l i ) = 0, or that E ( E ~Idil = 1,di2= 1, i,) = E(e2ldil = 1,d,, = I , & ). In particular, for each
(generally unknown) joint conditional distribution of (e:, u,,, u,,), which may differ across individuals, as well as over time for the same individual:
=E(sI::luil I W , , Y + 7 , , u i 2 4 w i 2 y +v i , l i ) = A(wily+ ~ i , ~ i 2 ~ +q i ; F , , ( & , T , ~ i l , ~ i 2 I i i ) ) = A i l ( w i l ~+ 77,wi2~+ 7h, li).
Obviously, the analysis carries through to the case where x: is always observed, which is the case most commonly treated in the literature.
The above discussion, which presumes knowledge of the true y, suggests
the form jn = [Cy=, Ax: Axi %@,I- '[Cy=, Ax: Ay, TiQi]. Under appropriate reg- ularity conditions, this estimator will be consistent and root-n asymptotically normal. An obvious requirement is that Pr(Awi y = 0) > 0, which may be satis- fied for example when all the random variables in wit are discrete, or in experimental cases where the distribution of wit is^ in^ the^ control^ of^ the researcher, situations that are rare in economic applications. Of course, this estimation scheme cannot be directly implemented since y is unknown. Furthermore, as argued above, it may be the case that Ti= 0 6.e. Aw, y # 0) for all individuals in our sample. Notice though that, if A is a
for which the difference Aw, is close to zero should also have AA, E 0, and the preceding arguments would hold approximately. We therefore propose the following two-step estimation procedure, which is in the spirit of Powell (1987), and Ahn and Powell (1993): In the first step, y is consistently estimated based on equation (2.2) alone. In the second step, the
wi,qn and wi,Tn are "close." Specifically, we propose
where &, is a weight that declines to zero as the magnitude of the difference I wi,qn - wi2YnIincreases. We choose "kernel" weights of the form:
where K is a "kernel density" function, and h, is a sequence of "bandwidths" which tends to zero as n + m. Thus, for a fixed (nonzero) magnitude of the difference 1 Aw, ?,I, the weight Gin shrinks as the sample size increases, while for a fixed n, a larger I Aw, I?, corresponds to a smaller weight. It is interesting to note that the arguments used in estimating the main regression equation may be modified to accommodate the case of a truncated sample, that is when we only observe those individuals that have d,, = 1 for all time periods. Recall that our method for eliminating the sample selection effect from equation (2.1') is based on the fact that, under certain distributional assumptions, Aw, y = 0 implies Ah, = 0. However, Aw, = 0 also implies Ah, = 0. In other words, we might dispense altogether with the first step of estimating y,
would suggest using the weights: Gin = (l/h:)K(Aw,/h,). Although this ap- proach would imply a slower rate of convergence for the resulting estimator, this
estimation scheme may be used for estimating p from a truncated sample, in which case estimation of the selection equation is infeasible. An obvious drawback in this method is that, in order to consistently estimate the entire
not contain any elements in common. The above analysis extends naturally to the case of a longer (and possibly unbalanced) panel, that is when T. 2 2. Then p could be estimated from those observations that have d,, = d,, = 1, and for which wit?, and wis?, are "close," for all s, t = 1... , q.. The estimator is of the form
where
In the following section we derive the asymptotic properties of our proposed estimator for the main equation of interest, under the assumption that y has been consistently estimated. At the end of the section, we examine the applica- bility of existing estimators for obtaining first-step estimates of the selection equation.
3.1. Asymptotic Properties of the Estimator The derivation of the large sample properties of fin of equations (2.3) and (2.4) proceeds in two steps. First, the asymptotic behavior of the infeasible estimator which uses the true y in the construction of the kernel weights, denoted by fin, is analyzed. Then the large sample behavior of the difference ( fin - fin) is investigated. It will be useful to define the scalar index W, = Aw, y and its estimated counterpart = Aw, y,, along with the following quantities:
n ,= 1 h,
1344 EKATERINI KYRIAZIDOU
ASSUMPTIONR5: The unknown function9 il(wly + 7, w, y + 7 , J ) = E(E: Idl = l , d , = l , ~ ) ~ E ( ~ ~ I u ~ < w ~ y + ~ , u , < w , y + _ r ] , J ) satisfies: A ( s , , s , , J ) - A(s,,_s,,J)=il.(s,-s,) for t , r = 1 , 2 , where A is afunction of (s,,s,, J ) , i.e., A = Ais,, s,, 5 1, which is bounded" on its support.
This assumption is crucial to our analysis. It will be satisfied, for example, if A is continuously differentiable with respect to its first two arguments, with bounded first-order partial derivatives (as, for example, when the errors are jointly normally distributed), in which case we may apply the multivariate mean-value theorem:
Here A(]) (j = 1,2) denotes the first-order partial derivative of A with respect to its first and second argument respectively, and c; lies on the line segment connecting (w, y + r ] , w, y + 7 , !:) and (w, + 7, wl y + 7, J ). Thus, in this case, A = Acl)(cT)- 11(2)(~1), and by assumption will be bounded.
ASSUMPTIONR6: (a) x: and r: have bounded 4 + 2 6 moments conditional on W, for any 6 E (0,l). (b) E(Axl Ax @ I W) and E(Axt Ax Au2 @ I W) are continuous at W = 0 and do not uanish. (c) E ( Ax' j @ l W) is almost eueiywhere r times continuously differenfiable as a fiinction of W, and has bounded deri~latices.
ASSUMPTIONR8: h, +0 and nh, +m as n -t cc.
From our analysis in Section 2, it is easy to see that Assumptions R1-R would suffice to identify P for known y. An identification scheme in the spirit of our discussion in Section 2 would obviously require support of W at zero, as well as nonsingularity of the matrix 2,y,y, imposed by Assumption R3, analogous to the familiar full rank assumption. The continuity of the distribution of the index W, imposed in Assumption R4, is a regularity condition, common in kernel estimation of density a;d regression
2 infeasible, even if y were known.
~ o t i c ethat by Assumption R1, thc functional form of A is the same over time for the same individual, while by Assumption R2, it is also the same across !ndividuals. (^10) In principle, we could dispense with the assumption that 11 is bounded, by assuming that has finite fourth moment conditional on 1V.
Since our estimation scheme is based on pairs of observations for which
These are imposed by Assumptions R4-R8. Notice, in particular, Assumption R5, which imposes a Lipschitz continuity property on the selection correction function A( ). It is easy to see that simple continuity will not be sufficient to
more, similarly to kernel density and regression estimation, a high order of differentiability r for certain functions of the index W, along with the appropri- ate choice of the kernel function and the bandwidth sequence, imply a faster
order bias-reducing" kernel, which by Assumption R7(e) is required to be negative in part of its domain. The next lemma establishes the asymptotic properties of the infeasible esti- mator p,.
LEMMA1: Let Assumptions R1-R8 hold. Define Z x x = f w ( 0 ). E ( A x ' A x @ I W = O ) ,
I,, =fW(O)E(AxrAx Au2@1 W = o ) / K ( ~ ) ~d v ,
eualuated at W = 0. Then, P (a) Sxx-+ Zxx.
K s x * -+h ZxA. (c) If K h ; + ' + m, then (i) h;(r+')S,y, -+
P 0 and (ii) h;('+')S,, -,
P
The asymptotic properties of fin easily follow from the previous Lemma: If
P
In order to derive the asymptotic properties of the feasible estimator f i n , we will make the following additional assumptions:
ASSUMPTIONR9: In addition to the conditions of Assumption R7, the kernel function satisfies: (a) K ( v ) is three times continuously differentiable with bounded deriuatiues, and (b) /IKr(v>ldv, lIK"(v)l dv, l ~ ~ K ' ( v ) ~ d v ,and ~ v ~ K " ( v ) ~ ~ v are finite.
Thus, in the limit, the fact we are using Tit to estimate P does not affect the asymptotic distribution of Bf,. The lower bound on p , imposed by Assumption R12, is the key for this result to hold. In words, this bound implies that ,B is estimated at a rate slower than y. Indeed, from Theorem 1, the rate of convergence of fin is (nh,)-'/" n - I / > - ~ , ' 2 , which is obviously slower than
y ) = o,(l).
K ( g , - y ) = OP(l) for K h ; " -+ h, we obtain the following asymptotic representation, which may be easily derived from the analysis of Lemma 2(b) in the Appendix:
where n 0 = plim, ,,(l/n) ( l / h ~ ) ~ ' ( ~ / i ; / h , )Ax: Awi Ahi Qi i = 1
provided that E ( d x l A W ~ ~ @ I W ) is continuous at W = O and v K ( v ) - + O as lvl -f m Asymptotic normality of fir, may still be established if K i q , - y ) has an asymptotic representation of the form: J n h , ( T i J - y ) = l/ K c : , , ~ ( A ~ , .Ad,. y ) + 0,(1).'~ At first glance it looks attractive to eliminate the asymptotic bias of fin by choosing h , so that a h : , + ' + = 0, or equivalently by setting p > (1/(2(r +
0. Indeed, the rate of convergence in distribution of fin is maximized by making p as small as possible, that is by setting p = 1/(2(r + 1) + I), in which
converges at a rate that can be arbitrarily close to n - ' / < provided also that y is estimated fast enough, that is provided y > ( r + 1)/(2(r + 1) + 1). Although the proposed estimator is asymptotically biased, it is possible to eliminate the asymptotic bias while maintaining the maximal rate of conver- gence, in the manner suggested by Bierens (1987).
COROLLARY: Let 6, be the estimator with window width h , = h. n - / ( * ( I +' I + I).
and fin, the estimator.with window width h,, a = h .n +^ "''I, where^^6 E^ (0,l).
(^12) We can also derive an asymptotic representation for i,,in thc case where y is estimated at rate n-" that is slower than 1/ 6. In this case we obtain r z P ( in - /3) = .X;xlfl.nP(.i;, - y ) + op(l), which implies that in converges at the same rate as .i;,,which is slower than thc "optimal" rate obtained for the infeasible estimator fin, that is when y is known.
Define A f i n + (I^ - 6 ) ( r +^ 1)/(2(r+^ I ) +^ 1) A p sz P a ,^ s 1 - n - ( l - 6 ) ( r + l ) / ( 2 ( r A l ) + 1) ' A Then, n(r+1'/(2('T fin - p ) N(0, h- 12;X12Xc2,;').
A In order to compute i?,or p, in an application, one needs to choose the kernel function K , and to assign a numerical value to the bandwidth parameter h,. Results on kernel density and regression function estimation suggest that the asymptotic performance of the estimator will be likely more sensitive to the choice of the window width than to the choice of the kernel. Furthermore, the asymptotic normality result of the Corollary above shows that the variance of the limiting distribution depends crucially on the choice of the constant h. We will thus focus here on the problem of bandwidth selection. Bierens (1987) discusses the construction of high order bias-reducing kernels. For a given order of differentiability r, and a given sample size n, the results
So the problem of bandwidth selection reduces to the problem of choosing the constant h. A natural way to proceed (see Horowitz (1992) and Hardle (1990)), is to choose h so as to minimize some kind of measure of the "distance" of the estimator from the true value, based on the asymptotic result of Theorem 1. Consider for example minimizing the asymptotic mean squared error of the estimator, defined as:
for any nonstochastic positive semidefinite matrix A that satisfies 2~,_CXX~~Z;'Z,,# 0. It is straightforward to show that MSE is minimized by setting
trace [ 21 ;A 2;,'2,,]
1/(2(17 1 ) t 1 ) (3.2.1) h = h" = 2 ( r + I ) Z ; * E , ; ~ A ~ ~ ; ~ ~ ~ , , This last expression suggests that we may construct a consistent estimate of h* if consistent estim:tes of XI,, Z,,, and 2,, are available. By part (a) of Lemmata 1 and 2, S,, consistently estimates S,, for any h, that satisfies h,, -jr 0
THEOREM2:'' Assume that Assumptions Rl-R12 hold. (a) Let fii2be a con- sistent estimator of p based on h, = h .n-1/(2("1'+1', and define ;,, =jJ,, - x,, P,.
(^13) The proof of Theorem 2 IS omitted herc to conserve space. It is available at the author's world
wide web page.
sistencyl%f 6 ,if h;'(?, - y ) = op(l), for any h , that satisfies Assumption R8,
satisfied by the conditional maximum likelihood estimator proposed by Rasch (1960, 1961) and Andersen (1970), which is consistent and root-n asymptotically normal, under the assumption that the errors in the selection equation are white noise with a logistic distribution and independent of the regressors and the individual effects. In fact, as Chamberlain (1992) has shown, if the support of the predictor variables in the selection equation is bounded, then identification of y is possible only in the logistic case. Furthermore, even if the support is unbounded, in which case y may be identified and thus consistently estimated, consistent estimation at rate n-'7' is possible only in the logistic case. As is well known though, if the distribution of the errors is misspecified, the conditional maximum likelihood approach will in general produce inconsistent estimators. Another possible choice for estimating y is the conditional maximum score estimator, proposed by Manski (1987). Under fairly weak distributional assump- tions, this estimator consistently estimates y up to scale. However, the results of Cavanagh (1987), and Kim and Pollard (1990) for the maximum score estimator proposed by Manski (1975, 1985) for the cross section binary response model, namely that it converges at the slow rate of n P ' l 3 to a non-normal random variable, suggest that these properties carry through to its panel data analog, the conditional maximum score estimator. Thus, if (%,- y ) = 0,(nP1/3), it is possi- ble to consistently estimate ,B by choosing h , to satisfy n'l3h; -,m. In this case though, the analysis for obtaining the asymptotic distribution for p,, is not applicable. It is possible, however, to modify Manski's conditional maximum score estima- tor and obtain control over both its rate of convergence and its limiting distribution, by imposing sufficient smoothness on the distribution of the errors and the explanatory variables in the selection equation. Specifically, following the approach taken by Horowitz (1992) for estimating the cross section binary response model, we can construct a "smoothed conditional maximum score" estimator, which under weak (but stronger than Manski's) assumptions, is consistent and asymptotica!ly normally distributed, with a rate of convergence that can be arbitrarily close to n-'I2, depending on the amount of smoothness
(^14) Consistency of p, may be established under the weaker restriction that /z;'ll.F, - yll' = o,(l). The proof of Lemma 2(a) would then have to be modified, by taking a third instead of a first order Taylor series expansion. This modification does not alter the basic restriction for obtaining an asymptotic distribution for 6,which does not depend on the estimation of y in the first step, namely that y has to be estimated at a faster rate than p. Notice that in this case, the upper bound on ,u in Assumption R12 would have to be replaced by ( 6 p - 1)/7. However, this modification would affect the proof of Theorem 2, which would become unnecessarily complicated and long.
we are willing to assume for the underlying distributions. This estimator is considered in an earlier version of the paper (Kyriazidou (1994)) and also in Charlier et al. (1995).
In this section we illustrate certain finite sample properties of the proposed estimator. The Monte Carlo results presented here are in no sense representa- tive of the estimator's sampling behavior since only one experimental design is considered. Further, there is little justification for the choice of the particular design, except that it is simple to set up and that, in the absence of sample selectivity, ordinary least squares on the first differences would perform quite well. The simulation study of this section is intended more as an investigation of the sensitivity of the estimator to the choice of bandwidth, the order of the kernel, the proposed asymptotic bias correction, the first step estimation method, the performance in practice of the proposed plug-in method for estimating the bandwidth constant, and finally the practical usefulness of the proposed covari- ance matrix estimator in testing hypotheses about the main regression equation coefficients. Data for the Monte Carlo experiments are generated according to the model:
where p O = 1, y, = y, = 1, w ,, ,, and w2 ,, are independent N( - 1 , l ) variables, q, = (w,,,, + w,,,,)/2 + 25,, with 5, an independent variable distributed uni- formly over the interval (0,1), u,, is logistically distributed normalized to have variance equal to 1, x,, = w,,,,, a, = (w,,,, + w,, ,,)/2 + 5,, with 5, an indepen- dent N(0, 2) variable, and s,, = 0.8t3 + 0.6ul,, with 5, an independent standard normal variable. All data are generated i.i.d. across individuals and over time. This design implies that Pr(d, + d, = 1)= 0.37, and Pr(d, = d, = 1) = 0.31, so that approximately 37 percent of each sample is used in the first step estimation of the selection equation and approximately 31 percent in the second step. Each Monte Carlo experiment is performed 1000 times, while the same pseudoran- dom number sequences are used for each one of three different sample sizes n: 250, 1000, and 4000. Table I presents the finite sample properties of the "naive" estimator, denoted by p,,,,,, that ignores sample selectivity and is therefore inconsistent. This estimator is obtained by applying OLS on the first differences using only those individuals that are selected into the sample both time periods, i.e. those that have d,, = d,? = 1. This estimator may be viewed as a limiting case of our proposed estimator with bandwidth equal to infinity. Panel A reports the estimated mean bias and root mean squared error (RMSE) for this estimator over 1000 replications for different sample sizes n. As the estimator may not have a finite mean or variance in any finite sample, we also report its median
which is a second order bias-reducing kernel. The bandwidth sequence is h , = h .n-1/(2'r+"+1'= h .n-lI5 with h = 1. The panels on the right-hand side present the results for f i n , the estimator of the Corollary of Theorem 1 which corrects for asymptotic bias, where we use 6 = 0.1. Going from top to bottom of Table 11, Panel A reports the results for the proposed estimator using the true y in the construction of the kernel weights.15 In Panel B, y is estimated by conditional logit, denoted by qL, which in this case will be consistent since all of the assumptions underlying the approach hold in our Monte Carlo design. In Panel C, y is estimated using the conditional maximum score estimator,l denoted by qc,,ry, and in Panels D and E we use the smoothed conditional maximum score estimator, denoted by q,,,,. In Panel D, y is estimated at a rate faster than p , while in Panel E both @ and y are estimated at the same rate." From Table I1 we see that the propose estimator is less biased than the "naive" OLS estimator both with and without the asymptotic bias correction. Furthermore, this bias decreases with sample size since the estimator is consis- tent, at rate slower than n - ' I 2 , as predicted by the asymptotic theory. This may be seen by the fact that the RMSE decreases by less than half when we quadruple the sample size. Notice that the results do not change substantially whether we use the true y or we estimate it for the construction of the kernel weights, except when the smoothed maximum score approach is used. In the latter case (Panels D and E), the estimator is significantly more biased, although its RMSE is lower than in the other panels. This may be due to the relatively large finite sample bias of the smoothed maximum score estimates (see also Horc3witz (1992)), which may be thought of as increasing the effective window
(^15) In the construction of the kernel weights of both the infeasible estimator j,, of Panel A and the feasible estimators of Panels B-E, the norm of y is set equal to one so that the results across panels are comparable. '' The CMS estimates are computed by maximizing the objective function (l/n)C:_,Ad; ~ { A w , , g s + Awt2 g2 2 0) (see also equation (7) in Manski (1987)) over g, = sin(g) and g2 = cos(g) with g ranging in a 2.000-point equispaced grid from 0 to 27r. (^17) The SCMS estimates are computed by maximizing
over all g E %"hat have /g,/ = 1 and gl in a compact subset of !It by the method of fast simulated annealing. Joel Horowitz kindly provided the optimization routine. In Panel D, we set L ( v ) =Kj(v) of Horowitz (1992, page 5161, which implies that the estimator, denoted by Tsc,tfs,a, converges in distribution at rate ,1-4'9 (faster than the rate of P, which in the case of a second order kernel is n-2'5), so that the asynlptotic theory of Section 3.1 is valid. hl Panel E. we use L i v ) = @ i v ) where @ is the standard normal cumtllative distribution function. In this case the estimator, denoted by +sFSC,ZfS,2r converges in distribution at the same rate as P,,,n-'/j The SCMS estimates used in the construction of the kernel weights are corrected for asymptotic bias using 6 = 0.1 and are obtained by the two stage "plug-in" procedure, where in the first stage the bandwidth sequence is cr, = 0 , 5 ~ - ( 1 fih~ 1') (in = 2 or 41, while the second stage uses the estimated optimal constant in the construction of the bandwidth. For details, see Horowitz (1992) and Kyriazidou (1994).
sides of Table 11, we see that the asymptotic bias correction does decrease the estimated (mean and median) bias of the estimator, it invariably however increases its variability. In Table I11 we investigate the sensitivity of the (infeasible) estimator with respect to the choice of the bandwidth constant and the choice of the kernel A function. Panels A and B present the results for b,, and P, using a bandwidth constant h equal to 0.5 and 3, respectively, and a second order bias-reducing kernel. As expected the estimator's bias increases as we increase the bandwidth while the RMSE decreases. The increase in both mean and median bias appears quite large, which indicates that point estimates may be quite sensitive to the choice of bandwidth. In order to give a sense of the precision with which these biases are estimated, we provide at the bottom of Table I11 their estimated standard errors for the two sets of experiments that use 0.5 and 3 as bandwidth constant (Panels A and B).'~ In Panels C and D we use a fourth and a sixth order bias-reducing kernel and set h, = n-1/(2("+l)") with r = 3 and r = 5, respectively. A comparison of Panels 11-A and 111-C and 111-D suggests that the use of higher order kernels speeds up the rate of convergence of the estimator, although there does not appear to be much gain from increasing the order of the kernel from four to six. Table IV explores the properties of the proposed estimator when the "plug-in" method described in Section 3.2 is used. The specification is the same as in Table 11. Comparing Panels A-D in Tables I1 and IV, we see that the bias of the estimates increases when the optimal bandwidth constant 6" is used yhile their RMSE decreases (except in Panel IV-Dl. This is because, in general, h* is larger than the initial constant (here the initial bandwidth constant is set equal to one2'). Table V displays the mean of 6" across 1000 replications for different specifications of the initial constant for the case of the infeasible estimator. We find that the means of the estimates are increasing in the initial bandwidth constant (although this is not necessarily true for all 1000 samples). Our finding may be interpreted by the asymptotic bias term being in general poorly esti- mated in the particular Monte Carlo design used in this study. Indeed, we find that, for the sample sizes considered here, the estimated asymptotic bias of the estimator decreases with the bandwidth constant h contrary to the asymptotic
l8 To estimate the standard errors for the median bias we need to calculate the estimator's density. This is estimated using a normal kernel and the rule-of-thumb bandwidth suggested by Silverman (1986. equation 3.28). (^19) The fourth-order kernel is K,(v) = l. l e x p ( - ~ ' ~ / 2 )-~. l e x p ( - c 2 / 21 1 ) ( 1 / m ) , and the sixth-order kernel is K,(v)^ =^ 1.5 e ~ ~ ( - ~ ' ~ / 2 )+^ 0.1 exp(-u2/2.^ 9)(l/^ 6) -^ 0.6 exp(-u2/^. 4)(1/ 20 4).See Bierens (1987). We chose the initial h equal to one as the mean squared error of the distribution of the (infeasible) estimator in the 1000 replications was found to be minimized in that neighborhood when a rough search over a 10-point grid from 0.5 to 10 was performed for a sample size n = 100,000.