Prepare for your exams
Get points
Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

For each uploaded document

Answer questions

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Effects of Individuals in Panel Data Selection Models: Verbeek, Nijman & Wooldridge (1992-, Schemes and Mind Maps of Literature

University of Bath Literature

Methods for testing and correcting for selectivity bias in panel data selection models, specifically the work of Verbeek and Nijman (1992) and Wooldridge (1995). the estimation of unknown parameter vectors and unobservable individual effects, as well as the use of the smoothed conditional maximum score estimator for binary response panel data models.

What you will learn

How does sample selectivity affect the estimation of panel data models?
What assumptions are necessary for the consistent estimation of panel data selection models?
What is the role of the smoothed conditional maximum score estimator in binary response panel data models?

How are the unknown parameter vectors and unobservable individual effects estimated in panel data selection models?

Typology: Schemes and Mind Maps

2021/2022

Uploaded on 09/27/2022

prouline 🇬🇧

4.6

(7)

221 documents

1 / 30

This page cannot be seen from the preview

Don't miss anything!

bg1

Econornettica,Vol.

65,

No. 6 (November, 19971, 1335-1364

ESTIMATION OF A PANEL DATA SAMPLE

SELECTION MODEL

We consider the problem of estimation in a panel data samplc selection model, where

both thc selection and the regression equation of intercst contain unobservable individ-

ual-specific effects. We propose a two-step estimation procedure, which "differences out"

both the sample selection effect and the unobservable individual effect from the cquation

of intercst. In the first step, the unknown coefficients of the "selection" equation are

consistently estimated. The estimates are then used to estimate thc regression equation of

interest. The estimator proposed in this paper is consistent and asymptotically normal,

with a rate of convergence that can be made arbitrarily close to n-'I2, depending on the

strength of certain smoothness assumptions. The finite sample properties of the estimator

are invcstigated in a small Monte Carlo simulation.

KEYWORDS:

Sample selection, panel data, individual-specific effects.

1.

INTRODUCTION

SAMPLE

SELECTION IS

A

PROBLEM

frequently encountered in applied research. It

arises as a result of either self-selection by the individuals under investigation,

or sample selection decisions made by data analysts.

A

classic example, studied

in the seminal work of Gronau (1974) and Heckman (1976), is female labor

supply, where hours worked are observed only for those women who decide to

participate in the labor force. Failure to account for sample selection is well

known to lead to inconsistent estimation of the behavioral parameters of

interest, as these are confounded with parameters that determine the probability

of entry into the sample. In recent years a vast amount of econometric literature

has been devoted to the problem of controlling for sample selectivity. The

research however has almost exclusively focused on the cross-sectional data

case. See Powell (1994) for a review of this literature and for references. In

contrast, this paper focuses on the case where the researcher has panel or

longitudinal data a~ailable.~

Sample selectivity is as acute a problem in panel as

in cross section data. In addition, panel data sets are commonly characterized by

nonrandomly missing observations due to sample attrition.

This paper is bascd on Chapter

1

of my thesis completed at Northwestern University. Evanston,

Illinois. I wish to thank my thesis advisor Bo Honor& for invaluable help and support during this

project. Many individuals, among them a co-editor and two anonymous referecs, have offered useful

comments and suggestions for which

I

am very grateful. Joel Horowitz kindly provided a computer

program used in this study. An earlicr version of the paper was prescnted at the North American

Summer Meetings of the Econometric Society, June, 1994. Financial support from NSF through

Grant No. SES-9210037 to Bo Honor& is gratefully acknowledged. All remaining errors are my

responsibility.

An

Appendix which contains a proof of a theorem not included in the paper may be

obtained at the world wide web site: http://www.spc.uchicago.edu/E-Kyriazidou.

"

Obviously, the analysis is similar for any kind of data that have a group structure.

pf3

pf4

pf5

pf8

pf9

pfa

pfd

pfe

pff

pf12

pf13

pf14

pf15

pf16

pf17

pf18

pf19

pf1a

pf1b

pf1c

pf1d

pf1e

Related documents

Selection Models, Models for Counts - Econometric Analysis of Panel Data - Lecture Slides

Econometrics Glossary: Key Terms and Definitions (Wooldridge Chapters 1-5)

Econometrics Exam Questions and Answers: Time Series and Panel Data Analysis

Limited Dependent Variable Models - Econometrics - Lecture Notes

Partial preview of the text

Download Effects of Individuals in Panel Data Selection Models: Verbeek, Nijman & Wooldridge (1992- and more Schemes and Mind Maps Literature in PDF only on Docsity!

Econornettica, Vol. 65, No. 6 (November, 19971, 1335-

ESTIMATION O F A PANEL DATA SAMPLE

SELECTION MODEL

We consider the problem of estimation in a panel data samplc selection model, where both thc selection and the regression equation of intercst contain unobservable individ- ual-specific effects. We propose a two-step estimation procedure, which "differences out" both the sample selection effect and the unobservable individual effect from the cquation of intercst. In the first step, the unknown coefficients of the "selection" equation are consistently estimated. The estimates are then used to estimate thc regression equation of interest. The estimator proposed in this paper is consistent and asymptotically normal, with a rate of convergence that can be made arbitrarily close to n - ' I 2 , depending on the strength of certain smoothness assumptions. The finite sample properties of the estimator are invcstigated in a small Monte Carlo simulation. KEYWORDS:Sample selection, panel data, individual-specific effects.

1. INTRODUCTION

SAMPLESELECTION IS A PROBLEM frequently encountered in applied research. It arises as a result of either self-selection by the individuals under investigation, or sample selection decisions made by data analysts. A classic example, studied in the seminal work of Gronau (1974) and Heckman (1976), is female labor supply, where hours worked are observed only for those women who decide to participate in the labor force. Failure to account for sample selection is well known to lead to inconsistent estimation of the behavioral parameters of interest, as these are confounded with parameters that determine the probability of entry into the sample. In recent years a vast amount of econometric literature has been devoted to the problem of controlling for sample selectivity. The research however has almost exclusively focused on the cross-sectional data case. See Powell (1994) for a review of this literature and for references. In contrast, this paper focuses on the case where the researcher has panel or longitudinal data a ~ a i l a b l e. ~Sample selectivity is as acute a problem in panel as in cross section data. In addition, panel data sets are commonly characterized by nonrandomly missing observations due to sample attrition.

This paper is bascd on Chapter 1 of my thesis completed at Northwestern University. Evanston, Illinois. I wish to thank my thesis advisor Bo Honor& for invaluable help and support during this project. Many individuals, among them a co-editor and two anonymous referecs, have offered useful comments and suggestions for which I am very grateful. Joel Horowitz kindly provided a computer program used in this study. An earlicr version of the paper was prescnted at the North American Summer Meetings of the Econometric Society, June, 1994. Financial support from NSF through Grant No. SES-9210037 to Bo Honor& is gratefully acknowledged. All remaining errors are my responsibility. An Appendix which contains a proof of a theorem not included in the paper may be obtained at the world wide web site: http://www.spc.uchicago.edu/E-Kyriazidou. " Obviously, the analysis is similar for any kind of data that have a group structure.

1336 EKATERINI KYRIAZIDOU

The most typical concern in empirical work using panel data has been the presence of unobserved heterogeneity. Heterogeneity across economic agents may arise for example as a result of different preferences, endowments, or attributes. These permanent individual characteristics are commonly unobserv- able, or may simply not be measurable due to their qualitative nature. Failure to account for such individual-specific effects may result in biased and inconsistent estimates of the parameters of interest. In linear panel data models, these unobserved effects may be "differenced" out, using the familiar "within" ("fixed-effects") approach. This method is generally not applicable in limited dependent variable models. Exceptions include the discrete choice model stud- ied by Rasch (1960, 1961), Anderson (1970), and Manski (1987), and the censored and truncated regression models (Honor6 (1992, 1993)). See also Chamberlain (1984), and Hsiao (1986) for a discussion of panel data methods. The simultaneous presence of sample selectivity and unobserved heterogene- ity has been noted in empirical work (as for example in Hausman and Wise (19791, Nijman and Verbeek (1992), and Rosholm and Smith (1994)). Given the pervasiveness of either problem in panel data studies, it appears highly desirable to be able to control for both of them simultaneously. The present paper is a step in this direction. In particular, we consider the problem of estimating a panel data model. where both the sample selection rule, assumed to follow a binary response model, and the (linear) regression equation of interest contain additive perma- nent unobservable individual-specific effects that may depend on the observable explanatory variables in an arbitrary way. In this type 2 Tobit model (in the terminology of Amemiya (1985)), sample selectivity induces a fundamental nonlinearity in the equation of interest with respect to the unobserved charac- teristics, which, in contrast to linear panel data models, cannot be "differenced away." This is because the sample selection effect, which enters additivelp in the main equation, is a (generally unknown) nonlinear function of both the observed time-varying regressors and the unobservable individual effects of the selection equation, and is therefore not constant over time. Furthermore, even if one were willing to specify the distribution of the underlying time-varying errors (for example normal) in order to estimate the model by maximum likelihood, the presence of unobservable effects in the selection rule would require that the researcher also specify a functional form for their statistical dependence on the observed variables. Apart from being nonrobust to distributional misspecification, this fully parametric "random ef- fects" approach is also computationally cumbersome, as it requires multiple numerical integration over both the unobservable effects and the entire length of the panel. Heckman's (1976, 1979) two-step correction, although computa- tionally much more tractable, also requires full specification of the underlying distributions of the unobservables, and is therefore susceptible to inconsisten- cies due to misspecification. Thus, the results of this paper will be important even if the distribution of the individual effects is the only nuisance parameter in the model.

1338 EKATERINI KYRIAZIDOU

The first step of the proposed estimation method requires that the discrete choice selection equation be estimated consistently and at a sufficiently fast rate. To this end, we propose using a "smoothed" version of Manski's (1987) condi- tional maximum score e ~ t i m a t o r , ~which follows the approach taken by Horowitz (1992) for estimating cross section discrete choice models. Under appropriate assumptions, stronger than those in Manski (1987), the smoothed estimator improves on the rate of convergence of the original estimator, and also allows standard statistical inference. Furthermore, it dispenses with parametric as- sumptions on the distribution of the errors, required for example by the conditional maximum likelihood estimator proposed by Rasch (1960, 1961) and Andersen (1970). Although our analysis is based on the assumption of a censored panel, with only two observations per individual, it easily generalizes to the case of a longer and possibly,unbalanced panel, and may be also modified to accommodate truncated samples, in which case estimation of the selection equation is infeasi- ble. Extensions of our estimation method to cover these situations are discussed at the end of the next section. The paper is organized as follows. Section 2 describes the model and moti- vates the proposed estimation procedure. Section 3 states the assumptions and derives the asymptotic properties of the estimator. Section 4 presents the results of a Monte Carlo study investigating the small sample performance of the proposed estimator. Section 5 offers conclusions and suggests topics for future research. The proofs of theorems and lemmata are given in the Appendix.

2. THE MODEL AND THE PROPOSED ESTIMATOR

We consider the following model:

(2.2) d,,^ =^ l{wity+^ 17,^ - uit^^2 01.

Here, p E F t k and y E 8 4 are unknown parameter vectors which we wish to

e ~ t i m a t e , ~ and wi, are vectors of explanatory variables (with possibly common x: elements), a>nd 17, are unobservable time-invariant individual-specific effects

(possibly correlated with the regressors and the errors), E,T and uit are unob-

served disturbances (not necessarily independent of each other), while yz E 3 is a latent variable whose observability depends on the outcome of the indicator

The smoothed conditional maximum score estimator for binary response panel data models, along with its asymptotic properties and necessary assumptions, is presented in an earlier version of this paper (Kyriazidou (1994)). See also Charlier, Melenberg, and van Soest (1995). Obviously constants cannot be identified in either equation, since they would be absorbed in the individual effects. These will be treated as nuisance parameters and will not be estimated. Our analysis also applies to the case where a: = rl,

SAMPLE SELECTION MODEL 1339

variable d,, E {O,l). In particular, it is assumed that, while ( d , , , ~ , , )is always observed, (y:, x:) is observed only6 if d,, = 1. In other words, the "selection" variable d,, determines whether the itth observation in equation (2.1) is cen- sored or not. Thus, our problem is to estimate P and y from a sample consisting of quadruples (dil,wi,,yi,,xi,). We will denote the vector of (observed and

unobserved) explanatory variables by ii= (wil,w,,, x:,, x:,, a" q). Notice that,

without the "fixed effects" a* and rl,, our model becomes a panel data version of the well known sample selection model considered in the literature, and could be estimated by any of the existing methods. Without sample selectivity, that is with d,, = 1 for all i and t , equation (2.1) is the standard panel data linear regression model. In our setup, it is possible to estimate y in the discrete choice "selection" equation (2.2) using either the conditional maximum likelihood approach pro- posed by Rasch (1960, 1961) and Andersen (1970), or the conditional maximum score method proposed by Manski (1987). On the other hand, estimation of P based on the main equation of interest (2.1) is confronted with two problems: first, the presence of the unobservable effect ai,= d,,. a" and second and more fundamental, the potential "endogeneity" of the regressors xi, = di;x:, which arises from their dependence on the selection variable d,,, and which may result in "selection bias." The first problem is easily solved by noting that for those observations that have d,, = d,, = 1, time differencing will eliminate the effect a,, from equation (2.1). This is analogous to the "fixed-effects" approach taken in linear panel data models. In general though, application of standard methods, e.g., OLS, on this first-differenced subsample will yield inconsistent estimates of P, due to sample selectivity. This may be seen from the population regression function for the first-differenced subsample:

E ( y i l - y i 2 I d i l = 1 , d i 2 = 1 , l i ) = (x:~ - 4 , ) p + E ( E ~- &;Idil = 1 , d i 2= 1, i i ). In general, there is no reason to expect that E(&,TId,, = 1, d,, = 1, l i ) = 0, or that E ( E ~Idil = 1,di2= 1, i,) = E(e2ldil = 1,d,, = I , & ). In particular, for each

time period the "sample selection effect" A,, = E(E: Idil = 1, d,, = 1, i i ) depends

not only on the (partially unobservable) conditioning vector ii,but also on the

(generally unknown) joint conditional distribution of (e:, u,,, u,,), which may differ across individuals, as well as over time for the same individual:

A,, = E(&:ldil = 1 , d i 2= 1, i,)

=E(sI::luil I W , , Y + 7 , , u i 2 4 w i 2 y +v i , l i ) = A(wily+ ~ i , ~ i 2 ~ +q i ; F , , ( & , T , ~ i l , ~ i 2 I i i ) ) = A i l ( w i l ~+ 77,wi2~+ 7h, li).

Obviously, the analysis carries through to the case where x: is always observed, which is the case most commonly treated in the literature.

SAMPLE SELECTION MODEL 1341

The above discussion, which presumes knowledge of the true y, suggests

estimating p by OLS from a subsample that consists of those observations that

have wily = w,, y and d,, = d,, = 1. Defining Ti= l{wil y = wi2y}, Qi = l{dil =

d,, = I} = di,di2, and with A denoting first differences, the OLS estimator is of

the form jn = [Cy=, Ax: Axi %@,I- '[Cy=, Ax: Ay, TiQi]. Under appropriate reg- ularity conditions, this estimator will be consistent and root-n asymptotically normal. An obvious requirement is that Pr(Awi y = 0) > 0, which may be satis- fied for example when all the random variables in wit are discrete, or in experimental cases where the distribution of wit is^ in^ the^ control^ of^ the researcher, situations that are rare in economic applications. Of course, this estimation scheme cannot be directly implemented since y is unknown. Furthermore, as argued above, it may be the case that Ti= 0 6.e. Aw, y # 0) for all individuals in our sample. Notice though that, if A is a

sufficiently "smooth" function, and .i;, is a consistent estimate of y, observations

for which the difference Aw, is close to zero should also have AA, E 0, and the preceding arguments would hold approximately. We therefore propose the following two-step estimation procedure, which is in the spirit of Powell (1987), and Ahn and Powell (1993): In the first step, y is consistently estimated based on equation (2.2) alone. In the second step, the

estimate yn is used to estimate p , based on those pairs of observations for which

wi,qn and wi,Tn are "close." Specifically, we propose

where &, is a weight that declines to zero as the magnitude of the difference I wi,qn - wi2YnIincreases. We choose "kernel" weights of the form:

where K is a "kernel density" function, and h, is a sequence of "bandwidths" which tends to zero as n + m. Thus, for a fixed (nonzero) magnitude of the difference 1 Aw, ?,I, the weight Gin shrinks as the sample size increases, while for a fixed n, a larger I Aw, I?, corresponds to a smaller weight. It is interesting to note that the arguments used in estimating the main regression equation may be modified to accommodate the case of a truncated sample, that is when we only observe those individuals that have d,, = 1 for all time periods. Recall that our method for eliminating the sample selection effect from equation (2.1') is based on the fact that, under certain distributional assumptions, Aw, y = 0 implies Ah, = 0. However, Aw, = 0 also implies Ah, = 0. In other words, we might dispense altogether with the first step of estimating y,

and estimate p from those observations for which wil and wi2 are "close," which

would suggest using the weights: Gin = (l/h:)K(Aw,/h,). Although this ap- proach would imply a slower rate of convergence for the resulting estimator, this

1342 EKATERINI KYRIAZIDOU

estimation scheme may be used for estimating p from a truncated sample, in which case estimation of the selection equation is infeasible. An obvious drawback in this method is that, in order to consistently estimate the entire

parameter vector p, we would have to impose the restriction that wit and x,Y, do

not contain any elements in common. The above analysis extends naturally to the case of a longer (and possibly unbalanced) panel, that is when T. 2 2. Then p could be estimated from those observations that have d,, = d,, = 1, and for which wit?, and wis?, are "close," for all s, t = 1... , q.. The estimator is of the form

where

In the following section we derive the asymptotic properties of our proposed estimator for the main equation of interest, under the assumption that y has been consistently estimated. At the end of the section, we examine the applica- bility of existing estimators for obtaining first-step estimates of the selection equation.

3. ESTIMATION OF THE MAIN EQUATION

3.1. Asymptotic Properties of the Estimator The derivation of the large sample properties of fin of equations (2.3) and (2.4) proceeds in two steps. First, the asymptotic behavior of the infeasible estimator which uses the true y in the construction of the kernel weights, denoted by fin, is analyzed. Then the large sample behavior of the difference ( fin - fin) is investigated. It will be useful to define the scalar index W, = Aw, y and its estimated counterpart = Aw, y,, along with the following quantities:

j,, = - C - K - Ax: Axi @,,

n ,= 1 h,

1344 EKATERINI KYRIAZIDOU

ASSUMPTIONR5: The unknown function9 il(wly + 7, w, y + 7 , J ) = E(E: Idl = l , d , = l , ~ ) ~ E ( ~ ~ I u ~ < w ~ y + ~ , u , < w , y + _ r ] , J ) satisfies: A ( s , , s , , J ) - A(s,,_s,,J)=il.(s,-s,) for t , r = 1 , 2 , where A is afunction of (s,,s,, J ) , i.e., A = Ais,, s,, 5 1, which is bounded" on its support.

This assumption is crucial to our analysis. It will be satisfied, for example, if A is continuously differentiable with respect to its first two arguments, with bounded first-order partial derivatives (as, for example, when the errors are jointly normally distributed), in which case we may apply the multivariate mean-value theorem:

Here A(]) (j = 1,2) denotes the first-order partial derivative of A with respect to its first and second argument respectively, and c; lies on the line segment connecting (w, y + r ] , w, y + 7 , !:) and (w, + 7, wl y + 7, J ). Thus, in this case, A = Acl)(cT)- 11(2)(~1), and by assumption will be bounded.

ASSUMPTIONR6: (a) x: and r: have bounded 4 + 2 6 moments conditional on W, for any 6 E (0,l). (b) E(Axl Ax @ I W) and E(Axt Ax Au2 @ I W) are continuous at W = 0 and do not uanish. (c) E ( Ax' j @ l W) is almost eueiywhere r times continuously differenfiable as a fiinction of W, and has bounded deri~latices.

ASSUMPTIONR7: The function K : 3 + 91 satisfies: (a) jK(v) d v = 1, (b)

lIK(v)l d v < a, (c) supvlK(v>l< m, id) l l v l r f l l ~ ( v ) ld v < %, and (el

lvJK(v)d v = O f o r a l l j = 1,...,r.

ASSUMPTIONR8: h, +0 and nh, +m as n -t cc.

From our analysis in Section 2, it is easy to see that Assumptions R1-R would suffice to identify P for known y. An identification scheme in the spirit of our discussion in Section 2 would obviously require support of W at zero, as well as nonsingularity of the matrix 2,y,y, imposed by Assumption R3, analogous to the familiar full rank assumption. The continuity of the distribution of the index W, imposed in Assumption R4, is a regularity condition, common in kernel estimation of density a;d regression

functions. It is precisely this continuity that renders the estimator P, of Section

2 infeasible, even if y were known.

~ o t i c ethat by Assumption R1, thc functional form of A is the same over time for the same individual, while by Assumption R2, it is also the same across !ndividuals. (^10) In principle, we could dispense with the assumption that 11 is bounded, by assuming that has finite fourth moment conditional on 1V.

SAMPLE SELECTION MODEL 1345

Since our estimation scheme is based on pairs of observations for which

= Aw, y E 0, it is obvious that additional smoothness conditions are required.

These are imposed by Assumptions R4-R8. Notice, in particular, Assumption R5, which imposes a Lipschitz continuity property on the selection correction function A( ). It is easy to see that simple continuity will not be sufficient to

guarantee that Ah, + 0 as U: + 0, since Ahi is not a function of U.;. Further-

more, similarly to kernel density and regression estimation, a high order of differentiability r for certain functions of the index W, along with the appropri- ate choice of the kernel function and the bandwidth sequence, imply a faster

rate of convergence in distribution for fin. Specifically, we choose a "(r + 1)th

order bias-reducing" kernel, which by Assumption R7(e) is required to be negative in part of its domain. The next lemma establishes the asymptotic properties of the infeasible esti- mator p,.

LEMMA1: Let Assumptions R1-R8 hold. Define Z x x = f w ( 0 ). E ( A x ' A x @ I W = O ) ,

I,, =fW(O)E(AxrAx Au2@1 W = o ) / K ( ~ ) ~d v ,

where g(r)(0) is the (k x 1) uector of rth-order deriuatiues of

eualuated at W = 0. Then, P (a) Sxx-+ Zxx.

(b) If K h k f ' + with 0 5 I; < .o, then (i) Ks,,,;N(0, Z,,,), and (ii)

P -

K s x * -+h ZxA. (c) If K h ; + ' + m, then (i) h;(r+')S,y, -+

P 0 and (ii) h;('+')S,, -,

P

ZxA.

The asymptotic properties of fin easily follow from the previous Lemma: If

K h ; " + I;, then K(fin - /3) N ( A Z;.'X~~, Z;x'Xx, Z;,'), while if

P

K h ; + ' -+ m, then h i i r f 'I( fin - +IzIx,.

In order to derive the asymptotic properties of the feasible estimator f i n , we will make the following additional assumptions:

ASSUMPTIONR9: In addition to the conditions of Assumption R7, the kernel function satisfies: (a) K ( v ) is three times continuously differentiable with bounded deriuatiues, and (b) /IKr(v>ldv, lIK"(v)l dv, l ~ ~ K ' ( v ) ~ d v ,and ~ v ~ K " ( v ) ~ ~ v are finite.

SAMPLE SELECTION MODEL 1347

Thus, in the limit, the fact we are using Tit to estimate P does not affect the asymptotic distribution of Bf,. The lower bound on p , imposed by Assumption R12, is the key for this result to hold. In words, this bound implies that ,B is estimated at a rate slower than y. Indeed, from Theorem 1, the rate of convergence of fin is (nh,)-'/" n - I / > - ~ , ' 2 , which is obviously slower than

n-P, since p > 1 - 2p. Thus in effect, Assumption R12 requires that f i (? , -

y ) = o,(l).

In principle, we could allow P to be estimated at the same rate as y. Thus, if

K ( g , - y ) = OP(l) for K h ; " -+ h, we obtain the following asymptotic representation, which may be easily derived from the analysis of Lemma 2(b) in the Appendix:

where n 0 = plim, ,,(l/n) ( l / h ~ ) ~ ' ( ~ / i ; / h , )Ax: Awi Ahi Qi i = 1

provided that E ( d x l A W ~ ~ @ I W ) is continuous at W = O and v K ( v ) - + O as lvl -f m Asymptotic normality of fir, may still be established if K i q , - y ) has an asymptotic representation of the form: J n h , ( T i J - y ) = l/ K c : , , ~ ( A ~ , .Ad,. y ) + 0,(1).'~ At first glance it looks attractive to eliminate the asymptotic bias of fin by choosing h , so that a h : , + ' + = 0, or equivalently by setting p > (1/(2(r +

1) + 1)). In that case,'however, the rate of convergence of fin is lower than when

0. Indeed, the rate of convergence in distribution of fin is maximized by making p as small as possible, that is by setting p = 1/(2(r + 1) + I), in which

Case it becomes - ' I + 1 ) / ( 2 ( " + 1 ) - 1 1. Thus, for r large enough, the estimator

converges at a rate that can be arbitrarily close to n - ' / < provided also that y is estimated fast enough, that is provided y > ( r + 1)/(2(r + 1) + 1). Although the proposed estimator is asymptotically biased, it is possible to eliminate the asymptotic bias while maintaining the maximal rate of conver- gence, in the manner suggested by Bierens (1987).

COROLLARY: Let 6, be the estimator with window width h , = h. n - / ( * ( I +' I + I).

and fin, the estimator.with window width h,, a = h .n +^ "''I, where^^6 E^ (0,l).

(^12) We can also derive an asymptotic representation for i,,in thc case where y is estimated at rate n-" that is slower than 1/ 6. In this case we obtain r z P ( in - /3) = .X;xlfl.nP(.i;, - y ) + op(l), which implies that in converges at the same rate as .i;,,which is slower than thc "optimal" rate obtained for the infeasible estimator fin, that is when y is known.

1348 EKATERINI KYRIAZIDOU

Define A f i n + (I^ - 6 ) ( r +^ 1)/(2(r+^ I ) +^ 1) A p sz P a ,^ s 1 - n - ( l - 6 ) ( r + l ) / ( 2 ( r A l ) + 1) ' A Then, n(r+1'/(2('T fin - p ) N(0, h- 12;X12Xc2,;').

A In order to compute i?,or p, in an application, one needs to choose the kernel function K , and to assign a numerical value to the bandwidth parameter h,. Results on kernel density and regression function estimation suggest that the asymptotic performance of the estimator will be likely more sensitive to the choice of the window width than to the choice of the kernel. Furthermore, the asymptotic normality result of the Corollary above shows that the variance of the limiting distribution depends crucially on the choice of the constant h. We will thus focus here on the problem of bandwidth selection. Bierens (1987) discusses the construction of high order bias-reducing kernels. For a given order of differentiability r, and a given sample size n, the results

of Theorem 1 suggest that h, = h. n - + be chosen so that p = 1/(2(r + 1) + 1).

So the problem of bandwidth selection reduces to the problem of choosing the constant h. A natural way to proceed (see Horowitz (1992) and Hardle (1990)), is to choose h so as to minimize some kind of measure of the "distance" of the estimator from the true value, based on the asymptotic result of Theorem 1. Consider for example minimizing the asymptotic mean squared error of the estimator, defined as:

-- 2 + / t r a c e [ X (X C (^) + hX'+ ')x,,x,,)x,']

for any nonstochastic positive semidefinite matrix A that satisfies 2~,_CXX~~Z;'Z,,# 0. It is straightforward to show that MSE is minimized by setting

trace [ 21 ;A 2;,'2,,]

1/(2(17 1 ) t 1 ) (3.2.1) h = h" = 2 ( r + I ) Z ; * E , ; ~ A ~ ~ ; ~ ~ ~ , , This last expression suggests that we may construct a consistent estimate of h* if consistent estim:tes of XI,, Z,,, and 2,, are available. By part (a) of Lemmata 1 and 2, S,, consistently estimates S,, for any h, that satisfies h,, -jr 0

and nh, +m. In the next theorem, we provide consistent estimators of S,, and

22 A.

THEOREM2:'' Assume that Assumptions Rl-R12 hold. (a) Let fii2be a con- sistent estimator of p based on h, = h .n-1/(2("1'+1', and define ;,, =jJ,, - x,, P,.

(^13) The proof of Theorem 2 IS omitted herc to conserve space. It is available at the author's world

wide web page.

1350 EKATERINI KYRIAZIDOU

sistencyl%f 6 ,if h;'(?, - y ) = op(l), for any h , that satisfies Assumption R8,

i.e, for h, -.0 and nh, -t m. 011 the other hand, the asymptotic normality result

of Theorem 1 requires that K(.i;,-y ) =op(l), for any h , that satisfies

K 1 2 ~ + -. ' &, with 0 I 6 < m. ,.

The conditions for obtaining consistency and asymptotic normality of P,, are

satisfied by the conditional maximum likelihood estimator proposed by Rasch (1960, 1961) and Andersen (1970), which is consistent and root-n asymptotically normal, under the assumption that the errors in the selection equation are white noise with a logistic distribution and independent of the regressors and the individual effects. In fact, as Chamberlain (1992) has shown, if the support of the predictor variables in the selection equation is bounded, then identification of y is possible only in the logistic case. Furthermore, even if the support is unbounded, in which case y may be identified and thus consistently estimated, consistent estimation at rate n-'7' is possible only in the logistic case. As is well known though, if the distribution of the errors is misspecified, the conditional maximum likelihood approach will in general produce inconsistent estimators. Another possible choice for estimating y is the conditional maximum score estimator, proposed by Manski (1987). Under fairly weak distributional assump- tions, this estimator consistently estimates y up to scale. However, the results of Cavanagh (1987), and Kim and Pollard (1990) for the maximum score estimator proposed by Manski (1975, 1985) for the cross section binary response model, namely that it converges at the slow rate of n P ' l 3 to a non-normal random variable, suggest that these properties carry through to its panel data analog, the conditional maximum score estimator. Thus, if (%,- y ) = 0,(nP1/3), it is possi- ble to consistently estimate ,B by choosing h , to satisfy n'l3h; -,m. In this case though, the analysis for obtaining the asymptotic distribution for p,, is not applicable. It is possible, however, to modify Manski's conditional maximum score estima- tor and obtain control over both its rate of convergence and its limiting distribution, by imposing sufficient smoothness on the distribution of the errors and the explanatory variables in the selection equation. Specifically, following the approach taken by Horowitz (1992) for estimating the cross section binary response model, we can construct a "smoothed conditional maximum score" estimator, which under weak (but stronger than Manski's) assumptions, is consistent and asymptotica!ly normally distributed, with a rate of convergence that can be arbitrarily close to n-'I2, depending on the amount of smoothness

(^14) Consistency of p, may be established under the weaker restriction that /z;'ll.F, - yll' = o,(l). The proof of Lemma 2(a) would then have to be modified, by taking a third instead of a first order Taylor series expansion. This modification does not alter the basic restriction for obtaining an asymptotic distribution for 6,which does not depend on the estimation of y in the first step, namely that y has to be estimated at a faster rate than p. Notice that in this case, the upper bound on ,u in Assumption R12 would have to be replaced by ( 6 p - 1)/7. However, this modification would affect the proof of Theorem 2, which would become unnecessarily complicated and long.

SAMPLE SELECTION MODEL 1351

we are willing to assume for the underlying distributions. This estimator is considered in an earlier version of the paper (Kyriazidou (1994)) and also in Charlier et al. (1995).

4. MONTE CARL0 EVIDENCE

In this section we illustrate certain finite sample properties of the proposed estimator. The Monte Carlo results presented here are in no sense representa- tive of the estimator's sampling behavior since only one experimental design is considered. Further, there is little justification for the choice of the particular design, except that it is simple to set up and that, in the absence of sample selectivity, ordinary least squares on the first differences would perform quite well. The simulation study of this section is intended more as an investigation of the sensitivity of the estimator to the choice of bandwidth, the order of the kernel, the proposed asymptotic bias correction, the first step estimation method, the performance in practice of the proposed plug-in method for estimating the bandwidth constant, and finally the practical usefulness of the proposed covari- ance matrix estimator in testing hypotheses about the main regression equation coefficients. Data for the Monte Carlo experiments are generated according to the model:

where p O = 1, y, = y, = 1, w ,, ,, and w2 ,, are independent N( - 1 , l ) variables, q, = (w,,,, + w,,,,)/2 + 25,, with 5, an independent variable distributed uni- formly over the interval (0,1), u,, is logistically distributed normalized to have variance equal to 1, x,, = w,,,,, a, = (w,,,, + w,, ,,)/2 + 5,, with 5, an indepen- dent N(0, 2) variable, and s,, = 0.8t3 + 0.6ul,, with 5, an independent standard normal variable. All data are generated i.i.d. across individuals and over time. This design implies that Pr(d, + d, = 1)= 0.37, and Pr(d, = d, = 1) = 0.31, so that approximately 37 percent of each sample is used in the first step estimation of the selection equation and approximately 31 percent in the second step. Each Monte Carlo experiment is performed 1000 times, while the same pseudoran- dom number sequences are used for each one of three different sample sizes n: 250, 1000, and 4000. Table I presents the finite sample properties of the "naive" estimator, denoted by p,,,,,, that ignores sample selectivity and is therefore inconsistent. This estimator is obtained by applying OLS on the first differences using only those individuals that are selected into the sample both time periods, i.e. those that have d,, = d,? = 1. This estimator may be viewed as a limiting case of our proposed estimator with bandwidth equal to infinity. Panel A reports the estimated mean bias and root mean squared error (RMSE) for this estimator over 1000 replications for different sample sizes n. As the estimator may not have a finite mean or variance in any finite sample, we also report its median

SAMPLE SELECTION MODEL 1353

which is a second order bias-reducing kernel. The bandwidth sequence is h , = h .n-1/(2'r+"+1'= h .n-lI5 with h = 1. The panels on the right-hand side present the results for f i n , the estimator of the Corollary of Theorem 1 which corrects for asymptotic bias, where we use 6 = 0.1. Going from top to bottom of Table 11, Panel A reports the results for the proposed estimator using the true y in the construction of the kernel weights.15 In Panel B, y is estimated by conditional logit, denoted by qL, which in this case will be consistent since all of the assumptions underlying the approach hold in our Monte Carlo design. In Panel C, y is estimated using the conditional maximum score estimator,l denoted by qc,,ry, and in Panels D and E we use the smoothed conditional maximum score estimator, denoted by q,,,,. In Panel D, y is estimated at a rate faster than p , while in Panel E both @ and y are estimated at the same rate." From Table I1 we see that the propose estimator is less biased than the "naive" OLS estimator both with and without the asymptotic bias correction. Furthermore, this bias decreases with sample size since the estimator is consis- tent, at rate slower than n - ' I 2 , as predicted by the asymptotic theory. This may be seen by the fact that the RMSE decreases by less than half when we quadruple the sample size. Notice that the results do not change substantially whether we use the true y or we estimate it for the construction of the kernel weights, except when the smoothed maximum score approach is used. In the latter case (Panels D and E), the estimator is significantly more biased, although its RMSE is lower than in the other panels. This may be due to the relatively large finite sample bias of the smoothed maximum score estimates (see also Horc3witz (1992)), which may be thought of as increasing the effective window

(^15) In the construction of the kernel weights of both the infeasible estimator j,, of Panel A and the feasible estimators of Panels B-E, the norm of y is set equal to one so that the results across panels are comparable. '' The CMS estimates are computed by maximizing the objective function (l/n)C:_,Ad; ~ { A w , , g s + Awt2 g2 2 0) (see also equation (7) in Manski (1987)) over g, = sin(g) and g2 = cos(g) with g ranging in a 2.000-point equispaced grid from 0 to 27r. (^17) The SCMS estimates are computed by maximizing

over all g E %"hat have /g,/ = 1 and gl in a compact subset of !It by the method of fast simulated annealing. Joel Horowitz kindly provided the optimization routine. In Panel D, we set L ( v ) =Kj(v) of Horowitz (1992, page 5161, which implies that the estimator, denoted by Tsc,tfs,a, converges in distribution at rate ,1-4'9 (faster than the rate of P, which in the case of a second order kernel is n-2'5), so that the asynlptotic theory of Section 3.1 is valid. hl Panel E. we use L i v ) = @ i v ) where @ is the standard normal cumtllative distribution function. In this case the estimator, denoted by +sFSC,ZfS,2r converges in distribution at the same rate as P,,,n-'/j The SCMS estimates used in the construction of the kernel weights are corrected for asymptotic bias using 6 = 0.1 and are obtained by the two stage "plug-in" procedure, where in the first stage the bandwidth sequence is cr, = 0 , 5 ~ - ( 1 fih~ 1') (in = 2 or 41, while the second stage uses the estimated optimal constant in the construction of the bandwidth. For details, see Horowitz (1992) and Kyriazidou (1994).

1354 EKATERINI KYRIAZIDOU

width used in the estimation of P. Furthermore, we notice that the results are

very similar when y is estimated at the same rate as p (Panel E) relative to the

case where it is estimated faster than p (Panel D). Comparing the right and left

sides of Table 11, we see that the asymptotic bias correction does decrease the estimated (mean and median) bias of the estimator, it invariably however increases its variability. In Table I11 we investigate the sensitivity of the (infeasible) estimator with respect to the choice of the bandwidth constant and the choice of the kernel A function. Panels A and B present the results for b,, and P, using a bandwidth constant h equal to 0.5 and 3, respectively, and a second order bias-reducing kernel. As expected the estimator's bias increases as we increase the bandwidth while the RMSE decreases. The increase in both mean and median bias appears quite large, which indicates that point estimates may be quite sensitive to the choice of bandwidth. In order to give a sense of the precision with which these biases are estimated, we provide at the bottom of Table I11 their estimated standard errors for the two sets of experiments that use 0.5 and 3 as bandwidth constant (Panels A and B).'~ In Panels C and D we use a fourth and a sixth order bias-reducing kernel and set h, = n-1/(2("+l)") with r = 3 and r = 5, respectively. A comparison of Panels 11-A and 111-C and 111-D suggests that the use of higher order kernels speeds up the rate of convergence of the estimator, although there does not appear to be much gain from increasing the order of the kernel from four to six. Table IV explores the properties of the proposed estimator when the "plug-in" method described in Section 3.2 is used. The specification is the same as in Table 11. Comparing Panels A-D in Tables I1 and IV, we see that the bias of the estimates increases when the optimal bandwidth constant 6" is used yhile their RMSE decreases (except in Panel IV-Dl. This is because, in general, h* is larger than the initial constant (here the initial bandwidth constant is set equal to one2'). Table V displays the mean of 6" across 1000 replications for different specifications of the initial constant for the case of the infeasible estimator. We find that the means of the estimates are increasing in the initial bandwidth constant (although this is not necessarily true for all 1000 samples). Our finding may be interpreted by the asymptotic bias term being in general poorly esti- mated in the particular Monte Carlo design used in this study. Indeed, we find that, for the sample sizes considered here, the estimated asymptotic bias of the estimator decreases with the bandwidth constant h contrary to the asymptotic

l8 To estimate the standard errors for the median bias we need to calculate the estimator's density. This is estimated using a normal kernel and the rule-of-thumb bandwidth suggested by Silverman (1986. equation 3.28). (^19) The fourth-order kernel is K,(v) = l. l e x p ( - ~ ' ~ / 2 )-~. l e x p ( - c 2 / 21 1 ) ( 1 / m ) , and the sixth-order kernel is K,(v)^ =^ 1.5 e ~ ~ ( - ~ ' ~ / 2 )+^ 0.1 exp(-u2/2.^ 9)(l/^ 6) -^ 0.6 exp(-u2/^. 4)(1/ 20 4).See Bierens (1987). We chose the initial h equal to one as the mean squared error of the distribution of the (infeasible) estimator in the 1000 replications was found to be minimized in that neighborhood when a rough search over a 10-point grid from 0.5 to 10 was performed for a sample size n = 100,000.