Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Statistical Inference: Sufficiency, Ancillarity, and Estimation, Essays (high school) of Economics

Statistical inference concepts such as sufficiency, ancillarity, and estimation. It covers topics like one-to-one transformations, marginal and conditional densities, unbiased estimators, efficiency criterion, and consistency. The text also mentions the glivenko-cantelli theorem and the concept of weak and consistent uniformly asymptotically normal estimators.

Typology: Essays (high school)

2011/2012

Uploaded on 03/13/2012

marcyn
marcyn 🇬🇧

4.3

(12)

228 documents

1 / 31

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
McFadden Chapter 6. Estimation
135
________________________________________________________________________
Chapter 6. ESTIMATION
1. Desirable Properties of Estimators
Consider data xthat comes from a DGP which has a density f(x,θ). In most
o
initial applications, we will think of xas a simple random sample of size n,
x= (x ,...,x ) drawn from a population in which x has a density
f
(x,θ), so that the
1n o
DGP density is f(x,θ)=
f
(x ,θ)...
f
(x ,θ). However, the notation f(x,θ) can
1o no o
also cover more complicated DGP, such as time-series data sets. Suppose that θis
o
unknown, but one knows that this DGP is contained in a family with densities f(x,θ)
indexed by θ. Let Xdenote the domain of x, and Θdenote the domain of θ. In the
case of a simple random sample where an observation x is a point in a space
X
, one
n
has X=
X
. The statistical inference task is to estimate θ. In Chapter 5, we saw
o
that an estimator T(x)ofθwas desirable from a Bayesian point of view if T()
o
minimized the expected cost of mistakes. For typical cost functions where the larger
the mistake, the larger the cost, Bayes estimators will try to get "close" to the
true parameter value. That is, the Bayes procedure will seek estimators whose
probability densities are concentrated tightly around the true θ. Classical
o
statistical procedures lack the expected cost criterion for choosing estimators, but
also seek estimators whose probability densities are concentrated around the true θ.
o
Listed below are some of the properties that are deemed desirable for
classical estimators. Classical statistics often proceeds by developing some
candidate estimators, and then using some of these properties to choose among the
candidates. It is often not possible to achieve all of these properties at the same
time, and sometimes they can even be incompatible. Some of the properties are
defined relative to a
class
of candidate estimators, a set of possible T() that we
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f

Partial preview of the text

Download Statistical Inference: Sufficiency, Ancillarity, and Estimation and more Essays (high school) Economics in PDF only on Docsity!

________________________________________________________________________

Chapter 6. ESTIMATION

1. Desirable Properties of Estimators Consider data x that comes from a DGP which has a density f( x ,θ (^) o). In most initial applications, we will think of x as a simple random sample of size n, x = (x ,...,x ) drawn from a population in which x has a density 1 n f(x,θ (^) o), so that the DGP density is f( x ,θ) = f(x , 1 θ (^) o) ⋅...⋅ f(x ,n θ (^) o). However, the notation f( x ,θ (^) o) can also cover more complicated DGP, such as time-series data sets. Suppose that θ (^) ois unknown, but one knows that this DGP is contained in a family with densities f( x ,θ) indexed by θ. Let X denote the domain of x , and Θ denote the domain of θ. In the case of a simple random sample where an observation x is a point in a space X, one has X = Xn. The statistical inference task is to estimate θ. In Chapter 5, we saw o that an estimator T( x ) of θ (^) o was desirable from a Bayesian point of view if T(⋅) minimized the expected cost of mistakes. For typical cost functions where the larger the mistake, the larger the cost, Bayes estimators will try to get "close" to the true parameter value. That is, the Bayes procedure will seek estimators whose probability densities are concentrated tightly around the true θ (^) o. Classical statistical procedures lack the expected cost criterion for choosing estimators, but also seek estimators whose probability densities are concentrated around the true θ (^) o. Listed below are some of the properties that are deemed desirable for classical estimators. Classical statistics often proceeds by developing some candidate estimators, and then using some of these properties to choose among the candidates. It is often not possible to achieve all of these properties at the same time, and sometimes they can even be incompatible. Some of the properties are defined relative to a class of candidate estimators, a set of possible T(⋅) that we

________________________________________________________________________

will denote by T. The density of an estimator T(⋅) will be denoted ψ(t,θ (^) o), or when it is necessary to index the estimator, ψ (^) T(t, θ (^) o).

Sufficiency. Suppose there is a one-to-one transformation from the data x into variables ( y , z ).^1 Then the DGP density f( x ,θ) can be described in terms of the density of ( y , z ), which we might denote g( y , z ,θ) and write as the product of the marginal density of y and the conditional density of z given y , g( y , z ,θ) = g ( 1 y ,θ)⋅g ( 2 zy ,θ).^2 Note that in general both the marginal and the conditional densities depend on θ. The variables y are said to be sufficient for θ if the conditional distribution of z given y is independent of θ; i.e., g ( 2 zy ,θ) = g ( 2 zy ). In this case, all of the information in the sample about θ is summarized in y , and once you know y , knowing z tells you nothing more about θ. (One way to convince yourself of this is to form the posterior density of θ, given y and z , for any prior. You will find that this posterior density, which is a complete description of what you believe about θ, does not depend on z when y is sufficient.) Sufficiency of y is equivalent to the factorization g( y , z ,θ) = g ( 1 y ,θ)⋅g ( 2 zy ) of the density into one term depending only on y and θ and a second term depending only on z and y. This characterization is useful for identifying sufficient statistics. An implication of sufficiency is that there is no reason to consider estimators T( x ) that depend on x except through the sufficient statistics. Then, the


(^1) This is a known transformation, so it cannot depend on unknown θ. (^2) The relationship of the density f( x ,θ) and the density g( y , z ,θ) comes from the rules for transforming random variables; see Chapter 3.6. Let x = x ( y , z ) denote the inverse of the one-to-one transformation from x to y and z , and let J denote the Jacobian of this mapping; i.e., the determinant of the array of derivatives of x ( y , z ) with respect to its arguments, signed so that it is positive. Then g( y , z ,θ) = f( x ( y , z ))⋅J. The Jacobian J does not depend on θ, so g( y , z ,θ) factors into a term depending only on y and θ and a term independent of θ if and only if f( x ( y , z )) factors in the same way.

________________________________________________________________________

estimator to functions of such a y is not as useful as knowing that one only needs to look at functions of a sum of the x’s. Ancillarity. As in the discussion of sufficiency, suppose there is a one-to-one transformation from the data x into variables ( y , z ). Then the DGP density can be written the product of the marginal density of y and the conditional density of z given y , g ( 1 y ,θ)⋅g ( 2 zy ,θ). Both g 1 and g 2 depend in general on θ. The data y are ancillary to θ if g 1 does not depend on θ. In this case, all the information about θ that is contained in the data is contained in the conditional distribution of z given y. This implies that the search for an estimator for θ can be confined to ones derived from the conditional density of z given y. Ancillarity provides useful restrictions when g ( 2 zy ,θ) depends only on a low-dimensional part of y , or when this density is independent of unknown nuisance parameters that enter the marginal density of y. An example where ancillarity is useful arises in data x = (x ,...,x ) where the 1 n x are independent observations from an exponential density λ⋅e - λxi and the sample i size n is random with a Poisson density γn-1 ⋅e- γ/(n-1)! for n = 1,2,.... The DGP density is then λ ⋅n^ e- λ(x +...+x )^1 n^ ⋅γn-1 ⋅e- γ/(n-1)!. This density factors into the density λn n-1 y ⋅e- λy, with y = x +...+ x , that is now the conditional density of y 1 n given n, times a term that is a function of n, y, and γ, but not of λ. Then, the principle of ancillarity says that to make inferences on λ, one should condition on n and not be concerned with the nuisance parameter γ that enters the marginal density of n. Admissibility. An estimator T(⋅) from a class of estimators T is admissible relative to T if there is no second estimator T′(⋅) in T with the property that for all θ , E (T′( x ) - θ )^2 ≤ E (T( x ) - θ ) ,^2 with inequality strict for at least o x θ (^) o o x θ (^) o o

________________________________________________________________________

one θ (^) o. This is the same as the definition of admissibility in statistical decision theory, but with the cost of a mistake defined as mean squared error (MSE), the square of the difference between the estimate and the true value of θ. An inadmissible estimator is undesirable because there is an identified alternative estimator that is more closely clustered around the true parameter value. A limitation of admissibility is that there will often be many admissible estimators, and this criterion does not choose between them. Unbiasedness. An estimator T(⋅) is unbiased for θ (^) o if E (^) x θ (^) oT( x ) ≡ θ (^) o for all +∞ θ ; i.e., θ ≡ i 2 T( x )f( x 2 θ)d x. An estimator with this property is "centered" around o j

  • ∞ the true parameter value, and will not systematically be too high or too low. Efficiency. An estimator T(⋅) is efficient relative to an estimator T′(⋅) if for all θ , E (T( x ) - θ )^2 ≤ E (T′( x ) - θ ).^2 The estimator T(⋅) is efficient o x θ (^) o o x θ (^) o o relative to a class of estimators T if it is efficient relative to T′(⋅) for all T′(⋅) in T. An efficient estimator provides estimates that are most closely clustered around the true value of θ, by the squared distance measure, among all the estimators in T. The limitation of efficiency is that for many problems and classes of estimators T , there will be no efficient estimator, in that one cannot satisfy the required inequality uniformly for all θ (^) o. The following theorem establishes an efficiency result for estimators that are functions of sufficient statistics:

Blackwell Theorem. If T′(⋅) is any estimator of θ (^) o from data x , and y is a sufficient statistic, then there exists an estimator T(⋅) that is a function solely of the sufficient statistic and that is efficient relative to T′(⋅). If T′(⋅) is unbiased, then so is T(⋅). If an unbiased estimator T(⋅) is uncorrelated with every

________________________________________________________________________

bound, then we can be sure that it is MVUE. However, the converse is not true: There may be a MVUE, its variance may still be larger than this lower bound; i.e., the lower bound may be unobtainable. Cramer-Rao Bound. Suppose a simple random sample x = (x ,...,x 1 N) with f(x,θ (^) o) the density of an observation x. Assume that log f(x,θ (^) o) is twice continuously differentiable in θ, and that this function and its derivatives are bounded in magnitude by a function that is independent of θ and has a finite integral in x. Suppose an estimator T( x ) has E (^) x θT( x ) ≡ θ + μ(θ), and that the bias μ(θ) is differentiable. Then, the variance of T( x ) satisfies

V (T( x )) ≥ (1 + μ′(θ)) /n^2 ⋅ E [∇ logf(x,θ )]^2. x θ x θ θ o

If the estimator is unbiased, so μ(θ) ≡ 0, this bound is

V (T( X )) ≥ 1/n⋅ E [∇ logf(x,θ )]^2. x θ x θ θ o

The expression E [∇ logf(x,θ )]^2 is termed theFisher information contained in an x θ θ o observation; then, the Cramer-Rao bound states that the variance of an unbiased estimator is at least as large as the reciprocal of the Fisher information in the n sample. To demonstrate this result, let L( x ,θ) = slog f(x ,θ), so that the DGP t i i= density is f( x ,θ) = eL( x ,θ). By construction, +∞ +∞ 1 ≡ i 2 eL( x ,θ)d^ x and θ + μ(θ) ≡ i 2 T( x )⋅eL( x ,θ)d x. j j

  • ∞ - ∞ Differentiate each integral with respect to θ to get +∞ +∞ 0 ≡ i 2 ∇ L( x ,θ)⋅eL( x ,θ)^ d x and 1 + μ′(θ) ≡ i 2 T( x )⋅∇ L( x ,θ)⋅e L( x ,θ)d x. j θ j θ
  • ∞ - ∞

________________________________________________________________________

Combine these to get an expression for the covariance of T and ∇ (^) θL, +∞ 1 + μ′(θ) ≡ i 2 [T( x ) - θ]⋅∇ L( x ,θ)⋅eL( x ,θ)d x. j θ

Now, any covariance has the property that its square is no greater than the product of the variances of its terms. This is called the Cauchy-Schwartz inequality. In this case, the inequality can be written

( 2 +∞ ) 22 ( ) 2 2 i L( x ,θ) 2 2 29 1 + μ′(θ) 20 = 2 2j [T( x ) - θ]⋅∇ (^) θL( x ,θ)⋅e d x 2 ≤ V (^) x θ(T( x ))⋅ E (^) x θ[ ∇ (^) θL( x ,θ)]. (^22) - ∞ 22 9 0

Dividing both sides by the Fisher information in the sample, which is simply the variance of the sample score, E (^) x θ [∇ (^) θL( x ,θ)], gives the Cramer-Rao bound. p Invariance. In some conditions, one would expect that a change in a problem should not alter an estimate of a parameter, or should alter it in a specific way. Generically, these are called invariance properties of an estimator. For example, when estimating a parameter from data obtained by a simple random sample, the estimate should not depend on the indexing of the observations in the sample; i.e., T(x ,...,x ) should be 1 n invariant under permutations of the observations. Sometimes a parameter enters a DGP in such a way that there is a simple relationship between shifts in the parameter and the shifts one would expect to observe in the data. For example, suppose the density of an observation is of the form f(x (^) i 2 θ) ≡ h(x -iθ); in this case, θ is called a location parameter. If the true value of θ shifts up by an amount ∆, one would expect observations on average to shift up by an amount ∆. If T(x ,...,x ) is an estimator of 1 n θ (^) o in this problem, a reasonable property to impose on T(⋅) is that T(x + 1 ∆,...,x +n ∆) = T(x ,...,x ) 1 n + ∆.

________________________________________________________________________

correspond to approximations that are likely to be relevant to a specific problem. There is no ambiguity when one is drawing simple random samples from an infinite population. However, if one samples from a finite population, a finite sequence of samples of increasing size will terminate in a complete census of the population. While one could imagine sampling with replacement and drawing samples that are larger than the population, it is not obvious why estimators that have some reasonable properties in this limit are necessarily appropriate for the finite population. Put another way, it is not obvious that this limit provides a good approximation to the finite sample. The issue of the appropriate asymptotic limit is particularly acute for time series. One can imagine extending observations indefinitely through time. This may provide approximations that are appropriate in some situations for some purposes, but not for others. For example, if one is trying to estimate the timing of a particular event, a local feature of the time series, it is questionable that extending the time series indefinitely into the past and future leads to a good approximation to the statistical properties of the estimator of the time of an event. Other ways of thinking of increasing sample sizes for time series, such as sampling from more and more "parallel" universes, or sampling at shorter and shorter intervals, have their own idiosyncrasies that make them questionable as useful approximations. The second major issue is how the sequence of estimators associated with various sample sizes is defined. A conceptualization introduced in Chapter 5 defines an estimator to be a functional of the empirical CDF of the data, T(F ).n Then, it is natural to think of T(F(⋅,θ (^) o)) as the limit of this sequence of estimators, and the Glivenko-Cantelli theorem stated in Chapter 5.1 establishes an approximation property that the estimator T(F )n converges almost surely to T(F(⋅,θ (^) o)), as long as T(⋅) satisfies a continuity property at F(⋅,θ (^) o). It is

________________________________________________________________________

particularly important to avoid reliance on asymptotic arguments when is is clear that the asymptotic approximation is irrelevant to the behavior of the estimator in the range of sample sizes actually encountered. Consider an estimation procedure which says "Ignore the data and estimate θ (^) o to be zero in all samples of size less than 10 billion, and for larger samples employ some computationally complex but statistically sound estimator". This procedure may technically have good asymptotic properties, but this approximation obviously tells you nothing about the behavior of the estimator in economic sample sizes of a few thousand observations.

Consistency. A sequence of estimators T (n x ) = T (x ,...,x ) for samples of sizen 1 n n are consistent for θ (^) o if the probability that they are more than a distance ε > 0 from θ (^) o goes to zero as n increases; i.e., lim P(T (x ,...,x ) -n 1 n θ o > ε) = 0. In nL∞ the terminology of Chapter 4, this isweak convergence orconvergence in probability, written T (x ,...,x ) ---------Lp θ. One can also talk about strong consistency, which n 1 n o holds when lim P( sup T (^) n′(x ,...,x 1 n′) - θ o > ε) = 0, and corresponds to almost sure nL∞ n′≥n convergence, T (x ,...,x ) ------------Las θ. n 1 n o Asymptotic Normality. A sequence of estimators T (n⋅) for samples of size n are consistent asymptotically normal (CAN) for θ (^) o if there exists a sequence r (^) n of scaling constants such that r (^) n -----L +∞ and rn ⋅ (T (n x (^) n) - θ (^) o) converges in distribution to a normally distributed random variable with some mean μ = μ(θ (^) o) and variance σ^2 = σ(θ ).^2 3 The mean μ is termed the asymptotic bias, and σ^2 is termed the o


(^3) If Ψ (^) n(t) is the CDF of T (n x (^) n), then the CDF of Q (^) n = r (^) n⋅ (T (n x (^) n) - θ (^) o) is

Ψ (θ + q/r ). From Chapter 4, r (T ( x ) - θ ) ---------Ld^ Z with Z ~ N(μ,σ^2 ) if for each q, n o n n n n o the CDF of Q (^) n satisfies lim Ψ (^) n( θ (^) o + q/r )n - Φ((t-μ)/σ) = 0. This is the nL∞ conventional definition of convergence in distribution, with the continuity of the

________________________________________________________________________

distribution to a density that does not depend on θ. Then, there is a large sample rationale for concentrating on estimators that depend only on y.

2. General Estimation Criteria It is useful to have some general methods of generating estimators that as a consequence of their construction will have some desirable statistical properties. Such estimators may prove adequate in themselves, or may form a starting point for refinements that improve statistical properties. We introduce several such methods: Analogy Estimators. Suppose one is interested in a feature of a target population that can be described as a functional of its CDF F(⋅), such as its mean, median, or variance, and write this feature as θ (^) o = μ(F). An analogy estimator exploits the similarity of a population and of a simple random sample drawn from this population, and forms the estimator T( x ) = μ(F ),n where μ is the functional that produces the target population feature and F (^) n is the empirical distribution function. For example, a sample mean will be an analogy estimator for a population mean. Moment Estimators. Population moments will depend on the parameter index in the underlying DGP. This is true for ordinary moments such as means, variances, and covariances, as well as more complicated moments involving data transformations, such as quantiles. Let m(x) denote a function of an observation and E (^) xθ (^) om(x) = γ(θ (^) o) denote the population moment formed by taking the expectation of m(x). In a sample x = (x ,...,x ), 1 n the idea of a moments estimator is to form a sample moment n (^1) ------ sm(x ) ≡ E m(x), and then to use the analogy of the population and sample moments n t i n i= to form the approximation E (^) nm(x) ≈ E (^) xθ = γ(θo ).^4 The moment estimator T( x ) solves _________________________________o (^4) The sample average of a function m(x) of an observation can also be interpreted as

________________________________________________________________________

E (^) nm(x) = γ(T( x )). When the number of moment conditions equals the number of parameters, an exact solution is normally obtainable, and T( x ) is termed a classical method of moments estimator. When the number of moment conditions exceeds the number of parameters, it is not possible in general to find T( x ) that sets them all to zero at once. In this case, one may form a number of linear combinations of the moments equal to the number of parameters to be estimated, and find T( x ) that sets these linear combinations to zero. The linear combinations in turn may be derived starting from some metric that provides a measure of the distance of the moments from zero, with T( x ) interpreted as a minimand of this metric. This is called generalized method of moments estimation. Maximum likelihood estimators. Consider the DGP density f( x ,θ) for a given sample as a function of θ. The maximum likelihood estimator of the unknown true value θ (^) o is the function θ = T( x ) that maximizes f( x ,θ). The intuition behind this estimator is that if we guess a value for θ that is far away from the true θ (^) o, then the probability law for this θ would be very unlikely to produce the data that are actually observed, whereas if we guess a value for θ that is near the true θ (^) o, then the probability law for this θ would be likely to produce the observed data. Then, the T( x ) which maximized this likelihood, as measured by the probability law itself, should be close to the true θ. The maximum likelihood estimator plays a central role in classical statistics, and can be motivated solely in terms of its desirable classical statistical properties in large samples.


its expectation with respect to the empirical distribution of the sample; we use the notation E (^) nm(x) to denote this empirical expectation.

________________________________________________________________________

will estimate the parameters μ and σ^2 using the maximum likelihood method, and establish some of the statistical properties of these estimators. The first-order-conditions for maximizing L( x ,μ,σ^2 ) in μ and σ^2 are n n 0 = s^ (x -μ)/σ^2 +++++⇒ μ^^ = x-----^ ≡ 1 ------^ sx , t i n t i i=1 i= n n 0 = -n/2σ^2 + s^ (x -μ) /2^2 σ 4 +++++⇒ ^2σ = 1 ------^ s^ (x -x)-----^2. t i n t i i=1 i=

The maximum likelihood estimator of μ is then the sample mean, and the maximum likelihood estimator of σ^2 is the sample variance. Define s^2 = σ ⋅^2n/(n-1), the sample variance with a sample size correction. The following results summarize the properties of these estimators: (1) (x,s ) are joint minimal sufficient statistics for (^ -----^2 μ,σ^2 ). (2) x is an unbiased estimator for^ -----^ μ, and s^2 an unbiased estimator for σ^2. (3) x is a Minimum Variance Unbiased Estimator (MVUE) for^ -----^ μ ; s^2 is MVUE for σ^2. (4) x is Normally distributed with mean^ -----^ μ and variance σ^2 /n. (5) (n-1)s /^2 σ^2 has a Chi-square distribution with n-1 degrees of freedom. (6) x and s^ -----^2 are statistically independent. (7) qe====== n(x -^6 ----- μ)/s has a Student’s-T distribution with n-1 degrees of freedom. (8) n⋅(x -^ -----^ μ) /s^2 2 has an F-distribution with 1 and n-1 degrees of freedom. The following paragraphs comment on these properties and prove them. Consider the sufficiency property (1). Factor the log likelihood function as n L( x ,μ,σ^2 ) = - n------^ ⋅Log(2π) - n------^ ⋅Log σ^2 - 1 ------^ ⋅ s^ (x -x + x------^ -----^ μ) /^2 σ^2 2 2 2 t i i=

________________________________________________________________________

n n = - n------^ ⋅Log(2π) - n------^ ⋅Log σ^2 - 1 ------^ ⋅ s^ (x -x) /-----^2 σ^2 - 1 ------^ ⋅ s^ (x------^ μ) /^2 σ^2 2 2 2 t i 2 t i=1 i= n n 2 1 (n-1)s^2 n (x------^ μ)^2 = - ------ 2 ⋅ Log(2π) - ------ 2 ⋅ Log σ - ------ 2 ⋅ -------------------------------- 2 - ------ 2 ⋅--------------------------- 2. σ σ

This implies that -----x^ and s^2 are jointly sufficient for μ and σ^2. Because the dimension of (x,s )^ -----^2 is the same as the dimension of (μ,σ^2 ), they are obviously minimal sufficient statistics. n The expectation of -----x^ is E -----x^ = 1 ------^ s E x = μ, since the expectation of each n t i i= observation is μ. Hence x is unbiased.^ -----^ To establish the expectation of s , first^2 form the n×n matrix q 2 1-1/n -1/n ... -1/n -1/n e 2 (^22) -1/n 1-1/n ... -1/n -1/n 22 M = I - 11 ′/n = 2 ... ... ... ... ... 2 , 22 -1/n -1/n ... 1-1/n -1/n (^22) (^2) z -1/n -1/n ... -1/n 1-1/n (^2) c where I is the identity matrix and 1 is a n× 1 vector of ones. This matrix is idempotent, with M^2 = M, and its trace satisfies

tr(M) = tr(I) - tr( 11 ′/n) = n - tr( 11 /n) = n - 1.

Let Z′ = (x - 1 μ,...,x -n μ) denote the vector of deviations of observations from the population mean. Then, Z′M = (x - x,...,x-----^ - x) and s-----^2 = Z′M⋅MZ/(n-1) = Z′MZ/(n-1). 1 n Therefore, since with independent observations one has E ZZ′ = σ^2 I, one obtains

E s^2 = E (Z′MZ)/(n-1) = E tr(Z′MZ)/(n-1) = E tr(MZZ′)/(n-1) = tr(M⋅ E (ZZ′))/(n-1) = σ ⋅^2 tr(M)/(n-1) = σ^2.

Hence, s^2 is unbiased.

________________________________________________________________________

Next consider the distribution of -----x. We use the fact that linear transformations of multivariate normal random vectors are again multivariate normal: If Z ~N(μ,Ω) and W = CZ, then W ~N(Cμ,CΩC′). This result holds even if Z and W are of different dimensions, or C is of less than full rank. (If the rank of CΩC′ is less than full, then the random variable has all its density concentrated on a subspace.) Now -----x = C x , where C = (1/n,...,1/n) and x = (x ,...,x )′, and x is 1 n multivariate normal with mean 1 ⋅μ and covariance matrix σ^2 I, where 1 is a n×1 vector of ones and I is the n×n identity matrix. Therefore, x ~^ -----^ N(μC 1 ,σ^2 CC′) = N(μ,σ^2 /n). Next consider the distribution of s.^2 We will need the following fact about statistical distributions: The sum of the squares of K independent standard normal random variables has a Chi-Square distribution with K degrees of freedom. (To prove this, first show that it holds for K = 1 by finding the density of the square of a standard normal random variable and noting that it coincides with the density of χ^2. 1 Then use the rules for moment generating functions to see that the sum of independent χ^2 random variables is χ^2 .) We also need the matrix result that any idempotent 1 K matrix M of dimension n and rank r can be written as M = WW′, where W is n×r and column-orthonormal (i.e., W′W = I ).r (To prove this, write M in terms of its singular value decomposition, and apply the conditions M = M′ and M⋅M = M.) Consider M = I - 11 ′/n which has rank n-1 = tr(M), and the linear transformation q2----- e 2 q e 2 x - μ 2 = 22 (1/n)^1 ′^22 ( x - 1 ⋅μ) ≡ C( x - 1 ⋅μ). 2 u (^2 2) z^ M^2 c z c The result of this transformation is then multivariate normal, q2----- e 2 2 x - μ 2 ~N(0,σ^2 CC′). (^2) z u (^2) c

________________________________________________________________________

q 2 (1/n) 0 e 2 ----- But CC′ = (^22 0) M 22 , so that x - μ and u are uncorrelated, hence (for joint normals) z c ----- independent. Then x is independent of any function of u, and specifically of s^2 = u′u/(n-1). The distribution of s^2 is obtained by noting that u′u = ε′Mε = ε′WW′ε, where ε = ( x - 1 ⋅μ). But z = W′ε ~N(0,σ^2 I ), by the matrix result above n- for idempotent matrices. Hence, (n-1)s /^2 σ^2 = u′u/σ^2 = z′z/σ^2 is the sum of squares of n-1 independent standard normal random variates, so that it is distributed χ 2. q====== 6 n- The results that en(x ------ μ)/s has a Student’s-T distribution with n-1 degrees of freedom, and that n⋅(x^ -----^ - μ) /s^2 2 has an F-distribution with 1 and n-1 degrees of freedom follow from properties of distributions related to the normal, Chapter 3.9.

4. Large Sample Properties of Maximum Likelihood Estimates This section provides a brief and informal introduction to the statistical properties of maximum likelihood estimators and similar estimation methods in large samples. Consider a simple random sample x = (x ,...,x ) from a population in which 1 n the density of an observation is f(x,θ (^) o). The DGP density or likelihood of the sample is then f( x ,θ) = f(x , 1 θ)⋅...⋅ f(x ,n θ), with θ (^) o the true value of θ. The log likelihood of an observation is l(x,θ) = log f(x,θ (^) o), and the log likelihood of the n sample is L ( x ,θ) = 1 ------^ s^ l(x ,θ).^5 The maximum likelihood estimator T ( x ) is a value n n t n n i= of θ which maximizes L (n x ,θ). The first-order condition for this maximum is that the sample score,


(^5) For the purposes of this section, it will be convenient to scale the sample likelihood by 1/n so that it is an average of the scores of the individual observations. Obviously one can go from this definition to a definition of the sample log likelihood without scaling simply by multiplying by n.