







Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
An estimator for which equality holds in (2.2) is called a minimum variance unbiased estimator or simply a best unbiased estimator. The expected ...
Typology: Slides
1 / 13
This page cannot be seen from the preview
Don't miss anything!
Abstract The Cram´er-Rao Inequality provides a lower bound for the variance of an unbiased estimator of a parameter. It allows us to conclude that an unbiased estimator is a minimum variance unbiased estimator for a parameter. In these notes we prove the Cram´er-Rao inequality and examine some applications. We conclude with a discussion of a probability distribution for which the Cram´er-Rao inequality provides no useful information.
1 Description of the Problem 1
2 The Cram´er-Rao Inequality 2
3 Examples and Exercises 5
A Interchanging Integration and Differentiation 7
B The Cauchy-Schwarz Inequality 8
C The Exponential Density 9
D When the Cram´er-Rao Inequality Provides No Information 9
Point estimation is the use of a statistic to estimate the value of some parameter of a population having a particular type of density. The statistic we use is called the point estimator and its value is the point estimate. A desirable property for a point estimator Θ for a parameterˆ θ is that the expected value of Θ isˆ θ. If Θ is a random variable withˆ density f and values θˆ, this is equivalent to saying
−∞
θf^ ˆ (θˆ) dθˆ = θ.
An estimator having this property is said to be unbiased. Often in the process of making a point estimate, we must choose among several unbiased estimators for a given parameter. Thus we need to consider additional criteria to select one of the estimators for use. For example, suppose that X 1 , X 2 ,... , Xm are a random sample from a normal population of mean μ and variance σ^2 with n an odd integer, m = 2n + 1. Let the density of this function be given by f (x; μ, σ^2 ). Suppose we wish to estimate the mean, μ, of this population. It is well-known that both the sample mean and the sample median are unbiased estimators of the mean (c.f. [MM]). Often, we will take the unbiased estimator having the smallest variance. The variance of Θ is, as for any randomˆ variable, the second moment about the mean:
var( Θ)ˆ =
−∞
(θˆ − μ (^) Θˆ)^2 f (θˆ) dθ.ˆ
Here, μ (^) Θˆ is the mean of the random variable Θ, which isˆ θ in the case of an unbiased estimator. Choosing the estimator with the smaller variance is a natural thing to do, but by no means is it the only possible choice. If two estimators have
the same expected value, then while their average values will be equal the estimator with greater variance will have larger fluctuations about this common value. An estimator with a smaller variance is said to be relatively more efficient because it will tend to have values that are concentrated more closely about the correct value of the parameter, thus it allows us to be more confident that our estimate will be as close to the actual value as we would like. Furthermore, the quantity
var Θˆ 1 var Θˆ 2
is used as a measure of the efficiency of Θˆ 2 relative to Θˆ 1 [MM]. We hope to maximize efficiency by minimizing variance. In our example, the mean of the population has variance σ^2 /m = σ^2 /(2n + 1). If the population median is ˜μ, that is ˜μ is such that (^) ∫ μ˜
−∞
f (x; μ, σ^2 ) dx =
then, according to [MM], the sampling distribution of the median is approximately normal with mean ˜μ and variance
1 8 n · f (˜μ)^2
Since the normal distribution of our example is symmetric, we must have ˜μ = μ, which makes it easy to show that f (˜μ) = 1/
2 πσ^2. The variance of the sample median is therefore πσ^2 / 4 n. Certainly, in our example, the mean has the smaller variance of the two estimators, but we would like to know whether an estimator with smaller variance exists. More precisely, it would be very useful to have a lower bound on the variance of an unbiased estimator. Clearly, the variance must be non-negative^1 , but it would be useful to have a less trivial lower bound. The Cram´er-Rao Inequality is a theorem that provides such a bound under very general conditions. It does not, however, provide any assurance that any estimator exists that has the minimum variance allowed by this bound.
The Cram´er-Rao Inequality provides us with a lower bound on the variance of an unbiased estimator for a parameter.
Cram´er-Rao Inequality. Let f (x; θ) be a probability density with continuous parameter θ. Let X 1 ,... , Xn be indepen- dent random variables with density f (x; θ), and let Θ(ˆX 1 ,... , Xn) be an unbiased estimator of θ. Assume that f (x; θ) satisfies two conditions:
∂ ∂θ
Θ(^ ˆx 1 ,... , xn)
∏^ n
i=
f (xi; θ) dxi
Θ(^ ˆx 1 ,... , xn) ∂^
∏n i=1 f^ (xi;^ θ) ∂θ
dx 1 · · · dxn, (2.1)
Conditions under which this holds are reproduced from [HH] in Appendix A.
Then
var( Θ)ˆ ≥
n E
∂ log f (x;θ) ∂θ
where E denotes the expected value with respect to the probability density function f (x; θ).
Proof. We prove the theorem as in [CaBe]. Let Θ(̂ X~) = Θ(̂ X 1 ,... , Xn). We assume that our estimator depends only on
the sample values X 1 ,... , Xn and is independent of θ. Since Θ(̂ X~) is unbiased as an estimator for θ, we have E[Θ] =̂ θ. From this we have:
0 = E[ Θˆ − θ]
=
Θ(x 1 ,... , xn) − θ
f (x 1 ; θ) · · · f (xn; θ) dx 1 · · · dxn.
(^1) It is possible for the variance of an estimator to be zero. Consider the following case: we always estimate the mean to be 0, no matter what sample values we observe. This is a terrific estimate if the mean happens to be 0, and is a poor estimate otherwise. Note, however, that the variance of our estimator is zero!
We square both sides of (2.7), obtaining
( Θ(̂ ~x) − θ) · φ(~x; θ)^1 /^2
φ(~x; θ)^1 /^2 ·
∑^ n
i=
∂ log f (xi; θ) ∂θ
d~x
We now apply the Cauchy-Schwarz Inequality to (2.8). Thus
Θ(~x) − θ
· φ(~x; θ)d~x ·
∫ (^ ∑n
i=
∂ log f (xi; θ) ∂θ
φ(~x; θ)d~x. (2.9)
There are two multiple integrals to evaluate on the right hand side. The first multiple integral is just the definition of the variance of the estimator Θ, which we denote by var(̂ Θ). Thus (2.8) becomeŝ
1 ≤ var(Θ)̂ ·
∫ (^ ∑n
i=
∂ log f (xi; θ) ∂θ
φ(~x; θ)d~x. (2.10)
To finish the proof of the Cram´er-Rao Inequality, it suffices to show
∫ · · ·
∫ (^ ∑n
i=
∂ log f (xi; θ) ∂θ
φ(~x; θ)d~x = nE
∂ log f (x; θ) ∂θ
This is because if we can prove (2.11), simple division will yield the Cram´er-Rao Inequality from (2.10). We now prove (2.11). We have ∫ · · ·
∫ (^ ∑n
i=
∂ log f (xi; θ) ∂θ
φ(~x; θ)d~x =
∫ (^) ∑n
i=
∑^ n
j=
∂ log f (xi; θ) ∂θ
∂ log f (xj ; θ) ∂θ
φ(~x; θ)d~x
∑^ n
i=
∑^ n
j=
∂ log f (xi; θ) ∂θ
∂ log f (xj ; θ) ∂θ
φ(~x; θ)d~x
where
∫ (^) ∑n
i=
∂ log f (xi; θ) ∂θ
φ(~x; θ)d~x
1 ≤i,j≤n i 6 =j
∂ log f (xi; θ) ∂θ
∂ log f (xj ; θ) ∂θ
φ(~x; θ)d~x. (2.13)
The proof is completed by showing I 1 = nE
∂ log f (x;θ) ∂θ
and I 2 = 0. We have
∫ (^) ∑n
i=
∂ log f (xi; θ) ∂θ
φ(~x; θ)d~x
∑^ n
i=
∂ log f (xi; θ) ∂θ
f (xi; θ)dxi ·
∫ (^) ∏n
=1 6 =i
f (x; θ)dx
∑^ n
i=
∂ log f (xi; θ) ∂θ
f (xi; θ)dxi · 1 n−^1
∑^ n
i=
∂ log f (xi; θ) ∂θ
= nE
∂ log f (xi; θ) ∂θ
In the above calculation, we used the fact that f (xi; θ) is a probability density, and therefore integrates to one. In the final expected values, xi is a dummy variable, and we may denote these n expected values with a common symbol. We now turn to the analysis of I 2. In obvious notation, we may write
I 2 =
1 ≤ii,j 6 =j≤n
I 2 (i, j). (2.15)
To show I 2 = 0 it suffices to show each I 2 (i, j) = 0, which we now proceed to do. Note
I 2 (i, j) =
∂ log f (xi; θ) ∂θ
∂ log f (xj ; θ) ∂θ φ(~x; θ)d~x
∂ log f (xi; θ) ∂θ f (xi; θ)dxi ·
∂ log f (xj ; θ) ∂θ dxj ·
∫ (^) ∏n
=
6 =i,j
f (x; θ)dx
∂ log f (xi; θ) ∂θ
f (xi; θ)dxi ·
∂ log f (xj ; θ) ∂θ
dxj · 1 n−^2
∂ log f (xi; θ) ∂θ
∂ log f (xj ; θ) ∂θ
however, each of these two expected values is zero! To see this, note
f (x; θ)dx. (2.17)
If we differentiate both sides of (2.17) with respect to θ, we find
∂f (x; θ) ∂θ
dx
f (x; θ)
∂f (x; θ) ∂θ
f (x; θ)dx
∂ log f (x; θ) ∂θ
f (x; θ)dx = E
∂ log f (x; θ) ∂θ
This shows I 2 (i, j) = 0, which completes the proof.
An estimator for which equality holds in (2.2) is called a minimum variance unbiased estimator or simply a best unbiased estimator. The expected value in the Cram´er-Rao Inequality is called the information number or the Fisher information of the sample. We notice that the theorem makes no statement about whether equality holds for any particular estimator Θ. Indeed,ˆ in Appendix D, we give an example in which the information is infinite, and the bound provided is therefore var( Θ)ˆ ≥ 0, which is trivial.
Example 3.1. We first consider estimating the parameter of an exponential population based on a sample of size m = 2n + 1. This population has density
f (x; θ) =
1 θ e
−x/θ (^) if x ≥ 0 0 if x < 0.
We consider two estimators, one based on the sample mean and the other on the sample median. We know from the Central Limit Theorem that for large m, the sample mean will have a normal distribution whose mean is θ, the population mean, and whose variance is θ^2 /m = θ^2 /(2n + 1), where θ^2 is the variance computed from the exponential density (the mean and variance are computed in Appendix C). For large n, the sample median Yn+1 has approximately a normal distribution with mean equal to ˜μ, the population median, and variance 1/(8n · f (˜μ)^2 ) [MM]. By definition, the population median satisfies
∫ (^) ˜μ
−∞
f (x) dx =
∫ (^) μ˜
0
θ
e−x/θ^ dx =
On the right side, we have
∫ Θ(^ ˆx) ∂f^ (x;^ θ) ∂θ
dx =
∫ (^) θ
0
2 x
∂θ
θ
dx
θ^2
∫ (^) θ
0
2 x dx
= − 1. (3.27)
It is therefore clear that condition (2.1) does not hold, so we cannot assume that the Cram´er-Rao Inequality holds. Indeed, we will show that it does not. We first compute the information of the sample:
n E
∂ log f (x; θ) ∂θ
∂ log(1/θ) ∂θ
∂(− log θ) ∂θ
θ
θ^2
Therefore, if applicable, the Cram´er-Rao Inequality would tell us that var( Θ)ˆ ≥ θ^2. We now compute the variance of Θ = 2ˆ X:
var(2X) = E
∫ (^) θ
0
(2x)^2 ·
θ dx − θ^2
4 θ^2 3 − θ^2
θ^2 3
We therefore see that the Cram´er-Rao Inequality need not be satisfied when condition (2.1) is not satisfied. We note that this example has the property that the region in which the density function is nonzero depends on the parameter that we are estimating. In such cases we must be particularly careful as condition (2.1) will often not be satisfied.
Exercise 3.3. Show that the sample mean is a minimum variance unbiased estimator for the mean of a normal population.
Exercise 3.4. Let X be a random variable with a binomial distribution with parameters n and θ. Is n · Xn ·
1 − Xn
a minimum variance unbiased estimator for the variance of X?
Theorem A.1 (Differentiating under the integral sign). Let f (x, t) : Rn+1^ → R be a function such that for each fixed t the integral
F (t) =
Rn
f (t, x)dx 1 · · · dxn (A.30)
exists. For all x, suppose that ∂f /∂t exists^3 , and that there is a continuous Riemann integrable function^4 g(x) such that ∣ ∣ ∣ ∣
f (s, x) − f (t, x) s − t
∣ ≤^ g(x)^ (A.31)
for all s 6 = t. Then F is differentiable, and
dF dt
Rn
∂f ∂t
(t, x)dx 1 · · · dxn. (A.32)
(^3) Technically, all we need is that ∂f /∂t exists for almost all x, i.e., except for a set of measure zero. (^4) This condition can be weakened; it suffices for g(x) to be a Lebesgue integrable function.
The above statement is modified from that of Theorem 4.11.22 of [HH]. See page 518 of [HH] for a proof. We have stated a slightly weaker version (and commented in the footnotes on the most general statement) because these weaker cases often suffice for our applications.
Exercise A.2. It is not always the case that one can interchange orders of operations. We saw in Example 3.2 a case where we cannot interchange the integration and differentiation. We give an example which shows that we cannot always interchange orders of integration. For simplicity, we give a sequence amn such that
m(
n am,n)^6 =^
n(
m am,n). For m, n ≥ 0 let
am,n =
1 if n = m − 1 if n = m + 1 0 otherwise.
Show that the two different orders of summation yield different answers. The reason the Fubini Theorem is not applicable here is that
n
m |amn|^ =^ ∞.
The Cauchy-Schwarz Inequality is a general result from linear algebra pertaining to inner product spaces. Here we will consider only an application to Riemann integrable functions. For a more thorough treatment of the general form of the inequality, we refer the reader to Chapter 8 of [HK].
Cauchy-Schwarz Inequality. Let f, g be Riemann integrable real-valued functions of Rn. Then
(∫
· · ·
f (x 1 ,... , xn)g(x 1 ,... , xn) dx 1 · · · dxn
f (x 1 ,... , xn)^2 dx 1 ,... , dxn·
g(x 1 ,... , xn)^2 dx 1 · · · dxn.
Proof. The proof given here is a special case of that given in [HK] (page 377). For notational convenience, we define
I(f, g) =
f (x 1 ,... , xn)g(x 1 ,... , xn) dx 1 · · · dxn.
The statement of the theorem is then I(f, g)^2 ≤ I(f, f )I(g, g).
The following are results of basic properties of integrals, and we leave it as an exercise for the reader to show that they hold:
In the case that I(f, f ) = 0 we must also have I(f, g) = 0, so the inequality holds in this case. Otherwise, we let
h = g − I(g, f ) I(f, f )
· f.
We consider I(h, h), noting by property 4 above that this number must be nonnegative. Using the properties verified by the reader, we gave
0 ≤ I(h, h) = I
g − I(g, f ) I(f, f )
· f, g − I(g, f ) I(f, f )
· f
= I(g, g) −
I(g, f ) I(f, f ) · I(f, g) −
I(g, f ) I(f, f ) · I(g, f ) +
I(g, f )^2 I(f, f )^2 · I(f, f )
= I(g, g) −
I(g, f )^2 I(f, f )
It thus follows that
I(f, g)^2 ≤ I(f, f )I(g, g). (B.35)
Consider
f (x; θ) =
aθ (^) xθ (^) log^13 x if x ≥ e 0 otherwise,
where aθ is chosen so that f (x; θ) is a probability density function. Thus
∫ (^) ∞
e
aθ
dx xθ^ log^3 x
We chose to have log^3 x in the denominator to ensure that the above integral converges, as does log x times the integrand; however, the expected value (in the expectation in (2.2)) will not converge. For example, 1/x log x diverges (its integral looks like log log x) but 1/x log^2 x converges (its integral looks like 1/ log x); see pages 62–63 of [Rud] for more on close sequences where one converges but the other does not. This distribution is close to the Pareto distribution (or a power law). Pareto distributions are very useful in describing many natural phenomena; see for example [DM, Ne, NM]. The inclusion of the factor of log−^3 x allows us to have the exponent of x in the density function equal 1 and have the density function defined for arbitrarily large x; it is also needed in order to apply the Dominated Convergence Theorem to justify some of the arguments below. If we remove the logarithmic factors, then we obtain a probability distribution only if the density vanishes for large x. As log^3 x is a very slowly varying function, our distribution f (x; θ) may be of use in modeling data from an unbounded distribution where one wants to allow a power law with exponent 1, but cannot as the resulting probability integral would diverge. Such a situation occurs frequently in the Benford Law literature; see [Hi, Rai] for more details. We study the variance bounds for unbiased estimators Θ of̂ θ, and in particular we show that when θ = 1 then the Cram´er-Rao inequality yields a useless bound. Note that it is not uncommon for the variance of an unbiased estimator to depend on the value of the parameter being estimated. For example, consider the uniform distribution on [0, θ]. Let X denote the sample mean of n independent observations, and Yn = max 1 ≤i≤n Xi be the largest observation. The expected value of 2X and n+1 n Yn are both θ
(implying each is an unbiased estimator for θ); however, Var(2X) = θ^2 / 3 n and Var( n+1 n Yn) = θ^2 /n(n + 1) both depend on θ, the parameter being estimated (see, for example, page 324 of [MM] for these calculations).
Lemma D.1. As a function of θ ∈ [1, ∞), aθ is a strictly increasing function and a 1 = 2. It has a one-sided derivative at θ = 1, and d daθθ ∈ (0, ∞).
Proof. We have
aθ
e
dx xθ^ log^3 x
When θ = 1 we have
a 1 =
e
dx x log^3 x
which is clearly positive and finite. In fact, a 1 = 2 because the integral is
∫ (^) ∞
e
dx x log^3 x
e
log−^3 x
d log x dx
2 log^2 x
∞ e
though all we need below is that a 1 is finite and non-zero, we have chosen to start integrating at e to make a 1 easy to compute. It is clear that aθ is strictly increasing with θ, as the integral in (D.46) is strictly decreasing with increasing θ (because the integrand is decreasing with increasing θ). We are left with determining the one-sided derivative of aθ at θ = 1, as the derivative at any other point is handled similarly (but with easier convergence arguments). It is technically easier to study the derivative of 1/aθ , as
d dθ
aθ
a^2 θ
daθ dθ
and 1 aθ
e
dx xθ^ log^3 x
The reason we consider the derivative of 1/aθ is that this avoids having to take the derivative of the reciprocals of integrals. As a 1 is finite and non-zero, it is easy to pass to d daθθ |θ=1. Thus we have
d dθ
aθ
θ=
= lim h→ 0 +
h
e
dx x1+h^ log^3 x
e
dx x log^3 x
= lim h→ 0 +
e
1 − xh h
xh
dx x log^3 x
We want to interchange the integration with respect to x and the limit with respect to h above. This interchange is permissible by the Dominated Convergence Theorem (see Appendix D.3 for details of the justification). Note lim h→ 0 +
1 − xh h
xh^
= − log x; (D.51)
one way to see this is to use the limit of a product is the product of the limits, and then use L’Hospital’s rule, writing xh as eh^ log^ x. Therefore d dθ
aθ
θ=
e
dx x log^2 x
as this is finite and non-zero, this completes the proof and shows d daθθ |θ=1 ∈ (0, ∞).
Remark D.2. We see now why we chose f (x; θ) = aθ/xθ^ log^3 x instead of f (x; θ) = aθ /xθ^ log^2 x. If we only had two factors of log x in the denominator, then the one-sided derivative of aθ at θ = 1 would be infinite.
Remark D.3. Though the actual value of d daθθ |θ=1 does not matter, we can compute it quite easily. By (D.52) we have
d dθ
aθ
θ=
e
dx x log^2 x
= −
e
log−^2 x
d log x dx
=
log x
∞ e
Thus by (D.48), and the fact that a 1 = 2 (Lemma D.1), we have
daθ dθ
θ=
= −a^21 ·
d dθ
aθ
θ=
We now compute the expected value, E
∂ log f (x;θ) ∂θ
; showing it is infinite when θ = 1 completes the proof of our main
result. Note
log f (x; θ) = log aθ − θ log x + log log−^3 x ∂ log f (x; θ) ∂θ
aθ
daθ dθ
− log x. (D.55)
By Lemma D.1 we know that d daθθ is finite for each θ ≥ 1. Thus
∂ log f (x; θ) ∂θ
aθ
daθ dθ − log x
e
aθ
daθ dθ
− log x
· aθ dx xθ^ log^3 x
If θ > 1 then the expectation is finite and non-zero. We are left with the interesting case when θ = 1. As d daθθ |θ=1 is finite and non-zero, for x sufficiently large (say x ≥ x 1 for some x 1 , though by Remark D.3 we see that we may take any x 1 ≥ e^4 ) we have (^) ∣ ∣ ∣ ∣
a 1
daθ dθ
θ=
log x 2
[CaBe] G. Casella and R. Berger, Statistical Inference, 2nd edition, Duxbury Advanced Series, Pacific Grove, CA, 2002.
[DM] D. Devoto and S. Martinez, Truncated Pareto Law and oresize distribution of ground rocks, Mathematical Geology 30 (1998), no. 6, 661–673.
[Hi] T. Hill, A statistical derivation of the significant-digit law, Statistical Science 10 (1996), 354–363.
[HK] Kenneth Hoffman and Ray Kunze. Linear algebra. Second edition. Prentice-Hall Inc., Englewood Cliffs, N.J.,
[HH] J. H. Hubbard and B. B. Hubbard, Vector Calculus, Linear Algebra, and Differential Forms, second edition, Prentice Hall, Upper Saddle River, NJ, 2002.
[MM] I. Miller and M. Miller, John E. Freund’s Mathematical Statistics with Applications, seventh edition, Prentice Hall, 2004.
[Ne] M. E. J. Newman, Power laws, Pareto distributions and Zipfs law, Contemporary Physics 46 (2005), no. 5, 323-351.
[NM] M. Nigrini and S. J. Miller, Benford’s Law applied to hydrology data – results and relevance to other geophysical data, preprint.
[Rai] R. A. Raimi, The first digit problem, Amer. Math. Monthly 83 (1976), no. 7, 521–538.
[Rud] W. Rudin, Principles of Mathematical Analysis, third edition, International Series in Pure and Applied Mathe- matics, McGraw-Hill Inc., New York, 1976.
[SS] E. Stein and R. Shakarchi, Real Analysis: Measure Theory, Integration, and Hilbert Spaces, Princeton University Press, Princeton, NJ, 2005.