Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Lecture Notes on Advanced Statistical Theory, Exercises of Statistics

Lecture Notes on Advanced Statistical Theory. 1. Ryan Martin. Department of Mathematics, Statistics, and Computer Science.

Typology: Exercises

2021/2022

Uploaded on 09/12/2022

madbovary
madbovary 🇬🇧

3.9

(13)

244 documents

1 / 145

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Lecture Notes on Advanced Statistical Theory1
Ryan Martin
Department of Mathematics, Statistics, and Computer Science
University of Illinois at Chicago
www.math.uic.edu/~rgmartin
January 10, 2016
1These notes are meant to supplement the lectures for Stat 511 at UIC given by the author.
The accompanying textbook for the course is Keener’s Theoretical Statistics, Springer, 2010, and
is referred to frequently though out these notes. The author makes no guarantees that these notes
are free of typos or other, more serious errors.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download Lecture Notes on Advanced Statistical Theory and more Exercises Statistics in PDF only on Docsity!

Lecture Notes on Advanced Statistical Theory

Ryan Martin

Department of Mathematics, Statistics, and Computer Science

University of Illinois at Chicago

www.math.uic.edu/~rgmartin

January 10, 2016

(^1) These notes are meant to supplement the lectures for Stat 511 at UIC given by the author.

The accompanying textbook for the course is Keener’s Theoretical Statistics, Springer, 2010, and is referred to frequently though out these notes. The author makes no guarantees that these notes are free of typos or other, more serious errors.

Contents

  • 1 Introduction and Preparations
    • 1.1 Introduction
    • 1.2 Mathematical preliminaries
      • 1.2.1 Measure and integration
      • 1.2.2 Basic group theory
      • 1.2.3 Convex sets and functions
    • 1.3 Probability
      • 1.3.1 Measure-theoretic formulation
      • 1.3.2 Conditional distributions
      • 1.3.3 Jensen’s inequality
      • 1.3.4 A concentration inequality
      • 1.3.5 The “fundamental theorem of statistics”
      • 1.3.6 Parametric families of distributions
    • 1.4 Conceptual preliminaries
      • 1.4.1 Ingredients of a statistical inference problem
      • 1.4.2 Reasoning from sample to population
    • 1.5 Exercises
  • 2 Exponential Families, Sufficiency, and Information
    • 2.1 Introduction
    • 2.2 Exponential families of distributions
    • 2.3 Sufficient statistics
      • 2.3.1 Definition and the factorization theorem
      • 2.3.2 Minimal sufficient statistics
      • 2.3.3 Ancillary and complete statistics
    • 2.4 Fisher information
      • 2.4.1 Definition
      • 2.4.2 Sufficiency and information
      • 2.4.3 Cramer–Rao inequality
      • 2.4.4 Other measures of information
    • 2.5 Conditioning
    • 2.6 Discussion
      • 2.6.1 Generalized linear models
      • 2.6.2 A bit more about conditioning
    • 2.7 Exercises
  • 3 Likelihood and Likelihood-based Methods
    • 3.1 Introduction
    • 3.2 Likelihood function
    • 3.3 Likelihood-based methods and first-order theory
      • 3.3.1 Maximum likelihood estimation
      • 3.3.2 Likelihood ratio tests
    • 3.4 Cautions concerning the first-order theory
    • 3.5 Alternatives to the first-order theory
      • 3.5.1 Bootstrap
      • 3.5.2 Monte Carlo and plausibility functions
    • 3.6 On advanced likelihood theory
      • 3.6.1 Overview
      • 3.6.2 “Modified” likelihood
      • 3.6.3 Asymptotic expansions
    • 3.7 A bit about computation
      • 3.7.1 Optimization
      • 3.7.2 Monte Carlo integration
    • 3.8 Discussion
    • 3.9 Exercises
  • 4 Bayesian Inference
    • 4.1 Introduction
    • 4.2 Bayesian analysis
      • 4.2.1 Basic setup of a Bayesian inference problem
      • 4.2.2 Bayes’s theorem
      • 4.2.3 Inference
      • 4.2.4 Marginalization
    • 4.3 Some examples
    • 4.4 Motivations for the Bayesian approach
      • 4.4.1 Some miscellaneous motivations
      • 4.4.2 Exchangeability and deFinetti’s theorem
    • 4.5 Choice of priors
      • 4.5.1 Prior elicitation
      • 4.5.2 Convenient priors
      • 4.5.3 Many candidate priors and robust Bayes
      • 4.5.4 Objective or non-informative priors
    • 4.6 Bayesian large-sample theory
      • 4.6.1 Setup
      • 4.6.2 Laplace approximation
      • 4.6.3 Bernstein–von Mises theorem
    • 4.7 Concluding remarks
      • 4.7.1 Lots more details on Bayesian inference
      • 4.7.2 On Bayes and the likelihood principle
      • 4.7.3 On the “Bayesian” label
      • 4.7.4 On “objectivity”
      • 4.7.5 On the role of probability in statistical inference
    • 4.8 Exercises
  • 5 Statistical Decision Theory
    • 5.1 Introduction
    • 5.2 Admissibility
    • 5.3 Minimizing a “global” measure of risk
      • 5.3.1 Minimizing average risk
      • 5.3.2 Minimizing maximum risk
    • 5.4 Minimizing risk under constraints
      • 5.4.1 Unbiasedness constraints
      • 5.4.2 Equivariance constraints
      • 5.4.3 Type I error constraints
    • 5.5 Complete class theorems
    • 5.6 On minimax estimation of a normal mean
    • 5.7 Exercises
  • 6 More Asymptotic Theory (incomplete!)
    • 6.1 Introduction
    • 6.2 M- and Z-estimators
      • 6.2.1 Definition and examples
      • 6.2.2 Consistency
      • 6.2.3 Rates of convergence
      • 6.2.4 Asymptotic normality
    • 6.3 More on asymptotic normality and optimality
      • 6.3.1 Introduction
      • 6.3.2 Hodges’s provocative example
      • 6.3.3 Differentiability in quadratic mean
      • 6.3.4 Contiguity
      • 6.3.5 Local asymptotic normality
      • 6.3.6 On asymptotic optimality
    • 6.4 More Bayesian asymptotics
      • 6.4.1 Consistency
      • 6.4.2 Convergence rates
      • 6.4.3 Bernstein–von Mises theorem, revisited
    • 6.5 Concluding remarks
    • 6.6 Exercises

Chapter 1

Introduction and Preparations

1.1 Introduction

Stat 511 is a first course in advanced statistical theory. This first set of notes is intended to set the stage for the material that is the core of the course. In particular, these notes define the notation we shall use throughout, and also set the conceptual and mathematical level we will be working at. Naturally, both the conceptual and mathematical level will be higher than in an intermediate course, such as Stat 411 at UIC. On the mathematical side, real analysis and, in particular, measure theory, is very im- portant in probability and statistics. Indeed, measure theory is the foundation on which modern probability is built and, by the close connection between probability and statistics, it is natural that measure theory also permeates the statistics literature. Measure theory itself can be very abstract and difficult. I am not an expert in measure theory, and I don’t expect you to be an expert either. But, in general, to read and understand research papers in statistical theory, one should at least be familiar with the basic terminology and results of measure theory. My presentation here is meant to introduce you to these basics, so that we have a working measure-theoretic vocabulary moving forward to our main focus in the course. Keener (2010), the course textbook, also takes a similar approach to its measure theory presentation. Besides measure theory, I will also give some brief introduction to group theory and convex sets/functions. The remainder of this first set of notes concerns the transitions from measure theory to probability and from probability to statistics. On the conceptual side, besides being able to apply theory to particular examples, I hope to communicate why such theory was developed; that is, not only do I want you be familiar with results and techniques, but I hope you can understand the motivation behind these developments. Along these lines, in this chapter, I will discuss the basic ingredients of a sta- tistical inference problem, along with some discussion about statistical reasoning, addressing the fundamental question: how to reason from sample to population? Surprisingly, there’s no fully satisfactory answer to this question.

A measure μ is finite if μ(X) is a finite number. Probability measures (see Section 1.3.1) are special finite measures where μ(X) = 1. A measure μ is said to be σ-finite if there exists a sequence of sets {Ai} ⊂ A such that

i=1 Ai^ =^ X^ and^ μ(Ai)^ <^ ∞^ for each^ i.

Example 1.2. Let X be a countable set and A the class of all subsets of X; then clearly A is a σ-algebra. Define μ according to the rule

μ(A) = number of points in A, A ∈ A.

Then μ is a σ-finite measure which is refered to as counting measure.

Example 1.3. Let X be a subset of d-dimensional Euclidean space Rd. Take A to be the smallest σ-algebra that contains the collection of open rectangles

A = {(x 1 ,... , xd) : ai < xi < bi, i = 1,... , d}, ai < bi.

Then A is the Borel σ-algebra on X, which contains all open and closed sets in X; but there are subsets of X that do not belong to A! Then the (unique) measure μ, defined by

μ(A) =

∏^ d

i=

(bi − ai), for rectangles A ∈ A

is called Lebesgue measure, and it’s σ-finite.

Next we consider integration of a real-valued function f with respect to a measure μ on (X, A). This more general definition of integral satisfies most of the familiar properties from calculus, such as linearity, monotonicity, etc. But the calculus integral is defined only for a class of functions which is generally too small for our applications. The class of functions of interest are those which are measurable. In particular, a real- valued function f is measurable if and only if, for every real number a, the set {x : f (x) ≤ a} is in A. If A is a measurable set, then the indicator function IA(x), which equals 1 when x ∈ A and 0 otherwise, is measurable. More generally, a simple function

s(x) =

∑^ K

k=

akIAk (x),

is measurable provided that A 1 ,... , AK ∈ A. Continuous f are also usually measurable. The integral of a non-negative simple function s with respect to μ is defined as ∫ s dμ =

∑^ K

k=

aiμ(Ak). (1.1)

Take a non-decreasing sequence of non-negative simple functions {sn} and define

f (x) = lim n→∞ sn(x). (1.2)

It can be shown that f defined in (1.2) is measurable. Then the integral of f with respect to μ is defined as (^) ∫

f dμ = lim n→∞

sn dμ,

the limit of the simple function integrals. It turns out that the left-hand side does not depend on the particular sequence {sn}, so it’s unique. In fact, an equivalent definition for the integral of a non-negative f is ∫ f dμ = sup 0 ≤ s ≤ f , simple

s dμ. (1.3)

For a general measurable function f which may take negative values, define

f +(x) = max{f (x), 0 } and f −(x) = − min{f (x), 0 }.

Both the positive part f +^ and the negative part f −^ are non-negative, and f = f +^ − f −. The integral of f with respect to μ is defined as ∫ f dμ =

f +^ dμ −

f −^ dμ,

where the two integrals on the right-hand side are defined through (1.3). In general, a measurable function f is said to be μ-integrable, or just integrable, if

f +^ dμ and

f −^ dμ are both finite.

Example 1.4 (Counting measure). If X = {x 1 , x 2 ,.. .} and μ is counting measure, then ∫ f dμ =

∑^ ∞

i=

f (xi).

Example 1.5 (Lebesgue measure). If X is a Euclidean space and μ is Lebesgue measure, then

f dμ exists and is equal to the usual Riemann integral of f from calculus whenever the latter exists. But the Lebesgue integral exists for f which are not Riemann integrable.

Next we list some important results from analysis, related to integrals. The first two have to do with interchange of limits^1 and integration, which is often important in statistical problems. The first is relatively weak, but is used in the proof of the second.

Theorem 1.1 (Fatou’s lemma). Given {fn}, non-negative and measurable, ∫ ( lim inf n→∞ fn

dμ ≤ lim inf n→∞

fn dμ.

The opposite inequality holds for lim sup, provided that |fn| ≤ g for integrable g.

(^1) Recall the notions of “lim sup” and “lim inf” from analysis. For example, if xn is a sequence of real numbers, then lim supn→∞ xn = infn supk≥n xk and, intuitively, this is the largest accumulation point of the sequence; similarly, lim infn→∞ xn is the smallest accumulation point, and if the largest and smallest accumulations points are equal, then the sequence converges and the common accumulation point is the limit. Also, if fn is a sequence of real-valued functions, then we can define lim sup fn and lim inf fn by applying the previous definitions pointwise.

it is clear that the only way there can be fewer than two real roots is if b^2 − 4 ac is ≤ 0. Using the definitions of a, b, and c we find that

f g dμ

f 2 dμ ·

g^2 dμ ≤ 0 ,

and from this the result follows immediately. A different proof, based on Jensen’s inequality, is given in Example 1.8.

The next result defines “double-integrals” and shows that, under certain conditions, the order of integration does not matter. Fudging a little bit on the details, for two measure spaces (X, A, μ) and (Y, B, ν), define the product space

(X × Y, A ⊗ B, μ × ν),

where X × Y is usual set of ordered pairs (x, y), A ⊗ B is the smallest σ-algebra that contains all the sets A × B for A ∈ A and B ∈ B, and μ × ν is the product measure defined as

(μ × ν)(A × B) = μ(A)ν(B).

This concept is important for us because independent probability distributions induce a product measure. Fubini’s theorem is a powerful result that allows certain integrals over the product to be done one dimension at a time.

Theorem 1.4 (Fubini). Let f (x, y) be a non-negative measurable function on X × Y. Then ∫

X

[∫

Y

f (x, y) dν(y)

]

dμ(x) =

Y

[∫

X

f (x, y) dμ(x)

]

dν(y). (1.4)

The common value above is the double integral, written

X×Y f d(μ^ ×^ ν). Our last result has something to do with constructing new measures from old. It also allows us to generalize the familiar notion of probability densities which, in turn, will make our lives easier when discussing the general statistical inference problem. Suppose f is a non-negative^2 measurable function. Then

ν(A) =

A

f dμ (1.5)

defines a new measure ν on (X, A). An important property is that μ(A) = 0 implies ν(A) = 0; the terminology is that ν is absolutely continuous with respect to μ, or ν is dominated by μ, written ν  μ. But it turns out that, if ν  μ, then there exists f such that (1.5) holds. This is the famous Radon–Nikodym theorem.

Theorem 1.5 (Radon–Nikodym). Suppose ν  μ. Then there exists a non-negative μ- integrable function f , unique modulo μ-null sets, such that (1.5) holds. The function f , often written as f = dν/dμ is the Radon–Nikodym derivative of ν with respect to μ.

(^2) f can take negative values, but then the measure is a signed measure.

We’ll see later that, in statistical problems, the Radon–Nikodym derivative is the familiar density or, perhaps, a likelihood ratio. The Radon–Nikodym theorem also formalizes the idea of change-of-variables in integration. For example, suppose that μ and ν are σ-finite measures defined on X, such that ν  μ, so that there exists a unique Radon–Nikodym derivative f = dν/dμ. Then, for a ν-integrable function ϕ, we have

∫ ϕ dν =

ϕ f dμ;

symbolically this makes sense: dν = (dν/dμ) dμ.

1.2.2 Basic group theory

An important mathematical object is that of a group, a set of elements together with a certain operation having a particular structure. Our particular interest (Section 1.3.6) is in groups of transformations and how they interact with probability distributions. Here we set some very basic terminology and understanding of groups. A course on abstract algebra would cover these concepts, and much more.

Definition 1.2. A group is a set G together with a binary operation ·, such that:

  • (closure) for each g 1 , g 2 ∈ G , g 1 · g 2 ∈ G ;
  • (identity) there exists e ∈ G such that e · g = g for all g ∈ G ;
  • (inverse) for each g ∈ G , there exists g−^1 ∈ G such that g−^1 · g = e;
  • (associative) for each g 1 , g 2 , g 3 ∈ G , g 1 · (g 2 · g 3 ) = (g 1 · g 2 ) · g 3.

The element e is called the identity, and the element g−^1 is called the inverse of g. The group G is called abelian, or commutative, if g 1 · g 2 = g 2 · g 1 for all g 1 , g 2 ∈ G.

Some basic examples of groups include (Z, +), (R, +), and (R{ 0 }, ×); the latter requires that the origin be removed since 0 has no multiplicative inverse. These three groups are abelian. The general linear group of dimension m, consisting of all m × m non-singular matrices, is a group under matrix multiplication; this is not an abelian group. Some simple properties of groups are given in Exercise 10. We are primarily interested in groups of transformations. Let X be a space (e.g., a sample space) and consider a collection G of functions g, mapping X to itself. Consider the operation ◦ of function composition. The identity element e is the function e(x) = x for all x ∈ X. If we require that (G , ◦) be a group with identity e, then each g ∈ G is a one-to-one function. To see this, take any g ∈ G and take x 1 , x 2 ∈ X such that g(x 1 ) = g(x 2 ). Left composition by g−^1 gives e(x 1 ) = e(x 2 ) and, consequently, x 1 = x 2 ; therefore, g is one-to-one. Some examples of groups of transformations are:

  • For X = Rm, define the map gc(x) = x + c, a shift of the vector x by a vector c. Then G = {gc : c ∈ Rm} is an abelian group of transformations.

Convexity is important in optimization problems (maximum likelihood, least squares, etc) as it relates to existence and uniqueness of global optima. For example, if the criterion (loss) function to be minimized is convex and a local minimum exists, then convexity guarantees that it’s a global minimum. “Convex” can be used as an adjective for sets, not just functions. A set C, in a linear space, is convex if, for any points x and y in C, the convex combination ax + (1 − a)y, for a ∈ [0, 1], is also a point in C. In other words, a convex set C contains line segments connecting all pairs of points in C. Examples of convex sets are interval of numbers, circles in the plane, and balls/ellipses in higher dimensions. There is a connection between convex sets and convex functions: if f is a convex real-valued function, then, for any real t, the set Ct = {x : f (x) ≤ t} is convex (see Exercise 15). There will be some applications of convex sets in the later chapters.^3

1.3 Probability

1.3.1 Measure-theoretic formulation

It turns out the mathematical probability is just a special case of the measure theory stuff presented above. Our probabilities are finite measures, our random variables are measurable functions, our expected values are integrals. Start with an essentially arbitrary measurable space (Ω, F), and introduce a probability measure P; that is P(Ω) = 1. Then (Ω, F, P) is called a probability space. The idea is that Ω contains all possible outcomes of the random experiment. Consider, for example, the heights example above in Section 1.4.1. Suppose we plan to sample a single UIC student at random from the population of students. Then Ω consists of all students, and exactly one of these students will be the one that’s observed. The measure P will encode the underlying sampling scheme. But in this example, it’s not the particular student chosen that’s of interest: we want to know the student’s height, which is a measurement or characteristic of the sampled student. How do we account for this? A random variable X is nothing but a measurable function from Ω to another space X. It’s important to understand that X, as a mapping, is not random; instead, X is a function of a randomly chosen element ω in Ω. So when we are discussing probabilities that X satisfies such and such properties, we’re actually thinking about the probability (or measure) the set of ω’s for which X(ω) satisfies the particular property. To make this more precise we write

P(X ∈ A) = P{ω : X(ω) ∈ A} = PX−^1 (A).

To simplify notation, etc, we will often ignore the underlying probability space, and work simply with the probability measure PX (·) = PX−^1 (·). This is what we’re familiar with from basic probability and statistics; the statement X ∼ N(0, 1) means simply that the probability

(^3) e.g., the parameter space for natural exponential families is convex; Anderson’s lemma, which is used to

prove minimaxity in normal mean problems, among other things, involves convex sets; etc.

measure induced on R by the mapping X is a standard normal distribution. When there is no possibility of confusion, we will drop the “X” subscript and simply write P for PX. When PX , a measure on the X-space X, is dominated by a σ-finite measure μ, the Radon–Nikodym theorem says there is a density dPX /dμ = pX , and

PX (A) =

A

pX dμ.

This is the familiar case we’re used to; when μ is counting measure, pX is a probability mass function and, when μ is Lebesgue measure, pX is a probability density function. One of the benefits of the measure-theoretic formulation is that we do not have to handle these two important cases separately. Let ϕ be a real-valued measurable function defined on X. Then the expected value of ϕ(X) is

EX {ϕ(X)} =

X

ϕ(x) dPX (x) =

X

ϕ(x)pX (x) dμ(x),

the latter expression holding only when PX  μ for a σ-finite measure μ on X. The usual properties of expected value (e.g., linearity) hold in this more general case; the same tools we use in measure theory to study properties of integrals of measurable functions are useful for deriving such things. In these notes, it will be assumed you are familiar with all the basic probability calcu- lations defined and used in basic probability and statistics courses, such as Stat 401 and Stat 411 at UIC. In particular, you are expected to know the common distributions (e.g., normal, binomial, Poisson, gamma, uniform, etc) and how to calculate expectations for these and other distributions. Moreover, I will assume you are familiar with some basic operations involving random vectors (e.g., covariance matrices) and some simple linear algebra stuff. Keener (2010), Sections 1.7 and 1.8, introduces these concepts and notations. In probability and statistics, product spaces are especially important. The reason, as we eluded to before, is that independence of random variables is connected with product spaces and, in particular, product measures. If X 1 ,... , Xn are iid PX , then their joint distribution is the product measure

PX 1 × PX 2 × · · · × PXn = PX × PX · · · × PX = PnX.

The first term holds with only “independence;” the second requires “identically distributed;” the last term is just a short-hand notation for the middle term. When we talk about convergence theorems, such as the law of large numbers, we say something like: for an infinite sequence of random variables X 1 , X 2 ,... some event happens with probability 1. But what is the measure being referenced here? In the iid case, it turns out that it’s an infinite product measure, written as P∞ X. We’ll have more to say about this when the time comes.

Here I use the more standard notation for conditional probability. The law of total probability then allows us to write

P(Y ∈ B) =

P(Y ∈ B | X = x)pX (x) dμ(x),

in other words, marginal probabilities for Y may be obtained by taking expectation of the conditional probabilities. More generally, for any ν-integrable function ϕ, we may write the conditional expectation

E{ϕ(Y ) | X = x} =

ϕ(y)pY |X (y|x) dν(y).

We may evaluate the above expectation for any x, so we actually have defined a (μ-measurable) function, say, g(x) = E(Y | X = x); here I took ϕ(y) = y for simplicity. Now, g(X) is a random variable, to be denoted by E(Y | X), and we can ask about its mean, variance, etc. The corresponding version of the law of total probability for conditional expectations is

E(Y ) = E{E(Y | X)}. (1.6)

This formula is called smoothing in Keener (2010) but I would probably call it a law of iterated expectation. This is actually a very powerful result that can simplify lots of calcu- lations; Keener (2010) uses this a lot. There are versions of iterated expectation for higher moments, e.g.,

V(Y ) = V{E(Y | X)} + E{V(Y | X)}, (1.7) C(X, Y ) = E{C(X, Y | Z)} + C{E(X | Z), E(Y | Z)}, (1.8)

where V(Y | X) is the conditional variance, i.e., the variance of Y relative to its conditional distribution and, similarly, C(X, Y | Z) is the conditional covariance of X and Y. As a final word about conditional distributions, it is worth mentioning that conditional distributions are particularly useful in the specification of complex models. Indeed, it can be difficult to specify a meaningful joint distribution for a collection of random variables in a given application. However, it is often possible to write down a series of conditional distributions that, together, specify a meaningful joint distribution. That is, we can simplify the modeling step by working with several lower-dimensional conditional distributions. This is particularly useful for specifying prior distributions for unknown parameters in a Bayesian analysis; we will discuss this more later.

1.3.3 Jensen’s inequality

Convex sets and functions appear quite frequently in statistics and probability applications, so it can help to see the some applications. The first result, relating the expectation of a convex function to the function of the expectation, should be familiar.

Theorem 1.6 (Jensen’s inequality). Suppose ϕ is a convex function on an open interval X ⊆ R, and X is a random variable taking values in X. Then

ϕ[E(X)] ≤ E[ϕ(X)].

If ϕ is strictly convex, then equality holds if and only if X is constant.

Proof. First, take x 0 to be any fixed point in X. Then there exists a linear function (x) = c(x − x 0 ) + ϕ(x 0 ), through the point (x 0 , ϕ(x 0 )), such that(x) ≤ ϕ(x) for all x. To prove our claim, take x 0 = E(X), and note that

ϕ(X) ≥ c[X − E(X)] + ϕ[E(X)].

Taking expectations on both sides gives the result.

Jensen’s inequality can be used to confirm: E(1/X) ≥ 1 /E(X), E(X^2 ) ≥ E(X)^2 , and E[log X] ≤ log E(X). An interesting consequence is the following.

Example 1.7 (Kullback–Leibler divergence). Let f and g be two probability density func- tions dominated by a σ-finite measure μ. The Kullback–Leibler divergence of g from f is defined as

Ef {log[f (X)/g(X)]} =

log(f /g)f dμ.

It follows from Jensen’s inequality that

Ef {log[f (X)/g(X)]} = −Ef {log[g(X)/f (X)]} ≥ − log Ef [g(X)/f (X)]

= − log

(g/f )f dμ = 0.

That is, the Kullback–Leibler divergence is non-negative for all f and g. Moreover, it equals zero if and only if f = g (μ-almost everywhere). Therefore, the Kullback–Leibler divergence acts like a distance measure between to density functions. While it’s not a metric in a mathematical sense^5 , it has a lot of statistical applications. See Exercise 23.

Example 1.8 (Another proof of Cauchy–Schwartz). Recall that f 2 and g^2 are μ-measurable functions. If

g^2 dμ is infinite, then there is nothing to prove, so suppose otherwise. Then p = g^2 /

g^2 dμ is a probability density on X. Moreover, (∫^ f g dμ ∫ g^2 dμ

(f /g)p dμ

(f /g)^2 p dμ =

f 2 dμ ∫ g^2 dμ

where the inequality follows from Theorem 1.6. Rearranging terms one gets (∫ f g dμ

f 2 dμ ·

g^2 dμ,

which is the desired result. (^5) it’s not symmetric and does not satisfy the triangle inequality

It is easy to verify that h′′′(z) = 0 iff z = log(^1 −c c). Plugging this z value in to h′′^ gives 1 /4, and this is the global maximum. Therefore, h′′(z) ≤ 1 /4 for all z > 0. Now, for some z 0 ∈ (0, ζ), there is a second-order Taylor approximation of h(ζ) around 0:

h(ζ) = h(0) + h′(0)ζ + h′′(z 0 )

ζ^2 2

ζ^2 8

t^2 (b − a)^2 8

Plug this bound in to get MX (t) ≤ eh(ζ)^ ≤ et (^2) (b−a) (^2) / 8 .

Lemma 1.2 (Chernoff). For any random variable X, P(X > ε) ≤ inft> 0 e−tεE(etX^ ).

Proof. See Exercise 26.

Now we are ready for the main result, Hoeffding’s inequality. The proof combines the results in the two previous lemmas.

Theorem 1.7 (Hoeffding’s inequality). Let Y 1 , Y 2 ,... be independent random variables, with P(a ≤ Yi ≤ b) = 1 and mean μ. Then

P(| Y¯n − μ| > ε) ≤ 2 e−^2 nε (^2) /(b−a) 2 .

Proof. We can take μ = 0, without loss of generality, by working with Xi = Yi − μ. Of course, Xi is still bounded, and the length of the bounding interval is still b − a. Write

P(| X¯n| > ε) = P( X¯n > ε) + P(− X¯n > ε).

Start with the first term on the right-hand side. Using Lemma 1.2,

P( X¯n > ε) = P(X 1 + · · · + Xn > nε) ≤ inf t> 0 e−tnεMX (t)n,

where MX (t) is the moment generating function of X 1. By Lemma 1.1, we have

P( X¯n > ε) ≤ inf t> 0

e−tnεent (^2) (b−a) (^2) / 8 .

The minimizer, over t > 0, of the right-hand side is t = 4ε/(b − a)^2 , so we get

P( X¯n > ε) ≤ e−^2 nε

(^2) /(b−a) 2 .

To complete the proof, apply the same argument to P(− X¯n > ε), obtain the same bound as above, then sum the two bounds together.

There are lots of other kinds of concentration inequalities, most are more general than Hoeffding’s inequality above. Exercise 28 walks you through a concentration inequality for normal random variables and a corresponding strong law. Modern work on concentration inequalities deals with more advanced kinds of random quantities, e.g., random functions or stochastic processes. The next subsection gives a special case of such a result.

1.3.5 The “fundamental theorem of statistics”

Consider the problem where X 1 ,... , Xn are iid with common distribution function F on the real line; for simplicity, lets assume throughout that F is everywhere continuous. Of course, if we knew F , then, at least in principle, we know everything about the distribution of the random variables. It should also be clear, at least intuitively, that, if n is large, then we would have seen “all the possible values” of a random variable X ∼ F , in their relative frequencies, and so it should be possible to learn F from a long enough sequence of data. The result below, called the Glivenko–Cantelli theorem or, by some, the fundamental theorem of statistics, demonstrates that our intuition is correct.

First we need a definition. Given X 1 ,... , Xn iid ∼ F , we want to construct an estimator Fˆn of F. A natural choice is the “empirical distribution function:”

Fˆn(x) =^1 n

∑^ n

i=

I(−∞,x](Xi), x ∈ R,

that is, Fˆn(x) is just the proportion of the sample with values not exceeding x. It is a simple consequence from Hoeffding’s inequality above (paired with the Borel–Cantelli lemma) that Fˆn(x) converges almost surely to F (x) for each x. The Glivenko–Cantelli theorem says that Fˆn converges to F not just pointwise, but uniformly.

Theorem 1.8 (Glivenko–Cantelli). Given X 1 ,... , Xn iid ∼ F , where F is everywhere contin- uous on R, let Fˆn be the empirical distribution function as defined above. Set

‖ Fˆn − F ‖∞ := sup x

| Fˆn(x) − F (x)|.

Then ‖ Fˆn − F ‖∞ converges to zero almost surely.

Proof. Our goal is to show that, for any ε > 0,

lim sup n

sup x

| Fˆn(x) − F (x)| ≤ ε, almost surely.

To start, given (arbitrary) ε > 0, let −∞ = t 1 < t 2 < · · · < tJ = ∞ be a partition of R such that F (t− j+1) − F (tj ) ≤ ε, j = 1,... , J − 1.

Exercise 29 demonstrates the existence of such a partition. Then, for any x, there exists j such that tj ≤ x < tj+1 and, by monotonicity,

Fˆn(tj ) ≤ Fˆn(x) ≤ Fˆn(t− j+1) and F (tj ) ≤ F (x) ≤ F (t− j+1).

This implies that

Fˆn(tj ) − F (t− j+1) ≤ Fˆn(x) − F (x) ≤ Fˆn(t− j+1) − F (tj ).