





Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
The problem of estimating the mean of a sparse vector in the context of information and coding theory. the minimax rates for this problem, the role of the empirical mean estimator, and the definition of the I-projection of a distribution onto a closed convex set. The document also includes proofs and examples.
Typology: Schemes and Mind Maps
1 / 9
This page cannot be seen from the preview
Don't miss anything!
Information and Coding Theory Winter 2021
Lecturer: Madhur Tulsiani
We will conclude our discussion of minimax rates, with this final example of estimating the mean, when we are given the additional condition that the mean is a sparse vector. Consider the set of normal distributions, where the mean has only one non-zero coordinate.
Π =
N( μ , Id) | μ ∈ R d, ‖ μ ‖ 0 ≤ 1
Let θ (P) = E x∼P [x] be the mean, and let `( θ ̂, θ ) =
∥̂ θ^ −^ θ
2 2
as before. From the previous
examples, it seems like the empirical mean estimator is always the best one, and the role of information theory is primarily for proving lower bounds. However, it can also serve as a guide for the right bound to aim for. For this problem, it will be much easier to prove a lower bound. We will then show an estimator which matches this bound.
Let V = {e 1 ,... , ed} be the set of standard basis vectors in R d. Consider the set of dis- tributions Pv = N(
2 δ · v, Id) for all v ∈ V. Note that the means μ v =
2 δ · v satisfy ‖ μ v 1 −^ μ v 2 ‖ =^2 δ^ for all^ v 1 6 =^ v 2. Using the bound from the previous lecture, we get
Mn(Π, `) ≥ δ^2 ·
n · E v 1 ,v 2 ∈V [D(Pv 1 ‖Pv 2 )] + 1 log |V |
≥ δ^2 ·
n ·
4 δ^2 /(2 ln 2)
log d
≥ c · log d n
for an appropriate constant c > 0, using a choice of δ^2 = c′^ · logn d. We will now show that this lower bound is actually tight.
The optimal estimator for the above problem actually extends the definition of the mean as the minimizer of the total square distance (from the sample points). Recall the following.
Exercise 1.1. Let x 1 ,... , xn ∈ R d. Then the empirical mean η = (^1) n · (^) ∑ni= 1 xi satisfies
n
i= 1
‖xi − η ‖^22 = inf v∈ R d
n
i= 1
‖xi − v‖^22
Given a sequence of samples x = (x 1 ,... , xn), let the η denote the empirical mean
η :=
n
n
i= 1
xi.
As we saw above, the empirical mean is the minimizer of the least square distance. How- ever, it is not sparse. We take our estimator μ ̂ to only consist of the largest entry (in absolute value) of η , and set all other entries to zero i.e.,
μ ̂ j :=
η j if j = argmaxk∈[d] | η k| 0 otherwise
Note that the above definition does not make sense if the the coordinate maximizing (^) | η k| is not unique. In such a case, we arbitrarily pick one of the maximizing coordinates. Check that this definition is a constrained version of the above definition for empirical mean. While the empirical mean η is the minimizer over all of R d, of the average squared distance from the sample points, the estimator above is the minimizer over all sparse vectors.
Exercise 1.2. Check that for μ ̂ defined as above
n
i= 1
‖xi − ̂ μ ‖^22 = inf ‖v‖ 0 ≤ 1
{ (^) n
i= 1
‖xi − v‖^22
While we will use the above estimator, the operation of picking the largest coordinate does not combine well with analytic expressions such as expectation etc. For this reason, we will use the empirical mean η as an intermediate object in the analysis. We need the following basic properties
Proposition 1.3. Let x ∼ (N( μ , Id))n^ be a sequence of n independent samples, and let η = 1 n ·^ ∑
n i= 1 xi^ be the empirical mean. Then^ η^ −^ μ^ is distributed according to the Gaussian distribution N
0, (^1) n · Id
and we are done. So let’s assume μ ̂ 1 = 0 and μ ̂ j 6 = 0 for some j > 1. Since we must have μ ̂ j = η j in this case, we have
| μ 1 | +
η j
= |( μ − ̂ μ ) 1 | +
( μ − ̂ μ )j
≥ ‖ μ − μ ̂ ‖ 2 ≥ t.
Also, since η j must be the largest coordinate in absolute value, we have ∣∣ η j
≥ | η 1 | ≥ | μ 1 | − | μ 1 − η 1 |.
Adding the above inequalities gives
| μ 1 − η 1 | + 2 ·
η j
= | μ 1 − η 1 | + 2 ·
μ j − η j
≥ t.
Hence, either | μ 1 − η 1 | ≥ t/3 or
∣ μ j − η j
∣ (^) ≥ t/3, which is what we wanted to prove.
We can now finish the computation of the expected loss, using the above tail bound. Using s = t^2 in the above bound, we can write it as
P
‖ μ − ̂ μ ‖^22 ≥ s
≤ 2 d · exp (−ns/18).
This yields the following bound.
Claim 1.6. For the estimator μ ̂ as above
x ∼(N( μ ,Id ))n
‖ μ − μ ̂ ( x )‖^22
log d n
Proof: We use the fact that for a non-negative random variable Z, E [Z] =
s P^ [Z^ ≥^ s]. Using this, we get
x ∼(N( μ ,Id ))n
‖ μ − μ ̂ ( x )‖^22
∫ (^) ∞
0
‖ μ − μ ̂‖^22 ≥ s
ds
∫ (^) u
0
‖ μ^ −^ μ ̂ ‖^22 ≥^ s
ds +
∫ (^) ∞
u
‖ μ^ −^ μ ̂ ‖^22 ≥^ s
ds
≤
∫ (^) u
0
1 ds +
∫ (^) ∞
u
2 d · exp (−ns/18) ds
= u +
36 d n · exp (−nu/18).
Choosing u = c · logn^ dfor an appropriate constant c, then finishes the proof.
We will now talk more about finding a distribution in a set Π that minimizes D(P‖Q) for a fixed distribution Q. We encountered this when discussing Sanov’s theorem and hypothesis testing, and will now discuss its properties in some detail. When Q is the uniform distribution on X. Then we also have,
D(P||Q) = log |X | − H(P)
Hence, in this case P∗^ is a distribution that maximizes entropy. In general, when the given information does not uniquely determine a distribution, we choose P∗^ that maximizes entropy. This can be thought of as picking P∗^ in the set of distributions Π, subject to the least amount of additional assumptions. This is sometimes called the Maximum Entropy Principle. In this lecture, we will characterize the distributions obtained by minimizing Kl-divergence (or maximizing entropy).
For closed convex set Π, such a P is called the I-projection of Q onto Π.
Definition 2.1. Let Π be a closed convex set of distributions over X. In addition, assume that Supp(Q) = X. Then ProjΠ(Q) := arg min P∈Π
Note that the assumption Supp(Q) = X above is without loss of generality since D(P‖Q) = ∞ for any P such that Supp(P) 6 ⊆ Supp(Q). Use the (strict) convexity of KL-divergence to check the following.
Exercise 2.2. For a closed, convex set Π, the projection P∗^ = ProjΠ(Q) exists and is unique.
It is immediate from definition that if P ∈ Π, then D(P‖Q) ≥ D(P∗‖Q). In fact, P∗^ tells us more. It also tells us how “far" P is away from Q in KL-divergence measure.
Theorem 2.3. Let P∗^ = ProjΠ(Q). Then, for all P ∈ Π,
Supp(P) ⊆ Supp(P∗) D(P‖Q) ≥ D(P‖P∗) + D(P∗‖Q)
Proof: Define Pt = tP + ( 1 − t)P∗, where t ∈ [0, 1]. By minimality of P∗, it is clear that D(Pt||Q) − D(P∗||Q) ≥ 0. By the mean value theorem, we also have that
t · (D(Pt‖Q) − D(P∗‖Q)) ≤
d dt D(Pt‖Q)
t=t′^ ∈[0,t]
Since t′^ → 0 as t → 0, we get
lim t↓ 0
d dt D(Pt‖Q) ≥ 0.
1 with prob. 1/ 0 with prob. 1/
Next, we show how to compute and characterize I-projections for some special sets of distributions.
Definition 2.5. For any given real-valued functions f 1 , f 2 , ..., fk on X and α 1 , α 2 , ..., α k ∈ R , the set
L =
x∈X
p(x) · fi(x) = E x∼P [ fi(x)] = α i, ∀i ∈ [k]
is called a linear family of distributions.
We show that for linear families, the inequality proved above, is in fact tight. Moreover, the projection P∗^ lies in the interior of the polytope defining L.
Lemma 2.6. Let L be a linear family given by
x∈X
p(x) · fi(x) = α i, i ∈ [k]
and
⋃ P∈L Supp(P) =^ X^. Let P ∗ (^) = Proj L(Q). Then, for all P^ ∈ L
Then the I-Projection P∗^ of Q onto L satisfies the Pythagorean identity
D(P‖Q) = D(P‖P∗) + D(P∗‖Q)
Proof: Recall that Supp(P) ⊆ Supp(P∗) and pt(x) = t · p(x) + ( 1 − t) · p∗(x). Since the conditions defining L are linear, we have that for all t ∈ R and all i ∈ [k]
x∈X
x∈X
x∈X
p∗(x) · fi(x) = α i
However, we may not have pt(a) ≥ 0 for all t < 0. We find a β > 0 such that for t ∈ [− β , 0]
pt(x) ≥ 0 ⇔ t(p(x) − p∗(x)) ≥ − p∗(x)
Note that above inequality clearly holds if p(x) − p∗(x) < 0. Now choose β such that
β = min x:p(x)−p∗^ (x)> 0
{ (^) p∗(x) p(x) − p∗(x)
Notice that β > 0 since Supp(P∗) ⊇ ∪P∈L Supp(P).
The above implies that (^) dtd D(Pt||Q)|t= 0 = 0 by the minimality of P∗, which in turn implies the equality D(P||Q) = D(P||P∗) + D(P∗||Q).
The above can also be used to show that the I-projection onto L is of a special form. To describe this, we define the following family of distributions.
Definition 2.7. Let Q be a given distribution. For any given functions g 1 , g 2 , ..., gk on X , the set
EQ(g 1 ,... , gk) :=
P | ∃ λ 1 ,... , λ k ∈ R ∀x ∈ X , p(x) = c · q(x) · exp
k
i= 1
λ i gi(x)
is called an exponential family of distributions.
We will show that P∗^ = ProjL(Q) ∈ EQ( f 1 , ..., fk). We prove this for a linear family defined by a single constraint. The proof for families with multiple constraints is identical. Let f : X → R and let L be defined as
x∈X
p(x) · f (x) = E x∼P
[ f (x)] = α
The projection P∗^ is the optimal solution to the convex program
minimize D(P‖Q)
x∈X
p(x) · f (x) = α
x∈X
p(x) = 1
p(x) ≥ 0 ∀x ∈ X.
For λ 0 , λ 1 ∈ R , we write the Lagrangian as
Λ(P; λ 0 , λ 1 ) = D(P‖Q) + λ 0 ·
x
p(x) − 1
x
p(x) · f (x) − α