Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Sparse Mean Estimation and I-Projections in Information and Coding Theory, Schemes and Mind Maps of Algorithms and Programming

The problem of estimating the mean of a sparse vector in the context of information and coding theory. the minimax rates for this problem, the role of the empirical mean estimator, and the definition of the I-projection of a distribution onto a closed convex set. The document also includes proofs and examples.

Typology: Schemes and Mind Maps

2021/2022

Uploaded on 09/27/2022

ekansh
ekansh 🇺🇸

4.3

(20)

266 documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Information and Coding Theory Winter 2021
Lecture 10: February 11, 2021
Lecturer: Madhur Tulsiani
1 Sparse mean estimation
We will conclude our discussion of minimax rates, with this final example of estimating
the mean, when we are given the additional condition that the mean is a sparse vector.
Consider the set of normal distributions, where the mean has only one non-zero coordinate.
Π=nN(µ,Id)|µRd,kµk01o.
Let θ(P) = ExP[x]be the mean, and let `(b
θ,θ) =
b
θθ
2
2as before. From the previous
examples, it seems like the empirical mean estimator is always the best one, and the role
of information theory is primarily for proving lower bounds. However, it can also serve as
a guide for the right bound to aim for. For this problem, it will be much easier to prove a
lower bound. We will then show an estimator which matches this bound.
1.1 Lower bound
Let V={e1, . . . , ed}be the set of standard basis vectors in Rd. Consider the set of dis-
tributions Pv=N(2δ·v,Id)for all v V. Note that the means µv=2δ·vsatisfy
kµv1µv2k=2δfor all v16=v2. Using the bound from the previous lecture, we get
Mn(Π,`)δ2·1n·Ev1,v2∈V [D(Pv1kPv2)]+1
log |V|
δ2· 1n·4δ2/(2 ln 2)+1
log d!
c·log d
n,
for an appropriate constant c>0, using a choice of δ2=c0·log d
n. We will now show that
this lower bound is actually tight.
1
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download Sparse Mean Estimation and I-Projections in Information and Coding Theory and more Schemes and Mind Maps Algorithms and Programming in PDF only on Docsity!

Information and Coding Theory Winter 2021

Lecture 10: February 11, 2021

Lecturer: Madhur Tulsiani

1 Sparse mean estimation

We will conclude our discussion of minimax rates, with this final example of estimating the mean, when we are given the additional condition that the mean is a sparse vector. Consider the set of normal distributions, where the mean has only one non-zero coordinate.

Π =

N( μ , Id) | μR d, ‖ μ ‖ 0 ≤ 1

Let θ (P) = E x∼P [x] be the mean, and let `( θ ̂, θ ) =

∥̂ θ^ −^ θ

2 2

as before. From the previous

examples, it seems like the empirical mean estimator is always the best one, and the role of information theory is primarily for proving lower bounds. However, it can also serve as a guide for the right bound to aim for. For this problem, it will be much easier to prove a lower bound. We will then show an estimator which matches this bound.

1.1 Lower bound

Let V = {e 1 ,... , ed} be the set of standard basis vectors in R d. Consider the set of dis- tributions Pv = N(

2 δ · v, Id) for all v ∈ V. Note that the means μ v =

2 δ · v satisfy ‖ μ v 1 −^ μ v 2 ‖ =^2 δ^ for all^ v 1 6 =^ v 2. Using the bound from the previous lecture, we get

Mn(Π, `) ≥ δ^2 ·

n · E v 1 ,v 2 ∈V [D(Pv 1 ‖Pv 2 )] + 1 log |V |

δ^2 ·

n ·

4 δ^2 /(2 ln 2)

log d

≥ c · log d n

for an appropriate constant c > 0, using a choice of δ^2 = c′^ · logn d. We will now show that this lower bound is actually tight.

1.2 Upper bound

The optimal estimator for the above problem actually extends the definition of the mean as the minimizer of the total square distance (from the sample points). Recall the following.

Exercise 1.1. Let x 1 ,... , xn ∈ R d. Then the empirical mean η = (^1) n · (^) ∑ni= 1 xi satisfies

n

i= 1

‖xi − η ‖^22 = inf v∈ R d

n

i= 1

‖xi − v‖^22

Given a sequence of samples x = (x 1 ,... , xn), let the η denote the empirical mean

η :=

n

n

i= 1

xi.

As we saw above, the empirical mean is the minimizer of the least square distance. How- ever, it is not sparse. We take our estimator μ ̂ to only consist of the largest entry (in absolute value) of η , and set all other entries to zero i.e.,

μ ̂ j :=

η j if j = argmaxk∈[d] | η k| 0 otherwise

Note that the above definition does not make sense if the the coordinate maximizing (^) | η k| is not unique. In such a case, we arbitrarily pick one of the maximizing coordinates. Check that this definition is a constrained version of the above definition for empirical mean. While the empirical mean η is the minimizer over all of R d, of the average squared distance from the sample points, the estimator above is the minimizer over all sparse vectors.

Exercise 1.2. Check that for μ ̂ defined as above

n

i= 1

‖xi − ̂ μ ‖^22 = inf ‖v‖ 0 ≤ 1

{ (^) n

i= 1

‖xi − v‖^22

While we will use the above estimator, the operation of picking the largest coordinate does not combine well with analytic expressions such as expectation etc. For this reason, we will use the empirical mean η as an intermediate object in the analysis. We need the following basic properties

Proposition 1.3. Let x ∼ (N( μ , Id))n^ be a sequence of n independent samples, and let η = 1 n ·^ ∑

n i= 1 xi^ be the empirical mean. Then^ η^ −^ μ^ is distributed according to the Gaussian distribution N

0, (^1) n · Id

and we are done. So let’s assume μ ̂ 1 = 0 and μ ̂ j 6 = 0 for some j > 1. Since we must have μ ̂ j = η j in this case, we have

| μ 1 | +

η j

= |( μ − ̂ μ ) 1 | +

( μ − ̂ μ )j

≥ ‖ μμ ̂ ‖ 2 ≥ t.

Also, since η j must be the largest coordinate in absolute value, we have ∣∣ η j

≥ | η 1 | ≥ | μ 1 | − | μ 1 − η 1 |.

Adding the above inequalities gives

| μ 1 − η 1 | + 2 ·

η j

= | μ 1 − η 1 | + 2 ·

μ j − η j

≥ t.

Hence, either | μ 1 − η 1 | ≥ t/3 or

μ j − η j

∣ (^) ≥ t/3, which is what we wanted to prove.

We can now finish the computation of the expected loss, using the above tail bound. Using s = t^2 in the above bound, we can write it as

P

[

μ − ̂ μ ‖^22 ≥ s

]

≤ 2 d · exp (−ns/18).

This yields the following bound.

Claim 1.6. For the estimator μ ̂ as above

E

x ∼(N( μ ,Id ))n

[

μμ ̂ ( x )‖^22

]

= O

log d n

Proof: We use the fact that for a non-negative random variable Z, E [Z] =

s P^ [Z^ ≥^ s]. Using this, we get

E

x ∼(N( μ ,Id ))n

[

μμ ̂ ( x )‖^22

]

∫ (^) ∞

0

P

[

μμ ̂‖^22 ≥ s

]

ds

∫ (^) u

0

P

[

μ^ −^ μ ̂ ‖^22 ≥^ s

]

ds +

∫ (^) ∞

u

P

[

μ^ −^ μ ̂ ‖^22 ≥^ s

]

ds

∫ (^) u

0

1 ds +

∫ (^) ∞

u

2 d · exp (−ns/18) ds

= u +

36 d n · exp (−nu/18).

Choosing u = c · logn^ dfor an appropriate constant c, then finishes the proof.

2 I-Projections and applications

We will now talk more about finding a distribution in a set Π that minimizes D(P‖Q) for a fixed distribution Q. We encountered this when discussing Sanov’s theorem and hypothesis testing, and will now discuss its properties in some detail. When Q is the uniform distribution on X. Then we also have,

D(P||Q) = log |X | − H(P)

Hence, in this case P∗^ is a distribution that maximizes entropy. In general, when the given information does not uniquely determine a distribution, we choose P∗^ that maximizes entropy. This can be thought of as picking P∗^ in the set of distributions Π, subject to the least amount of additional assumptions. This is sometimes called the Maximum Entropy Principle. In this lecture, we will characterize the distributions obtained by minimizing Kl-divergence (or maximizing entropy).

For closed convex set Π, such a P is called the I-projection of Q onto Π.

Definition 2.1. Let Π be a closed convex set of distributions over X. In addition, assume that Supp(Q) = X. Then ProjΠ(Q) := arg min P∈Π

D(P‖Q) = P∗

Note that the assumption Supp(Q) = X above is without loss of generality since D(P‖Q) = ∞ for any P such that Supp(P) 6 ⊆ Supp(Q). Use the (strict) convexity of KL-divergence to check the following.

Exercise 2.2. For a closed, convex set Π, the projection P∗^ = ProjΠ(Q) exists and is unique.

It is immediate from definition that if P ∈ Π, then D(P‖Q) ≥ D(P∗‖Q). In fact, P∗^ tells us more. It also tells us how “far" P is away from Q in KL-divergence measure.

Theorem 2.3. Let P∗^ = ProjΠ(Q). Then, for all P ∈ Π,

Supp(P) ⊆ Supp(P∗) D(P‖Q) ≥ D(P‖P∗) + D(P∗‖Q)

Proof: Define Pt = tP + ( 1 − t)P∗, where t ∈ [0, 1]. By minimality of P∗, it is clear that D(Pt||Q) − D(P∗||Q) ≥ 0. By the mean value theorem, we also have that

t · (D(Pt‖Q) − D(P∗‖Q)) ≤

d dt D(Pt‖Q)

t=t′^ ∈[0,t]

Since t′^ → 0 as t → 0, we get

lim t↓ 0

d dt D(Pt‖Q) ≥ 0.

  1. Show that P∗^ =

1 with prob. 1/ 0 with prob. 1/

  1. Show that D(P||Q) > D(P||P∗) + D(P∗||Q) for the above example.

Next, we show how to compute and characterize I-projections for some special sets of distributions.

2.1 Linear families and I-projections

Definition 2.5. For any given real-valued functions f 1 , f 2 , ..., fk on X and α 1 , α 2 , ..., α k ∈ R , the set

L =

P | ∑

x∈X

p(x) · fi(x) = E x∼P [ fi(x)] = α i, ∀i ∈ [k]

is called a linear family of distributions.

We show that for linear families, the inequality proved above, is in fact tight. Moreover, the projection P∗^ lies in the interior of the polytope defining L.

Lemma 2.6. Let L be a linear family given by

L =

P : ∑

x∈X

p(x) · fi(x) = α i, i ∈ [k]

and

⋃ P∈L Supp(P) =^ X^. Let P ∗ (^) = Proj L(Q). Then, for all P^ ∈ L

  1. There exists β > 0 such that for t ∈ [− β , 0], Pt = tP + ( 1 − t)P∗^ ∈ L.
  2. D(P‖Q) = D(P‖P∗) + D(P∗‖Q)

Then the I-Projection P∗^ of Q onto L satisfies the Pythagorean identity

D(P‖Q) = D(P‖P∗) + D(P∗‖Q)

Proof: Recall that Supp(P) ⊆ Supp(P∗) and pt(x) = t · p(x) + ( 1 − t) · p∗(x). Since the conditions defining L are linear, we have that for all t ∈ R and all i ∈ [k]

x∈X

pt(x) · fi(x) = t · ∑

x∈X

p(x) · fi(x) + ( 1 − t) · ∑

x∈X

p∗(x) · fi(x) = α i

However, we may not have pt(a) ≥ 0 for all t < 0. We find a β > 0 such that for t ∈ [− β , 0]

pt(x) ≥ 0 ⇔ t(p(x) − p∗(x)) ≥ − p∗(x)

Note that above inequality clearly holds if p(x) − p∗(x) < 0. Now choose β such that

β = min x:p(x)−p∗^ (x)> 0

{ (^) p∗(x) p(x) − p∗(x)

Notice that β > 0 since Supp(P∗) ⊇ ∪P∈L Supp(P).

The above implies that (^) dtd D(Pt||Q)|t= 0 = 0 by the minimality of P∗, which in turn implies the equality D(P||Q) = D(P||P∗) + D(P∗||Q).

The above can also be used to show that the I-projection onto L is of a special form. To describe this, we define the following family of distributions.

Definition 2.7. Let Q be a given distribution. For any given functions g 1 , g 2 , ..., gk on X , the set

EQ(g 1 ,... , gk) :=

P | ∃ λ 1 ,... , λ k ∈ R ∀x ∈ X , p(x) = c · q(x) · exp

k

i= 1

λ i gi(x)

is called an exponential family of distributions.

We will show that P∗^ = ProjL(Q) ∈ EQ( f 1 , ..., fk). We prove this for a linear family defined by a single constraint. The proof for families with multiple constraints is identical. Let f : X → R and let L be defined as

L =

P | ∑

x∈X

p(x) · f (x) = E x∼P

[ f (x)] = α

The projection P∗^ is the optimal solution to the convex program

minimize D(P‖Q)

subject to ∑

x∈X

p(x) · f (x) = α

x∈X

p(x) = 1

p(x) ≥ 0 ∀x ∈ X.

For λ 0 , λ 1 ∈ R , we write the Lagrangian as

Λ(P; λ 0 , λ 1 ) = D(P‖Q) + λ 0 ·

x

p(x) − 1

  • λ 1 ·

x

p(x) · f (x) − α