Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Sparse Mean Estimation and I-Projections in Information and Coding Theory, Schemes and Mind Maps of Algorithms and Programming

University of North Texas Health Science Center (UNTHSC)Algorithms and Programming

The problem of estimating the mean of a sparse vector in the context of information and coding theory. the minimax rates for this problem, the role of the empirical mean estimator, and the definition of the I-projection of a distribution onto a closed convex set. The document also includes proofs and examples.

Typology: Schemes and Mind Maps

2021/2022

Uploaded on 09/27/2022

ekansh 🇺🇸

4.3

(20)

266 documents

1 / 9

This page cannot be seen from the preview

Don't miss anything!

Information and Coding Theory Winter 2021

Lecture 10: February 11, 2021

Lecturer: Madhur Tulsiani

1 Sparse mean estimation

We will conclude our discussion of minimax rates, with this final example of estimating

the mean, when we are given the additional condition that the mean is a sparse vector.

Consider the set of normal distributions, where the mean has only one non-zero coordinate.

Π=nN(µ,Id)|µ∈Rd,kµk0≤1o.

Let θ(P) = Ex∼P[x]be the mean, and let `(b

θ,θ) = 



b

θ−θ



2

2as before. From the previous

examples, it seems like the empirical mean estimator is always the best one, and the role

of information theory is primarily for proving lower bounds. However, it can also serve as

a guide for the right bound to aim for. For this problem, it will be much easier to prove a

lower bound. We will then show an estimator which matches this bound.

1.1 Lower bound

Let V={e1, . . . , ed}be the set of standard basis vectors in Rd. Consider the set of dis-

tributions Pv=N(√2δ·v,Id)for all v∈ V. Note that the means µv=√2δ·vsatisfy

kµv1−µv2k=2δfor all v16=v2. Using the bound from the previous lecture, we get

Mn(Π,`)≥δ2·1−n·Ev1,v2∈V [D(Pv1kPv2)]+1

log |V|

≥δ2· 1−n·4δ2/(2 ln 2)+1

log d!

≥c·log d

n,

for an appropriate constant c>0, using a choice of δ2=c0·log d

n. We will now show that

this lower bound is actually tight.

1

Partial preview of the text

Download Sparse Mean Estimation and I-Projections in Information and Coding Theory and more Schemes and Mind Maps Algorithms and Programming in PDF only on Docsity!

Information and Coding Theory Winter 2021

Lecture 10: February 11, 2021

Lecturer: Madhur Tulsiani

1 Sparse mean estimation

We will conclude our discussion of minimax rates, with this final example of estimating the mean, when we are given the additional condition that the mean is a sparse vector. Consider the set of normal distributions, where the mean has only one non-zero coordinate.

Π =

N( μ , Id) | μ ∈ R d, ‖ μ ‖ 0 ≤ 1

Let θ (P) = E x∼P [x] be the mean, and let `( θ ̂, θ ) =

∥̂ θ^ −^ θ

2 2

as before. From the previous

examples, it seems like the empirical mean estimator is always the best one, and the role of information theory is primarily for proving lower bounds. However, it can also serve as a guide for the right bound to aim for. For this problem, it will be much easier to prove a lower bound. We will then show an estimator which matches this bound.

1.1 Lower bound

Let V = {e 1 ,... , ed} be the set of standard basis vectors in R d. Consider the set of dis- tributions Pv = N(

2 δ · v, Id) for all v ∈ V. Note that the means μ v =

2 δ · v satisfy ‖ μ v 1 −^ μ v 2 ‖ =^2 δ^ for all^ v 1 6 =^ v 2. Using the bound from the previous lecture, we get

Mn(Π, `) ≥ δ^2 ·

n · E v 1 ,v 2 ∈V [D(Pv 1 ‖Pv 2 )] + 1 log |V |

≥ δ^2 ·

n ·

4 δ^2 /(2 ln 2)

log d

≥ c · log d n

for an appropriate constant c > 0, using a choice of δ^2 = c′^ · logn d. We will now show that this lower bound is actually tight.

1.2 Upper bound

The optimal estimator for the above problem actually extends the definition of the mean as the minimizer of the total square distance (from the sample points). Recall the following.

Exercise 1.1. Let x 1 ,... , xn ∈ R d. Then the empirical mean η = (^1) n · (^) ∑ni= 1 xi satisfies

n

i= 1

‖xi − η ‖^22 = inf v∈ R d

n

i= 1

‖xi − v‖^22

Given a sequence of samples x = (x 1 ,... , xn), let the η denote the empirical mean

η :=

n

i= 1

xi.

As we saw above, the empirical mean is the minimizer of the least square distance. How- ever, it is not sparse. We take our estimator μ ̂ to only consist of the largest entry (in absolute value) of η , and set all other entries to zero i.e.,

μ ̂ j :=

η j if j = argmaxk∈[d] | η k| 0 otherwise

Note that the above definition does not make sense if the the coordinate maximizing (^) | η k| is not unique. In such a case, we arbitrarily pick one of the maximizing coordinates. Check that this definition is a constrained version of the above definition for empirical mean. While the empirical mean η is the minimizer over all of R d, of the average squared distance from the sample points, the estimator above is the minimizer over all sparse vectors.

Exercise 1.2. Check that for μ ̂ defined as above

n

i= 1

‖xi − ̂ μ ‖^22 = inf ‖v‖ 0 ≤ 1

{ (^) n

i= 1

‖xi − v‖^22

While we will use the above estimator, the operation of picking the largest coordinate does not combine well with analytic expressions such as expectation etc. For this reason, we will use the empirical mean η as an intermediate object in the analysis. We need the following basic properties

Proposition 1.3. Let x ∼ (N( μ , Id))n^ be a sequence of n independent samples, and let η = 1 n ·^ ∑

n i= 1 xi^ be the empirical mean. Then^ η^ −^ μ^ is distributed according to the Gaussian distribution N

0, (^1) n · Id

and we are done. So let’s assume μ ̂ 1 = 0 and μ ̂ j 6 = 0 for some j > 1. Since we must have μ ̂ j = η j in this case, we have

| μ 1 | +

η j

= |( μ − ̂ μ ) 1 | +

( μ − ̂ μ )j

≥ ‖ μ − μ ̂ ‖ 2 ≥ t.

Also, since η j must be the largest coordinate in absolute value, we have ∣∣ η j

≥ | η 1 | ≥ | μ 1 | − | μ 1 − η 1 |.

Adding the above inequalities gives

| μ 1 − η 1 | + 2 ·

η j

= | μ 1 − η 1 | + 2 ·

μ j − η j

≥ t.

Hence, either | μ 1 − η 1 | ≥ t/3 or

∣ μ j − η j

∣ (^) ≥ t/3, which is what we wanted to prove.

We can now finish the computation of the expected loss, using the above tail bound. Using s = t^2 in the above bound, we can write it as

P

[

‖ μ − ̂ μ ‖^22 ≥ s

]

≤ 2 d · exp (−ns/18).

This yields the following bound.

Claim 1.6. For the estimator μ ̂ as above

E

x ∼(N( μ ,Id ))n

[

‖ μ − μ ̂ ( x )‖^22

]

= O

log d n

Proof: We use the fact that for a non-negative random variable Z, E [Z] =

s P^ [Z^ ≥^ s]. Using this, we get

E

x ∼(N( μ ,Id ))n

[

‖ μ − μ ̂ ( x )‖^22

]

∫ (^) ∞

0

P

[

‖ μ − μ ̂‖^22 ≥ s

]

ds

∫ (^) u

0

P

[

‖ μ^ −^ μ ̂ ‖^22 ≥^ s

]

ds +

∫ (^) ∞

u

P

[

‖ μ^ −^ μ ̂ ‖^22 ≥^ s

]

ds

≤

∫ (^) u

0

1 ds +

∫ (^) ∞

u

2 d · exp (−ns/18) ds

= u +

36 d n · exp (−nu/18).

Choosing u = c · logn^ dfor an appropriate constant c, then finishes the proof.

2 I-Projections and applications

We will now talk more about finding a distribution in a set Π that minimizes D(P‖Q) for a fixed distribution Q. We encountered this when discussing Sanov’s theorem and hypothesis testing, and will now discuss its properties in some detail. When Q is the uniform distribution on X. Then we also have,

D(P||Q) = log |X | − H(P)

Hence, in this case P∗^ is a distribution that maximizes entropy. In general, when the given information does not uniquely determine a distribution, we choose P∗^ that maximizes entropy. This can be thought of as picking P∗^ in the set of distributions Π, subject to the least amount of additional assumptions. This is sometimes called the Maximum Entropy Principle. In this lecture, we will characterize the distributions obtained by minimizing Kl-divergence (or maximizing entropy).

For closed convex set Π, such a P is called the I-projection of Q onto Π.

Definition 2.1. Let Π be a closed convex set of distributions over X. In addition, assume that Supp(Q) = X. Then ProjΠ(Q) := arg min P∈Π

D(P‖Q) = P∗

Note that the assumption Supp(Q) = X above is without loss of generality since D(P‖Q) = ∞ for any P such that Supp(P) 6 ⊆ Supp(Q). Use the (strict) convexity of KL-divergence to check the following.

Exercise 2.2. For a closed, convex set Π, the projection P∗^ = ProjΠ(Q) exists and is unique.

It is immediate from definition that if P ∈ Π, then D(P‖Q) ≥ D(P∗‖Q). In fact, P∗^ tells us more. It also tells us how “far" P is away from Q in KL-divergence measure.

Theorem 2.3. Let P∗^ = ProjΠ(Q). Then, for all P ∈ Π,

Supp(P) ⊆ Supp(P∗) D(P‖Q) ≥ D(P‖P∗) + D(P∗‖Q)

Proof: Define Pt = tP + ( 1 − t)P∗, where t ∈ [0, 1]. By minimality of P∗, it is clear that D(Pt||Q) − D(P∗||Q) ≥ 0. By the mean value theorem, we also have that

t · (D(Pt‖Q) − D(P∗‖Q)) ≤

d dt D(Pt‖Q)

t=t′^ ∈[0,t]

Since t′^ → 0 as t → 0, we get

lim t↓ 0

d dt D(Pt‖Q) ≥ 0.

Show that P∗^ =

1 with prob. 1/ 0 with prob. 1/

Show that D(P||Q) > D(P||P∗) + D(P∗||Q) for the above example.

Next, we show how to compute and characterize I-projections for some special sets of distributions.

2.1 Linear families and I-projections

Definition 2.5. For any given real-valued functions f 1 , f 2 , ..., fk on X and α 1 , α 2 , ..., α k ∈ R , the set

L =

P | ∑

x∈X

p(x) · fi(x) = E x∼P [ fi(x)] = α i, ∀i ∈ [k]

is called a linear family of distributions.

We show that for linear families, the inequality proved above, is in fact tight. Moreover, the projection P∗^ lies in the interior of the polytope defining L.

Lemma 2.6. Let L be a linear family given by

L =

P : ∑

x∈X

p(x) · fi(x) = α i, i ∈ [k]

and

⋃ P∈L Supp(P) =^ X^. Let P ∗ (^) = Proj L(Q). Then, for all P^ ∈ L

There exists β > 0 such that for t ∈ [− β , 0], Pt = tP + ( 1 − t)P∗^ ∈ L.
D(P‖Q) = D(P‖P∗) + D(P∗‖Q)

Then the I-Projection P∗^ of Q onto L satisfies the Pythagorean identity

D(P‖Q) = D(P‖P∗) + D(P∗‖Q)

Proof: Recall that Supp(P) ⊆ Supp(P∗) and pt(x) = t · p(x) + ( 1 − t) · p∗(x). Since the conditions defining L are linear, we have that for all t ∈ R and all i ∈ [k]

x∈X

pt(x) · fi(x) = t · ∑

x∈X

p(x) · fi(x) + ( 1 − t) · ∑

x∈X

p∗(x) · fi(x) = α i

However, we may not have pt(a) ≥ 0 for all t < 0. We find a β > 0 such that for t ∈ [− β , 0]

pt(x) ≥ 0 ⇔ t(p(x) − p∗(x)) ≥ − p∗(x)

Note that above inequality clearly holds if p(x) − p∗(x) < 0. Now choose β such that

β = min x:p(x)−p∗^ (x)> 0

{ (^) p∗(x) p(x) − p∗(x)

Notice that β > 0 since Supp(P∗) ⊇ ∪P∈L Supp(P).

The above implies that (^) dtd D(Pt||Q)|t= 0 = 0 by the minimality of P∗, which in turn implies the equality D(P||Q) = D(P||P∗) + D(P∗||Q).

The above can also be used to show that the I-projection onto L is of a special form. To describe this, we define the following family of distributions.

Definition 2.7. Let Q be a given distribution. For any given functions g 1 , g 2 , ..., gk on X , the set

EQ(g 1 ,... , gk) :=

P | ∃ λ 1 ,... , λ k ∈ R ∀x ∈ X , p(x) = c · q(x) · exp

k

i= 1

λ i gi(x)

is called an exponential family of distributions.

We will show that P∗^ = ProjL(Q) ∈ EQ( f 1 , ..., fk). We prove this for a linear family defined by a single constraint. The proof for families with multiple constraints is identical. Let f : X → R and let L be defined as

L =

P | ∑

x∈X

p(x) · f (x) = E x∼P

[ f (x)] = α

The projection P∗^ is the optimal solution to the convex program

minimize D(P‖Q)

subject to ∑

x∈X

p(x) · f (x) = α

x∈X

p(x) = 1

p(x) ≥ 0 ∀x ∈ X.

For λ 0 , λ 1 ∈ R , we write the Lagrangian as

Λ(P; λ 0 , λ 1 ) = D(P‖Q) + λ 0 ·

x

p(x) − 1

λ 1 ·

x

p(x) · f (x) − α

Sparse Mean Estimation and I-Projections in Information and Coding Theory, Schemes and Mind Maps of Algorithms and Programming

Related documents

Partial preview of the text

Download Sparse Mean Estimation and I-Projections in Information and Coding Theory and more Schemes and Mind Maps Algorithms and Programming in PDF only on Docsity!

Lecture 10: February 11, 2021

1 Sparse mean estimation

1.1 Lower bound

1.2 Upper bound

[

]

E

[

]

= O

E

[

]

P

[

]

P

[

]

P

[

]

2 I-Projections and applications

D(P‖Q) = P∗

2.1 Linear families and I-projections

P | ∑

L =

P : ∑

pt(x) · fi(x) = t · ∑

p(x) · fi(x) + ( 1 − t) · ∑

L =

P | ∑

subject to ∑