Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Vapnik - Chevronenkis Theory - Clayton Scott, Study notes of Mathematical Statistics

There are explain the VC theorems, VC class, sauer;s lemma and VC theory for sets with examples.

Typology: Study notes

2021/2022

Uploaded on 03/31/2022

eekbal
eekbal 🇺🇸

4.6

(30)

264 documents

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
EECS 598: Statistical Learning Theory, Winter 2014 Topic 5
Vapnik-Chevronenkis Theory
Lecturer: Clayton Scott Scribe: Srinagesh Sharma, Scott Reed, Petter Nilsson
Disclaimer:These notes have not been subjected to the usual scrutiny reserved for formal publications.
They may be distributed outside this class only with the permission of the Instructor.
1 Introduction
Let’s say we are given training data {(Xi,Yi)}n
i=1 which are drawn from X × Y indep endently from the
distribution PXY , and a set of classifiers H. In the previous lecture, we saw that performance guarantees for
empirical risk minimization over Hfollow from uniform deviation bounds of the form
Pr sup
h∈H b
Rn(h)R(h)δ
where b
Rn(h) := 1
nPn
i=1 1{h(Xi)6=Yi}is the empirical risk and R(h) is the true risk. We also established
such a bound for finite H. In these notes we turn our attention to the setting where His infinite, and
possibly uncountable. This will lead us to an interesting notion of the capacity of Hknown as the Vapnik-
Chervonenkis dimension.
2 VC Theorem
Let H {0,1}X. For x1, x2, . . . , xn X denote
NH(x1, . . . , xn) := |{(h(x1), . . . , h(xn)) : h H}|
Clearly NH(x1, . . . , xn)2n. The nth shatter coefficient is defined as
SH(n) =: max
x1,...,xn∈X NH(x1, . . . , xn)
If SH(n)=2n, then x1, . . . , xnsuch that
NH(x1, . . . , xn) = 2n
and we say that Hshatters x1, . . . , xn.
Note. The shatter coefficient is sometimes called the growth function in the literature.
The VC dimension of His defined as
VH:= max {n|SH(n)=2n}.
If SH(n)=2nnthen VH:= .
Remark. To show VH=Vwe must show that there exists at least one set of points x1,. . . , xnthat can be
shattered by H, and that no set of n+ 1 points can be shattered by H.
The VC dimension and the shatter coefficient relate to the following uniform deviation bound.
1
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Vapnik - Chevronenkis Theory - Clayton Scott and more Study notes Mathematical Statistics in PDF only on Docsity!

EECS 598: Statistical Learning Theory, Winter 2014 Topic 5

Vapnik-Chevronenkis Theory

Lecturer: Clayton Scott Scribe: Srinagesh Sharma, Scott Reed, Petter Nilsson

Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.

They may be distributed outside this class only with the permission of the Instructor.

1 Introduction

Let’s say we are given training data {(Xi, Yi)}

n i= which are drawn from X × Y independently from the

distribution PXY , and a set of classifiers H. In the previous lecture, we saw that performance guarantees for

empirical risk minimization over H follow from uniform deviation bounds of the form

Pr

sup h∈H

∣̂Rn(h) − R(h)

≤ δ

where R̂n(h) :=

1 n

∑n

i= (^1) {h(X i)^6 =Yi}^ is the empirical risk and R(h) is the true risk. We also established

such a bound for finite H. In these notes we turn our attention to the setting where H is infinite, and

possibly uncountable. This will lead us to an interesting notion of the capacity of H known as the Vapnik-

Chervonenkis dimension.

2 VC Theorem

Let H ⊆ { 0 , 1 }

X

. For x 1 , x 2 ,... , xn ∈ X denote

NH(x 1 ,... , xn) := |{(h(x 1 ),... , h(xn)) : h ∈ H}|

Clearly NH(x 1 ,... , xn) ≤ 2

n

. The n

th shatter coefficient is defined as

SH(n) =: max x 1 ,...,xn∈X

NH(x 1 ,... , xn)

If SH(n) = 2

n , then ∃ x 1 ,... , xn such that

NH(x 1 ,... , xn) = 2

n

and we say that H shatters x 1 ,... , xn.

Note. The shatter coefficient is sometimes called the growth function in the literature.

The VC dimension of H is defined as

VH := max {n | SH(n) = 2

n }.

If SH(n) = 2 n ∀n then VH := ∞.

Remark. To show VH = V we must show that there exists at least one set of points x 1 ,... , xn that can be

shattered by H, and that no set of n + 1 points can be shattered by H.

The VC dimension and the shatter coefficient relate to the following uniform deviation bound.

Theorem 1. For any n ≥ 1 and  > 0

Pr

sup h∈H

∣̂Rn(h) − R(h)

≤ 8 SH(n)e

−n 2 / 32

. (1)

where the probability is with respect to the draw of the training data.

We will show below that SH(n) ≤ (n + 1) VH

. Therefore if VH is finite then the right hand side of equation

(1) is dominated by the exponential and will go to zero as n → ∞. Similar to the case |H| < ∞ we also have

a performance guarantee for ERM when VH < ∞.

Corollary 1. If

hn is an empirical risk minimizer (ERM) over H then

Pr

R(̂hn) − R

∗ H

≤ 8 SH(n)e

−n 2 / 128 ,

where the probability is with respect to the draw of the training data. Equivalently, with probability greater

than 1 − δ

R(̂ hn) ≤ R

∗ H

128[log(

8 SH(n) δ

)]

n

≤ R

∗ H +

128[VH log(n + 1) + log(

8 δ

)]

n

Proof. Follows from Theorem 1 using an argument like that when |H| < ∞.

Corollary 2. If VH < ∞ then H is uniformly learnable by ERM.

The version of the VC inequality given in Theorem 1 is proved in [1]. That reference also contains a

broader discussion of VC theory than is presented here. We will prove the VC inequality later (with perhaps

different constants) after discussing Rademacher complexity.

2.1 VC Classes

A VC class is a set of classifiers H with VH < ∞. We will now consider some examples of H for which VH

can be established or at least bounded.

Example. Consider the set of classifiers

H =

(^1) {x∈R}

R =

d ∏

i=

[ai, bi], ai < bi

Let d = 1. Given one point, we can always assign it a one or a zero. Therefore VH > 1. For two

points there are four possible assignments and they can be shattered using H. However, given 3 points, the

following assignment cannot be realized by any h ∈ H. Therefore NH(x 1 ,... , xn) < 8, and so VH = 2.

Figure 1: This classification does not belong to NH(x 1 ,... , xn) when d=

For d = 2, the four points in Fig. 2 can be shattered by H. Therefore VH is at least 4. For n = 5,

there is a maximum and minimum point in each dimension. Consider a set of ≤ 4 points achieving these

As a sanity check, this results in the bound

Pr

sup h∈H

∣̂Rn(h)^ −^ R(h)

∣ ≥^ 

≤ 8 |H| e

−n 2 / 32

which is basically the same as what was derived previously except for larger constants.

The following result lets us bound the VC dimension of a broad family of classes.

Lemma 1. Let F be an m-dimensional vector space of real-valued functions. Then,

H =

(^1) {f (x)≥ 0 }

f ∈ F

has VH ≤ m.

Proof. Suppose H shatters m + 1 points, say x 1 ,... , xm+1. Define the linear mapping L : F → R

m+ ,

L(f ) = (f (x 1 ),... , f (xm+1))

T .

Since dim(F) = m we have dim(L(F)) ≤ m where L(F) denotes the image of F. By the projection theorem,

R

m+ = L(F) ⊕ L(F)

where ⊕ denotes the direct sum. Therefore

dim(L(F)

⊥ ) ≥ 1

and so there exits γ 6 = 0, γ ∈ R m+ such that

γ

T L(f ) = 0 ∀f ∈ F.

Thus, ∀f ∈ F

m+ ∑

i=

γif (xi) = 0

or equivalently, ∑

i:γi≥ 0

γif (xi) =

i:γi< 0

−γif (xi).

We will assume that at least one γi < 0. If not, replace γ by −γ. Since H shatters x 1 ,... , xm+1, let

h = 1{f (x)≥ 0 } be such that

h(xi) = 1 ⇐⇒ γi ≥ 0.

For the corresponding f

f (xi) ≥ 0 ⇐⇒ γi ≥ 0.

This implies that

i:γi≥ 0 γif (xi) ≥ 0 and

i:γi< 0 −γif (xi) < 0 which is a contradiction. Therefore,

VH ≤ m.

Let’s apply this result to the class of linear classifiers.

Example. Let X = R

d and F =

f

f (x) = w

T x + b, w ∈ R

d , b ∈ R

. Then H is the set of all linear

classifiers,

H =

(^1) {wT (^) x+b≥ 0 }

w ∈ R

d , b ∈ R

Since dim(F) = d + 1 we deduce from the lemma that VH ≤ d + 1. In fact, this bound is achieved so that

VH = d + 1. For d = 2, H shatters the vertices of any nondegenerate triangle, for d = 3, H shatters the

vertices of a tetrahedron, and for general d, H shatters the zero vector along with the standard basis in R

d .

3 Sauer’s Lemma

This is a bound on the shatter coefficient that was proved independently by Vapnik and Chervonenkis (1971),

Sauer (1972) and Shelah (1972).

Theorem 2. Let V = VH < ∞. For all n ≥ 1 ,

SH(n) ≤

V ∑

i=

n

i

Before proving this theorem we consider several corollaries:

Corollary 3. If V < ∞, then ∀n ≥ 1 ,

SH(n) ≤ (n + 1)

V .

Proof. By the binomial theorem,

(n + 1)

V

V ∑

i=

n

i

V

i

V ∑

i=

n

i

V!

(V − i)!i!

V ∑

i=

n i

i!

V ∑

i=

n!

(n − i)!i!

V ∑

i=

n

i

Corollary 4. ∀n ≥ V ,

SH(n) ≤

ne

V

)V

Proof. If

V

n

≤ 1 then

V

n

V ∑V

i=

n

i

V ∑

i=

V

n

i

n

i

n ∑

i=

V

n

i

n

i

V

n

)n

≤ e

V

Therefore V ∑

i=

n

i

ne

V

)V

Corollary 5. If V > 2 , then ∀n ≥ V ,

SH(n) ≤ n

V .

Proof. If V > 2, then

e

V

< 1, so the statement holds by Corollary 4.

Proof of Sauer’s Lemma. For n ≤ V ,

V ∑

i=

n

i

n ∑

i=

n

i

n = 2

n = SH(n)

in analogy to our definitions for classifiers. Indeed, sets and binary classifiers are equivalent via

G 7 → hG(x) = 1{x∈G}

h 7 → Gh = {x : h(x) = 1}.

This gives a VC theorem for sets.

Corollary 6. If X 1 , ..., Xn

iid ∼ Q, then for any G, n ≥ 0 ,  > 0 ,

Pr

sup G∈G

Q(G) − Q(G)| ≥ 

≤ 8 SG (n)e

−n 2 / 32

where

Q(G) =

n

∑n

i=

1 {X

i∈G}

Proof. Define PXY on X × { 0 , 1 } s.t. PXY (Y = 0) = 1,PX|Y =0 = Q, and PX|Y =1 is arbitrary. Then

R(h) = PXY (h(X) 6 = Y )

= PXY (Y = 1)PX|Y =1(h(X) = 0) + PXY (Y = 0)PX|Y =0(h(X) = 1)

= Q(Gh)

Similarly,

Rn(h) =

Q(Gh), and so

sup h∈H

| R̂ n(h) − R(h)| = sup G∈G

|Q̂(G) − Q(G)|

where H is defined in terms of G via G 7 → hG = (^1) {x∈G}. Finally, not that SG (n) = SH(n).

As an application, consider X ∈ R and G = {(−∞, t] : t ∈ R}. Then SG (n) = n + 1 (see Fig. 4).

( ]

( ]

( ]

( ]

( ]

Figure 5: Different ways to intersect sets with points.

Let X ∼ Q. Denote Gt = (−∞, t]. Then

Q(Gt) = Pr(X ≤ t) =: F (t) (CDF)

Qˆ(G

t) =

n

n ∑

i=

(^1) {x i≤t}^ =: F̂ (t) (empirical CDF)

Corollary 7. For all Q, n ≥ 1 ,  > 0 ,

Pr(sup t∈R

|F̂n(t) − F (t)|

|| Fbn−F ||∞

≥ ) ≤ 8(n + 1)e

−n 2 / 32 .

This is known as the Dvoretzky-Kiefer-Wolfowitz inequality. Tighter versions exist [3, 2].

5 Monotone Layers and Convex Sets

Earlier, we obtained convergence guarantees for an empirical risk minimizing classifier when the VC dimen-

sion of the classifier set H was finite. The purpose of this section is to obtain such guarantees also in cases

of infinite VC dimension. This requires assumptions about the underlying distribution. In other words the

result will not be distribution free, as was possible before. In particular, we will assume that the samples

are drawn from a distribution that has a bounded density with compact support. This material is based on

a similar result in Chapter 13 of [1], where a weaker assumption on the distribuiton of X is made.

In the following, we consider the feature space X = R 2 and the following two families of classifiers:

C =

(^1) {x∈C} | C is convex

L =

(^1) {x∈L} | L = {(x 1 , x 2 ) ∈ R : x 2 ≤ ψ(x 1 )} for non-increasing ψ : R → R

The classifiers in L are called monotone layers. Both families have infinite VC dimension. The fact that

the VC dimension of C is infinity has been shown earlier. To see that L also has infinite VC dimension, it

suffices to see that any set of points placed decreasingly can be shattered by L, as shown in Figure 6.

Figure 6: A non-increasing function shatters a set of non-increasingly placed points.

In the proof of the VC theorem, it is shown in an intermediate step that

Pr

sup h∈H

n(h)^ −^ R(h)

∣ ≥^ ε

≤ 8 E {NH(X 1 ,... , Xn)} e

−nε 2 / 32 .

Here, NH(X 1 ,... , Xn) is the number of possible labelings of X 1 ,... , Xn by H. We will now bound this

quantity.

Theorem 3. If X has a density f such that ‖f ‖∞ ≤ ∞ and supp(f ) is bounded, then for H = C or H = L,

E {NH(X 1 ,... , Xn)} ≤ e

c

√ n

for a constant c.

Remark. In the theorem statement, ‖f ‖∞ = sup x∈X |f (x)| is the sup norm. The support of a function is

defined as supp(f ) = {x ∈ X : f (x) > 0 }, where the line denotes the set closure.

Corollary 8. If hˆn is an empirical risk minimizer (ERM) over H = C or H = L, then for all ε > 0 ,

Pr

R(

hn) − R

∗ H ≥^ ε

≤ 8 e

c

√ n−nε 2 / 128 .

Equivalently, with probability at least 1 − δ,

R(

hn) − R

∗ H ≤

128 [c

n + log(8/δ)]

n

= O(n

− 1 / 4 ).

outside c(ψ) will be uniquely labeled by ψ, therefore only cells in c(ψ) contribute to NL, with obviously at

most 2

Ni ways to assign labels to the Ni points in Ci.

We now bound the number of terms in the sum and the product separately.

  1. It is possible to encode c(ψ) with 2m bits. One such encoding can be done using a bit vector

(r 1 ,... , rm− 1 , c 1 ,... , cm− 1 , b 0 , b 1 ) ∈ { 0 , 1 }

2 m .

Here the bits are defined as follows:

  • ri = 1 iff the path turns right at row i, i = 1,... , m − 1.
  • ci = 1 iff the path turns down at column i, i = 1,... , m − 1.
  • b 0 = 1 iff the first turn is right (as opposed to down).
  • b 1 = 1 iff the last turn is right (as opposed to down).

This suffices because paths must alternate down and right turn. Consequently, there are at most 2 2 m

possible c(ψ).

  1. We bound the expected value of

Ci∈c(ψ)

Ni using the moment-generating function (MGF) of a

multinomial:

E

Ci∈c(ψ)

Ni

= E

P Ci∈c(ψ) Ni

= E

exp

ln(2)

Ci∈c(ψ)

Ni

Ci∈c(ψ)

pi

n

≤ exp

n

Ci∈c(ψ)

pi

The last equality follows from the formula for the multinomial MGF, while the last bound follows from

x

n

)n

=

n ∑

i=

n

i

x

n

)i

n ∑

i=

n!

i!(n − i)!

x

n

)i

n ∑

i=

x!

i!

≤ e

x .

Now, since the volume of each cell is (2r/m) 2 and there are at most 2m cells in a path,

Ci∈c(ψ)

pi =

Ci∈c(ψ)

Ci

f (x) dx ≤ 2 m × ‖f ‖∞

2 r

m

8 r 2

m

‖f ‖∞.

By combining 1 and 2, it follows that

E {NL(X 1 ,... , Xn)} ≤ 2

2 m e

8 nr 2 ‖f ‖∞/m ,

a bound which holds for all choices of m. By tuning this parameter as m ∼

n, the final bound becomes

E {NL(X 1 ,... , Xn)} ≤ e

c

√ n

for a constant c depending only on r and ‖f ‖∞.

Exercises

  1. Determine the sample complexity N (, δ) for ERM over a class H with VC dimension VH < ∞.
  2. Show that the VC Theorem for sets implies the VC Theorem for classifiers. Hint: Consider sets of the

form G

′ = G × { 0 } ∪ G

c × { 1 } ⊂ X × Y, where G

c denotes the complement.

  1. Let G 1 and G 2 denote two classes of sets.

(a) For G 1 ∩ G 2 := {G 1 ∩ G 2 | G 1 ∈ G 1 , G 2 ∈ G 2 }, show SG 1 ∩G 2 (n) ≤ SG 1 (n)SG 2 (n).

(b) For G 1 ∪ G 2 := {G 1 ∪ G 2 | G 1 ∈ G 1 , G 2 ∈ G 2 }, show SG 1 ∪G 2 (n) ≤ SG 1 (n)SG 2 (n).

  1. Show that the following classes have finite VC dimension by exhibiting an explicit upper bound on the

VC dimension.

(a) X = R d , H = { (^1) {f (x)≥ 0 } | f is an inhomogeneous quadratic polynomial}.

(b) X = R

d , H = { (^1) {x∈C} | C is a sphere (including boundary and interior)}.

(c) X = R

2 , H = { (^1) {x∈P k }^ | Pk is a convex polygon containing at most k sides}.

(d) X = R

d , H = { (^1) {x∈R k }^ | Rk is a union of at most k rectangles}.

  1. Prove Theorem 3 for H = C.

References

[1] L. Devroye, L. Gy¨orfi and G. Lugosi, A Probabilistic Theory of Pattern Recognition, Springer, 1996.

[2] L. Devroye and G. Lugosi, Combinatorial Methods in Density Estimation, Springer, 2001.

[3] P. Massart, “The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality,” Annals of Probability,

vol. 18, pp. 1269-1283, 1990.