






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
There are explain the VC theorems, VC class, sauer;s lemma and VC theory for sets with examples.
Typology: Study notes
1 / 11
This page cannot be seen from the preview
Don't miss anything!
EECS 598: Statistical Learning Theory, Winter 2014 Topic 5
Lecturer: Clayton Scott Scribe: Srinagesh Sharma, Scott Reed, Petter Nilsson
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.
They may be distributed outside this class only with the permission of the Instructor.
Let’s say we are given training data {(Xi, Yi)}
n i= which are drawn from X × Y independently from the
distribution PXY , and a set of classifiers H. In the previous lecture, we saw that performance guarantees for
empirical risk minimization over H follow from uniform deviation bounds of the form
Pr
sup h∈H
∣̂Rn(h) − R(h)
≤ δ
where R̂n(h) :=
1 n
∑n
i= (^1) {h(X i)^6 =Yi}^ is the empirical risk and R(h) is the true risk. We also established
such a bound for finite H. In these notes we turn our attention to the setting where H is infinite, and
possibly uncountable. This will lead us to an interesting notion of the capacity of H known as the Vapnik-
Chervonenkis dimension.
Let H ⊆ { 0 , 1 }
X
. For x 1 , x 2 ,... , xn ∈ X denote
NH(x 1 ,... , xn) := |{(h(x 1 ),... , h(xn)) : h ∈ H}|
Clearly NH(x 1 ,... , xn) ≤ 2
n
. The n
th shatter coefficient is defined as
SH(n) =: max x 1 ,...,xn∈X
NH(x 1 ,... , xn)
If SH(n) = 2
n , then ∃ x 1 ,... , xn such that
NH(x 1 ,... , xn) = 2
n
and we say that H shatters x 1 ,... , xn.
Note. The shatter coefficient is sometimes called the growth function in the literature.
The VC dimension of H is defined as
VH := max {n | SH(n) = 2
n }.
If SH(n) = 2 n ∀n then VH := ∞.
Remark. To show VH = V we must show that there exists at least one set of points x 1 ,... , xn that can be
shattered by H, and that no set of n + 1 points can be shattered by H.
The VC dimension and the shatter coefficient relate to the following uniform deviation bound.
Theorem 1. For any n ≥ 1 and > 0
Pr
sup h∈H
∣̂Rn(h) − R(h)
≤ 8 SH(n)e
−n 2 / 32
. (1)
where the probability is with respect to the draw of the training data.
We will show below that SH(n) ≤ (n + 1) VH
. Therefore if VH is finite then the right hand side of equation
(1) is dominated by the exponential and will go to zero as n → ∞. Similar to the case |H| < ∞ we also have
a performance guarantee for ERM when VH < ∞.
Corollary 1. If
hn is an empirical risk minimizer (ERM) over H then
Pr
R(̂hn) − R
∗ H
≤ 8 SH(n)e
−n 2 / 128 ,
where the probability is with respect to the draw of the training data. Equivalently, with probability greater
than 1 − δ
R(̂ hn) ≤ R
∗ H
128[log(
8 SH(n) δ
n
∗ H +
128[VH log(n + 1) + log(
8 δ
n
Proof. Follows from Theorem 1 using an argument like that when |H| < ∞.
Corollary 2. If VH < ∞ then H is uniformly learnable by ERM.
The version of the VC inequality given in Theorem 1 is proved in [1]. That reference also contains a
broader discussion of VC theory than is presented here. We will prove the VC inequality later (with perhaps
different constants) after discussing Rademacher complexity.
A VC class is a set of classifiers H with VH < ∞. We will now consider some examples of H for which VH
can be established or at least bounded.
Example. Consider the set of classifiers
(^1) {x∈R}
d ∏
i=
[ai, bi], ai < bi
Let d = 1. Given one point, we can always assign it a one or a zero. Therefore VH > 1. For two
points there are four possible assignments and they can be shattered using H. However, given 3 points, the
following assignment cannot be realized by any h ∈ H. Therefore NH(x 1 ,... , xn) < 8, and so VH = 2.
Figure 1: This classification does not belong to NH(x 1 ,... , xn) when d=
For d = 2, the four points in Fig. 2 can be shattered by H. Therefore VH is at least 4. For n = 5,
there is a maximum and minimum point in each dimension. Consider a set of ≤ 4 points achieving these
As a sanity check, this results in the bound
Pr
sup h∈H
∣̂Rn(h)^ −^ R(h)
≤ 8 |H| e
−n 2 / 32
which is basically the same as what was derived previously except for larger constants.
The following result lets us bound the VC dimension of a broad family of classes.
Lemma 1. Let F be an m-dimensional vector space of real-valued functions. Then,
(^1) {f (x)≥ 0 }
f ∈ F
has VH ≤ m.
Proof. Suppose H shatters m + 1 points, say x 1 ,... , xm+1. Define the linear mapping L : F → R
m+ ,
L(f ) = (f (x 1 ),... , f (xm+1))
T .
Since dim(F) = m we have dim(L(F)) ≤ m where L(F) denotes the image of F. By the projection theorem,
m+ = L(F) ⊕ L(F)
⊥
where ⊕ denotes the direct sum. Therefore
dim(L(F)
⊥ ) ≥ 1
and so there exits γ 6 = 0, γ ∈ R m+ such that
γ
T L(f ) = 0 ∀f ∈ F.
Thus, ∀f ∈ F
m+ ∑
i=
γif (xi) = 0
or equivalently, ∑
i:γi≥ 0
γif (xi) =
i:γi< 0
−γif (xi).
We will assume that at least one γi < 0. If not, replace γ by −γ. Since H shatters x 1 ,... , xm+1, let
h = 1{f (x)≥ 0 } be such that
h(xi) = 1 ⇐⇒ γi ≥ 0.
For the corresponding f
f (xi) ≥ 0 ⇐⇒ γi ≥ 0.
This implies that
i:γi≥ 0 γif (xi) ≥ 0 and
i:γi< 0 −γif (xi) < 0 which is a contradiction. Therefore,
VH ≤ m.
Let’s apply this result to the class of linear classifiers.
Example. Let X = R
d and F =
f
f (x) = w
T x + b, w ∈ R
d , b ∈ R
. Then H is the set of all linear
classifiers,
H =
(^1) {wT (^) x+b≥ 0 }
w ∈ R
d , b ∈ R
Since dim(F) = d + 1 we deduce from the lemma that VH ≤ d + 1. In fact, this bound is achieved so that
VH = d + 1. For d = 2, H shatters the vertices of any nondegenerate triangle, for d = 3, H shatters the
vertices of a tetrahedron, and for general d, H shatters the zero vector along with the standard basis in R
d .
This is a bound on the shatter coefficient that was proved independently by Vapnik and Chervonenkis (1971),
Sauer (1972) and Shelah (1972).
Theorem 2. Let V = VH < ∞. For all n ≥ 1 ,
SH(n) ≤
V ∑
i=
n
i
Before proving this theorem we consider several corollaries:
Corollary 3. If V < ∞, then ∀n ≥ 1 ,
SH(n) ≤ (n + 1)
V .
Proof. By the binomial theorem,
(n + 1)
V ∑
i=
n
i
i
V ∑
i=
n
i
(V − i)!i!
V ∑
i=
n i
i!
V ∑
i=
n!
(n − i)!i!
V ∑
i=
n
i
Corollary 4. ∀n ≥ V ,
SH(n) ≤
ne
Proof. If
n
≤ 1 then
n
i=
n
i
V ∑
i=
n
i
n
i
n ∑
i=
n
i
n
i
n
)n
≤ e
V
Therefore V ∑
i=
n
i
ne
Corollary 5. If V > 2 , then ∀n ≥ V ,
SH(n) ≤ n
V .
Proof. If V > 2, then
e
< 1, so the statement holds by Corollary 4.
Proof of Sauer’s Lemma. For n ≤ V ,
V ∑
i=
n
i
n ∑
i=
n
i
n = 2
n = SH(n)
in analogy to our definitions for classifiers. Indeed, sets and binary classifiers are equivalent via
G 7 → hG(x) = 1{x∈G}
h 7 → Gh = {x : h(x) = 1}.
This gives a VC theorem for sets.
Corollary 6. If X 1 , ..., Xn
iid ∼ Q, then for any G, n ≥ 0 , > 0 ,
Pr
sup G∈G
≤ 8 SG (n)e
−n 2 / 32
where
n
∑n
i=
i∈G}
Proof. Define PXY on X × { 0 , 1 } s.t. PXY (Y = 0) = 1,PX|Y =0 = Q, and PX|Y =1 is arbitrary. Then
R(h) = PXY (h(X) 6 = Y )
= PXY (Y = 1)PX|Y =1(h(X) = 0) + PXY (Y = 0)PX|Y =0(h(X) = 1)
= Q(Gh)
Similarly,
Rn(h) =
Q(Gh), and so
sup h∈H
| R̂ n(h) − R(h)| = sup G∈G
where H is defined in terms of G via G 7 → hG = (^1) {x∈G}. Finally, not that SG (n) = SH(n).
As an application, consider X ∈ R and G = {(−∞, t] : t ∈ R}. Then SG (n) = n + 1 (see Fig. 4).
Figure 5: Different ways to intersect sets with points.
Let X ∼ Q. Denote Gt = (−∞, t]. Then
Q(Gt) = Pr(X ≤ t) =: F (t) (CDF)
t) =
n
n ∑
i=
(^1) {x i≤t}^ =: F̂ (t) (empirical CDF)
Corollary 7. For all Q, n ≥ 1 , > 0 ,
Pr(sup t∈R
|F̂n(t) − F (t)|
|| Fbn−F ||∞
≥ ) ≤ 8(n + 1)e
−n 2 / 32 .
This is known as the Dvoretzky-Kiefer-Wolfowitz inequality. Tighter versions exist [3, 2].
Earlier, we obtained convergence guarantees for an empirical risk minimizing classifier when the VC dimen-
sion of the classifier set H was finite. The purpose of this section is to obtain such guarantees also in cases
of infinite VC dimension. This requires assumptions about the underlying distribution. In other words the
result will not be distribution free, as was possible before. In particular, we will assume that the samples
are drawn from a distribution that has a bounded density with compact support. This material is based on
a similar result in Chapter 13 of [1], where a weaker assumption on the distribuiton of X is made.
In the following, we consider the feature space X = R 2 and the following two families of classifiers:
(^1) {x∈C} | C is convex
(^1) {x∈L} | L = {(x 1 , x 2 ) ∈ R : x 2 ≤ ψ(x 1 )} for non-increasing ψ : R → R
The classifiers in L are called monotone layers. Both families have infinite VC dimension. The fact that
the VC dimension of C is infinity has been shown earlier. To see that L also has infinite VC dimension, it
suffices to see that any set of points placed decreasingly can be shattered by L, as shown in Figure 6.
Figure 6: A non-increasing function shatters a set of non-increasingly placed points.
In the proof of the VC theorem, it is shown in an intermediate step that
Pr
sup h∈H
n(h)^ −^ R(h)
∣ ≥^ ε
≤ 8 E {NH(X 1 ,... , Xn)} e
−nε 2 / 32 .
Here, NH(X 1 ,... , Xn) is the number of possible labelings of X 1 ,... , Xn by H. We will now bound this
quantity.
Theorem 3. If X has a density f such that ‖f ‖∞ ≤ ∞ and supp(f ) is bounded, then for H = C or H = L,
E {NH(X 1 ,... , Xn)} ≤ e
c
√ n
for a constant c.
Remark. In the theorem statement, ‖f ‖∞ = sup x∈X |f (x)| is the sup norm. The support of a function is
defined as supp(f ) = {x ∈ X : f (x) > 0 }, where the line denotes the set closure.
Corollary 8. If hˆn is an empirical risk minimizer (ERM) over H = C or H = L, then for all ε > 0 ,
Pr
hn) − R
∗ H ≥^ ε
≤ 8 e
c
√ n−nε 2 / 128 .
Equivalently, with probability at least 1 − δ,
hn) − R
∗ H ≤
128 [c
n + log(8/δ)]
n
= O(n
− 1 / 4 ).
outside c(ψ) will be uniquely labeled by ψ, therefore only cells in c(ψ) contribute to NL, with obviously at
most 2
Ni ways to assign labels to the Ni points in Ci.
We now bound the number of terms in the sum and the product separately.
(r 1 ,... , rm− 1 , c 1 ,... , cm− 1 , b 0 , b 1 ) ∈ { 0 , 1 }
2 m .
Here the bits are defined as follows:
This suffices because paths must alternate down and right turn. Consequently, there are at most 2 2 m
possible c(ψ).
Ci∈c(ψ)
Ni using the moment-generating function (MGF) of a
multinomial:
Ci∈c(ψ)
Ni
P Ci∈c(ψ) Ni
exp
ln(2)
Ci∈c(ψ)
Ni
Ci∈c(ψ)
pi
n
≤ exp
n
Ci∈c(ψ)
pi
The last equality follows from the formula for the multinomial MGF, while the last bound follows from
x
n
)n
=
n ∑
i=
n
i
x
n
)i
≤
n ∑
i=
n!
i!(n − i)!
x
n
)i
≤
n ∑
i=
x!
i!
≤ e
x .
Now, since the volume of each cell is (2r/m) 2 and there are at most 2m cells in a path,
Ci∈c(ψ)
pi =
Ci∈c(ψ)
Ci
f (x) dx ≤ 2 m × ‖f ‖∞
2 r
m
8 r 2
m
‖f ‖∞.
By combining 1 and 2, it follows that
E {NL(X 1 ,... , Xn)} ≤ 2
2 m e
8 nr 2 ‖f ‖∞/m ,
a bound which holds for all choices of m. By tuning this parameter as m ∼
n, the final bound becomes
E {NL(X 1 ,... , Xn)} ≤ e
c
√ n
for a constant c depending only on r and ‖f ‖∞.
form G
′ = G × { 0 } ∪ G
c × { 1 } ⊂ X × Y, where G
c denotes the complement.
(a) For G 1 ∩ G 2 := {G 1 ∩ G 2 | G 1 ∈ G 1 , G 2 ∈ G 2 }, show SG 1 ∩G 2 (n) ≤ SG 1 (n)SG 2 (n).
(b) For G 1 ∪ G 2 := {G 1 ∪ G 2 | G 1 ∈ G 1 , G 2 ∈ G 2 }, show SG 1 ∪G 2 (n) ≤ SG 1 (n)SG 2 (n).
VC dimension.
(a) X = R d , H = { (^1) {f (x)≥ 0 } | f is an inhomogeneous quadratic polynomial}.
(b) X = R
d , H = { (^1) {x∈C} | C is a sphere (including boundary and interior)}.
(c) X = R
2 , H = { (^1) {x∈P k }^ | Pk is a convex polygon containing at most k sides}.
(d) X = R
d , H = { (^1) {x∈R k }^ | Rk is a union of at most k rectangles}.
[1] L. Devroye, L. Gy¨orfi and G. Lugosi, A Probabilistic Theory of Pattern Recognition, Springer, 1996.
[2] L. Devroye and G. Lugosi, Combinatorial Methods in Density Estimation, Springer, 2001.
[3] P. Massart, “The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality,” Annals of Probability,
vol. 18, pp. 1269-1283, 1990.