















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
A lecture script from ttic 31020: introduction to machine learning, covering nonparametric methods in machine learning. The lecture begins by comparing parametric and nonparametric methods, explaining that nonparametric methods keep around the training data and use it as parameters. The lecture then introduces kernel density estimation, a nonparametric method for probability density estimation, and nearest neighbor classifiers. The document also discusses the choice of kernel width and the famous cover and hart result on the nearest neighbor classifier.
Typology: Lecture notes
1 / 23
This page cannot be seen from the preview
Don't miss anything!
TTIC 31020: Introduction to Machine Learning
Instructor: Greg Shakhnarovich
TTI–Chicago
November 10, 2010
So far, we have seen parametric methods
Is SVM classifier
sign
w 0 +
αi> 0
αiyiK(xi, x)
parametric?
So far, we have seen parametric methods
Is SVM classifier
sign
w 0 +
αi> 0
αiyiK(xi, x)
parametric?
In nonparametric methods the training examples are explicitly used as parameters.
The problem of probability density estimation: infer p(vx 0 ) given a set of x 1 ,... , xN.
Parametric estimation: assume a parametric form p(x; θ)
Estimate θ using ML, MAP etc.
The idea behind nonparametric estimation: directly evaluate how dense is the vicinity of x 0.
Consider a kernel that is also a pdf.
x 0 − xi; 0, σ^2 I
Estimator:
pˆ(x 0 ) =
i=
K(x 0 , xi).
−6^0 −5 −4 −3 −2 −1 0 1 2 3
An example’s contribution depends on the distance from x 0.
ˆp(x 0 ) =
i=
x 0 − xi; 0, σ^2 I
σ = 1 σ = 0. 4 σ = 0. 05
−6^0 −5 −4 −3 −2 −1 0 1 2 3
−6^0 −5 −4 −3 −2 −1 0 1 2 3
−6^0 −5 −4 −3 −2 −1 0 1 2 3
Choice of the kernel width σ is crucial.
When σ is sufficiently small, the role of xi that are far from x 0 vanishes.
We can make this explicit by ignoring the kernel, and simply expressing inference in terms of neighors’ labels.
Example: nearest neighbor classification.
iN N = argmin i
‖x 0 − xi‖.
No parametric/probabilistic assumptions whatsoever!
Let RN be the expected risk of the NN classifier with N training examples drawn from p(x, y).
A famous result due to Cover and Hart ’67: under mild assumptios on p(x, y), the asymptotic risk of the NN classifier R∞ = limN →∞ RN satisfies
R∗^ ≤ R∞ ≤ 2 R∗(1 − R∗),
where R∗^ is the Bayes risk.
Less famous result (Cover, ’68): the rate of convergence to the bound can be arbitrarily slow!
Nonetheless, in practice NN is often very accurate – and slow.
Take 16x16 grayscale images (8bit) of handwritten digits.
Use Euclidean distance in raw pixel space, k = 7.
Classification error (leave-one-out): 4.85%.
Examples:
We can use k > 1 nearest neighbors ⇒ k-NN classifier
What about regression? Simplest k-NN regression: let x′ 1 ,... , x′ k be the neighbors, and y′ 1 ,... , y k′ their labels.
∑k j=1 y ′ j. What kind of functions can we estimate in this way?
What is the effect of k?
NN induce Voronoi tasselation of the space:
Idea 2: bring back the parameters.
Fit a (simple) parametric model to the neighbors of x 0.
Idea 2: bring back the parameters.
Fit a (simple) parametric model to the neighbors of x 0.