Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Comparison of Nonparametric and Parametric Methods in Machine Learning: Kernel Density Est, Lecture notes of Introduction to Machine Learning

A lecture script from ttic 31020: introduction to machine learning, covering nonparametric methods in machine learning. The lecture begins by comparing parametric and nonparametric methods, explaining that nonparametric methods keep around the training data and use it as parameters. The lecture then introduces kernel density estimation, a nonparametric method for probability density estimation, and nearest neighbor classifiers. The document also discusses the choice of kernel width and the famous cover and hart result on the nearest neighbor classifier.

Typology: Lecture notes

2011/2012

Uploaded on 03/12/2012

alfred67
alfred67 🇺🇸

4.9

(20)

328 documents

1 / 23

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Lecture 20: Nonparametric methods
TTIC 31020: Introduction to Machine Learning
Instructor: Greg Shakhnarovich
TTI–Chicago
November 10, 2010
Lecture 20: Nonparametric methods TTIC 31020
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17

Partial preview of the text

Download Comparison of Nonparametric and Parametric Methods in Machine Learning: Kernel Density Est and more Lecture notes Introduction to Machine Learning in PDF only on Docsity!

Lecture 20: Nonparametric methods

TTIC 31020: Introduction to Machine Learning

Instructor: Greg Shakhnarovich

TTI–Chicago

November 10, 2010

Review

Parametric vs. nonparametric methods

So far, we have seen parametric methods

  • Learning = inferring (fitting) parameters.

Is SVM classifier

sign

w 0 +

αi> 0

αiyiK(xi, x)

parametric?

  • In general, we can not summarize it in a simple parametric form.
  • Need to keep around some (possibly all!) of the training data.

Parametric vs. nonparametric methods

So far, we have seen parametric methods

  • Learning = inferring (fitting) parameters.

Is SVM classifier

sign

w 0 +

αi> 0

αiyiK(xi, x)

parametric?

  • In general, we can not summarize it in a simple parametric form.
  • Need to keep around some (possibly all!) of the training data.
  • (^) The Lagrange multipliers α are kind of parameters.

In nonparametric methods the training examples are explicitly used as parameters.

Nonparametric density estimation

The problem of probability density estimation: infer p(vx 0 ) given a set of x 1 ,... , xN.

Parametric estimation: assume a parametric form p(x; θ)

Estimate θ using ML, MAP etc.

  • We have seen examples for Gaussian and Bernoulli densities.

The idea behind nonparametric estimation: directly evaluate how dense is the vicinity of x 0.

Kernel density estimation

Consider a kernel that is also a pdf.

  • (^) e.g., Gaussian kernel K(x 0 , xi) = N

x 0 − xi; 0, σ^2 I

Estimator:

pˆ(x 0 ) =

N

∑^ N

i=

K(x 0 , xi).

−6^0 −5 −4 −3 −2 −1 0 1 2 3

An example’s contribution depends on the distance from x 0.

Choice of kernel width

ˆp(x 0 ) =

N

∑^ N

i=

N

x 0 − xi; 0, σ^2 I

σ = 1 σ = 0. 4 σ = 0. 05

−6^0 −5 −4 −3 −2 −1 0 1 2 3

−6^0 −5 −4 −3 −2 −1 0 1 2 3

−6^0 −5 −4 −3 −2 −1 0 1 2 3

Choice of the kernel width σ is crucial.

  • Similar to the overfitting effect in supervised learning!

Nearest neighbor methods

When σ is sufficiently small, the role of xi that are far from x 0 vanishes.

  • (^) The result depends on the nearest neighbors of x 0.

We can make this explicit by ignoring the kernel, and simply expressing inference in terms of neighors’ labels.

Example: nearest neighbor classification.

  • (^) Training data x 1 , y 1 ,... , xN , yN are simply stored.
  • Given x 0 , let

iN N = argmin i

‖x 0 − xi‖.

  • (^) Nearest neighbor prediction: ˆy 0 = yiN N

No parametric/probabilistic assumptions whatsoever!

Nearest neighbor classifier

Let RN be the expected risk of the NN classifier with N training examples drawn from p(x, y).

A famous result due to Cover and Hart ’67: under mild assumptios on p(x, y), the asymptotic risk of the NN classifier R∞ = limN →∞ RN satisfies

R∗^ ≤ R∞ ≤ 2 R∗(1 − R∗),

where R∗^ is the Bayes risk.

Less famous result (Cover, ’68): the rate of convergence to the bound can be arbitrarily slow!

Nonetheless, in practice NN is often very accurate – and slow.

Example: k-NN for handwritten digits

Take 16x16 grayscale images (8bit) of handwritten digits.

Use Euclidean distance in raw pixel space, k = 7.

Classification error (leave-one-out): 4.85%.

Examples:

Nearest neighbor: extensions

We can use k > 1 nearest neighbors ⇒ k-NN classifier

  • Label for x 0 predicted by majority voting among its k-NN.

What about regression? Simplest k-NN regression: let x′ 1 ,... , x′ k be the neighbors, and y′ 1 ,... , y k′ their labels.

  • (^) Predict ˆy = (^1) k

∑k j=1 y ′ j. What kind of functions can we estimate in this way?

What is the effect of k?

Geometry of nearest neighbor

NN induce Voronoi tasselation of the space:

Parametric locally weighted regression

Idea 2: bring back the parameters.

Fit a (simple) parametric model to the neighbors of x 0.

Parametric locally weighted regression

Idea 2: bring back the parameters.

Fit a (simple) parametric model to the neighbors of x 0.