Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Introduction to Machine Learning-Lecture 24-Computer Science, Lecture notes of Introduction to Machine Learning

Feature Selection, Multilayer Networks, Minimum-Residual Projection, PCA, Compression, Classification, Gaussians, Probabilistic, Linear Subspaces, Unsupervised Learning, Feature Selection, Filter Methods, Mutual Information, Max-MI Feature Selection, Filter Methods, Wrapper Methods, Neural Networks, Two-Layer Network, Feed-Forward Networks, Greg Shakhnarovich, Lecture Slides, Introduction to Machine Learning, Computer Science, Toyota Technological Institute at Chicago, United States of America.

Typology: Lecture notes

2011/2012

Uploaded on 03/12/2012

alfred67
alfred67 🇺🇸

4.9

(20)

328 documents

1 / 27

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Lecture 24: Feature selection, multilayer networks
TTIC 31020: Introduction to Machine Learning
Instructor: Greg Shakhnarovich
TTI–Chicago
November 19, 2010
Lecture 24: Feature selection, multilayer networks TTIC 31020
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b

Partial preview of the text

Download Introduction to Machine Learning-Lecture 24-Computer Science and more Lecture notes Introduction to Machine Learning in PDF only on Docsity!

Lecture 24: Feature selection, multilayer networks

TTIC 31020: Introduction to Machine Learning

Instructor: Greg Shakhnarovich

TTI–Chicago

November 19, 2010

Review

PCA is the solution to minimum-residual projection.

Let rows of X be the data points (N × d matrix).

Construct the d × d data covariance matrix S = (^) N^1 XT^ X;

Let φ 1 ,... , φd be the orthonormal eigenvectors of S corresponding to the eigenvalues λ 1 ≥... ≥ λd.

The optimal k-dim linear subspace is given by

Φ = [φ 1 ,... , φk].

PCA and classification

A very common methodology: perform PCA on all data and learn a classifier in the low-dimensional space.

Tempting: may turn computationally infeasible into practical.

PCA and classification

A very common methodology: perform PCA on all data and learn a classifier in the low-dimensional space. Tempting: may turn computationally infeasible into practical. Careful! Direction of largest variance need not be the most discriminative direction.

−2^0 0 2 4 6 8

12

34

5 67

8

(^109) PCA subspace LDA subspace

−5^0 −4 −3 −2 −1 0 1 2 3 4 5

Class +1 Class − Total −5^0 −4 −3 −2 −1 0 1 2 3 4 5

1

1.4 Class +1 Class − Total

Probabilistic PCA

Probabilistic PCA is method of fitting a constrained Gaussian (“pancake”):

λ 1... 0........

... 0........ 0... λk........ 0... 0 σ^2 .. ... 0........... 0 σ^2

ΦT

ML estimate for the noise variance σ^2 :

σ^2 =

d − k

∑^ d

j=k+

λj

Linear subspaces vs. manifolds

Linearity assumption constrains the type of subspaces we can find.

A general formulation: a hidden manifold.

One possible method: kernel PCA

Very active area of research...

Feature selection

Suppose we are considering a finite number of features (or basis functions). x = [x 1 ,... , xd]T

We are interested in selecting a subset of these features, xs 1 ,... , xsk , that lead to the best classification or regression performance.

We have already seen this:

Feature selection

Suppose we are considering a finite number of features (or basis functions). x = [x 1 ,... , xd]T

We are interested in selecting a subset of these features, xs 1 ,... , xsk , that lead to the best classification or regression performance.

We have already seen this: lasso regularization.

PCA: more like “feature generation”

  • zj = φTj x is a linear combination of all x 1 ,... , xd

Mutual information

Mutual Information between the random variables X and Y is defined as the reduction in entropy (uncertaintly) of X given Y :

I(X; Y ) , H(X) − H(X|Y )

Mutual information

Mutual Information between the random variables X and Y is defined as the reduction in entropy (uncertaintly) of X given Y :

I(X; Y ) , H(X) − H(X|Y )

= −

x

p(x) log p(x) ︸︷︷︸ =P y p(x,y)

x

y

p(x, y) log p (x | y)

Mutual information

Mutual Information between the random variables X and Y is defined as the reduction in entropy (uncertaintly) of X given Y :

I(X; Y ) , H(X) − H(X|Y )

= −

x

p(x) log p(x) ︸︷︷︸ =P y p(x,y)

x

y

p(x, y) log p (x | y)

x

y

p(x, y) log p(x) +

x

y

p(x, y) log p (x | y)

x,y

p(x, y) log

p (x | y) p(x)

x,y

p(x, y) log

p (x | y) p(y) p(x)p(y)

Mutual information

Mutual Information between the random variables X and Y is defined as the reduction in entropy (uncertaintly) of X given Y :

I(X; Y ) , H(X) − H(X|Y )

= −

x

p(x) log p(x) ︸︷︷︸ =P y p(x,y)

x

y

p(x, y) log p (x | y)

x

y

p(x, y) log p(x) +

x

y

p(x, y) log p (x | y)

x,y

p(x, y) log

p (x | y) p(x)

x,y

p(x, y) log

p (x | y) p(y) p(x)p(y) = DKL (p(x, y) || p(x)p(y)).

Max-MI feature selection: classification

We can evaluate MI between class label y and a feature xj.

I(xj ; y) =

y∈Y

x

p(x, y) log

p (x | y) p(y) p(x)p(y)

This requires estimating p(y) (easy), p(xj ) and p (xj | y) (may be hard).

Sanity check: for binary classification problem, I(xj ; y) ≤ 1 for any feature xj.

Filter methods: shortcommings

How many features to include? Where to place the threshold?