



















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Feature Selection, Multilayer Networks, Minimum-Residual Projection, PCA, Compression, Classification, Gaussians, Probabilistic, Linear Subspaces, Unsupervised Learning, Feature Selection, Filter Methods, Mutual Information, Max-MI Feature Selection, Filter Methods, Wrapper Methods, Neural Networks, Two-Layer Network, Feed-Forward Networks, Greg Shakhnarovich, Lecture Slides, Introduction to Machine Learning, Computer Science, Toyota Technological Institute at Chicago, United States of America.
Typology: Lecture notes
1 / 27
This page cannot be seen from the preview
Don't miss anything!
TTIC 31020: Introduction to Machine Learning
Instructor: Greg Shakhnarovich
TTI–Chicago
November 19, 2010
PCA is the solution to minimum-residual projection.
Let rows of X be the data points (N × d matrix).
Construct the d × d data covariance matrix S = (^) N^1 XT^ X;
Let φ 1 ,... , φd be the orthonormal eigenvectors of S corresponding to the eigenvalues λ 1 ≥... ≥ λd.
The optimal k-dim linear subspace is given by
Φ = [φ 1 ,... , φk].
A very common methodology: perform PCA on all data and learn a classifier in the low-dimensional space.
Tempting: may turn computationally infeasible into practical.
A very common methodology: perform PCA on all data and learn a classifier in the low-dimensional space. Tempting: may turn computationally infeasible into practical. Careful! Direction of largest variance need not be the most discriminative direction.
−2^0 0 2 4 6 8
12
34
5 67
8
(^109) PCA subspace LDA subspace
−5^0 −4 −3 −2 −1 0 1 2 3 4 5
Class +1 Class − Total −5^0 −4 −3 −2 −1 0 1 2 3 4 5
1
1.4 Class +1 Class − Total
Probabilistic PCA is method of fitting a constrained Gaussian (“pancake”):
λ 1... 0........
... 0........ 0... λk........ 0... 0 σ^2 .. ... 0........... 0 σ^2
ML estimate for the noise variance σ^2 :
σ^2 =
d − k
∑^ d
j=k+
λj
Linearity assumption constrains the type of subspaces we can find.
A general formulation: a hidden manifold.
One possible method: kernel PCA
Very active area of research...
Suppose we are considering a finite number of features (or basis functions). x = [x 1 ,... , xd]T
We are interested in selecting a subset of these features, xs 1 ,... , xsk , that lead to the best classification or regression performance.
We have already seen this:
Suppose we are considering a finite number of features (or basis functions). x = [x 1 ,... , xd]T
We are interested in selecting a subset of these features, xs 1 ,... , xsk , that lead to the best classification or regression performance.
We have already seen this: lasso regularization.
PCA: more like “feature generation”
Mutual Information between the random variables X and Y is defined as the reduction in entropy (uncertaintly) of X given Y :
I(X; Y ) , H(X) − H(X|Y )
Mutual Information between the random variables X and Y is defined as the reduction in entropy (uncertaintly) of X given Y :
I(X; Y ) , H(X) − H(X|Y )
= −
x
p(x) log p(x) ︸︷︷︸ =P y p(x,y)
x
y
p(x, y) log p (x | y)
Mutual Information between the random variables X and Y is defined as the reduction in entropy (uncertaintly) of X given Y :
I(X; Y ) , H(X) − H(X|Y )
= −
x
p(x) log p(x) ︸︷︷︸ =P y p(x,y)
x
y
p(x, y) log p (x | y)
x
y
p(x, y) log p(x) +
x
y
p(x, y) log p (x | y)
x,y
p(x, y) log
p (x | y) p(x)
x,y
p(x, y) log
p (x | y) p(y) p(x)p(y)
Mutual Information between the random variables X and Y is defined as the reduction in entropy (uncertaintly) of X given Y :
I(X; Y ) , H(X) − H(X|Y )
= −
x
p(x) log p(x) ︸︷︷︸ =P y p(x,y)
x
y
p(x, y) log p (x | y)
x
y
p(x, y) log p(x) +
x
y
p(x, y) log p (x | y)
x,y
p(x, y) log
p (x | y) p(x)
x,y
p(x, y) log
p (x | y) p(y) p(x)p(y) = DKL (p(x, y) || p(x)p(y)).
We can evaluate MI between class label y and a feature xj.
I(xj ; y) =
y∈Y
x
p(x, y) log
p (x | y) p(y) p(x)p(y)
This requires estimating p(y) (easy), p(xj ) and p (xj | y) (may be hard).
Sanity check: for binary classification problem, I(xj ; y) ≤ 1 for any feature xj.
How many features to include? Where to place the threshold?