

















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Kernels, Boosting, Slack Variables, Non-Separable Case, SVM, Boosting, Nonlinear Features, Logistic Regression, Nonlinear Mapping, Kernel Trick, Mercer’s Kernels, Popular Kernels, Radial Basis Function Kernel, RBF Kernel, SVM Regression, Penalized Loss Minimizer, Stepwise Regression, Greedy Assembly, Greg Shakhnarovich, Lecture Slides, Introduction to Machine Learning, Computer Science, Toyota Technological Institute at Chicago, United States of America.
Typology: Lecture notes
1 / 25
This page cannot be seen from the preview
Don't miss anything!
TTIC 31020: Introduction to Machine Learning
Instructor: Greg Shakhnarovich
TTI–Chicago
October 25, 2010
We start with argmaxw,w 0
1 ‖w‖ mini^ yi
wT^ xi + w 0
In linearly separable case, we get a quadratic program
max
i=
αi −
i,j=
αiαj yiyj xTi xj
subject to
i=
αiyi = 0, αi ≥ 0 for all i = 1,... , N.
Solving it for α we get the SVM classifier
yˆ = sign
w ˆ 0 +
αi> 0
αiyixTi x
Kernel trick and SVMs
Boosting
As with logistic regression, we can move to nonlinear classifiers by mapping data into nonlinear feature space.
φ : [x 1 , x 2 ]T^ → [x^21 ,
2 x 1 x 2 , x^22 ]T
Consider the mapping: φ : [x 1 , x 2 ]T^ → [1,
2 x 1 ,
2 x 2 , x^21 , x^22 ,
2 x 1 x 2 ]T^.
The (linear) SVM classifier in the feature space:
yˆ = sign
w ˆ 0 +
αi> 0
αiyiφ(xi)T^ φ(x)
The dot product in the feature space:
1 + xT^ z
We defined a non-linear mapping into feature space
φ : [x 1 , x 2 ]T^ → [1,
2 x 1 ,
2 x 2 , x^21 , x^22 ,
2 x 1 x 2 ]T
and saw that φ(x)T^ φ(z) = K(x, z) using the kernel
K(x, z) =
1 + xT^ z
I.e., we can calculate dot products in the feature space implicitly, without ever writing the feature expansion!
What kind of function K is a valid kernel, i.e. such that there exists a feature space Φ(x) in which K(x, z) = φ(x)T^ φ(z)?
Theorem due to Mercer (1930s): K must be
K(x 1 , x 1 ) K(x 1 , x 2 ) K(x 1 , xN )
................. K(xN , x 1 ) K(xN , x 2 ) K(xN , xN )
must be positive definite.
The linear kernel: K(x, z) = xT^ z.
This leads to the original, linear SVM.
The polynomial kernel:
K(x, z; c, d) = (c + xT^ z)d.
We can write the expansion explicitly, by concatenating powers up to d and multiplying by appropriate weights.
K(x, z; σ) = exp
σ^2
‖x − z‖^2
The RBF kernel is a measure of similarity between two examples.
What is the role of parameter σ?
K(x, z; σ) = exp
σ^2
‖x − z‖^2
The RBF kernel is a measure of similarity between two examples.
What is the role of parameter σ? Consider σ → 0.
Data are linearly separable in the (infinite-dimensional) feature space
We don’t need to explicitly compute dot products in that feature space – instead we simply evaluate the RBF kernel.
The key ideas:
-insensitive loss -tube
z
L(z)
y
y +
y −
y(x)
x
̂ ξ > 0
ξ > 0
Two sets of slack variables:
yi ≤ f (xi) + + ξi, yi ≥ f (xi) − − ξ˜i, ξi ≥ 0 , ξ˜i ≥ 0.
Optimization: min C
i
ξi + ξ˜i
Two main ideas:
Complexity of classifier depends on the number of SVs.
One of the most successful ML techniques!
A crucial component: good QP solver.
Recommended off-the-shelf package: SVMlight http://svmlight.joachims.org
Can perform stepwise selection for any classifier of the form
yˆ(x) = f
∑^ d
j=
wj φj (x)
For instance, logistic regression:
σ(w 1 xd j 11 ) − 1 / 2
σ(w 1 xd j 11 + w 2 xd j 22 ) − 1 / 2