Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Maximizing Margin in Support Vector Machines: Deriving the Quadratic Program, Assignments of Programming Languages

How to find the optimal hyperplane for a support vector machine (svm) by maximizing the margin, which is the minimum distance between the hyperplane and the data points. The document derives the quadratic program (qp) that can be used to solve this optimization problem using off-the-shelf qp solvers. The document also discusses handling non-separable data by introducing slack variables and penalties.

Typology: Assignments

Pre 2010

Uploaded on 07/22/2009

koofers-user-t98
koofers-user-t98 🇺🇸

10 documents

1 / 4

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1 Administrivia
In HW 2, when I say “true concept”, what I mean is “plot the underlying sine curve”. The
idea is that there’s a true process (the sine wave) and you’re measuring it with a noisy sensor
(the N(0,1) term). You’re simply trying to reconstruct the underlying process as well as
possible. You should end up with a bunch of plots that have the same sine wave and same
noisy data points, but a variety of different estimates for the curve.
2 Linear Discriminator Functions, Cont’d: SVMs
Recall that we formulated our maximum margin problem last time as: “find the Wthat
maximizes the minimum distance to any point in the data”, which we formalized like:
We know that the (signed) distance from the hyperplane Wto a point Xis given by
dist(X, W) = WT
kc
WkX
Therefore for any point in your data, Xi, with label yi, we have that
di=yi
WT
kc
WkXi
for gives the absolute distance of Xito W. (Note that signed distance is <0only for
yi=1.)
Thus, for separable data, there exists some Wfor which di=yiWTXi>0for all i.
Let Cbe the minimal distance between hyperplane and data over the whole data set:
C= min
iyi
WT
kc
WkXi
Now Cis the measure of the margin we’re looking for it measures the closest distance
between any data point and the hyperplane. So we’d like to find the Wthat maximizes C.
Or, in other words,
max
WC
Subject to yi
WT
kc
WkXiCi
Which we can easily rewrite as:
max
WC
Subject to yiWTXi k c
WkCi
1
pf3
pf4

Partial preview of the text

Download Maximizing Margin in Support Vector Machines: Deriving the Quadratic Program and more Assignments Programming Languages in PDF only on Docsity!

1 Administrivia

  • In HW 2, when I say “true concept”, what I mean is “plot the underlying sine curve”. The idea is that there’s a true process (the sine wave) and you’re measuring it with a noisy sensor (the N (0, 1) term). You’re simply trying to reconstruct the underlying process as well as possible. You should end up with a bunch of plots that have the same sine wave and same noisy data points, but a variety of different estimates for the curve.

2 Linear Discriminator Functions, Cont’d: SVMs

  • Recall that we formulated our maximum margin problem last time as: “find the W that maximizes the minimum distance to any point in the data”, which we formalized like:
  • We know that the (signed) distance from the hyperplane W to a point X is given by

dist(X, W) =

WT

‖Ŵ ‖

X

  • Therefore for any point in your data, Xi, with label yi, we have that

di = yi

WT

‖Ŵ ‖

Xi

for gives the absolute distance of Xi to W. (Note that signed distance is < 0 only for yi = − 1 .)

  • Thus, for separable data, there exists some W for which di = yiWTXi > 0 for all i.
  • Let C be the minimal distance between hyperplane and data over the whole data set:

C = min i yi

WT

‖ Ŵ ‖

Xi

  • Now C is the measure of the margin we’re looking for – it measures the closest distance between any data point and the hyperplane. So we’d like to find the W that maximizes C. Or, in other words,

max W

C

Subject to yi

WT

‖Ŵ ‖

Xi ≥ C ∀i

  • Which we can easily rewrite as:

max W

C

Subject to yiWTXi ≥ ‖ Ŵ ‖C ∀i

  • Well... That’s very nice, but how do you solve that for the W that you want?
  • For those who’re familiar with linear programming/linear opt, you might recognize the form of the system that I wrote above. It looks like the general “maximize foo, subject to bar” form that we use for writing linear programs.
  • It turns out that this system is actually a quadratic program (QP). (This isn’t immediately obvious, but we’ll get to it in a second.) For the moment, all you really need to know is: if you can write down the problem you want to solve as a QP, there’re off-the-shelf software routines that you can just plug in that will solve it for you. It’s pretty black-box, and I won’t go into any more detail on how the QP solver works ( way , way beyond the scope of this class). Suffice it to say that it’s enough just to be able to write your problem as a QP.
  • But our system still isn’t in the right form for a QP. That’s going to require something of the form:

max W Some function of W 2

Subject to A bunch of linear constraints

But what we have is something of the form

max W min i WTXi

Subject to A bunch of linear constraints

  • The trick is to notice that when the margin is maximized, then it exactly touches (at least) one point on each side of the hyperplane. Also, because rescaling W doesn’t change the hyperplane, we can pick a scale for W such that WTX = ± 1 at the margins. This changes the “subject to” conditions we gave above to:

Subject to yiWTXi ≥ 1 ∀i

  • At those points exactly on the margin, we have that:

Ŵ TXi + w 0 = − 1 for points “above” the plane Ŵ TXj + w 0 = − 1 for points “below” the plane

The first eqn defines a plane (w 0 + 1)/‖ Ŵ ‖ units from the origin, the second defines a plane (w 0 − 1)/‖ Ŵ ‖ units from the origin. Subtracting, we find that the margin, C is 2 /‖ Ŵ ‖ units wide. Therefore, we can maximize the margin simply by maximizing 2 /‖ Ŵ ‖, or, equivalently, minimizing Ŵ T^ Ŵ.

  • Thus, we can write our final form as:

min W

WT^ Ŵ

subject to yiWTXi ≥ 1 ∀i

(No, it’s not obvious, and I told you I wouldn’t prove it to you. Go look up the Burges tutorial on SVMs if you really want to understand where that came from... )

  • The important thing about that expression is that it’s written in terms of a dot product be- tween the Xi. That is, if you know the value of the dot product, you don’t actually need to know what the Xi themselves are!
  • Why is this useful? Well, suppose that we take our original Xi, which live in Rd, and project them into some higher dimensional space, Rk^ via a nonlinear transform Φ(X) (k À d).
  • It may be that k is so large that it’s painful to manipulate (we have examples of k = ∞)
  • But suppose that we have a convenient way to manipulate the product K(Xi, Xj ) = Φ(Xi)Φ(Xj ). Now we could just plug in K in the place of the dot product, above, and carry through the same solution. It turns out to work just fine.
  • Which begs the question of, if you can’t handle Φ, how can you get K?
  • Well, often K is easy to represent, even if Φ is hard. E.g.,

K(Xi, Xj ) = e−^

(Xi−Xj )T(Xi−Xj ) 2 σ^2

which you should recognize as, essentially, being a Gaussian. That’s a K equivalent to an infinite-dimensional Φ (again, I won’t prove this to you). Another example is:

K(Xi, Xj ) = (XiTXj )p

for some integer p > 1. It turns out that if your original Xi ∈ Rd, then this corresponds to a Φ that lives in a space of dimension

(d+p− 1 p

. If you’re looking at, say, 16 × 16 images (d = 256), and p = 4 (you’re looking at a degree-4 polynomial expansion), then Φ is a 183,181,376-dimensional space. Wow!