Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Maximizing Margin in Support Vector Machines: Deriving the Quadratic Program, Assignments of Programming Languages

University of New Mexico (UNM) - Gallup Programming Languages

How to find the optimal hyperplane for a support vector machine (svm) by maximizing the margin, which is the minimum distance between the hyperplane and the data points. The document derives the quadratic program (qp) that can be used to solve this optimization problem using off-the-shelf qp solvers. The document also discusses handling non-separable data by introducing slack variables and penalties.

Typology: Assignments

Pre 2010

Uploaded on 07/22/2009

koofers-user-t98 🇺🇸

10 documents

1 / 4

This page cannot be seen from the preview

Don't miss anything!

1 Administrivia

•In HW 2, when I say “true concept”, what I mean is “plot the underlying sine curve”. The

idea is that there’s a true process (the sine wave) and you’re measuring it with a noisy sensor

(the N(0,1) term). You’re simply trying to reconstruct the underlying process as well as

possible. You should end up with a bunch of plots that have the same sine wave and same

noisy data points, but a variety of different estimates for the curve.

2 Linear Discriminator Functions, Cont’d: SVMs

•Recall that we formulated our maximum margin problem last time as: “find the Wthat

maximizes the minimum distance to any point in the data”, which we formalized like:

•We know that the (signed) distance from the hyperplane Wto a point Xis given by

dist(X, W) = WT

kc

WkX

•Therefore for any point in your data, Xi, with label yi, we have that

di=yi

WT

kc

WkXi

for gives the absolute distance of Xito W. (Note that signed distance is <0only for

yi=−1.)

•Thus, for separable data, there exists some Wfor which di=yiWTXi>0for all i.

•Let Cbe the minimal distance between hyperplane and data over the whole data set:

C= min

iyi

WT

kc

WkXi

•Now Cis the measure of the margin we’re looking for – it measures the closest distance

between any data point and the hyperplane. So we’d like to find the Wthat maximizes C.

Or, in other words,

max

WC

Subject to yi

WT

kc

WkXi≥C∀i

•Which we can easily rewrite as:

max

WC

Subject to yiWTXi≥ k c

WkC∀i

1

Partial preview of the text

Download Maximizing Margin in Support Vector Machines: Deriving the Quadratic Program and more Assignments Programming Languages in PDF only on Docsity!

1 Administrivia

In HW 2, when I say “true concept”, what I mean is “plot the underlying sine curve”. The idea is that there’s a true process (the sine wave) and you’re measuring it with a noisy sensor (the N (0, 1) term). You’re simply trying to reconstruct the underlying process as well as possible. You should end up with a bunch of plots that have the same sine wave and same noisy data points, but a variety of different estimates for the curve.

2 Linear Discriminator Functions, Cont’d: SVMs

Recall that we formulated our maximum margin problem last time as: “find the W that maximizes the minimum distance to any point in the data”, which we formalized like:
We know that the (signed) distance from the hyperplane W to a point X is given by

dist(X, W) =

WT

‖Ŵ ‖

X

Therefore for any point in your data, Xi, with label yi, we have that

di = yi

WT

‖Ŵ ‖

Xi

for gives the absolute distance of Xi to W. (Note that signed distance is < 0 only for yi = − 1 .)

Thus, for separable data, there exists some W for which di = yiWTXi > 0 for all i.
Let C be the minimal distance between hyperplane and data over the whole data set:

C = min i yi

WT

‖ Ŵ ‖

Xi

Now C is the measure of the margin we’re looking for – it measures the closest distance between any data point and the hyperplane. So we’d like to find the W that maximizes C. Or, in other words,

max W

C

Subject to yi

WT

‖Ŵ ‖

Xi ≥ C ∀i

Which we can easily rewrite as:

max W

C

Subject to yiWTXi ≥ ‖ Ŵ ‖C ∀i

Well... That’s very nice, but how do you solve that for the W that you want?
For those who’re familiar with linear programming/linear opt, you might recognize the form of the system that I wrote above. It looks like the general “maximize foo, subject to bar” form that we use for writing linear programs.
It turns out that this system is actually a quadratic program (QP). (This isn’t immediately obvious, but we’ll get to it in a second.) For the moment, all you really need to know is: if you can write down the problem you want to solve as a QP, there’re off-the-shelf software routines that you can just plug in that will solve it for you. It’s pretty black-box, and I won’t go into any more detail on how the QP solver works ( way , way beyond the scope of this class). Suffice it to say that it’s enough just to be able to write your problem as a QP.
But our system still isn’t in the right form for a QP. That’s going to require something of the form:

max W Some function of W 2

Subject to A bunch of linear constraints

But what we have is something of the form

max W min i WTXi

Subject to A bunch of linear constraints

The trick is to notice that when the margin is maximized, then it exactly touches (at least) one point on each side of the hyperplane. Also, because rescaling W doesn’t change the hyperplane, we can pick a scale for W such that WTX = ± 1 at the margins. This changes the “subject to” conditions we gave above to:

Subject to yiWTXi ≥ 1 ∀i

At those points exactly on the margin, we have that:

Ŵ TXi + w 0 = − 1 for points “above” the plane Ŵ TXj + w 0 = − 1 for points “below” the plane

The first eqn defines a plane (w 0 + 1)/‖ Ŵ ‖ units from the origin, the second defines a plane (w 0 − 1)/‖ Ŵ ‖ units from the origin. Subtracting, we find that the margin, C is 2 /‖ Ŵ ‖ units wide. Therefore, we can maximize the margin simply by maximizing 2 /‖ Ŵ ‖, or, equivalently, minimizing Ŵ T^ Ŵ.

Thus, we can write our final form as:

min W

WT^ Ŵ

subject to yiWTXi ≥ 1 ∀i

(No, it’s not obvious, and I told you I wouldn’t prove it to you. Go look up the Burges tutorial on SVMs if you really want to understand where that came from... )

The important thing about that expression is that it’s written in terms of a dot product be- tween the Xi. That is, if you know the value of the dot product, you don’t actually need to know what the Xi themselves are!
Why is this useful? Well, suppose that we take our original Xi, which live in Rd, and project them into some higher dimensional space, Rk^ via a nonlinear transform Φ(X) (k À d).
It may be that k is so large that it’s painful to manipulate (we have examples of k = ∞)
But suppose that we have a convenient way to manipulate the product K(Xi, Xj ) = Φ(Xi)Φ(Xj ). Now we could just plug in K in the place of the dot product, above, and carry through the same solution. It turns out to work just fine.
Which begs the question of, if you can’t handle Φ, how can you get K?
Well, often K is easy to represent, even if Φ is hard. E.g.,

K(Xi, Xj ) = e−^

(Xi−Xj )T(Xi−Xj ) 2 σ^2

which you should recognize as, essentially, being a Gaussian. That’s a K equivalent to an infinite-dimensional Φ (again, I won’t prove this to you). Another example is:

K(Xi, Xj ) = (XiTXj )p

for some integer p > 1. It turns out that if your original Xi ∈ Rd, then this corresponds to a Φ that lives in a space of dimension

(d+p− 1 p

. If you’re looking at, say, 16 × 16 images (d = 256), and p = 4 (you’re looking at a degree-4 polynomial expansion), then Φ is a 183,181,376-dimensional space. Wow!

Maximizing Margin in Support Vector Machines: Deriving the Quadratic Program, Assignments of Programming Languages

Related documents

Partial preview of the text

Download Maximizing Margin in Support Vector Machines: Deriving the Quadratic Program and more Assignments Programming Languages in PDF only on Docsity!

1 Administrivia

2 Linear Discriminator Functions, Cont’d: SVMs

WT

‖Ŵ ‖

X

WT

‖Ŵ ‖

WT

‖ Ŵ ‖

C

WT

‖Ŵ ‖

C

WT^ Ŵ