









































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Optimal Classification, Logistic Regression, Linear Classifiers, Risk, Expected Loss, Conditional Risk, Log-Odds Ratio, Logistic Model, Decision Boundary, Likelihood, Derivative, Logistic, Maximum Likelihood, Finding, Gradient Ascent, Descent, Stochastic Gradient Ascent, Batch Gradient Ascent, Newton Raphson, Probabilistic, Overfitting, Greg Shakhnarovich, Lecture Slides, Introduction to Machine Learning, Computer Science, Toyota Technological Institute at Chicago, United States of America.
Typology: Lecture notes
1 / 49
This page cannot be seen from the preview
Don't miss anything!
TTIC 31020: Introduction to Machine Learning
Instructor: Greg Shakhnarovich
TTI–Chicago
October 15, 2010
Decision boundary set by ˆwT^ x + w 0 = 0.
−15 −10 −5 0 5 10 15
−
−
0
5
10
−8 −10 −5 0 5 10 −6^ −
−2^0
24
68
10
−10 −5 0 5 10 15
−
−
0
5
y ˆ = h(x) = sign
w 0 + wT^ x
Classifying using a linear decision boundary effectively reduces the data dimension to 1.
Need to find w (direction) and w 0 (location) of the boundary
Want to minimize the expected zero/one loss for classifier h : X → Y, which for (x, y) is
L(h(x), y) =
0 if h(x) = y, 1 if h(x) 6 = y.
The risk (expected loss) of a C-way classifier h(x):
R(h) = Ex,y [L(h(x), y)]
x
c=
L(h(x), c) p(x, y = c) dx
The risk (expected loss) of a C-way classifier h(x):
R(h) = Ex,y [L(h(x), y)]
x
c=
L(h(x), c) p(x, y = c) dx
x
c=
L(h(x), c) p (y = c | x)
p(x)dx
Clearly, it’s enough to minimize the conditional risk for any x:
R(h | x) =
c=
L(h(x), c)p (y = c | x).
R(h | x) =
c=
L(h(x), c)p (y = c | x)
R(h | x) =
c=
L(h(x), c)p (y = c | x)
= 0 · p (y = h(x) | x) + 1 ·
c 6 =h(x)
p (y = c | x)
c 6 =h(x)
p (y = c | x)
R(h | x) =
c=
L(h(x), c)p (y = c | x)
= 0 · p (y = h(x) | x) + 1 ·
c 6 =h(x)
p (y = c | x)
c 6 =h(x)
p (y = c | x) = 1 − p (y = h(x) | x).
To minimize conditional risk given x, the classifier must decide
h(x) = argmax c
p (y = c | x).
This is the best possible classifier in terms of generalization, i.e. expected misclassification rate on new examples.
Optimal rule h(x) = argmaxc p (y = c | x) is equivalent to
h(x) = c∗^ ⇔
p (y = c∗^ | x) p (y = c | x)
≥ 1 ∀c
⇔ log
p (y = c∗^ | x) p (y = c | x)
≥ 0 ∀c
For the binary case,
h(x) = 1 ⇔ log
p (y = 1 | x) p (y = 0 | x)
We can model the (unknown) decision boundary directly:
log
p (y = 1 | x) p (y = 0 | x)
= w 0 + wT^ x = 0.
Since p (y = 1 | x) = 1 − p (y = 0 | x), we have (after exponentiating):
p (y = 1 | x) 1 − p (y = 1 | x)
= exp(w 0 + wT^ x) = 1
We can model the (unknown) decision boundary directly:
log
p (y = 1 | x) p (y = 0 | x)
= w 0 + wT^ x = 0.
Since p (y = 1 | x) = 1 − p (y = 0 | x), we have (after exponentiating):
p (y = 1 | x) 1 − p (y = 1 | x)
= exp(w 0 + wT^ x) = 1
p (y = 1 | x)
= 1 + exp(−w 0 − wT^ x) = 2
⇒ p (y = 1 | x) =
1 + exp(−w 0 − wT^ x)
p (y = 1 | x) =
1 + exp(−w 0 − wT^ x)
The logistic function σ(x) = (^) 1+^1 e−x : For any x, 0 ≤ σ(x) ≤ 1; Monotonic, σ(−∞) = 0, σ(+∞) = 1
σ(0) = 1/2. To shift the crossing to an arbitrary z: σ(x − z).
To change the “slope”: σ(ax). −5^0 −4 −3 −2 −1 0 1 2 3 4 5
σ^1 (x) σ(x−2) σ(2x) σ(0.5x+1)
What if x ∈ Rd^ = [x 1... xd]T^?
σ(w 0 + wT^ x) is a scalar function of a scalar variable w 0 + wT^ x.
the direction of w determines orientation; w 0 determines the location; ‖w‖ determines the slope.
p (y = 1 | x) = σ(w 0 + wT^ x) = 1/ 2 ⇔ w 0 + wT^ x = 0
With linear logistic model we get a linear decision boundary.
−5−4 −2 0 2 4 6
−
−
−
−
0
1
2
3
4
w
w 0 +wT^ x =