



























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Regularization, Probabilistic View, Logistic Model, Loss, Logistic Regression, Extensions, Surrogate Loss, Additive Models, Visualizing, Softmax, MAP Estimation, Penalized Likelihood Surface, Separable Data, Effect of Regularization, Scaled Objective, Probabilistic Interpretation, Occam’s Razor, Optimal Linear Classifier, Classification Margin, Discriminant Function, Hyperplane, Representer Theorem, Support Vector Machines, Kernels, Greg Shakhnarovich, Lecture Slides, Introduction to Machine Lea
Typology: Lecture notes
1 / 35
This page cannot be seen from the preview
Don't miss anything!
TTIC 31020: Introduction to Machine Learning
Instructor: Greg Shakhnarovich
TTI–Chicago
October 18, 2010
Logistic model:
log
p (y = 1 | x) p (y = 0 | x)
= w 0 + wT^ x = 0.
⇒ p (y = 1 | x) =
1 + exp(−w 0 − wT^ x)
Maximum likelihood = minimium log-loss
argmax w,w 0
i=
yi log σ(w 0 + wT^ xi) + (1 − yi) log
1 − σ(w 0 + wT^ xi)
Talk more about logistic regression
Start large margin classification
Recall that we really want to minimize 0/1 loss Instead, we are minimizing the log-loss:
argmax w
i=
log p(yi|xi; w) = argmin w
i=
log p(yi|xi ; w)
This is a surrogate loss; we work with it since it is not computationally feasible to optimize the 0/1 loss directly.
yf (x)
L(yf (x), 1)
0 / (^1) log p
err^2
As with regression we can extend this framework to arbitrary features (basis functions):
p (y = 1 | x) = σ (w 0 + φ 1 (x) +... + φm(x)).
Example: quadratic logistic regression in 2D
p (y = 1 | x) = σ
w 0 + w 1 x 1 + w 2 x 2 + w 3 x^21 + w 4 x^22
w 0 + w 1 x 1 + w 2 x 2 + w 3 x^21 + w 4 x^22 = 0,
i.e. it’s a quadratic decision boundary.
Linear Quadratic
−6 −4 −2 0 2 4 6
−
−
−
0
2
4
6
−6 −4 −2 0 2 4 6
−
−
−
0
2
4
6
We will look at a 2D example, and assume w 0 = 0, i.e. our model will be ˆp(y = 1|x) = σ(w 1 x 1 + w 2 x 2 ).
log p as a function of w Contour plot: high/low
w 1
w 2
−3 −3 −2 −1 0 1 2 3
−
−
0
1
2
3
A line αw in the w 1 , w 2 space corresponds to a set of parralel decision boundaries of the form αwT^ x = 0.
The sign of α determines the direction.
We can get the same decision boundary with an infinite number of settings for w.
When the data are separable by w 0 + αwT^ x = 0, what’s the best choice for α?
p(y = 1 | x) = σ(w 0 + αwT^ x).
We can get the same decision boundary with an infinite number of settings for w.
When the data are separable by w 0 + αwT^ x = 0, what’s the best choice for α?
p(y = 1 | x) = σ(w 0 + αwT^ x).
With α → ∞, we have p(yi|x; w 0 , αw) → 1.
Similar problem in Lecture 1: given H, H, H, H do we believe the ML estimate that μ = p(X = H) = 1?
Solution: introdue a prior over μ, use Bayes rule,
p(μ | X) =
p(X | μ)p(μ) p(X)
and obtain MAP estimate
μ̂ M AP = argmax μ
log p(μ|X) = argmax μ
{log p(X|μ) + log p(μ)}
Usually have a prior that favors θ far from 0 or 1.
Intuition: similar to the coin toss experiment, we may have some belief about the value of w before seeing any data.
A possible prior that captures that belief:
p(w) = N
w; 0 , σ^2 I
In the 2D case (again, ignoring w 0 ) this means
p(w 1 , w 2 ) =
2 πσ^2
exp
w^21 + w 22 2 σ^2
log p(X|w) log p(w; σ) log ˜p(X, w; σ)
−3^ −3 −2 −1 0 1 2 3
−
−
0
1
2
3
−3−3 −2 −1 0 1 2 3
−
−
0
1
2
3
−3−3 −2 −1 0 1 2 3
−
−
0
1
2
3
This is our objective function, and we can find its peak by gradient descent as before.
log ˜p(X, w; σ) =
i=
log p (yi | xi; w) −
2 σ^2
‖w‖
σ = 1.
−3 −2 −1 0 1 2 3
−
−
0
1
2
3 σ^ = 0.
−3 −2 −1 0 1 2 3
−
−
0
1
2
3 σ^ = 0.
−3 −2 −1 0 1 2 3
−
−
0
1
2
3
σ^2 = 1 σ^2 = 0. 5 σ^2 = 0. 1