





























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Bias, Variance, Overfitting, Error Decomposition, Regression, Model Complexity, Cross Validation, Estimation Theory, Polynomial Regression, Bias, Estimator, Consistency, Bias-Variance Decomposition, Bias-Variance Tradeoff, Penalizing Model Complexity, Regularization, Greg Shakhnarovich, Lecture Slides, Introduction to Machine Learning, Computer Science, Toyota Technological Institute at Chicago, United States of America.
Typology: Lecture notes
1 / 37
This page cannot be seen from the preview
Don't miss anything!
TTIC 31020: Introduction to Machine Learning
Instructor: Greg Shakhnarovich
TTI–Chicago
October 8, 2010 revised October 11, 2010
Ep(x,y)
(y − wˆ 0 − wˆ 1 x)^2
= Ep(x,y)
(y − w∗ 0 − w∗ 1 x)^2
structural error
(w∗ 0 + w∗ 1 x − wˆ 0 − wˆ 1 x)^2
estimation error
best regression f ∗^ = E[y|x]
best linear regression w∗
estimate ˆw
w∗: parameters of the best linear predictor
More on overfitting
Model complexity
Model selection; cross-validation
Estimation theory; bias-variance tradeoff
f (x; w) = w 0 +
∑^ m
j=
wj xj^.
Define ˜x = [1, x, x^2 ,... , xm]T
Then, f (x; w) = wT^ x˜ and we are back to the familiar simple linear regression. The least squares solution:
wˆ =
X˜T^ y, where ˜X =
1 x 1 x^21... xm 1 1 x 2 x^22... xm 2
............... 1 xN x^2 N... xmN
Data drawn from 3rd order model:
−12−5 −4 −3 −2 −1 0 1 2 3 4 5
−
−
−
−
−
0
2
4
6
−12−5 −4 −3 −2 −1 0 1 2 3 4 5
−
−
−
−
−
0
2
4
6
m = 1 m = 3
Data drawn from 3rd order model:
−12−5 −4 −3 −2 −1 0 1 2 3 4 5
−
−
−
−
−
0
2
4
6
−12−5 −4 −3 −2 −1 0 1 2 3 4 5
−
−
−
−
−
0
2
4
6
m = 1 m = 3
−12−5 −4 −3 −2 −1 0 1 2 3 4 5
−
−
−
−
−
0
2
4
6
m = 5
The basic idea: if a model overfits (is too sensitive to data) it will be unstable. I.e. removal part of the data will change the fit significantly.
We can hold out part of the data, fit the model to the rest, and then test on the heldout set.
What are the problems of this approach?
The basic idea: if a model overfits (is too sensitive to data) it will be unstable. I.e. removal part of the data will change the fit significantly.
We can hold out part of the data, fit the model to the rest, and then test on the heldout set.
What are the problems of this approach?
The improved holdout method: k-fold cross-validation
The improved holdout method: k-fold cross-validation
The improved holdout method: k-fold cross-validation
An extreme case: leave-one-out cross-validation
Lˆcv = 1 N
i=
(yi − f (xi; ˆw−i))^2
where ˆw−i is fit to all the data but the i-th example.
−12−5 −4 −3 −2 −1 0 1 2 3 4 5
−
−
−
−
−
0
2
4
6
−12−5 −4 −3 −2 −1 0 1 2 3 4 5
−
−
−
−
−
0
2
4
6
m = 3 m = 5
This is a very good estimate, although expensive to compute
Cross validation provides some means of dealing with overfitting
What is the source of overfitting? Why do some models overfit more than others?
We can try to get some insight by thinking about the estimation process for model parameters
An estimator ̂θ of a parameter θ is a function that for data X = {x 1 ,... , xN } produces estimate (value) value θ̂.
Examples: ML estimator for a Gaussian mean, given X, produces an estimate (vector) μ̂. ML estimator for linear regression parameters w under Gaussian noise model
The estimate θˆ is a random variable since it is based on a randomly drawn set X.
We can talk about E
θˆ|X
and var(θˆ|X).
(When θ is a vector, we have Cov(θˆ).)