Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Bias-Introduction to Machine Learning-Lecture 06-Computer Science, Lecture notes of Introduction to Machine Learning

Bias, Variance, Overfitting, Error Decomposition, Regression, Model Complexity, Cross Validation, Estimation Theory, Polynomial Regression, Bias, Estimator, Consistency, Bias-Variance Decomposition, Bias-Variance Tradeoff, Penalizing Model Complexity, Regularization, Greg Shakhnarovich, Lecture Slides, Introduction to Machine Learning, Computer Science, Toyota Technological Institute at Chicago, United States of America.

Typology: Lecture notes

2011/2012

Uploaded on 03/12/2012

alfred67
alfred67 🇺🇸

4.9

(20)

328 documents

1 / 37

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Lecture 6: Bias, variance and overfitting
TTIC 31020: Introduction to Machine Learning
Instructor: Greg Shakhnarovich
TTI–Chicago
October 8, 2010
revised October 11, 2010
Lecture 6: Bias, variance and overfitting TTIC 31020
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25

Partial preview of the text

Download Bias-Introduction to Machine Learning-Lecture 06-Computer Science and more Lecture notes Introduction to Machine Learning in PDF only on Docsity!

Lecture 6: Bias, variance and overfitting

TTIC 31020: Introduction to Machine Learning

Instructor: Greg Shakhnarovich

TTI–Chicago

October 8, 2010 revised October 11, 2010

Review: error decomposition

Ep(x,y)

[

(y − wˆ 0 − wˆ 1 x)^2

]

= Ep(x,y)

[

(y − w∗ 0 − w∗ 1 x)^2

]

structural error

  • Ep(x,y)

[

(w∗ 0 + w∗ 1 x − wˆ 0 − wˆ 1 x)^2

]

estimation error

best regression f ∗^ = E[y|x]

best linear regression w∗

estimate ˆw

w∗: parameters of the best linear predictor

Plan for today

More on overfitting

Model complexity

Model selection; cross-validation

Estimation theory; bias-variance tradeoff

Reminder: polynomial regression

f (x; w) = w 0 +

∑^ m

j=

wj xj^.

Define ˜x = [1, x, x^2 ,... , xm]T

Then, f (x; w) = wT^ x˜ and we are back to the familiar simple linear regression. The least squares solution:

wˆ =

X˜T^ X˜

X˜T^ y, where ˜X =

1 x 1 x^21... xm 1 1 x 2 x^22... xm 2

............... 1 xN x^2 N... xmN

Model complexity and overfitting

Data drawn from 3rd order model:

−12−5 −4 −3 −2 −1 0 1 2 3 4 5

0

2

4

6

−12−5 −4 −3 −2 −1 0 1 2 3 4 5

0

2

4

6

m = 1 m = 3

Model complexity and overfitting

Data drawn from 3rd order model:

−12−5 −4 −3 −2 −1 0 1 2 3 4 5

0

2

4

6

−12−5 −4 −3 −2 −1 0 1 2 3 4 5

0

2

4

6

m = 1 m = 3

−12−5 −4 −3 −2 −1 0 1 2 3 4 5

0

2

4

6

m = 5

Cross-validation

The basic idea: if a model overfits (is too sensitive to data) it will be unstable. I.e. removal part of the data will change the fit significantly.

We can hold out part of the data, fit the model to the rest, and then test on the heldout set.

What are the problems of this approach?

Cross-validation

The basic idea: if a model overfits (is too sensitive to data) it will be unstable. I.e. removal part of the data will change the fit significantly.

We can hold out part of the data, fit the model to the rest, and then test on the heldout set.

What are the problems of this approach?

  • If the heldout set too small, we are susceptible to chance.
  • (^) If it’s too large, we get overly pessimistic (training on too little data).

Cross-validation

The improved holdout method: k-fold cross-validation

  • (^) Partition data into k roughly equal parts;
  • Train on all but j-th part, test on j-th part

x 1... xN

Cross-validation

The improved holdout method: k-fold cross-validation

  • (^) Partition data into k roughly equal parts;
  • Train on all but j-th part, test on j-th part

x 1... xN

Cross-validation

The improved holdout method: k-fold cross-validation

  • (^) Partition data into k roughly equal parts;
  • Train on all but j-th part, test on j-th part

x 1... xN

An extreme case: leave-one-out cross-validation

Lˆcv = 1 N

∑^ N

i=

(yi − f (xi; ˆw−i))^2

where ˆw−i is fit to all the data but the i-th example.

Cross-validation: example

−12−5 −4 −3 −2 −1 0 1 2 3 4 5

0

2

4

6

−12−5 −4 −3 −2 −1 0 1 2 3 4 5

0

2

4

6

m = 3 m = 5

This is a very good estimate, although expensive to compute

  • (^) Need to run N estimation problems each on N − 1 examples!
  • An important research area: devising tricks for efficiently computing cross-validation estimates (by taking advantage of overlap between folds).

Understanding overfitting

Cross validation provides some means of dealing with overfitting

What is the source of overfitting? Why do some models overfit more than others?

We can try to get some insight by thinking about the estimation process for model parameters

A bit of estimation theory

An estimator ̂θ of a parameter θ is a function that for data X = {x 1 ,... , xN } produces estimate (value) value θ̂.

Examples: ML estimator for a Gaussian mean, given X, produces an estimate (vector) μ̂. ML estimator for linear regression parameters w under Gaussian noise model

The estimate θˆ is a random variable since it is based on a randomly drawn set X.

We can talk about E

[

θˆ|X

]

and var(θˆ|X).

(When θ is a vector, we have Cov(θˆ).)

  • (^) Analysis done assuming that the data is distributed according to p(x; θ)!