Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Understanding Bias, Variance, Overfitting, and p-Hacking in ML, Lecture notes of Probability and Statistics

The concepts of bias, variance, overfitting, and p-hacking in the context of statistical modeling and machine learning. It explains how these concepts impact the accuracy of models and the importance of cross-validation for model selection. The document also provides an extended example using the wages dataset in R to illustrate the concepts.

What you will learn

  • How does overfitting affect the accuracy of a model?
  • What is the difference between bias and variance in statistical modeling?
  • What is p-hacking and how can it impact the results of statistical analysis?

Typology: Lecture notes

2021/2022

Uploaded on 09/27/2022

myohmy
myohmy 🇬🇧

4.8

(10)

300 documents

1 / 7

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
3.4. BIAS, VARIANCE, OVERFITTING AND P-HACKING 41
gl m ( C at ., d at a = d f [ df $ c a t + df $ f o x = =1 , c ( 1 , 3 )] ) # ca t vs . f ox
Then for each new animal we encounter of unknown species, we call predict() three times, again
yielding three estimated conditional probabilities. Say in the first one, Cat “wins,” i.e. the condi-
tional probability is less than 0.5. Say Dog wins in the second, and Cat wins in the third. Since
Cat had the most wins, we predict Cat.
Comparison
At first, OVA seems much better than AVA. If we have mlevels, that means running C(m, 2) =
O(m2) parwise logit models, rather than mfor OVA. However, that is somewhat compensated by
the fact that each pairwise model will be based on less data, and some analysts contend that AVA
can have better accuracy. It remains a bit of a controversy.
3.4 Bias, Variance, Overfitting and p-Hacking
By far the most vexing issue in statistics and machine learning is that of overfitting.
3.4.1 What Is Overfitting?
Suppose we have just one predictor, and ndata points. If we fit a polynomial model of degree
n1, the resulting curve will pass through all npoints, a “perfect” fit. For instance:
> x < - r no rm ( 6)
> y < - r no rm ( 6) # u n re la t ed t o x!
> d f < - d a ta . f r am e (x , y )
> d f $ x2 < - x ^ 2
> d f $ x3 < - x ^ 3
> d f $ x4 < - x ^ 4
> d f $ x5 < - x ^ 5
> df
x y x2 x3
1 - 1. 18 55 131 0 .2 881 29 1 1.4 05 441 20 -1 .66 61 68 894
2 - 1. 78 38769 -2 .0 74 17 40 3. 18 221 66 4 -5.6 76 68 262 7
3 - 0. 71 24510 -0 .4 25 36 78 0. 50 758 64 0 -0.3 61 63 043 1
4 0. 167 61 11 - 0. 19 49 265 0 .0 280 93 48 0. 004 70 877 9
5 1. 246 29 26 - 0. 73 48 481 1 .5 532 45 35 1. 935 79 824 5
6 0. 374 16 04 1. 952 16 67 0 . 13 999 60 1 0 .0 523 80 963
x4 x 5
1 1 . 9 75 2 65 e + 0 0 - 2 .3 4 17 0 24 1 4
pf3
pf4
pf5

Partial preview of the text

Download Understanding Bias, Variance, Overfitting, and p-Hacking in ML and more Lecture notes Probability and Statistics in PDF only on Docsity!

3.4. BIAS, VARIANCE, OVERFITTING AND P-HACKING 41

glm ( Cat ∼. , data = df [ df $ cat + df $ fox ==1 , c (1 ,3)]) # cat vs. fox

Then for each new animal we encounter of unknown species, we call predict() three times, again yielding three estimated conditional probabilities. Say in the first one, Cat “wins,” i.e. the condi- tional probability is less than 0.5. Say Dog wins in the second, and Cat wins in the third. Since Cat had the most wins, we predict Cat.

Comparison

At first, OVA seems much better than AVA. If we have m levels, that means running C(m, 2) = O(m^2 ) parwise logit models, rather than m for OVA. However, that is somewhat compensated by the fact that each pairwise model will be based on less data, and some analysts contend that AVA can have better accuracy. It remains a bit of a controversy.

3.4 Bias, Variance, Overfitting and p-Hacking

By far the most vexing issue in statistics and machine learning is that of overfitting.

3.4.1 What Is Overfitting?

Suppose we have just one predictor, and n data points. If we fit a polynomial model of degree n − 1, the resulting curve will pass through all n points, a “perfect” fit. For instance:

x <- rnorm (6) y <- rnorm (6) # unrelated to x! df <- data. frame (x , y ) df $ x2 <- x ^ df $ x3 <- x ^ df $ x4 <- x ^ df $ x5 <- x ^ df x y x2 x 1 -1.1855131 0.2881291 1.40544120 -1. 2 -1.7838769 -2.0741740 3.18221664 -5. 3 -0.7124510 -0.4253678 0.50758640 -0. 4 0.1676111 -0.1949265 0.02809348 0. 5 1.2462926 -0.7348481 1.55324535 1. 6 0.3741604 1.9521667 0.13999601 0. x4 x 1 1.975265 e +00 -2.

42 CHAPTER 3. SOME INFRASTRUCTURE: PROBABILITY AND STATISTICS

2 1.012650 e +01 -18. 3 2.576440 e -01 -0. 4 7.892437 e -04 0. 5 2.412571 e +00 3. 6 1.959888 e -02 0.

lmo <- lm ( y ∼. , data = df ) lmo

Call : lm ( formula = y ∼. , data = df )

Coefficients : ( Intercept ) x x2 x -1.3127 4.7632 11.4809 0. x4 x -6.9685 -2.

lmo $ fitted. values 1 2 3 4 5 0.2881291 -2.0741740 -0.4253678 -0.1949265 -0. 6

y [1] 0.2881291 -2.0741740 -0.4253678 -0.1949265 -0. [6] 1.

Yes, we “predicted” y perfectly, even though there was no relation between the response and predictor variables). Clearly that “perfect fit” is illusory, “noise fitting.” Our ability to predict future cases would not be good. This is overfitting.

Let’s take a closer look, in an RS context. Say we believe (3.14) is a good model for the setting described in that section, i.e. men becoming more liberal raters as they age but women becoming more conservative. If we omit the interaction term, than we will underpredict older men and overpredict older women. This biases our ratings.

On the other hand, we need to worry about sampling variance. Consider the case of opinion polls during an election campaign, in which the goal is to estimate p, the proportion of voters who will vote for Candidate Jones. If we use too small a sample size, say 50, our results will probably be inaccurate. This is due to sampling instability: Two pollsters, each randomly sampling 50 people, will sample different sets of people, thus each having different values of ̂p, their sample estimates of p. For a sample of size 50, it is likely that their two values of p̂ will be substantially different from each other, whereas if the sample size were 5000, the two estimates would likely be close to

44 CHAPTER 3. SOME INFRASTRUCTURE: PROBABILITY AND STATISTICS

in the test set, predict those values from our fitted model and the “X” values (i.e. the predictors) in the test set. We then “unpretend,” and check how well those predictions worked.

The test set is “fresh, new” data, since we called lm() or whatever only on the training set. Thus we are avoiding the “noise fitting” problem. We can try several candidate models — e.g. different sets of predictor variables or different numbers of nearest neighbors — then choose the one that best predicts the test data.

Since the trainimg set/test set partitioning is random, we should perform the partitioning several times, thus assessing the performance of each of our candidate models several times to see if a clear pattern emerges.

(Note carefully that after fitting the model via cross-validation, we then use the full data for later prediction. Splitting the data for cross-validation was just a temporary device for model selection.)

Cross-validation is essentially the standard for model selection, and it works well if we only try a few models. Problems can occur if we try many models, as seen in the next section.

3.4.3 The Problem of P-hacking

The (rather recent) term p-hacking refers to the following abuse of statistics.^13

Say we have 250 pennies, and we wish to determine whether any are unbalanced, i.e. have probability of heads different from 0.5. We do so by tossing each coin 100 times. If we get fewer than 40 heads or more than 60, we decide this coin is unbalanced.^14 The problem is that, even if all the coins are perfectly balanced, we eventually will have one that has fewer than 40 or greater than 60 heads, just by accident. We will then falsely declare this coin to be unbalanced.

Or, to give a somewhat frivolous example that still will make the point, say we are investigating whether there is any genetic component to a person’s sense of humor. Is there a Humor gene? There are many, many genes to consider. Testing each one for relation to sense of humor is like checking each penny for being unbalanced: Even if there is no Humor gene, then eventually, just by accident, we’ll stumble upon one that seems to be related to humor.^15

Though the above is not about prediction, it has big implications for the prediction realm. In ML there are various datasets on which analysts engage in contests, vying for the honor of developing

(^13) The term abuse here will not necessarily connote intent. It may occur out of ignorance of the problem. (^14) For those who know statistics: This gives us a Type I error rate of about 0.05, the standard used by most people. (^15) For those with background in statistics, the reason this is called “p-hacking” is that the researcher may form a significance test for each gene, computing a p-value for each test. Since under the null hypothesis we have a 5% chance of getting a “significant” p-value for any given gene, the probability of having at least one significant result out of the thousands of tests is quite high, even if the null hypothesis is true in all cases. There are techniques called multiple inference or multiple comparison methods, to avoid p-hacking in performing statistical inference. See for example Multiple Comparisons: Theory and Methods, Jason Hsu, 1996, CRC.

3.5. EXTENDED EXAMPLE 45

the model with the highest prediction accuracy, say for classification of images. If there is a large number of analysts competing for the prize, then even if all the analysts have models of equal accuracy, it is likely that one is substantially higher than the others, just due to an accident of sampling variation. This is true in spite of the fact that they all are using the same sample; it may be that the “winning” analyst’ model happens to do especially well in the given data, and may not be so good on another sample from the same population. So, when some researcher sets a new record on a famous ML dataset, it may be that the research really has found a better prediction model — or it may be that it merely looks better, due to p-hacking.

The same is true for your own analyses. If you try a large number of models, the “winning” one may actually not be better than all the others.

3.5 Extended Example

Let’s illustrate this on the dataset prgeng, assembled from the 2000 US census. It consists of wage and other information on programmers and engineers in Silicon Valley. This dataset is included in the R polyreg package, which fits polynomial models as we saw in Section 3.3.5.1 above.^16

getPE () # produces data frame pe pe1 <- pe [ , c (1 ,2 ,4 ,6 ,7 ,12:16 ,3)] # choose some predictors head ( pe1 ) age sex wkswrkd ms phd occ1 occ2 occ3 occ4 occ 1 50.30082 0 52 0 0 0 0 1 0 0 2 41.10139 1 20 0 0 0 1 0 0 0 3 24.67374 0 52 0 0 0 0 1 0 0 4 50.19951 1 52 0 0 1 0 0 0 0 5 51.18112 0 1 0 0 1 0 0 0 0 6 57.70413 1 0 0 0 1 0 0 0 0 wageinc 1 75000 2 12300 3 15400 4 0 5 160 6 0

By the way, note the dummy variables. We have just two levels for education, so anyone with just a bachelor’s degree or less, or a professional degree, with be “other,” coded by ms and phd both

(^16) Available from github.com/matloff.

3.5. EXTENDED EXAMPLE 47

[1] 23974.

Warning message : In predict. lm ( object $ fit , plm. newdata ) : prediction from a rank - deficient fit may be misleading

About the same. We may now be in the range in which sampling variation dominates small differences in predictive power. Keep in mind that we have sampling variation here both in terms of the random training/test sets splitting, and the fact that the full dataset should be considered a sample from a population.

Note the ominous warning. R found that the matrix A′A in (3.11) was close to nonfull-rank, thus nearly singular (noninvertible). Now p = 226.

This improvement continued until degree 7, when things got dramatically worse:

deg MAPE num. terms 5 24340.85 371 6 24554.34 551 7 36463.61 767 8 74296.09 1019

Under the “

n Rule of Thumb,”, things would begin to deteriorate when the number of terms got passed about 140, but we were able to go considerably further.

At degree 7, though, the perils of overfitting really caught up with us.