



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
The concepts of bias, variance, overfitting, and p-hacking in the context of statistical modeling and machine learning. It explains how these concepts impact the accuracy of models and the importance of cross-validation for model selection. The document also provides an extended example using the wages dataset in R to illustrate the concepts.
What you will learn
Typology: Lecture notes
1 / 7
This page cannot be seen from the preview
Don't miss anything!
glm ( Cat ∼. , data = df [ df $ cat + df $ fox ==1 , c (1 ,3)]) # cat vs. fox
Then for each new animal we encounter of unknown species, we call predict() three times, again yielding three estimated conditional probabilities. Say in the first one, Cat “wins,” i.e. the condi- tional probability is less than 0.5. Say Dog wins in the second, and Cat wins in the third. Since Cat had the most wins, we predict Cat.
Comparison
At first, OVA seems much better than AVA. If we have m levels, that means running C(m, 2) = O(m^2 ) parwise logit models, rather than m for OVA. However, that is somewhat compensated by the fact that each pairwise model will be based on less data, and some analysts contend that AVA can have better accuracy. It remains a bit of a controversy.
By far the most vexing issue in statistics and machine learning is that of overfitting.
Suppose we have just one predictor, and n data points. If we fit a polynomial model of degree n − 1, the resulting curve will pass through all n points, a “perfect” fit. For instance:
x <- rnorm (6) y <- rnorm (6) # unrelated to x! df <- data. frame (x , y ) df $ x2 <- x ^ df $ x3 <- x ^ df $ x4 <- x ^ df $ x5 <- x ^ df x y x2 x 1 -1.1855131 0.2881291 1.40544120 -1. 2 -1.7838769 -2.0741740 3.18221664 -5. 3 -0.7124510 -0.4253678 0.50758640 -0. 4 0.1676111 -0.1949265 0.02809348 0. 5 1.2462926 -0.7348481 1.55324535 1. 6 0.3741604 1.9521667 0.13999601 0. x4 x 1 1.975265 e +00 -2.
2 1.012650 e +01 -18. 3 2.576440 e -01 -0. 4 7.892437 e -04 0. 5 2.412571 e +00 3. 6 1.959888 e -02 0.
lmo <- lm ( y ∼. , data = df ) lmo
Call : lm ( formula = y ∼. , data = df )
Coefficients : ( Intercept ) x x2 x -1.3127 4.7632 11.4809 0. x4 x -6.9685 -2.
lmo $ fitted. values 1 2 3 4 5 0.2881291 -2.0741740 -0.4253678 -0.1949265 -0. 6
y [1] 0.2881291 -2.0741740 -0.4253678 -0.1949265 -0. [6] 1.
Yes, we “predicted” y perfectly, even though there was no relation between the response and predictor variables). Clearly that “perfect fit” is illusory, “noise fitting.” Our ability to predict future cases would not be good. This is overfitting.
Let’s take a closer look, in an RS context. Say we believe (3.14) is a good model for the setting described in that section, i.e. men becoming more liberal raters as they age but women becoming more conservative. If we omit the interaction term, than we will underpredict older men and overpredict older women. This biases our ratings.
On the other hand, we need to worry about sampling variance. Consider the case of opinion polls during an election campaign, in which the goal is to estimate p, the proportion of voters who will vote for Candidate Jones. If we use too small a sample size, say 50, our results will probably be inaccurate. This is due to sampling instability: Two pollsters, each randomly sampling 50 people, will sample different sets of people, thus each having different values of ̂p, their sample estimates of p. For a sample of size 50, it is likely that their two values of p̂ will be substantially different from each other, whereas if the sample size were 5000, the two estimates would likely be close to
in the test set, predict those values from our fitted model and the “X” values (i.e. the predictors) in the test set. We then “unpretend,” and check how well those predictions worked.
The test set is “fresh, new” data, since we called lm() or whatever only on the training set. Thus we are avoiding the “noise fitting” problem. We can try several candidate models — e.g. different sets of predictor variables or different numbers of nearest neighbors — then choose the one that best predicts the test data.
Since the trainimg set/test set partitioning is random, we should perform the partitioning several times, thus assessing the performance of each of our candidate models several times to see if a clear pattern emerges.
(Note carefully that after fitting the model via cross-validation, we then use the full data for later prediction. Splitting the data for cross-validation was just a temporary device for model selection.)
Cross-validation is essentially the standard for model selection, and it works well if we only try a few models. Problems can occur if we try many models, as seen in the next section.
The (rather recent) term p-hacking refers to the following abuse of statistics.^13
Say we have 250 pennies, and we wish to determine whether any are unbalanced, i.e. have probability of heads different from 0.5. We do so by tossing each coin 100 times. If we get fewer than 40 heads or more than 60, we decide this coin is unbalanced.^14 The problem is that, even if all the coins are perfectly balanced, we eventually will have one that has fewer than 40 or greater than 60 heads, just by accident. We will then falsely declare this coin to be unbalanced.
Or, to give a somewhat frivolous example that still will make the point, say we are investigating whether there is any genetic component to a person’s sense of humor. Is there a Humor gene? There are many, many genes to consider. Testing each one for relation to sense of humor is like checking each penny for being unbalanced: Even if there is no Humor gene, then eventually, just by accident, we’ll stumble upon one that seems to be related to humor.^15
Though the above is not about prediction, it has big implications for the prediction realm. In ML there are various datasets on which analysts engage in contests, vying for the honor of developing
(^13) The term abuse here will not necessarily connote intent. It may occur out of ignorance of the problem. (^14) For those who know statistics: This gives us a Type I error rate of about 0.05, the standard used by most people. (^15) For those with background in statistics, the reason this is called “p-hacking” is that the researcher may form a significance test for each gene, computing a p-value for each test. Since under the null hypothesis we have a 5% chance of getting a “significant” p-value for any given gene, the probability of having at least one significant result out of the thousands of tests is quite high, even if the null hypothesis is true in all cases. There are techniques called multiple inference or multiple comparison methods, to avoid p-hacking in performing statistical inference. See for example Multiple Comparisons: Theory and Methods, Jason Hsu, 1996, CRC.
the model with the highest prediction accuracy, say for classification of images. If there is a large number of analysts competing for the prize, then even if all the analysts have models of equal accuracy, it is likely that one is substantially higher than the others, just due to an accident of sampling variation. This is true in spite of the fact that they all are using the same sample; it may be that the “winning” analyst’ model happens to do especially well in the given data, and may not be so good on another sample from the same population. So, when some researcher sets a new record on a famous ML dataset, it may be that the research really has found a better prediction model — or it may be that it merely looks better, due to p-hacking.
The same is true for your own analyses. If you try a large number of models, the “winning” one may actually not be better than all the others.
Let’s illustrate this on the dataset prgeng, assembled from the 2000 US census. It consists of wage and other information on programmers and engineers in Silicon Valley. This dataset is included in the R polyreg package, which fits polynomial models as we saw in Section 3.3.5.1 above.^16
getPE () # produces data frame pe pe1 <- pe [ , c (1 ,2 ,4 ,6 ,7 ,12:16 ,3)] # choose some predictors head ( pe1 ) age sex wkswrkd ms phd occ1 occ2 occ3 occ4 occ 1 50.30082 0 52 0 0 0 0 1 0 0 2 41.10139 1 20 0 0 0 1 0 0 0 3 24.67374 0 52 0 0 0 0 1 0 0 4 50.19951 1 52 0 0 1 0 0 0 0 5 51.18112 0 1 0 0 1 0 0 0 0 6 57.70413 1 0 0 0 1 0 0 0 0 wageinc 1 75000 2 12300 3 15400 4 0 5 160 6 0
By the way, note the dummy variables. We have just two levels for education, so anyone with just a bachelor’s degree or less, or a professional degree, with be “other,” coded by ms and phd both
(^16) Available from github.com/matloff.
Warning message : In predict. lm ( object $ fit , plm. newdata ) : prediction from a rank - deficient fit may be misleading
About the same. We may now be in the range in which sampling variation dominates small differences in predictive power. Keep in mind that we have sampling variation here both in terms of the random training/test sets splitting, and the fact that the full dataset should be considered a sample from a population.
Note the ominous warning. R found that the matrix A′A in (3.11) was close to nonfull-rank, thus nearly singular (noninvertible). Now p = 226.
This improvement continued until degree 7, when things got dramatically worse:
deg MAPE num. terms 5 24340.85 371 6 24554.34 551 7 36463.61 767 8 74296.09 1019
Under the “
n Rule of Thumb,”, things would begin to deteriorate when the number of terms got passed about 140, but we were able to go considerably further.
At degree 7, though, the perils of overfitting really caught up with us.