Download Statistics Consulting Cheat Sheet and more Cheat Sheet Statistics in PDF only on Docsity!
Statistics Consulting Cheat Sheet
- October 1, Kris Sankaran
- 1 What this guide is for Contents
- 2 Hypothesis testing
- 2.1 (One-sample, Two-sample, and Paired) t-tests
- 2.2 Difference in proportions
- 2.3 Contingency tables
- 2.3.1 χ^2 tests
- 2.3.2 Fisher’s Exact test
- 2.3.3 Cochran-Mantel-Haenzel test
- 2.3.4 McNemar’s test
- 2.4 Nonparametric tests
- 2.4.1 Rank-based tests
- 2.4.2 Permutation tests
- 2.4.3 Bootstrap tests
- 2.4.4 Kolmogorov-Smirnov
- 2.5 Power analysis
- 2.5.1 Analytical
- 2.5.2 Computational
- 3 Elementary estimation
- 3.1 Classical confidence intervals
- 3.2 Bootstrap confidence intervals
- 4 (Generalized) Linear Models
- 4.1 Linear regression
- 4.2 Diagnostics
- 4.3 Logistic regression
- 4.4 Poisson regression
- 4.5 Psueo-Poisson and Negative Binomial regression
- 4.6 Loglinear models
- 4.7 Multinomial regression
- 4.8 Ordinal regression
- 5 Inference in linear models (and other more complex settings
- 5.1 (Generalized) Linear Models and ANOVA
- 5.2 Multiple testing
- 5.2.1 Alternative error metrics
- 5.2.2 Procedures
- 5.3 Causality
- 5.3.1 Propensity score matching
- 6 Regression variants
- 6.1 Random effects and hierarchical models
- 6.2 Curve-fitting
- 6.2.1 Kernel-based
- 6.2.2 Splines
- 6.3 Regularization
- 6.3.1 Ridge, Lasso, and Elastic Net
- 6.3.2 Structured regularization
- 6.4 Time series models
- 6.4.1 ARMA models
- 6.4.2 Hidden Markov Models
- 6.4.3 State-space models
- 6.5 Spatiotemporal models
- 6.6 Survival analysis
- 7 Model selection
- 7.1 AIC / BIC
- 7.2 Stepwise selection
- 7.3 Lasso
- 8 Unsupervised methods
- 8.1 Clustering
- 8.2 Low-dimensional representations
- 8.2.1 Principle Components Analysis
- 8.2.2 Factor analysis
- 8.2.3 Distance based methods
- 8.3 Networks
- 8.4 Mixture modeling
- 9 Data preparation
- 9.1 Missing data
- 9.2 Transformations
- 9.3 Reshaping
- There are two kinds of errors we can make: (1) Accidentally falsify when true (false positive / type I error) and (2) fail to falsify when actually false (false negative / type II error)
For this analysis paradigm to work, a few points are necessary.
- We need to be able to articulate the sampling behavior of the system under the null hypothesis.
- We need to be able to quantitatively measure discrepancies from the null. Ideally we would be able to measure these discrepancies in a way that makes as few errors as possible – this is the motivation behind optimality theory.
While testing is fundamental to much of science, and to a lot of our work as consultants, there are some limitations we should always keep in mind,
- Often, describing the null can be complicated by particular structure present within a problem (e.g., the need to control for values of other variables). This motivates inference through modeling, which is reviewed below.
- Practical significance is not the same as statistical significance. A p-value should never be the final goal of a statistical analysis – they should be used to complement figures / confidence intervals / follow-up analysis^1 that provide a sense of the effect size.
2.1 (One-sample, Two-sample, and Paired) t-tests
If I had to make a bet for which test was used the most on any given day, I’d bet it’s the t-test. There are actually several variations, which are used to interrogate different null hypothesis, but the statistic that is used to test the null is similar across scenarios.
- The one-sample t-test is used to measure whether the mean of a sample is far from a preconceived population mean.
- The two-sample t-test is used to measure whether the difference in sample means between two groups is large enough to substantiate a rejection of the null hypothesis that the population means are the same across the two groups.
What needs to be true for these t-tests to be valid?
- Sampling needs to be independent and identically distributed (i.i.d.), and in two-sample setting, the two groups need to be independent. If this is not the case, you can try pairing or developing richer models, see below. (^1) E.g., studying contributions from individual terms in a χ-square test
Figure 1: Pairing makes it possible to see the effect of treatment in this toy example. The points represent a value for patients (say, white blood cell count) measured at the beginning and end of an experiment. In general, the treatment leads to increases in counts on a per-person basis. However, the inter-individual variation is very large – looking at the difference between before and after with- out the lines joining pairs, we wouldn’t think there is much of a difference. Pairing makes sure the effect of the treatment is not swamped by the varia- tion between people, by controlling for each persons’ white blood cell count at baseline.
- In the two sample case, depending on the the sample sizes and population variances within groups, you would need to use different estimates of the standard error.
- If the sample size is large enough, we don’t need to assume normality in the population(s) under investigation. This is because the central limit kicks in and makes the means normal. In the small sample setting however, you would need normality of the raw data for the t-test to be appropriate. Otherwise, you should use a nonparametric test, see below.
Pairing is a useful device for making the t-test applicable in a setting where individual level variation would otherwise dominate effects coming from treat- ment vs. control. See Figure 1 for a toy example of this behavior.
- Instead of testing the difference in means between two groups, test for whether the per-individual differences are centered around zero.
- For example, in Darwin’s Rhea Mays data, a treatment and control plant are put in each pot. Since there might be a pot-level effect in the growth of the plants, it’s better to look at the per-pot difference (the differences are i.i.d).
Pairing is related to a few other common statistical ideas,
A1 A2 total B1 n 11 n 12 n 1. B2 n 21 n 22 n 2. total n. 1 n. 2 n..
Table 1: The basic representation of a 2 × 2 contingency table.
- Odds-Ratio: This is p p^1211 pp^2122. It’s referred to in many test, but I find it useful to transform back to relative risk whenever a result is state in terms of odds ratios.
- A cancer study
- Effectiveness of venom vaccine
- Comparing subcategories and time series
- Family communication of genetic disease
2.3.1 χ^2 tests
The χ^2 test is often used to study whether or not two categorical variables in a contingency table are related. More formally, it assesses the plausibility of the null hypothesis of independence,
H 0 : pij = pi+p+j
The two most common statistics used to evaluate discrepancies the Pearson and likelihood ratio χ^2 statistics, which measure the deviation from the expected count under the null,
- Pearson: Look at the squared absolute difference between the observed and expected counts, using
i,j
(nij −μˆij )^2 μ ˆij
- Likelihood-ratio: Look at the logged relative difference between observed and expected counts, using 2
i,j nij^ log
nij μ ˆij
Under the null hypotheses, and assuming large enough sample sizes, these are both χ^2 distributed, with degrees of freedom determined by the number of levels in each categorical variable. A useful follow-up step when the null is rejected is to see which cell(s) con- tributed to the most to the χ^2 -statistic. These are sometimes called Pearson residuals.
2.3.2 Fisher’s Exact test
Fisher’s Exact test is an alternative to the χ^2 test that is useful when the counts within the contingency table are small and the χ^2 approximation is not necessarily reliable.
- It tests the same null hypothesis of independence as the χ^2 -test
- Under that null, and assuming a binomial sampling mechanism (condition on the row and column totals), the count of the top-left cell can be shown to follow a hypergeometric distribution (and this cell determines counts in all other cells).
- This can be used to determine the probability of seeing tables with as much or more extreme departures from independence.
- There is a generalization to I × J tables, based on the multiple hyperge- ometric distribution
- The most famous example used to explain this test is the Lady Tasting Tea.
2.3.3 Cochran-Mantel-Haenzel test
The Cochran-Mantel Haenzel test is a variant of the exact test that applies when samples have been stratified across K groups, yielding K 2 × 2 separate contingency tables^3.
- The null hypothesis to which this test applies is that, in each of the K strata, there is no association between rows and columns.
- The test statistic consists of pooling deviations from expected counts across all K strata, where the expected counts are defined conditional on the margins (they are the means and variances under a hypergeometric distribution), ∑K k=1 (n^11 k^ −^ E^ [n^11 k])
2 ∑K k=1 Var (n^11 k)
Some related past problems,
- Mantel haenzel chisquare test (^3) These are sometimes called partial tables
H 0 : P (X > Y ) = P (Y > X) ,
which is a strictly stronger condition than equality in the means of X and Y. For this reason, care needs to be taken to interpret a rejection in this and other rank-based tests – the rejection could have been due to any difference in the distributions (for example, in the variances), and not just a difference in the means.
- The procedure does the following: (1) combine the two groups of data into one pooled set, (2) rank the elements in this pooled set, and (3) see whether the ranks in one group are systematically larger than another. If there is such a discrepancy, reject the null hypothesis.
- Sign test
- This is an alternative to the paired t-test, when data are paired be- tween the two groups (think of a change-from-baseline experiment).
- The null hypothesis is that the differences between paired measure- ments is symmetrically distributed around 0.
- The procedure first computes the sign of the difference between all pairs. It then computes the number of times a positive sign occurs and compares it with the a Bin
n, (^12)
, which is how we’d expect this quantity to be distributed under the null hypothesis.
- Since this test only requires a measurement of the sign of the differ- ence between pairs, it can be applied in settings where there is no numerical data (for example, data in a survey might consist of “likes” and “dislikes” before and after a treatment).
- Signed-rank test
- In the case that it is possible to measure the size of the difference between pairs (not just their sign), it is often possible to improve the power of the sign test, using the signed-rank test.
- Instead of simply calculating the sign of the difference between pairs, we compute provide a measure of the size of the difference between pairs. For example, in numerical data, we could just use |xiafter − xibefore|.
- At this point, order the difference scores from largest to smallest, and see whether one group is systematically overrepresented among the larger scores^5. In this case, reject the null hypothesis.
- Evaluating results of a training program (^5) The threshold is typically tabulated, or a more generally applicable normal approximation can be applied
- Nonparametric tests for mean / variance
- t-test vs. Mann-Whitney
- Trial comparison for walking and stopping
2.4.2 Permutation tests
Permutation tests are a kind of computationally intensive test that can be used quite generally. The typical setting in which it applies has two groups between which we believe there is some difference. The way we measure this difference might be more complicated than a simple difference in means, so no closed-form distribution under the null may be available. The basic idea of the permutation test is that we can randomly create arti- ficial groups in the data, so there will be no systematic differences between the groups. Then, computing the statistic on these many artificial sets of data gives us an approximation to the null distribution of that statistic. Comparing the value of that statistic on the observed data with this approximate null can give us a p-value. See Figure 2 for a representation of this idea.
- More formally, the null hypothesis tested by permutation tests is that the group labels are exchangeable in the formal statistical sense^6
- For the same reason that caution needs to be exercised when interpret- ing rejections in the Mann-Whitney test, it’s important to be aware that a permutation test can reject the null for reasons other than a simple difference in means.
- The statistic you use in a permutation test can be whatever you want, and the test will be valid. Of course, the power of the test will depend crucially on whether the statistic is tailored to the type of departures from the null which actually exist.
- The permutation p-value of a test statistic is obtained by making a his- togram of the statistic under all the different relabelings, placing the ob- served value of the statistic on that histogram, and looking at the fraction of the histogram which is more extreme than the value calculated from the real data. See Figure 2.
- A famous application of this method is to Darwin’s Zea Mays data^7. In this experiment, Darwin planted Zea Mays that had been treated in two different ways (self vs. cross-fertilized). In each pot, he planted two of each plant, and he made sure to put one of each type in each pot, to control for potential pot-level effects. He then looked to see how high the plants grew. The test statistic was then the standardized difference in means, and this was computed many times after randomly relabeling the (^6) The distribution is invariant under permutations. (^7) R.A. Fisher also used this dataset to explain the paired t-test.
Figure 3: To compute a p-value for a permutation test, refer to the permutation null distribution. Here, the histogram provides the value of the test statistic under many permutations of the group labeling – this approximates how the test statistic is distributed under the null hypothesis. The value of the test statistic in the observed data is the vertical bar. The fraction of the area of this histogram that has a more extreme value of this statistic is the p-value, and it exactly corresponds to the usual interpretation of p-values as the probability under the null that I observe a test statistic that is as or more extreme.
making a comparison of the real data to this approximate null distribution, as in permutation tests.
- As in permutation tests, this procedure will always be valid, but will only be powerful if the test-statistic is attuned to the actual structure of departures from the null.
- The trickiest part of this approach is typically describing an appropriate scheme for sampling under the null. This means we need to estimate Fˆ 0 among a class of CDFs F′ consistent with the null hypothesis.
- For example, in a two-sample difference of means testing situation, to sample from the null, we center each group by subtracting away the mean, so that H 0 actually holds, and then we simulate new data by sampling with replacement from this pooled, centered histogram.
2.4.4 Kolmogorov-Smirnov
The Kolmogorov-Smirnov (KS) test is a test for either (1) comparing two groups of real-valued measurements or (2) evaluating the goodness-of-fit of a collection of real-valued data to a prespecified reference distribution.
- In its two-sample variant, the empirical CDFs (ECFs) for each group are calculated. The discrepancy is measured by the largest absolute gap between the two ECDFs. This is visually represented in Figure 4.
- The distribution of this gap under the null hypothesis that the two groups have the same ECDF was calculated using an asymptotic approximation, and this is used to provide p-values.
Figure 4: The motivating idea of the two-sample and goodness-of-fit variants of the KS test. In the 2-sample test variant, the two colors represent two different empirical CDFs. The largest vertical gap between these CDFs is labeled by the black bar, and this defines the KS statistic. Under the null that the two groups have the same CDF, this statistic has a known distribution, which is used in the test. In the goodness-of-fit variant, the pink line now represents the true CDF for the reference population. This test sees whether the observed empirical CDF (blue line) is consistent with this reference CDF, again by measuring the largest gap between the pair.
- In the goodness-of-fit variant, all that changes is that one of the ECDFs is replaced with the known CDF for the reference distribution.
- Reference for KS test
2.5 Power analysis
Before performing an experiment, it is important to get a rough sense of how many samples will need to be collected in order for the follow-up analysis to have a chance at detecting phenomena of interest. This general exercise is called a power-analysis, and it often comes up in consulting sessions because many grant agencies will require a power analysis be conducted before agreeing to provide funding.
3 Elementary estimation
While testing declares that a parameter θ cannot plausibly lie in the subset H 0 of parameters for which the null is true, estimation provides an estimate ˆθ of θ, typically accompanied by a confidence interval, which summarizes the precision of that estimate. In most multivariate situations, estimation requires the technology of lin- ear modeling. However, in simpler settings, it is often possible to construct a confidence interval based parametric theory or the bootstrap.
3.1 Classical confidence intervals
Classical confidence intervals are based on rich parametric theory. Formally, a confidence interval for a parameter θ is random (data dependent) interval [L (X) , U (X)] such that, under data X generated with parameter θ, that in- terval would contain θ some prespecified percentage of the time (usually 90, 95, or 99%).
- The most common confidence interval is based on the Central Limit The- orem. Suppose data x 1 ,... , xn are sampled i.i.d. from a distribution with mean θ and variance σ^2. Then the fact that
n (¯xn − θ) ≈ N
0 , σ^2
for large n means that
[
x ¯n − zα/ 2 √σn , x¯n + z 1 − α 2 √^ σn
]
, where z α 2 is the α 2 quantile of the normal distribution, is a (1 − α) % confidence interval for θ.
- Since proportions can be thought of as averages of indicators variables ( if present, 0 if not) which have bernoulli means p and variances p (1 − p), the same reasoning gives confidence intervals for proportions.
- For the same reason that we might prefer a t-test to a z-test^9 , we may sometimes prefer using a t-quantile instead.
- If a confidence interval is known for a parameter, then it’s easy to con- struct an approximate interval for any smooth function of that parameter using the delta method. For example, this is commonly used to calculate confidence intervals for log-odds ratios.
3.2 Bootstrap confidence intervals
There are situations for which the central limit theorem might not apply and no theoretical analysis can provide a valid confidence interval. For example, we might have defined a parameter of interest that is not a simple function of means of the data. In many of these cases, it may be nonetheless be possible to use the bootstrap.
(^9) Samples sizes too small to put faith in the central limit theorem
- The main idea of the bootstrap is the “plug-in principle.” Suppose our goal is to calculate the variance of some statistic t (X1:n) under the true sam- pling distribution F for the Xi. That is, we want to know^10 VarF
θˆ (X1:n)
but this is unknown, since we don’t actually know F. However, we can plug-in Fˆ for F , which gives Var (^) Fˆ
θˆ (X1:n)
. This two is unknown, but we can sample from Fˆ to approximate it – the more samples from Fˆ we make, the better our estimate of Var (^) Fˆ
θˆ (X1:n)
. This pair of approx- imations (plugging in Fˆ for F , and then simulating to approximate the variance under Fˆ ) gives a usable approximation ˆv
θˆ
of VarF
θˆ (X1:n)
The square-root of this quantity is usually called the bootstrap estimate of standard error.
- The bootstrap estimate of standard error can be used to construct confi- dence intervals,
ˆθ (X1:n) − z α 2
√ ˆv
ˆθ
n
, θˆ (X1:n) + z 1 − α 2
√ ˆv
θˆ
n
- Since sampling from Fˆ is the same as sampling from the original data with replacement, the bootstrap is often explained in terms of resampling the original data.
- A variant of the above procedure skips the calculation of a variance esti- mator and instead simply reports the upper and lower α percentiles of the samples of θˆ (X 1:∗n) for X i∗ ∼ Fˆ. This is sometimes called the percentile bootstrap, and it is the one more commonly encountered in practice.
- In consulting situations, the bootstrap gives very general flexibility in defining statistics on which to do inference – you can do inference on parameters that might be motivated by specific statistical structure or domain knowledge.
4 (Generalized) Linear Models
Linear models provide the basis for most inference in multivariate settings. We won’t even begin to try to cover this topic comprehensively – there are entire course sequences that only cover linear models. But, we’ll try to highlight the main regression-related ideas that are useful to know during consulting. This section is focused more on the big-picture of linear regression and when we might want to use it in consulting. We defer a discussion of inference in linear models to Section 5. (^10) We usually want this so we can calculate a confidence interval, see the next bullet.
When is linear regression useful in consulting?
- In a consulting setting, regression is useful for understanding the associa- tion between two variables, controlling for many others. This is basically a rephrasing of point (2) above, but it’s the essential interpretation of lin- ear regression coefficients, and it’s this interpretation that many research studies are going after.
- Sometimes a client might originally come with a testing problem, but might want help extending it to account for additional structure or co- variates. In this setting, it can often be useful to propose a linear model instead: it still allows inference, but it becomes much easier to encode more complex structure.
What are some common regression tricks useful in consulting?
- Adding interactions: People will often ask about adding interactions in their regression, but usually from an intuition about the non-quantitative meaning of the word “interaction.” It’s important to clarify the quanti- tative meaning: Including an interaction term between X 1 and X 2 in a regression of these variables onto Y means that the slope of the relation- ship between X 1 on Y will be different depending on the value of X 2. For example, if X 2 can only take on two values (say, A and B), then the relationship between X 1 and Y will be linear with slope β 1 A in the case that X 2 is A and β 1 B otherwise^11. When X 2 is continuous, then there is a continuum of slopes depending on the value of X 2 : β 1 + β 1 × 2 X 2. See Figure 5 for a visual interpretation of interactions.
- Introducing basis functions: The linearity assumption is not as restrictive as it seems, if you can cleverly apply basis functions. Basis functions are functions like polynomials (or splines, or wavelets, or trees...) which you can mix together to approximate more complicated functions, see Figure
- Linear mixing can be done with linear models. To see why this is potentially useful, suppose we want to use time as a predictor in a model (e.g., where Y is number of species j present in the sample), but that species population doesn’t just increase or decrease lin- early over time (instead, it’s some smooth curve). Here, you can introduce a spline basis associated with time and then use a linear regression of the response onto these basis functions. The fitted coefficients will define a mean function relating time and the response.
- Derived features: Related to the construction of basis functions, it’s often possible to enrich a linear model by deriving new features that you imagine might be related to Y. The fact that you can do regression onto variables (^11) In terms of the regression coefficients β 1 for the main effect of X 1 on Y and β 1 × 2 for the interaction between X 1 and X 2 , this is expressed as β 1 A = β 1 and β 1 B = β 1 + β 1 × 2.
Figure 5: In the simplest setting, an interaction between a continuous and binary variable leads to two different slopes for the continuous variable. Here, we are showing the scatterplot (xi 1 , yi) pairs observed in the data. We suppose there is a binary variable that has also been measured, denoted xi 2 , and we shade in each point according to its value for this binary variable. Apparently, the relationship between x 1 and y depends on the value of y (when in the pink group, the slope is less. This can exactly be captured by introducing an interaction term between x 1 and x 2. In cases where x 2 is not binary, we would have a continuous of slopes between x 1 and y – one for each value of x 2.
that aren’t just the ones that were collected originally might not be ob- vious to your client. For example, if you were trying to predict whether someone will have a disease^12 based on time series of some lab tests, you can construct new variables corresponding to the “slope at the beginning,” or “slope at the end” or max, or min, ... across the time series. Of course, deciding which variables might actually be relevant for the regression will depend on domain knowledge.
- One trick – introducing random effects – is so common that it gets it’s own section. Basically, it’s useful whenever you have a lot of levels for a particular categorical vector.
Some examples where regression was used in past sessions,
- Family communication genetic disease
- Stereotype threat in virtual reality
- Fish gonad regression
- UV exposure and birth weight
- Land use and forest cover
- Molecular & cellular physiology (^12) In this case, the response is binary, so you would probably use logistic regression, but the basic idea of derived variables should still be clear.