








Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Odds ratios are an alternative way to think about probabilities. The odds in favor of an event with probability p are p/(1 − p). The odds ratio in favor of an ...
Typology: Slides
1 / 14
This page cannot be seen from the preview
Don't miss anything!
Bret Hanlon and Bret Larget
Department of Statistics University of Wisconsin—Madison
October 4–6, 2011
Contingency Tables 1 / 56
Case Study Example 9.3 beginning on page 213 of the text describes an experiment in which fish are placed in a large tank for a period of time and some are eaten by large birds of prey. The fish are categorized by their level of parasitic infection, either uninfected, lightly infected, or highly infected. It is to the parasites’ advantage to be in a fish that is eaten, as this provides an opportunity to infect the bird in the parasites’ next stage of life. The observed proportions of fish eaten are quite different among the categories.
Uninfected Lightly Infected Highly Infected Total Eaten 1 10 37 48 Not eaten 49 35 9 93 Total 50 45 46 141
The proportions of eaten fish are, respectively, 1/50 = 0.02, 10/45 = 0.222, and 37/46 = 0.804.
Contingency Tables Case Study Infected Fish and Predation 2 / 56
Frequency
0
10
20
30
40
50
Uninfected Lightly Infected Highly Infected
Eaten Not eaten
Contingency Tables Case Study Graphics 3 / 56
A stacked bar graph shows: I (^) the sample sizes in each sample; and I (^) the number of observations of each type within each sample. This plot makes it easy to compare sample sizes among samples and counts within samples, but the comparison of estimates of conditional probabilities among samples is less clear.
Contingency Tables Case Study Graphics 4 / 56
Relative Frequency
Uninfected Lightly Infected Highly Infected
Eaten Not eaten
Contingency Tables Case Study Graphics 5 / 56
A mosaic plot replaces absolute frequencies (counts) with relative frequencies within each sample. This plot makes comparisons of estimated conditional probabilities very clear. The cost is that the sample size information is lost.
Contingency Tables Case Study Graphics 6 / 56
In the setting of the experiment, we observe a difference between the proportions of eaten fish in the lightly and highly infected fish. A point estimate of this difference is 37 46
How can we quantify uncertainty in this estimate?
Contingency Tables Case Study Estimation 7 / 56
A confidence interval for a difference in proportions p 1 − p 2 is based on the sampling distribution of the difference in sample proportions. If the two samples are independent,
E(ˆp 1 − pˆ 2 ) = p 1 − p 2
Var(ˆp 1 − ˆp 2 ) = p 1 (1 − p 1 ) n 1
p 2 (1 − p 2 ) n 2 If both samples are large enough (depending on how close the proportions are to 0 or 1), this sampling distribution is approximately normal.
Contingency Tables Case Study Estimation 8 / 56
odds
Percent of Total
0
20
40
60
80
0 200 400 600 800 1000
Contingency Tables Case Study Odds Ratios 13 / 56
log(odds)
Percent of Total
0
5
10
15
2 4 6
Contingency Tables Case Study Odds Ratios 14 / 56
The sampling distribution of the odds ratio is very skewed to the right. The sampling distribution of the log odds ratio is fairly symmetric and bell-shaped. We will use the normal approximation for the log odds ratio and then translate back. The standard error of the odds ratio can be estimated as
ln(odds ratio)
x 1
n 1 − x 1
x 2
n 2 − x 2
Contingency Tables Case Study Odds Ratios 15 / 56
A 95% confidence interval for the odds ratio is
exp
ln OR̂ − 1 .96SE
p 1 /(1 − p 1 ) p 2 /(1 − p 2 )
< exp
ln OR + 1̂ .96SE
where OR =̂ ˆpˆp^12 //(1(1−−ˆpˆp^12 )) and
x 1
n 1 − x 1
x 2
n 2 − x 2
is the estimated standard error of the log odds ratio. Note the equivalent expression:
OR exp̂
p 1 /(1 − p 1 ) p 2 /(1 − p 2 ) < OR exp̂
Contingency Tables Case Study Odds Ratios 16 / 56
The estimated odds for being eaten in the highly infected group is 37 /9 = 4.111. The estimated odds for being eaten in the lightly infected group is 10 /35 = 0.286. The estimated odds ratio is 14.389 and its natural logarithm is 2.666. The estimated SE of the log odds ratio is √ 1 37
e^2.^666 −^1 .96(0.516)^ = 5. .23 and e^2 .666+1.96(0.516)^ = 39. .6. The 95% confidence interval is 5. 23 < OR < 39 .6.
Contingency Tables Case Study Odds Ratios 17 / 56
In the experimental setting of the infected fish case study, we are 95% confident that the odds of being eaten in the highly infected group are between 5.2 and 39.6 times higher than in the lightly infected group.
Contingency Tables Case Study Odds Ratios 18 / 56
Case Study Example 9.4 on page 220 describes an experiment. In Costa Rica, the vampire bat Desmodus rotundus feeds on the blood of domestic cattle. If the bats respond to a hormonal signal, cows in estrous (in heat) may be bitten with a different probability than cows not in estrous. (The researcher could tell the difference by harnessing painted sponges to the undersides of bulls who would leave their mark during the night.)
In estrous Not in estrous Total Bitten by a bat 15 6 21 Not bitten by a bat 7 322 329 Total 22 328 350
The proportion of bitten cows among those in estrous is 15/22 = 0.682 while the proportion of bitten cows among those not in estrous is 6/328 = 0.018.
Contingency Tables Case Study Vampire Bats 19 / 56
Find a 95% confidence interval for the difference in probabilities of being bitten by a vampire bat between cows in estrous and those not.
In the study setting in Costa Rica, we are 95% confident that the probability that a cow in estrous is bitten by a vampire bat is larger than the probability of cow not in estrous being bitten by an amount between 0.468 and 0.859.
Contingency Tables Case Study Vampire Bats 20 / 56
The χ^2 test of independence compares the observed counts in the table with the expected values of those counts under the null distribution. The test statistic measures discrepancy between observed and expected counts. If the discrepancy is larger than expected (from a random chance model), then there is evidence against the null hypothesis of independence.
Contingency Tables Hypothesis Testing Test of Independence 25 / 56
i∈rows
j∈columns
(Oij − Eij )^2 Eij
where Oij is the observed count in row i and column j; Eij = (row sum(table sum)^ i)(column sum j) is the expected count in row i and column j;
Contingency Tables Hypothesis Testing Test of Independence 26 / 56
Uninfected Lightly Infected Highly Infected Total Eaten 1 10 37 48 Not eaten 49 35 9 93 Total 50 45 46 141
Explain expected counts in reference to the example: Calculations and estimates assume independence (the null hypothesis). The observed proportion getting eaten is 48/141. The observed proportion that are uninfected is 50/141. Probabilities of these events are estimated by their observed proportions. Under independence,
P(eaten ∩ uninfected) = P(eaten)P(uninfected)
Contingency Tables Hypothesis Testing Test of Independence 27 / 56
Plugging in observed proportions as estimates, this is
P(eaten ∩ uninfected) ≈
Under the null hypothesis, the observed count in each cell is a binomial random variable with n = 141 and p estimated as above as a product of marginal proportions. Oij ∼ Binomial(n, pij ) where n is the total number of observations in the table and pij is the estimated probability for cell in row r and column c. The expected value of this random variable is Eij = npij , or
Eij = 141 ×
In general, Eij =
(row sum i)(column sum j) (table sum) Contingency Tables Hypothesis Testing Test of Independence 28 / 56
Observed Counts: Uninfected Lightly Infected Highly Infected Total Eaten 1 10 37 48 Not eaten 49 35 9 93 Total 50 45 46 141
Expected Counts: Uninfected Lightly Infected Highly Infected Total Eaten 17 15.3 15.7 48 Not eaten 33 29.7 30.3 93 Total 50 45 46 141
Contingency Tables Hypothesis Testing Test of Independence 29 / 56
i∈rows
j∈columns
(Oij − Eij )^2 Eij
The sum is over all cells in the table. If there are some cells where the observed counts and expected counts differ by a lot, the test statistic will be large. If all observed counts are close to expected counts, then the test statistic will be small.
Contingency Tables Hypothesis Testing Test of Independence 30 / 56
The sampling distribution of the test statistic under the null hypothesis of independence can be estimated using simulation. For large enough samples (no more than 20% of expected counts < 5), the χ^2 distribution with (r − 1)(c − 1) degrees of freedom is a good approximation. This is the distribution of a sum of (r − 1)(c − 1) squared independent standard normal random variables (which we will see next week). The expected value of the test statistic is (r − 1)(c − 1). The p-value is the area to the right of the test statistic under a χ^2 distribution with (r − 1)(c − 1) degrees of freedom.
Contingency Tables Hypothesis Testing Test of Independence 31 / 56
In the example, r = 2 and c = 3 so there are (2 − 1)(3 − 1) = 2 degrees of freedom. The test statistic of 69.8 is much larger than 2. The p-value is about 6. 6 × 10 −^16.
χ^2 (2) distribution
χ^2
density
0 20 40 60
Contingency Tables Hypothesis Testing Test of Independence 32 / 56
( (^) ∑r
i=
∑^ c
j=
Oij ln
( (^) Oij Eij
O 11 ln
11 E 11
rc Erc
The p-value is approximately 1. 2 × 10 −^17. Compare G to the χ^2 test of independence test statistic value of 69.8.
Contingency Tables Hypothesis Testing Case Study 37 / 56
There is overwhelming evidence (G = 77. 9 , n = 141, df = 2,p < 10 −^16 , G-test) that infection status and is not independent of the probability of being eaten for fish under these experimental conditions.
Contingency Tables Hypothesis Testing Case Study 38 / 56
Fisher’s exact test is based on an alternative probability model for 2 × 2 tables. Think of one factor as an outcome and the other as designating groups. Fisher’s exact test imagines the 2 × 2 tables if the groups of the same size had been randomly created with sampling without replacement rather than using the factor to form the groups. The p-value is the probability of selecting any table at least as extreme as the actual table. Sampling without replacement is described by the hypergeometric distribution.
Contingency Tables Hypothesis Testing Fisher’s Exact Test 39 / 56
The hypergeometric distribution is similar to the binomial distribution in that it counts the number of successes in a sample of size n, but the sample is made without replacement from a finite population, and so separate trials are not independent. The p-value from Fisher’s exact test is computed by summing hypergeometric probabilities.
Contingency Tables Hypothesis Testing Fisher’s Exact Test 40 / 56
A bucket contains r red balls and w white balls and n balls are sampled without replacement. X counts the number of red balls in the sample and we say that X has a hypergeometric distribution with parameters r , w , and n (notation varies by source and there is no general consensus).
P(X = k) =
(r k
)( (^) w n−k
(r +w n
) (^) , max{ 0 , n − w } ≤ k ≤ min{r , n}
The numerator counts the number of samples with exactly k red balls and n − k white balls and the denominator counts the number of samples of n balls from the total r + w.
Contingency Tables Hypothesis Testing Fisher’s Exact Test 41 / 56
In estrous Not in estrous Total Bitten by a bat 15 6 21 Not bitten by a bat 7 322 329 Total 22 328 350
Here are other tables with even more extreme differences in proportions of being bitten, but with the same marginal totals.
16 5 6 323
Contingency Tables Hypothesis Testing Fisher’s Exact Test 42 / 56
In estrous Not in estrous Total Bitten by a bat X 21 − X 21 Not bitten by a bat 22 − X 307 + X 329 Total 22 328 350 The p-value calculation focuses on any single cell; here the top left. Imagine the 21 bitten cows as red balls and the 329 cows not bitten as white balls. Sample 22 without replacement at random, and let X be the number of red balls in the sample. The probability of having exactly x red balls in the sample is ( 21 x
22 −x
22
as there are
x
ways to pick which x red balls are sampled,
22 −x
ways to pick which 22 − x white balls are sampled, and
22
total ways to choose 22 balls from 350. Contingency Tables Hypothesis Testing Fisher’s Exact Test 43 / 56
The actual grouping of cows by estrous status has X = 15. The p-value is the probability X ≥ 15.
x=
x
22 −x
22
This calculation is tedious by hand, but can be done in R using the dhyper() function.
sum(dhyper(15:21, 21, 329, 22)) [1] 1.004713e-
Contingency Tables Hypothesis Testing Fisher’s Exact Test 44 / 56
The remaining slides contain details about the G-test that will not be included for homework or exam.
Contingency Tables Appendix 49 / 56
In a likelihood ratio test, the null hypothesis assumes a likelihood model with k 0 free parameters which is a special case of the alternative hypothesis likelihood model with k 1 free parameters. The two likelihood models are maximized with likelihoods L 0 and L 1 respectively. The test statistic is G = 2(ln L 1 − ln L 0 ) which, for large enough samples, has approximately a χ^2 (k 1 − k 0 ) distribution when the null hypothesis is true.
Contingency Tables Appendix 50 / 56
The multinomial distribution is a generalization of the binomial distribution where there is an independent sample of size n and each outcome is in one of k categories with probabilities pi for the ith category (
∑k i=1 pi^ = 1). The probability that there are xi outcomes of type i for i = 1, 2 ,... , k is ( n x 1 ,... , xk
px 11 · · · p kxk
where (^) ( n x 1 ,... , xk
n! x 1! · · · xk! is called a multinomial coefficient and x 1 + · · · + xk = n.
Contingency Tables Appendix 51 / 56
In the binomial distribution, we can rethink the parameters for fixed n with probabilities p 1 and p 2 for the two categories with p 1 + p 2 = 1, so there is only one free parameter. (If you know p 1 , you also know p 2 .) The maximum likelihood estimates are ˆp 1 = x 1 /n and pˆ 2 = x 2 /n = 1 − ˆp 1 = (n − x 1 )/n. For more categories, the maximum likelihood estimates are ˆpi = xi /n for i = 1,... , k. The maximum likelihood is then
L =
n x 1 ,... , xk
x 1 n
)x 1 × · · · ×
( (^) xk n
)xk^ )
and the maximum log-likelihood is
ln L = ln
n x 1 ,... , xk
∑^ k
i=
xi ln
( (^) xi n
Contingency Tables Appendix 52 / 56
The observed outcomes {Oij } in a contingency table with r rows and c columns are jointly are modeled with a multinomial distribution with parameters {pij } for i = 1,... , r and j = 1,... , c. There are rc probabilities.
Contingency Tables Appendix 53 / 56
Under the alternative hypothesis of no independence, the only restriction on the probabilities is that they sum to one, so there are k 1 = rc − 1 free parameters. The maximum likelihood estimates are
pij =
Oij n The maximum log-likelihood is
ln L 1 = ln
n O 11 ,... , Orc
∑^ r
i=
∑^ c
j=
Oij ln
( (^) Oij n
Contingency Tables Appendix 54 / 56
Under the null hypothesis of independence, pij = pi· × p·j for all i and j where there are r − 1 free parameters for the row factor and c − 1 free parameters for the column factor, for a total of k 0 = r + c − 2. The maximum likelihood estimates are
ˆpi· = sum of observations in row i n for the row probabilities and
ˆp·j = sum of observations in column j n for the column probabilities. The maximum likelihood estimate for pij is ˆpij = ˆpi·ˆp·j = E nij. The maximum log-likelihood is
ln L 0 = ln
n O 11 ,... , Orc
∑^ r
i=
∑^ c
j=
Oij ln
ij n
Contingency Tables Appendix 55 / 56
The test statistic is G = 2(ln L 1 − ln L 0 ) which equals
( (^) ∑r
i=
∑^ c
j=
Oij
ln
ij n
− ln
ij n
which can be simplified to
( (^) ∑r
i=
∑^ c
j=
Oij ln
( (^) Oij Eij
The difference in the number of free parameters is ( rc − 1
r + c − 2
= rc − r − c + 1 = (r − 1)(c − 1)
Contingency Tables Appendix 56 / 56