Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Chapter 12: Chi Square Tests - Lecture Notes | MATH 241, Study notes of Mathematics

Material Type: Notes; Class: Statistical Applications; Subject: Mathematics; University: Saint Mary's College; Term: Spring 2009;

Typology: Study notes

Pre 2010

Uploaded on 08/05/2009

koofers-user-zcy-1
koofers-user-zcy-1 🇺🇸

10 documents

1 / 3

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CHI SQUARE TESTS for goodness of fit and independence (Chapter 12)
The Chi square statistic can be used for tests on distributions but must be used with frequency counts,[i.e. the
number of observations that fall into certain categories]. We use fito represent the actual frequency for category i
(number of observations in the actual data that are in categoryi) and eito represent the expected frequency if H0
is true (number of observations for category ipredicted by H0for a sample of this size).
Our test statistic is (in all cases) χ2=X
I
(fiei)2
ei
OR χ2=X
i,j
(fij eij)2
eij
(Total, over all categories, of (actual minus expected) squared over expected categories may be based on one variable
first formula or two variables second formula)
NOTE: Expected cell frequency must be at least 5 in order to use the chi-square distribution (rows or columns may be
combined to accomplish this)
Goodness of Fit [One variable one row of categories]
The issue is to determine whether a particular probability distribution might reasonably describe the population from
which the sample was drawn. Our test is always
H0: The data come from a population with the distribution stated
Ha: The data come from a population which does not fit that distribution
The test statistic is given by: sample χ2=X
I
(fiei)2
ei
with df = #categories1(number of parameters estimated from data)
In general, the expected frequency for category iis P(X=i)×n(n= sample size) and is not rounded to a whole
number. (P(X=i) comes from the distribution we are testing for)
Critical values for the distribution are given in table 3 on p.923 [same as used for inference on σ2] but we are only interested
in small areas [columns further to the right].
Decision method: We will reject H0and conclude the proposed distribution does not fit if our sample χ2> χ2
αwith df =
#categories 1(number of parameters estimated from data)
Independence Test and Contingency Tables [Two variables or two populations making a table of categories]
Events Aand Bare independent if P(A|B) = P(A), [which is equivalent to P(Aand B) = P(A)P(B)] Two variables
are independent if knowing the value for one does not change the probability distribution for the other. (All events that
can be described with one are independent of all events that can be described with the other)
In the contingency table (laying out all the possible combinations of values for the variables all “contingencies”),
independence means that the probability of any cell can be found as the product of marginal probabilities (P(X=
Aand Y=B) = P(X=A)×P(Y=B)) That is, the probability of column one is the same for every row, probability of
column two is the same for every row, etc. and probability of row 1 is the same for every column, etc. Thus the expected
count eij for the cell in row i, column jis given by
eij =P(row i)×P(column j)×sample size = # row i
sample size ×# column j
sample size ×sample size = # row i×# column j
sample size
The issue is to determine whether the two variables (determining the rows and columns, respectively) are independent.
Test is always
H0: The two variables are independent
Ha: The two variables are not independent
The test statistic is χ2=X
i,j
(fij eij)2
eij
df = (#rows 1) ×(#columns 1)
Decision method: We will reject H0and conclude the variables are not independent if our sample χ2> χ2
αwith df =
(#rows 1) ×(#columns 1). That is we reject the null hypothesis only if the test statistic is “big”.
MINITAB: [for contingency table] Enter the observed frequencies in adjacent columns, keeping the entries in order (so
you copy the table of observed values). Choose Stat>Tables then choose Chi-Square Test (Table in Worksheet) enter the
appropriate columns (containing the table) in the Columns Containing Table box
Equality of proportions
The chi square test for equality of several proportions (which is the extension of the two-sample test on proportions) is
1
pf3

Partial preview of the text

Download Chapter 12: Chi Square Tests - Lecture Notes | MATH 241 and more Study notes Mathematics in PDF only on Docsity!

CHI SQUARE TESTS for goodness of fit and independence (Chapter 12) The Chi square statistic can be used for tests on distributions — but must be used with frequency counts,[i.e. the number of observations that fall into certain categories]. We use fi to represent the actual frequency for category i (number of observations — in the actual data — that are in categoryi) and ei to represent the expected frequency if H 0 is true (number of observations for category i predicted by H 0 for a sample of this size).

Our test statistic is (in all cases) χ^2 =

I

(fi − ei)^2 ei

OR χ^2 =

i,j

(fij − eij )^2 eij

(Total, over all categories, of (actual minus expected) squared over expected — categories may be based on one variable

  • first formula – or two variables – second formula) NOTE: Expected cell frequency must be at least 5 in order to use the chi-square distribution (rows or columns may be combined to accomplish this)

Goodness of Fit [One variable — one row of categories]

The issue is to determine whether a particular probability distribution might reasonably describe the population from which the sample was drawn. Our test is always H 0 : The data come from a population with the distribution stated Ha: The data come from a population which does not fit that distribution

The test statistic is given by: sample χ^2 =

I

(fi − ei)^2 ei

with df = #categories− 1 −(number of parameters estimated from data)

In general, the expected frequency for category i is P (X = i) × n (n = sample size) — and is not rounded to a whole number. (P (X = i) comes from the distribution we are testing for) Critical values for the distribution are given in table 3 on p.923 [same as used for inference on σ^2 ] but we are only interested in small areas [columns further to the right]. Decision method: We will reject H 0 and conclude the proposed distribution does not fit if our sample χ^2 > χ^2 α with df = #categories − 1 − (number of parameters estimated from data)

Independence Test and Contingency Tables [Two variables or two populations making a table of categories]

Events A and B are independent if P (A|B) = P (A), [which is equivalent to P (A and B) = P (A)P (B)] Two variables are independent if knowing the value for one does not change the probability distribution for the other. (All events that can be described with one are independent of all events that can be described with the other)

In the contingency table (laying out all the possible combinations of values for the variables — all “contingencies”), independence means that the probability of any cell can be found as the product of marginal probabilities (P (X = A and Y = B) = P (X = A) × P (Y = B)) That is, the probability of column one is the same for every row, probability of column two is the same for every row, etc. and probability of row 1 is the same for every column, etc. Thus the expected count eij for the cell in row i, column j is given by

eij = P (row i) × P (column j) × sample size =

row i

sample size

×

column j

sample size

× sample size =

row i × # column j

sample size

The issue is to determine whether the two variables (determining the rows and columns, respectively) are independent. Test is always H 0 : The two variables are independent Ha: The two variables are not independent

The test statistic is χ^2 =

i,j

(fij − eij )^2 eij

df = (#rows − 1) × (#columns − 1)

Decision method: We will reject H 0 and conclude the variables are not independent if our sample χ^2 > χ^2 α with df = (#rows − 1) × (#columns − 1). That is we reject the null hypothesis only if the test statistic is “big”.

MINITAB: [for contingency table] Enter the observed frequencies in adjacent columns, keeping the entries in order (so you copy the table of observed values). Choose Stat>Tables then choose Chi-Square Test (Table in Worksheet) enter the appropriate columns (containing the table) in the Columns Containing Table box

Equality of proportions

The chi square test for equality of several proportions (which is the extension of the two-sample test on proportions) is

most easily treated as a special case of the test of independence. We have two rows (for “yes” and “no”) and one column for each population. Test is always H 0 : The proportions (of “yes”) are the same in all populations Ha: The proportions are not the same in all populations

Test statistic χ^2 =

i,j

(fij − eij )^2 eij

df = #columns − 1 because #rows = 2.

Examples

  1. Professor Frump claims that he grades on a curve — that is, 10% of students receive A’s, 20% B’s, 40% C’s 20% D’s and 10% F’s. [This is the classic “grading on a curve” — assumes a normal distribution of “success” & puts C at the mean.] A student who doubts this collects a sample of 63 grades from Frump’s classes and finds 8 A’s, 7 B’s, 28 C’s, 11 D’s and 9 F’s. does this indicate that Prof. Frump’s grades are not distributed as he claims?
  2. Associated Investors has been accused of engaging in prejudicial hiring practices. According to the most recent census, the percentage of whites, blacks, and Hispanics in the community where Associated is located are 70%, 12%, and 18% respectively. If a random sample of 200 of Associated’s employees revealed that 160 were white, 13 were black, and 27 were Hispanic, what, at the 0.05 level, can be concluded about the distribution of Associated’s employees?
  3. Does the number of reservation cancellations on United flight 568 fit a poisson distribution? H 0 : The number of cancellations fits a poisson distribution, with mean matching the mean of our sample Ha: the number of cancellations does not fit a poisson distribution or the mean does not match the mean of our sample Data: The first two columns of the table represent the sample (second column is “observed frequency”), while the last two are computed assuming H 0 is true (last column is “expected frequency”). Number of Probability assuming Expected Number of days observed Poisson distribution frequency cancellations (in 90 days) λ = 2. 6 (in 90 days) 0 9 .074 6. 1 17 .193 17. 2 25 .251 22. 3 15 .218 19. 4 11 .141 12. 5 7 .074 6. 6 2 .032 2. 7 2 .012 1. 8 2 .004 0. 9 0 .001 0. Degrees of freedom = n − 1 − ( number of parameters estimated from data). with n = number of categories. For Poisson we estimate one parameter (λ) so df = n − 2. NOTE expected cell frequency must be at least 5. If it is less than 5 for some category, we must combine categories to get the expected count up to 5.
  4. Example Is payment method independent of [or is it related to] the cost of a meal? H 0 : The variables (Price category and Payment method) are independent. (i.e. Method of payment is independent of price.) Ha: The variables are dependent [there is some relationship between them]. Actual data: DINNER PRICE VS. METHOD OF PAYMENT Dinner Price Cash Bank Credit Card Diner’s Club Card Total $10 200 130 70 400 &12 220 180 100 500 $14 190 130 80 500 $16 120 60 20 200 TOTAL 730 500 270 1500