Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Understanding Chi-Squared Test: Degrees of Freedom, Probabilities, and Errors, Study notes of Genetics

The Chi-Squared test for Goodness of Fit, focusing on the concept of degrees of freedom, the role of probabilities in hypothesis testing, and the types of errors that can occur. It also provides examples and calculations using a 2x2 table and real data. useful for students in statistics, mathematics, or data analysis courses.

What you will learn

  • What is the significance of the critical region in the Chi-Squared test?
  • What are the types of errors that can occur in hypothesis testing?
  • How is the Chi-Squared test used to compare observed and expected frequencies?
  • What is the role of degrees of freedom in the Chi-Squared test?
  • What is the Chi-Squared test for Goodness of Fit?

Typology: Study notes

2021/2022

Uploaded on 09/12/2022

shekara_44het
shekara_44het 🇺🇸

4.3

(7)

229 documents

1 / 16

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Chapter 5
The Goodness of Fit Test
5.1 Dice, Computers and Genetics
The CM of casting a die was introduced in Chapter 1. We assumed that the six possible outcomes
of this CM are equally likely; i.e. we assumed the ELC. Later I mentioned that I own two round-
cornered dice and I suspect that the ELC is not reasonable for either of them. How can we decide
whether to believe in the ELC for a die?
In this chapter we will learn about the (Chi-squared) Goodness of Fit Test. This test was
developed circa 1900 by Karl Pearson (1857–1936), in part to investigate theories of genetic in-
heritance. While I cannot give you an exact reference, sometime in the 1990’s Scientific American
(or a similarly themed journal—sorry) published an issue devoted to ‘The 20 greatest scientific
discoveries of the 20th Century. Next to such obvious entries as the jet engine and the splitting of
the atom was .. .the test of this chapter! This was a curious inclusion for at least two reasons.
1. Whereas it is true that modern statisticians do not condemn this test, they don’t use it that
much.
2. With all the wondrous discoveries of those one hundred years, I can’t imagine putting any
statistical method on the list! Many of my more zealous colleagues might disagree with my
last statement, but I would be truly amazed if any of them selected the Goodness of Fit Test
as our main contribution.
When I read this issue of the journal, my sense was that there were two reasons for including
our test. First, the test is important historically because it provided a confirmation that Mendel’s
‘genes’ made sense. This was important because genes provided a mechanism for Darwin’s work.
(I am not a biologist and indeed understand the subject poorly, but as I understand things, Darwin
provided no mechanism for natural selection.) Second, I suspect that the editors wanted to take the
most inclusive view of science for the issue. Hence, even Statistics received attention.
In any event, unless you work fora casino or are interested in gambling (sorry, I refuse to call it
‘gaming’), you might think that the study of dice is bit frivolous. Well, as mentioned above, there
are applications to genetics. But why do I mention computers in this section title?
51
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Understanding Chi-Squared Test: Degrees of Freedom, Probabilities, and Errors and more Study notes Genetics in PDF only on Docsity!

Chapter 5

The Goodness of Fit Test

5.1 Dice, Computers and Genetics

The CM of casting a die was introduced in Chapter 1. We assumed that the six possible outcomes of this CM are equally likely; i.e. we assumed the ELC. Later I mentioned that I own two round- cornered dice and I suspect that the ELC is not reasonable for either of them. How can we decide whether to believe in the ELC for a die? In this chapter we will learn about the (Chi-squared) Goodness of Fit Test. This test was developed circa 1900 by Karl Pearson (1857–1936), in part to investigate theories of genetic in- heritance. While I cannot give you an exact reference, sometime in the 1990’s Scientific American (or a similarly themed journal—sorry) published an issue devoted to ‘The 20 greatest scientific discoveries of the 20th Century.’ Next to such obvious entries as the jet engine and the splitting of the atom was... the test of this chapter! This was a curious inclusion for at least two reasons.

  1. Whereas it is true that modern statisticians do not condemn this test, they don’t use it that much.
  2. With all the wondrous discoveries of those one hundred years, I can’t imagine putting any statistical method on the list! Many of my more zealous colleagues might disagree with my last statement, but I would be truly amazed if any of them selected the Goodness of Fit Test as our main contribution.

When I read this issue of the journal, my sense was that there were two reasons for including our test. First, the test is important historically because it provided a confirmation that Mendel’s ‘genes’ made sense. This was important because genes provided a mechanism for Darwin’s work. (I am not a biologist and indeed understand the subject poorly, but as I understand things, Darwin provided no mechanism for natural selection.) Second, I suspect that the editors wanted to take the most inclusive view of science for the issue. Hence, even Statistics received attention. In any event, unless you work for a casino or are interested in gambling (sorry, I refuse to call it ‘gaming’), you might think that the study of dice is bit frivolous. Well, as mentioned above, there are applications to genetics. But why do I mention computers in this section title?

Well, we hear all the time about computer models that help us learn about the world. There are computer models for the climate, the mutation of species or viruses, and so on. These com- puter models typically include CMs and at some point in the analysis the computer programmer will simulate the operations of these various CMs by using a program called a random number generator. For example, a random number generator might promise to select a digit at random (this implies ELC in this setting) from 0, 1, 2,... , 9. But how does the programmer know that the program works as advertised? The test of this chapter can be used to investigate this issue.

5.2 The Chi-Squared Curves

In Chapter 2 we learned about the family of normal curves. Also in Chapter 2 we learned about the family of binomial distributions. A binomial is characterized by the values of two parameters: the number of trials n and the probability of success on any trial p. In Chapter 4 we learned about the family of Poisson distributions. A Poisson is characterized by the value of a single parameter θ. In this section we will learn about a family of curves called the Chi-Squared curves. (Note: Many people call these the Chi-Square curves—that is, no ‘d’ at the end of Square—but this has always annoyed me. For example, when I read the equation 32 = 9, I say, ‘Three squared equals nine.” I would never say, “Three square equals nine.” Three square sounds like I am talking about meals!) This might be a good time to tell/remind you that χ is the lower case Greek letter chi, where ‘ch’ is pronounced as a ‘k’ and the ‘i’ sound is long i. My word processor does not include an upper case chi because it looks just like an upper case ‘ex;’ i.e. ‘X’ can be either of two letters, hopefully the context will make it clear which you should use! A helpful reminder is that X is always ‘ex’ whereas X^2 is usually chi-squared. It is only rarely that statisticians square an ‘ex;’ for example, why would anyone want to know the square of the number of heads I get when tossing a coin? A Chi-Squared curve is characterized by the value of one parameter, called its degrees of freedom ( df ). The degrees of freedom can be any positive integer, 1, 2, 3,.... Our symbol for this curve will be χ^2 (df ). For example, χ^2 (5) is the Chi-Squared curve with df = 5. I will talk about a “Chi-Squared random variable.” Such a random variable will be denoted by X^2 and take on values χ^2. (Similar to how a binomial random variable X takes on values x.) Further, this terminology implies that we use a Chi-Squared curve to calculate probabilities for X^2. Following our now standard notation, we write this as:

X^2 ∼ χ^2 (df ).

By the way, as the symbol X^2 suggests, a Chi-Squared random variable can never take on a neg- ative number for its value; indeed, this is why we have the squared in the notation, as a reminder that negatives are impossible. On our course webpage there are links to both a table and a calculator for Chi-Squared curves. We will use only the calculator in this course; I provide the table in case you are interested in it. At this time, I want you to go to the calculator. When you call up the calculator you will find the default screen. You will see a curve with the area to the right of 10 shaded blue. Below the curve are three boxes. Reading from these boxes we learn that the area under a χ^2 (10) curve to

5.3 The Hypotheses

We assume that we have a CM that can be operated repeatedly and, when so operated, yields i.i.d. trials. Whether the outcomes are categories or numbers, we assign numbers to each outcome: 1, 2,

... k or 0, 1, 2,... (k − 1) if there are a finite number of possible outcomes or 0, 1, 2,... if there is a sequence of possible outcomes. Note that for the finite case there are k possible outcomes. The probability of outcome i is denoted by pi. So far, this is all quite routine. The Goodness of Fit Test is used when we have a theory about the values of the pi’s and we want to evaluate whether or not the theory is reasonable. Here are some examples. 1. Dice. The CM is the casting of a die. The possible outcomes are 1, 2,... , 6. I might entertain the theory that the die is balanced; i.e. I might assume the ELC. 2. Genetics. This is just one of many related problems that arise in genetics. Individual snap- dragon (Antirrhinum majus) plants can be red-, pink- or white-flowered. Self-pollination of pink-flowered plants can yield any of these three colors. The CM is one self-pollination with outcome 1 (red), 2 (pink) or 3 (white). Repeated operations are obtained by repeated self-pollinations. These repeated operations are assumed to yield i.i.d. trials. A Mendelian genetic model states that p 1 = 0. 25 , p 2 = 0. 50 and p 3 = 0. 25. 3. Bernoulli Trials. Carol likes to shoot free throws. Every day she attempts 10 free throws. As we saw in Chapter 2, if we assume that her individual shots are BT, then her total number of successes on a day has a Bin(10,p) distribution. We will learn how to use her data from many days to test the binomial model. We will learn how to do this for both of the cases: p is known and p is unknown. 4. Poisson Process. Every day David counts the number of successes in a fixed location over the same one-hour period of time. As discussed in Chapter 4, David might assume that the number of successes each day has a Poisson(θ) distribution. Given data from many days, we will learn how to test this assumption for both of the cases: θ is known and θ is unknown.

We will restrict attention to the situation in which the CM has a finite number of possible outcomes until the last section of this chapter; i.e. we will return to the Poisson Process in the last section. As mentioned above, the test of this chapter is relevant whenever we have a theory that specifies the values of the pi’s. The procedure we will learn is an example of a test of (statistical) hypothe- ses. Below I will introduce you to the features of a test of hypotheses with special attention paid to the Goodness of Fit Test of this chapter. The first feature is that every test has two hypotheses, denoted by H 0 and H 1. The first of these is called the null hypothesis and the second is called the alternative hypothesis. Because of its name, many texts denote the alternative hypothesis by Ha, but we will stick with H 1. Each hypothesis is a conjecture about reality. These conjectures do not overlap ; i.e. they cannot both be true. Curiously, it is possible that neither is true, although standard analyses tend to ignore this possibility. (Well, perhaps ignore is too strong a word, but in my experience analysts do not like to dwell on this possibility.)

For the Goodness of Fit test, the null hypothesis states that our theory about the probabilities is correct. The alternative hypothesis states that our theory is incorrect. This might sound confusing, but in any particular situation it is quite simple. For example, for our die study,

  • H 0 : p 1 = p 2 = p 3 = p 4 = p 5 = p 6 = 1/ 6
  • H 1 : Not H 0 ; i.e. at least one of the pi’s does not equal 1 / 6.

For our snapdragon example,

  • H 0 : p1(red) = 0. 25 , p2(pink) = 0. 50 , p3(white) = 0. 25.
  • H 1 : Not H 0 ;

In general, let pi 0 denote the theory’s value of pi. This makes the hypotheses:

  • H 0 : pi = pi 0 for all i
  • H 1 : Not H 0 ; i.e. pi 6 = pi 0 for at least one i.

The hypotheses must be selected before data are collected. This should never be a problem because the hypotheses are derived from questions of scientific interest, which exist before we collect data. Every test of hypotheses begins with the assumption that the null hypothesis is correct. There are two reasons for this, one philosophical, one practical. The philosophical reason is often de- scribed as Occam’s razor which states, roughly, that we prefer a simpler model for the world unless the simple model proves to be seriously inadequate. (See Wikipedia for more details.) In the current example, it is simpler to assume the die is balanced than to assume it is not. (If it is not balanced, we need to learn about its six probabilities and need a reason why they are not all the same.) Similarly, a Mendelian genetic model provides a simple way to explain inheritance of traits. If it is incorrect another (more complicated) model needs to be found. The practical reason is that we need to assume the null hypothesis is true in order to obtain useful math results. Thus, a test of hypotheses can be described, briefly, as follows. We specify our hypotheses. We assume the null hypothesis is true. We collect and analyze data. Based on our analysis we select one of two options:

  • Stop assuming the null hypothesis is correct; this is referred to as rejecting the null hypoth- esis.
  • Continue to assume the null hypothesis is correct; this is referred to as failing to reject the null hypothesis.

Statisticians (among others) find it insightful to list all the possible consequences of selecting an option. In particular, for a test of hypotheses, we find the following 2 × 2 (read 2 by 2) table to be very helpful.

probability distribution of O 1 is np 1. Now p 1 is unknown, which would be a huge problem except remember that we are assuming the null hypothesis is correct. With this assumption, p 1 = p 10 which is a known number. The mean of O 1 becomes np 10 , which we can easily compute. This argument for outcome ‘1’ can be extended to the other outcomes; the result is that the mean of the probability distribution for each Oi is npi 0 ; again these are all easily computable numbers. Dating back to the gambling origins of probability theory, the mean was called the expected value. Because this is an old test (over 100 years old, as mentioned earlier) this older terminology is reflected in our notation and we denote the mean of Oi by Ei. Thus, Ei = npi 0. At this point, it might help to introduce two specific examples.

Example 1: An Electronic Die. My statistical software package, Minitab, claims to have a random number generator that can simulate a balanced die. I decided to investigate this claim. I had my computer generate 600 trials from its so-called balanced die. My observed and ex- pected frequencies are in the table below.

Outcome 1 2 3 4 5 6 Oi 91 95 110 95 111 98 Ei 100 100 100 100 100 100

Example 2: Hypothetical Snapdragon Flowers. (Aside: I was disappointed when I searched the web for genetic data. I found sites that talked at length about how the Goodness of Fit Test is so important in genetics and then the example was... tossing a coin!) I will modify some data from published sources and hope that my modification avoids any lawsuits for infringement of copyrights! Suppose that George grows n = 240 snapdragons and obtains the data summarized in the following display.

Outcome 1 = Red 2 = Pink 3 = White Oi 53 124 63 Ei 60 120 60

I am now reminded of the words of Yogi Berra: You can observe a lot by just watching. Please look at the O’s and their corresponding E’s. They do not all agree. (In fact, none of them agree.) This is not surprising; I simulated 10,000 data sets as I did above—600 casts of an i.i.d. die with the ELC true—and never obtained data for which all the O’s equaled 100. In other words, the data almost always contain some evidence in support of the alternative hypothesis, even when the null hypothesis is true. Read this last sentence again. Notice that I talk of evidence in support of the alternative. This is how statisticians talk. We never say evidence in support of or against the null. We never say evidence against the alternative. We say evidence in support of the alternative because we are already assuming the null is correct and are looking to see whether there is evidence in support of the alternative. Well, as I said, there is almost always some evidence in support of the alternative; what we are really looking to do is to determine whether the evidence in support of the alternative is sufficiently strong to convince us to reject the null hypothesis.

Let’s look at our data again. We have six O’s and six E’s for the die example and three of each for the flowers. Any discrepancy between an O and its E provides evidence in support of the alternative. In other words, I compare the O’s and the E’s to see whether they agree, almost agree, disagree somewhat, and so on. In mathematics a common way to compare two numbers is to subtract one from the other, and we do that here. In particular, for each possible outcome we compare the O and the E by calculating (O − E) and placing these values in our table for the die:

Outcome 1 2 3 4 5 6 Oi 91 95 110 95 111 98 Ei 100 100 100 100 100 100 Oi − Ei − 9 − 5 10 − 5 11 − 2

Below is this table for the flowers:

Outcome 1 = Red 2 = Pink 3 = White Oi 53 124 63 Ei 60 120 60 Oi − Ei − 7 4 3

If an O − E is 0, then there is no evidence in support of the alternative. We want to treat an O − E of, say, − 10 as the same evidence as an O − E of +10. We can do this by taking the absolute value of O − E, but it turns out to be better to square the value. Thus, the values of (Oi − Ei)^2 are added to our table, first for the die and then for the flowers.

Outcome 1 2 3 4 5 6 Oi 91 95 110 95 111 98 Ei 100 100 100 100 100 100 Oi − Ei − 9 − 5 10 − 5 11 − 2 (Oi − Ei)^2 81 25 100 25 121

Outcome 1 = Red 2 = Pink 3 = White Oi 53 124 63 Ei 60 120 60 Oi − Ei − 7 4 3 (Oi − Ei)^2 49 16

Finally, we need to adjust for sample size because the values of (O − E)^2 , even for a balanced die, will tend to be larger the more often we cast the die. We adjust for sample size by dividing each (O − E)^2 by E, which we add to our table, first for the die and then for the flowers. Also, we sum the values of (O − E)^2 /E and call the total χ^2.

Outcome 1 2 3 4 5 6 Total Oi 91 95 110 95 111 98 600 Ei 100 100 100 100 100 100 600 Oi − Ei − 9 − 5 10 − 5 11 − 2 0 (Oi − Ei)^2 81 25 100 25 121 (Oi − Ei)^2 /Ei 0. 81 0. 25 1. 00 0. 25 1. 21 0. 04 χ^2 = 3. 56

http://www.stat.tamu.edu/ west/applets/chisqdemo.html

Once at the site, we type 5 in the degrees of freedom box, 0.05 in the lower right box and click on compute. For our snapdragon data, k = 3. For α = 0. 05 , the critical region is

χ^2 ≥ χ^20. 05 (2) = 5. 991.

We now evaluate our data. For the die data, χ^2 = 3. 56 is not in the critical region, so we do not reject the null hypothesis. For the snapdragons, χ^2 = 1. 100 is not in the critical region, so we do not reject the null hypothesis. Let me do a few more examples.

Example 3: Casting my round-cornered blue die. In the lecture examples for Chapter 1, I presented data for 1,000 casts of my blue round-cornered die. I will do the Goodness of Fit Test for these data; computations are in the table below.

Outcome 1 2 3 4 5 6 Total Oi 212 148 109 140 152 239 Ei 166. 7 166. 7 166. 7 166. 7 166. 7 166. 7 Oi − Ei 45. 3 − 18. 7 − 57. 7 − 26. 7 − 14. 7 72. 3 (Oi − Ei)^2 2052. 09 349. 69 3329. 29 712. 89 216. 09 5227. 29 (Oi − Ei)^2 /Ei 12. 31 2. 10 19. 97 4. 28 1. 30 31. 36 χ^2 = 71. 32

If I again use α = 0. 05 my critical region is χ^2 ≥ 11. 07. For these data χ^2 = 71. 32 is in the critical region, so my decision is to reject the null hypothesis and conclude that my die is not balanced.

Example 4: Hypothetical fair-coin tosser. Bert plans to toss his favorite coin four times every day for n = 160 days. He is convinced that the coin is fair (i.e. the probability of a head is 0.5), but he wonders about memory/independence. If there is independence then he has BT and the total number of heads on any given day will follow the Bin(4,0.50) distribution. It is the appropriateness of this binomial that Bert wants to investigate. The possible values of the number of heads on a day are, of course, 0, 1, 2, 3 and 4. You can check that if the binomial model is correct, then these outcomes have probabilities 1 / 16 , 4 / 16 , 6 / 16 , 4 / 16 and 1 / 16 , respectively. Bert’s null is that these binomial probabilities are correct and his alternative is that they are not. Bert collects his data and obtains the numbers shown in the table below, which also presents all of his computations.

Outcome 0 1 2 3 4 Total Oi 13 37 60 40 10 Ei 10 40 60 40 10 Oi − Ei 3 − 3 0 0 0 (Oi − Ei)^2 9 9 0 0 (Oi − Ei)^2 /Ei 0. 90 0. 225 0 0 0 χ^2 = 1. 125

Because k = 5, df = 4; for α = 0. 10 the critical region is

χ^2 ≥ χ^20. 10 (4) = 7. 779.

Our χ^2 = 1. 125 is not in the critical region, so the null hypothesis is not rejected.

Example 5: Hypothetical free throw shooter. Imagine a basketball player named Shack. He is getting old, doesn’t run too well and has always been a poor free throw shooter. He decides to work on the last of these as follows:

Five times per day for the next 80 days he will shoot four free throws and count the number of successes that he achieves.

Thus, Shack will collect n = 400 numerical values, with each value being one of: 0, 1, 2, 3 or 4. Shack wants to use the Goodness of Fit Test to test whether these 400 values behave as if they come from a binomial distribution. Note the difference between Examples 4 and 5. In Example 4, the analysis assumed that p = 0. 50. For this analysis we assume that p is unknown. Our null hypothesis is that the probabilities follow the binomial distribution for m = 4 trials for some p. The alternative is that the binomial is not correct, for any value of p. First, we need to look at the data Shack collects. His O’s are below.

Outcome 0 1 2 3 4 Oi 25 118 139 93 25

In order to proceed we need to use our data to estimate p. Shack shoots a total of 400(4) = 1600 free throws. From the table above, he obtains:

25(0) + 118(1) + 139(2) + 93(3) + 25(4) = 775 successes.

Thus, we estimate p by pˆ = 775/1600 = 0. 4844. We calculate our E’s using the Bin(4,0.4844) distribution. The E’s (probabilities times 400) for 0, 1, 2, 3 and 4 are: 28.3, 106.2, 149.7, 93.8 and 22.0, respectively. I will add these to our data and complete the computations of χ^2 :

Outcome 0 1 2 3 4 Total Oi 25 118 139 93 25 Ei 28. 3 106. 2 149. 7 93. 8 22. 0 Oi − Ei − 3. 3 11. 8 − 10. 7 − 0. 8 3. 0 (Oi − Ei)^2 10. 89 139. 24 114. 49 0. 64 9. 00 (Oi − Ei)^2 /Ei 0. 385 1. 311 0. 765 0. 007 0. 409 χ^2 = 2. 877

Next, we need the critical region. Well, we need the following fact. Let j denote the number of parameters that we must estimate in order to obtain the E’s. For the current example, j = 1.

Given that the null hypothesis is true, the sampling distribution of X^2 is approximated by the Chi-Squared curve with df = k − j − 1.

The Attained Significance Level helps to reduce substantially the seriousness of these complaints. Recall, Example 1, my electronic die. The observed value of the test statistic was χ^2 = 3. 56. Recall, also that I used α = 0. 05 to obtain the critical region χ^2 ≥ 11. 07. But suppose I had chosen for my critical region: χ^2 ≥ 3. 56.

What would we conclude? Well, first it looks awfully suspicious to have just happened to choose a c for my critical region that exactly matches the observed value of χ^2. Let’s ignore that for the moment. If I go to my Chi-Squared calculator, I find that the area under the χ^2 (5) to the right of 3.56 is 0.6143. Thus, if I had selected α = 0. 6143 then my critical region would have been χ^2 ≥ 3. 56 and I would have just barely rejected the null. If I pick any α larger than 0.6143 then I would have a c smaller than 3.56 and I would reject the null; if I pick any α smaller than 0. then I would have a c larger than 3.56 and I would fail to reject the null. Thus, I would reject the null if, and only if, my α ≥ 0. 6143. In words, 0.6143 is the smallest α for which the null hypothesis would be rejected. This number, 0.6143, is the P-value for these data. Below are the P-values for the other four Examples above.

  • Example 2: Snapdragons. The observed χ^2 = 1. 100 with df = 2. From the calculator, the area under χ^2 (2) to the right of 1.100 is 0.5769. Thus, the P-value is 0.5769. We would reject the null if, and only if, our α ≥ 0. 5769.
  • Example 3: Round-corned blue die. The observed χ^2 = 71. 32 with df = 5. From the calculator, the area under χ^2 (5) to the right of 71.32 is 0.0000. Thus, the P-value is 0.0000. We would reject the null if, and only if, our α ≥ 0. 0000. FYI, according to my computer software package the P-value is one in ten trillion.
  • Example 4: Fair-coin tosser. The observed χ^2 = 1. 125 with df = 4. From the calculator, the area under χ^2 (4) to the right of 1.125 is 0.8903. Thus, the P-value is 0.8903.
  • Example 5: Shack’s free throws. The observed χ^2 = 2. 877 with df = 3. From the calculator, the area under χ^2 (3) to the right of 2.877 is 0.411. Thus, the P-value is 0.411.

The approach I have described earlier for a test of hypotheses is sometimes called the classical approach. It can be viewed as quite rigid: every analysis must end with a decision to reject or not. This reflects mathematics in two ways. First, every math problem ends in a solution and then we go on to the next math problem. The solution here is to reject or not. Second, for academic researchers who want to publish research papers, the rigidity of the classical approach is helpful for proving theorems and obtaining other mathematical results. But science is much more dynamic than math. One hundred years ago many (most?) scientists believed the space between planets in our solar system was filled with ‘ether.’ If space = ether was a math result, well, then there would be ether. But space = ether was, presumably, a useful scientific theory until it was replaced by a better (more correct) one. Before I offer an alternative to the classical approach, let me remind you of what the P-value does for us. If we decide to use the rigid ‘reject or fail to reject’ approach to tests of hypotheses, the P-value has the virtue of removing, to some extent, the arbitrariness of the choice of α in the

following way. By reporting the P-value the researcher allows the consumer to apply his/her own choice of α to the decision making process. As I said above, science is more dynamic than math. Scientists may be less interested in a ‘carved in stone’ decision and more interested in evaluating the strength of the evidence in the data. The second interpretation of the P-value, given below, helps with this. As mentioned on page 59, the larger the value of χ^2 the stronger the evidence in support of the alternative. Thus, the P-value is the probability of the researcher obtaining the actual evidence or even stronger evidence. Remember the probability is computed under the assumption that the null is correct. This is a little tricky; the smaller the P-value the stronger the evidence in support of the alternative. For example, if one gets a P-value of 0.0001 this means that the probability of getting such strong (or stronger) evidence is one in ten-thousand. In other words, it is unlikely; thus, the evidence one has is very strong. This second interpretation of the P-value helps sort out the problems with Researchers A, B, C and D introduced on page 62. With the help of the Chi-Squared calculator, we can obtain the following P-values for these researchers; recall that df = 5 for all of them.

Researcher χ^2 P-value A 2.02 0. B 11.05 0. C 11.08 0. D 51.00 0.

If you go and reread my complaints about the reject/fail to reject approach to analysis given earlier, you will see that the P-value does a good job of answering them.

5.6 Some Loose Ends

We have been using the Chi-Squared curve to compute probabilities b/c in the limit it works. That is, using the Chi-Squared curve is an approximation. Is the approximation any good? To answer this, I first note that, as a practical matter, it is impossible to calculate exact probabilities. (I know, some people say ‘Impossible is nothing;’ but not for this!) We can, however, simulate the distribution of the test statistic X^2. In particular, I simulated 10,000 runs in which each run consisted of casting my balanced electronic die n = 600 times. Remember, for α = 0. 05 the critical region is χ^2 ≥ 11. 07 , where the number 11.07 is obtained by using the Chi-Squared curve with df = 5 as an approximation to the sampling distribution of X^2. Each run consisted of:

  1. I had the computer generate (simulate) 600 casts of a balanced die.
  2. For the data just obtained, I calculated the value of χ^2 exactly as illustrated above.

Thus, each run resulted in a value of χ^2. I then sorted the 10,000 values of χ^2 and determined, by counting, that 489 of my simulated values were ≥ 11. 07. Thus, the relative frequency of occurrence of χ^2 ≥ 11. 07 was 0. 0489. By the LLN, the P (X^2 ≥ 11 .07) is close to 0.0489. Thus,

Remember that a test of hypotheses only tests some of what we assume. It turns out that we cannot test everything. For example, consider the Goodness of Fit Test for the ELC for an electronic die, like the data I provided with my Example 1. If we generate n = 600 casts and each side lands up exactly 100 times, it is correct to say that the Goodness of Fit Test finds no evidence in support of the alternative. But the Goodness of Fit Test does not examine the assumption of i.i.d. trials. Here are two extreme possibilities that the Goodness of Fit Test would not see.

  • Lack of independence. Suppose that the electronic die yields the sequence ‘1, 2, 3, 4, 5, 6’ repeatedly. The trials are not independent, but our test of this chapter won’t notice it.
  • Lack of i.d. Suppose that the first 100 casts yield all 1’s; the next 100 casts yield all 2’s; and so on. This would occur if the probabilities are changing, but, again, the test of our section would not spot it.