Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Comparing Two Categorical Variables - Module 2 | POL 242, Exams of Political Science

Material Type: Exam; Professor: Toppen; Class: Scope and Methods; Subject: Political Science; University: Hope College; Term: Unknown 1989;

Typology: Exams

Pre 2010

Uploaded on 08/07/2009

koofers-user-q6u
koofers-user-q6u 🇺🇸

3

(1)

10 documents

1 / 10

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Module 2: Comparing Two Categorical Variables
In Module 1, we looked at describing a single variable, and in Module 2, we will
start to compare variables to one another and look for a relationship between them.
Before we start to make such a comparison, it is important to make a note of the
importance of random sampling and say a few words about a good hypothesis.
Random Sampling
Most of the data used by political scientists is a sample of the population they are
trying to measure. A population is every single case of what the researcher wants to
study. Usually it is not possible to obtain data on every single member of a population,
so a researcher uses a sample instead. A sample is a smaller set of cases of the
population. In order for a sample to be accepted for use in statistical testing, the sample
must be representative of the population as a whole. To assure that a sample is
representative of the population, researchers use a technique known as random sampling.
The idea behind random sampling is to arrange all the members of the population in a
list, randomly select a number of them, and obtain data from them. The key is that every
member of a population must have an equal chance of being selected; if this weren’t true,
then the sample wouldn’t be truly random, and therefore not representative of the
population. Most statistical tests, and all of them discussed in this guide, assume that the
data you are testing is truly representative of the population and uses random sampling.
Suppose that a researcher wants to do a study on the American public. He obtains
a list of telephone numbers of every person in Alabama, Colorado, Indiana, Maine, and
Oregon. He then randomly selects 1000 numbers and obtains the data from them that he
needs. Is this sample representative of his target population? No, of course not. Only
people in those five states have a chance to be surveyed, whereas someone in Michigan
has no chance of being surveyed. Therefore, the population for his sample is only the
public in those five states and not the entire American public.
Now suppose that same researcher is able to get a list of all the phone numbers of
everyone in America. To obtain his sample, he selects every 10,000th person on the list
and surveys him or her. Is this an acceptable method for the researcher to use? Again,
the answer is no. The method the researcher used is not random. Everyone on the list
does not have an equal chance to be selected because only every 10,000th person is
selected. This is calculated rather than random.
The most popular way of getting a random sample is by using a table of random
numbers (available in most statistics textbooks). Other methods of making selections
random could involve rolling dice or computer randomness programs. The data used in
this guide can be assumed to be a random sample. GSS1998.dta and NES2000.dta are
prime examples of random samples obtained by professional organizations. STATES.dta
and WORLD.dta are examples of times when it is possible to collect data on an entire
population. Thus, we don’t have to concern ourselves with whether these samples are
representative.
Keep in mind that even representative samples contain some sampling error. The
statistical tests that we will be exploring will explain how to account for this error and
make a conclusion about a relationship with some certainty.
-22-
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Comparing Two Categorical Variables - Module 2 | POL 242 and more Exams Political Science in PDF only on Docsity!

Module 2: Comparing Two Categorical Variables

In Module 1, we looked at describing a single variable, and in Module 2, we will start to compare variables to one another and look for a relationship between them. Before we start to make such a comparison, it is important to make a note of the importance of random sampling and say a few words about a good hypothesis.

Random Sampling

Most of the data used by political scientists is a sample of the population they are trying to measure. A population is every single case of what the researcher wants to study. Usually it is not possible to obtain data on every single member of a population, so a researcher uses a sample instead. A sample is a smaller set of cases of the population. In order for a sample to be accepted for use in statistical testing, the sample must be representative of the population as a whole. To assure that a sample is representative of the population, researchers use a technique known as random sampling. The idea behind random sampling is to arrange all the members of the population in a list, randomly select a number of them, and obtain data from them. The key is that every member of a population must have an equal chance of being selected; if this weren’t true, then the sample wouldn’t be truly random, and therefore not representative of the population. Most statistical tests, and all of them discussed in this guide, assume that the data you are testing is truly representative of the population and uses random sampling. Suppose that a researcher wants to do a study on the American public. He obtains a list of telephone numbers of every person in Alabama, Colorado, Indiana, Maine, and Oregon. He then randomly selects 1000 numbers and obtains the data from them that he needs. Is this sample representative of his target population? No, of course not. Only people in those five states have a chance to be surveyed, whereas someone in Michigan has no chance of being surveyed. Therefore, the population for his sample is only the public in those five states and not the entire American public. Now suppose that same researcher is able to get a list of all the phone numbers of everyone in America. To obtain his sample, he selects every 10,000th^ person on the list and surveys him or her. Is this an acceptable method for the researcher to use? Again, the answer is no. The method the researcher used is not random. Everyone on the list does not have an equal chance to be selected because only every 10,000th^ person is selected. This is calculated rather than random. The most popular way of getting a random sample is by using a table of random numbers (available in most statistics textbooks). Other methods of making selections random could involve rolling dice or computer randomness programs. The data used in this guide can be assumed to be a random sample. GSS1998.dta and NES2000.dta are prime examples of random samples obtained by professional organizations. STATES.dta and WORLD.dta are examples of times when it is possible to collect data on an entire population. Thus, we don’t have to concern ourselves with whether these samples are representative. Keep in mind that even representative samples contain some sampling error. The statistical tests that we will be exploring will explain how to account for this error and make a conclusion about a relationship with some certainty.

A Good Hypothesis

Every empirical study starts with a good hypothesis. A hypothesis explains the results that a researcher thinks he will obtain from his testing. Usually a hypothesis has been well researched and involves a theory that the researcher hopes to support with data. A good hypothesis provides three important pieces of information about the study: the population, the variables involved, and the expected direction of the relationship. A good rough skeleton of a hypothesis is:

In comparing (insert population), those (cases) with a higher (independent variable) will have a (higher/lower) (dependent variable) than will those with a lower (independent variable).

An independent variable is the cause in the relationship, whereas the dependent variable is the effect of a change in the independent. On a graph, the x-axis is the independent variable, while the y-axis is the dependent variable. A good way to distinguish the variables is that you are saying the value of the dependent variable depends on the value of the independent variable. A few examples of good hypotheses are:

In comparing the United States, those states with a higher high school graduation rate will have a higher voting rate than will those states with a lower high school graduation rate.

In comparing individuals in the United States, men are more likely to oppose same-sex marriages than are women.

Notice how the hypothesis is different in the second example to accommodate the fact that the independent variable is nominal. With some practice, you will learn to write good hypotheses that identify the population, variables, and direction of the relationship. When performing statistical tests, it is important to note that we do not test the hypothesis directly, but rather the null hypothesis. The null hypothesis is essentially the opposite of your hypothesis (your hypothesis is referred to as the alternative hypothesis and is represented by Ha). The null (represented by Ho) states that there is no relationship between the variables. In statistical testing, we seek not to prove the alternative, but rather to provide evidence against the null.

Comparing Two Categorical Variables

Now that we know our data is valid since it is a random sample and you know how to write a good hypothesis, we can begin to make comparisons between two variables. Depending on the level of measurement of the variables, there are different statistical tests we can employ to test for a relationship. In this module, you will learn to compare to categorical (nominal or ordinal) variables and see if they are related.

Figure 2. 2

0

50

100

favor oppose favor oppose

male

female

Gunlaw Opinions by Gender Gunlaw Opinions by Gender

(^) Percent Percent

Percent

favor/oppose gun permits

Graphs by respondent's sex

Figure 2. 3

Chi-Square Test of Significance

For two categorical variables, the statistics we need to analyze are found with a cross tabulation procedure known as the chi-square test of significance. A cross-tab procedure creates a table of the frequencies of data with the values of the independent variable defining the columns and the dependent variable defining the rows of the table. Each cell of the table represents a particular combination of the variables’ categories. Since these variables have two categories each, the table will have four cells (in addition to the row and column totals) – one for each possible combination of the categories (men

who favor gun laws, men who oppose gun laws, women who favor gun laws, and women who oppose gun laws). Let’s run this procedure so you can visualize the table. Go to Statistics>Summaries, tables & tests>Tables>Two-way tables with measures of association. Enter the independent variable (sex) in the column variable field and the dependent variable (gunlaw) in the row variable field. We want Stata to provide us with several pieces of data, so under Test statistics check the boxes next to Pearson’s chi-squared, Kendall’s tau-b, and Cramer’s V. Under Cell contents, check Pearson’s chi-squared, Within-column relative frequencies, and Expected frequencies. The window should now look like Figure 2.4. Click OK and view the output (Figure 2.5).

Figure 2. 4

The key tells us that each cell contains four pieces of information: the frequency, expected frequency, chi-squared contribution, and column percentages. Lets examine what these numbers mean by considering the first cell in the table (males who favor gun laws). The frequency is listed first and has a value of is 599, which means that 599 of the people surveyed were male and favored gun laws. The last statistic listed is the column frequency (77.29). This means that 77.29% of the column (males in this case) favor gun laws. The other two pieces of data in the cell require a little more explanation. As we mentioned earlier, statistical analyses test the null hypothesis rather than the alternative hypothesis. In this example, the null hypothesis is that there is no relationship between “sex” and “gunlaw.” Thus, we would expect that the percentage of people who favor gun laws would be the same for both males and females. This is where

To find the significance of the relationship by hand, you need the chi-square number, the degrees of freedom, and a chi-square distribution. The degrees of freedom for a table is given by the (number of rows –1) * (number of columns – 1). Therefore, since our table has two rows and two columns, its degrees of freedom = (2-1)*(2-1) = 1. At the end of this guide is a chi-square distribution table. The values in the top row of this table (.25, .20, .15, etc.) are called critical values, and serve as measures of statistical significance. To read the chart you find the row that corresponds to your degrees of freedom and read the values from left to right until the values pass your chi-square value and look at which critical value it is under. For example, suppose we had a chi-square of 10 with 2 degrees of freedom. We read the second row in the table until the values jump from 9.21 (.01 critical value) to 10.6 (.005 critical value). We then select the larger critical value (to be cautious) and assume that as our measure of statistical significance. In our sex-gunlaw table, our chi-square is 38.362 with one degree of freedom. This value is way off the chart to the right, so we use the last available critical value (.0005) as our value. With the technology of today’s society, most statistical programs (including Stata) can provide the critical value of a chi-square for us. Actually, the programs go a step further and find the exact critical value (rather than the static values on the chi-square distribution table). These values are commonly referred to as P-values (probability values). In our Stata output, the P-value for the sex-gunlaw relationship is 0.000 (this isn’t surprising since the chi-square value was so far to the right on the chart). P-values are the key to statistical significance. They tell us how sure we are that the relationship in the sample mirrors the population as a whole; in other words, whether the relationship in the sample happened “by chance.” In order for a test to be considered statistically significant, the P-value has to be .05 or less. (This is referred to as an alpha- level. Some researchers require different alpha-levels, but .05 is commonly accepted as the standard for statistical significance.) If we took many samples of the population, the P-value tells us the percent of time that we would find a relationship as strong as the one we are observing if the null is true. By requiring the P-value to be .05 or less, we are saying that only 5% or less of the time we would not observe the relationship, so the null hypothesis is likely to be wrong. In testing, we never actually prove our hypothesis; instead, we are providing evidence against the null. Thus, only tests with a P-value less than or equal to .05 are considered statistically significant. If a test has a P-value of greater than .05, it is not statistically significant and the results have very little use in the scientific community. (Prof. Toppen’s article in the 1996 issue of Peace and Change entitled "Development and Human Rights: An Alternative Analysis” is an exception. His research found no relationship between development and human rights, whereas most other reports had found a significant relationship.)

Note: P-values are not specific to chi-square testing alone. They are used in most statistical procedures to assure the statistical significance of a relationship.

Measures of Association

After determining that a chi-square test is statistically significant, we need to describe the strength of the relationship it has identified. The strength of a relationship is

given by statistics known as measures of association. The measures of association that we are interested in for a chi-square test are Kendall’s tau-b or Cramer’s V depending on the variables in the test. Kendall’s tau-b is used when both variables are ordinal and have the same number of categories (a “square” table). If the variables have a different number of categories, Kendall’s tau-b changes to Kendall’s tau-c in most statistical programs and books, but Stata will still refer to it as Kendall’s tau-b. Kendall’s tau statistics tell us both the direction and strength of the association. The statistic will have a magnitude between 0 and 1, but any magnitude over .5 is rare. Most statisticians consider a Kendall’s tau value over .3 to be “strong” and anything below .1 to be “weak.” Values between .1 and .3 are considered “moderate.” The statistic can be either positive or negative, indicating the direction of the relationship. (However, when analyzing the direction of a relationship, the direction of the statistic could be reversed depending on the way the variables are coded. Check your table to make sure what is labeled a positive relationship really is positive before you report any values) Cramer’s V is used when at least one of the variables is nominal. It only measures the strength and not the direction of a relationship (because nominal variables have no direction). Since gender is a nominal variable, Cramer’s V is appropriate for the sex-gunlaw relationship. Stata reports this statistic as -.1445, but we ignore the negative sign. Thus, there is a moderately strong relationship between gun law opinions and a person’s gender.

Reporting the Results

A good report on your test will include the hypothesis, information on the variables you used, a P-value for statistical significance, the measure of association for any observed relationship, and an explanation in words. For the gunlaw-sex relationship, we might write:

In comparing individuals in the United States, females are more likely to favor laws requiring gun permits than are males. To test this hypothesis, we chose two variables from GSS1998.dta, “sex” and “gunlaw,” and performed a chi-square test of significance on the data. “Sex” records a respondent’s gender and “gunlaw” records whether or not the person supports gun laws. The P-value for this relationship is 0.000 (chi-square = 38.362 with 1 degree of freedom). Cramer’s V is .1445 for the relationship. Therefore, with great confidence, we can conclude that gender has a modestly strong relationship on gun law opinions. Specifically, that relationship is that women are more likely to favor gun laws than are men.

Of course, there are other pieces of information you could add to this. For instance, it would be wise to report summary statistics on each variable and the percentage of each gender favoring gun laws. You might also include the histogram we created earlier and a copy of the chi-square table (you can easily copy and paste graphs into Microsoft Word, but the table you would have to recreate in Excel because information doesn’t copy nicely out of the Results window in Stata). In short, the more information you include, the more valuable your report becomes.

Figure 2. 7

Exercises for Module 2

  1. A researcher wants to study a possible relationship between gender and their attitudes towards homosexual marriages amongst the people of the United States. a. Write a possible hypothesis for a research paper. b. What is the null hypothesis? c. Briefly explain how she could gather data suitable for statistical analysis?
  2. A researcher has two variables she is analyzing. One is named “school” and measures the amount of education a person has from 1 (less than high school) to 4 (graduate school). The other variable is “aaschool” which measures a person’s support for affirmative action in college admissions (1 means “oppose” and 2 means “support”). The researchers hypothesis is: In comparing individuals, people with a higher level of education are more likely to support affirmative action in college admissions than are people with less education. a. Identify the dependent and independent variables. b. What is the appropriate measure of association for this relationship: Kendall’s tau-b, tau-c, or Cramer’s V?
  3. In GSS1998.dta, create a histogram and run a chi-square using “homosex” as the dependent variable and “income3” as the independent variable. Write-up the results of the analysis. Include a hypothesis for the relationship and provide the P- value, chi-square value, and appropriate measure of association. Explain what the results mean “in plain English.”