



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
STA138 Handout lecture notes very detailed
Typology: Lecture notes
1 / 5
This page cannot be seen from the preview
Don't miss anything!
Let us begin with a few simple examples about variables.
Example 1. A study was done to study the distribution of educational level (years of schooling completed) of residents in a large city who are over 40 years old. A random sample of 200 individuals (over 40 years of age) was taken and their educational levels were recorded. [The variable is educational level.]
Example 2. Another analyst used the same data as in Example 1, and summarized the data into two categories - those with a high school degree (no college degree) and those with a college degree. It turns out that in the sample 71 had a college degree. [Years of schooling has been has now been binned into two categories]
Example 3. Yet another analyst binned the data in Example 1 into three groups, and the counts are given in the following table Educational level less than high school high school college Count 38 91 71 In this case, years of schooling has been binned into three categories.
Example 4. In a large metropolitan area in California, a random sample of 400 adult residents was taken, and the counts are given below. Race/Ethnicity Anglo Hispanic Afro-American Others Count 87 117 101 95 Here the variable íethnicityíhas 4 categories. We may be interested in testing if the four groups are equally present in this metropolitan area (ie, each is 25%).
Discussion Note that in Example 1, the variable íyears of schoolingíis quantitative, whereas in Examples 2 and 3 the variable íeducational levelíis qualitative. In Example 1, we may use a plot such as histogram in order to examine the distribution of years of schooling. In Example 2, we have a binary variable (or a dummy or indicator variable) as it has only two categories. In Example 3, the variable íeducational levelíis qualitative with three categories, and there is an ordering of the categories. In Example 4, the variable íethnicityíhas four categories, but there is no ordering in the categories. Categorical Data Analysis deals with data where the variables are qualitative (ie, categorical). This course is concerned with concepts, methods and analysis of those data sets where the variables are categorical. Here are some technical names for the qualitative variables listed above:
(i) Binary (Example 2). (ii) Nominal (Example 4), categories are not ordered. (iii) Ordinal (Example 3, categories are ordered). Note that binning a quantitative variable leads to an ordinal qualitative variable. Statistical methods developed for ordinal variables should not be used for nominal variables.
More Examples for future discussion Before we get into some necessary technical details, let us look at a few more examples which we will deal with later in the course.
Example 5. In order to investigate association between smoking and lung cancer, 125 lung cancer patients (thought of as a random sample of lung cancer patients) and 150 controls (thought of as a random sample of without lung cancer) were taken. Both the samples were taken from the same large metropolitan area. The following table provides a summary of the counts. [Note that this data is from 1953] Smoking Habit Total Smoker Nonsmoker Cancer 120 5 125 Cancer-free 126 24 150 Total 246 29 275 Note we have two independent samples from the two populations (cancer patients, and cancer-free res- idents), and the categorical variable is smoking habit (smoker and non-smoker). We may be interested in Önding out the di§erence in the proportion of smokers in cancer and control groups. Note that the researcher conducting the study decided what the sample sizes from the two populations would be, and thus the row totals are known in advance. [Technical note: this is a case of independent Binomial samples (also called product Binomial samples).]
Example 6. Is political ideology dependent on opinion on death penalty for murders? A random sample of 250 adults is taken in a state and the summary of counts is given below. Ideology Total Supports Death penalty Liberal Conservative Yes 37 51 88 No 76 86 162 Total 113 137 250 Note that there are two qualitative variables: ideology and opinion on death penalty. It is of interest to test if ideology is independent of opinion on death penalty. Note that in this study, each person in the sample has been asked her/his political ideology and opinion on death penalty, thus the row totals (and column totals) were not known in advance. This is in contrast with Example 5, where the row totals were known in advance. [Technical note: this can be considered a case of Multinomial sampling scheme with four categories: Yes& Liberal, Yes&Conservative, No&Liberal, and No&Conservative.]
Example 7. We have the data on the verdicts in cases of convicted murderers in a certain state since 1980. Death Penalty Total Race of Victim Yes No White 45 85 130 Black 14 218 232 Total 59 303 362 We have two qualitative variables each with two categories: Victimís race and pronouncement of death penalty. It is of interest to investigate if the rate of death penalty higher for white victims than for blacks. Note that unlike in Examples 5 and 6, neither the sample size nor the row and column totals are not decided
schooling of young adults independent of parentsístatus (also known as test of independence). Though the sampling scemes in Examples 8 and 9 are di§erent, the procedure for testing these hypotheses testing turn out to be the same.
Example 10. Suppose that we are working with some doctors on heart attack patients. The dependent variable is whether the patient has had a second heart attack within 1 year (yes = 1). We have two independent variables, one is whether the patient completed a treatment consistent of anger control practices (yes=1). The other is a score on a trait anxiety scale (a higher score means more anxious). Person 2 nd^ heart attack Treatment of anger Trait Anxiety 1 1 1 70 2 1 1 80 3 1 1 50 4 1 0 60 5 1 0 40 6 1 0 65 7 1 0 75 8 1 0 80 9 1 0 70 10 1 0 60 11 0 1 65 12 0 1 50 13 0 1 45 14 0 1 35 15 0 1 40 16 0 1 50 17 0 0 55 18 0 0 45 19 0 0 50 20 0 0 60 A goal of the analysis would be to investigate (and model) how the probability of a second heart attack within a year of the Örst attack depends on trait anxiety and treatment of anger. Note that the response (dependent) variable is qualitative (binary), and the independent variables are treatment of anger (binary) and trait anxiety (quantitative). We will discuss this using what is known as a logistic regression model.
Example 11. (E§ect of anti-epilepsy drug) In order to study the anti-epilepsy drug progabide, researchers randomly assigned 59 patients su§ering from epileptic seizures to receive either progabide or a placebo, in addition to standard chemotherapy. Data below lists the number of epileptic seizures in the 8 weeks prior to administration of the treatment, the number of seizures in 8 weeks after the start of the treatment, (control=0, progabide=1), and the age (in years) of each patient.
Treatment Age Pretreatment count Posttreatment count 0 31 11 14 0 30 11 14 0 25 6 11 0 36 8 13 .. .
Here the main goal is to check if progabide is e§ective in reducing the incidence of epileptic seizures. This data will be analyzed using a Poisson regression model.