Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Understanding Correlation: Interpreting Scatterplots and Correlation Coefficients, Lecture notes of Statistics

An introduction to correlation, explaining the concept of an explanatory variable, displaying relationships through scatterplots, and interpreting correlation coefficients. It covers various relationships, including positive and negative associations, and provides examples of correlation in different contexts. The document also includes a caution about the difference between correlation and causation.

Typology: Lecture notes

2021/2022

Uploaded on 09/12/2022

theeconomist1
theeconomist1 🇺🇸

4.1

(30)

245 documents

1 / 15

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
3.1.1$Explanatory$and$Response$Variables$
$
We$think$that$car$weight$helps$explain$accident$deaths$and$that$smoking$influences$life$expectancy.$In$
these$relationships,$the$two$variables$play$different$roles.$Accident$death$rate$and$life$expectancy$are$
the$response'variable'of$interest.$Car$weight$and$number$of$cigarettes$smoked$are$the$explanatory'
variable.$
$
Response$variable$D$A$response'variable$measures$an$outcome$of$a$study.$$
$
Explanatory$variable$D$An$explanatory'variable'may$help$explain$or$influence$changes$in$a$response$
variable.$
$
You$will$often$see$explanatory$variables$called$independent$variables,$and$response$variables$called$
dependent$variables.$Because$the$words$“independent”$and$“dependent”$have$other$meanings$in$
statistics,$we$won’t$use$them$here.$
$
Example$–$Linking$SAT$Math$and$Critical$Reading$Scores$
Explanatory$or$Response$
$
Julie$asks,$“Can$I$predict$a$state’s$mean$SAT$Math$score$if$I$know$its$mean$SAT$Critical$Reading$score?”$
Jim$wants$to$know$how$the$mean$SAT$Math$and$Critical$Reading$scores$this$year$in$the$50$states$are$
related$to$each$other.$
$
PROBLEM:$For$each$student,$identify$the$explanatory$variable$and$the$response$variable$if$possible.$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
In$many$studies,$the$goal$is$to$show$that$changes$in$one$or$more$explanatory$variables$
actually$cause$changes$in$a$response$variable.$However,$other$explanatoryDresponse$relationships$
don’t$involve$direct$causation.$There$is$no$causeDandDeffect$relationship$between$SAT$Math$and$Critical$
Reading$scores.$Because$the$scores$are$closely$related,$we$can$still$use$a$state’s$mean$SAT$Critical$
Reading$score$to$predict$its$mean$Math$score.$
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Understanding Correlation: Interpreting Scatterplots and Correlation Coefficients and more Lecture notes Statistics in PDF only on Docsity!

3.1.1 Explanatory and Response Variables We think that car weight helps explain accident deaths and that smoking influences life expectancy. In these relationships, the two variables play different roles. Accident death rate and life expectancy are the response variable of interest. Car weight and number of cigarettes smoked are the explanatory variable. Response variable -­‐ A response variable measures an outcome of a study. Explanatory variable -­‐ An explanatory variable may help explain or influence changes in a response variable. You will often see explanatory variables called independent variables, and response variables called dependent variables. Because the words “independent” and “dependent” have other meanings in statistics, we won’t use them here. Example – Linking SAT Math and Critical Reading Scores Explanatory or Response Julie asks, “Can I predict a state’s mean SAT Math score if I know its mean SAT Critical Reading score?” Jim wants to know how the mean SAT Math and Critical Reading scores this year in the 50 states are related to each other. PROBLEM: For each student, identify the explanatory variable and the response variable if possible. In many studies, the goal is to show that changes in one or more explanatory variables actually cause changes in a response variable. However, other explanatory-­‐response relationships don’t involve direct causation. There is no cause-­‐and-­‐effect relationship between SAT Math and Critical Reading scores. Because the scores are closely related, we can still use a state’s mean SAT Critical Reading score to predict its mean Math score.

CHECK YOUR UNDERSTANDING

Identify the explanatory and response variables in each setting.

1. How does drinking beer affect the level of alcohol in our blood? The legal limit for driving in all states is 0.08%. In a study, adult volunteers drank different numbers of cans of beer. Thirty minutes later, a police officer measured their blood alcohol levels. 2. The National Student Loan Survey provides data on the amount of debt for recent college graduates, their current income, and how stressed they feel about college debt. A sociologist looks at the data with the goal of using amount of debt and income to explain the stress caused by college debt.

Example – Heavy Backpacks Making a Scatterplot Ninth-­‐grade students at the Webb Schools go on a backpacking trip each fall. Students are divided into hiking groups of size 8 by selecting names from a hat. Before leaving, students and their backpacks are weighed. Here are data from one hiking group in a recent year: PROBLEM: Make a scatterplot of the relationship between body weight and pack weight. SOLUTION: We follow the steps described earlier to make the scatterplot.

  1. Decide which variable should go on each axis. The weight a student can carry depends on his body weight. So we’ll use body weight as the explanatory variable ( x axis) and backpack weight as the response variable ( y axis).
  2. Label and scale your axes. We labeled the x axis “Body weight (lb)” and the y axis “Pack weight (lb).” Since the hikers’ body weights range from 103 to 187 pounds, we chose a horizontal scale starting at 100 pounds, with tick marks every 10 pounds. The hikers’ pack weights vary from 24 to 35 pounds, so we chose a vertical scale starting at 20 pounds, with tick marks every 2 pounds.
  3. Plot individual data values. The first student in the group weighs 120 pounds and his pack weighs 26 pounds. We plot this point directly above 120 on the horizontal axis and to the right of 26 on the vertical axis, as shown. For the second student in the group, we add the point (187,
    1. to the graph. By adding the points for the remaining six students in the group, we get the completed scatterplot.

3.1.3 Interpreting Scatterplots To interpret a scatterplot, follow the basic strategy of data analysis from Chapter 1 and Chapter 2: look for patterns and important deviations from those patterns. Let’s take a close look at the below scatterplot. What do we see?

  • The graph shows a clear direction : the overall pattern moves from upper left to lower right. That is, states in which higher percents of high school graduates take the SAT tend to have lower mean SAT Math scores. We call this a negative association between the two variables.
  • The form of the relationship is slightly curved. More important, most states fall into one of two distinct clusters. In about half of the states, 25% or fewer graduates took the SAT. In the other half, more than 40% took the SAT.
  • The strength of a relationship in a scatterplot is determined by how closely the points follow a clear form. The overall relationship in the above figure is moderately strong: states with similar percents taking the SAT tend to have roughly similar mean SAT Math scores.
  • Two states stand out in the scatterplot: West Virginia at (19, 501) and Maine at (87, 466). These points can be described as outliers since they fall outside the overall pattern. What explains the clusters? There are two widely used college entrance exams, the SAT and the American College Testing (ACT) exam. Each state favors one or the other. The ACT states cluster at the left of the above figure and the SAT states at the right. In ACT states, most students who take the SAT are applying to a selective college that prefers SAT scores. This select group of students has a higher mean score than the much larger group of students who take the SAT in SAT states. How to Examine a Scatterplot As in any graph of data, look for the overall pattern and for striking departures from that pattern.
  • You can describe the overall pattern of a scatterplot by the direction, form , and strength of the relationship.
  • An important kind of departure is an outlier , an individual value that falls outside the overall pattern of the relationship.

So far, we’ve seen relationships with two different directions. Backpack weight generally increases as body weight increases ( positive association ). The mean SAT score goes down as the percent of graduates taking the test increases ( negative association ). Let’s give a careful definition for these terms. Positive Association -­‐ Two variables have a positive association when above-­‐average values of one tend to accompany above-­‐average values of the other, and when below-­‐average values also tend to occur together. Negative Association -­‐ Two variables have a negative association when above-­‐average values of one tend to accompany below-­‐average values of the other. Of course, not all relationships have a clear direction that we can describe as a positive association or a negative association. The next example, however, illustrates a strong positive association with a simple and important form. Example – The Endangered Manatee Pulling it all together Manatees are large, gentle, slow-­‐moving creatures found along the coast of Florida. Many manatees are injured or killed by boats. The table below contains data on the number of boats registered in Florida (in thousands) and the number of manatees killed by boats for the years 1977 to 2007. STATE: What is the relationship between the number of manatees killed and the number of registered boats? PLAN: First, we’ll make a scatterplot with “boats registered” as the explanatory variable and “manatees killed” as the response variable. Then we’ll describe the direction, form, and strength of the relationship and identify any outliers. DO: The figure below is our completed scatterplot. There is a positive association—more boats registered goes with more manatees killed. The form of the relationship is linear. That is, the overall pattern follows a straight line from lower left to upper right. The relationship is strong because the points don’t deviate greatly from a line. There are no obvious outliers.

CONCLUDE: As more boats are registered, the number of manatees killed by boats goes up linearly. The previous example deserves an important caution: association does not imply causation. Although the scatterplot shows a strong linear relationship between the variables, we can’t conclude that the increase in manatee deaths was caused by the change in boat registrations. Always ask what other variables lurking in the background might contribute to the relationship between two variables. Because both boats registered and manatees killed are recorded year by year, any change in conditions over time might affect the relationship. For example, if boats in Florida have tended to go faster over the years, that might result in more manatees killed by the same number of boats.

3.1.4 Measuring Linear Association: Correlation A scatterplot displays the direction, form, and strength of the relationship between two quantitative variables. Linear relationships are particularly important because a straight line is a simple pattern that is quite common. A linear relationship is strong if the points lie close to a straight line and weak if they are widely scattered about a line. Unfortunately, our eyes are not good judges of how strong a linear relationship is. The two scatterplots in the below figure show the same data, but the graph on the right is drawn smaller in a large field. The right-­‐hand graph seems to show a stronger linear relationship. Since it’s easy to be fooled by different scales or by the amount of space around the cloud of points in a scatterplot, we need to use a numerical measure to supplement the graph. Correlation is the measure we use. Correlation -­‐ The correlation r measures the direction and strength of the linear relationship between two quantitative variables. The correlation r is always a number between −1 and 1. Correlation indicates the direction of a linear relationship by its sign: r > 0 for a positive association and r < 0 for a negative association. Values of r near 0 indicate a very weak linear relationship. The strength of the linear relationship increases as r moves away from 0 toward either −1 or 1. The extreme values r = −1 and r = 1 occur only in the case of a perfect linear relationship, when the points lie exactly along a straight line.

The figure below shows scatterplots that correspond to various values of r. To make the meaning of r clearer, the standard deviations of both variables in these plots are equal, and the horizontal and vertical scales are the same. The correlation describes the direction and strength of the linear relationship in each graph. How to calculate correlation r Suppose that we have data on variables x and y for n individuals. The values for the first individual are x 1 and y 1 , the values for the second individual are x 2 and y 2 , and so on. The means and standard deviations of the two variables are and sx for the x -­‐values, and and sy for the y -­‐values. The correlation r between x and y is or more compactly,

Be careful: a value of r close to 1 or −1 does not guarantee a linear relationship between two variables. A scatterplot with a clear curved form can have a correlation that’s near 1 or −1. Always plot your data! What does correlation measure? The calculator screen shots below provide more detail. At the left is a scatterplot of the hiker data with two lines added—a vertical line at the group’s mean body weight and a horizontal line at the mean pack weight of the group. Most of the points fall in the upper-­‐right or lower-­‐left “quadrants” of the graph. That is, hikers with above-­‐average body weights in the group tend to have above-­‐average pack weights, and hikers with below-­‐average body weights tend to have pack weights that are below average. This confirms the positive association between the variables. Above on the right is a scatterplot of the standardized scores. To get this graph, we transformed both the x -­‐ and the y -­‐values by subtracting their mean and dividing by their standard deviation. Standardizing a data set converts the mean to 0 and the standard deviation to 1. That’s why the vertical and horizontal lines in the right-­‐hand graph are both at 0. Notice that all the products of the standardized values will be positive except for one—the hiker who had slightly below-­‐average weight in the group but a pack weight that was slightly above average. His point appears in the upper-­‐left “quadrant” of both graphs.

CHECK YOUR UNDERSTANDING

The scatterplots below show four sets of real data: (a) repeats the manatee plot; (b) shows the number of named tropical storms and the number predicted before the start of hurricane season each year between 1984 and 2007 by William Gray of Colorado State University; (c) plots the healing rate in micrometers (millionths of a meter) per hour for the two front limbs of several newts in an experiment; and (d) shows stock market performance in consecutive years over a 56-year period.

1. For each graph, estimate the correlation r. Then interpret the value of r in context. 2. The scatterplot in (b) contains an outlier: the disastrous 2005 season, which had 27 named storms, including Hurricane Katrina. What effect would removing this point have on the correlation? Explain.