Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Understanding Data Variability: An Introduction to Box Plots and Quartiles, Study notes of Data Analysis & Statistical Methods

Northwestern State University of Louisiana (NSU)Data Analysis & Statistical Methods

An introduction to data analysis using box plots and quartiles. It explains how to construct a box plot, identify the median and interquartile range, and detect outliers. The document also includes exercises to practice these concepts.

Typology: Study notes

2021/2022

Uploaded on 09/12/2022

yorket 🇺🇸

4.4

(38)

276 documents

1 / 2

This page cannot be seen from the preview

Don't miss anything!

28 CHAPTER 1. INTRODUCTION TO DATA

1.6.5 Box plots, quartiles, and the median

Abox plot summarizes a data set using five statistics while also plotting unusual observa-

tions. Figure 1.25 provides a vertical dot plot alongside a box plot of the num char variable

from the email50 data set.

Number of Characters (in thousands)

0

10

20

30

40

50

60

lower whisker

Q1 (first quartile)

median

Q3 (third quartile)

upper whisker

max whisker reach

suspected outliers

−

Figure 1.25: A vertical dot plot next to a labeled box plot for the number

of characters in 50 emails. The median (6,890), splits the data into the

bottom 50% and the top 50%, marked in the dot plot by horizontal dashes

and open circles, respectively.

The first step in building a box plot is drawing a dark line denoting the median,

which splits the data in half. Figure 1.25 shows 50% of the data falling below the median

(dashes) and other 50% falling above the median (open circles). There are 50 character

counts in the data set (an even number) so the data are perfectly split into two groups of 25.

We take the median in this case to be the average of the two observations closest to the

50th percentile: (6,768 + 7,012)/2 = 6,890. When there are an odd number of observations,

there will be exactly one observation that splits the data into two halves, and in this case

that observation is the median (no average needed).

Median: the number in the middle

If the data are ordered from smallest to largest, the median is the observation

right in the middle. If there are an even number of observations, there will be two

values in the middle, and the median is taken as their average.

The second step in building a box plot is drawing a rectangle to represent the middle

50% of the data. The total length of the box, shown vertically in Figure 1.25, is called

the interquartile range (IQR, for short). It, like the standard deviation, is a measure of

variability in data. The more variable the data, the larger the standard deviation and IQR.

The two boundaries of the box are called the first quartile (the 25th percentile, i.e. 25%

of the data fall below this value) and the third quartile (the 75th percentile), and these

are often labeled Q1and Q3, respectively.

Partial preview of the text

Download Understanding Data Variability: An Introduction to Box Plots and Quartiles and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

28 CHAPTER 1. INTRODUCTION TO DATA

1.6.5 Box plots, quartiles, and the median

A box plot summarizes a data set using five statistics while also plotting unusual observa- tions. Figure 1.25 provides a vertical dot plot alongside a box plot of the num char variable from the email50 data set. Number of Characters (in thousands) 0 10 20 30 40 50 60 lower whisker Q 1 (first quartile) median Q 3 (third quartile) upper whisker max whisker reach suspected outliers −−−−−−−−−−− −−−−−− −−−−−− −− Figure 1.25: A vertical dot plot next to a labeled box plot for the number of characters in 50 emails. The median (6,890), splits the data into the bottom 50% and the top 50%, marked in the dot plot by horizontal dashes and open circles, respectively. The first step in building a box plot is drawing a dark line denoting the median, which splits the data in half. Figure 1.25 shows 50% of the data falling below the median (dashes) and other 50% falling above the median (open circles). There are 50 character counts in the data set (an even number) so the data are perfectly split into two groups of 25. We take the median in this case to be the average of the two observations closest to the 50 th^ percentile: (6,768 + 7,012)/2 = 6,890. When there are an odd number of observations, there will be exactly one observation that splits the data into two halves, and in this case that observation is the median (no average needed). Median: the number in the middle If the data are ordered from smallest to largest, the median is the observation right in the middle. If there are an even number of observations, there will be two values in the middle, and the median is taken as their average. The second step in building a box plot is drawing a rectangle to represent the middle 50% of the data. The total length of the box, shown vertically in Figure 1.25, is called the interquartile range (IQR, for short). It, like the standard deviation, is a measure of variability in data. The more variable the data, the larger the standard deviation and IQR. The two boundaries of the box are called the first quartile (the 25th^ percentile, i.e. 25% of the data fall below this value) and the third quartile (the 75th^ percentile), and these are often labeled Q 1 and Q 3 , respectively.

1.6. EXAMINING NUMERICAL DATA 29

Interquartile range (IQR) The IQR is the length of the box in a box plot. It is computed as IQR = Q 3 − Q 1 where Q 1 and Q 3 are the 25th^ and 75th^ percentiles. Exercise 1.30 What percent of the data fall between Q 1 and the median? What percent is between the median and Q 3?^34 Extending out from the box, the whiskers attempt to capture the data outside of the box, however, their reach is never allowed to be more than 1. 5 × IQR.^35 They capture everything within this reach. In Figure 1.25, the upper whisker does not extend to the last three points, which is beyond Q 3 +1. 5 ×IQR, and so it extends only to the last point below this limit. The lower whisker stops at the lowest value, 33, since there is no additional data to reach; the lower whisker’s limit is not shown in the figure because the plot does not extend down to Q 1 − 1. 5 × IQR. In a sense, the box is like the body of the box plot and the whiskers are like its arms trying to reach the rest of the data. Any observation that lies beyond the whiskers is labeled with a dot. The purpose of labeling these points – instead of just extending the whiskers to the minimum and maximum observed values – is to help identify any observations that appear to be unusually distant from the rest of the data. Unusually distant observations are called outliers. In this case, it would be reasonable to classify the emails with character counts of 41,623, 42,793, and 64,401 as outliers since they are numerically distant from most of the data. Outliers are extreme An outlier is an observation that appears extreme relative to the rest of the data. TIP: Why it is important to look for outliers Examination of data for possible outliers serves many useful purposes, including

Identifying strong skew in the distribution.
Identifying data collection or entry errors. For instance, we re-examined the email purported to have 64,401 characters to ensure this value was accurate.
Providing insight into interesting properties of the data. Exercise 1.31 The observation 64,401, a suspected outlier, was found to be an accurate observation. What would such an observation suggest about the nature of character counts in emails?^36 Exercise 1.32 Using Figure 1.25, estimate the following values for num char in the email50 data set: (a) Q 1 , (b) Q 3 , and (c) IQR.^37 (^34) Since Q 1 and Q 3 capture the middle 50% of the data and the median splits the data in the middle, 25% of the data fall between Q 1 and the median, and another 25% falls between the median and Q 3. (^35) While the choice of exactly 1.5 is arbitrary, it is the most commonly used value for box plots. (^36) That occasionally there may be very long emails. (^37) These visual estimates will vary a little from one person to the next: Q 1 = 3,000, Q 3 = 15,000, IQR = Q 3 − Q 1 = 12,000. (The true values: Q 1 = 2,536, Q 3 = 15,411, IQR = 12,875.)

Understanding Data Variability: An Introduction to Box Plots and Quartiles, Study notes of Data Analysis & Statistical Methods

Related documents

Partial preview of the text

Download Understanding Data Variability: An Introduction to Box Plots and Quartiles and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

28 CHAPTER 1. INTRODUCTION TO DATA

1.6.5 Box plots, quartiles, and the median

1.6. EXAMINING NUMERICAL DATA 29