Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

CHAPTER 3: Data Description, Lecture notes of Statistics

The mean is used in computing other statistics such as variance and standard deviation. ... The median is used when one must find the center value of a data.

Typology: Lecture notes

2021/2022

Uploaded on 09/12/2022

borich
borich 🇬🇧

4.3

(26)

293 documents

1 / 30

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Ch3: Data Description Santorico Page 68
CHAPTER 3: Data Description
You’ve tabulated and made pretty pictures.
Now what numbers do you use to summarize
your data?
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e

Partial preview of the text

Download CHAPTER 3: Data Description and more Lecture notes Statistics in PDF only on Docsity!

CHAPTER 3: Data Description

You’ve tabulated and made pretty pictures.

Now what numbers do you use to summarize

your data?

You’ll find a link on our website to

a data set with various measures for

housing in the suburbs of Boston. It

comes from a paper titled: “Hedonic

prices and the demand for clean air.”

I’ve given histograms here for a few

of the variables.

 What are some characteristics

of the distributions that you

might want to describe?

 Do you think these measures

might do a better job for some

of these variables versus

others? Why or why not?

Rules and Notation:

 Let x represent the variable for which we have sample data.

 Let n represent the number of observations in the sample. (the sample

size).

 Let N represent the number of observations in the population.

 (^)  x represents the sum of all the data values of x.  x

(^)  is the sum of the data values after squaring them.   x

x

(^) .

General Rounding Rule: When computations are done in the

calculation, rounding should not be done until the final answer is

calculated!

Rounding Rule of Thumb for Calculations from Raw Data: The final

answer should be rounded to one more decimal place than that of the

original data. You will see that this will be true for the mean, variance

and standard deviation.

Section 3-1: Measures of Central Tendency

Measure Description Statistic and Parameter Notes and Insights Mean the sum of the data values divided by the total number of values The sample mean is denoted by  x and calculated using the formula:  x   x n The population mean is denoted by  and is found with the formula:     x N The mean should be rounded to one more decimal place than occurs in the raw data. The mean is the balance point of the data. When the data is skewed the mean is pulled in the direction of the longer tail. The mean is used in computing other statistics such as variance and standard deviation. The mean is highly affected by outliers and may not be an appropriate statistic to use when an outlier is present. Median the middle number of the data set when they are ordered from smallest to largest Arrange the data in order. If n is odd, the median is the middle number. If n is even, the median is the mean of the middle two numbers We use the symbol MD for median. The median is robust against outliers (less affected by them). The median is used when one must find the center value of a data set Mode the value that occurs most often in a data set This is where the “peaks” occur in a histogram. Unimodal – when a data set has only one mode Bimodal – when a data set has 2 modes Multimodal – when a data set has more than 2 modes No Mode – when no data values occurs more than once The mode is used when the most typical case is desired.

GROUP WORK: (use appropriate notation)

Find the Mean, Median and Midrange of the daily vehicle pass charge for five U.S. National Parks. The costs are $25, $15, $15, $20, and $25. Find the Mean, Median and Midrange of the numbers of water- line breaks per month in the last two winter seasons for the city of Brownsville, Minnesota: 2, 3, 6, 8, 4, 1. Find the midrange.

Find the mode of the following data sets:

  • Set 1: 12, 8, 14, 15, 11, 10, 5,
  • Set 2: 1, 2, 3,
  • Set 3: 1, 2, 3, 4, 1, 2,
  • Set 4: 18.0, 14.0, 34.5, 10, 11.3, 10, 12.4,

Now consider which of these measures would be good

representations of “central tendency” for the 3 variables from the

Boston housing data set.

Per Capita Crime

Rate By Town

Average Number

Of Rooms Per

Dwelling

Pupil-Teacher

Ratio By Town

x ^ 3.613524^ 6.2846^ 18.4^6

MD= 0.25651 0 6.2085 19.

MR= 44.49126 0 6.1705 17.3 0

Mode= 0.01501, 14.

(each occurring

twice)

(all occurring

3 times)

(occurring 140

times; the next

count closest

occurred 34 times)

Notice how the statistics compare to each other for each variable, e.g., mean, median and midrange are all close to each other for the room variable. Why? Why is this not the case for the other variables?

Location of Mean, Median, and Mode on Distribution Shapes

Ch3: Data Description Santorico – Page 80

Ways to Measure Spread: Measure Description Sample Population Notes Range The difference between the largest and smallest observations. Denoted by R. R = high value – low value Variance The average of the squares of the distance each value is from the mean The sample variance is an estimate of the population variance calculated from a sample. It is denoted by s 2 . The formula to calculate the sample variance is 2 2 (^ ) 1 x x s n     . or

s

2

n x

2   ^  x  2

n ( n  1 )

Population variance is denoted by ^2 It is commonly used in statistics because it has nice theoretical properties. The formula for the population variance is   2  ( x ) 2  N

In practice, we don’t know the population values or parameters, so we cannot calculate ^2 or . We end up calculating the variance and standard deviation of a sample. Be careful to notice the difference of n-1 (sample) and n (population) in the denominator. Standard deviation the “typical” deviation from the sample mean The square root of the sample variance It is denoted by s. The formula to calculate the sample standard deviation is: 2 2 (^ ) 1 x x s s n    

^.

OR  sn x 2 ^  ^  x  2 n ( n  1 ) the square root of the population variance The symbol for the population standard deviation is  The formula for the population standard deviation is    

 ( x )

N

The greater the spread of the data, the larger the value of s. s = 0 only when all observations take the same value. s can be influenced by outliers because outliers influence the mean and because outliers have large deviations from the mean

Steps for Calculating Sample Variance and Standard Deviation

  1. Calculate the sample mean  ^ x.
  2. Calculate the deviation from the mean for every data value (data value – mean).
  3. Square all the values from #2 and find the sum.
  4. Divide the sum in #3 by  n  1. This calculation produces the sample variance.
  5. Take the square root of #4. This number produces the sample standard deviation. The same (general) procedure applies for finding the population variance and standard deviation except we use the population mean   and divide by N instead of  n  1.

And to check your answer, John’s statistics are:

x  185, s  13.6, s 3.

Uses of the Variance and Standard Deviation

  1. To determine the spread of data. The larger the variance or standard deviation, the greater the data are dispersed.
  2. Makes it easy to compare the dispersion of two or more data sets to decide which is more spread out.
  3. To determine the consistency of a variable. E.g., the variation of nuts and bolts in manufacturing must be small.
  4. Frequently used in inferential statistics, as we will see later in the book.
  5. Empirical rule……

Example: Mothers’ Heights An article in 1903 published the heights of 1052 mothers. The sample mean was 62.484 inches and the standard deviation was 2.390 inches. Note the summary table below regarding the actual percentages and the empirical rule.

Section 3-3: Measure of Position (some of…this section we need for use in Section 3-4) Quartiles – values that divide the distribution into four groups, separated by Q1, Q2 (median), and Q3.  Q1 is the 25 th percentile.  Q2 is the 50 th percentile (the median).  Q3 is the 75 th percentile. Interquartile Range (IQR) – the difference between Q1 and Q3. This is the range of the middle 50% of the data.  IQR  Q3Q