Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Sampling Distribution: Understanding Random Samples and Statistic Calculation - Prof. Edit, Lab Reports of Statistics

The concept of sampling distribution, where a random sample is drawn from a population to estimate population statistics such as mean age and proportion of smokers. Why random sampling is used, the statistics to calculate, and the concept of sampling variability. It also discusses the sampling distribution of the sample mean and proportion, and how they can be used to solve problems.

Typology: Lab Reports

2009/2010

Uploaded on 02/25/2010

koofers-user-80c
koofers-user-80c 🇺🇸

10 documents

1 / 4

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Math 1530 –Lab- Introducing the idea of Sampling distribution (Chapter 18)
Drawing a random sample IS a random experiment
Imagine you have a population of individuals and you will select a random sample of size n to ask them a few
questions, for example their age and if they are or have been smokers in some point of their life. Before
drawing the sample we know n of them are going to be in the sample but we don’t know exactly WHO is going
to be in the sample.
1. Why we select a random sample? Population parameters.
We select a random sample when we want to know something about the population but we don’t have time or
money to ask everybody in the population. The things we want to know about the population, in this case:
‘mean age in the population ’ and ‘proportion of smokers in the population’
2. What statistics to calculate from the sample?
Assume that you will take a sample of n individuals, ask them the questions:
‘What is your age( in years)?’ and ‘Have you smoked more than 100 cigarettes in your life?’ (the official
definition of ‘being an smoker” ? and you want to summarize the data in the sample.
What type of variable is age? Quantitative or Categorical ? _____________________
What type of variable is ‘being an smoker’ ? Quantitative or Categorical ? ___________________
Considering the type of variable which statistic do you consider appropriate to summarize the information of the
sample ?
For age ________________________________ For smokers ____________________________
3. Taking samples and calculating statistics
As you can imagine the mean age in the sample and the proportion of smokers in the sample depends on who
is in the sample. Just as for simplicity lets assume that we have a population of 50 individuals and that you will
select a sample of 5 individuals. In real life we only know the answers to the questions for those
individuals in the sample, but here just as an exercise you see below the age and smoking status of the 50
individuals in the population. This population is in the file agesmoke.mtw available in our web page.
ID Age Smoker
1 34 NO
2 39 YES
3 37 NO
4 46 NO
5 31 NO
6 32 NO
7 36 YES
8 51 NO
9 93 YES
10 66 YES
11 50 YES
12 32 NO
13 31 YES
ID Age Smoker
14 43 YES
15 24 NO
16 25 YES
17 43 NO
18 29 NO
19 31 NO
20 58 YES
21 76 YES
22 65 YES
23 39 YES
24 38 NO
25 37 YES
26 27 NO
ID Age Smoker
27 38 YES
28 69 YES
29 68 NO
30 21 NO
31 82 NO
32 32 YES
33 23 NO
34 51 NO
35 45 NO
36 26 NO
37 35 NO
38 26 NO
39 35 NO
ID Age Smoker
40 24 YES
41 25 YES
42 47 NO
43 45 NO
44 42 YES
45 81 NO
46 43 NO
47 39 NO
48 34 YES
49 71 NO
50 31 NO
Using the random digit table or Minitab select two different samples of size 5, report the
observations and the value of the statistics for each sample
Sample 1
Person 1 Person 2 Person 3 Person 4 Person 5 Value of the
statistic
ID
Age Mean=
Smoker? Proportion=
Sample 2
Person 1 Person 2 Person 3 Person 4 Person 5 Value of the
statistic
ID
Age Mean=
Smoker? Proportion=
Notice something interesting for categorical variables with two possible answers (‘success’ or ‘failure’). In this
example the variable Smoker has two categories : YES and NO. In the samples above replace Yes by 1 and No by
0. Call that new variable Y Counting the number of ‘yes’ is equivalent to adding the 1s and 0s corresponding to the
pf3
pf4

Partial preview of the text

Download Sampling Distribution: Understanding Random Samples and Statistic Calculation - Prof. Edit and more Lab Reports Statistics in PDF only on Docsity!

Math 1530 –Lab- Introducing the idea of Sampling distribution (Chapter 18) Drawing a random sample IS a random experiment Imagine you have a population of individuals and you will select a random sample of size n to ask them a few questions, for example their age and if they are or have been smokers in some point of their life. Before drawing the sample we know n of them are going to be in the sample but we don’t know exactly WHO is going to be in the sample.

  1. Why we select a random sample? Population parameters. We select a random sample when we want to know something about the population but we don’t have time or money to ask everybody in the population. The things we want to know about the population, in this case: ‘mean age in the population ’ and ‘proportion of smokers in the population’
  2. What statistics to calculate from the sample? Assume that you will take a sample of n individuals, ask them the questions: ‘What is your age( in years)?’ and ‘Have you smoked more than 100 cigarettes in your life?’ (the official definition of ‘being an smoker”? and you want to summarize the data in the sample. What type of variable is age? Quantitative or Categorical? _____________________ What type of variable is ‘being an smoker’? Quantitative or Categorical? ___________________ Considering the type of variable which statistic do you consider appropriate to summarize the information of the sample? For age ________________________________ For smokers ____________________________
  3. Taking samples and calculating statistics As you can imagine the mean age in the sample and the proportion of smokers in the sample depends on who is in the sample. Just as for simplicity lets assume that we have a population of 50 individuals and that you will select a sample of 5 individuals. In real life we only know the answers to the questions for those individuals in the sample, but here just as an exercise you see below the age and smoking status of the 50 individuals in the population. This population is in the file agesmoke.mtw available in our web page. ID Age Smoker 1 34 NO 2 39 YES 3 37 NO 4 46 NO 5 31 NO 6 32 NO 7 36 YES 8 51 NO 9 93 YES 10 66 YES 11 50 YES 12 32 NO 13 31 YES ID Age Smoker 14 43 YES 15 24 NO 16 25 YES 17 43 NO 18 29 NO 19 31 NO 20 58 YES 21 76 YES 22 65 YES 23 39 YES 24 38 NO 25 37 YES 26 27 NO ID Age Smoker 27 38 YES 28 69 YES 29 68 NO 30 21 NO 31 82 NO 32 32 YES 33 23 NO 34 51 NO 35 45 NO 36 26 NO 37 35 NO 38 26 NO 39 35 NO ID Age Smoker 40 24 YES 41 25 YES 42 47 NO 43 45 NO 44 42 YES 45 81 NO 46 43 NO 47 39 NO 48 34 YES 49 71 NO 50 31 NO Using the random digit table or Minitab select two different samples of size 5, report the observations and the value of the statistics for each sample Sample 1 Person 1 Person 2 Person 3 Person 4 Person 5 Value of the statistic ID Age Mean= Smoker? Proportion= Sample 2 Person 1 Person 2 Person 3 Person 4 Person 5 Value of the statistic ID Age Mean= Smoker? Proportion= Notice something interesting for categorical variables with two possible answers (‘success’ or ‘failure’). In this example the variable Smoker has two categories : YES and NO. In the samples above replace Yes by 1 and No by
  4. Call that new variable Y Counting the number of ‘yes’ is equivalent to adding the 1s and 0s corresponding to the

answers. For example if the answers to the question ‘Have you smoked more than 100 cigarettes in your life?’ are : YES , NO , YES, NO, NO ; the values of Y would be 1,0,1,0,

n

successes

p

n y y n ii  ^1

The sample proportion can be understood also as the sample mean of a variable that only takes values 1 and 0 (for success and failure, respectively) Below you see the distribution of age for the population. The population mean 42.92 is marked with an arrow. Mark (in the X axis) the values of the sample means for the two samples you got. How far were the means in the samples from the population mean? 15 25 35 45 55 65 75 85 95 15 10 5 0 Age Fr eq ue nc y Age (in years) of 50 individuals Pop mean We know that a proportion only can take values between 0 and 1. Below, in a line that goes from 0 to 1 we have marked the proportion of smokers in this small population (40% of the 50 individuals are or have been smokers). In the same graph, mark the proportion of smokers in the two samples you obtained. 0 1 0.

  1. Sampling Variability In the samples you selected in the previous section, be aware of two things:
  1. The value of the statistic is not necessarily equal to the value of the parameter we want to estimate (actually we would be VERY LUCKY if this happened), specially when the sample size is as small as the sample size we are working with (n=5)
  2. The values of the statistics were different for the two samples. Compare your values with the values obtained by the other students. That IS SAMPLING VARIABILITY : THE VALUES OF THE STATISTICS DIFFER FROM SAMPLE TO SAMPLE. The statistics, such as sample mean or sample proportion, are RANDOM VARIABLES because we don’t know exactly what value they will take until we select the sample.
  1. Sampling Distribution of the sample mean and sample proportion As for other random variables we are interested in the probability distribution of the statistics (sample mean or sample proportion), that distribution is called SAMPLING DISTRIBUTION. i.e. we want to know what values the sample mean or the sample proportion (of samples of size 5) can take and with what probability Now instead of taking 2 samples of size 5 we will take 1000 samples of size 5, to do it by hand would be too time consuming but we can use the computer. Next you will see the results for 1000 random samples of size 5 taken from the population of 50 individuals. In the appendix you can see how these samples were generated with the computer and you can generate your own samples if you wish.

6.1) Problem 13 on page 351 of Intro Stats by DeVeaux & Velleman says :“When a truckload of apples arrives at a packing plant, a random sample of 150 is selected and examined for bruises, discoloration and other defects. The whole truckload will be rejected if more than 5% of the sample is unsatisfactory. Suppose that in fact 8% of the apples on the truck do not meet the desired standard. What’s the probability that the shipment will be accepted anyway?” So 8% of the apples in the truckload are not good but we don’t know that because we only examine a random sample of 150 apples and maybe by chance those are in a better condition. First, check if the assumptions necessary to use the normal model are fulfilled. a) Is n no larger than 10% of the population. In this case n=150 , it is reasonable to think that a whole truckload has more than 1500 apples. So first condition is fulfilled b) Is np>10? Is n(1-p)>10In this case n=150 and p=0.08 so np=12 and n(1-p)=150*0.92=138. So second condition is fulfilled The distribution of the sample proportion can be assumed to be approximately normal with mean 0.08 and standard deviation

n

pq

= 0.0221510 The question is P(accepting the shipment even when 8% of

the apples in the truckload are not good)= P ( p ˆ 0. 05 ) = )

p 

P =^ )

P z 

Sketch a normal distribution and shade in the area you want to find. Use the normal table (or Minitab) to find it Report that probability ________________________ 6.2) Solve problem 21 on page 352 of Intro Stats by DeVeaux & Velleman. (In this case the duration of human pregnancies can be described by a normal model so the distribution of the sample mean can be described by a normal model regardless of the sample size). For other examples in which the variable does not have a normal distribution, you can still use the normal model for the sample mean (provided n is large enough) thanks to the Central Limit Theorem. ==================================Appendix========================================= If you want to generate your own 1000 samples you can use the program below samdismp.mtb (it can be downloaded from the web page, but you need to be careful that the extention of the file remains .mtb, for that use the option ‘all files’ in the moment of downloading it and not the option ‘web page’ or other) The program has the following lines: sample 5 c2 c4 c5-c6; replace. let c7(k1)=mean(c5) let c8(k1)=mean(c6) let k1=k1+ The program takes a random sample (with replacement) of size 5 from columns C2 and C where the values of age and Y (smoke Yes=1, no=0) are. The values of age and Y for the sample are placed in c5-c6 respectively. The program calculates the mean of the sample (for age) and places the mean in C It also calculates the proportion of smokers in the sample and places that proportion in C Then you need to initialize the counter k1 by typing at the MTB prompt: MTB> let k1=1. To execute the program in order to take the 1000 samples, from the menu click on FILE> OTHER FILES> run an executable , the following window will appear: Indicate the number of times you want to execute the program, click on Select File to indicate the name of the program. You can browse to find the program samdismp.mtb .The sample means will appear in C7 and the sample proportions in C8. You can later obtain histograms or tables for those variables. You can also change the sample size and observe what happens.