Sampling Distribution: Understanding Random Samples and Statistic Calculation - Prof. Edit | Lab Reports Statistics

Math 1530 –Lab- Introducing the idea of Sampling distribution (Chapter 18)

Drawing a random sample IS a random experiment

Imagine you have a population of individuals and you will select a random sample of size n to ask them a few

questions, for example their age and if they are or have been smokers in some point of their life. Before

drawing the sample we know n of them are going to be in the sample but we don’t know exactly WHO is going

to be in the sample.

1. Why we select a random sample? Population parameters.

We select a random sample when we want to know something about the population but we don’t have time or

money to ask everybody in the population. The things we want to know about the population, in this case:

‘mean age in the population ’ and ‘proportion of smokers in the population’

2. What statistics to calculate from the sample?

Assume that you will take a sample of n individuals, ask them the questions:

‘What is your age( in years)?’ and ‘Have you smoked more than 100 cigarettes in your life?’ (the official

definition of ‘being an smoker” ? and you want to summarize the data in the sample.

What type of variable is age? Quantitative or Categorical ? _____________________

What type of variable is ‘being an smoker’ ? Quantitative or Categorical ? ___________________

Considering the type of variable which statistic do you consider appropriate to summarize the information of the

sample ?

For age ________________________________ For smokers ____________________________

3. Taking samples and calculating statistics

As you can imagine the mean age in the sample and the proportion of smokers in the sample depends on who

is in the sample. Just as for simplicity lets assume that we have a population of 50 individuals and that you will

select a sample of 5 individuals. In real life we only know the answers to the questions for those

individuals in the sample, but here just as an exercise you see below the age and smoking status of the 50

individuals in the population. This population is in the file agesmoke.mtw available in our web page.

ID Age Smoker

1 34 NO

2 39 YES

3 37 NO

4 46 NO

5 31 NO

6 32 NO

7 36 YES

8 51 NO

9 93 YES

10 66 YES

11 50 YES

12 32 NO

13 31 YES

ID Age Smoker

14 43 YES

15 24 NO

16 25 YES

17 43 NO

18 29 NO

19 31 NO

20 58 YES

21 76 YES

22 65 YES

23 39 YES

24 38 NO

25 37 YES

26 27 NO

ID Age Smoker

27 38 YES

28 69 YES

29 68 NO

30 21 NO

31 82 NO

32 32 YES

33 23 NO

34 51 NO

35 45 NO

36 26 NO

37 35 NO

38 26 NO

39 35 NO

ID Age Smoker

40 24 YES

41 25 YES

42 47 NO

43 45 NO

44 42 YES

45 81 NO

46 43 NO

47 39 NO

48 34 YES

49 71 NO

50 31 NO

Using the random digit table or Minitab select two different samples of size 5, report the

observations and the value of the statistics for each sample

Sample 1

Person 1 Person 2 Person 3 Person 4 Person 5 Value of the

statistic

Age Mean=

Smoker? Proportion=

Sample 2

Person 1 Person 2 Person 3 Person 4 Person 5 Value of the

statistic

Age Mean=

Smoker? Proportion=

Notice something interesting for categorical variables with two possible answers (‘success’ or ‘failure’). In this

example the variable Smoker has two categories : YES and NO. In the samples above replace Yes by 1 and No by

0. Call that new variable Y Counting the number of ‘yes’ is equivalent to adding the 1s and 0s corresponding to the

Partial preview of the text

Download Sampling Distribution: Understanding Random Samples and Statistic Calculation - Prof. Edit and more Lab Reports Statistics in PDF only on Docsity!

Math 1530 –Lab- Introducing the idea of Sampling distribution (Chapter 18) Drawing a random sample IS a random experiment Imagine you have a population of individuals and you will select a random sample of size n to ask them a few questions, for example their age and if they are or have been smokers in some point of their life. Before drawing the sample we know n of them are going to be in the sample but we don’t know exactly WHO is going to be in the sample.

Why we select a random sample? Population parameters. We select a random sample when we want to know something about the population but we don’t have time or money to ask everybody in the population. The things we want to know about the population, in this case: ‘mean age in the population ’ and ‘proportion of smokers in the population’
What statistics to calculate from the sample? Assume that you will take a sample of n individuals, ask them the questions: ‘What is your age( in years)?’ and ‘Have you smoked more than 100 cigarettes in your life?’ (the official definition of ‘being an smoker”? and you want to summarize the data in the sample. What type of variable is age? Quantitative or Categorical? _____________________ What type of variable is ‘being an smoker’? Quantitative or Categorical? ___________________ Considering the type of variable which statistic do you consider appropriate to summarize the information of the sample? For age ________________________________ For smokers ____________________________
Taking samples and calculating statistics As you can imagine the mean age in the sample and the proportion of smokers in the sample depends on who is in the sample. Just as for simplicity lets assume that we have a population of 50 individuals and that you will select a sample of 5 individuals. In real life we only know the answers to the questions for those individuals in the sample, but here just as an exercise you see below the age and smoking status of the 50 individuals in the population. This population is in the file agesmoke.mtw available in our web page. ID Age Smoker 1 34 NO 2 39 YES 3 37 NO 4 46 NO 5 31 NO 6 32 NO 7 36 YES 8 51 NO 9 93 YES 10 66 YES 11 50 YES 12 32 NO 13 31 YES ID Age Smoker 14 43 YES 15 24 NO 16 25 YES 17 43 NO 18 29 NO 19 31 NO 20 58 YES 21 76 YES 22 65 YES 23 39 YES 24 38 NO 25 37 YES 26 27 NO ID Age Smoker 27 38 YES 28 69 YES 29 68 NO 30 21 NO 31 82 NO 32 32 YES 33 23 NO 34 51 NO 35 45 NO 36 26 NO 37 35 NO 38 26 NO 39 35 NO ID Age Smoker 40 24 YES 41 25 YES 42 47 NO 43 45 NO 44 42 YES 45 81 NO 46 43 NO 47 39 NO 48 34 YES 49 71 NO 50 31 NO Using the random digit table or Minitab select two different samples of size 5, report the observations and the value of the statistics for each sample Sample 1 Person 1 Person 2 Person 3 Person 4 Person 5 Value of the statistic ID Age Mean= Smoker? Proportion= Sample 2 Person 1 Person 2 Person 3 Person 4 Person 5 Value of the statistic ID Age Mean= Smoker? Proportion= Notice something interesting for categorical variables with two possible answers (‘success’ or ‘failure’). In this example the variable Smoker has two categories : YES and NO. In the samples above replace Yes by 1 and No by
Call that new variable Y Counting the number of ‘yes’ is equivalent to adding the 1s and 0s corresponding to the

answers. For example if the answers to the question ‘Have you smoked more than 100 cigarettes in your life?’ are : YES , NO , YES, NO, NO ; the values of Y would be 1,0,1,0,

n

successes

p

n y y n i  i  ^1

The sample proportion can be understood also as the sample mean of a variable that only takes values 1 and 0 (for success and failure, respectively) Below you see the distribution of age for the population. The population mean 42.92 is marked with an arrow. Mark (in the X axis) the values of the sample means for the two samples you got. How far were the means in the samples from the population mean? 15 25 35 45 55 65 75 85 95 15 10 5 0 Age Fr eq ue nc y Age (in years) of 50 individuals Pop mean We know that a proportion only can take values between 0 and 1. Below, in a line that goes from 0 to 1 we have marked the proportion of smokers in this small population (40% of the 50 individuals are or have been smokers). In the same graph, mark the proportion of smokers in the two samples you obtained. 0 1 0.

Sampling Variability In the samples you selected in the previous section, be aware of two things:

The value of the statistic is not necessarily equal to the value of the parameter we want to estimate (actually we would be VERY LUCKY if this happened), specially when the sample size is as small as the sample size we are working with (n=5)
The values of the statistics were different for the two samples. Compare your values with the values obtained by the other students. That IS SAMPLING VARIABILITY : THE VALUES OF THE STATISTICS DIFFER FROM SAMPLE TO SAMPLE. The statistics, such as sample mean or sample proportion, are RANDOM VARIABLES because we don’t know exactly what value they will take until we select the sample.

Sampling Distribution of the sample mean and sample proportion As for other random variables we are interested in the probability distribution of the statistics (sample mean or sample proportion), that distribution is called SAMPLING DISTRIBUTION. i.e. we want to know what values the sample mean or the sample proportion (of samples of size 5) can take and with what probability Now instead of taking 2 samples of size 5 we will take 1000 samples of size 5, to do it by hand would be too time consuming but we can use the computer. Next you will see the results for 1000 random samples of size 5 taken from the population of 50 individuals. In the appendix you can see how these samples were generated with the computer and you can generate your own samples if you wish.

6.1) Problem 13 on page 351 of Intro Stats by DeVeaux & Velleman says :“When a truckload of apples arrives at a packing plant, a random sample of 150 is selected and examined for bruises, discoloration and other defects. The whole truckload will be rejected if more than 5% of the sample is unsatisfactory. Suppose that in fact 8% of the apples on the truck do not meet the desired standard. What’s the probability that the shipment will be accepted anyway?” So 8% of the apples in the truckload are not good but we don’t know that because we only examine a random sample of 150 apples and maybe by chance those are in a better condition. First, check if the assumptions necessary to use the normal model are fulfilled. a) Is n no larger than 10% of the population. In this case n=150 , it is reasonable to think that a whole truckload has more than 1500 apples. So first condition is fulfilled b) Is np>10? Is n(1-p)>10In this case n=150 and p=0.08 so np=12 and n(1-p)=150*0.92=138. So second condition is fulfilled The distribution of the sample proportion can be assumed to be approximately normal with mean 0.08 and standard deviation

n

pq

= 0.0221510 The question is P(accepting the shipment even when 8% of

the apples in the truckload are not good)= P ( p ˆ 0. 05 ) = )

p 

P =^ )

P z 

Sketch a normal distribution and shade in the area you want to find. Use the normal table (or Minitab) to find it Report that probability ________________________ 6.2) Solve problem 21 on page 352 of Intro Stats by DeVeaux & Velleman. (In this case the duration of human pregnancies can be described by a normal model so the distribution of the sample mean can be described by a normal model regardless of the sample size). For other examples in which the variable does not have a normal distribution, you can still use the normal model for the sample mean (provided n is large enough) thanks to the Central Limit Theorem. ==================================Appendix========================================= If you want to generate your own 1000 samples you can use the program below samdismp.mtb (it can be downloaded from the web page, but you need to be careful that the extention of the file remains .mtb, for that use the option ‘all files’ in the moment of downloading it and not the option ‘web page’ or other) The program has the following lines: sample 5 c2 c4 c5-c6; replace. let c7(k1)=mean(c5) let c8(k1)=mean(c6) let k1=k1+ The program takes a random sample (with replacement) of size 5 from columns C2 and C where the values of age and Y (smoke Yes=1, no=0) are. The values of age and Y for the sample are placed in c5-c6 respectively. The program calculates the mean of the sample (for age) and places the mean in C It also calculates the proportion of smokers in the sample and places that proportion in C Then you need to initialize the counter k1 by typing at the MTB prompt: MTB> let k1=1. To execute the program in order to take the 1000 samples, from the menu click on FILE> OTHER FILES> run an executable , the following window will appear: Indicate the number of times you want to execute the program, click on Select File to indicate the name of the program. You can browse to find the program samdismp.mtb .The sample means will appear in C7 and the sample proportions in C8. You can later obtain histograms or tables for those variables. You can also change the sample size and observe what happens.

Sampling Distribution: Understanding Random Samples and Statistic Calculation - Prof. Edit, Lab Reports of Statistics

Related documents

Partial preview of the text

Download Sampling Distribution: Understanding Random Samples and Statistic Calculation - Prof. Edit and more Lab Reports Statistics in PDF only on Docsity!

n

successes

p

n

pq

the apples in the truckload are not good)= P ( p ˆ 0. 05 ) = )

p 

P =^ )

P z 