




























































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
WHAT IS STATISTICS? That science which enables us to draw conclusions about various phenomena on the basis of real data collected on sample-basis A tool for data-based research Also known as Quantitative Analysis A lot of application in a wide variety of disciplines Agriculture, Anthropology, Astronomy, Biology, Economic, Engineering, Environment, Geology, Genetics, Medicine, Physics, Psychology, Sociology, Zoology …. Virtually every single subject from Anthropology to Zoology …. A to
Typology: Lecture notes
1 / 343
This page cannot be seen from the preview
Don't miss anything!
Sampling Distribution ofX
Mean and Standard Deviation of the Sampling Distribution ofX
Sampling Distribution ofpˆ
uncertainty. It should of course be borne in mind that uncertainty does not imply ignorance but it refers to the incompleteness and the instability of data available. In this sense, the word statistics is used in the singular. As it embodies more of less all stages of the general process of learning, sometimes called scientific method , statistics is characterized as a science. Thus the word statistics used in the plural refers to a set of numerical information and in the singular, denotes the science of basing decision on numerical data. It should be noted that statistics as a subject is mathematical in character.
Thirdly , the word statistics are numerical quantities calculated from sample observations; a single quantity that has been so collected is called a statistic. The mean of a sample for instance is a statistic. The word statistics is plural when used in this sense.
Statistics is a discipline in its own right. It would therefore be desirable to know the characteristic features of statistics in order to appreciate and understand its general nature. Some of its important characteristics are given below:
Statistics deals with the behaviour of aggregates or large groups of data. It has nothing to do with what is happening to a particular individual or object of the aggregate.
Statistics deals with aggregates of observations of the same kind rather than isolated figures.
Statistics deals with variability that obscures underlying patterns. No two objects in this universe are exactly alike. If they were, there would have been no statistical problem.
Statistics deals with uncertainties as every process of getting observations whether controlled or uncontrolled, involves deficiencies or chance variation. That is why we have to talk in terms of probability.
Statistics deals with those characteristics or aspects of things which can be described numerically either by counts or by measurements.
Statistics deals with those aggregates which are subject to a number of random causes, e.g. the heights of persons are subject to a number of causes such as race, ancestry, age, diet, habits, climate and so forth.
Statistical laws are valid on the average or in the long run. There is n guarantee that a certain law will hold in all cases. Statistical inference is therefore made in the face of uncertainty.
Statistical results might be misleading the incorrect if sufficient care in collecting, processing and interpreting the data is not exercised or if the statistical data are handled by a person who is not well versed in the subject mater of statistics.
As it is such an important area of knowledge, it is definitely useful to have a fairly good idea about the way in which it works, and this is exactly the purpose of this introductory course. The following points indicate some of the main functions of this science:
Statistics assists in summarizing the larger set of data in a form that is easily understandable.
Statistics assists in the efficient design of laboratory and field experiments as well as surveys.
Statistics assists in a sound and effective planning in any field of inquiry.
Statistics assists in drawing general conclusions and in making predictions of how much of a thing will happen under given conditions.
As stated earlier, Statistics is a discipline that has finds application in the most diverse fields of activity. It is perhaps a subject that should be used by everybody. Statistical techniques being powerful tools for analyzing numerical data are used in almost every branch of learning. In all areas, statistical techniques are being increasingly used, and are developing very rapidly.
A modern administrator whether in public or private sector leans on statistical data to provide a factual basis for decision.
A politician uses statistics advantageously to lend support and credence to his arguments while elucidating the problems he handles. A businessman, an industrial and a research worker all employ statistical methods in their work. Banks, Insurance companies and Government all have their statistics departments.
A social scientist uses statistical methods in various areas of socio-economic life a nation. It is sometimes said that “a social scientist without an adequate understanding of statistics, is often like the blind man groping in a dark room for a black cat that is not there”.
The word “data” appears in many contexts and frequently is used in ordinary conversation. Although the word carries something of an aura of scientific mystique, its meaning is quite simple and mundane. It is Latin for “those that are given” (the singular form is “datum”). Data may therefore be thought of as the results of observation.
Data are collected in many aspects of everyday life.
Statements given to a police officer or physician or psychologist during an interview are data.
So are the correct and incorrect answers given by a student on a final examination.
Almost any athletic event produces data.
The time required by a runner to complete a marathon,
The number of errors committed by a baseball team in nine innings of play.
And, of course, data are obtained in the course of scientific inquiry:
the positions of artifacts and fossils in an archaeological site,
The number of interactions between two members of an animal colony during a period of observation,
The spectral composition of light emitted by a star.
In statistics, an observation often means any sort of numerical recording of information, whether it is a physical measurement such as height or weight; a classification such as heads or tails, or an answer to a question such as yes or no.
A characteristic that varies with an individual or an object is called a variable. For example, age is a variable as it varies from person to person. A variable can assume a number of values. The given set of all possible values from which the variable takes on a value is called its Domain. If for a given problem, the domain of a variable contains only one value, then the variable is referred to as a constant.
Variables may be classified into quantitative and qualitative according to the form of the characteristic of interest. A variable is called a quantitative variable when a characteristic can be expressed numerically such as age, weight, income or number of children. On the other hand, if the characteristic is non-numerical such as education, sex, eye- colour, quality, intelligence, poverty, satisfaction, etc. the variable is referred to as a qualitative variable. A qualitative characteristic is also called an attribute. An individual or an object with such a characteristic can be counted or enumerated after having been assigned to one of the several mutually exclusive classes or categories.
An error is said to be biased when the observed value is consistently and constantly higher or lower than the true value. Biased errors arise from the personal limitations of the observer, the imperfection in the instruments used or some other conditions which control the measurements. These errors are not revealed by repeating the measurements. They are cumulative in nature, that is, the greater the number of measurements, the greater would be the magnitude of error. They are thus more troublesome. These errors are also called cumulative or systematic errors. An error, on the other hand, is said to be unbiased when the deviations, i.e. the excesses and defects, from the true value tend to occur equally often. Unbiased errors and revealed when measurements are repeated and they tend to cancel out in the long run. These errors are therefore compensating and are also known as random errors or accidental errors.
Steps involved in a Statistical Research-Project Collection of Data: Primary Data Secondary Data Sampling: Concept of Sampling Non-Random Versus Random Sampling Simple Random Sampling Other Types of Random Sampling
Topic and significance of the study Objective of your study Methodology for data-collection Source of your data Sampling methodology Instrument for collecting data
As far as the objectives of your research are concerned, they should be stated in such a way that you are absolutely clear about the goal of your study --- EXACTLY WHAT IT IS THAT YOU ARE TRYING TO FIND OUT? As far as the methodology for DATA-COLLECTION is concerned, you need to consider:
Source of your data (the statistical population) Sampling Methodology Instrument for collecting data
The most important part of statistical work is perhaps the collection of data. Statistical data are collected either by a COMPLETE enumeration of the whole field, called CENSUS, which in many cases would be too costly and too time consuming as it requires large number of enumerators and supervisory staff, or by a PARTIAL enumeration associated with a SAMPLE which saves much time and money.
Data that have been originally collected (raw data) and have not undergone any sort of statistical treatment, are called PRIMARY data. Data that have undergone any sort of treatment by statistical methods at least ONCE, i.e. the data that have been collected, classified, tabulated or presented in some form for a certain purpose, are called SECONDARY data.
One or more of the following methods are employed to collect primary data: Direct Personal Investigation Indirect Investigation Collection through Questionnaires Collection through Enumerators Collection through Local Sources
In this method, an investigator collects the information personally from the individuals concerned. Since he interviews the informants himself, the information collected is generally considered quite accurate and complete. This method may prove very costly and time-consuming when the area to be covered is vast. However, it is useful for laboratory experiments or localized inquiries. Errors are likely to enter the results due to personal bias of the investigator.
Sometimes the direct sources do not exist or the informants hesitate to respond for some reason or other. In such a case, third parties or witnesses having information are interviewed. Moreover, due allowance is to be made for the personal bias. This method is useful when the information desired is complex or there is reluctance or indifference on the part of the informants. It can be adopted for extensive inquiries.
For Example: 1)All the possible outcomes from the throw of a die – however long we throw the die and record the results, we could always continue to do so far a still longer period in a theoretical concept – one which has no existence in reality. 2) The No. of ways in which a football team of 11 players can be selected from the 16 possible members named by the Club Manager. We also need to differentiate between the sampled population and the target population. Sampled population is that from which a sample is chosen whereas the population about which information is sought is called the target population thus our population will consist of the total no. of students in all the colleges in the Punjab. Suppose on account of shortage of resources or of time, we are able to conduct such a survey on only 5 colleges scattered throughout the province. In this case, the students of all the colleges will constitute the target pop whereas the students of those 5 colleges from which the sample of students will be selected will constitute the sampled population. The above discussion regarding the population, you must have realized how important it is to have a very well-defined population. The next question is: How will we draw a sample from our population? The answer is that: In order to draw a random sample from a finite population, the first thing that we need is the complete list of all the elements in our population. This list is technically called the FRAME.
A sampling frame is a complete list of all the elements in the population. For example: The complete list of the BCS students of Virtual University of Pakistan on February 15, 2003 Speaking of the sampling frame, it must be kept in mind that, as far as possible, our frame should be free from various types of defects: does not contain inaccurate elements is not incomplete is free from duplication, and Is not out of date. Next, let’s talk about the sample that we are going to draw from this population. As you all know, a sample is only a part of a statistical population, and hence it can represent the population to only to some extent. Of course, it is intuitively logical that the larger the sample, the more likely it is to represent the population. Obviously, the limiting case is that: when the sample size tends to the population size, the sample will tend to be identical to the population. But, of course, in general, the sample is much smaller than the population. The point is that, in general, statistical sampling seeks to determine how accurate a description of the population the sample and its properties will provide. We may have to compromise on accuracy, but there are certain such advantages of sampling because of which it has an extremely important place in data-based research studies.
1. Savings in time and money. Although cost per unit in a sample is greater than in a complete investigation, the total cost will be less (because the sample will be so much smaller than the statistical population from which it has been drawn). A sample survey can be completed faster than a full investigation so that variations from sample unit to sample unit over time will largely be eliminated. Also, the results can be processed and analyzed with increased speed and precision because there are fewer of them. 2. More detailed information may be obtained from each sample unit. 3. Possibility of follow-up: (After detailed checking, queries and omissions can be followed up --- a procedure which might prove impossible in a complete survey). 4. Sampling is the only feasible possibility where tests to destruction are undertaken or where the population is effectively infinite. The next two important concepts that need to be considered are those of sampling and non-sampling errors.
The difference between the estimate derived from the sample (i.e. the statistic) and the true population value (i.e. the parameter) is technically called the sampling error. For example,
Sampling error arises due to the fact that a sample cannot exactly represent the pop, even if it is drawn in a correct manner
2. NON-SAMPLING ERROR
Besides sampling errors, there are certain errors which are not attributable to sampling but arise in the process of data collection, even if a complete count is carried out. Main sources of non sampling errors are: The defect in the sampling frame. Faulty reporting of facts due to personal preferences. Negligence or indifference of the investigators Non-response to mail questionnaires. These (non-sampling) errors can be avoided through Following up the non-response, Proper training of the investigators. Correct manipulation of the collected information,
Let us now consider exactly what is meant by ‘sampling error’: We can say that there are two types of non-response --- partial non-response and total non-response. ‘Partial non-response’ implies that the respondent refuses to answer some of the questions. On the other hand, ‘ total non-response’ implies that the respondent refuses to answer any of the questions. Of course, the problem of late returns and non-response of the kind that I have just mentioned occurs in the case of HUMAN populations. Although refusal of sample units to cooperate is encountered in interview surveys, it is far more of a problem in mail surveys. It is not uncommon to find the response rate to mail questionnaires as low as 15 or 20%.The provision of INFORMATION ABOUT THE PURPOSE OF THE SURVEY helps in stimulating interest, thus increasing the chances of greater response. Particularly if it can be shown that the work will be to the ADVANTAGE of the respondent IN THE LONG RUN. Similarly, the respondent will be encouraged to reply if a pre-paid and addressed ENVELOPE is sent out with the questionnaire. But in spite of these ways of reducing non-response, we are bound to have some amount of non-response. Hence, a decision has to be taken about how many RECALLS should be made. The term ‘recall’ implies that we approach the respondent more than once in order to persuade him to respond to our queries. Another point worth considering is: How long the process of data collection should be continued? Obviously, no such process can be carried out for an indefinite period of time! In fact, the longer the time period over which the survey is conducted, the greater will be the potential VARIATIONS in attitudes and opinions of the respondents. Hence, a well-defined cut-off date generally needs to be established. Let us now look at the various ways in which we can select a sample from our population. We begin by looking at the difference between non-random and RANDOM sampling. First of all, what do we mean by non- random sampling?
‘Nonrandom sampling’ implies that kind of sampling in which the population units are drawn into the sample by using one’s personal judgment. This type of sampling is also known as purposive sampling. Within this category, one very important type of sampling is known as Quota Sampling.
In this type of sampling, the selection of the sampling unit from the population is no longer dictated by chance. A sampling frame is not used at all, and the choice of the actual sample units to be interviewed is left to the discretion of the interviewer. However, the interviewer is restricted by quota controls. For example, one particular interviewer may be told to interview ten married women between thirty and forty years of age living in town X, whose husbands are professional workers, and five unmarried professional women of the same age living in the same town. Quota sampling is often used in commercial surveys such as consumer market-research. Also, it is often used in public opinion polls.
There is no need to construct a frame. It is a very quick form of investigation. Cost reduction.
Actually, Random Number Tables are constructed according to certain mathematical principles so that each digit has the same chance of selection. Of course, nowadays randomness may be achieved electronically. Computers have all those programmes by which we can generate random numbers.
EXAMPLE
The following frequency table of distribution gives the ages of a population of 1000 teen-age college students in a particular country. Select a sample of 10 students using the random numbers table. Find the sample mean age and compare with the population mean age.
How will we proceed to select our sample of size 10 from this population of size 1000?
Age (X)
No. of Students (f)
13 6
14 61
15 270
16 491
17 153
18 15
19 4
1000
Student-Population of a College
The first step is to allocate to each student in this population a sampling number. For this purpose, we will begin by constructing a column of cumulative frequencies.
Now that we have the cumulative frequency of each class, we are in a position to allocate the sampling numbers to all the values in a class. As the frequency as well as the cumulative frequency of the first class is 6, we allocate numbers 000 to 005 to the six students who belong to this class.
As the cumulative frequency of the second class is 67 while that of the first class was 6, therefore we allocate sampling numbers 006 to 066 to the 61 students who belong to this class.
AGE
X
No. of
Students
f
cf
Sampling
Numbers
13 6 6 000 – 005
14 61 67
15 270 337
16 491 828
17 153 981
18 15 996
19 4 1000
1000
AGE X
No. of Students f
Cumulative Frequency cf
13 6 6
14 61 67
15 270 337
16 491 828
17 153 981
18 15 996
19 4 1000
1000
AGE
X
No. of
Students
f
cf
Sampling
Numbers
13 6 6 000 – 005
14 61 67 006 – 066
15 270 337
16 491 828
17 153 981
18 15 996
19 4 1000
1000
The age of each student in this class is 14 years; hence, obviously, the age of the 42nd student is also 14 years. This is how we are able to ascertain the ages of all the students who have been selected in our sampling. You will recall that in this example, our aim was to draw a sample from the population of college students, and to compare the sample’s mean age with the population mean age. The population mean age comes out to be 15.785 years.
The population mean age is :
The above formula is a slightly modified form of the basic formula that you have done ever-since school-days i.e. the mean is equal to the sum of all the observations divided by the total number of observations. Next, we compute the sample mean age. Adding the 10 values and dividing by 10, we obtain: Ages of students selected in the sample (in years): 14, 15, 16, 15, 16, 16, 17, 15, 16, 16 Hence the sample mean age is: 15.6, comparing the sample mean age of 15.6 years with the population mean age of 15.785 years, we note that the difference is really quite slight, and hence the sampling error is equal to
X ^ nX 15 6. years (^10156)
And the reason for such a small error is that we have adopted the RANDOM sampling method. The basic advantage of random sampling is that the probability is very high that the sample will be a good representative of the population from which it has been drawn, and any quantity computed from the sample will be a good estimate of the corresponding quantity computed from the population! Actually, a sample is supposed to be a MINIATURE REPLICA of the population. As stated earlier, there are various other types of random sampling.
· Stratified sampling (if the population is heterogeneous) Systematic sampling (practically, more convenient than simple random sampling) Cluster sampling (sometimes the sampling units exist in natural clusters) Multi-stage sampling All these designs rest upon random or quasi-random sampling. They are various forms of PROBABILITY sampling --- that in which each sampling unit has a known (but not necessarily equal) probability of being selected. Because of this knowledge, there exist methods by which the precision and the reliability of the estimates can be calculated OBJECTIVELY. It should be realized that in practice, several sampling techniques are incorporated into each survey design, and only rarely will simple random sample be used, or a multi-stage design be employed, without stratification. The point to remember is that whatever method be adopted, care should be exercised at every step so as to make the results as reliable as possible.