Statistics Exam 1 Notes | Cheat Sheet Statistics

• If the computed probability is very small (<0.05, <0.01) then we have evidence that

rejects the null hypothesis… i.e., we would conclude a statistically significant treatment

difference or treatment effect. → p-value

• Exp. units (rats), 12 observational units (measurements) per rat.

• Continuous variables include those variables that can be measured across the positive

real line (e.g., weights, heights)

• Discrete variables take positive integer values only.

• “Not true experimental randomization” when individuals are randomly selected from

each group, in the sense that any mean difference detected between 2 breeds not due

to inherent differences b/n breeds themselves, there maybe confounding factors that

influence mean differences.Only randomized studies provide valid inference on

causation; observational studies can only infer on association.

PARAMETER VS. STATISTICS

• A population dataset (N) contains data for all entities that are target of inference,

while sample dataset (n) contains subset of N.

• N – population parameter; n – sample statistics

• Statistical inference is process of using information (i.e., statistics) from a sample of

data to derive a conclusion about characteristics (i.e., parameters) of the population.

• Cumulative frequency – implies order or rank

• Relative frequency – sum total up to 1 or 100%

• Frequencies are given on the y-axis

• Judicial inference / scope of inference – happens when you have access to a certain

subset of experimental units (e.g., university research farm) that are supposed to be

representative of a much larger population.

• If the variables are not yet observed, they are random, and we generally use upper

case letters; if the variables are already observed or “realized”, we use lower case

letters.

MEASURES OF LOCATION

• Sample mean 𝑦 = ∑𝑦𝑖

𝑛

𝑖=1

𝑛 is a statistic or estimator of population mean µ = ∑𝑌𝑖

𝑁

𝑖=1

𝑁 or

parameter that defines true state of nature.

• Statistical conclusions based on sample statistics always subject to sampling error; 2

different samples from same population could lead to different estimates of the

parameters of a population.

• Median- simply middle value (50% lies above/below) = 0.5(n+1)

• In data drawn randomly from a normal distributed population, the sample median

should be fairly close to sample mean.

• Mode –frequently occurring data / robust to outliers like median

• Data:symmetrical&unimodal, mode=median=mean (normal dist.)

• Positive skewness: mode < median < mean (v.v for negative skew)

• Mean is outlier-sensitive; geometric mean more resistant to outliers.

MEASURES OF DISPERSION

• pth percentile (0<p<100)/ quantiles is the value of a variable with p % of the values

of the data distribution lying underneath it.

• Range – difference between smallest and largest observation.

• Interquartile range: Q1 (25% percentile) to Q3 (75% percentile) often a better

descriptor than range because its outlier resistant.

• Variance: simply the mean of the

squared deviations of the

observations from the location

mean µ.

• An estimator is unbiased if, over repeated sampling, the average value of the statistics

is equal to the parameter inferred upon. Mathematically, we represent this as

expectation E. E(s2) = σ2

• Standard deviation (s): standard measure of the deviation of the observations from

the mean… positive square root of the variance.

• Coefficient of variation (CV): useful in comparing variation for different var. measured

in different units / having diff. mean responses.

• CV = (s / Y) *100% → indication of relative variability (var of mice weights vs. cattle

wts., etc.); also unitless. Generally less than 50%.; s -standard dev.; Y - mean

• Quantitative continuous variable should be roughly normally distributed. If not →

transform data!

• When ratio of largest response var. to smallest is or exceeds 1-order of magnitude (10-

fold diff), transformation of Y likely to be effective.

• Log transformation - suitable transformation for when CV is constant

• If constant (a) added to/ subtracted from a random variable (y), variance of transformed

variable (x=a+y) same, but different means.

STEM AND LEAF PLOTS AND BOX PLOTS

• Stem and leaf display presents a histogram like picture of data; can easily re-sort

dataset and identify the highest and lowest observation.

• Box and whisker plot – interior box represents interquartile range; the interior line in

that box represents the median and the crosshair symbol represents the mean. The

length of a whisker extends up to 1.5 interquartile ranges. If datapoint is beyond

whiskers → outlier.

EMPIRICAL RULE: applied to normally distributed continuous response variables

(symmetric/ bell-shaped): 1SD: 68%; 2SD: 95%; 3SD: 99.7%.

Using range to approx. STD: σ = Range/6 ; Range = (µ+ 3σ) – (µ - 3σ)

Z-TRANSFORMATION AND T-(OR SAMPLE Z) TRANSFORMATION

• It is common to standardize distributions with population mean µ and population

variance σ2 to allow hypothesis testing.

• z = (y - µ) / σ; for normally dist data → standard normal dist.

• The mean of standard normal dist is 0 and variance is 1

• z transformation can also be applied to samples, where we substitute the statistics for

the parameters → t- transformation: z = (y - 𝑦) / s

• Back transformation: y = µ + σz or y = 𝑦 + s*t

PROBABILITY DISTRIBUTIONS

• PMF: description of prob. for discrete random var.

• Mean for prob. dist. of discrete random variables:

• Population variance “”:

• Note that: probabilities are always defined between 0 and 1

o If an event A is certain NOT to occur then Pr(A) = 0

o If an event is certain to occur, then Pr(A) = 1

o Sum of probabilities of all mutually exclusive events equal 1.

o Two events A and B are mut exclusive if Pr(A&B) = 0

o Two events that may occur simultaneously → independent if Pr(A) not affected

by Pr(B), and v.v.

o Probability of A & B occurring together = Pr(A)*Pr(B) if independent events

• Complement of A or Ac – Pr (Ac) = 1 - Pr(A)

• Probability Density Function (PDF’s) – probability statements for continuous data;

describes the nature of the data in population.

• Discrete uniform distribution:

o Mean and variance of a theoretic al discrete distribution can be

computed without computing mean & variance.

BINOMIAL DISTRIBUTION

• The probability of getting y successes on n trials, where the probability of success on

each trial is p, can be determined as:

• For a binomial random variable, the population mean and variance can be written as:

CONTINUOUS UNIFORM DISTRIBUTION

• We model a continuous random variable with a curve f(x), which represents the height

of the curve at point x, called a probability density function (PDF).

• For continuous random variables, probabilities are areas under the curve.

• For any continuous probability distribution:

o f(x) > 0 for all x

o the area under the entire curve is equal to 1.

o There are several continuous probability distribution that come up frequently:

Normal Distribution, Uniform Distribution, Exponential Distribution.

STANDARD NORMAL DISTRIBUTION

• The distribution is symmetric about the mean, the mean is also the median

• Check out area under the curve pertaining to standard deviation.

• If x is a random variable that has a normal distribution, we write this as: X ~ N (µ, 2)

“the random variable X is distributed normally with mean µ & variance 2

• The standard normal distribution is a normal distribution with mean 0 and variance 1

• We often represent random variables that have the standard normal distribution with

the letter Z, and we will say: Z ~ N (0,1)

STANDARDIZING NORMALLY DISTRIBUTED

RANDOM VARIABLES

- Suppose X is normally distributed random var.

with mean µ & standard deviation 

- Finding the x percentile

Partial preview of the text

Download Statistics Exam 1 Notes and more Cheat Sheet Statistics in PDF only on Docsity!

If the computed probability is very small (<0.05, <0.01) then we have evidence that rejects the null hypothesis… i.e., we would conclude a statistically significant treatment difference or treatment effect. → p-value
Exp. units (rats), 12 observational units (measurements) per rat.
Continuous variables include those variables that can be measured across the positive real line (e.g., weights, heights)
Discrete variables take positive integer values only.
“Not true experimental randomization” when individuals are randomly selected from each group, in the sense that any mean difference detected between 2 breeds not due to inherent differences b/n breeds themselves, there maybe confounding factors that influence mean differences.Only randomized studies provide valid inference on causation; observational studies can only infer on association. PARAMETER VS. STATISTICS
A population dataset (N) contains data for all entities that are target of inference, while sample dataset (n) contains subset of N.
N – population parameter; n – sample statistics
Statistical inference is process of using information (i.e., statistics) from a sample of data to derive a conclusion about characteristics (i.e., parameters) of the population.
Cumulative frequency – implies order or rank
Relative frequency – sum total up to 1 or 100%
Frequencies are given on the y-axis
Judicial inference / scope of inference – happens when you have access to a certain subset of experimental units (e.g., university research farm) that are supposed to be representative of a much larger population.
If the variables are not yet observed , they are random , and we generally use upper case letters ; if the variables are already observed or “ realized ”, we use lower case letters. MEASURES OF LOCATION
Sample mean 𝑦 = ∑ 𝑛 𝑖= 1 𝑦𝑖 𝑛 is a^ statistic^ or estimator of population mean μ =^ ∑ 𝑁 𝑖= 1 𝑌𝑖 𝑁 or parameter that defines true state of nature.
Statistical conclusions based on sample statistics always subject to sampling error ; 2 different samples from same population could lead to different estimates of the parameters of a population.
Median - simply middle value (50% lies above/below) = 0.5(n+1)
In data drawn randomly from a normal distributed population, the sample median should be fairly close to sample mean.
Mode – frequently occurring data / robust to outliers like median
Data: symmetrical & unimodal , mode=median=mean (normal dist.)
Positive skewness: mode < median < mean (v.v for negative skew)
Mean is outlier-sensitive; geometric mean more resistant to outliers. MEASURES OF DISPERSION
pth^ percentile (0<p<100)/ quantiles is the value of a variable with p % of the values of the data distribution lying underneath it.
Range – difference between smallest and largest observation.
Interquartile range : Q1 (25% percentile) to Q3 (75% percentile) often a better descriptor than range because its outlier resistant.
Variance : simply the mean of the squared deviations of the observations from the location mean μ.
An estimator is unbiased if, over repeated sampling, the average value of the statistics is equal to the parameter inferred upon. Mathematically, we represent this as expectation E. E(s^2 ) = σ^2
Standard deviation (s): standard measure of the deviation of the observations from the mean… positive square root of the variance.
Coefficient of variation (CV) : useful in comparing variation for different var. measured in different units / having diff. mean responses.
CV = ( s / Y) *100% → indication of relative variability (var of mice weights vs. cattle wts., etc.); also unitless. Generally less than 50%.; s - standard dev.; Y - mean
Quantitative continuous variable should be roughly normally distributed. If not → transform data!
When ratio of largest response var. to smallest is or exceeds 1 - order of magnitude (10- fold diff), transformation of Y likely to be effective.
Log transformation - suitable transformation for when CV is constant
If constant (a) added to/ subtracted from a random variable (y), variance of transformed variable (x=a+y) same, but different means.

STEM AND LEAF PLOTS AND BOX PLOTS

Stem and leaf display presents a histogram like picture of data; can easily re-sort dataset and identify the highest and lowest observation.
Box and whisker plot – interior box represents interquartile range; the interior line in that box represents the median and the crosshair symbol represents the mean. The length of a whisker extends up to 1.5 interquartile ranges. If datapoint is beyond whiskers → outlier. EMPIRICAL RULE : applied to normally distributed continuous response variables (symmetric/ bell-shaped): 1SD: 68%; 2SD: 95%; 3SD: 99.7%. Using range to approx. STD: σ = Range/6 ; Range = (μ+ 3σ) – (μ - 3σ) Z-TRANSFORMATION AND T-(OR SAMPLE Z) TRANSFORMATION
It is common to standardize distributions with population mean μ and population variance σ^2 to allow hypothesis testing.
z = (y - μ) / σ; for normally dist data → standard normal dist.
The mean of standard normal dist is 0 and variance is 1
z transformation can also be applied to samples, where we substitute the statistics for the parameters → t- transformation: z = (y - 𝑦) / s
Back transformation: y = μ + σz or y = 𝑦 + st PROBABILITY DISTRIBUTIONS*
PMF: description of prob. for discrete random var.
Mean for prob. dist. of discrete random variables:
Population variance “”:
Note that: probabilities are always defined between 0 and 1 o If an event A is certain NOT to occur then Pr(A) = 0 o If an event is certain to occur, then Pr(A) = 1 o Sum of probabilities of all mutually exclusive events equal 1. o Two events A and B are mut exclusive if Pr(A&B) = 0 o Two events that may occur simultaneously → independent if Pr(A) not affected by Pr(B), and v.v. o Probability of A & B occurring together = Pr(A)*Pr(B) if independent events
Complement of A or Ac^ – Pr (Ac) = 1 - Pr(A)
Probability Density Function (PDF’s) – probability statements for continuous data; describes the nature of the data in population.
Discrete uniform distribution : o Mean and variance of a theoretical discrete distribution can be computed without computing mean & variance. BINOMIAL DISTRIBUTION
The probability of getting y successes on n trials, where the probability of success on each trial is p , can be determined as:
For a binomial random variable, the population mean and variance can be written as: CONTINUOUS UNIFORM DISTRIBUTION
We model a continuous random variable with a curve f(x), which represents the height of the curve at point x, called a probability density function (PDF).
For continuous random variables, probabilities are areas under the curve.
For any continuous probability distribution: o f(x) > 0 for all x o the area under the entire curve is equal to 1. o There are several continuous probability distribution that come up frequently: Normal Distribution, Uniform Distribution, Exponential Distribution. STANDARD NORMAL DISTRIBUTION
The distribution is symmetric about the mean, the mean is also the median
Check out area under the curve pertaining to standard deviation.
If x is a random variable that has a normal distribution, we write this as: X ~ N (μ, 2) “the random variable X is distributed normally with mean μ & variance 2
The standard normal distribution is a normal distribution with mean 0 and variance 1
We often represent random variables that have the standard normal distribution with the letter Z, and we will say: Z ~ N (0,1) STANDARDIZING NORMALLY DISTRIBUTED RANDOM VARIABLES
Suppose X is normally distributed random var. with mean μ & standard deviation 
Finding the x percentile

For instance, what values of z would define middle 95% probability for Z? Here 1-α =.95 such that α= .05. Hence, we need to find z /2=.025. From Table A.1, this is 1.96. That is, Pr(−1.96 Z 1.96) = 0.95. Also, Pr(-1 < Z < 1) = 1- 2(0.1587) = 0.6826 ~ 0.68. Suppose we wish to select only the top 10% of heifers in a Brahman heifer population. What should be the cutoff point for selection? Pr(Z > zα=0.10)= 0.10 z From Table A.1 and A.1A, zα=0. = 1.28. Now remember that: z = (y - μ) / σ; Thus, y = μ + σz = 500 + 1.28*20 = 525.6kg SAMPLING DISTRIBUTION OF A MEAN

The mean of the sampling distribution of the sample mean is equal to population mean.
The standard deviation of sampling distribution 𝑌 is equal to:
If the population is normally distributed, then 𝑌is also normally distributed
For mean of n observations, when we are standardizing, we use → CENTRAL LIMIT THEOREM: sample mean will be approx. normally distributed for large samples, regardless of dist. from which we sampling.
Distribution of sample mean tends toward normal distribution as n increases.
Rough guideline: sample mean considered approx. normally distributed if n > 30 CLT tells us that our usual z score involving sample mean that tends to the standard normal distribution as sample size tends to infinity. According to the z-table of Table A.1 of Freund et al. (2010), middle 90% should lie within: 3.5 (mean) + 1.65 (.540 std) = [2.61,4.39]; For 100,000 exp.: 5th percentile: 2.6; 95th: 4. THE NORMAL APPROXIMATION TO THE BINOMIAL DISTRIBUTION
Continuous normal distribution can be used to approximate discrete binomial distrib
Normal approximation is reasonable if both np>10, and n(1-p) > 10
Recall, for a binomial random variable X: μ = np; σ^2 = np(1-p)
To standardize, use Z → (x- μ)/  → Z ~ N(0,1) Suppose the probability that any cow will require any herdsperson assistance at calving is about 30%. i.e. p = .30. In a herd of 50 cows, a) the expected number of assisted calvings is μ = np = 50 x.30 = 15 cows. b) the variance of number of assisted calvings is ^2 = np(1-p) = 10.5 cows or a std of 3.24. c) approximate probability that the farmer will have to assist 20 or more cows Pr(𝑌 ≥ 20 ) ≈ Pr (𝑍 ≥ 20 − 15 3. 24 ) = Pr(𝑍 ≥ 1. 54 ) = 0. 062 SAMPLING DISTRIBUTION OF A VARIANCE
E(s^2 ) = ^2 → sample variance is unbiased estimator of pop. variance → shape depends on n
CHI-SQUARE DISTRIBUTION : Related to standard normal distribution. If random variable Z has the standard normal distribution, then Z2 has an X2 distribution with 1 DF
Distribution when random variables are squared…?
If Z1, Z2…Zk are independent normal var., then Z1^2 , Z2^2 ,… Zk^2 has a X^2 distribution with k DF
Mean is equal to k (degrees of freedom), and variance is equal to twice the df (2k)
CLT: as DF increase, the skewness decreases
Probability statements for variances: SAMPLING DISTRIBUTION OF SAMPLE MEAN WHEN VARIANCE IS NOT KNOWN
When variance or standard deviation is unknown, or sample size is small, we cannot use normal/z distribution to compute probabilities for sample mean → we use t-distribution
Suppose we draw a random sample of n observations from a normally distributed population Z = (X - μ)/ /sqrN  has the standard normal distribution.
Applied when the population standard deviation is unknown, and we use the sample standard deviation → Z = (X - μ)/ (s/sqrN) → this sample std (s) is a statistic and vary from sample to sample → not standard normal dist. anymore (not Z), so we label it t distribution, w/ n- 1 DF
The denominator of the sample variance (s^2 ) is n-1. Looks like the z statistic (which has a standard normal distribution), except we replaced the population std with the sample std. We are estimating a parameter with a statistic so greater variability.
use the t-table to get ( ?) → Sample questions: What percentage of the observations within:2 SD of mean? Low 2 = 98.3 – 240.4 = 17.5 (17.51624 as per R studio) High 2 = 98.3 + 240.4 = 179.1 (179.0393 as per R studio) Proportion of observations falling within +/- 2sd: 94.444% From boxplot and stem-and-leaf plot, we characterize the distribution of CK levels as roughly symmetrically/ normally distributed (median is in middle of box and whiskers are roughly equal on each side) and no skew. However, there potential outlier in Stem 20. A biologist made a certain pH measurement in each of 24 frogs; She calculated mean of 7.373 and SD of 0.129 for original measurements. Next, she transformed data by subtracting 7 from each observation and then multiplying by 100. For example, 7.43 was transformed to 43. What are the mean and standard deviation of the transformed data? Mean of transformed data = (mean – 7) 100 = (7.373 – 7) 100 = 37. Standard deviation of transformed data = (SD100) = 0.129100 = 12. Log transformation changed spread & symmetry of data. In SL plot 5.a, data is positively skewed, but in stem-and-leaf plot 5.b, distribution of data more symmetric/ normal. 𝑴𝒆𝒂𝒏 (𝜇𝑋) = ∑ Pr o b(𝑋 = 𝑥) ⋅ 𝑥 𝑘= 6 𝑥= 1 = ( 0 ⋅ 0. 671 ) + ( 1 ⋅ 0. 229 ) + ( 2 ⋅ 0. 053 )+.. Variance ( 𝜎𝑋^2 ) = ∑ 𝑘 𝑥= 1 Pr(𝑋 = 𝑥) ⋅(𝑥 − 𝜇)^2 Variance ( 𝜎𝑋^2 ) = [( 0 − 0. 498 )^2 ∗ 0. 671 ] + [( 1 − 0. 498 )^2 ∗ 0. 229 ] …. What is the probability that any one child receives at most one diagnostic service? Pr ( X i < 1) = Pr( X i = 0) + Pr( X i =1) = 0.671 + 0.229 = 0.900 or 90% probability What is the probability that Sally and Doug will require one diagnostic service whereas the other two children will not require any diagnostic services. Pr(X1=0 & X2=0 & X3=0 & X4=0) = Pr(X1=0) * Pr(X2 =0) * Pr(X3=0) * Pr(X4 =0) falls in the top 2.5% of the cholesterol levels of men age 18-24 must be remeasured. Z 97.5th percentile = 1.96 y = Z *  + μ =  + =  mg/100 mL Proportion of the population departs from the true mean body temperature by more than one and a half standard deviations? Pr(Z>1.5) = 1 – 0.9332 = 0.0668 or 6.68% (if in one direction) What proportion of all sample standard deviations would be greater than 12 mg/100ml? 𝜒^2 = (𝑛−^1 )𝑠 2 𝜎^2 =^ ( 20 − 1 )( 122 )

252 =^31.^98 →^ 0.025< (𝝌 𝟐 (^) > 𝟑𝟏. 𝟗𝟖) < 𝟎. 𝟎𝟓 What two bounds define the middle 95% of the distribution of sample standard

Statistics Exam 1 Notes, Cheat Sheet of Statistics

Related documents

Partial preview of the text

Download Statistics Exam 1 Notes and more Cheat Sheet Statistics in PDF only on Docsity!

STEM AND LEAF PLOTS AND BOX PLOTS