Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Statistics Exam 1 Notes, Cheat Sheet of Statistics

Notes for first exam of Statistics Course

Typology: Cheat Sheet

2022/2023

Uploaded on 12/11/2023

carinna-saldana-pierard
carinna-saldana-pierard 🇺🇸

1 document

1 / 2

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
If the computed probability is very small (<0.05, <0.01) then we have evidence that
rejects the null hypothesis… i.e., we would conclude a statistically significant treatment
difference or treatment effect. p-value
Exp. units (rats), 12 observational units (measurements) per rat.
Continuous variables include those variables that can be measured across the positive
real line (e.g., weights, heights)
Discrete variables take positive integer values only.
“Not true experimental randomization” when individuals are randomly selected from
each group, in the sense that any mean difference detected between 2 breeds not due
to inherent differences b/n breeds themselves, there maybe confounding factors that
influence mean differences.Only randomized studies provide valid inference on
causation; observational studies can only infer on association.
PARAMETER VS. STATISTICS
A population dataset (N) contains data for all entities that are target of inference,
while sample dataset (n) contains subset of N.
N – population parameter; n – sample statistics
Statistical inference is process of using information (i.e., statistics) from a sample of
data to derive a conclusion about characteristics (i.e., parameters) of the population.
Cumulative frequency – implies order or rank
Relative frequency – sum total up to 1 or 100%
Frequencies are given on the y-axis
Judicial inference / scope of inference – happens when you have access to a certain
subset of experimental units (e.g., university research farm) that are supposed to be
representative of a much larger population.
If the variables are not yet observed, they are random, and we generally use upper
case letters; if the variables are already observed or realized”, we use lower case
letters.
MEASURES OF LOCATION
Sample mean 𝑦 = 𝑦𝑖
𝑛
𝑖=1
𝑛 is a statistic or estimator of population mean µ = 𝑌𝑖
𝑁
𝑖=1
𝑁 or
parameter that defines true state of nature.
Statistical conclusions based on sample statistics always subject to sampling error; 2
different samples from same population could lead to different estimates of the
parameters of a population.
Median- simply middle value (50% lies above/below) = 0.5(n+1)
In data drawn randomly from a normal distributed population, the sample median
should be fairly close to sample mean.
Mode –frequently occurring data / robust to outliers like median
Data:symmetrical&unimodal, mode=median=mean (normal dist.)
Positive skewness: mode < median < mean (v.v for negative skew)
Mean is outlier-sensitive; geometric mean more resistant to outliers.
MEASURES OF DISPERSION
pth percentile (0<p<100)/ quantiles is the value of a variable with p % of the values
of the data distribution lying underneath it.
Range – difference between smallest and largest observation.
Interquartile range: Q1 (25% percentile) to Q3 (75% percentile) often a better
descriptor than range because its outlier resistant.
Variance: simply the mean of the
squared deviations of the
observations from the location
mean µ.
An estimator is unbiased if, over repeated sampling, the average value of the statistics
is equal to the parameter inferred upon. Mathematically, we represent this as
expectation E. E(s2) = σ2
Standard deviation (s): standard measure of the deviation of the observations from
the mean… positive square root of the variance.
Coefficient of variation (CV): useful in comparing variation for different var. measured
in different units / having diff. mean responses.
CV = (s / Y) *100% indication of relative variability (var of mice weights vs. cattle
wts., etc.); also unitless. Generally less than 50%.; s -standard dev.; Y - mean
Quantitative continuous variable should be roughly normally distributed. If not
transform data!
When ratio of largest response var. to smallest is or exceeds 1-order of magnitude (10-
fold diff), transformation of Y likely to be effective.
Log transformation - suitable transformation for when CV is constant
If constant (a) added to/ subtracted from a random variable (y), variance of transformed
variable (x=a+y) same, but different means.
STEM AND LEAF PLOTS AND BOX PLOTS
Stem and leaf display presents a histogram like picture of data; can easily re-sort
dataset and identify the highest and lowest observation.
Box and whisker plot – interior box represents interquartile range; the interior line in
that box represents the median and the crosshair symbol represents the mean. The
length of a whisker extends up to 1.5 interquartile ranges. If datapoint is beyond
whiskers outlier.
EMPIRICAL RULE: applied to normally distributed continuous response variables
(symmetric/ bell-shaped): 1SD: 68%; 2SD: 95%; 3SD: 99.7%.
Using range to approx. STD: σ = Range/6 ; Range = (µ+ 3σ) – - 3σ)
Z-TRANSFORMATION AND T-(OR SAMPLE Z) TRANSFORMATION
It is common to standardize distributions with population mean µ and population
variance σ2 to allow hypothesis testing.
z = (y - µ) / σ; for normally dist data standard normal dist.
The mean of standard normal dist is 0 and variance is 1
z transformation can also be applied to samples, where we substitute the statistics for
the parameters t- transformation: z = (y - 𝑦) / s
Back transformation: y = µ + σz or y = 𝑦 + s*t
PROBABILITY DISTRIBUTIONS
PMF: description of prob. for discrete random var.
Mean for prob. dist. of discrete random variables:
Population variance “”:
Note that: probabilities are always defined between 0 and 1
o If an event A is certain NOT to occur then Pr(A) = 0
o If an event is certain to occur, then Pr(A) = 1
o Sum of probabilities of all mutually exclusive events equal 1.
o Two events A and B are mut exclusive if Pr(A&B) = 0
o Two events that may occur simultaneously independent if Pr(A) not affected
by Pr(B), and v.v.
o Probability of A & B occurring together = Pr(A)*Pr(B) if independent events
Complement of A or Ac – Pr (Ac) = 1 - Pr(A)
Probability Density Function (PDF’s) – probability statements for continuous data;
describes the nature of the data in population.
Discrete uniform distribution:
o Mean and variance of a theoretic al discrete distribution can be
computed without computing mean & variance.
BINOMIAL DISTRIBUTION
The probability of getting y successes on n trials, where the probability of success on
each trial is p, can be determined as:
For a binomial random variable, the population mean and variance can be written as:
CONTINUOUS UNIFORM DISTRIBUTION
We model a continuous random variable with a curve f(x), which represents the height
of the curve at point x, called a probability density function (PDF).
For continuous random variables, probabilities are areas under the curve.
For any continuous probability distribution:
o f(x) > 0 for all x
o the area under the entire curve is equal to 1.
o There are several continuous probability distribution that come up frequently:
Normal Distribution, Uniform Distribution, Exponential Distribution.
STANDARD NORMAL DISTRIBUTION
The distribution is symmetric about the mean, the mean is also the median
Check out area under the curve pertaining to standard deviation.
If x is a random variable that has a normal distribution, we write this as: X ~ N (µ, 2)
“the random variable X is distributed normally with mean µ & variance 2
The standard normal distribution is a normal distribution with mean 0 and variance 1
We often represent random variables that have the standard normal distribution with
the letter Z, and we will say: Z ~ N (0,1)
STANDARDIZING NORMALLY DISTRIBUTED
RANDOM VARIABLES
- Suppose X is normally distributed random var.
with mean µ & standard deviation
- Finding the x percentile
pf2

Partial preview of the text

Download Statistics Exam 1 Notes and more Cheat Sheet Statistics in PDF only on Docsity!

  • If the computed probability is very small (<0.05, <0.01) then we have evidence that rejects the null hypothesis… i.e., we would conclude a statistically significant treatment difference or treatment effect. → p-value
  • Exp. units (rats), 12 observational units (measurements) per rat.
  • Continuous variables include those variables that can be measured across the positive real line (e.g., weights, heights)
  • Discrete variables take positive integer values only.
  • “Not true experimental randomization” when individuals are randomly selected from each group, in the sense that any mean difference detected between 2 breeds not due to inherent differences b/n breeds themselves, there maybe confounding factors that influence mean differences.Only randomized studies provide valid inference on causation; observational studies can only infer on association. PARAMETER VS. STATISTICS
  • A population dataset (N) contains data for all entities that are target of inference, while sample dataset (n) contains subset of N.
  • N – population parameter; n – sample statistics
  • Statistical inference is process of using information (i.e., statistics) from a sample of data to derive a conclusion about characteristics (i.e., parameters) of the population.
  • Cumulative frequency – implies order or rank
  • Relative frequency – sum total up to 1 or 100%
  • Frequencies are given on the y-axis
  • Judicial inference / scope of inference – happens when you have access to a certain subset of experimental units (e.g., university research farm) that are supposed to be representative of a much larger population.
  • If the variables are not yet observed , they are random , and we generally use upper case letters ; if the variables are already observed or “ realized ”, we use lower case letters. MEASURES OF LOCATION
  • Sample mean 𝑦 = ∑ 𝑛 𝑖= 1 𝑦𝑖 𝑛 is a^ statistic^ or estimator of population mean μ =^ ∑ 𝑁 𝑖= 1 𝑌𝑖 𝑁 or parameter that defines true state of nature.
  • Statistical conclusions based on sample statistics always subject to sampling error ; 2 different samples from same population could lead to different estimates of the parameters of a population.
  • Median - simply middle value (50% lies above/below) = 0.5(n+1)
  • In data drawn randomly from a normal distributed population, the sample median should be fairly close to sample mean.
  • Mode – frequently occurring data / robust to outliers like median
  • Data: symmetrical & unimodal , mode=median=mean (normal dist.)
  • Positive skewness: mode < median < mean (v.v for negative skew)
  • Mean is outlier-sensitive; geometric mean more resistant to outliers. MEASURES OF DISPERSION
  • pth^ percentile (0<p<100)/ quantiles is the value of a variable with p % of the values of the data distribution lying underneath it.
  • Range – difference between smallest and largest observation.
  • Interquartile range : Q1 (25% percentile) to Q3 (75% percentile) often a better descriptor than range because its outlier resistant.
  • Variance : simply the mean of the squared deviations of the observations from the location mean μ.
  • An estimator is unbiased if, over repeated sampling, the average value of the statistics is equal to the parameter inferred upon. Mathematically, we represent this as expectation E. E(s^2 ) = σ^2
  • Standard deviation (s): standard measure of the deviation of the observations from the mean… positive square root of the variance.
  • Coefficient of variation (CV) : useful in comparing variation for different var. measured in different units / having diff. mean responses.
  • CV = ( s / Y) *100% → indication of relative variability (var of mice weights vs. cattle wts., etc.); also unitless. Generally less than 50%.; s - standard dev.; Y - mean
  • Quantitative continuous variable should be roughly normally distributed. If not → transform data!
  • When ratio of largest response var. to smallest is or exceeds 1 - order of magnitude (10- fold diff), transformation of Y likely to be effective.
  • Log transformation - suitable transformation for when CV is constant
  • If constant (a) added to/ subtracted from a random variable (y), variance of transformed variable (x=a+y) same, but different means.

STEM AND LEAF PLOTS AND BOX PLOTS

  • Stem and leaf display presents a histogram like picture of data; can easily re-sort dataset and identify the highest and lowest observation.
  • Box and whisker plot – interior box represents interquartile range; the interior line in that box represents the median and the crosshair symbol represents the mean. The length of a whisker extends up to 1.5 interquartile ranges. If datapoint is beyond whiskers → outlier. EMPIRICAL RULE : applied to normally distributed continuous response variables (symmetric/ bell-shaped): 1SD: 68%; 2SD: 95%; 3SD: 99.7%. Using range to approx. STD: σ = Range/6 ; Range = (μ+ 3σ) – (μ - 3σ) Z-TRANSFORMATION AND T-(OR SAMPLE Z) TRANSFORMATION
  • It is common to standardize distributions with population mean μ and population variance σ^2 to allow hypothesis testing.
  • z = (y - μ) / σ; for normally dist data → standard normal dist.
  • The mean of standard normal dist is 0 and variance is 1
  • z transformation can also be applied to samples, where we substitute the statistics for the parameters → t- transformation: z = (y - 𝑦) / s
  • Back transformation: y = μ + σz or y = 𝑦 + st PROBABILITY DISTRIBUTIONS*
  • PMF: description of prob. for discrete random var.
  • Mean for prob. dist. of discrete random variables:
  • Population variance “”:
  • Note that: probabilities are always defined between 0 and 1 o If an event A is certain NOT to occur then Pr(A) = 0 o If an event is certain to occur, then Pr(A) = 1 o Sum of probabilities of all mutually exclusive events equal 1. o Two events A and B are mut exclusive if Pr(A&B) = 0 o Two events that may occur simultaneously → independent if Pr(A) not affected by Pr(B), and v.v. o Probability of A & B occurring together = Pr(A)*Pr(B) if independent events
  • Complement of A or Ac^ – Pr (Ac) = 1 - Pr(A)
  • Probability Density Function (PDF’s) – probability statements for continuous data; describes the nature of the data in population.
  • Discrete uniform distribution : o Mean and variance of a theoretical discrete distribution can be computed without computing mean & variance. BINOMIAL DISTRIBUTION
  • The probability of getting y successes on n trials, where the probability of success on each trial is p , can be determined as:
  • For a binomial random variable, the population mean and variance can be written as: CONTINUOUS UNIFORM DISTRIBUTION
  • We model a continuous random variable with a curve f(x), which represents the height of the curve at point x, called a probability density function (PDF).
  • For continuous random variables, probabilities are areas under the curve.
  • For any continuous probability distribution: o f(x) > 0 for all x o the area under the entire curve is equal to 1. o There are several continuous probability distribution that come up frequently: Normal Distribution, Uniform Distribution, Exponential Distribution. STANDARD NORMAL DISTRIBUTION
  • The distribution is symmetric about the mean, the mean is also the median
  • Check out area under the curve pertaining to standard deviation.
  • If x is a random variable that has a normal distribution, we write this as: X ~ N (μ, 2) “the random variable X is distributed normally with mean μ & variance 2
  • The standard normal distribution is a normal distribution with mean 0 and variance 1
  • We often represent random variables that have the standard normal distribution with the letter Z, and we will say: Z ~ N (0,1) STANDARDIZING NORMALLY DISTRIBUTED RANDOM VARIABLES
  • Suppose X is normally distributed random var. with mean μ & standard deviation 
  • Finding the x percentile

For instance, what values of z would define middle 95% probability for Z? Here 1-α =.95 such that α= .05. Hence, we need to find z /2=.025. From Table A.1, this is 1.96. That is, Pr(−1.96 Z 1.96) = 0.95. Also, Pr(-1 < Z < 1) = 1- 2(0.1587) = 0.6826 ~ 0.68. Suppose we wish to select only the top 10% of heifers in a Brahman heifer population. What should be the cutoff point for selection? Pr(Z > zα=0.10)= 0.10 z From Table A.1 and A.1A, zα=0. = 1.28. Now remember that: z = (y - μ) / σ; Thus, y = μ + σz = 500 + 1.28*20 = 525.6kg SAMPLING DISTRIBUTION OF A MEAN

  • The mean of the sampling distribution of the sample mean is equal to population mean.
  • The standard deviation of sampling distribution 𝑌 is equal to:
  • If the population is normally distributed, then 𝑌is also normally distributed
  • For mean of n observations, when we are standardizing, we use → CENTRAL LIMIT THEOREM: sample mean will be approx. normally distributed for large samples, regardless of dist. from which we sampling.
  • Distribution of sample mean tends toward normal distribution as n increases.
  • Rough guideline: sample mean considered approx. normally distributed if n > 30 CLT tells us that our usual z score involving sample mean that tends to the standard normal distribution as sample size tends to infinity. According to the z-table of Table A.1 of Freund et al. (2010), middle 90% should lie within: 3.5 (mean) + 1.65 (.540 std) = [2.61,4.39]; For 100,000 exp.: 5th percentile: 2.6; 95th: 4. THE NORMAL APPROXIMATION TO THE BINOMIAL DISTRIBUTION
  • Continuous normal distribution can be used to approximate discrete binomial distrib
  • Normal approximation is reasonable if both np>10, and n(1-p) > 10
  • Recall, for a binomial random variable X: μ = np; σ^2 = np(1-p)
  • To standardize, use Z → (x- μ)/  → Z ~ N(0,1) Suppose the probability that any cow will require any herdsperson assistance at calving is about 30%. i.e. p = .30. In a herd of 50 cows, a) the expected number of assisted calvings is μ = np = 50 x.30 = 15 cows. b) the variance of number of assisted calvings is ^2 = np(1-p) = 10.5 cows or a std of 3.24. c) approximate probability that the farmer will have to assist 20 or more cows Pr(𝑌 ≥ 20 ) ≈ Pr (𝑍 ≥ 20 − 15 3. 24 ) = Pr(𝑍 ≥ 1. 54 ) = 0. 062 SAMPLING DISTRIBUTION OF A VARIANCE
  • E(s^2 ) = ^2 → sample variance is unbiased estimator of pop. variance → shape depends on n
  • CHI-SQUARE DISTRIBUTION : Related to standard normal distribution. If random variable Z has the standard normal distribution, then Z2 has an X2 distribution with 1 DF
  • Distribution when random variables are squared…?
  • If Z1, Z2…Zk are independent normal var., then Z1^2 , Z2^2 ,… Zk^2 has a X^2 distribution with k DF
  • Mean is equal to k (degrees of freedom), and variance is equal to twice the df (2k)
  • CLT: as DF increase, the skewness decreases
  • Probability statements for variances: SAMPLING DISTRIBUTION OF SAMPLE MEAN WHEN VARIANCE IS NOT KNOWN
  • When variance or standard deviation is unknown, or sample size is small, we cannot use normal/z distribution to compute probabilities for sample mean → we use t-distribution
  • Suppose we draw a random sample of n observations from a normally distributed population Z = (X - μ)/ /sqrN  has the standard normal distribution.
  • Applied when the population standard deviation is unknown, and we use the sample standard deviation → Z = (X - μ)/ (s/sqrN) → this sample std (s) is a statistic and vary from sample to sample → not standard normal dist. anymore (not Z), so we label it t distribution, w/ n- 1 DF
  • The denominator of the sample variance (s^2 ) is n-1. Looks like the z statistic (which has a standard normal distribution), except we replaced the population std with the sample std. We are estimating a parameter with a statistic so greater variability.
  • use the t-table to get ( ?) → Sample questions: What percentage of the observations within:2 SD of mean? Low 2 = 98.3 – 240.4 = 17.5 (17.51624 as per R studio) High 2 = 98.3 + 240.4 = 179.1 (179.0393 as per R studio) Proportion of observations falling within +/- 2sd: 94.444% From boxplot and stem-and-leaf plot, we characterize the distribution of CK levels as roughly symmetrically/ normally distributed (median is in middle of box and whiskers are roughly equal on each side) and no skew. However, there potential outlier in Stem 20. A biologist made a certain pH measurement in each of 24 frogs; She calculated mean of 7.373 and SD of 0.129 for original measurements. Next, she transformed data by subtracting 7 from each observation and then multiplying by 100. For example, 7.43 was transformed to 43. What are the mean and standard deviation of the transformed data? Mean of transformed data = (mean – 7) 100 = (7.373 – 7) 100 = 37. Standard deviation of transformed data = (SD100) = 0.129100 = 12. Log transformation changed spread & symmetry of data. In SL plot 5.a, data is positively skewed, but in stem-and-leaf plot 5.b, distribution of data more symmetric/ normal. 𝑴𝒆𝒂𝒏 (𝜇𝑋) = ∑ Pr o b(𝑋 = 𝑥) ⋅ 𝑥 𝑘= 6 𝑥= 1 = ( 0 ⋅ 0. 671 ) + ( 1 ⋅ 0. 229 ) + ( 2 ⋅ 0. 053 )+.. Variance ( 𝜎𝑋^2 ) = ∑ 𝑘 𝑥= 1 Pr(𝑋 = 𝑥) ⋅(𝑥 − 𝜇)^2 Variance ( 𝜎𝑋^2 ) = [( 0 − 0. 498 )^2 ∗ 0. 671 ] + [( 1 − 0. 498 )^2 ∗ 0. 229 ] …. What is the probability that any one child receives at most one diagnostic service? Pr ( X i < 1) = Pr( X i = 0) + Pr( X i =1) = 0.671 + 0.229 = 0.900 or 90% probability What is the probability that Sally and Doug will require one diagnostic service whereas the other two children will not require any diagnostic services. Pr(X1=0 & X2=0 & X3=0 & X4=0) = Pr(X1=0) * Pr(X2 =0) * Pr(X3=0) * Pr(X4 =0) falls in the top 2.5% of the cholesterol levels of men age 18-24 must be remeasured. Z 97.5th percentile = 1.96 y = Z *  + μ =  + =  mg/100 mL Proportion of the population departs from the true mean body temperature by more than one and a half standard deviations? Pr(Z>1.5) = 1 – 0.9332 = 0.0668 or 6.68% (if in one direction) What proportion of all sample standard deviations would be greater than 12 mg/100ml? 𝜒^2 = (𝑛−^1 )𝑠 2 𝜎^2 =^ ( 20 − 1 )( 122 )
  1. 252 =^31.^98 →^ 0.025< (𝝌 𝟐 (^) > 𝟑𝟏. 𝟗𝟖) < 𝟎. 𝟎𝟓 What two bounds define the middle 95% of the distribution of sample standard