• If the computed probability is very small (<0.05, <0.01) then we have evidence that
rejects the null hypothesis… i.e., we would conclude a statistically significant treatment
difference or treatment effect. → p-value
• Exp. units (rats), 12 observational units (measurements) per rat.
• Continuous variables include those variables that can be measured across the positive
real line (e.g., weights, heights)
• Discrete variables take positive integer values only.
• “Not true experimental randomization” when individuals are randomly selected from
each group, in the sense that any mean difference detected between 2 breeds not due
to inherent differences b/n breeds themselves, there maybe confounding factors that
influence mean differences.Only randomized studies provide valid inference on
causation; observational studies can only infer on association.
PARAMETER VS. STATISTICS
• A population dataset (N) contains data for all entities that are target of inference,
while sample dataset (n) contains subset of N.
• N – population parameter; n – sample statistics
• Statistical inference is process of using information (i.e., statistics) from a sample of
data to derive a conclusion about characteristics (i.e., parameters) of the population.
• Cumulative frequency – implies order or rank
• Relative frequency – sum total up to 1 or 100%
• Frequencies are given on the y-axis
• Judicial inference / scope of inference – happens when you have access to a certain
subset of experimental units (e.g., university research farm) that are supposed to be
representative of a much larger population.
• If the variables are not yet observed, they are random, and we generally use upper
case letters; if the variables are already observed or “realized”, we use lower case
letters.
MEASURES OF LOCATION
• Sample mean 𝑦 = ∑𝑦𝑖
𝑛
𝑖=1
𝑛 is a statistic or estimator of population mean µ = ∑𝑌𝑖
𝑁
𝑖=1
𝑁 or
parameter that defines true state of nature.
• Statistical conclusions based on sample statistics always subject to sampling error; 2
different samples from same population could lead to different estimates of the
parameters of a population.
• Median- simply middle value (50% lies above/below) = 0.5(n+1)
• In data drawn randomly from a normal distributed population, the sample median
should be fairly close to sample mean.
• Mode –frequently occurring data / robust to outliers like median
• Data:symmetrical&unimodal, mode=median=mean (normal dist.)
• Positive skewness: mode < median < mean (v.v for negative skew)
• Mean is outlier-sensitive; geometric mean more resistant to outliers.
MEASURES OF DISPERSION
• pth percentile (0<p<100)/ quantiles is the value of a variable with p % of the values
of the data distribution lying underneath it.
• Range – difference between smallest and largest observation.
• Interquartile range: Q1 (25% percentile) to Q3 (75% percentile) often a better
descriptor than range because its outlier resistant.
• Variance: simply the mean of the
squared deviations of the
observations from the location
mean µ.
• An estimator is unbiased if, over repeated sampling, the average value of the statistics
is equal to the parameter inferred upon. Mathematically, we represent this as
expectation E. E(s2) = σ2
• Standard deviation (s): standard measure of the deviation of the observations from
the mean… positive square root of the variance.
• Coefficient of variation (CV): useful in comparing variation for different var. measured
in different units / having diff. mean responses.
• CV = (s / Y) *100% → indication of relative variability (var of mice weights vs. cattle
wts., etc.); also unitless. Generally less than 50%.; s -standard dev.; Y - mean
• Quantitative continuous variable should be roughly normally distributed. If not →
transform data!
• When ratio of largest response var. to smallest is or exceeds 1-order of magnitude (10-
fold diff), transformation of Y likely to be effective.
• Log transformation - suitable transformation for when CV is constant
• If constant (a) added to/ subtracted from a random variable (y), variance of transformed
variable (x=a+y) same, but different means.
STEM AND LEAF PLOTS AND BOX PLOTS
• Stem and leaf display presents a histogram like picture of data; can easily re-sort
dataset and identify the highest and lowest observation.
• Box and whisker plot – interior box represents interquartile range; the interior line in
that box represents the median and the crosshair symbol represents the mean. The
length of a whisker extends up to 1.5 interquartile ranges. If datapoint is beyond
whiskers → outlier.
EMPIRICAL RULE: applied to normally distributed continuous response variables
(symmetric/ bell-shaped): 1SD: 68%; 2SD: 95%; 3SD: 99.7%.
Using range to approx. STD: σ = Range/6 ; Range = (µ+ 3σ) – (µ - 3σ)
Z-TRANSFORMATION AND T-(OR SAMPLE Z) TRANSFORMATION
• It is common to standardize distributions with population mean µ and population
variance σ2 to allow hypothesis testing.
• z = (y - µ) / σ; for normally dist data → standard normal dist.
• The mean of standard normal dist is 0 and variance is 1
• z transformation can also be applied to samples, where we substitute the statistics for
the parameters → t- transformation: z = (y - 𝑦) / s
• Back transformation: y = µ + σz or y = 𝑦 + s*t
PROBABILITY DISTRIBUTIONS
• PMF: description of prob. for discrete random var.
• Mean for prob. dist. of discrete random variables:
• Population variance “”:
• Note that: probabilities are always defined between 0 and 1
o If an event A is certain NOT to occur then Pr(A) = 0
o If an event is certain to occur, then Pr(A) = 1
o Sum of probabilities of all mutually exclusive events equal 1.
o Two events A and B are mut exclusive if Pr(A&B) = 0
o Two events that may occur simultaneously → independent if Pr(A) not affected
by Pr(B), and v.v.
o Probability of A & B occurring together = Pr(A)*Pr(B) if independent events
• Complement of A or Ac – Pr (Ac) = 1 - Pr(A)
• Probability Density Function (PDF’s) – probability statements for continuous data;
describes the nature of the data in population.
• Discrete uniform distribution:
o Mean and variance of a theoretic al discrete distribution can be
computed without computing mean & variance.
BINOMIAL DISTRIBUTION
• The probability of getting y successes on n trials, where the probability of success on
each trial is p, can be determined as:
• For a binomial random variable, the population mean and variance can be written as:
CONTINUOUS UNIFORM DISTRIBUTION
• We model a continuous random variable with a curve f(x), which represents the height
of the curve at point x, called a probability density function (PDF).
• For continuous random variables, probabilities are areas under the curve.
• For any continuous probability distribution:
o f(x) > 0 for all x
o the area under the entire curve is equal to 1.
o There are several continuous probability distribution that come up frequently:
Normal Distribution, Uniform Distribution, Exponential Distribution.
STANDARD NORMAL DISTRIBUTION
• The distribution is symmetric about the mean, the mean is also the median
• Check out area under the curve pertaining to standard deviation.
• If x is a random variable that has a normal distribution, we write this as: X ~ N (µ, 2)
“the random variable X is distributed normally with mean µ & variance 2
• The standard normal distribution is a normal distribution with mean 0 and variance 1
• We often represent random variables that have the standard normal distribution with
the letter Z, and we will say: Z ~ N (0,1)
STANDARDIZING NORMALLY DISTRIBUTED
RANDOM VARIABLES
- Suppose X is normally distributed random var.
with mean µ & standard deviation
- Finding the x percentile