Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Summary of the Basic Probability Theory in Statistics | MATH 218, Study notes of Mathematics

Material Type: Notes; Professor: Joyce; Class: TOPICS IN STATISTICS; Subject: Mathematics; University: Clark University; Term: Spring 2008;

Typology: Study notes

Pre 2010

Uploaded on 08/07/2009

koofers-user-8gj
koofers-user-8gj 🇺🇸

5

(1)

9 documents

1 / 3

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Summary of basic probability theory, part 2
D. Joyce, Clark University
Math 218, Mathematical Statistics, Jan 2008
Expectation. The expected value E(X), also
called the expectation or mean µX, of a random
variable Xis defined differently for the discrete and
continuous cases.
For a discrete random variable, it is a weighted
average defined in terms of the probability mass
function fas
E(X) = µX=X
x
xf(x).
For a continuous random variable, it is defined in
terms of the probability density function fas
E(X) = µX=Z
−∞
xf(x)dx.
There is a physical interpretation where this
mean is interpreted as a center of gravity.
Expectation is a linear operator. That means
that the expectation of a sum or difference is the
difference of the expectations
E(X+Y) = E(X) + E(Y),
and that’s true whether or not Xand Yare inde-
pendent, and also
E(cX) = c E (X)
where cis any constant. From these two properties
it follows that
E(XY) = E(X)E(Y),
and, more generally, expectation preserves linear
combinations
E(
n
X
i=1
ciXi) =
n
X
i=1
ciE(Xi).
Furthermore, when Xand Yare independent,
then E(XY ) = E(X)E(Y), but that equation
doesn’t usually hold when Xand Yare not inde-
pendent.
Variance and standard deviation. The vari-
ance of a random variable Xis defined as
Var(X) = σ2
X=E((XµX)2) = E(X2)µ2
X
where the last equality is provable. Standard devia-
tion, σ, is defined as the square root of the variance.
Here are a couple of properties of variance. First,
if you multiply a random variable Xby a constant
cto get cX, the variance changes by a factor of the
square of c, that is
Var(cX) = c2Var(X).
That’s the main reason why we take the square
root of variance to normalize it—the standard de-
viation of cX is ctimes the standard deviation of
X. Also, variance is translation invariant, that is,
if you add a constant to a random variable, the
variance doesn’t change:
Var(X+c) = Var(X).
In general, the variance of the sum of two random
variables is not the sum of the variances of the two
random variables. But it is when the two random
variables are independent.
Moments, central moments, skewness, and
kurtosis. The kth moment of a random variable
Xis defined as µk=E(Xk). Thus, the mean is
the first moment, µ=µ1, and the variance can
1
pf3

Partial preview of the text

Download Summary of the Basic Probability Theory in Statistics | MATH 218 and more Study notes Mathematics in PDF only on Docsity!

Summary of basic probability theory, part 2

D. Joyce, Clark University

Math 218, Mathematical Statistics, Jan 2008

Expectation. The expected value E(X), also called the expectation or mean μX , of a random variable X is defined differently for the discrete and continuous cases. For a discrete random variable, it is a weighted average defined in terms of the probability mass function f as

E(X) = μX =

∑ x

xf (x).

For a continuous random variable, it is defined in terms of the probability density function f as

E(X) = μX =

∫ (^) ∞

−∞

xf (x) dx.

There is a physical interpretation where this mean is interpreted as a center of gravity. Expectation is a linear operator. That means that the expectation of a sum or difference is the difference of the expectations

E(X + Y ) = E(X) + E(Y ),

and that’s true whether or not X and Y are inde- pendent, and also

E(cX) = c E(X)

where c is any constant. From these two properties it follows that

E(X − Y ) = E(X) − E(Y ),

and, more generally, expectation preserves linear combinations

E(

∑^ n

i=

ciXi) =

∑^ n

i=

ciE(Xi).

Furthermore, when X and Y are independent, then E(XY ) = E(X) E(Y ), but that equation doesn’t usually hold when X and Y are not inde- pendent. Variance and standard deviation. The vari- ance of a random variable X is defined as

Var(X) = σ^2 X = E((X − μX )^2 ) = E(X^2 ) − μ^2 X

where the last equality is provable. Standard devia- tion, σ, is defined as the square root of the variance. Here are a couple of properties of variance. First, if you multiply a random variable X by a constant c to get cX, the variance changes by a factor of the square of c, that is

Var(cX) = c^2 Var(X).

That’s the main reason why we take the square root of variance to normalize it—the standard de- viation of cX is c times the standard deviation of X. Also, variance is translation invariant, that is, if you add a constant to a random variable, the variance doesn’t change:

Var(X + c) = Var(X).

In general, the variance of the sum of two random variables is not the sum of the variances of the two random variables. But it is when the two random variables are independent. Moments, central moments, skewness, and kurtosis. The kth^ moment of a random variable X is defined as μk = E(Xk). Thus, the mean is the first moment, μ = μ 1 , and the variance can

be found from the first and second moments, σ^2 = μ 2 − μ^21. The kth^ central moment is defined as E((X −μ)k. Thus, the variance is the second central moment. A third central moment of the standardized ran- dom variable X∗^ = (X − μ)/σ,

β 3 = E((X∗)^3 ) =

E((X − μ)^3 ) σ^3

is called the skewness of X. A distribution that’s symmetric about its mean has 0 skewness. (In fact all the odd central moments are 0 for a symmetric distribution.) But if it has a long tail to the right and a short one to the left, then it has a positive skewness, and a negative skewness in the opposite situation. A fourth central moment of X∗,

β 4 = E((X∗)^4 ) =

E((X − μ)^4 ) σ^4

is callled kurtosis. A fairly flat distribution with long tails has a high kurtosis, while a short tailed distribution has a low kurtosis. A bimodal distribu- tion has a very high kurtosis. A normal distribution has a kurtosis of 3. (The word kurtosis was made up in the early 19th century from the Greek word for curvature.)

The moment generating function. There is a clever way of organizing all the moments into one mathematical object, and that object is called the moment generating function. It’s a function m(t) of a new variable t defined by

m(t) = E(etX^ ).

Since the exponential function et^ has the power se- ries

et^ =

∑^ ∞

k=

tk k!

= 1 + t +

t^2 2!

tk k!

we can rewrite m(t) as follows

m(t) = E(etX^ )1 + μ 1 t +

μ 2 2!

t^2 + · · · +

μk k!

tk^ + · · ·.

That implies that m(k)(0), the kth^ derivative of m(t) evaluated at t = 0, equals the kth^ moment μk of X. In other words, the moment generating function generates the moments of X by differentiation. For discrete distributions, we can also compute the moment generating function directly in terms of the probability mass function f (x) = P (X=x) as m(t) = E(etX^ ) =

∑ x

etxp(x).

For continuous distributions, the moment generat- ing function can be expressed in terms of the prob- ability density function f as

m(t) = E(etX^ ) =

∫ (^) ∞

−∞

etxfX (x) dx.

The moment generating function enjoys the fol- lowing properties. Translation. If Y = X + a, then

mY (t) = etamX (t).

Scaling. If Y = bx, then

mY (t) = mX (bt).

Standardizing. From the last two properties, if

X∗^ =

X − μ σ is the standardized random variable for X, then

mX∗ (t) = e−μt/σmX (t/σ).

Convolution. If X and Y are independent vari- ables, and Z = X + Y , then

mZ (t) = mX (t) mY (t).

The primary use of moment generating functions is to develop the theory of probability. For instance, the easiest way to prove the central limit theorem is to use moment generating functions. The median, quartiles, quantiles, and per- centiles. The median of a distribution X, some- times denoted ˜μ, is the value such that P (X ≤ μ˜) =