Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

OLS in Matrix Form, Study notes of Advanced Calculus

The primary property of OLS estimators is that they satisfy the criteria of ... We can derive the variance-covariance matrix of the OLS estimator, ˆβ.

Typology: Study notes

2021/2022

Uploaded on 09/27/2022

manager33
manager33 🇬🇧

4.4

(34)

241 documents

1 / 14

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
OLS in Matrix Form
1 The True Model
Let X be an n×kmatrix where we have observations on kindependent variables for n
observations. Since our model will usually contain a constant term, one of the columns in
the Xmatrix will contain only ones. This column should be treated exactly the same as any
other column in the Xmatrix.
Let ybe an n×1 vector of observations on the dependent variable.
Let ²be an n×1 vector of disturbances or errors.
Let βbe an k×1 vector of unknown population parameters that we want to estimate.
Our statistical model will essentially look something like the following:
Y1
Y2
.
.
.
.
.
.
Yn
n×1
=
1X11 X21 . . . Xk1
1X12 X22 . . . Xk2
.
.
..
.
..
.
.... .
.
.
.
.
..
.
..
.
.... .
.
.
1X1nX2n. . . Xkn
n×k
β1
β2
.
.
.
.
.
.
βn
k×1
+
²1
²2
.
.
.
.
.
.
²n
n×1
This can be rewritten more simply as:
y= +²(1)
This is assumed to be an accurate reflection of the real world. The model has a systematic com-
ponent ( ) and a stochastic component (²). Our goal is to obtain estimates of the population
parameters in the βvector.
2 Criteria for Estimates
Our estimates of the population parameters are referred to as ˆ
β. Recall that the criteria we use
for obtaining our estimates is to find the estimator ˆ
βthat minimizes the sum of squared residuals
(Pe2
iin scalar notation).1Why this criteria? Where does this criteria come from?
The vector of residuals eis given by:
e=yXˆ
β(2)
1Make sure that you are always careful about distinguishing between disturbances (²) that refer to things that
cannot be observed and residuals (e) that can be observed. It is important to remember that ²6=e.
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe

Partial preview of the text

Download OLS in Matrix Form and more Study notes Advanced Calculus in PDF only on Docsity!

OLS in Matrix Form

1 The True Model

  • Let X be an n × k matrix where we have observations on k independent variables for n

observations. Since our model will usually contain a constant term, one of the columns in

the X matrix will contain only ones. This column should be treated exactly the same as any

other column in the X matrix.

  • Let y be an n × 1 vector of observations on the dependent variable.
  • Let ≤ be an n × 1 vector of disturbances or errors.
  • Let β be an k × 1 vector of unknown population parameters that we want to estimate.

Our statistical model will essentially look something like the following:

Y 1

Y

2

Y

n

n× 1

1 X 11 X 21... X

k 1

1 X

12

X

22

... X

k 2

1 X

1 n

X

2 n

... X

kn

n×k

β 1

β 2

β n

k× 1

2

n

n× 1

This can be rewritten more simply as:

y = Xβ + ≤ (1)

This is assumed to be an accurate reflection of the real world. The model has a systematic com-

ponent (Xβ) and a stochastic component (≤). Our goal is to obtain estimates of the population

parameters in the β vector.

2 Criteria for Estimates

Our estimates of the population parameters are referred to as

β. Recall that the criteria we use

for obtaining our estimates is to find the estimator

β that minimizes the sum of squared residuals

e

2

i

in scalar notation).

1 Why this criteria? Where does this criteria come from?

The vector of residuals e is given by:

e = y − X

β (2)

1 Make sure that you are always careful about distinguishing between disturbances (≤) that refer to things that

cannot be observed and residuals (e) that can be observed. It is important to remember that ≤ 6 = e.

The sum of squared residuals (RSS) is e

′ e.

2

[

e 1 e 2...... en

]

1 ×n

e 1

e 2

en

n× 1

[

e 1 × e 1 + e 2 × e 2 +... + en × en

]

1 × 1

It should be obvious that we can write the sum of squared residuals as:

e

′ e = (y − X

β)

′ (y − X

β)

= y

′ y −

β

′ X

′ y − y

′ X

β +

β

′ X

′ X

β

= y

′ y − 2

β

′ X

′ y +

β

′ X

′ X

β (4)

where this development uses the fact that the transpose of a scalar is the scalar i.e. y

′ X

β =

(y

′ X

β)

β

′ X

′ y.

To find the

β that minimizes the sum of squared residuals, we need to take the derivative of Eq. 4

with respect to

β. This gives us the following equation:

∂e

′ e

β

= − 2 X

′ y + 2X

′ X

β = 0 (5)

To check this is a minimum, we would take the derivative of this with respect to

β again – this

gives us 2X

′ X. It is easy to see that, so long as X has full rank, this is a positive definite matrix

(analogous to a positive real number) and hence a minimum.

3

2 It is important to note that this is very different from ee

  • the variance-covariance matrix of residuals.

3 Here is a brief overview of matrix differentiaton.

∂a

′ b

∂b

=

∂b

′ a

∂b

= a (6)

when a and b are K×1 vectors.

∂b

′ Ab

∂b

= 2Ab = 2b

′ A (7)

when A is any symmetric matrix. Note that you can write the derivative as either 2Ab or 2b

′ A.

∂ 2 β

′ X

′ y

∂b

=

∂ 2 β

′ (X

′ y)

∂b

= 2X

′ y (8)

and

∂β

′ X

′ Xβ

∂b

=

∂β

′ Aβ

∂b

= 2Aβ = 2X

′ Xβ (9)

when X

′ X is a K×K matrix. For more information, see Greene (2003, 837-841) and Gujarati (2003, 925).

What does X

′ e look like?

X

11

X

12

... X

1 n

X

21

X

22

... X

2 n

X

k 1

X

k 2

... X

kn

e 1

e 2

en

X

11 × e 1

+ X

12 × e 2

+... + X

1 n × e n

X

21 × e 1

+ X

22 × e 2

+... + X

2 n × e n

X

k 1 × e 1 + X k 2 × e 2 +... + X kn × en

From X

′ e = 0, we can derive a number of properties.

  1. The observed values of X are uncorrelated with the residuals.

X

′ e = 0 implies that for every column x k of X, x

k

e = 0. In other words, each regressor

has zero sample correlation with the residuals. Note that this does not mean that X is un-

correlated with the disturbances; we’ll have to assume this.

If our regression includes a constant, then the following properties also hold.

  1. The sum of the residuals is zero.

If there is a constant, then the first column in X (i.e. X 1 ) will be a column of ones. This

means that for the first element in the X

′ e vector (i.e. X 11 × e 1 + X 12 × e 2 +... + X 1 n × en)

to be zero, it must be the case that

e i

  1. The sample mean of the residuals is zero.

This follows straightforwardly from the previous property i.e. e =

∑ ei

n

  1. The regression hyperplane passes through the means of the observed values (X

and y).

This follows from the fact that e = 0. Recall that e = y − X

β. Dividing by the number

of observations, we get e = y − x

β = 0. This implies that y = x

β. This shows that the

regression hyperplane goes through the point of means of the data.

  1. The predicted values of y are uncorrelated with the residuals.

The predicted values of y are equal to X

β i.e. ˆy = X

β. From this we have

′ e = (X

β)

′ e = b

′ X

′ e = 0 (16)

This last development takes account of the fact that X

′ e = 0.

  1. The mean of the predicted Y’s for the sample will equal the mean of the observed

Y’s i.e. yˆ = y.

These properties always hold true. You should be careful not to infer anything from the residuals

about the disturbances. For example, you cannot infer that the sum of the disturbances is zero or

that the mean of the disturbances is zero just because this is true of the residuals - this is true of

the residuals just because we decided to minimize the sum of squared residuals.

Note that we know nothing about

β except that it satisfies all of the properties discussed above.

We need to make some assumptions about the true model in order to make any inferences regarding

β (the true population parameters) from

β (our estimator of the true parameters). Recall that

β

comes from our sample, but we want to learn about the true parameters.

4 The Gauss-Markov Assumptions

  1. y = Xβ + ≤

This assumption states that there is a linear relationship between y and X.

  1. X is an n × k matrix of full rank.

This assumption states that there is no perfect multicollinearity. In other words, the columns

of X are linearly independent. This assumption is known as the identification condition.

3. E[≤|X] = 0

E

1

|X

2

|X

n

|X

E(≤

1

E(≤

2

E(≤

n

This assumption - the zero conditional mean assumption - states that the disturbances average

out to 0 for any value of X. Put differently, no observations of the independent variables

convey any information about the expected value of the disturbance. The assumption implies

that E(y) = Xβ. This is important since it essentially says that we get the mean function

right.

4. E(≤≤

′ |X) = σ

2 I

This captures the familiar assumption of homoskedasticity and no autocorrelation. To see

why, start with the following:

E(≤≤

′ |X) = E

≤ 1 |X

2

|X

n

|X

[

≤ 1 |X ≤ 2 |X... ≤n|X]

]

5 The Gauss-Markov Theorem

The Gauss-Markov Theorem states that, conditional on assumptions 1-5, there will be no other

linear and unbiased estimator of the β coefficients that has a smaller sampling variance. In other

words, the OLS estimator is the Best Linear, Unbiased and Efficient estimator (BLUE). How do

we know this?

Proof that

β is an unbiased estimator of β.

We know from earlier that

β = (X

′ X)

− 1 X

′ y and that y = Xβ + ≤. This means that

β = (X

′ X)

− 1 X

′ (Xβ + ≤)

β = β + (X

′ X)

− 1 X

′ ≤ (24)

since (X

′ X)

− 1 X

′ X = I. This shows immediately that OLS is unbiased so long as either (i) X is

fixed (non-stochastic) so that we have:

E[

β] = E[β] + E[(X

′ X)

− 1 X

′ ≤]

= β + (X

′ X)

− 1 X

′ E[≤] (25)

where E[≤] = 0 by assumption or (ii) X is stochastic but independent of ≤ so that we have:

E[

β] = E[β] + E[(X

′ X)

− 1 X

′ ≤]

= β + (X

′ X)

− 1 E[X

′ ≤] (26)

where E(X

′ ≤) = 0.

Proof that

β is a linear estimator of β.

From Eq. 24, we have:

β = β + (X

′ X)

− 1 X

′ ≤ (27)

Since we can write

β = β + A≤ where A = (X

′ X)

− 1 X

′ , we can see that

β is a linear function of the

disturbances. By the definition that we use, this makes it a linear estimator (See Greene (2003, 45).

Proof that

β has minimal variance among all linear and unbiased estimators

See Greene (2003, 46-47).

6 The Variance-Covariance Matrix of the OLS Estimates

We can derive the variance-covariance matrix of the OLS estimator,

β.

E[(

β − β)(

β − β)

′ ] = E[((X

′ X)

− 1 X

′ ≤)((X

′ X)

− 1 X

′ ≤)

′ ]

= E[(X

′ X)

− 1 X

′ ≤≤

′ X(X

′ X)

− 1 ] (28)

where we take advantage of the fact that (AB)

′ = B

′ A

′ i.e. we can rewrite (X

′ X)

− 1 X

′ ≤ as

′ X(X

′ X)

− 1

. If we assume that X is non-stochastic, we get:

E[(

β − β)(

β − β)

′ ] = (X

′ X)

− 1 X

′ E[≤≤

′ ]X(X

′ X)

− 1 (29)

From Eq. 22, we have E[≤≤

′ ] = σ

2 I. Thus, we have:

E[(

β − β)(

β − β)

′ ] = (X

′ X)

− 1 X

′ (σ

2 I)X(X

′ X)

− 1

= σ

2 I(X

′ X)

− 1 X

′ X(X

′ X)

− 1

= σ

2 (X

′ X)

− 1 (30)

We estimate σ

2 with ˆσ

2 , where:

σˆ

2

e

′ e

n − k

To see the derivation of this, see Greene (2003, 49).

What does the variance-covariance matrix of the OLS estimator look like?

E[(

β − β)(

β − β)

′ ] =

var(

β 1 ) cov(

β 1 ,

β 2 )... cov(

β 1 ,

β k

cov(

β 2

β 1 ) var(

β 2 )... cov(

β 2

β k

cov(

β k

β 1 ) cov(

β k

β 2 )... var(

β k

As you can see, the standard errors of the

β are given by the square root of the elements along the

main diagonal of this matrix.

6.1 Hypothesis Testing

Recall Assumption 6 from earlier, which stated that ≤|X ∼ N [0, σ

2 I]. I had stated that this

assumption was not necessary for the Gauss-Markov Theorem but was crucial for testing inferences

about

β. Why? Without this assumption, we know nothing about the distribution of

β. How does

this assumption about the distribution of the disturbances tell us anything about the distribution of

β? Well, we just saw in Eq. 27 that the OLS estimator is just a linear function of the disturbances.

By assuming that the disturbances have a multivariate normal distribution i.e.

≤ ∼ N [0, σ

2 I] (33)

assumption to make. Our OLS standard errors will be incorrect insofar as:

X

′ E[≤≤

′ ]X 6 = σ

2 (X

′ X) (38)

Note that our OLS standard errors may be too big or too small. So, what can we do if we suspect

that there is heteroskedasticity?

Essentially, there are two options.

  1. Weighted Least Squares: To solve the problem, we just need to find something that is

proportional to the variance. We might not know the variance for each observation, but

if we know something about where it comes from, then we might know something that is

proportional to it. In effect, we try to model the variance. Note that this only solves the

problem of heteroskedasticity if we assume that we have modelled the variance correctly - we

never know if this is true or not.

  1. Robust standard errors (White 1980): This method treats heteroskedasticity as a nuisance

rather than something to be modelled.

How do robust standard errors work? We never observe disturbances (≤) but we do observe residuals

(e). While each individual residual (e i ) is not going to be a very good estimator of the corresponding

disturbance (≤ i ), White (1980) showed that X

′ ee

′ X is a consistent (but not unbiased) estimator of

X

′ E[≤≤

′ ]X.

6

Thus, the variance-covariance matrix of the coefficient vector from the White estimator is:

var − cov(

β) = (X

′ X)

− 1 X

′ ee

′ X(X

′ X)

− 1 (39)

rather than:

var − cov(

β) = X

′ X)

− 1 X

′ ≤≤

′ X(X

′ X)

− 1

= (X

′ X)

− 1 X

′ (σ

2 I)X(X

′ X)

− 1 (40)

from the normal OLS estimator.

White (1980) suggested that we could test for the presence of heteroskedasticity by examining the

extent to which the OLS estimator diverges from his own estimator. White’s test is to regress

the squared residuals (e

2

i

) on the terms in X

′ X i.e. on the squares and the cross-products of the

independent variables. If the R

2 exceeds a critical value (nR

2 ∼ χ

2

k

), then heteroskedasticity causes

problems. At that point use the White estimator (assuming your sample is sufficiently large). Neal

Beck suggests that, by and large, using the White estimator can do little harm and some good.

6 It is worth remembering that X

′ ee

′ X is a consistent (but not unbiased) estimator of X

′ E[≤≤

′ ]X since this means

that robust standard errors are only appropriate when the sample is relatively large (say, greater than 100 degrees of

freedom).

8 Partitioned Regression and the Frisch-Waugh-Lovell Theorem

Imagine that our true model is:

y = X 1 β 1 + X 2 β 2 + ≤ (41)

In other words, there are two sets of independent variables. For example, X 1 might contain some

independent variables (perhaps also the constant) whereas X 2 contains some other independent

variables. The point is that X 1 and X 2 need not be two variables only. We will estimate:

y = X 1

β 1

+ X

2

β 2

  • e (42)

Say, we wanted to isolate the coefficients associated with X 2 i.e.

β 2

. The normal form equations

will be:

7

[

X

1

X

1

X

1

X

2

X

2

X

1

X

2

X

2

] [

β 1

β 2

]

[

X

1

y

X

2

y

]

First, let’s solve for

β 1

(X

1

X 1 )

β 1 + (X

1

X 2 )

β 2 = X

1

y

(X

1

X

1

β 1

= X

1

y − (X

1

X

2

β 2

β 1

= (X

1

X

1

− 1 X

1

y − (X

1

X

1

− 1 X

1

X

2

β 2

β 1

= (X

1

X

1

− 1 X

1

(y − X 2

β 2

8.1 Omitted Variable Bias

The solution shown in Eq. 44 is the set of OLS coefficients in the regression of y on X 1 , i.e,

(X

1

X 1 )

− 1 X

1

y, minus a correction vector (X

1

X 1 )

− 1 X

1

X 2

β 2. This correction vector is the equation

for omitted variable bias. The first part of the correction vector up to

β 2 , i.e. (X

1

X

1

− 1 X

1

X

2 , is

just the regression of the variables in X 2 done separately and then put together into a matrix on all

the variables in X 1

. This will only be zero if the variables in X 1 are linearly unrelated (uncorrelated

or orthogonal) to the variables in X 2

. The correction vector will also be zero if

β 2 = 0 i.e. if X 2

variables have no impact on y. Thus, you can ignore all potential omitted variables that are either

(i) unrelated to the included variables or (ii) unrelated to the dependent variable. Any omitted

variables that do not meet these conditions will change your estimates of

β 1 if they were to be

included.

Greene (2003, 148) writes the omitted variable formula slightly differently. He has

E[b 1 ] = β 1

+ P

  1. 2 β 2

where P 1. 2 = (X

1

X 1 )

− 1 X

1

X 2 , where b 1 is the coefficient vector of a regression omitting the X 2

7 To see this, compare with Eq. 10.

Now we insert this into (2) of Eq. 43. This gives us

X

2

y = X

2

X

1

(X

1

X

1

− 1 X

1

y − X

2

X

1

(X

1

X

1

− 1 X

1

X

2

β 2

+ X

2

X

2

β 2

X

2

y − X

2

X

1

(X

1

X

1

− 1 X

1

y = X

2

X

2

β 2

− X

2

X

1

(X

1

X

1

− 1 X

1

X

2

β 2

X

2 y − X

2

X 1 (X

1

X 1 )

− 1 X

1 y = [X

2

X 2 − X

2

X 1 (X

1

X 1 )

− 1 X

1

X 2 ]

β 2

X

2

y − X

2

X

1

(X

1

X

1

− 1 X

1

y = [(X

2

− X

2

X

1

(X

1

X

1

− 1 X

1

)X

2

]

β 2

X

2 y − X

2

X 1 (X

1

X 1 )

− 1 X

1 y = [X

2

(I − X 1 (X

1

X 1 )

− 1 X

1

)X 2 ]

β 2

(X

2

− X

2

X

1

(X

1

X

1

− 1 X

1

)y = [X

2

(I − X

1

(X

1

X

1

− 1 X

1

)X

2

]

β 2

X

2

(I − X 1 (X

1

X 1 )

− 1 X

1 )y = [X

2

(I − X 1 (X

1

X 1 )

− 1 X

1

)X 2 ]

β 2

β 2

= [X

2

(I − X

1

(X

1

X

1

− 1 X

1

)X

2

]

− 1 X

2

(I − X

1

(X

1

X

1

− 1 X

1

)y

= (X

2

M 1 X 2 )

− 1 (X

2 M 1 y) (51)

Recall that M is the residual maker. In this case, M 1 makes residuals for regressions on the X 1

variables: M 1 y is the vector of residuals from regressing y on the X 1 variables and M 1 X 2 is the

matrix made up of the column by column residuals of regressing each variable (column) in X 2 on

all the variables in X 1

Because M is both idempotent and symmetric, we can rewrite Eq. 51 as

β 2

= (X

∗ ′

2

X

2

− 1 X

∗ ′

2

y

∗ (52)

where X

2

= M

1

X

2 and y

∗ = M 1 y.

From this it is easy to see that

β 2 can be obtained from regressing y

∗ on X

2

(you’ll get good at

spotting regressions i.e. equations of the (X

′ X)

− 1 X

′ y form. The starred variables are just the

residuals of the variables (y or X 2 ) after regressing them on the X 1 variables.

This leads to the Frisch-Waugh-Lovell Theorem: In the OLS regression of vector y on two sets

of variables, X 1 and X 2 , the subvector

β 2 is the set of coefficients obtained when the residuals from

a regression of y on X 1 alone are regressed on the set of residuals obtained when each column of

X

2 is regressed on X 1

We’ll come back to the FWL Theorem when we look at fixed effects models.

8.4 Example

Imagine we have the following model.

Y = β 0

  • β 1

X

1

  • β 2

X

2

  • β 3

X

3

If we regressed Y on X 1

, X

2 , and X 3 , we would get

β 1

β 2

β 3

. We could get these estimators

differently. Say we partitioned the variables into (i) X 1 and (ii) X 2 and X 3

Step 1: regress Y on X 1 and obtain residuals (e1) i.e. M 1 y.

Step 2: regress X 2 on X 1 and obtain residuals (e2) i.e. first column of M 1

X

2

Step 3: regress X 3 on X 1 and obtain residuals (e3) i.e. second column of M 1 X 2.

Step 4: regress e1 on e2 and e3 i.e. regress M 1 y on M 1

X

2

Step 5: the coefficient on e2 will be

β 2 and the coefficient on e3 will be

β 3.

Steps 2 and 3 are called partialing out or netting out the effect of X 1

. For this reason, the coefficients

in multiple regression are often called partial regression coefficients. This is what it means to say

we are holding the X 1 variables constant in the regression.

So the difference between regressing Y on both X 1 and X 2 instead of on just X 2 is that in the

first case we first regress both the dependent variables and all the X 2 variables separately on X 1

and then regress the residuals on each other, but in the second case we just regress y on the X 2

variables.