Lecture Notes on Simple Linear Regression | MATH 241 | Study notes Mathematics

Simple Linear Regression

Regression expresses a relation used to predict one variable, called the response variable (or “dependent” variable

– often called y), from other variables, called predictors (or “independent” variables – often called x1, x2, . . . , xk), and

provides us with an equation to make this prediction. The regression equation that we calculate is descriptive of the

sample (like sample mean, like sample standard deviation); we use various inference methods to see the extent to which

we believe the description carries over to the population from which the sample is drawn.

A sample question

This table represent a sample of ten trucks; for each, we have with the age in years and the annual maintenance cost. We

want to find a linear equation, using this information, which most closely describes (“predicts”) the [average] maintenance

cost of a truck, based on its age. [Notice how the language tells us the ‘predictor” is age and the ‘response” (“predicted”)

is cost.

Truck age cost

number (years) ($Thousands)

1 1 3.50

2 2 3.70

3 2 4.80

4 2 5.20

5 2 5.90

6 3 5.50

7 4 7.50

8 4 8.00

9 5 7.90

10 5 9.50

For this course, we will focus on linear regression in which we look for the linear equation y=b0+b1x1+b2x2+. . .+bkxk

which best fits our data, and will begin with “simple” linear regression — using one predictor.

Thus we will have data in pairs — two different variables (a value of the predictor and a value of the response) observed

on the same individual (or sampling unit) and we will be looking for the equation y=b0+b1xwhich best describes the

relation; later we will look at tests to decide whether we have evidence for a similar relation in the population, and at how

to use the relation to make predictions.

The linear regression model (the theory we are using)

Our calculations (and decision of what is “better” or “worse” fit) are based on the following model (assumptions about the

population): There are two random variables Xand Y. For each possible value xof Xthere is a probability distribution

of values of Y(Y|x) which fits the following conditions:

1. The mean of the Y’s for a given x(called E(Y|x) or µY|x) is given by a linear equation E(Y|x) = β0+β1x.

2. For each value of X, the values of Y|xare approximately normally distributed

3. The standard deviations of the variables Y|xare all the same (for all values xof X) (This is the assumption of “ho-

moscedasticity” - notice it’s like the assumption in analysis of variance that all the populations have the same variance)

Another way of saying all of this is that the values of Yare given by y=β0+β1x+where (the “error of prediction”)

is a random variable (representing the “random variation” of individuals) which is normally distributed and independent

of X. [We will come back to these ideas]]

The Regression eaquation

For any linear equation y=b0+b1xeach data point (xi, yi) gives a “predicted value” ˆyi=b0+b1xiand there is a residual

yi−ˆyiwhich gives the error (difference between the actual value for that point and the prediction for the point). The

“line of best fit in the sense of least squares” or the “regression line for predicting ybased on xor the “OLS [stands for

“ordinary least squares”] line for ybased on x” is the ˆy=b0+b1xfor which the total of the squares of the residuals

(y1−ˆy1)2+ (y2−ˆy2)2+· ·· + (yn−ˆyn)2[ = (y1−(b0+b1x))2+ (y2−(b0+b1x2))2+··· + (yn−b(b0+b1xn))2] is

smallest. [If our model is correct, this is the method that will most often bring us closest to the “real” population line].

Fortunately, some work with calculus (already done for us by some nice people years ago) gives us the following equations

(the “normal equations”) for b0and b1

Pyi=nb0+b1Pxi

Pxiyi=b0+b1Px2

iwhich we can rewrite [after some clever algebra] as

slope =b1=P(xi−¯x)(yi−¯y)

P(xi−¯x)2=P(xiyi)−n¯x¯y

P(xi−¯x)2, intercept =b0= ¯y−b1¯x

Lecture Notes on Simple Linear Regression | MATH 241, Study notes of Mathematics

Related documents

Partial preview of the text

Download Lecture Notes on Simple Linear Regression | MATH 241 and more Study notes Mathematics in PDF only on Docsity!

SSR

SST

SSE

SST

SSE