linear regression

Linear RegressionThe Least squares Regression

model

Regression Line A regression line is a line that

describes how a response variable y changes as an explanatory variable x changes.

We often use regression to predict the value of y given an x value

Equation of a Regression Line

A regression line relating x to y has an equation:

yˆ (read y hat) is the predictor value of the response variable y for a given value of the explanatory variable x.

b is the slope, the amount y is expected to change when x increases one unit

a is the y intercept the predicted value of y when x=0

bxay ˆ

Prediction Interpolation is the use of a

regression line to predict between known observations

Extrapolation is the use of a regression line to predict outside known observations• predictions from extrapolation are often not

accurate

Residuals A residual is the difference between

an observed value of the response variable and the value predicted by the regression line• Residual=observed y- predicted y• Residual =y-yˆ

Least Square Regression line

The least squares regression line of y on x is the line that makes the sum of the squared residuals as small as possible.

Equation:

xbya

s

srb

y

x

Other Caclulations

xbya

xx

yyxxb

2)(

))((

How well does a line fit the data?

Since residuals tell us how far the data is from the regression line they are a natural place to look for the fit.

A residual plot is a scatterplot of the residuals against the explanatory variable.

How do residual plots help us assess the fit of the data?

A residual plot in effect turns the regression line horizontal

The Residual plots magnify the deviation of points from the line• Making it easier to see unusual observations

and patterns

What we look for in residual plots

NO obvious pattern• A curved pattern shows a nonlinear

relationship.• A megaphone pattern shows growth of

residuals

The residuals should be relatively small• The typical prediction error.

The average prediction error

The standard deviation of residuals (s)

2

)ˆ(

2

2

2

n

yys

n

residualss

i

Home example We want to predict the price of a home in

Arvada. A random sample of 10 homes for sell is taken. Thousand

Make a prediction for the cost of the 11th house if we know the square footage is 1789 ft2Square ft Price

(thousands)

Square ft Price (thousands

1429 201 1785 325

1982 333 2001 450

1359 205 1835 360

1761 370 1948 407

1883 454 1489 293

Well here is what I would do

I would make a scatter plot. Than I would find the linear

regression Finally, I would use the regression

line to predict the cost.

Here’s what I found:• y=231.67+.33x• r=.87 Thus the price of the

home• r2=.76 would be $353.47

thousand

D what is the r2 thing? r2 is the coefficient of

determination.• Yes I know it is r squared, but why do we

bother?

More house example Now I am going to change one small

thing in our house example, we don’t know the size of the 11th house. What would to predict the price to be now?

I would predict the price to be $339.8

Not as good as our last prediction but not bad.

Explain v Unexplained variability

We would expect our linear regression model to predict the price better than the mean, but is it really that much different?

The sum of squares prediction errors if we use the mean is 70913.6• This is the sum of squares of TOTAL

variation SST The sum of squares residuals is

16754.6• This is the sum of square of the ERROR SSE

How SST and SSE make r2

The ratio SSE/SST tells us how the proportion of variation in y still remaining.• SSE/SST=16754.6/70913.6 = .236

Thus 23.6 % of the variation is unaccounted for in our model

• Thus the percentage accounted for in our model is 1-.236= .764

HOLD ON Wasn’t r2=.76?

Yes, it was. In fact we can calculate r2 by finding the ratio SSE/SST and subtracting it from 1.

Thus, what r2 tells us is the amount of variability explained by the model

So finally

2

2

2

)(

)ˆ(

1

yySST

yySSE

SST

SSEr

i

i

linear regression

Documents