chapter 7 - dougraffle.com · review in chapter 7, we saw: where do we go from here? scatterplots...

33
Chapter 7 Linear Regression D. Raffle 5/27/2015 Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h... 1 of 33 05/26/2015 02:08 AM

Upload: others

Post on 12-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

Chapter 7Linear Regression

D. Raffle

5/27/2015

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

1 of 33 05/26/2015 02:08 AM

Page 2: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

Review

In Chapter 7, we saw:

Where do we go from here?

Scatterplots can show us relationships between two numeric variables.

Correlation describes the strength and direction of linear relationshipsbetween two numeric variables.

We call the variable the explanatory variable, because it explains the variable.

We call the variable the response because it responds to changes in .

·

·

· X Y

· Y X

If correlation tells us how well the points fit around a line, what line do they fitaround?

How can we use this line to describe the relationship further?

Can we use it to make predictions?

·

·

·

2/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

2 of 33 05/26/2015 02:08 AM

Page 3: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

Fuel Efficiency vs. Weight

3/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

3 of 33 05/26/2015 02:08 AM

Page 4: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

Fuel Efficiency vs. Weight

What can we see?

What else might we want to do?

There is a fairly strong negative linear relationship between the weight of a carand its fuel efficiency

As we increase the weight of a car, the fuel efficiency tends to decrease.

· r = −0.87·

·

Describe the general trend with a line

Make a prediction of what the fuel efficiency of a car weighing 4500 lbs is.

·

·

4/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

4 of 33 05/26/2015 02:08 AM

Page 5: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

The Line of Best Fit

So how do we find the line the best describes the general trend?

The most commonly used method is called the least squares regression (LSR).

The least squares regression line is the line that comes closest to all of thepoints simultaneously.

At any value of the variable, the line is our prediction for the mean orexpected value of the variable.

This lets us analyze values of given that we know what is.

·

·

· XY

· Y X

5/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

5 of 33 05/26/2015 02:08 AM

Page 6: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

Fuel Efficiency vs. Weight

6/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

6 of 33 05/26/2015 02:08 AM

Page 7: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

The Linear Model

Defining Least Squares Regression:

LSR is a linear model

A mathematical model is just a formula that describes something in the realworld (e.g., Area = Base Height)

A statistical model does the same thing, but it accounts for variability oruncertainty (e.g., The Normal Model)

LSR calculates the best possible line to describe the overall trend between ourvariables from our data

·

·×

·

·

7/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

7 of 33 05/26/2015 02:08 AM

Page 8: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

The Least Squares Line

Recall from algebra that we can describe a line using the formula:

In statistics, we use the same method with different, but we label thingsdifferently:

What do the terms mean?

· y = mx + b

· = + xy b0 b1

is the slope

is the y-intercept

is our prediction for the mean of when

· b1 (m)· b0 (b)· y Y X = x

8/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

8 of 33 05/26/2015 02:08 AM

Page 9: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

Algebra Review: Lines

Consider the line:

What do the coefficients tell us?

= 1 + 2xy

x = 1 + 2xy y

0 = 1 + 2(0)y 1

1 = 1 + 2(1)y 3

2 = 1 + 2(2)y 5

tells us the value of when

tells us that every time goes up by one unit, increases by

· = 1b0 y X = 0· = 2b1 X y 2

9/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

9 of 33 05/26/2015 02:08 AM

Page 10: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

Fuel Efficiency vs. Weight

For our car data set, the regression line is:

What does this tell us? Keep in mind that weight is in thousands of pounds.

This brings up an important note:

= 37.2851 − 5.3445wtmpg

For every added 1000 pounds of weight, fuel efficiency drops by milesper gallon

If a car weighs 0 lbs, it's predicted efficiency is mpg

· 5.3445

· 37.285

A car can't weigh 0 lbs.

In statistics, the y-intercept is often not interpretable, we just use it to drawthe line and make predictions.

·

·

10/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

10 of 33 05/26/2015 02:08 AM

Page 11: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

Fuel Efficiency vs. Weight

If a car weighed lbs, what would we expect its efficiency to be?

How do we interpret this?

= 37.2851 − 5.3445wtmpg

4500

· = 37.28511 − 5.3445(4.5)mpg

· = 37.285 − 24.05mpg

· = 13.23mpg

We predict that the average car weighing lbs gets mpg· 4500 13.23

11/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

11 of 33 05/26/2015 02:08 AM

Page 12: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

Residuals

We said the the Least Squares Regression line doesn't hit every point in ourscatterplot, so how can we tell how well it does?

For each point , we predict the value

For every observation we have, we can see how far off we were by finding theresidual

For a given , the residual is:

This is the vertical distance between the line and the true value of .

If our estimate was too high, the residual will be negative

If our estimate was too low, the residual will be positive

· (x, y) (x, )y

·

· x y − y

· y

·

·

12/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

12 of 33 05/26/2015 02:08 AM

Page 13: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

Residuals: Example

Let's pick on car in our data set, the Camaro Z28. This car gets mpg andweighs lbs.

So how'd we do?

13.33840

· = 37.2851 − 5.34455wtmpg

· = 37.2851 − 5.3445(3.84)mpg

· = 37.2851 − 20.5228mpg

· = 16.76mpg

We overestimated by 3.46 mpg

· mpg − = 13.3 − 16.76 = −3.46mpg

·

13/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

13 of 33 05/26/2015 02:08 AM

Page 14: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

Residuals: Example

Let's pick another car, the Fiat 128. It's efficiency is mpg and it weights lbs.

So how'd we do?

32.42200

· = 37.2851 − 5.34455wtmpg

· = 37.2851 − 5.3445(2.2)mpg

· = 37.2851 − 11.76mpg

· = 25.53mpg

The model underestimated by 6.87 mpg

· mpg − = 32.4 − 25.53 = 6.87mpg

·

14/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

14 of 33 05/26/2015 02:08 AM

Page 15: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

Residuals in Reverse

Imagine you bought a car that weighs pounds, and I told you the residualwas . What was the car's fuel efficiency?

2780−1.03

· mpg − = −1.03mpg

· = 37.2851 − 5.34455wtmpg

· = 37.2851 − 5.3445(2.78)mpg

· = 37.2851 − 14.86mpg

· = 22.43mpg

· mpg − 22.43 = −1.03· mpg = −1.03 + 22.43 = 21.4

15/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

15 of 33 05/26/2015 02:08 AM

Page 16: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

Visualizing Residuals

16/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

16 of 33 05/26/2015 02:08 AM

Page 17: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

The Least Squares Line

We've seen how to check the prediction for a single value, but how do we knowwe did the best we could?

We call our line the Least Squares Regression line because it:

The LSR line comes the closest to all of the points simultaneously

It does this by finding the line which has the smallest residuals overall

Since negative residuals are just as important as positive ones, we squarethem to force them to be positive

Because of this, we focus on the squared residuals

·

·

·

·

Minimizes the sum of the squared residuals

This means that it minimizes the sum of the squared vertical distancesbetween the points and the line.

·

·

17/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

17 of 33 05/26/2015 02:08 AM

Page 18: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

Relationship to Correlation

Recall:

Because of this,

Correlation is the strength of the linear relationship between and .

The direction of the relationship is indicated by the sign of

· X Y

· −1 ≤ r ≤ 1· r

The sign of will always match the sign of .· r b1

18/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

18 of 33 05/26/2015 02:08 AM

Page 19: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

Measure of Fit:

In order to evaluate how good the model is, we need a measurement of how wellit fits. For this, we use the statistic.

So what is ?

R2

R2

is the fraction of the variability in the response variable exlained bythe variable.

Do changes in explain changes in ?

· R2 (Y )X

· X Y

R2

If we only have one variable,

If , knowing lets us perfectly predict

If , tells us nothing about

· X =R2 r2

· −1 ≤ r ≤ 1 → 0 ≤ ≤ 1R2

· = 1R2 X Y

· = 0R2 X Y

19/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

19 of 33 05/26/2015 02:08 AM

Page 20: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

Fuel Efficiency vs. Weight

What is the of our model that predicts fuel efficiency from weight?R2

, there is a strong negative correlation

So the weight of cars explains 76% of the variability in their fuel efficiency

This makes sense, obviously other properties (number of cylinders,transmission, design, etc.) will play a role in a car's fuel efficiency

· r = −0.87· = (−0.87R2 )2

· = 0.76R2

·

·

20/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

20 of 33 05/26/2015 02:08 AM

Page 21: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

Units in Regression

In Correlation:

In Regression:

The correlation coefficient has no units

Changes in units (e.g., lbs kg) had no effect on

·

· → r

The slope is "rise over run", or "change in over change in "

The slope is measured in units of over units of

Changing units can significantly change our line

· Y X

· Y X

·

21/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

21 of 33 05/26/2015 02:08 AM

Page 22: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

Fuel Efficiency vs. Weight: Changes in Units

So far, we've been measuring the weight in 1000s of pounds

What if we represented the weight directly in pounds?

This represents how much the fuel efficiency in mpg changes when weincrease weight by 1000 pounds

· = −5.3445b1

·

The same relationship needs to exist, so changing the weight by 1000 poundsstill needs to move mpg down by 5.3445

In order for this to be true, the slope needs to be divided by 1000

So changing increasing the weight by a single pound decreases mpg by0.0053445.

·

·

· = −5.3445/1000 = −0.0053445b1

·

22/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

22 of 33 05/26/2015 02:08 AM

Page 23: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

Regression: Outliers

How do outliers affect regression?

If they're above or below the line (extreme in ):

If they fall along the line, but are extreme in

Y

They can affect the slope drastically

They can decrease

·

· R2

X

They can inflate and make us think the relationship is stronger than itreally is

· R2

23/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

23 of 33 05/26/2015 02:08 AM

Page 24: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

Fuel Efficiency vs. Weight

24/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

24 of 33 05/26/2015 02:08 AM

Page 25: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

Fuel Efficiency vs. Weight

25/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

25 of 33 05/26/2015 02:08 AM

Page 26: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

Horsepower vs. Engine Displacement

26/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

26 of 33 05/26/2015 02:08 AM

Page 27: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

Horsepower vs. Engine Displacement

27/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

27 of 33 05/26/2015 02:08 AM

Page 28: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

Using StatCrunch

To do regression in StatCrunch:

Stat Regression Simple Linear1.

X Variable Select your explanatory variable2.

Y Variable Select your response3.

→ →→→

28/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

28 of 33 05/26/2015 02:08 AM

Page 29: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

StatCrunch: Regression Results

29/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

29 of 33 05/26/2015 02:08 AM

Page 30: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

StatCrunch: Regression Results

30/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

30 of 33 05/26/2015 02:08 AM

Page 31: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

Extending Regression: Multiple 's

Why do statisticians use instead of ?

X

= + xy b0 b1 y = mx + b

Because we can also try to use multiple variables to predict a single

For example, we could use mothers' and fathers' heights to find a model forthei childens' heights

This is called multiple regression

then takes all of the variables into account

· X Y

·

· = + (mother) + (father)child'sˆ b0 b1 b2

·

· R2 X

31/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

31 of 33 05/26/2015 02:08 AM

Page 32: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

Extending Regression: Variable Types

Categorical Variables can be used in regression using what are called dummyvariables. Say we wanted to use gender to predict height.

If we had a model

This is often done in medical trials, especially in multiple regression.

Let if the person is male

Let if the person is female

· x = 0· x = 1

= 5.7ft − 0.25xh

if a person is male

if a person is female

· = 5.7ft − 0.25(0) = 5.7fth

· = 5.7ft − 0.25(1) = 5.45fth

32/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

32 of 33 05/26/2015 02:08 AM

Page 33: Chapter 7 - dougraffle.com · Review In Chapter 7, we saw: Where do we go from here? Scatterplots can show us relationships between two numeric variables. Correlation describes the

Summary

We can describe the linear relationship between two numeric variables withLeast Squares Regression

We get predictions for a value of the explanatory variable by plugging its valuein for in the line equation

The prediction is our estimation of the average response for that value of

A residual is the distance from the true value of we observed from theprediction

The Least Squares regression line minimizes the sum of the squared residuals

is the fraction (or percent) of the variability in the response explained bythe regression model

Outliers can have a significant effect on the slope of the line and

·

·x

· X

· Y

·

· R2

· R2

33/33

Chapter 7 http://stat.wvu.edu/~draffle/111/week2/ch7/ch7.h...

33 of 33 05/26/2015 02:08 AM