146 37 linear_regression

MATH& 146

Lesson 37

Section 5.2

Linear Regression

1

Linear Models

Fitting linear models by eye is open to criticism

since it is based on an individual preference. For

instance, it is not altogether clear whether lines A,

B, or C fit best in the scatterplot below.

2

AB

C

Elmhurst Aid

Let us consider a random sample of fifty students

in the 2011 freshman class of Elmhurst College in

Illinois, comparing family income and gift aid. Gift

aid is financial aid that is a gift, as opposed to a

loan.

3

Elmhurst Aid

A scatterplot of the data is shown below along with

two linear fits. The lines follow a negative trend in

the data; students who have higher family incomes

tended to have lower gift aid from the university.

4

Elmhurst Aid

To determine which line is best, we begin by

thinking about what we mean by "best."

Mathematically, we want a line that has small

residuals.

5

Elmhurst Aid

A common practice is to choose the line that

minimizes the sum of the squared residuals:

6

2 2 2

1 2 .ne e e

Fitting Lines

Stepping away from the Elmhurst data for a

moment, consider the following three data points.

7

4,12 , 11,15 , 10,27

Example 1

For this data, consider three linear models, shown

below. Which model appears to best fit the data?

8

A CB

Least Squares Lines

Here's a look at the three models again, only with

the square residuals showing. Notice that the best

fitting line is the one with the smallest possible

sum of squares. This is called the least squares

line.

9

Sum of Squares = 368.4 Sum of Squares = 185.6 Sum of Squares = 83.6

Poor Fit Good Fit Best Fit

Conditions for the

Least Squares Line

When fitting a least squares line, we generally require

the following:

• Linearity: The data should show a linear trend.

• Nearly normal residuals: Generally, watch out for

outliers or influential points.

• Constant variability: The variability of points

around the least squares line remains roughly

constant.

• Independent observations: Be cautious about

data collected sequentially in a time series. Such

data may have an underlying structure.10

Constant Variability

Constant variability (homoscedasticity) basically

means that the variances along the line of best fit

remain similar as you move along the line.

Standard errors are potentially biased without this

condition.

11

Example 2

Should we have concerns about applying least

squares regression to the Elmhurst data?

12

Check

• Linearity

• Normality

• Constant variability

• Independence

For the Elmhurst data, we could write the equation of the least squares regression line as

Here the equation is set up to predict gift aid based on a student's family income, which would be useful to students considering Elmhurst. These two values, β0 and β1, are the parameters of the regression line.

Least Squares Lines

13

0 1 _aid family income

Least Squares Lines

As before, the parameters are estimated using

observed data. In practice, this estimation is done

using a computer in the same way that other

estimates, like a sample mean, can be estimated

using a computer or calculator.

However, we can also find the parameter

estimates by applying two properties of the least

squares line:

14

Least Squares Lines

The slope of the least squares line can be

estimated by

where R is the correlation coefficient between the

two variables, and sx and sy are the sample

standard deviations of the explanatory variable

and response, respectively.

15

1

y

x

sb R

s

Least Squares Lines

The y-intercept of the least squares line can be

estimated by

where and are the sample means of the

explanatory and response variables and b1 is the

estimated slope.

16

0 1b y b x

yx

Example 3

Below are the summary statistics for the Elmhurst

data. Use the table to find the least-squares line

for predicting aid based of family income.

17

0 1y b b x 1

y

x

sb R

s

0 1b y b x

family income, in $1000s (" ") gift aid, in $1000s (" ")

mean 101.8 19.94

sd 63.2 5.46

0.499x y

x y

x y

s s

R

Example 4

It was mentioned earlier that a computer is usually

used to compute the least squares line. A

summary table based on computer output is

shown below for the Elmhurst data. Explain the

results of each column.

18

Estimate Std. Error t value Pr(>|t|)

(Intercept) 24.3193 1.2915 18.83 0.0000

family_income 0.0431 0.0108 3.98 0.0002

df 48

Example 5

Suppose a high school senior is considering

Elmhurst College. Can she simply use the linear

equation that we have estimated to calculate her

financial aid from the university?

19

Example 6

The slope and intercept estimates for the Elmhurst

data are –0.0431 and 24.3. What do these

numbers really mean?

20

Estimate Std. Error t value Pr(>|t|)

(Intercept) 24.3193 1.2915 18.83 0.0000

family_income 0.0431 0.0108 3.98 0.0002

Interpreting Parameters

Estimated by Least Squares

The intercept describes the average outcome of y

if x = 0 and the linear model is valid all the way to

x = 0, which in many applications is not the case.

The slope describes the estimated average

difference in the y variable if the explanatory

variable x for a case happened to be one unit

larger.

21

Example 7

The graph below shows the number of

newspapers delivered and total pay for Leona's

newspaper job. What does the slope of this graph

represent?

22

Example 8

Colby put $100 in a savings account. The graph

below shows how the amount in the account would

increase over the next ten years. What do the

slope and y-intercept represent?

23

The linear equation below shows the cost, C, of a

hamburger with different numbers of toppings, t.

a) What is the y-intercept, and what does it mean?

b) What is the slope, and what does it mean?

c) If Jodi paid $3.50 for a hamburger, how many

toppings were on her hamburger?

Example 9

24

ˆ 1.90 0.40C t

The regression line for the Burger King menu

comparing fat (in grams) with protein (in grams) is

a) State and interpret the slope and intercept in

the context of the problem.

b) How many grams of fat would you expect for an

item with 10 grams of protein.

Example 10

25

6.8 0.97fat protein

146 37 linear_regression

Education