146 37 linear_regression
TRANSCRIPT
Linear Models
Fitting linear models by eye is open to criticism
since it is based on an individual preference. For
instance, it is not altogether clear whether lines A,
B, or C fit best in the scatterplot below.
2
AB
C
Elmhurst Aid
Let us consider a random sample of fifty students
in the 2011 freshman class of Elmhurst College in
Illinois, comparing family income and gift aid. Gift
aid is financial aid that is a gift, as opposed to a
loan.
3
Elmhurst Aid
A scatterplot of the data is shown below along with
two linear fits. The lines follow a negative trend in
the data; students who have higher family incomes
tended to have lower gift aid from the university.
4
Elmhurst Aid
To determine which line is best, we begin by
thinking about what we mean by "best."
Mathematically, we want a line that has small
residuals.
5
Elmhurst Aid
A common practice is to choose the line that
minimizes the sum of the squared residuals:
6
2 2 2
1 2 .ne e e
Fitting Lines
Stepping away from the Elmhurst data for a
moment, consider the following three data points.
7
4,12 , 11,15 , 10,27
Example 1
For this data, consider three linear models, shown
below. Which model appears to best fit the data?
8
A CB
Least Squares Lines
Here's a look at the three models again, only with
the square residuals showing. Notice that the best
fitting line is the one with the smallest possible
sum of squares. This is called the least squares
line.
9
Sum of Squares = 368.4 Sum of Squares = 185.6 Sum of Squares = 83.6
Poor Fit Good Fit Best Fit
Conditions for the
Least Squares Line
When fitting a least squares line, we generally require
the following:
• Linearity: The data should show a linear trend.
• Nearly normal residuals: Generally, watch out for
outliers or influential points.
• Constant variability: The variability of points
around the least squares line remains roughly
constant.
• Independent observations: Be cautious about
data collected sequentially in a time series. Such
data may have an underlying structure.10
Constant Variability
Constant variability (homoscedasticity) basically
means that the variances along the line of best fit
remain similar as you move along the line.
Standard errors are potentially biased without this
condition.
11
Example 2
Should we have concerns about applying least
squares regression to the Elmhurst data?
12
Check
• Linearity
• Normality
• Constant variability
• Independence
For the Elmhurst data, we could write the equation of the least squares regression line as
Here the equation is set up to predict gift aid based on a student's family income, which would be useful to students considering Elmhurst. These two values, β0 and β1, are the parameters of the regression line.
Least Squares Lines
13
0 1 _aid family income
Least Squares Lines
As before, the parameters are estimated using
observed data. In practice, this estimation is done
using a computer in the same way that other
estimates, like a sample mean, can be estimated
using a computer or calculator.
However, we can also find the parameter
estimates by applying two properties of the least
squares line:
14
Least Squares Lines
The slope of the least squares line can be
estimated by
where R is the correlation coefficient between the
two variables, and sx and sy are the sample
standard deviations of the explanatory variable
and response, respectively.
15
1
y
x
sb R
s
Least Squares Lines
The y-intercept of the least squares line can be
estimated by
where and are the sample means of the
explanatory and response variables and b1 is the
estimated slope.
16
0 1b y b x
yx
Example 3
Below are the summary statistics for the Elmhurst
data. Use the table to find the least-squares line
for predicting aid based of family income.
17
0 1y b b x 1
y
x
sb R
s
0 1b y b x
family income, in $1000s (" ") gift aid, in $1000s (" ")
mean 101.8 19.94
sd 63.2 5.46
0.499x y
x y
x y
s s
R
Example 4
It was mentioned earlier that a computer is usually
used to compute the least squares line. A
summary table based on computer output is
shown below for the Elmhurst data. Explain the
results of each column.
18
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.3193 1.2915 18.83 0.0000
family_income 0.0431 0.0108 3.98 0.0002
df 48
Example 5
Suppose a high school senior is considering
Elmhurst College. Can she simply use the linear
equation that we have estimated to calculate her
financial aid from the university?
19
Example 6
The slope and intercept estimates for the Elmhurst
data are –0.0431 and 24.3. What do these
numbers really mean?
20
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.3193 1.2915 18.83 0.0000
family_income 0.0431 0.0108 3.98 0.0002
Interpreting Parameters
Estimated by Least Squares
The intercept describes the average outcome of y
if x = 0 and the linear model is valid all the way to
x = 0, which in many applications is not the case.
The slope describes the estimated average
difference in the y variable if the explanatory
variable x for a case happened to be one unit
larger.
21
Example 7
The graph below shows the number of
newspapers delivered and total pay for Leona's
newspaper job. What does the slope of this graph
represent?
22
Example 8
Colby put $100 in a savings account. The graph
below shows how the amount in the account would
increase over the next ten years. What do the
slope and y-intercept represent?
23
The linear equation below shows the cost, C, of a
hamburger with different numbers of toppings, t.
a) What is the y-intercept, and what does it mean?
b) What is the slope, and what does it mean?
c) If Jodi paid $3.50 for a hamburger, how many
toppings were on her hamburger?
Example 9
24
ˆ 1.90 0.40C t