lecture 5 notes chapter 7. linear...
TRANSCRIPT
Lecture 5 Notes
Chapter 7. Linear Regression
1
Learning Outcomes
• Linear model
• Use the least-squares criterion to find the line of best fit
• Understand the differential roles of the response and explanatory variables
• Compute, graph and assess the residuals
• Perform model diagnostics using residual plots
2
Motivation for Using Least-Squares Procedure
Consider the following:
• We know (assume) that students’ course outcome is correlated with their hours of study.
• We know (have seen than in previous lecture) that percent of birth to teen moms is correlated with poverty rate.
So, we can calculate correlation between two quantitative variables in each case, but what if we want to answer the
following questions:
• What is the expected course outcome for students who study 5 hours per week?
• What is the predicted percent of birth to teen moms when poverty rate is 21?
So correlation cannot answer these questions. We need to, therefore, setup and fit a model to the data using a
procedure called Least-Squares in order to better understand the relationship and to predict one variable from
another variable in a study.
3
Relationship Between Percent of Birth to Teen Moms and Poverty Rate
4
• Let’s consider a data set on geographic socio-political areas in each state in the United States. Researchers are interested to investigate the relationship between poverty rate and percent of birth to teen moms. Moreover, they are interested in predicting percent of birth to teen moms from poverty rate.
Relationship Between Percent of Birth to Teen Moms and Poverty Rate
If poverty rate is 21, how much percent of teen moms can we expect?
5
If poverty rate is 21, how much percent of teen moms can we expect?
6
• This correlation value tells us that there is a
positive, strong association between poverty
rate and percent of birth to teen moms.
• Strength of the relationship is part of the
picture but it doesn’t tell us how to predict
one variable from the other.
Relationship Between Percent of Birth to Teen Moms and Poverty Rate
Least-Squares: The Line of Best Fit
If poverty rate is 21, how much percent of teen moms can we expect?
7
• Idea: We want to predict percent of birth to teen
moms from poverty rate.
Response variable: Percent of birth to teen moms
Explanatory variable: Poverty rate
• Find a linear model that gives an equation of a
straight line through data.
• Linear model can summarizes the pattern, how
the variables are associated.
• Best Line:
• It comes closer to all the points than any
other points.
• Some of the points are above the line and
some are below the line.
Least-Squares: The Line of Best Fit
8
Linear Regression Model
9
• When the scatterplot suggests that a straight-line relationship is appropriate, we propose the linear regression model.
• A regression line is a straight line that describes how a response variable, 𝑦, changes as an explanatory variable, 𝑥, changes.
• A regression line is the best fitted line, closest to all the points in the scatterplot, that relates a response variable, 𝑦 to an
explanatory variable, 𝑥, which has an equation of this form: E(𝑦) = 𝛽0 + 𝛽1𝑥
(𝛽0 and 𝛽1 are regression coefficients for the linear regression function).
• 𝛽0 (beta zero) is the y-intercept of the line: the point at which the line intersects the y-axis.
• 𝛽1 (beta one) is the slope of the line: the change (amount of decrease or increase) in mean response 𝑦 in its unit for every
one unit increase in 𝑥.
• A positive slope implies that mean response increases by the value of 𝛽1
• A negative slope implies that mean response decreases by the value of 𝛽1
• Note: A simple linear regression model E(𝑦) = 𝛽0 + 𝛽1𝑥 refers to having only one explanatory variable in the linear
regression model.
• A regression line is often called a least-squares regression line of 𝑦 on 𝑥, because this line makes the sum of squared of errors
(residual = observed value – predicted value) as small as possible (more about this later in this lecture).
Regression Equation: Least-Squares Estimates
10
• Regression function describes how mean of response variable changes according to the values of an explanatory variable.
• The values of 𝛽0and 𝛽1, are unknown. Thus, we use data to estimate 𝛽0and 𝛽1 in the linear regression function.
• We use the linear regression model to fit our data.
• Equation of the least-squares regression line of 𝑦 on 𝑥 is: ො𝑦 = 𝑏0 + 𝑏1𝑥
with slope 𝑏1 = 𝑟𝑆𝑦
𝑆𝑥(recall: rise over run) and y-intercept 𝑏0 = ഥ𝑦 - 𝑏1 ҧ𝑥 (uses the idea: ( ҧ𝑥, ഥ𝑦 ) coordinate on the line)
• ො𝑦 = 𝑏0 + 𝑏1𝑥 is also called least-square prediction equation, fitted function, or regression equation.
• For a given 𝑥 value, ො𝑦 = 𝑏0 + 𝑏1𝑥 estimates the mean of 𝑦 for all subjects (cases) in the population having that value of 𝑥.
• Predicted values are estimates made from the model: ො𝑦 values (for given 𝑥 values) are always on the regression line.
• We predict for 𝑦 within the range of 𝑥 values.
• We do not predict outside the range of 𝑥 values (we do not extrapolate). Such predictions are often not accurate. Why?
Because the regression equation is based on the information in the data.
Least-Squares: The Line of Best Fit
What does the line tell us?
11
The line tells that the average percent
of birth to teen moms we would expect
for a given poverty rate.
Least-Squares: The Line of Best Fit
If poverty rate is 21, how much percent of teen moms can we expect?
12
The line might suggest that the state Mississippi with poverty rate of 21 should have of about 16 percent of birth to teen moms
(a value on the line) when, in fact, it actually, has 17.1.
Regression Equation: Calculate with Summary Statistics
13
The regression equation for predicting percentage of teen moms from poverty rate is:
ො𝑦 = 𝑏0 + 𝑏1𝑥 with slope 𝑏1 = 𝑟𝑆𝑦
𝑆𝑥and y-intercept 𝑏0 = ഥ𝑦 - 𝑏1 ҧ𝑥
1. Find the estimated slope: 𝑏1 = 𝑟𝑆𝑦
𝑆𝑥
𝑏1 = 0.8454253 𝑥2.528571
3.061984= 0.698147964 ≅ 0.70
2. Find the estimated y-intercept: 𝑏0 = ഥ𝑦 - 𝑏1 ҧ𝑥
𝑏0 = 10.35882 – [0.698147964(12.84902)] = 1.388302848 ≅ 1.39
Regression equation: ො𝑦 = 1.39 + 0.70𝑥
Or: 𝑃𝑒𝑟𝑐𝑒𝑛𝑡 𝑜𝑓 𝑏𝑖𝑟𝑡ℎ 𝑡𝑜 𝑡𝑒𝑒𝑛 𝑚𝑜𝑚𝑠 = 1.39 + 0.70 poverty rate
14
Find the Regression Equation Using R
Regression equation: ො𝑦 = 1.39 + 0.70𝑥 (round up to two decimal points)
Or: 𝑃𝑒𝑟𝑐𝑒𝑛𝑡 𝑜𝑓 𝑏𝑖𝑟𝑡ℎ 𝑡𝑜 𝑡𝑒𝑒𝑛 𝑚𝑜𝑚𝑠 = 1.39 + 0.70 poverty rate
15
Interpretation of the Regression Coefficients
Interpretation of estimated slope, 0.70:
When the poverty rate increase by 1, the mean percent of birth to teen moms is estimated to increase by 0.70.
Or: The percent of birth to teen moms increases by 0.7 for every one additional poverty rate, on average.
Interpretation of estimated y-intercept, 1.39:
When the poverty rate is zero, the mean percent of birth to teen moms is estimated to be 1.39.
However, since there is no 0 % poverty rate in this data (in any of the states), y-intercept has no meaningful interpretation other
than being the shift of the line.
Regression equation: ො𝑦 = 1.39 + 0.70𝑥
Or: 𝑃𝑒𝑟𝑐𝑒𝑛𝑡 𝑜𝑓 𝑏𝑖𝑟𝑡ℎ 𝑡𝑜 𝑡𝑒𝑒𝑛 𝑚𝑜𝑚𝑠 = 1.39 + 0.70 poverty rate
16
Highest and the Lowest Poverty Rates
17
Predicting for Percent of Birth to Teen Moms from Poverty Rate
Regression equation: ො𝑦 = 1.39 + 0.70𝑥
• The highest poverty rate was for the state, Mississippi, with value of 21.
What is the predicted value for percent of birth teenage mothers?
ො𝑦 = 1.39 + 0.70 21 = 16.09
• The lowest poverty rate was for the state New Hampshire, with value of 7.6.
What is the predicted value for percent of birth that are to teenage mothers?
ො𝑦 = 1.39 + 0.70 7.6 = 6.71
18
Obtain Predicted Values in R
Below table shows predicted values for percent of birth to teen moms given the poverty rates in the data.
19
Predicting with Mean X gives Mean Y
20
Residuals (Error of Predictions)• Residuals tell us what the model missed in explaining the response variable.
• It tell us how far off the model prediction’s is from that case or point. It helps us see if the model makes sense.
• There are distances between each observation in the scatterplot to the best fitted line (predicted values on the line).
• A residual is the difference between an observed value of the response and the value predicted by the regression line. That is, residual =
observed y – predicted y = 𝑦 – ො𝑦
• We denote residual with the lower-case letter e: e = 𝑦 – ො𝑦
• Some residuals are negative and some are positive.
• Some residuals are really close to zero or sometimes zero (when we have no error of prediction).
• The mean (and the sum) of residuals is zero.
• The residual value is positive when 𝑦 > ො𝑦 , which means that we underestimated our prediction.
• The residual value is negative when 𝑦 < ො𝑦 , which means that we overestimated our prediction.
• Small residuals are desirable. Why? Because, it will be an indication that the selected explanatory variable is doing a good job at
predicting the response variable.
Obtain Residual Values (by hand calculation)
21
Regression equation: ෝ𝒚 = 1.39 + 0.70𝒙
• For the state Mississippi, the poverty rate is 21 and the percent of birth to teenage mothers is 17.1 percent. What is the
residual for this observation?
Residual is observed value (𝑦) – predicted value ( ො𝑦). We write as e = 𝑦 - ො𝑦
The predicted value for percent of birth that are to teenage mothers given poverty rate of 21 is: ො𝑦 = 1.39 + 0.70 21 = 16.09
e = 17.10 – 16.09 = 1.01 (the residual value is positive, because, 𝑦 > ො𝑦. We underestimated the value of percent of birth that
are to teenage mothers for the state of Mississippi.
• For the state New Hampshire, the poverty rate is 7.6 and the percent of birth to teen moms is 6.6. What is the residual for
this observation?
Residual is observed value (𝑦) – predicted value ( ො𝑦). We write as e = 𝑦 - ො𝑦
The predicted value for percent of birth that are to teenage mothers given poverty rate of 7.6 is: ො𝑦 = 1.39 + 0.70 7.6 = 6.71
e = 6.6 – 6.71 = -0.11 (the residual value is negative, because, 𝑦 < ො𝑦. We overestimated the value of percent of birth that are
to teenage mothers for the state of New Hampshire.
22
Obtain Residual Values in R
23
Obtain Summary Statistics for Residual Values in R
Estimated Standard Deviation of Regression Model, S
• If we add all the residual values, the sum is always zero.
• In order to find the standard deviation of residuals, square all the residuals and then add the these squared values.
σ(𝑒)2 = σ(𝑦 − ො𝑦)2 Sum of squared deviations of y-values about their predicted values.
• The sum will tell us how well the line we drew (on the scatterplot) fits the data. The smaller the sum the better the fit.
• Divide thisσ(𝑒)2 by n-2, because in simple linear regression, we estimate for two parameters: 𝛽0 and 𝛽1
• Thus, 𝑆2 is estimated variance for distribution of 𝑦 values about the least squares line ො𝑦: 𝑆2 = σ(𝑦 − ො𝑦)2
𝑛 −2
Note: 𝜎2 is variation of distribution of 𝑦 values about the line of means 𝐸(𝑦)
• 𝑆 is estimated standard deviation of distribution of 𝑦 values about the least squares line ො𝑦
• S is the estimated standard deviation of 𝑦 for any fixed values of 𝑥
• 𝑆 measures the spread of the distribution of 𝑦 values about the least squares line ො𝑦
• We expect most (~95%) of the observed 𝑦 values to lie within 2 𝑆 of their respected least-squared predicted values, ො𝑦
S = 𝑆2 =σ(𝑦 − ො𝑦)2
𝑛 −2S is the estimated standard error of the regression model
24
25
Obtain Standard Deviation of Residual Values in R
• We expect most (~95%) of the observed 𝑦 values to lie within 2 𝑆 of their respected least-squared predicted values, ො𝑦
• We expect most (~95%) of the observed percent of birth to teen moms to lie within 2 x 1.364205 = 2.72841 of their
respected least-squared predicted values, 𝑝𝑒𝑟𝑐𝑒𝑛𝑡 𝑜𝑓 𝑏𝑖𝑟𝑡ℎ 𝑡𝑜 𝑡𝑒𝑒𝑛 𝑚𝑜𝑚𝑠.
• Moreover, we expect almost all of the observed percent of birth to teen moms to lie within 3 x 1.364205 = 4.092615 of
their respected least-squared predicted values, 𝑝𝑒𝑟𝑐𝑒𝑛𝑡 𝑜𝑓 𝑏𝑖𝑟𝑡ℎ 𝑡𝑜 𝑡𝑒𝑒𝑛 𝑚𝑜𝑚𝑠.
26
Examining Residuals: Using Residual Plots
• When a regression model is appropriate, it should model the underlying relationship.
• Nothing interesting should be left behind. We plot the residuals in the hope of finding nothing interesting.
• Residual plots help us assess the regression model assumptions.
• We check the model assumptions using residual points. There are three assumptions to check:
1. Assumption of Normality:
Residual points are approx. normally distributed: check histogram or boxplot of residuals, normal probability plot residuals).
2. Assumption of Linearity:
Residuals have mean zero: check whether residual points are randomly plotted around the zero line (mean of residuals) – use
the plot of residuals verses predictor or fitted value (this is a scatterplot except for here we do not want to see any obvious
pattern for example, a curvature pattern).
3. Assumption of Equal Variance:
Residuals have constant variances: Check whether residual points are evenly spread out around the zero line - use the plot of
residuals verses predictor or fitted value (this is a scatterplot except for here we do not want to see any obvious pattern like
fanning or cone shape: an increasing dispersion as the fitted values increase).
Check the Assumption of Normality of Residuals
27
In checking this, we can use the histogram plot of residuals
The histogram plot of residuals show that residuals are approx. normally distributed.
The residuals points are within 3 standard deviation of their mean zero.
Check the Assumption of Normality of Residuals
28
In checking this, we can use the boxplot of residuals.
The boxplot of residuals show that residuals are approx. normally distributed.
The residuals points are within 3 standard deviation of their mean zero.
Check the Assumption of Normality of Residuals
29
In checking this, we can use the normal probability plot of residuals.
The normal probability plot of residuals show that residuals are approx. normally distributed.
The residuals points are close to the straight line.
Checking the Assumption of Linearity and Constant Variances of Residuals
30
In checking these assumptions we can use the plot of residuals verses the values of the explanatory variable.
• Residual points are randomly plotted around the zero
line. Thus, the assumption of linearity is met.
• Residual points are evenly spread out around the zero
line. No obvious pattern is observed. Thus,
assumption of residuals have constant variance is met.
Note: we want to see NO pattern in the plot of residuals
verses values of explanatory variables. Thus, we state that
the model that we proposed, the linear regression model,
is an adequate fit to our data.
Checking the Assumption of Linearity and Constant Variances of Residuals
31
In checking these assumptions we can use, the plot of residuals verses fitted values.
• Residual points are randomly plotted around the zero
line. Thus, the assumption of linearity is met.
• Residual points are evenly spread out around the zero
line. No obvious pattern is observed. Thus,
assumption of residuals have constant variance is met.
Note: we want to see NO pattern in the plot of residuals
verses the fitted values. Thus, we state that the model that
we proposed, the linear regression model, is an adequate
fit to our data.
32
Example of Simple Linear Regression Model NOT Correct
The curvature pattern suggests the need for higher order model or transformations.
33
Example of Simple Linear Regression Model NOT Correct
The trend in dispersion:
An increasing dispersion as the fitted values increase, in which case a transformation of the
response may help. For example, taking log or square root.
Coefficient of Determination, 𝑹𝟐: The Variation Accounted for by the Model
• 𝑅2 (R-squared) measures the usefulness of the entire regression model; a sample statistic that tells us how well
model fits the data.
• In simple linear regression, 𝑅2 measures the proportion of variation in the response variable 𝑦, that is explained
by linear relationship with the explanatory variable 𝑥.
• That also means, 𝑅2 is the proportion of variation in the response variable 𝑦 that is accounted for by the
explanatory variable 𝑥.
• In simple linear regression, we can also obtain 𝑅2 by squaring the estimated correlation coefficient 𝑟: 𝑟2 = 𝑅2
• 𝑅2 = 0 implies lack of fit of the model to the data. That is none of the variance in the data is in the model. All of
the variance is left in the residual.
• 𝑅2 = 1 implies a perfect fit, with the model passing through every data point.
• In general, the larger the value of 𝑅2, the better model fits the data.
34
In Our Example: 𝑹𝟐 (Read from R output)
35
𝑟2 = 𝑅2
r = 0.8454253
𝑟2 = (0.854253)2 = 0.714744
Interpretation:
About 71% of the variation in percent of birth to teen moms is explained by linear relationship with poverty rate.
Or: About 71% of variation in percent of birth to teen moms is accounted for by variation in poverty rate.
𝑹𝟐: The Variation Accounted for by the Model 1 - 𝑹𝟐: The Variation Unaccounted for by the Model
• The variation in the residuals is the key to assessing how well the model fits the data. Let’s compare the variation of the
response variable with the variation of the residuals.
• The percent of birth to teen moms has a standard deviation of 2.53%. The standard deviation of the residuals is 1.36%.
• If the correlation were 1.0 ( r = 1) and the model predicted the percent of birth to teen moms values perfectly, the residuals
would all be zero and have no variation. We could not do possibly any better than that.
• On the other hand, if the correlation were zero (r = 0), the model would simply predict 10.36 percent birth to teenage mothers
(the mean) for all poverty rate values and the residuals would just be the observed percent to teenage mothers values minus their
mean (𝑒 = 𝑦 − ത𝑦) . These residuals would have the same variability as the original data because, as we know, just subtracting
the mean does not change the spread.
36
𝑹𝟐: The Variation Accounted for by the Model 1 - 𝑹𝟐: The Variation Unaccounted for by the Model
• Compare the variability of the percent of birth to teenage mothers with the variability of the residuals from the regression.
• In the boxplots, we compare the deviations of state percent of birth to teenage mothers from their mean of ~10.36 (on the left),
with the deviations or residuals from the line of best fit (on the right). The variation left in the residuals is unaccounted for by
the model, but it is less than the variation in the original data (more precisely compared in the boxplots).
37
𝑹𝟐: The Variation Accounted for by the Model 1 - 𝑹𝟐: The Variation Unaccounted for by the Model
• How well does the poverty rate regression model do?
Look at the side-by-side boxplots.
• The variation in the residuals (on the right) is smaller
than variation in the data (on the left), but certainly
bigger than zero. That is nice to know, but how much
of the variation in the data is still left in the residuals?
If you had to put a number between 0% and 100% on
the fraction of the variation left in the residuals, what
would you say?
38
In Our Example: 𝟏 − 𝑹𝟐 (𝐏𝐫𝐨𝐩𝐨𝐫𝐭𝐢𝐨𝐧 𝐔𝐧𝐞𝐱𝐩𝐥𝐚𝐢𝐧𝐞𝐝)
39
The squared correlation,𝑹𝟐, gives the fraction of the data’s variation accounted for by the model, and
𝟏 − 𝑹𝟐 is the fraction of the original variation left in the residuals.
1 - R2 is the proportion of the model variability that is not explained with the linear relationship with the
predictor variable (left in the residual).
In our example:
𝟏 − 𝑹𝟐 = 1 – 0.714744 = 0.285256
Interpret: About 29% of variability in percent of birth to teenage mothers has been left in the residuals.
40
Example: Finding 𝒓 from 𝑹𝟐
Suppose that the association between Adult smokers % and ACT scores is investigated and the scatterplot is
approximately linear.
The regression function was: Adult Smokers % = 45.348999 - 1.2272345 ACT
𝑅2 = 0.20
What is r (estimate of correlation)?
41
Example: Finding 𝒓 from 𝑹𝟐
Recall the association between Adult smokers % and ACT scores.
The regression function was: Adult Smokers % = 45.348999 - 1.2272345 ACT
𝑅2 = 0.20
What is r (estimate of correlation)?
r = sign of slope 𝑅2
r = - 0.20 = - 0.45
42
Steps in Doing Regression
• Start with a scatterplot.
• If the scatterplot does not look like a straight line relationship, stop.
• Otherwise, you can calculate correlation and also intercept and the slope of the regression line.
• Check whether regression is OK by looking at plot of residuals against anything relevant.
• If it is not OK, do not use regression. We cannot say that the explanatory variable is a useful predictor.
• Our aim:
We want regression for which line is OK and we confirm that by looking at scatterplot and residual plots.