unit 4: regression assumptions and diagnostics class 11…class 11… unit 4 / page 1© andrew ho,...
Post on 13-Jan-2016
228 Views
Preview:
TRANSCRIPT
Unit 4: Regression Assumptions and Diagnostics Class 11…
http://xkcd.com/539/ Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education
© Andrew Ho, Harvard Graduate School of Education
Where is Unit 4 in our 11-Unit Sequence?
Unit 6:The basics of
multiple regression
Unit 7:Statistical control in depth:Correlation and collinearity
Unit 10:Interaction and quadratic effects
Unit 8:Categorical predictors I:
Dichotomies
Unit 9:Categorical predictors II:
Polychotomies
Unit 11:Regression in practice. Common Extensions.
Unit 1:Introduction to
simple linear regression
Unit 2:Correlation
and causality
Unit 3:Inference for the regression model
Building a solid
foundation
Unit 4:Regression assumptions:Evaluating their tenability
Unit 5:Transformations
to achieve linearity
Mastering the
subtleties
Adding additional predictors
Generalizing to other types of predictors and
effects
Pulling it all
together
Unit 3 / Page 2
In this unit, we’re going to learn about…
• Reprise of the assumptions required for OLS regression-based inference
• The four major types of model violations:– Outliers– Nonlinearity– Heteroscedasticity– Non-independence of errors
• Determining whether the regression assumptions hold—strategies and rationale– Why residuals provide a powerful lens for evaluating regression
assumptions– Residuals as observations that “control for” or account for variables– Raw vs. standardized/studentized residuals– Leverage, outliers, and Cook’s Distance– Residual plots: How to construct them and what to look for– What should we do if we identify an outlier or other unusual observation?
• How would we summarize our results?Unit 4 / Page 3© Andrew Ho, Harvard Graduate School of Education
These four assumptions are often summarized in a shorthand: . This reads, the population residuals, , are independent and identically normally distributed with mean 0 (values centered on the prediction) and a common variance (and thus a common SD).
The linear regression model and its four assumptions
© Andrew Ho, Harvard Graduate School of Education Unit 1 / Page 4
Assumption 1: At each value of there is a conditional distribution of that is normal with mean and SD .
Assumption 2: The straight line model is correct: The fall on a line.
Assumption 3: Homoscedasticity. The conditional standard deviations, , are equal across all .
Assumption 4: Conditional independence. For any value of , the s are independent. They share no hidden common association. Cannot be visualized from the plot.
𝜖 i .i . d .𝑁 (0 ,𝜎 𝑌∨𝑋2 )
© Andrew Ho, Harvard Graduate School of Education
Anscombe’s Quartet: Four datasets with identical summary statistics
How can we investigate the distributions of our residuals?
Unit 4 / Page 5
24
68
10
y2
4 6 8 10 12 14x2
-2-1
01
2R
esid
uals
5 6 7 8 9 10Fitted values
-10
12
3R
esi
dua
ls
5 6 7 8 9 10Fitted values
-2-1
01
2R
esid
uals
6 8 10 12Fitted values
Diagnostic Plots: Residuals vs. Fitted Values2
46
81
0y2
4 6 8 10 12 14x2
-2-1
01
2R
esid
uals
4 6 8 10 12 14x2
-10
12
3R
esi
dua
ls
5 6 7 8 9 10Fitted values
-2-1
01
2R
esid
uals
6 8 10 12Fitted values
24
68
10
y2
4 6 8 10 12 14x2
-2-1
01
2R
esid
uals
4 6 8 10 12 14x2
-10
12
3R
esi
dua
ls
5 6 7 8 9 10Fitted values
-2-1
01
2R
esid
uals
6 8 10 12Fitted values
^
To better visualize the residuals, we can subtract out the regression line. Instead of plotting y2 on x2, we can plot y2 – y2 on x2:
/* Graph the usual linear fit */graph twoway (scatter y2 x2) (lfit y2 x2)/* Run the regression */regress y2 x2/* Save residuals to variable resid */predict resid, residuals/* Plot residuals on x2 */scatter resid x2/* An equivalent visual representation is obtained by plotting residuals on their predicted (“fitted”) values, the yhats. */predict yhatscatter resid yhat // Or, in one smooth move after regress…rvfplot, name(rvf2)
For simple linear regression, since values are linearly related to values:
the plots look the same but just have a different horizontal axis. In practice, we use fitted values because it’s harder to plot on multiple values when we move to multiple regression.
^
Unit 4 / Page 6© Andrew Ho, Harvard Graduate School of Education
-2-1
01
23
Re
sidu
als
4 6 8 10 12Fitted values
-2-1
01
23
Res
idua
ls
4 6 8 10 12Fitted values
-2-1
01
23
Re
sidu
als
4 6 8 10 12Fitted values
-2-1
01
23
Res
idua
ls
4 6 8 10 12Fitted values
Plots of residuals vs. fitted values for Anscombe’s Quartet
Your residuals will always sum to zero, and the correlation between your residuals and fitted values (or any ) will always be zero. RVFs are a better lens through which to appreciate your residuals.
Unit 4 / Page 7© Andrew Ho, Harvard Graduate School of Education
Zagat: Can the price of your dinner predict its quality?
n = 76 Boston restaurants classified as “New American” in the ‘05-’06 guideZagat surveyor’s average estimate of the price of dinner, one
drink, and tip.
Average of the restaurant’s ratings for food, service, and decor. A 0-30 scale where 16-19 is good to very good, 20-25 is very good to excellent, and 26-30 is extraordinary to perfection.
Unit 4 / Page 8© Andrew Ho, Harvard Graduate School of Education
1618
2022
2426
Zag
at r
atin
g (0
-30)
20 30 40 50 60 70Average estimated dinner cost w/drink and tip ($)
Scatterplot of Zagat rating on Zagat-reported cost
Unit 4 / Page 9© Andrew Ho, Harvard Graduate School of Education
© Andrew Ho, Harvard Graduate School of Education
Regression Results
61.6% of the variance in ratings is associated with (accounted for by/explained by) the cost.
The conditional standard deviation of ratings is 1.56 (the unconditional standard deviation was 2.5)
A dollar difference in cost predicts (is associated with) a .186-point difference in ratings.
The sample statistic of .186 is 10.9 estimated standard errors away from our null hypothesis of 0 slope.The probability of such a slope under
the null hypothesis is very, very low.A plausible range for the “true” slope is between .15 and .22; 95% of these kinds of intervals are expected to contain the “true” slope.
The relationship is moderately strong and statistically significant, where a $10 difference in cost predicts a 1.86-point difference in the Zagat ratings.
Unit 4 / Page 10
1618
2022
2426
Zag
at r
atin
g (0
-30)
20 30 40 50 60 70Average estimated dinner cost w/drink and tip ($)
An estimated linear regression model of rating on cost
Unit 4 / Page 11© Andrew Ho, Harvard Graduate School of Education
And, subtracting out the regression line…
𝑟𝑎𝑡𝑖𝑛𝑔=13.83+ .186𝑐𝑜𝑠𝑡
-4-2
02
4R
esid
uals
18 20 22 24 26Fitted values
Plot of residuals vs. fitted values
-4-2
02
4R
esid
uals
18 20 22 24 26Fitted values
Unit 4 / Page 12© Andrew Ho, Harvard Graduate School of Education
We look for: 1) linearity 2) outliers 3) homoscedasticity 4) normality
© Andrew Ho, Harvard Graduate School of Education
Linearity: Eyeballing and lpoly (sparingly)
Unit 4 / Page 13
-4-2
02
4R
esid
uals
18 20 22 24 26Fitted values
-4-2
02
4R
esid
ual
s
18 20 22 24 26Fitted values
kernel = epanechnikov, degree = 0, bandwidth = .85
Local polynomial smooth
Responses to nonlinearity include polynomial or other nonlinear regression approaches (not covered here) and transformations (covered in Unit 5)
Ignoring or failing to address nonlinearity risks poor prediction, underestimation of , and Type II error (missing a finding).
-4-2
02
4R
esid
uals
18 20 22 24 26Fitted values
29 Newbury
Bristol
Stanhope Grille
-4-2
02
4R
esid
uals
18 20 22 24 26Fitted values
Outliers: Identifying and interpreting relatively large residuals
But residuals can’t identify outliers alone.
Positive residuals.Underpredicted.High ratings after “controlling for,” “partialling out,” “regressing out,” or accounting for cost.
Unit 4 / Page 14© Andrew Ho, Harvard Graduate School of Education
Leverage: Extremity of an observation along predictor variables
© Andrew Ho, Harvard Graduate School of Education Unit 4 / Page 15
This outlier has a large residual This outlier does not.
We should consider both the discrepancy from the regression line and the leverage exerted on the regression line.
With many predictors, the leverage of an observation is a single expression of distance from many predictor means.
With simple linear regression, it looks somewhat familiar:
Called is because leverage is an element of the “hat matrix”: It puts the hat on the Y.
h=1𝑛
+(𝑋− 𝑋 0 )
2
∑ ( 𝑋𝑖−𝑋 )2
http://www.stat.sc.edu/~west/javahtml/Regression.html
.02
.04
.06
.08
.1.1
2L
eve
rag
e
20 30 40 50 60 70Average estimated dinner cost w/drink and tip ($)
Visualizing Leverage: Look familiar?
© Andrew Ho, Harvard Graduate School of Education Unit 4 / Page 16
Leverage values should be interpreted relatively.They are never more than 1, and the sum of all leverages is the number of predictors+1 (here, 2)
The greater the leverage, the more a single observation can influence the regression line.
h=1𝑛
+(𝑋− 𝑋 0 )
2
∑ ( 𝑋𝑖−𝑋 )2
𝑋
29 Newbury
Bristol
Stanhope Grille
-4-2
02
4R
esid
uals
18 20 22 24 26Fitted values
Incorporating leverage… a little.
Unit 4 / Page 17© Andrew Ho, Harvard Graduate School of Education
To better interpret these residuals, we can “standardize” them: express them in terms of a kind of standard deviation unit that incorporates leverage.
29 Newbury
Bristol
Stanhope Grille
-3-2
-10
12
Sta
ndar
dize
d R
esid
uals
18 20 22 24 26Fitted values
Standardized/Studentized residuals: predict stdres, rstandard
Expresses residuals in terms of RMSE (conditional standard deviations) while inflating residuals with more leverage.
If residuals are homoscedastic and normally distributed about the regression line, we have a sense of how unexpected a standardized residual might be.
Unit 4 / Page 18© Andrew Ho, Harvard Graduate School of Education
Called “studentized” in honor of Gosset of the t-test (pseudonym: student). Stata calls these “standardized residuals.” Standardized residuals usually refer to dividing by RMSE alone. Very similar for large sample sizes.
𝑒𝑠𝑡𝑑 ( 𝑖 )=𝑌 𝑖− �̂� 𝑖
𝑅𝑀𝑆𝐸√1−h𝑖
𝑑𝑖=( h𝑖1− h𝑖 )(
𝑒 𝑠𝑡𝑑 (𝑖 )2
𝑘+1 )
224 Boston St.
29 Newbury33 RestaurantAriadneAura Avenue One
Azure
B-Side Lounge
Bambara
Birch St. Bistro
Blarney Stone
Brenden Crocker'sBristol
Central Kitchen
Daedalus
Dalia's Bistro
Dalya'sDedo Lounge
Devlin's
Excelsior
Fava
Federalist
Franklin Café
Gardner Museum
Gargoyles
Grafton St. Pub
Grapevine
Green St. Grill
Hamersley's Bistro
Harvest
Icarus
Independent
IndigoIntrigue Café
IsabellaJ's at Nashoba Valley
Jacob WirthJer-Ne
LaurelLiving RoomLucy's
Lyceum B&GMargo Bistro
Meritage
Nightingale
Novel
Olio
Orleans
PerdixRed Rock Bistro
Sage
Salts
Scandia
Seasons
Sidney's Grille
Silks
Square CaféStanhope GrilleStephanie's
TW Food
Temple Bar
Ten Tables
Top of the Hub
Tremont 647
Troquet
Tryst
Union B&G
UpStairs on the Sq.
VinaliaVox PopuliWalden Grille
West Side Lounge
Zebra's Bistro
Zon's
blu
flora
.02
.04
.06
.08
.1.1
2L
eve
rag
e
0 .02 .04 .06 .08Normalized residual squared
Cook’s Distance: A metric incorporating both discrepancy and leverage
Unit 4 / Page 19© Andrew Ho, Harvard Graduate School of Education
• A go-to metric to address the question, did this single observation have a notable impact on my results? A measure of influence.
Recall: Cook’s Distance:
1618
2022
2426
Zag
at r
atin
g (0
-30)
20 30 40 50 60 70Average estimated dinner cost w/drink and tip ($)
Leverage(horizontal)
Squared standardizedresidual (vertical)
These are averages, not benchmarks.
Just the squared studentized residual over n.
𝑒𝑠𝑡𝑑 ( 𝑖 )=𝑌 𝑖− �̂� 𝑖
𝑅𝑀𝑆𝐸√1−h𝑖
224 Boston St.
29 Newbury
33 Restaurant
AriadneAura
Avenue One
Azure
B-Side LoungeBambara
Birch St. BistroBlarney StoneBrenden Crocker's
Bristol
Central Kitchen
DaedalusDalia's Bistro
Dalya'sDedo Lounge
Devlin's
ExcelsiorFava
Federalist
Franklin CaféGardner Museum
GargoylesGrafton St. Pub
Grapevine
Green St. Grill
Hamersley's Bistro
Harvest
Icarus
Independent
Indigo
Intrigue CaféIsabella
J's at Nashoba Valley
Jacob Wirth
Jer-Ne
LaurelLiving Room
Lucy's Lyceum B&GMargo Bistro
MeritageNightingale
Novel
Olio
Orleans
PerdixRed Rock Bistro
Sage
Salts
ScandiaSeasons
Sidney's GrilleSilks
Square Café
Stanhope Grille
Stephanie's
TW Food
Temple BarTen Tables
Top of the Hub
Tremont 647 TroquetTryst
Union B&G
UpStairs on the Sq.
Vinalia
Vox Populi
Walden GrilleWest Side Lounge
Zebra's Bistro
Zon's
blu
flora
0.0
2.0
4.0
6.0
8.1
Coo
k's
D
18 20 22 24 26Fitted values
1618
2022
2426
Zag
at r
atin
g (0
-30)
20 30 40 50 60 70Average estimated dinner cost w/drink and tip ($)
Cook’s Distance: predict cooksd, cooksd
Unit 4 / Page 20
• Benchmarks: distances greater than 1 or 4/n are worth noting, although we acknowledge the impossibility of simple cutoffs.
Stata Demo: . findit regpt. regpt
Central Kitchen has notable influence on the regression line, with moderately high amounts of leverage and discrepancy.
© Andrew Ho, Harvard Graduate School of Education
Key concepts for outlying observations.• A residual is the vertical distance from an observed point to the estimated
regression line. OLS minimizes the sum of these squared residuals.• Absolute residuals are interpreted on the scale of Y. They are interpretable as Y
above and beyond, “controlling” for, accounting for, adjusting for X. • Visualization of residuals can be improved by plotting residuals on fits (rvfplot)• Leverage is a single measure of distance of an observation from the means of
one or more predictor variables and indicates potential to influence slopes.• Standardized/Studentized residuals are expressed in terms of RMSE, or
conditional standard deviations. The latter is adjusted slightly by leverage. These afford probabilistic interpretations if the normal assumption holds.
• Cook’s distance incorporates both residuals (vertical discrepancy) and leverage in a single metric describing the influence of observations on slope.
• The discrepancy and leverage of observations can be visualized by plotting leverage on squared studentized residuals (lvr2plot).
• An outlier is almost never removed absent an outright data entry error. Instead, the influence is described, and alternative models may be employed.
• Exclusion of an outlier usually requires an argument that it is not part of the population of interest. © Andrew Ho, Harvard Graduate School of Education Unit 4 / Page 21
© Andrew Ho, Harvard Graduate School of Education
The Butterfly Ballot: The Palm Beach County Outlier
Unit 4 / Page 22
Although Democrats are listed second in the left hand column, you vote Democratic by punching the third hole
If you punch the second hole, you are voting for the Reform party (ie, Pat Buchanan)
RQ: In the 2000 Presidential election, did Buchanan get more votes than we “would have expected” in Palm Beach County?
Of the nearly 6 million votes cast in Florida, the official tally has Bush beating Gore by 537 votes
01
000
200
03
000
400
0B
uch
ana
n vo
tes
in 2
000
0 100 200 300 400Reform party registered voters in 2000
BROWARDHILLSBOROUGH
PALM BEACH
PINELLAS
01
000
200
03
000
400
0B
uch
ana
n vo
tes
in 2
000
0 100 200 300 400Reform party registered voters in 2000
After univariate descriptives, our scatterplot.
© Andrew Ho, Harvard Graduate School of Education Unit 4 / Page 23
10 Nov 2000: “The Bush campaign claims that the number of votes for Buchanan in Palm Beach County is perfectly accurate. ‘New information has come to our attention that puts in perspective the results of the vote in Palm Beach County,’ Bush spokesman Ari Fleischer said on Thursday. ‘Palm Beach County is a Pat Buchanan stronghold and that's why Pat Buchanan received 3,407 votes there.’” (Salon.com) View Article
© Andrew Ho, Harvard Graduate School of Education
BROWARD
HILLSBOROUGH
PALM BEACH
PINELLAS-500
05
001
000
150
02
000
Res
idu
als
0 500 1000 1500Fitted values
Regressing Buchanan’s votes on registered reform party voters
Unit 4 / Page 24
_cons 1.532519 46.60847 0.03 0.974 -91.55102 94.61606 reform 3.686713 .4087551 9.02 0.000 2.870372 4.503053 bucvote Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 13334589.9 66 202039.241 Root MSE = 301.85 Adj R-squared = 0.5490 Residual 5922476.32 65 91115.0202 R-squared = 0.5559 Model 7412113.59 1 7412113.59 Prob > F = 0.0000 F( 1, 65) = 81.35 Source SS df MS Number of obs = 67
. regress bucvote reform
Above and beyond, “controlling” for, adjusting for, accounting for…
Some say, “after statistically controlling for…”
Acceptable.Safest.
Palm Beach has 2163 more votes for Buchanan than predicted given the
number of registered Reform Party voters.…than the model predicts given the
number of registered Reform Party voters.
PALM BEACH
PINELLAS-20
24
68
Sta
nda
rdiz
ed r
esi
dua
ls
0 500 1000 1500Fitted values
Standardized/Studentized Residuals
© Andrew Ho, Harvard Graduate School of Education Unit 4 / Page 25
Stata: Standardized ResidualsMe/Others: Studentized Residuals
Palm Beach County is almost 8 standard deviations above the regression line.
Here, standard deviations are RMSEs slightly adjusted to accentuate high-leverage residuals.
ALACHUABAKERBAYBRADFORDBREVARD
BROWARD
CALHOUNCHARLOTTECITRUSCLAYCOLLIERCOLUMBIA
DADE
DE SOTODIXIEDUVALESCAMBIAFLAGLERFRANKLINGADSDENGILCHRISTGLADESGULFHAMILTONHARDEEHENDRYHERNANDOHIGHLANDS
HILLSBOROUGH
HOLMESINDIAN RIVERJACKSONJEFFERSONLAFAYETTELAKELEELEONLEVYLIBERTYMADISONMANATEEMARIONMARTINMONROENASSAUOKALOOSAOKEECHOBEE
ORANGE
OSCEOLA
PALM BEACH
PASCO
PINELLAS
POLKPUTNAMSANTA ROSA
SARASOTA
SEMINOLEST JOHNSST LUCIESUMTERSUWANNEETAYLORUNIONVOLUSIA
WAKULLAWALTONWASHINGTON
0.0
5.1
.15
.2.2
5L
eve
rag
e
0 .2 .4 .6 .8Normalized residual squared
Leverage vs. Residual Squared (Discrepancy)
© Andrew Ho, Harvard Graduate School of Education Unit 4 / Page 26
BROWARD
PALM BEACH
PINELLAS
01
23
45
Coo
k's
D
0 500 1000 1500Fitted values
Cook’s Distance
© Andrew Ho, Harvard Graduate School of Education Unit 4 / Page 27
Palm Beach County has an extreme influence on the regression line.
© Andrew Ho, Harvard Graduate School of Education
01
000
200
03
000
400
0B
uch
ana
n vo
tes
in 2
000
0 100 200 300 400Reform party registered voters in 2000
Sensitivity Plots
Unit 4 / Page 28
The sensitivity analysis is one of the most useful tools that you can take from this class. As we begin to appreciate that there are many plausibly correct models, we investigate whether our inferences are robust across these plausible decisions. Also called a robustness check.
“Here are my results and my conclusions.”“But have you considered , , and ?”“We considered , , and , and also , , and .… and it makes no difference.… or it does, and here is why.”
What to do with outlying observations (1)• Visualize them.
– Scatterplots, residuals vs. fit plots (rvfplot), standardized/studentized residuals, leverage vs. residuals-squared plots (lvr2plot), Cook’s distance.
• Describe their degree of discrepancy and influence in terms of the above and in context.• Try to explain each outlying observation. What is story of Central Kitchen? What is story of
Palm Beach County? What are the plausible rival stories that might account for the outlier?
• Severely influential or outlying observations must be documented for your audience.
• Conduct a sensitivity analysis: Resultswith and without the outlier.
• Address whether the outlier affectsyour primary inferences.
© Andrew Ho, Harvard Graduate School of Education Unit 4 / Page 29
Comparison of regression models predicting Buchanan’s total vote (Florida Presidential Election data, 2000)
All Florida counties (n=67)
Without Palm Beach (n=66)
Estimated SlopeEstimated S.E.
t statistic
3.69*(0.41)
9.02
2.45*(0.12)20.18
R2 55.6% 86.4%* p<.001
What to do with outlying observations (2)
• Conduct a full analysis without the outlying observation and retest assumptions.
• Sometimes eliminating an outlier reveals other outliers…
• Stata note: After running regress again without the outlier, use the predict command without it, too: predict cooksd_1 if county != “PALM BEACH”, cooksd
• In general, do not remove outliers unless demonstrating sensitivity. Deciding to proceed with a model without an outlier requires very strong substantive justification:– Because of 1) problems with misleading
ballots in this county and no other, 2) the absence of other explanations for the outlying observation, and 3) our interest in modeling the relationship between bucvote and reform in a population without misleading ballots, we proceed with a regression model excluding Palm Beach County. © Andrew Ho, Harvard Graduate School of Education Unit 4 / Page 30
-200
-100
010
020
030
0R
esid
uals
0 200 400 600 800 1000Fitted values
BROWARD
COLLIER
DUVAL
MARION
PINELLAS
POLK0
.05
.1.1
5.2
.25
Coo
k's
D
0 200 400 600 800 1000Fitted values
© Andrew Ho, Harvard Graduate School of Education
What to do with outlying observations (3)• If outliers are part of your population of interest, they cannot be
removed.• If they are part of your population, then in spite of their influence, your
estimated slope is still unbiased… your best guess over many samples.• However, your standard error may be so large (sometimes you get the
outliers, sometimes you don’t) that you may prefer a more stable model, even if it’s not unbiased.
• This leads to alternative approaches. One to remember: median regression, also called least absolute value (LAV) regression, where we model the conditional median and minimize the absolute deviations (instead of squared deviations). – Recall that the median is more robust to outliers than the mean. In
Stata, this is a subset of quantile regression under the qreg command• As a default, don’t get fancy. Model in a familiar framework to best
communicate your results.• Sometimes, a simple transformation can draw in an outlier and let us
preserve our familiar framework (next unit). Unit 4 / Page 31
© Andrew Ho, Harvard Graduate School of Education
Heteroscedasticity: Blood pressure example
Unit 4 / Page 32
6070
8090
100
110
Dia
stol
ic b
lood
pre
ssur
e
20 30 40 50 60Age
© Andrew Ho, Harvard Graduate School of Education
-20
-10
010
20R
esid
uals
65 70 75 80 85 90Fitted values
Heteroscedasticity: Visual inspection
Unit 4 / Page 33
Clear increase in conditional variance in blood pressure as age increases, but a linear relationship seems appropriate.An unbiased slope and fine prediction, but also high standard errors and inaccurate conditional standard errors.
When the conditional variance can be modeled as a function of the predictor, we can use a weighted least squares (WLS) approach that diminishes the “pull” of noisy observations.
And sometimes, when the relationship is nonlinear, transformations can reduce heteroscedasticity.
© Andrew Ho, Harvard Graduate School of Education
010
2030
40P
erce
nt
-200 -100 0 100 200 300Residuals
Normality of residuals: hist res, kden percent
• Residuals will always have a mean of zero but should also have a symmetrical normal distribution, both conditionally and unconditionally.
• Violations can lead to poor estimation properties, particularly in small samples, although without nonlinearity, outliers or heteroscedasticity, the estimates can be quite robust.
• When accompanied by nonlinearity and heteroscedasticity, the remedial approach is typically transformation.
Unit 4 / Page 34
Blood Pressure Example
Florida Example (Palm Beach Removed)
05
1015
2025
Per
cent
-20 -10 0 10 20Residuals
http://www.people.vcu.edu/~rjohnson/regression/
© Andrew Ho, Harvard Graduate School of Education
What are the takeaways from this unit?
Unit 4 / Page 35
• Inference from regression models requires adherence to important assumptions• Independent and identically normally distributed residuals with mean 0
• Residuals are the tools we use to test our assumptions• Linearity (Do residuals have a conditional mean of 0 in the observed range?)• Outliers (Are residuals identically distributed? Normally distributed?)• Homoscedasticity (Are residuals identically distributed?)• Normality (Are residuals normally distributed?)
• Residuals can be interpreted as values above and beyond, accounting for, “controlling” for, or adjusting for • Residuals are error that remains to be explained.
• Outlying observations must be documented and described. Particularly influential outliers may require sensitivity studies, and all actions taken must be clearly and honestly communicated.
• For each regression assumptions, we can now:• Communicate degrees of violation of assumptions.• Articulate how and where we might expect a violation to threaten our
inferences.• Take remedial action to improve model fit or articulate threats to inferences
© Andrew Ho, Harvard Graduate School of Education
• Assumptions• Cook’s distance• Discrepancy• Homoscedasticity• Independence of observations• Influence• Leverage• Leverage vs. residuals-squared plot• Linearity• Normality• Outlier• Residuals vs. fitted values plot• Sensitivity studies and plots• Standardized residuals• Studentized residuals
Terms to Note
Unit 4 / Page 36
top related