unit 4: regression assumptions and diagnostics class 11…class 11… unit 4 / page 1© andrew ho,...

Unit 4: Regression Assumptions and Diagnostics Class 11…

http://xkcd.com/539/ Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

Where is Unit 4 in our 11-Unit Sequence?

Unit 6:The basics of

multiple regression

Unit 7:Statistical control in depth:Correlation and collinearity

Unit 10:Interaction and quadratic effects

Unit 8:Categorical predictors I:

Dichotomies

Unit 9:Categorical predictors II:

Polychotomies

Unit 11:Regression in practice. Common Extensions.

Unit 1:Introduction to

simple linear regression

Unit 2:Correlation

and causality

Unit 3:Inference for the regression model

Building a solid

foundation

Unit 4:Regression assumptions:Evaluating their tenability

Unit 5:Transformations

to achieve linearity

Mastering the

subtleties

Adding additional predictors

Generalizing to other types of predictors and

effects

Pulling it all

together

Unit 3 / Page 2

In this unit, we’re going to learn about…

• Reprise of the assumptions required for OLS regression-based inference

• The four major types of model violations:– Outliers– Nonlinearity– Heteroscedasticity– Non-independence of errors

• Determining whether the regression assumptions hold—strategies and rationale– Why residuals provide a powerful lens for evaluating regression

assumptions– Residuals as observations that “control for” or account for variables– Raw vs. standardized/studentized residuals– Leverage, outliers, and Cook’s Distance– Residual plots: How to construct them and what to look for– What should we do if we identify an outlier or other unusual observation?

• How would we summarize our results?Unit 4 / Page 3© Andrew Ho, Harvard Graduate School of Education

These four assumptions are often summarized in a shorthand: . This reads, the population residuals, , are independent and identically normally distributed with mean 0 (values centered on the prediction) and a common variance (and thus a common SD).

The linear regression model and its four assumptions

Assumption 1: At each value of there is a conditional distribution of that is normal with mean and SD .

Assumption 2: The straight line model is correct: The fall on a line.

Assumption 3: Homoscedasticity. The conditional standard deviations, , are equal across all .

Assumption 4: Conditional independence. For any value of , the s are independent. They share no hidden common association. Cannot be visualized from the plot.

𝜖 i .i . d .𝑁 (0 ,𝜎 𝑌∨𝑋2 )

Anscombe’s Quartet: Four datasets with identical summary statistics

How can we investigate the distributions of our residuals?

Unit 4 / Page 5

4 6 8 10 12 14x2

5 6 7 8 9 10Fitted values

6 8 10 12Fitted values

Diagnostic Plots: Residuals vs. Fitted Values2

4 6 8 10 12 14x2

To better visualize the residuals, we can subtract out the regression line. Instead of plotting y2 on x2, we can plot y2 – y2 on x2:

/* Graph the usual linear fit */graph twoway (scatter y2 x2) (lfit y2 x2)/* Run the regression */regress y2 x2/* Save residuals to variable resid */predict resid, residuals/* Plot residuals on x2 */scatter resid x2/* An equivalent visual representation is obtained by plotting residuals on their predicted (“fitted”) values, the yhats. */predict yhatscatter resid yhat // Or, in one smooth move after regress…rvfplot, name(rvf2)

For simple linear regression, since values are linearly related to values:

the plots look the same but just have a different horizontal axis. In practice, we use fitted values because it’s harder to plot on multiple values when we move to multiple regression.

Unit 4 / Page 6© Andrew Ho, Harvard Graduate School of Education

4 6 8 10 12Fitted values

Plots of residuals vs. fitted values for Anscombe’s Quartet

Your residuals will always sum to zero, and the correlation between your residuals and fitted values (or any ) will always be zero. RVFs are a better lens through which to appreciate your residuals.

Zagat: Can the price of your dinner predict its quality?

n = 76 Boston restaurants classified as “New American” in the ‘05-’06 guideZagat surveyor’s average estimate of the price of dinner, one

drink, and tip.

Average of the restaurant’s ratings for food, service, and decor. A 0-30 scale where 16-19 is good to very good, 20-25 is very good to excellent, and 26-30 is extraordinary to perfection.

20 30 40 50 60 70Average estimated dinner cost w/drink and tip ($)

Scatterplot of Zagat rating on Zagat-reported cost

Regression Results

61.6% of the variance in ratings is associated with (accounted for by/explained by) the cost.

The conditional standard deviation of ratings is 1.56 (the unconditional standard deviation was 2.5)

A dollar difference in cost predicts (is associated with) a .186-point difference in ratings.

The sample statistic of .186 is 10.9 estimated standard errors away from our null hypothesis of 0 slope.The probability of such a slope under

the null hypothesis is very, very low.A plausible range for the “true” slope is between .15 and .22; 95% of these kinds of intervals are expected to contain the “true” slope.

The relationship is moderately strong and statistically significant, where a $10 difference in cost predicts a 1.86-point difference in the Zagat ratings.

Unit 4 / Page 10

An estimated linear regression model of rating on cost

And, subtracting out the regression line…

𝑟𝑎𝑡𝑖𝑛𝑔=13.83+ .186𝑐𝑜𝑠𝑡

Plot of residuals vs. fitted values

We look for: 1) linearity 2) outliers 3) homoscedasticity 4) normality

Linearity: Eyeballing and lpoly (sparingly)

Unit 4 / Page 13

kernel = epanechnikov, degree = 0, bandwidth = .85

Local polynomial smooth

Responses to nonlinearity include polynomial or other nonlinear regression approaches (not covered here) and transformations (covered in Unit 5)

Ignoring or failing to address nonlinearity risks poor prediction, underestimation of , and Type II error (missing a finding).

29 Newbury

Bristol

Stanhope Grille

Outliers: Identifying and interpreting relatively large residuals

But residuals can’t identify outliers alone.

Positive residuals.Underpredicted.High ratings after “controlling for,” “partialling out,” “regressing out,” or accounting for cost.

Leverage: Extremity of an observation along predictor variables

This outlier has a large residual This outlier does not.

We should consider both the discrepancy from the regression line and the leverage exerted on the regression line.

With many predictors, the leverage of an observation is a single expression of distance from many predictor means.

With simple linear regression, it looks somewhat familiar:

Called is because leverage is an element of the “hat matrix”: It puts the hat on the Y.

h=1𝑛

+(𝑋− 𝑋 0 )

∑ ( 𝑋𝑖−𝑋 )2

http://www.stat.sc.edu/~west/javahtml/Regression.html

Visualizing Leverage: Look familiar?

Leverage values should be interpreted relatively.They are never more than 1, and the sum of all leverages is the number of predictors+1 (here, 2)

The greater the leverage, the more a single observation can influence the regression line.

h=1𝑛

+(𝑋− 𝑋 0 )

∑ ( 𝑋𝑖−𝑋 )2

29 Newbury

Bristol

Stanhope Grille

Incorporating leverage… a little.

To better interpret these residuals, we can “standardize” them: express them in terms of a kind of standard deviation unit that incorporates leverage.

29 Newbury

Bristol

Stanhope Grille

Standardized/Studentized residuals: predict stdres, rstandard

Expresses residuals in terms of RMSE (conditional standard deviations) while inflating residuals with more leverage.

If residuals are homoscedastic and normally distributed about the regression line, we have a sense of how unexpected a standardized residual might be.

Called “studentized” in honor of Gosset of the t-test (pseudonym: student). Stata calls these “standardized residuals.” Standardized residuals usually refer to dividing by RMSE alone. Very similar for large sample sizes.

𝑒𝑠𝑡𝑑 ( 𝑖 )=𝑌 𝑖− �̂� 𝑖

𝑅𝑀𝑆𝐸√1−h𝑖

𝑑𝑖=( h𝑖1− h𝑖 )(

𝑒 𝑠𝑡𝑑 (𝑖 )2

𝑘+1 )

224 Boston St.

29 Newbury33 RestaurantAriadneAura Avenue One

B-Side Lounge

Bambara

Birch St. Bistro

Blarney Stone

Brenden Crocker'sBristol

Central Kitchen

Daedalus

Dalia's Bistro

Dalya'sDedo Lounge

Devlin's

Excelsior

Federalist

Franklin Café

Gardner Museum

Gargoyles

Grafton St. Pub

Grapevine

Green St. Grill

Hamersley's Bistro

Harvest

Icarus

Independent

IndigoIntrigue Café

IsabellaJ's at Nashoba Valley

Jacob WirthJer-Ne

LaurelLiving RoomLucy's

Lyceum B&GMargo Bistro

Meritage

Nightingale

Orleans

PerdixRed Rock Bistro

Scandia

Seasons

Sidney's Grille

Square CaféStanhope GrilleStephanie's

TW Food

Temple Bar

Ten Tables

Top of the Hub

Tremont 647

Troquet

Union B&G

UpStairs on the Sq.

VinaliaVox PopuliWalden Grille

West Side Lounge

Zebra's Bistro

0 .02 .04 .06 .08Normalized residual squared

Cook’s Distance: A metric incorporating both discrepancy and leverage

• A go-to metric to address the question, did this single observation have a notable impact on my results? A measure of influence.

Recall: Cook’s Distance:

Leverage(horizontal)

Squared standardizedresidual (vertical)

These are averages, not benchmarks.

Just the squared studentized residual over n.

𝑒𝑠𝑡𝑑 ( 𝑖 )=𝑌 𝑖− �̂� 𝑖

𝑅𝑀𝑆𝐸√1−h𝑖

224 Boston St.

29 Newbury

33 Restaurant

AriadneAura

Avenue One

B-Side LoungeBambara

Birch St. BistroBlarney StoneBrenden Crocker's

Bristol

Central Kitchen

DaedalusDalia's Bistro

Dalya'sDedo Lounge

Devlin's

ExcelsiorFava

Federalist

Franklin CaféGardner Museum

GargoylesGrafton St. Pub

Grapevine

Green St. Grill

Hamersley's Bistro

Harvest

Icarus

Independent

Indigo

Intrigue CaféIsabella

J's at Nashoba Valley

Jacob Wirth

Jer-Ne

LaurelLiving Room

Lucy's Lyceum B&GMargo Bistro

MeritageNightingale

Orleans

PerdixRed Rock Bistro

ScandiaSeasons

Sidney's GrilleSilks

Square Café

Stanhope Grille

Stephanie's

TW Food

Temple BarTen Tables

Top of the Hub

Tremont 647 TroquetTryst

Union B&G

UpStairs on the Sq.

Vinalia

Vox Populi

Walden GrilleWest Side Lounge

Zebra's Bistro

Cook’s Distance: predict cooksd, cooksd

Unit 4 / Page 20

• Benchmarks: distances greater than 1 or 4/n are worth noting, although we acknowledge the impossibility of simple cutoffs.

Stata Demo: . findit regpt. regpt

Central Kitchen has notable influence on the regression line, with moderately high amounts of leverage and discrepancy.

Key concepts for outlying observations.• A residual is the vertical distance from an observed point to the estimated

regression line. OLS minimizes the sum of these squared residuals.• Absolute residuals are interpreted on the scale of Y. They are interpretable as Y

above and beyond, “controlling” for, accounting for, adjusting for X. • Visualization of residuals can be improved by plotting residuals on fits (rvfplot)• Leverage is a single measure of distance of an observation from the means of

one or more predictor variables and indicates potential to influence slopes.• Standardized/Studentized residuals are expressed in terms of RMSE, or

conditional standard deviations. The latter is adjusted slightly by leverage. These afford probabilistic interpretations if the normal assumption holds.

• Cook’s distance incorporates both residuals (vertical discrepancy) and leverage in a single metric describing the influence of observations on slope.

• The discrepancy and leverage of observations can be visualized by plotting leverage on squared studentized residuals (lvr2plot).

• An outlier is almost never removed absent an outright data entry error. Instead, the influence is described, and alternative models may be employed.

• Exclusion of an outlier usually requires an argument that it is not part of the population of interest. © Andrew Ho, Harvard Graduate School of Education Unit 4 / Page 21

The Butterfly Ballot: The Palm Beach County Outlier

Unit 4 / Page 22

Although Democrats are listed second in the left hand column, you vote Democratic by punching the third hole

If you punch the second hole, you are voting for the Reform party (ie, Pat Buchanan)

RQ: In the 2000 Presidential election, did Buchanan get more votes than we “would have expected” in Palm Beach County?

Of the nearly 6 million votes cast in Florida, the official tally has Bush beating Gore by 537 votes

0 100 200 300 400Reform party registered voters in 2000

BROWARDHILLSBOROUGH

PALM BEACH

PINELLAS

After univariate descriptives, our scatterplot.

10 Nov 2000: “The Bush campaign claims that the number of votes for Buchanan in Palm Beach County is perfectly accurate. ‘New information has come to our attention that puts in perspective the results of the vote in Palm Beach County,’ Bush spokesman Ari Fleischer said on Thursday. ‘Palm Beach County is a Pat Buchanan stronghold and that's why Pat Buchanan received 3,407 votes there.’” (Salon.com) View Article

BROWARD

HILLSBOROUGH

PALM BEACH

PINELLAS-500

Regressing Buchanan’s votes on registered reform party voters

Unit 4 / Page 24

_cons 1.532519 46.60847 0.03 0.974 -91.55102 94.61606 reform 3.686713 .4087551 9.02 0.000 2.870372 4.503053 bucvote Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 13334589.9 66 202039.241 Root MSE = 301.85 Adj R-squared = 0.5490 Residual 5922476.32 65 91115.0202 R-squared = 0.5559 Model 7412113.59 1 7412113.59 Prob > F = 0.0000 F( 1, 65) = 81.35 Source SS df MS Number of obs = 67

. regress bucvote reform

Above and beyond, “controlling” for, adjusting for, accounting for…

Some say, “after statistically controlling for…”

Acceptable.Safest.

Palm Beach has 2163 more votes for Buchanan than predicted given the

number of registered Reform Party voters.…than the model predicts given the

number of registered Reform Party voters.

PALM BEACH

PINELLAS-20

Standardized/Studentized Residuals

Stata: Standardized ResidualsMe/Others: Studentized Residuals

Palm Beach County is almost 8 standard deviations above the regression line.

Here, standard deviations are RMSEs slightly adjusted to accentuate high-leverage residuals.

ALACHUABAKERBAYBRADFORDBREVARD

BROWARD

CALHOUNCHARLOTTECITRUSCLAYCOLLIERCOLUMBIA

DE SOTODIXIEDUVALESCAMBIAFLAGLERFRANKLINGADSDENGILCHRISTGLADESGULFHAMILTONHARDEEHENDRYHERNANDOHIGHLANDS

HILLSBOROUGH

HOLMESINDIAN RIVERJACKSONJEFFERSONLAFAYETTELAKELEELEONLEVYLIBERTYMADISONMANATEEMARIONMARTINMONROENASSAUOKALOOSAOKEECHOBEE

ORANGE

OSCEOLA

PALM BEACH

PINELLAS

POLKPUTNAMSANTA ROSA

SARASOTA

SEMINOLEST JOHNSST LUCIESUMTERSUWANNEETAYLORUNIONVOLUSIA

WAKULLAWALTONWASHINGTON

0 .2 .4 .6 .8Normalized residual squared

Leverage vs. Residual Squared (Discrepancy)

BROWARD

PALM BEACH

PINELLAS

Cook’s Distance

Palm Beach County has an extreme influence on the regression line.

Sensitivity Plots

Unit 4 / Page 28

The sensitivity analysis is one of the most useful tools that you can take from this class. As we begin to appreciate that there are many plausibly correct models, we investigate whether our inferences are robust across these plausible decisions. Also called a robustness check.

“Here are my results and my conclusions.”“But have you considered , , and ?”“We considered , , and , and also , , and .… and it makes no difference.… or it does, and here is why.”

What to do with outlying observations (1)• Visualize them.

– Scatterplots, residuals vs. fit plots (rvfplot), standardized/studentized residuals, leverage vs. residuals-squared plots (lvr2plot), Cook’s distance.

• Describe their degree of discrepancy and influence in terms of the above and in context.• Try to explain each outlying observation. What is story of Central Kitchen? What is story of

Palm Beach County? What are the plausible rival stories that might account for the outlier?

• Severely influential or outlying observations must be documented for your audience.

• Conduct a sensitivity analysis: Resultswith and without the outlier.

• Address whether the outlier affectsyour primary inferences.

Comparison of regression models predicting Buchanan’s total vote (Florida Presidential Election data, 2000)

All Florida counties (n=67)

Without Palm Beach (n=66)

Estimated SlopeEstimated S.E.

t statistic

3.69*(0.41)

2.45*(0.12)20.18

R2 55.6% 86.4%* p<.001

What to do with outlying observations (2)

• Conduct a full analysis without the outlying observation and retest assumptions.

• Sometimes eliminating an outlier reveals other outliers…

• Stata note: After running regress again without the outlier, use the predict command without it, too: predict cooksd_1 if county != “PALM BEACH”, cooksd

• In general, do not remove outliers unless demonstrating sensitivity. Deciding to proceed with a model without an outlier requires very strong substantive justification:– Because of 1) problems with misleading

ballots in this county and no other, 2) the absence of other explanations for the outlying observation, and 3) our interest in modeling the relationship between bucvote and reform in a population without misleading ballots, we proceed with a regression model excluding Palm Beach County. © Andrew Ho, Harvard Graduate School of Education Unit 4 / Page 30

0 200 400 600 800 1000Fitted values

BROWARD

COLLIER

MARION

PINELLAS

0 200 400 600 800 1000Fitted values

What to do with outlying observations (3)• If outliers are part of your population of interest, they cannot be

removed.• If they are part of your population, then in spite of their influence, your

estimated slope is still unbiased… your best guess over many samples.• However, your standard error may be so large (sometimes you get the

outliers, sometimes you don’t) that you may prefer a more stable model, even if it’s not unbiased.

• This leads to alternative approaches. One to remember: median regression, also called least absolute value (LAV) regression, where we model the conditional median and minimize the absolute deviations (instead of squared deviations). – Recall that the median is more robust to outliers than the mean. In

Stata, this is a subset of quantile regression under the qreg command• As a default, don’t get fancy. Model in a familiar framework to best

communicate your results.• Sometimes, a simple transformation can draw in an outlier and let us

preserve our familiar framework (next unit). Unit 4 / Page 31

Heteroscedasticity: Blood pressure example

Unit 4 / Page 32

20 30 40 50 60Age

Heteroscedasticity: Visual inspection

Unit 4 / Page 33

Clear increase in conditional variance in blood pressure as age increases, but a linear relationship seems appropriate.An unbiased slope and fine prediction, but also high standard errors and inaccurate conditional standard errors.

When the conditional variance can be modeled as a function of the predictor, we can use a weighted least squares (WLS) approach that diminishes the “pull” of noisy observations.

And sometimes, when the relationship is nonlinear, transformations can reduce heteroscedasticity.

-200 -100 0 100 200 300Residuals

Normality of residuals: hist res, kden percent

• Residuals will always have a mean of zero but should also have a symmetrical normal distribution, both conditionally and unconditionally.

• Violations can lead to poor estimation properties, particularly in small samples, although without nonlinearity, outliers or heteroscedasticity, the estimates can be quite robust.

• When accompanied by nonlinearity and heteroscedasticity, the remedial approach is typically transformation.

Unit 4 / Page 34

Blood Pressure Example

Florida Example (Palm Beach Removed)

-20 -10 0 10 20Residuals

http://www.people.vcu.edu/~rjohnson/regression/

What are the takeaways from this unit?

Unit 4 / Page 35

• Inference from regression models requires adherence to important assumptions• Independent and identically normally distributed residuals with mean 0

• Residuals are the tools we use to test our assumptions• Linearity (Do residuals have a conditional mean of 0 in the observed range?)• Outliers (Are residuals identically distributed? Normally distributed?)• Homoscedasticity (Are residuals identically distributed?)• Normality (Are residuals normally distributed?)

• Residuals can be interpreted as values above and beyond, accounting for, “controlling” for, or adjusting for • Residuals are error that remains to be explained.

• Outlying observations must be documented and described. Particularly influential outliers may require sensitivity studies, and all actions taken must be clearly and honestly communicated.

• For each regression assumptions, we can now:• Communicate degrees of violation of assumptions.• Articulate how and where we might expect a violation to threaten our

inferences.• Take remedial action to improve model fit or articulate threats to inferences

• Assumptions• Cook’s distance• Discrepancy• Homoscedasticity• Independence of observations• Influence• Leverage• Leverage vs. residuals-squared plot• Linearity• Normality• Outlier• Residuals vs. fitted values plot• Sensitivity studies and plots• Standardized residuals• Studentized residuals

Terms to Note

Unit 4 / Page 36

unit 4: regression assumptions and diagnostics class 11…class 11… unit 4 / page 1© andrew ho,...

com539 unit

namervf2 unit

unit sequence

assumptions andrew ho

plot residuals

regression modelbuilding

regression line

anscombes quartet unit

Documents

unit ii - class 7

diffusion unit class 1

unit i - class 7

race, class, and power the hidden assumptions about mental...

unit 4: regression assumptions: evaluating their tenability

slide 1 testing multivariate assumptions the multivariate...

class presentation unit six

unit - iv collection classes€¦ · unit - iv collection...

class man. unit 1

assertyourself - breaking the theories and assumptions of...

unit 5 – class & capitalism

class presentation unit one

2018 capital market assumptions - institutional.voya.com 3...

unit ii - class 6

unit i - class 2

unit iii - class 4

unit v - class 33

unit iii - class 15

2019 capital market assumptions · 2018-12-13 · 3...

hunger games class unit