regression iii: advanced methods - michigan state...

33
Lecture 14: Non-normal Error Distribution and Nonlinear Functional Forms http://polisci.msu.edu/jacoby/icpsr/regress3 Regression III: Advanced Methods Bill Jacoby Michigan State University

Upload: phungngoc

Post on 24-Mar-2018

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

Lecture 14: Non-normal Error Distribution and Nonlinear Functional Forms

http://polisci.msu.edu/jacoby/icpsr/regress3

Regression III:Advanced Methods

Bill JacobyMichigan State University

Page 2: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

2

Goals of the lecture

• Discuss methods for detecting non-normality in errors and nonlinearity in functional form– Each of these reflect problems with the specification

of the model• Discuss various ways that transformations can be used

to remedy these problems• Explore maximum likelihood methods that embed the

linear model in a more general nonlinear model incorporating transformations as parameters

Page 3: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

3

Non-Normally Distributed Errors• The least-squares fit is based on the conditional mean

– The mean is not a good measure of centre for either a highly skewed distribution or a multi-modal distribution

• Non-Normality does not produce bias in the coefficient estimates, but it does have two important consequences:– It poses problems for efficiency—i.e., the OLS

standard errors are no longer the smallest. Weighted least squares (WLS) is more efficient

– Standard errors can be biased—i.e., confidence intervals and significance test may lead to wrong conclusions. Robust standard errors can compensate for this problem

• Transformations can often remedy the “heavy-tailed errors” problem

• Re-specification of the model—i.e., include a missing discrete predictor—can sometimes fix a multi-modal error problem

Page 4: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

4

Assessing skewness:Quantile-comparsion plots of the studentized residuals• Highly skewed error distributions can compromise the

interpretation of the least-squares fit – Quantile comparison plots are a useful method for

exploring for outliers and skewness of the residuals• Recall that the sample distribution of studentized

residuals Ei* can be compared either to the

quantiles of unit-normal distribution, N(0,1) or to the t-distribution with n-k-2 df:

– Transforming the dependent variable can often correct skewness in the error distribution

Page 5: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

5

Assessing Modality:Density estimation of the errors (1)

• Density estimation is an effective way to assess modality • Recall that density estimation estimates the probability

density function of a variable based on the sample byaveraging and smoothing the histogram– Areas before a particular point on the density curve

represent probabilities. – The area of the density function is 1

Page 6: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

6

Distribution of the Residuals:Example: Inequality data

• The density estimate of the studentized residuals clearly shows a positive skew, and the possibility of a grouping of cases to the right

• We know from earlier that there are some unusual cases, so we shall remove them again before continuing -2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

rstudent(Weakliem.model)

Prob

abilit

y de

nsity

func

tion

Page 7: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

7

Studentized Residuals after removing Czech Republic and Slovakia

-2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

rstudent(Weakliem.model2)Pr

obab

ility

dens

ity fu

nctio

n

-2 -1 0 1 2

-2-1

01

23

t Quantiles

Stud

entiz

ed R

esid

uals

(Wea

klie

m.m

odel

2)

Page 8: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

8

Nonlinearity• The assumption that the average error E(ε) is

everywhere zero implies that the regression surface accurately reflects the dependency of Y on the X’s.

• We can see this as linearity in the broad sense– i.e., nonlinearity refers to a partial relationship

between two variables that is not summarized by a straight line, but it could also refer to situations when two variables specified to have additive effects actually interact

• Violating this assumption implies that the model fails to account for a systematic pattern between Y and the X’s. – Often models characterized by this violation will still

provide a useful approximation of the pattern in the data, but they can also be misleading.

• It is impossible to directly view the regression surface when more than two predictors are specified, but we can employ partial-residual plots to assess nonlinearity

Page 9: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

9

Plots for assessing nonlinearity• Scatterplot matrices are useful for preliminary

assessment of the relationships between several variables in a multiple regression model, but can be misleading because they plot the marginal rather than partial relationships between Y and each X (i.e., they do not control for the other X’s).

• Conditioning plots are better but still have trouble if there are too many X’s

• Partial-regression plots (or added-variable plots) are not very useful either because they are unable to distinguish between monotone linearity (which can often be corrected with a simple transformation) and non-monotone nonlinearity (which cannot be corrected with a transformation)

• Partial-residual plots, (or “component plus residual plots”) however, can reveal both monotone and non-monotone linearity

Page 10: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

10

Failure of Partial Regression Plots(Added-Variable Plots)• The two plots in column (a)

represent the scatterplot of Y and X and the plot of the residuals E and X for one regression; column (b) gives the same for another regression

• Notice that although (a) is characterised by non-monotone linearity, this was not picked up in the simple residual plot, where the pattern is identical to that in (b) which is characterised by monotone linearity.

• Only (b) can be transformed to satisfy the linearity requirement

Page 11: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

11

Partial-Residual Plots (1)(Component-Plus-Residual Plots)• Not as suitable as partial-regression plots for revealing

leverage and influence, partial-residual plots are more helpful for establishing violations of the linearity assumption

• The partial residual for the jth independent variable is:

• This simply adds the linear component of the partial regression between Y and Xj (which may be characterised by a nonlinear component) to the least squares residuals

• The “partial residuals” E(j) are plotted versus Xj, meaning that Bj is the slope of the simple regression of E(j) on Xj.

• A nonparametric smooth helps assess whether the linear trend adequately captures the partial relationship between Y and X.

Page 12: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

12

Partial-Residual Plots vsPartial-Regression Plots

• Partial-regression plots (or added-variable-plots) plot the residuals Y(1) from the regression of Y on all X’s except X1 and the residuals from X1 regressed on all other X’s. – In other words they plot the relationship between Y and

X1 that remains when the effects of X2, …, Xk are removed

– Good for diagnosing outliers• Partial-residual plots (or component-plus residual plots)

plot the “partial residuals” E(j) for each observation versus Xj

– Basically the partial residuals are the linear component from the partial regression plus the least-squares residuals

– Good for diagnosing nonlinearity

Page 13: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

13

Example of partial residual plots (1):The Canadian Prestige data

Page 14: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

14

Example of partial residual plots (2):The Canadian Prestige data

0 10000 20000

-20

-10

010

20

Component+Residual Plot

income

Com

pone

nt+R

esid

ual(p

rest

ige)

6 8 10 12 14 16

-20

-10

010

2030

Component+Residual Plot

education

Com

pone

nt+R

esid

ual(p

rest

ige)

0 20 40 60 80

-20

-10

010

Component+Residual Plot

women

Com

pone

nt+R

esid

ual(p

rest

ige)

• The plot for income suggests a power transformation (of income) down the ladder of powers; For education the departure from linearity doesn’t appear to be very problematic; For %women there appears to be no relationship

Page 15: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

15

Testing for linearity using nested models and discrete data• Because they divide data into groups, discrete X’s (or

combination of X’s) facilitate straightforward tests of nonlinearity

• Essentially the goal is to categorize an otherwise quantitative explanatory variable, include it in a model replacing the original variable, and compare the fit of the two models

• This is done within a nested model framework, using F-tests to determine the adequacy of the linear fit

• Note the logic of such a test:– The quantitative X assumes measurement intervals

are equally sized and comparable– The categorical version of X makes no assumptions

about the differences between the “dummied” categories.

Page 16: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

16

Discrete data:Lack of fit test for linearity• Assume a model with one explanatory variable X that

takes on 10 discrete values:

• We could refit this model treating X as a categorical variable, and thus employing a set of 9 dummy regressors:

• Model (A), which specifies a linear trend, is a special case of model (B), which captures any pattern of relationship between the E(Y) and X. In other words, model A is nested within model B

• We can then do an incremental F-test for linearity—if the linear model adequately captures the relationship, the difference between the two models will not be statistically significant

Page 17: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

17

Test for linearity (1):Example from the 1989 General Social Survey• The data explore whether

education level affects vocabulary score tests

• In the first model, education is treated as a continuous variable that takes on 20 different values

• This model is contrasted with a model that codes education as a set of m-1=19 dummy regressors 0 5 10 15 20

02

46

810

education

voca

bula

ry

Page 18: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

18

R-script for discrete data nonlinearitytest:1989 GSS example

Page 19: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

19

Example of test for linearity

• The Linear trend model is nested within the Nonlinear model.

• As we see here, the lack of fit for the linear trend is not statistically significant

• We can conclude, then, that a linear trend adequately captures the trend in the data

• This general principle can be extended to compare linear trends with nonparametric regressions—we will do this later in the course

Page 20: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

20

Maximum Likelihood Methods for Transformations (1)

• Although the ad hoc methods for assessing nonlinearity discussed earlier are usually quite effective, there are more sophisticated statistical techniques based on maximum likelihood estimation.

• Despite the underlying complexity of the statistical theory, these methods are quite simple to implement.

• Basically these models embed the usual multiple-regression model in a more general model that contains a parameter for the transformation.

• If several variables need to be transformed, several such parameters need to be included—As a result, these models are inherently nonlinear.

Page 21: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

21

Maximum Likelihood Methods for Transformations (2)

• The transformation is indexed by a parameter λ (which may, itself, be a vector of several parameters) and estimated simultaneously with the usual regression parameters by maximizing the likelihood function and thus achieving MLEs: L(λ, α, β1, …,βk, σε

2).• If λ=λ0 (i.e., there is no transformation), a likelihood ratio

test, Wald test or score test of H0:λ=λ0 can assess whether that a transformation is required

Page 22: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

22

Maximum Likelihood Methods:Box-Cox transformation of Y • The Box-Cox transformation of Y functions to normalize

the error distribution, stabilize the error variance, and straighten the relationship of Y to the X’s

• The general Box-Cox model is:

• Note that all of the Yi must be positive. Although this precludes the possibility that they are normally distributed (because the normal distribution is unbounded) this is not usually problematic in practice unless many Y-values stack near 0.

Page 23: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

23

• A simple procedure to find the MLE, λ-hat is to evaluate the maximized loge L(α, β1,…,βk, σε

2|λ) for a range of values λ– A good range to start with is –2 to +2. – If the maximum log-likelihood is not contained

within this range, the range can be expanded• Because it is based on the likelihood ratio test, the

suggested transformation is 1-λ• To test H0:λ=1 we need the likelihood ratio statistic:

Page 24: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

24

• An approximate score test for the Box-Cox transformation is based on the constructed variable:

• The augmented regression, including the constructed variable is then:

• A t-test for the H0:φ=0 assess the whether the transformation is needed:

Page 25: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

25

Box-Cox Transformation:Example: Ornstein data

Page 26: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

26

Box-Cox transformation:An example using the Ornstein data

-40 0 20 60

-20

010

2030

40

Added-Variable Plot

box.cox.var(interlocks + 1) | others

inte

rlock

s |

othe

rs

• The coefficient for the Box-Cox variable in the model is .69, suggesting that a transformation of Y of to the power of approximately 1-.69=.31 is needed

• The added-variable plot (called a “constructed variable plot”) allows us to see that the choice of transformation was notinfluenced by only a few cases—it seems to be needed throughout most of the data

Page 27: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

27

Box-Tidwell Transformation of the X’s (1)

• Explicit in this model is a power transformation of each of the X’s (of course, we would not want to transform allvariables—e.g., it makes no sense to transform dummy variables)

• The Box-Tidwell model is a nonlinear model

Where the errors are independently distributed as εI∼ N(0, σε

2), and the Xij are positive.

• Maximum likelihood can also be used to find appropriate transformations for the X’s in order to linearize the relationship with Y

• Consider the model:

Page 28: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

28

Box-Tidwell Transformation of the X’s

• The Box and Tidwell procedure yields a constructed variable diagnostic in the following way:1. Regress Y on the X’s, to obtain A1, B1,…,Bk

2. Regress Y on the X’s AND the constructed variables X1logeX1,…,XklogeXk to obtain A’, B1’,…,Bk’ and D1,…,Dk

3. The constructed variables are used to assess the need for a transformation of the Xj by testing the null hypothesis H0:δj=0, where δj is the population coefficient of XjlogeXj

4. A preliminary estimate of the transformation parameter γj is given by:

• This procedure is iterated until it converges on the MLE

Page 29: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

29

Box-Tidwell TransformationExample: Prestige data

• A quadratic partial regression is included for women because we saw earlier that this was needed.

• The statistically significant score tests indicate that transformations are needed for both variables

• The MLE of Power suggests that income should be transformed by a power of -.037 (essentially zero, indicating that a log of income would work well), and education by a power of 2.19 (indicating education2 would work well)

Page 30: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

30

Added-variable plots for the Box-Tidwell constructed variables (1)

Page 31: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

31

Added-Variable Plots for the Box-Tidwell constructed variables (2)

-2000 2000 6000

-20

-10

010

A dded-Variable Plot

I( income * log(income)) | others

pres

tige

| ot

hers

-0.5 0.0 0.5

-15

-50

510

1520

Added-Variable Plot

I(education * log(education)) | others

pres

tige

| ot

hers

• The graphs here both provide general support for the transformations found from the Box-Tidwell model

Page 32: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

32

Summary (1)• Nonconstant error variance also threatens efficiency

and can produce biased standard errors. There are two common problems: (1) Error variance increases as Y increases (2) Systematic relationship of the errors with an X– Residual plots (residuals against hat-values) allow us

to visualize the pattern– Levine’s test; Score tests– WLS can correct the problem, but it is of little concern

unless the largest error variance is at least 10 times as large as the smallest

• Non-normal errors threaten the efficiency of least-squares estimation. We can visually assess whether a transformation of Y is needed by looking at:– Quantile comparison plots of the studentized residuals

for heavy tails– Density estimates of the errors for multi-modality

Page 33: Regression III: Advanced Methods - Michigan State …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week4/14...2 Goals of the lecture • Discuss methods for detecting non-normality

33

Summary (2)• Nonlinearity is problematic because OLS will not

adequately represent the pattern in the data—our coefficient estimates are meaningless– Partial-residual plots (also called component-plus-

residual plots) help visualize linearity in multiple regression

– Discrete model tests provide an intuitive way to test for linearity for a particular X

• Maximum likelihood methods provide sophisticated ways to determine required transformations:– Box-Cox Transformation of Y: Finds the best

transformation to simultaneously stabilize the error variance, normalize the error distribution and straighten the relationship between Y and the X’s

– Box-Tidwell Transformation of X’s: Linearizes the relationship of Y to the X’s be finding transformations for the X’s