chapter 3: diagnostics and remedial measures ayona chatterjee spring 2008 math 4813/5813

47
Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Upload: beverly-mccoy

Post on 12-Jan-2016

223 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Chapter 3: Diagnostics and Remedial Measures

Ayona Chatterjee

Spring 2008

Math 4813/5813

Page 2: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Validity of a regression model

• Any one of the following features may not be appropriate.– Linearity– Normality of error terms.

• Important to examine the aptness of a model before making inferences.

• Consider diagnostic tools to justify the appropriateness of a mode.

• Suggest remedial techniques to fix deviations.

Page 3: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Lets recall: A Dot Plot

• A dotplot displays a dot for each observation along a number line. If there are multiple occurrences of an observation, or if observations are too close together, then dots will be stacked vertically. If there are too many points to fit vertically in the graph, then each dot may represent more than one point.

Page 4: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Stem and Leaf Diagram

• In a stem-and-leaf plot each data value is split into a "stem" and a "leaf".  The "leaf" is usually the last digit of the number and the other digits to the left of the "leaf" form the "stem". 

Page 5: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Box Plot

• Uses the max, minimum and the quartiles to plot the data.

• Can draw conclusions about symmetry and outliers.

Page 6: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Time Series Plot

• Also called sequence plot.

• Used when data are collected in series over time.

• Used to draw inference about patterns with time.

• Seasonal or weekly effects.

Page 7: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Diagnostics for Predictor Variable

• Let us look at the Toluca Company example given in chapter 1.

• The predictor variable X was the lot size.

• A dot plot, time series plot, stem and leaf plot and box pot for the data were obtained.

Page 8: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

20 70 120

Lot Size

5 10 15 20

20

70

120

Run

Lot

Siz

e

20

70

120

Lo

tS

ize

1 2 0

4 3 000

6 4 05

8 5 00

11 6 000

(3) 7 555

10 8 00005

5 9

5 10 00

3 11 00

1 12 5

Page 9: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Residuals

• Residuals are the difference between the observed and predicted responses (Y).

• For the normal error regression model, we assume that the error term is normally distributed.

• If the model is appropriate for the data, this should be reflected in the residuals.

Page 10: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Departures from Model to be Studied by Residuals

1. The regression function is not linear.2. The error terms do not have constant

variance.3. The error terms are not independent.4. The model fits all but one or few outliers,5. The error terms are not normally

distributed.6. One or several important predictor(s) have

been omitted from the model.

Page 11: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Diagnostics for Residuals

• Six diagnostic plots to judge departure from the simple linear regression model. – Plot of residuals against predictor variable.

• The plot should have a random scatter of plots.

– Plot of absolute or squared residuals against X.– Plot of residuals against the fitted values.

Page 12: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Diagnostics for Residuals

– Plot of residuals against time or other sequence.• Should not display any trends.

– Plots of residuals against omitted predictor variables.

– Box plot of residuals.– Normal probability plot of residuals.

• Should lie along a straight line.

Page 13: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Predictor

Good Looking Plots

Page 14: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Nonlinearity of Regression Function

• A example to study the relation between maps distributed and bus rider ship in eight cities. Here X is the # of bus transit maps distributed for free to residents at the beginning of the test period and Y is the increase during the test period in average daily busy rider ship during non peak hours.

X Y

80 0.60

220 6.70

140 5.30

120 4.00

180 6.55

100 2.15

200 6.60

160 5.75

Page 15: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Plots

• Here a linear function appears to give a decent fit to the data set introduced in the previous slide. The regression equation obtained is

• Y = -1.82 + 0.0435 X 100 150 200

0

1

2

3

4

5

6

7

X

Y

Page 16: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

100 150 200

-1

0

1

X

RE

SI1

Residual Plot

• Here the departure from linearity if more visible as the residuals depart from 0 in a systematic manner.

• The residual against the predictor is the preferred plot to judge linearity.

Page 17: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Nonconstancy or error variance

• Here we have a residual plot against age for a study of the relation between blood pressure of adult women and their age, as age increases the residuals increase. In many business, social science and biological science, departure from constancy of error variance tends to be of the “megaphone” effect.

Page 18: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Nonconstancy of error variance

• The two other types of departure from constant error variance are when we have a curvilinear regression function or the error variance increases over time.

Page 19: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Presence of outliers

• Outliers are extreme observations and can be identified from box plot or dot plots.

• Another option is to have a scatter plot of the semi-studentized residual

• A rough rule of thumb in case of a large number of observations is to consider semi-studentized residuals with absolute value of 4 or more as outliers.

MSEe /

Page 20: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Example

Here we can see that the scatter plot appears to have one outlier and this is pulling the regression line upwards. Thus in the residual plot we have so many observations in the lower half of the plot.

Removing the outlier leads to a more uniformly linear scatter plot and better regression estimates.

Page 21: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Nonindependence of Error Terms

• For time series data it is advised to plot residuals against time order.

• This is to check if consecutive observations are independent of each other or not

Page 22: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Nonnormality of error terms

• Large departures from normality is of concern.

• A normal probability plot for the residuals in one way to judge normality.

Page 23: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Omission of Important Predictor Variables

• Residuals should be plotted against variables omitted from the model that may have important effects on the response.

• Example studies output Y and age of workers X.

Page 24: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Example

• Lets work on the GPA data set.– Plot a box plot for the ACT scores, are there any

noteworthy features in the plot?

– Prepare a dot plot of the residuals. What information does this plot provide?

– Plot the residuals against the fitted value. What departure from the regression model can be studied from this plot? What are your findings?

– Prepare a normality plot of the residuals and comment on it.

Page 25: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Overview of Remedial Measures

• If the linear regression model is not appropriate for your data set:– Abandon regression model and develop a new

model.– Employ some transformation on the data so that

the regression model is appropriate for the transformed data.

Page 26: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Nonlinearity of Regression Function

• If the relation between X and Y is not linear, the following relations can be investigated:– Quadratic regression function.– Exponential regression function.

XYE

XXYE

10

2210

}{

}{

Page 27: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Transformations for Nonlinear Relation

• To achieve linearity one can transform X or Y or both.

• When the errors terms are normally distributed, we will transform X.

• The following slide has some suggested transformations.

Page 28: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Prototype Transformation of X

XX

XX

10log

)exp(

2

XX

XX

)exp(

/1

XX

XX

Page 29: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Example

• Data from an experiment on the effect of number of days of training received X and performance Y in a battery of simulated sales situation are presented.

0.5 42.5

0.5 50.6

1 68.5

1 80.7

1.5 89.0

1.5 99.6

2 105.3

2 111.8

2.5 112.3

2.5 125.7

Page 30: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Need to transform the data

0.5 1.0 1.5 2.0 2.5

X

40

60

80

100

120

Y

Page 31: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Square root transformed X`

0.7 0.9 1.1 1.3 1.5

X'

40

60

80

100

120

Y

Page 32: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Results

-1 0 1

Quantiles of Standard Normal

-10

-50

5

Re

sid

ua

ls

4 6

9

Page 33: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

0.7 0.9 1.1 1.3 1.5

X'

-10

-5

0

5

Re

sid

ua

ls

Page 34: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Transformation for Non-normality and Unequal Error variances

• Unequal error variances and non-normality often occurs together.

• To fix this we shall transform Y, since we need to change the shape and spread of the distribution for Y.

• A simultaneous transformation on X may also be needed.

Page 35: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Prototype regression patterns

Transformations on Y

10log

1/

Y Y

Y Y

Y Y

Page 36: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Example: Plasma Levels

• Using the data on plasma levels, – Draw a scatter plot of Age against plasma

levels, comment on it.– Suggest a Suggest a suitable transformation.– Verify the validity of the transformation.

Page 37: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

43210

20.0

17.5

15.0

12.5

10.0

7.5

5.0

Age

Pla

sma le

vel

Scatterplot of Plasma level vs Age

Page 38: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

43210

1.3

1.2

1.1

1.0

0.9

0.8

0.7

Age

Log10 Y

Scatterplot of Log10 Y vs Age

Page 39: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

0.200.150.100.050.00-0.05-0.10-0.15

99

95

90

80

70

605040

30

20

10

5

1

Residual

Perc

ent

Normal Probability Plot(response is Log10 Y)

43210

0.20

0.15

0.10

0.05

0.00

-0.05

-0.10

Age

RES

I1

0

Scatterplot of RESI1 vs Age

These plots supports the appropriateness of the linear regression model to the transformed data.

Page 40: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Box-Cox transformation

• The Box-Cox procedure automatically identifies a transformation from the family of power transformations on Y.

• The family of power transformations is of the form:

– Here λ is a parameter to be determined from the data.

Y Y

Page 41: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

The new regression model

• The normal error regression model with the response variable a member of the family of power transformations described in the previous slide is:

• Along with the regression coefficients we now need to estimate λ. Most cases the maximum likelihood estimator of λ is obtained by conduction a numerical search in a potential range for λ.

0 1i i iY X

Page 42: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Calculations for λ.

• We standardize the responses so that the error magnitude does not depend on λ.

• Once the standardized observations Wi have been obtained for a given λ value, they are regressed on the predictor variable X.

1

2

1/

21

1 12

( 1) 0

(log ) 0

1

ii

e i

nn

ii

K YW

K Y

where

K Y

KK

Page 43: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Example: Sales growth

• A marketing researcher studied annual sales of a product that had been introduced 10 years ago. The data are as follows, where X is the year (coded) and Y is sales in thousands of units. Answer the following questions.

Page 44: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

• Prepare a scatter plot of the data. Does a linear relation appear adequate?

• Use the Box-Cox procedure and standardization to find an appropriate power transformation of Y. Evaluate SSE for λ = 0.3, 0.4, 0.5, 0.6, 0.7. What transformation of Y is suggested?

Page 45: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

X Y0 981 1352 1623 1784 2215 2326 2837 3008 3749 395

9876543210

400

350

300

250

200

150

100

X

YScatterplot of Y vs X

Page 46: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

Box Cox calculations

lambda 0.3

K2 218.2605

K1 144.5977

Page 47: Chapter 3: Diagnostics and Remedial Measures Ayona Chatterjee Spring 2008 Math 4813/5813

• Thus the regression equation using a square-root transformation on Y will give

9876543210

0.50

0.25

0.00

-0.25

-0.50

X

RES

I1

Scatterplot of RESI1 vs X