15.6 influence analysis -...

5
15.6 Influence Analysis 1 15.6 Influence Analysis In Sections 13.5 and 14.3, you used residual analysis to evaluate the regression assumptions. This section introduces several methods that measure the influence of individual observations: The hat matrix elements, The Studentized deleted residuals, Cook’s distance statistic, Figure 15.16 presents the values of these statistics computed by Minitab for the OmniPower sales data. D i t i h i FIGURE 15.16 Minitab worksheet containing computed values for the Studentized deleted residuals, the hat matrix elements, and Cook’s distance statistics for the OmniPower sales data M15_BERE8380_12_SE_C15.6.qxd 2/21/11 8:21 PM Page 1

Upload: vodan

Post on 30-Jan-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 15.6 Influence Analysis - Pearsonwps.prenhall.com/wps/media/objects/11886/12171343/OnlineTopics/bb… · 21.02.2011 · 15.6 Influence Analysis 1 ... Biometrika Tables for Statisticians,

15.6 Influence Analysis 1

15.6 Influence AnalysisIn Sections 13.5 and 14.3, you used residual analysis to evaluate the regression assumptions.This section introduces several methods that measure the influence of individual observations:

• The hat matrix elements, • The Studentized deleted residuals, • Cook’s distance statistic,

Figure 15.16 presents the values of these statistics computed by Minitab for the OmniPowersales data.

Di

tihi

F I G U R E 1 5 . 1 6Minitab worksheetcontaining computedvalues for theStudentized deletedresiduals, the hat matrixelements, and Cook’sdistance statistics for theOmniPower sales data

M15_BERE8380_12_SE_C15.6.qxd 2/21/11 8:21 PM Page 1

Page 2: 15.6 Influence Analysis - Pearsonwps.prenhall.com/wps/media/objects/11886/12171343/OnlineTopics/bb… · 21.02.2011 · 15.6 Influence Analysis 1 ... Biometrika Tables for Statisticians,

The Hat Matrix Elements, hiIn Section 13.8, was defined for the simple linear regression model when constructing theconfidence interval estimate of the mean response. For multiple regression models, the equa-tion for calculating the hat matrix diagonal elements, , requires the use of matrix algebraand is beyond the scope of this text (see references 4, 5, and 7).

The hat matrix diagonal element for observation i, denoted reflects the possible influ-ence of on the regression equation. If potentially influential observations are present, youmay need to delete them from the model. In a regression model containing k independentvariables, Hoaglin and Welsch (see reference 5) suggest the following decision rule:

For the OmniPower sales data, because and you flag any value greaterthan Referring to Figure 15.16, you see that none of the values aregreater than 0.1429. Therefore, none of the observations are candidates for removal from theanalysis.

The Studentized Deleted Residuals, tiRecall from Section 13.5 that a residual is the difference between the observed value of Y andthe predicted value of Y [see Equation (13.14) on page 539]. Studentized residuals are theresiduals divided by the standard error of the estimate and adjusted for the distance from

The Studentized deleted residual, expressed as a t statistic in Equation (15.10), measuresthe difference of each from the value predicted by a model that includes all observationsexcept observation i.

Yi

X .SYX

hi2(2 + 1)>34 = 0.1765.hik = 2,n = 34

then Xi is an influential observation and is a candidate for removal from the model.

If hi 7 2(k + 1)>n,

Xi

hi,

hi

hi

STUDENTIZED DELETED RESIDUAL

(15.10)

whereresidual for observation i

number of independent variableserror sum of squares of the regression model fitted

hat matrix diagonal element for observation ihi =

SSE =

k =

ei =

ti = ei An - k - 1

SSE(1 - hi) - e2i

Hoaglin and Welsch (see reference 5) suggest that if or (using a level ofsignificance of 0.10), the observed and predicted values are so different that observation i ishighly influential on the regression equation and is a candidate for removal.

For the OmniPower sales data, and Thus, you flag any whose absolutevalue is greater than 1.6973 (see Table E.3). In Figure 15.16, and are highlighted. Thus, the 14th, 15th, and 20th observations may each havean adverse effect on the model. These observations were not previously flagged according to the

criterion. Since and measure different aspects of influence, neither criterion is sufficientby itself. When is small, may be large. When is large, may be moderate or small becausethe observed is consistent with the rest of the data.

Cook’s Distance Statistic, DiCook’s distance statistic, , based on both and the Studentized residual, is a third criterionfor identifying influential observations. To decide whether an observation flagged by either the

or criterion is unduly affecting the model, Cook and Weisberg (see reference 4) developedCook’s statistic.Di

tihi

hiDi

Yi

tihitihi

tihihi

t20 = 2.27527t14 = -3.08402, t15 = 2.20612,

tik = 2.n = 34

ti 6 ta>2ti 7 ta>2

2 CHAPTER 15 Multiple Regression Model Building

M15_BERE8380_12_SE_C15.6.qxd 2/21/11 8:21 PM Page 2

Page 3: 15.6 Influence Analysis - Pearsonwps.prenhall.com/wps/media/objects/11886/12171343/OnlineTopics/bb… · 21.02.2011 · 15.6 Influence Analysis 1 ... Biometrika Tables for Statisticians,

15.6 Influence Analysis 3

Cook and Weisberg suggest that if (the critical value of the F distribution havingdegrees of freedom in the numerator and degrees of freedom in the denomi-

nator at a 0.50 level of significance), the observation is highly influential on the regressionequation and is a candidate for removal.

Table 15.4 shows critical values for Cook’s statistic. Di

n - k - 1k + 1Di 7 Fa

A � 0.50

Numerator df � k � 1

Denominator df � n � k � 1 2 3 4 5 6 7 8 9 10 12 15 20

10 .743 .845 .899 .932 .954 .971 .983 .992 1.00 1.01 1.02 1.03

11 .739 .840 .893 .926 .948 .964 .977 .986 .994 1.01 1.02 1.0312 .735 .835 .888 .921 .943 .959 .972 .981 .989 1.00 1.01 1.0215 .726 .826 .878 .911 .933 .949 .960 .970 .977 .989 1.00 1.0120 .718 .816 .868 .900 .922 .938 .950 .959 .966 .977 .989 1.0024 .714 .812 .863 .895 .917 .932 .944 .953 .961 .972 .983 .99430 .709 .807 .858 .890 .912 .927 .939 .948 .955 .966 .978 .98940 .705 .802 .854 .885 .907 .922 .934 .943 .950 .961 .972 .98360 .701 .798 .849 .880 .901 .917 .928 .937 .945 .956 .967 .978

120 .697 .793 .844 .875 .896 .912 .923 .932 .939 .950 .961 .972q .693 .789 .839 .870 .891 .907 .918 .927 .934 .945 .956 .967

Source: Extracted from E. S. Pearson and H. O. Hartley, eds., Biometrika Tables for Statisticians, 3rd ed., 1966, by permission of the BiometrikaTrustees.

COOK’S Di STATISTIC

(15.11)

whereresidual for observation inumber of independent variables

mean square error of the regression model fittedhat matrix diagonal element for observation ihi =

MSE =

k =

ei =

Di =

e2i

k MSE c hi

(1 - hi)2d

T A B L E 1 5 . 4Selected Critical Values of F for Cook’s Di Statistic

For the OmniPower sales data, since and there are 3 degrees of freedom inthe numerator and 31 degrees of freedom in the denominator. Thus, any isflagged. Referring to Figure 15.16, you see that none of the values exceed 0.187, and there-fore no observations are identified as influential using Cook’s statistic.

OverviewThis section discussed three criteria for evaluating the influence of each observation on themultiple regression model. The various statistics did not lead to a consistent set of conclusions.According to both the and the criteria, none of the observations is a candidate for re-moval. Under such circumstances, most statisticians would conclude that there is insufficientevidence for the removal of any observation from the analysis.

In addition to the three criteria presented here, there are other measures of influence (seereferences 1 and 6). Although different statisticians seem to prefer particular measures, cur-rently there is no consensus as to the “best” measure.

Dihi

Di

Di

Di 7 Fa, = 0.807k = 2,n = 34

M15_BERE8380_12_SE_C15.6.qxd 2/21/11 8:21 PM Page 3

Page 4: 15.6 Influence Analysis - Pearsonwps.prenhall.com/wps/media/objects/11886/12171343/OnlineTopics/bb… · 21.02.2011 · 15.6 Influence Analysis 1 ... Biometrika Tables for Statisticians,

4 CHAPTER 15 Multiple Regression Model Building

Problems for Section 15.6APPLYING THE CONCEPTS

15.42 In Problem 14.4 on page 583, you used sales andnumber of orders to predict distribution costs at a mail-ordercatalog business (stored in ). Perform an influenceanalysis on your results and determine whether any observa-tions should be deleted from the analysis. If necessary, reana-lyze the regression model after deleting these observationsand compare your results.

15.43 In Problem 14.5 on page 583, you used horsepowerand weight to predict gasoline mileage (stored in ).Perform an influence analysis on your results and determinewhether any observations should be deleted from the analysis.If necessary, reanalyze the regression model after deletingthese observations and compare your results.

15.44 In Problem 14.6 on page 583, you used the amountof radio advertising and newspaper advertising to predictsales (stored in ). Perform an influence analysisAdvertise

Auto2010

Warecost

on your results and determine whether any observationsshould be deleted from the analysis. If necessary, reanalyzethe regression model after deleting these observations andcompare your results.

15.45 In Problem 14.7 on page 584, you used the totalstaff present and remote hours to predict standby hours(stored in ). Perform an influence analysis on yourresults and determine whether any observations should bedeleted from the analysis. If necessary, reanalyze the regres-sion model after deleting these observations and compareyour results.

15.46 In Problem 14.8 on page 584, you used the landarea of the property and age in years to predict appraisedvalue (stored in ). Perform an influence analysison your results and determine whether any observationsshould be deleted from the analysis. If necessary, reanalyzethe regression model after deleting these observations andcompare your results.

GlenCove

Standby

R E F E R E N C E S1. Andrews, D. F., and D. Pregibon, “Finding the Outliers

That Matter,” Journal of the Royal Statistical Society 40(Ser. B., 1978): 85–93.

2. Atkinson, A. C., “Robust and Diagnostic RegressionAnalysis,” Communications in Statistics 11 (1982):2559–2572.

3. Belsley, D. A., E. Kuh, and R. Welsch, Regression Diag-nostics: Identifying Influential Data and Sources ofCollinearity (New York: Wiley, 1980).

4. Cook, R. D., and S. Weisberg, Residuals and Influencein Regression (New York: Chapman and Hall, 1982).

5. Hoaglin, D. C., and R. Welsch, “The Hat Matrix in Re-gression and ANOVA,” The American Statistician, 32(1978): 17–22.

6. Hocking, R. R., “Developments in Linear RegressionMethodology: 1959–1982,” Technometrics 25 (1983):219–250.

7. Kutner, M., C. Nachtsheim, J. Neter, and W. Li, AppliedLinear Statistical Models, 5th ed. (New York: McGraw-Hill/Irwin, 2005).

8. Minitab Release 16 (State College, PA: Minitab, Inc.,2010)

M15_BERE8380_12_SE_C15.6.qxd 2/21/11 8:21 PM Page 4

Page 5: 15.6 Influence Analysis - Pearsonwps.prenhall.com/wps/media/objects/11886/12171343/OnlineTopics/bb… · 21.02.2011 · 15.6 Influence Analysis 1 ... Biometrika Tables for Statisticians,

5

Use Regression to perform influence analysis. Use the“Interpreting the Regression Coefficients” instructions inSection MG14.1 (repeated below), replacing step 19 ofthose instructions with the steps 19 through 22 listed below.

For example, to perform the Figure 15.16 analysis of theOmniPower sales data on page, open to the OmniPowerworksheet. Select Stat ➔ Regression ➔ Regression. In theRegression dialog box:

1. Double-click C1 Sales in the variables list to addSales to the Response box.

2. Double-click C2 Price in the variables list to addPrice to the Predictors box.

3. Double-click C3 Promotion in the variables list toadd Promotion to the Predictors box.

4. Click Graphs.

In the Regression - Graphs dialog box:

5. Click Regular and Individual Plots.6. Check Histogram of residuals and clear all the other

check boxes.7. Click anywhere inside the Residuals versus the vari-

ables box.8. Double-click C2 Price in the variables list to add

Price in the Residuals versus the variables box.

9. Double-click C3 Promotion in the variables list to addPromotion in the Residuals versus the variables box.

10. Click OK.11. Back in the Regression dialog box, click Results.

In the Regression - Results dialog box:

12. Click In addition, the full table of fits and residualsand then click OK.

13. Back in the Regression dialog box, click Options.

In the Regression - Options dialog box:

14. Check Fit Intercept.15. Clear all the Display and Lack of Fit Test check

boxes.16. Enter 79 and 400 in the Prediction intervals for new

observations box.17. Enter 95 in the Confidence level box.18. Click OK.19. Back in the Regression dialog box, click Storage.

In the Regression - Storage dialog box:

20. Check Deleted t residuals, Hi (leverages), andCook’s distance.

21. Click OK.22. Back in the Regression dialog box, click OK.

There are no Excel Guide instructions for this section.

M G 1 5 . 6 M I N I TA B G U I D E

F O R I N F L U E N C E A N A LY S I S

E G 1 5 . 6 E X C E L G U I D E

F O R I N F L U E N C E A N A LY S I S

M15_BERE8380_12_SE_C15.6.qxd 2/21/11 8:21 PM Page 5