influential observations in regression

Influential Observations in Regression

Measurements on Heat Production as a Function of Body Mass and Work

Effort.M. Greenwood (1918). “On the Efficiency of Muscular Work,” Proc. Roy. Soc. Of London, Series B, Vol. 90, #627, pp. 199-214

Data Description

• Study involved Algerians accustomed to heavy labor. Experiment consisted of several hours on stationary bicycle.

• Dependent (Response) Variable:– Heat Production (Calories)

• Independent (Explanatory/Predictor) Variables:– Work Effort (Calories)– Body Mass (kg)

• Model:– H = WM

Raw Data (Table III, p.203)

M W H M W H76.2 156.8 3398 64.8 137.5 302071.3 114.1 2988 60.2 129.7 281269.6 142.6 3048 72.4 97.1 296258.0 142.6 2781 68.9 129.4 323674.6 142.6 2912 70.1 129.4 321468.9 128.3 3135 70.8 161.7 338969.1 142.6 3261 66.5 129.4 290862.1 156.8 3030 66.7 137.5 306368.7 128.3 3139 71.2 129.4 295665.4 142.6 2996 72.4 145.6 302370.4 128.3 3248 69.3 129.4 300169.1 142.6 3117 67.4 145.6 284163.7 142.6 2891 69.6 161.7 311762.1 114.1 2667 66.2 121.3 273373.5 142.6 3403 74.5 121.3 280861.3 129.4 2999 67.7 97.1 281370.1 137.5 3318 57.5 97.1 261579.8 121.3 2989 70.4 113.2 281461.3 129.4 3936

Estimated Regression Coefficients

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 1536.5098 584.4994 2.6288 0.0128 348.6642 2724.3554

W 6.1563 2.3659 2.6022 0.0136 1.3483 10.9643M 10.1409 7.6826 1.3200 0.1957 -5.4720 25.7538

MWH 14.1016.651.1536^

Note that that we can conclude, controlling for the other factor:

Work Effort increase Heat Production increases (p = .0136)

Body Mass increase does not Heat Production increases (p = .1957)

Plot of Residuals versus Fitted Values

-400

-200

0

200

400

600

800

1000

1200

2500 2600 2700 2800 2900 3000 3100 3200 3300 3400

Huge, Positive, Residual

Influential Measures (I) Note: n=37, p*=3 Parameters

24.1,76.037

)3(31 outside Coeffs Reg of Errors Stdon linfluentiaHighly

:Ratio Covariance

804.0 tscoefficien of groupon linfluentiaHighly

:D sCook'

329.037

2t coefficienon linfluentiaHighly

:t)coefficien regressioneach for (One DFBETAS

569.037

32 valuefitted OWNon linfluentiaHighly

:DFFITS

162.037

)3(2 levels-X wrt tolinfluentiay Potentiall

:Values (Hat) Leverage

75.3or 74.3 Outlier

:Residuals dStudentize and edStandardiz

35,3,50.

)(

ii

*

437,)37(2

05.337,

)37(2

05.

i

i

ij

i

ii

COVR

FD

DFBETAS

DDFFITS

trtr

Standardized / Studentized Residuals

Obs# e(i) r(i) r*(i) Obs# e(i) r(i) r*(i)

1 123.44 0.5856 0.5799 20 -20.14 -0.0900 -0.08872 26.01 0.1184 0.1167 21 -133.47 -0.6145 -0.60873 -72.21 -0.3222 -0.3179 22 93.51 0.4546 0.44924 -221.58 -1.0582 -1.0601 23 204.15 0.9059 0.90345 -258.91 -1.1808 -1.1879 24 169.98 0.7557 0.75096 109.92 0.4880 0.4824 25 139.03 0.6487 0.64317 145.86 0.6505 0.6448 26 -99.51 -0.4420 -0.43678 -101.57 -0.4796 -0.4741 27 3.60 0.0160 0.01589 115.95 0.5146 0.5090 28 -99.17 -0.4424 -0.437110 -81.62 -0.3659 -0.3612 29 -144.07 -0.6506 -0.644911 207.71 0.9247 0.9226 30 -34.90 -0.1549 -0.152712 1.86 0.0083 0.0082 31 -275.37 -1.2335 -1.243313 -169.38 -0.7654 -0.7607 32 -120.80 -0.5626 -0.556814 -201.70 -0.9280 -0.9260 33 -221.60 -0.9905 -0.990315 243.24 1.1010 1.1045 34 -230.77 -1.0581 -1.060016 44.22 0.2016 0.1987 35 -7.83 -0.0373 -0.036817 224.12 0.9968 0.9967 36 -102.39 -0.5215 -0.515818 -103.52 -0.5067 -0.5011 37 -133.33 -0.6062 -0.600519 981.22 4.4726 6.8679

Influential Measures (II)

Obs# df.b0 df.b(M) df.b(W) df.fit cov.r cook.d hat inf

1 -0.2080 0.1545 0.1423 0.2440 1.2488 2.02E-02 0.1505

2 0.0008 0.0151 -0.0243 0.0339 1.1846 3.95E-04 0.0779

3 0.0252 -0.0124 -0.0327 -0.0645 1.1283 1.42E-03 0.0395

4 -0.2907 0.4070 -0.1609 -0.4655 1.1799 7.20E-02 0.1616

5 0.2717 -0.2554 -0.1046 -0.3516 1.0491 4.07E-02 0.0806

6 0.0042 0.0143 -0.0219 0.0843 1.1036 2.43E-03 0.0296

7 -0.0418 0.0141 0.0674 0.1291 1.0956 5.65E-03 0.0385

8 -0.0353 0.1168 -0.1394 -0.1931 1.2493 1.27E-02 0.1422

9 0.0073 0.0116 -0.0228 0.0884 1.1006 2.67E-03 0.0293

10 -0.0153 0.0381 -0.0425 -0.0817 1.1361 2.28E-03 0.0487

11 -0.0319 0.0747 -0.0467 0.1760 1.0501 1.04E-02 0.0351

12 -0.0005 0.0002 0.0009 0.0016 1.1375 9.19E-07 0.0385

13 -0.0704 0.1258 -0.0945 -0.1984 1.1087 1.33E-02 0.0637

14 -0.2602 0.1802 0.1650 -0.3030 1.1211 3.07E-02 0.0967

15 -0.2151 0.1934 0.1007 0.2953 1.0509 2.89E-02 0.0667

16 0.0453 -0.0471 -0.0017 0.0585 1.1841 1.17E-03 0.0797

17 -0.0691 0.0609 0.0471 0.1853 1.0352 1.14E-02 0.0334

18 0.1503 -0.2257 0.0858 -0.2521 1.3397 2.17E-02 0.2020 *

19 1.5659 -1.6273 -0.0604 2.0203 0.0829 5.77E-01 0.0797 *

Influential Measures (III)Obs# df.b0 df.b(M) df.b(W) df.fit cov.r cook.d hat inf

20 -0.0074 0.0107 -0.0058 -0.0190 1.1429 1.24E-04 0.0437

21 -0.1593 0.1696 0.0011 -0.2004 1.1723 1.36E-02 0.0978

22 0.0270 0.0890 -0.1893 0.2181 1.3271 1.62E-02 0.1908 *

23 0.0031 0.0257 -0.0306 0.1555 1.0465 8.10E-03 0.0288

24 -0.0234 0.0521 -0.0285 0.1379 1.0746 6.42E-03 0.0326

25 -0.1374 0.0406 0.2021 0.2392 1.1994 1.94E-02 0.1216

26 -0.0317 0.0233 0.0113 -0.0778 1.109 2.07E-03 0.0307

27 0.0005 -0.0009 0.0009 0.0029 1.1307 2.88E-06 0.0327

28 0.0275 -0.0469 0.0183 -0.0881 1.1186 2.65E-03 0.0391

29 0.1138 -0.0860 -0.0817 -0.1661 1.1232 9.36E-03 0.0622

30 0.0012 -0.0064 0.0054 -0.0267 1.1248 2.45E-04 0.0297

31 0.0372 0.0494 -0.1772 -0.2762 1.0004 2.50E-02 0.0470

32 0.0986 -0.0112 -0.1770 -0.2041 1.2063 1.42E-02 0.1184

33 -0.1189 0.0552 0.1097 -0.2100 1.0468 1.47E-02 0.0430

34 0.1308 -0.2493 0.1507 -0.3341 1.0875 3.71E-02 0.0904

35 -0.0075 -0.0008 0.0146 -0.0160 1.301 8.82E-05 0.1595 *

36 -0.2862 0.1937 0.1984 -0.3081 1.4485 3.23E-02 0.2629 *

37 -0.0225 -0.0592 0.1286 -0.1711 1.1445 9.94E-03 0.0751

Diagnosing Influential Observations

• Clearly, Observation #19 exerts a huge influence (although it has a small hat or leverage value, so it must be near center of Mass/Work observations

• Upon further review to author’s original calculations provided in paper, the mean and S.D. are much to high for H (but exactly the same for M and W).

• Could observation been a “typo”? • Try replacing H19=3936 with H19=2936• Note: Do not do this arbitrarily, check your data

sources in practice

Analysis with Corrected Data PointCoefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 977.4254 376.0531 2.5992 0.0137 213.1935 1741.6572W 6.2436 1.5221 4.1019 0.0002 3.1503 9.3370M 17.7777 4.9428 3.5967 0.0010 7.7327 27.8227

MWH 78.1724.643.977^

Note that both factors are significant, and that the intercept and body mass coefficients have changed drastically

Plot of Residuals versus Predicted Values

-300

-200

-100

0

100

200

300

2500 2600 2700 2800 2900 3000 3100 3200 3300 3400

influential observations in regression

Documents