influential observations in regression
DESCRIPTION
Influential Observations in Regression. Measurements on Heat Production as a Function of Body Mass and Work Effort. M. Greenwood (1918). “On the Efficiency of Muscular Work,” Proc. Roy. Soc. Of London, Series B , Vol. 90, #627, pp. 199-214. Data Description. - PowerPoint PPT PresentationTRANSCRIPT
Influential Observations in Regression
Measurements on Heat Production as a Function of Body Mass and Work
Effort.M. Greenwood (1918). “On the Efficiency of Muscular Work,” Proc. Roy. Soc. Of London, Series B, Vol. 90, #627, pp. 199-214
Data Description
• Study involved Algerians accustomed to heavy labor. Experiment consisted of several hours on stationary bicycle.
• Dependent (Response) Variable:– Heat Production (Calories)
• Independent (Explanatory/Predictor) Variables:– Work Effort (Calories)– Body Mass (kg)
• Model:– H = WM
Raw Data (Table III, p.203)
M W H M W H76.2 156.8 3398 64.8 137.5 302071.3 114.1 2988 60.2 129.7 281269.6 142.6 3048 72.4 97.1 296258.0 142.6 2781 68.9 129.4 323674.6 142.6 2912 70.1 129.4 321468.9 128.3 3135 70.8 161.7 338969.1 142.6 3261 66.5 129.4 290862.1 156.8 3030 66.7 137.5 306368.7 128.3 3139 71.2 129.4 295665.4 142.6 2996 72.4 145.6 302370.4 128.3 3248 69.3 129.4 300169.1 142.6 3117 67.4 145.6 284163.7 142.6 2891 69.6 161.7 311762.1 114.1 2667 66.2 121.3 273373.5 142.6 3403 74.5 121.3 280861.3 129.4 2999 67.7 97.1 281370.1 137.5 3318 57.5 97.1 261579.8 121.3 2989 70.4 113.2 281461.3 129.4 3936
Estimated Regression Coefficients
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 1536.5098 584.4994 2.6288 0.0128 348.6642 2724.3554
W 6.1563 2.3659 2.6022 0.0136 1.3483 10.9643M 10.1409 7.6826 1.3200 0.1957 -5.4720 25.7538
MWH 14.1016.651.1536^
Note that that we can conclude, controlling for the other factor:
Work Effort increase Heat Production increases (p = .0136)
Body Mass increase does not Heat Production increases (p = .1957)
Plot of Residuals versus Fitted Values
-400
-200
0
200
400
600
800
1000
1200
2500 2600 2700 2800 2900 3000 3100 3200 3300 3400
Huge, Positive, Residual
Influential Measures (I) Note: n=37, p*=3 Parameters
24.1,76.037
)3(31 outside Coeffs Reg of Errors Stdon linfluentiaHighly
:Ratio Covariance
804.0 tscoefficien of groupon linfluentiaHighly
:D sCook'
329.037
2t coefficienon linfluentiaHighly
:t)coefficien regressioneach for (One DFBETAS
569.037
32 valuefitted OWNon linfluentiaHighly
:DFFITS
162.037
)3(2 levels-X wrt tolinfluentiay Potentiall
:Values (Hat) Leverage
75.3or 74.3 Outlier
:Residuals dStudentize and edStandardiz
35,3,50.
)(
ii
*
437,)37(2
05.337,
)37(2
05.
i
i
ij
i
ii
COVR
FD
DFBETAS
DDFFITS
trtr
Standardized / Studentized Residuals
Obs# e(i) r(i) r*(i) Obs# e(i) r(i) r*(i)
1 123.44 0.5856 0.5799 20 -20.14 -0.0900 -0.08872 26.01 0.1184 0.1167 21 -133.47 -0.6145 -0.60873 -72.21 -0.3222 -0.3179 22 93.51 0.4546 0.44924 -221.58 -1.0582 -1.0601 23 204.15 0.9059 0.90345 -258.91 -1.1808 -1.1879 24 169.98 0.7557 0.75096 109.92 0.4880 0.4824 25 139.03 0.6487 0.64317 145.86 0.6505 0.6448 26 -99.51 -0.4420 -0.43678 -101.57 -0.4796 -0.4741 27 3.60 0.0160 0.01589 115.95 0.5146 0.5090 28 -99.17 -0.4424 -0.437110 -81.62 -0.3659 -0.3612 29 -144.07 -0.6506 -0.644911 207.71 0.9247 0.9226 30 -34.90 -0.1549 -0.152712 1.86 0.0083 0.0082 31 -275.37 -1.2335 -1.243313 -169.38 -0.7654 -0.7607 32 -120.80 -0.5626 -0.556814 -201.70 -0.9280 -0.9260 33 -221.60 -0.9905 -0.990315 243.24 1.1010 1.1045 34 -230.77 -1.0581 -1.060016 44.22 0.2016 0.1987 35 -7.83 -0.0373 -0.036817 224.12 0.9968 0.9967 36 -102.39 -0.5215 -0.515818 -103.52 -0.5067 -0.5011 37 -133.33 -0.6062 -0.600519 981.22 4.4726 6.8679
Influential Measures (II)
Obs# df.b0 df.b(M) df.b(W) df.fit cov.r cook.d hat inf
1 -0.2080 0.1545 0.1423 0.2440 1.2488 2.02E-02 0.1505
2 0.0008 0.0151 -0.0243 0.0339 1.1846 3.95E-04 0.0779
3 0.0252 -0.0124 -0.0327 -0.0645 1.1283 1.42E-03 0.0395
4 -0.2907 0.4070 -0.1609 -0.4655 1.1799 7.20E-02 0.1616
5 0.2717 -0.2554 -0.1046 -0.3516 1.0491 4.07E-02 0.0806
6 0.0042 0.0143 -0.0219 0.0843 1.1036 2.43E-03 0.0296
7 -0.0418 0.0141 0.0674 0.1291 1.0956 5.65E-03 0.0385
8 -0.0353 0.1168 -0.1394 -0.1931 1.2493 1.27E-02 0.1422
9 0.0073 0.0116 -0.0228 0.0884 1.1006 2.67E-03 0.0293
10 -0.0153 0.0381 -0.0425 -0.0817 1.1361 2.28E-03 0.0487
11 -0.0319 0.0747 -0.0467 0.1760 1.0501 1.04E-02 0.0351
12 -0.0005 0.0002 0.0009 0.0016 1.1375 9.19E-07 0.0385
13 -0.0704 0.1258 -0.0945 -0.1984 1.1087 1.33E-02 0.0637
14 -0.2602 0.1802 0.1650 -0.3030 1.1211 3.07E-02 0.0967
15 -0.2151 0.1934 0.1007 0.2953 1.0509 2.89E-02 0.0667
16 0.0453 -0.0471 -0.0017 0.0585 1.1841 1.17E-03 0.0797
17 -0.0691 0.0609 0.0471 0.1853 1.0352 1.14E-02 0.0334
18 0.1503 -0.2257 0.0858 -0.2521 1.3397 2.17E-02 0.2020 *
19 1.5659 -1.6273 -0.0604 2.0203 0.0829 5.77E-01 0.0797 *
Influential Measures (III)Obs# df.b0 df.b(M) df.b(W) df.fit cov.r cook.d hat inf
20 -0.0074 0.0107 -0.0058 -0.0190 1.1429 1.24E-04 0.0437
21 -0.1593 0.1696 0.0011 -0.2004 1.1723 1.36E-02 0.0978
22 0.0270 0.0890 -0.1893 0.2181 1.3271 1.62E-02 0.1908 *
23 0.0031 0.0257 -0.0306 0.1555 1.0465 8.10E-03 0.0288
24 -0.0234 0.0521 -0.0285 0.1379 1.0746 6.42E-03 0.0326
25 -0.1374 0.0406 0.2021 0.2392 1.1994 1.94E-02 0.1216
26 -0.0317 0.0233 0.0113 -0.0778 1.109 2.07E-03 0.0307
27 0.0005 -0.0009 0.0009 0.0029 1.1307 2.88E-06 0.0327
28 0.0275 -0.0469 0.0183 -0.0881 1.1186 2.65E-03 0.0391
29 0.1138 -0.0860 -0.0817 -0.1661 1.1232 9.36E-03 0.0622
30 0.0012 -0.0064 0.0054 -0.0267 1.1248 2.45E-04 0.0297
31 0.0372 0.0494 -0.1772 -0.2762 1.0004 2.50E-02 0.0470
32 0.0986 -0.0112 -0.1770 -0.2041 1.2063 1.42E-02 0.1184
33 -0.1189 0.0552 0.1097 -0.2100 1.0468 1.47E-02 0.0430
34 0.1308 -0.2493 0.1507 -0.3341 1.0875 3.71E-02 0.0904
35 -0.0075 -0.0008 0.0146 -0.0160 1.301 8.82E-05 0.1595 *
36 -0.2862 0.1937 0.1984 -0.3081 1.4485 3.23E-02 0.2629 *
37 -0.0225 -0.0592 0.1286 -0.1711 1.1445 9.94E-03 0.0751
Diagnosing Influential Observations
• Clearly, Observation #19 exerts a huge influence (although it has a small hat or leverage value, so it must be near center of Mass/Work observations
• Upon further review to author’s original calculations provided in paper, the mean and S.D. are much to high for H (but exactly the same for M and W).
• Could observation been a “typo”? • Try replacing H19=3936 with H19=2936• Note: Do not do this arbitrarily, check your data
sources in practice
Analysis with Corrected Data PointCoefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 977.4254 376.0531 2.5992 0.0137 213.1935 1741.6572W 6.2436 1.5221 4.1019 0.0002 3.1503 9.3370M 17.7777 4.9428 3.5967 0.0010 7.7327 27.8227
MWH 78.1724.643.977^
Note that both factors are significant, and that the intercept and body mass coefficients have changed drastically
Plot of Residuals versus Predicted Values
-300
-200
-100
0
100
200
300
2500 2600 2700 2800 2900 3000 3100 3200 3300 3400