confidence intervals for the mean of y prediction...

25
Week 4: The essentials of multiple regression (using Minitab output): ANOVA table, R 2 , global F-test, residual plots, inference for individual parameters, prediction/confidence intervals. Interaction and polynomial models. The General Linear Model (GLM) and transforming to GLM (e.g. exponential models). Testing portions of a model via the Extra SS principle. (ch11) [pp57-80] -Global F test Confidence intervals for the mean of Y Prediction intervals (CSDATA in IPS data appendix) Regression Analysis: gpa versus hsm, hse, satm The regression equation is gpa = 0.305 + 0.163 hsm + 0.0657 hse + 0.000747 satm Predictor Coef SE Coef T P Constant 0.3047 0.3918 0.78 0.438 hsm 0.16272 0.03586 4.54 0.000 hse 0.06572 0.03494 1.88 0.061 satm 0.0007467 0.0006120 1.22 0.224 S = 0.698805 R-Sq = 20.7% R-Sq(adj) = 19.6% Analysis of Variance Source DF SS MS F P Regression 3 28.0305 9.3435 19.13 0.000 Residual Error 220 107.4322 0.4883 Total 223 135.4628 Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI 1 2.9057 0.0764 (2.7551, 3.0563) (1.5203, 4.2911) Values of Predictors for New Observations New Obs hsm hse satm 1 10.0 8.00 600 1

Upload: vuanh

Post on 24-Apr-2018

223 views

Category:

Documents


2 download

TRANSCRIPT

Week 4: The essentials of multiple regression (using Minitab output): ANOVA table, R2, global F-test, residual plots, inference for individual parameters, prediction/confidence intervals. Interaction and polynomial models. The General Linear Model (GLM) and transforming to GLM (e.g. exponential models). Testing portions of a model via the Extra SS principle. (ch11) [pp57-80]

-Global F test Confidence intervals for the mean of Y Prediction intervals (CSDATA in IPS data appendix) Regression Analysis: gpa versus hsm, hse, satm The regression equation is gpa = 0.305 + 0.163 hsm + 0.0657 hse + 0.000747 satm Predictor Coef SE Coef T P Constant 0.3047 0.3918 0.78 0.438 hsm 0.16272 0.03586 4.54 0.000 hse 0.06572 0.03494 1.88 0.061 satm 0.0007467 0.0006120 1.22 0.224 S = 0.698805 R-Sq = 20.7% R-Sq(adj) = 19.6% Analysis of Variance Source DF SS MS F P Regression 3 28.0305 9.3435 19.13 0.000 Residual Error 220 107.4322 0.4883 Total 223 135.4628 Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI 1 2.9057 0.0764 (2.7551, 3.0563) (1.5203, 4.2911) Values of Predictors for New Observations New Obs hsm hse satm 1 10.0 8.00 600

1

Interaction models Interaction Models with Two quantitative Independent Variables

0 1 1 2 2 3 1 2Y x x x xβ β β β= + + + +ε

Example Case-study (Peru, some data deleted) Row Fraction Weight Systol 1 0.047619 71.0 170 2 0.272727 56.5 120 3 0.208333 56.0 125 4 0.041667 61.0 148 5 0.040000 65.0 140 25 0.615385 74.0 128 26 0.358974 72.0 134 27 0.609756 62.5 112 28 0.780488 68.0 128 29 0.121951 63.4 134 30 0.285714 68.0 128 31 0.581395 69.0 140 32 0.604651 73.0 138 33 0.232558 64.0 118 34 0.431818 65.0 110 35 0.409091 71.0 142 36 0.222222 60.2 134 37 0.021277 55.0 116 38 0.860000 70.0 132 39 0.740741 87.0 152

2

0.90.80.70.60.50.40.30.20.10.0

170

160

150

140

130

120

110

100

Fraction

Syst

ol

9080706050

170

160

150

140

130

120

110

100

Weight

Syst

ol

The regression equation is Systol = 60.9 - 26.8 Fraction + 1.22 Weight Predictor Coef StDev T P Constant 60.90 14.28 4.26 0.000 Fraction -26.767 7.218 -3.71 0.001 Weight 1.2169 0.2337 5.21 0.000 S = 9.777 R-Sq = 47.3% R-Sq(adj) = 44.4% Analysis of Variance Source DF SS MS F P Regression 2 3090.1 1545.0 16.16 0.000 Residual Error 36 3441.4 95.6 Total 38 6531.4

3

Interaction model . Regression Analysis The regression equation is Systol = 52.2 - 9.9 F + 1.36 W - 0.267 F*W Predictor Coef StDev T P Constant 52.22 34.17 1.53 0.135 F -9.86 60.78 -0.16 0.872 W 1.3560 0.5501 2.47 0.019 F*W -0.2672 0.9536 -0.28 0.781

The model for subjects with W=50 Systol = 52.2 - 9.9 F + 1.36 x 50 - 0.267 F*50

Systol = 52.2 - 9.9 F + 68 – 13.35 F

Systol = 120.2 – 23.25 F

The model for subjects with W=90 Systol = 52.2 - 9.9 F + 1.36 x 90 - 0.267 F*90

Systol = 52.2 - 9.9 F + 122.4 – 24.03F

Systol = 174.6 – 33.93 F

Ex Test whether the interaction is significant.

4

A quadratic (second order) model with a quantitative predictor

The quadratic model is given by 2

0 1 2y x xβ β β= + + +ε ,

with ε satisfying the usual assumptions. The term involving is called the quadratic (or the second order) term.

2x

Example (Carp data) Y = Endogenous nitrogen excretion (ENE) X = Body weight Row bodyweight ENE 1 11.7 15.3 2 25.3 9.3 3 90.2 6.5 4 213.0 6.0 5 10.2 15.7 6 17.6 10.0 7 32.6 8.6 8 81.3 6.4 9 141.5 5.6 10 285.7 6.0

5

3002001000

15

10

5

bodyw eight

ENE

The regression equation is ENE = 13.7 - 0.102 bodywt +0.000273 bodywtsq Predictor Coef StDev T P Constant 13.713 1.306 10.50 0.000 bodywt -0.10184 0.02881 -3.53 0.010 bodywtsq 0.0002735 0.0001016 2.69 0.031 S = 2.194 R-Sq = 73.7% R-Sq(adj) = 66.2% Analysis of Variance Source DF SS MS F P Regression 2 94.659 47.329 9.83 0.009 Residual Error 7 33.705 4.815 Total 9 128.364

6

Interpretation of the estimated coefficients

- Interpretation of the estimated coefficients must be undertaken cautiously.

- estimated y-intercept ˆ0β can be

meaningfully interpreted only if the range of the independent variable includes 0

- ˆ

1β will not in general have a meaningful interpretation in the quadratic model

- The sign of the ˆ

2β is the indicator of whether the curve is concave downward ( ˆ

2β >0) or concave downward ( ˆ2β <0)

7

Testing whether the quadratic model is statistically useful

This is tested by testing the null hypothesis

: 00 1 2H β β= =

Ha : At least one of the above coefficients is nonzero

8

Minitab commands

9

10

11

12

- Complete second order model in two predictors

13

Testing a set of beta’s p72, course notes Example Row wt distance cost wt*dist wt**2 diast**2 1 5.90 47 2.6 277.3 34.8100 2209 2 3.20 145 3.9 464.0 10.2400 21025 3 4.40 202 8.0 888.8 19.3600 40804 4 6.60 160 9.2 1056.0 43.5600 25600 5 0.75 280 4.4 210.0 0.5625 78400 6 0.70 80 1.5 56.0 0.4900 6400 7 6.50 240 14.5 1560.0 42.2500 57600 8 4.50 53 1.9 238.5 20.2500 2809 9 0.60 100 1.0 60.0 0.3600 10000 10 7.50 190 14.0 1425.0 56.2500 36100 11 5.10 240 11.0 1224.0 26.0100 57600 12 2.40 209 5.0 501.6 5.7600 43681 13 0.30 160 2.0 48.0 0.0900 25600 14 6.20 115 6.0 713.0 38.4400 13225 15 2.70 45 1.1 121.5 7.2900 2025 16 3.50 250 8.0 875.0 12.2500 62500 17 4.10 95 3.3 389.5 16.8100 9025 18 8.10 160 12.1 1296.0 65.6100 25600 19 7.00 260 15.5 1820.0 49.0000 67600 20 1.10 90 1.7 99.0 1.2100 8100

a) Fit a complete second order model.

14

The regression equation is cost = 0.827 - 0.609 wt + 0.00402 dist + 0.00733 wt*dist + 0.0898 wt**2 + 0.000015 dist**2 Predictor Coef StDev T P Constant 0.8270 0.7023 1.18 0.259 wt -0.6091 0.1799 -3.39 0.004 dist 0.004021 0.007998 0.50 0.623 wt*dist 0.0073271 0.0006374 11.49 0.000 wt**2 0.08975 0.02021 4.44 0.001 dist**2 0.00001507 0.00002243 0.67 0.513 S = 0.4428 R-Sq = 99.4% R-Sq(adj) = 99.2% Analysis of Variance Source DF SS MS F P Regression 5 449.341 89.868 458.39 0.000 Residual Error 14 2.745 0.196 Total 19 452.086 Source DF Seq SS wt 1 270.553 dist 1 143.631 wt*dist 1 31.268 wt**2 1 3.800 dist**2 1 0.088

Test the hypothesis that the terms wt**2 and dist**2 can be dropped from the model.

15

The regression equation is cost = - 0.141 + 0.019 wt + 0.00772 distance + 0.00780 wt*dist Predictor Coef StDev T P Constant -0.1405 0.6481 -0.22 0.831 wt 0.0191 0.1582 0.12 0.905 distance 0.007721 0.003906 1.98 0.066 wt*dist 0.0077957 0.0008977 8.68 0.000 S = 0.6439 R-Sq = 98.5% R-Sq(adj) = 98.3% Analysis of Variance Source DF SS MS F P Regression 3 445.45 148.48 358.15 0.000 Residual Error 16 6.63 0.41 Total 19 452.09

16

-Sequential Sums of Squares Regression Analysis: cost versus wt, distance, wt*dist, wt**2, dist**2 The regression equation is cost = 0.827 - 0.609 wt + 0.00402 distance + 0.00733 wt*dist + 0.0898 wt**2 + 0.000015 dist**2 Predictor Coef SE Coef T P Constant 0.8270 0.7023 1.18 0.259 wt -0.6091 0.1799 -3.39 0.004 distance 0.004021 0.007998 0.50 0.623 wt*dist 0.0073271 0.0006374 11.49 0.000 wt**2 0.08975 0.02021 4.44 0.001 dist**2 0.00001507 0.00002243 0.67 0.513 S = 0.442778 R-Sq = 99.4% R-Sq(adj) = 99.2% Analysis of Variance Source DF SS MS F P Regression 5 449.341 89.868 458.39 0.000 Residual Error 14 2.745 0.196 Total 19 452.086 Source DF Seq SS wt 1 270.553 distance 1 143.631 wt*dist 1 31.268 wt**2 1 3.800 dist**2 1 0.088 Regression Analysis: cost versus wt The regression equation is cost = 0.28 + 1.49 wt Predictor Coef SE Coef T P

17

Constant 0.276 1.368 0.20 0.842 wt 1.4932 0.2883 5.18 0.000 S = 3.17571 R-Sq = 59.8% R-Sq(adj) = 57.6% Analysis of Variance Source DF SS MS F P Regression 1 270.55 270.55 26.83 0.000 Residual Error 18 181.53 10.09 Total 19 452.09 Regression Analysis: cost versus wt, distance The regression equation is cost = - 4.67 + 1.29 wt + 0.0369 distance Predictor Coef SE Coef T P Constant -4.6728 0.8911 -5.24 0.000 wt 1.2924 0.1378 9.38 0.000 distance 0.036936 0.004602 8.03 0.000 S = 1.49314 R-Sq = 91.6% R-Sq(adj) = 90.6% Analysis of Variance Source DF SS MS F P Regression 2 414.18 207.09 92.89 0.000 Residual Error 17 37.90 2.23 Total 19 452.09 Regression Analysis: cost versus wt, distance, wt*dist The regression equation is cost = - 0.141 + 0.019 wt + 0.00772 distance + 0.00780 wt*dist

18

Predictor Coef SE Coef T P Constant -0.1405 0.6481 -0.22 0.831 wt 0.0191 0.1582 0.12 0.905 distance 0.007721 0.003906 1.98 0.066 wt*dist 0.0077957 0.0008977 8.68 0.000 S = 0.643880 R-Sq = 98.5% R-Sq(adj) = 98.3% Analysis of Variance Source DF SS MS F P Regression 3 445.45 148.48 358.15 0.000 Residual Error 16 6.63 0.41 Total 19 452.09 Regression Analysis: cost versus wt, distance, wt*dist, wt**2 The regression equation is cost = 0.475 - 0.578 wt + 0.00908 distance + 0.00726 wt*dist + 0.0867 wt**2 Predictor Coef SE Coef T P Constant 0.4747 0.4585 1.04 0.317 wt -0.5782 0.1707 -3.39 0.004 distance 0.009078 0.002654 3.42 0.004 wt*dist 0.0072587 0.0006176 11.75 0.000 wt**2 0.08674 0.01934 4.49 0.000 S = 0.434604 R-Sq = 99.4% R-Sq(adj) = 99.2% Analysis of Variance Source DF SS MS F P Regression 4 449.25 112.31 594.62 0.000 Residual Error 15 2.83 0.19 Total 19 452.09

19

Regression Analysis: cost versus wt, distance, wt*dist, wt**2, dist**2 The regression equation is cost = 0.827 - 0.609 wt + 0.00402 distance + 0.00733 wt*dist + 0.0898 wt**2 + 0.000015 dist**2 Predictor Coef SE Coef T P Constant 0.8270 0.7023 1.18 0.259 wt -0.6091 0.1799 -3.39 0.004 distance 0.004021 0.007998 0.50 0.623 wt*dist 0.0073271 0.0006374 11.49 0.000 wt**2 0.08975 0.02021 4.44 0.001 dist**2 0.00001507 0.00002243 0.67 0.513 S = 0.442778 R-Sq = 99.4% R-Sq(adj) = 99.2% Analysis of Variance Source DF SS MS F P Regression 5 449.341 89.868 458.39 0.000 Residual Error 14 2.745 0.196 Total 19 452.086

20

Examples (STA221 Apr 98 Final Exam) The regression equation is Weight = 0.0265 - 0.0729 Diameter + 0.0628 diam**2 Predictor Coef StDev T P Constant 0.02652 0.02133 1.24 0.240 Diameter -0.07287 0.01553 -4.69 0.001 diam**2 0.062755 0.002609 24.06 0.000 S = 0.01117 R-Sq = 99.9% R-Sq(adj) = 99.9% Analysis of Variance Source DF SS MS F P Regression 2 1.32300 0.66150 5299.38 0.000 Residual Error 11 0.00137 0.00012 Total 13 1.32437 Source DF Seq SS Diameter 1 1.25077 diam**2 1 0.07223 The test of significance for the contribution of the second order term in diameter has an F-value of (to the nearest 50)

A) 7600 B) 5300 C) 2650 D) 600 E) 350

The regression equation is Weight = - 0.237 + 0.447 Diameter - 0.150 Height Predictor Coef StDev T P Constant -0.23658 0.06340 -3.73 0.003 Diameter 0.44689 0.03921 11.40 0.000 Height -0.15043 0.03622 -4.15 0.002 S = 0.05104 R-Sq = 97.8% R-Sq(adj) = 97.4% Analysis of Variance Source DF SS MS F P Regression 2 1.29571 0.64786 248.65 0.000 Residual Error 11 0.02866 0.00261 Total 13 1.32437

21

The regression equation is Weight = 0.0216 - 0.151 Diameter + 0.0467 Height + 0.0721 diam**2 - 0.00290 ht**2 Predictor Coef StDev T P Constant 0.02156 0.03595 0.60 0.563 Diameter -0.15141 0.04792 -3.16 0.012 Height 0.04666 0.03762 1.24 0.246 diam**2 0.072104 0.006467 11.15 0.000 ht**2 -0.002898 0.004179 -0.69 0.505 S = 0.01057 R-Sq = 99.9% R-Sq(adj) = 99.9% Analysis of Variance Source DF SS MS F P Regression 4 1.32336 0.33084 2958.44 0.000 Residual Error 9 0.00101 0.00011 Total 13 1.32437 Source DF Seq SS Diameter 1 1.25077 Height 1 0.04494 diam**2 1 0.02760 ht**2 1 0.00005

1.00.50.0

0.02

0.01

0.00

-0.01

-0.02

Weight

Res

idua

l

Residuals Versus Weight(response is Weight)

22

65432

0.02

0.01

0.00

-0.01

-0.02

Height

Res

idua

l

Residuals Versus Height(response is Weight)

1.00.50.0

0.02

0.01

0.00

-0.01

-0.02

Fitted Value

Res

idua

l

Residuals Versus the Fitted Values(response is Weight)

0.020.010.00-0.01-0.02

2

1

0

-1

-2

Nor

mal

Sco

re

Residual

Normal Probability Plot of the Residuals(response is Weight)

23

0.0200.0150.0100.0050.000-0.005-0.010-0.015

4

3

2

1

0

Residual

Fre

quen

cy

Histogram of the Residuals(response is Weight)

The regression equation is Weight = 0.117 + 0.0982 Diameter - 0.159 Height + 0.0513 diam*ht Predictor Coef StDev T P Constant 0.11742 0.08189 1.43 0.182 Diameter 0.09820 0.07567 1.30 0.224 Height -0.15942 0.02090 -7.63 0.000 diam*ht 0.05133 0.01063 4.83 0.001 S = 0.02934 R-Sq = 99.4% R-Sq(adj) = 99.2% Analysis of Variance Source DF SS MS F P Regression 3 1.31577 0.43859 509.61 0.000 Residual Error 10 0.00861 0.00086 Total 13 1.32437 7) Which of the following are true?

I) If we test the extra contribution of both height and height squared to the model with only diameter and diameter squared, the calculated F-statistics would be lass than 2.

II) If we test the extra contribution of adding both height squared and diameter squared to the to the first order model with just height and diameter, the calculated F-statistic is lass than 200

III) If we assume the appropriateness of the model with diameter, height and their product, we see that the effect on dry weight of an increase in diameter is not independent of the height of the trees.

IV) Residual plots indicate problems with the second order model containing diameter, height and their respective squares.

24

The General Linear Model p78, course notes

- a model of the form 0 1 1 2 2y x x xp kβ β β β= + + + +ε

3

4

- which is linear in the β’s.

- no predictor is a linear function of the other

oredictors i.e. , , … are not allowed in a

general linear model,

31 2x x= 1 2x x x= +

- but the predictors can be non linear

functions of other predictors - e.g. x x , 2

2 1= 2 3x x x= are allowed and so

polynomial models are general linear models

25