lecture 4 linear models iii olivier missa, [email protected]@york.ac.uk advanced research skills
TRANSCRIPT
![Page 2: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills](https://reader036.vdocuments.us/reader036/viewer/2022081519/56649cce5503460f9499a19e/html5/thumbnails/2.jpg)
2
Outline
"Refresher" on different types of model:
Multiple regression
Polynomial regression
Model building
Finding the "best" model.
![Page 3: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills](https://reader036.vdocuments.us/reader036/viewer/2022081519/56649cce5503460f9499a19e/html5/thumbnails/3.jpg)
3
When more than one continuous or discrete variables are used to predict a response variable.
Example: The 38 car drivers dataset
Multiple Regression
> data (seatpos) ## from faraway package
> attach(seatpos)
> names(seatpos)[1] "Age" "Weight" "HtShoes" "Ht" "Seated" [6] "Arm" "Thigh" "Leg" "hipcenter"
> summary(lm(hipcenter ~ Age))Coefficients:
Estimate Std. Error t value Pr(>|t|)(Intercept) -192.9645 24.3015 -7.940 2.00e-09 ***Age 0.7963 0.6331 1.258 0.217...Multiple R-squared: 0.0421, Adjusted R-squared: 0.01549...
![Page 4: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills](https://reader036.vdocuments.us/reader036/viewer/2022081519/56649cce5503460f9499a19e/html5/thumbnails/4.jpg)
4
Multiple Regression> summary(lm(hipcenter ~ Age))
Estimate Std. Error t value Pr(>|t|)Age 0.7963 0.6331 1.258 0.217 Multiple R-squared: 0.0421> summary(lm(hipcenter ~ Weight))
Estimate Std. Error t value Pr(>|t|)Weight -1.0674 0.2134 -5.002 1.49e-05 *** Multiple R-squared: 0.41> summary(lm(hipcenter ~ HtShoes))
Estimate Std. Error t value Pr(>|t|)HtShoes -4.2621 0.5391 -7.907 2.21e-09 *** Multiple R-squared: 0.6346> summary(lm(hipcenter ~ Ht))
Estimate Std. Error t value Pr(>|t|)Ht -4.2650 0.5351 -7.970 1.83e-09 *** Multiple R-squared: 0.6383> summary(lm(hipcenter ~ Seated))
Estimate Std. Error t value Pr(>|t|)Seated -8.844 1.375 -6.432 1.84e-07 *** Multiple R-squared: 0.5347
![Page 5: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills](https://reader036.vdocuments.us/reader036/viewer/2022081519/56649cce5503460f9499a19e/html5/thumbnails/5.jpg)
5
Multiple Regression> summary(lm(hipcenter ~ Arm))
Estimate Std. Error t value Pr(>|t|)Arm -10.351 2.391 -4.329 0.000114 *** Multiple R-squared: 0.3423> summary(lm(hipcenter ~ Thigh))
Estimate Std. Error t value Pr(>|t|)Thigh -9.100 2.069 -4.398 9.29e-05 *** Multiple R-squared: 0.3495> summary(lm(hipcenter ~ Leg))
Estimate Std. Error t value Pr(>|t|)Leg -13.795 1.801 -7.658 4.59e-09 *** Multiple R-squared: 0.6196
Age Ht Leg
![Page 6: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills](https://reader036.vdocuments.us/reader036/viewer/2022081519/56649cce5503460f9499a19e/html5/thumbnails/6.jpg)
6
Multiple Regression> summary(mod <- lm(hipcenter ~ Ht + Leg))Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 491.244 99.543 4.935 1.95e-05 ***Ht -2.565 1.268 -2.022 0.0509 Leg -6.136 4.164 -1.473 0.1496 ---Residual standard error: 35.79 on 35 degrees of freedomMultiple R-squared: 0.6594, Adjusted R-squared: 0.6399 F-statistic: 33.88 on 2 and 35 DF, p-value: 6.517e-09
> summary(lm(hipcenter ~ Ht)) Estimate Std. Error t value Pr(>|t|)
Ht -4.2650 0.5351 -7.970 1.83e-09 *** Multiple R-squared: 0.6383
> summary(lm(hipcenter ~ Leg)) Estimate Std. Error t value Pr(>|t|)
Leg -13.795 1.801 -7.658 4.59e-09 *** Multiple R-squared: 0.6196
slope parameters change
Still Significant
No Longer Significant
![Page 7: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills](https://reader036.vdocuments.us/reader036/viewer/2022081519/56649cce5503460f9499a19e/html5/thumbnails/7.jpg)
7
> drop1(mod, test="F")Single term deletions
Model:hipcenter ~ Ht + Leg Df Sum of Sq RSS AIC F value Pr(F) <none> 44835 275 Ht 1 5236 50071 277 4.0877 0.05089 .Leg 1 2781 47616 275 2.1711 0.14957
> cor(Ht,Leg)[1] 0.9097524
> plot(Ht ~ Leg)
Multiple Regression Beware of strong collinearity
![Page 8: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills](https://reader036.vdocuments.us/reader036/viewer/2022081519/56649cce5503460f9499a19e/html5/thumbnails/8.jpg)
8
> add1( lm(hipcenter~Ht), ~. +Age +Weight +HtShoes +Seated +Arm +Thigh +Leg, test="F")
Single term additions
Model:hipcenter ~ Ht Df Sum of Sq RSS AIC F value Pr(F)<none> 47616 275 Age 1 2354 45262 275 1.8199 0.1860Weight 1 196 47420 277 0.1446 0.7061HtShoes 1 26 47590 277 0.0189 0.8913Seated 1 102 47514 277 0.0748 0.7861Arm 1 76 47540 277 0.0558 0.8147Thigh 1 5 47611 277 0.0034 0.9538Leg 1 2781 44835 275 2.1711 0.1496
Multiple Regression Beware of strong collinearity
No Added Variable
Significantly Improves the
Model
![Page 9: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills](https://reader036.vdocuments.us/reader036/viewer/2022081519/56649cce5503460f9499a19e/html5/thumbnails/9.jpg)
9
Comparing models How can we compare models on an equal footing ?
(regardless of the number of parameters).
The multiple-R2 can only increase as more variables enter a model
(because the RSS can only decrease).
The adjusted R2 corrects for the different number of parameters to some extent.
1
12
n
TSSpn
RSSradj
TSS
RSSrmult 12 no good to
compare models
![Page 10: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills](https://reader036.vdocuments.us/reader036/viewer/2022081519/56649cce5503460f9499a19e/html5/thumbnails/10.jpg)
10
Akaike Information Criterion
Invented by Hirotugu Akaike in 1971
after a few weeks of sleepless nights
stressing over a conference presentation.
The AIC originally called 'An Information Criterion'
penalizes the likelihood of a model according
to the number of parameters being estimated.
LkAIC ln22 Maximized value of
the Likelihood functionNumber of Parameters
The lower the AIC value
the better the model is
![Page 11: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills](https://reader036.vdocuments.us/reader036/viewer/2022081519/56649cce5503460f9499a19e/html5/thumbnails/11.jpg)
11
Akaike Information CriterionInvented by Hirotugu Akaike in 1971
after a few weeks of sleepless nights
stressing over a conference presentation.
When residuals are normally & independently distributed:
LkAIC ln22
Exact expression
Simplified expression(not equal)
![Page 12: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills](https://reader036.vdocuments.us/reader036/viewer/2022081519/56649cce5503460f9499a19e/html5/thumbnails/12.jpg)
12
Akaike Information CriterionMay need to be corrected for small sample sizes
A variant, the BIC, 'Bayesian Information Criterion' (Schwartz Criterion)
penalizes free parameters more strongly than AIC
When residuals are normally & independently distributed:
1
12
kn
kkAICAICc
LnkBIC ln2ln
LkAIC ln22
Exact expression
Simplified expression(not equal)
![Page 13: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills](https://reader036.vdocuments.us/reader036/viewer/2022081519/56649cce5503460f9499a19e/html5/thumbnails/13.jpg)
13
> mod <- lm(hipcenter ~. , data=seatpos) ## starts with everything> s.mod <- step(mod) ## by default prunes variables out ('backward')
Start: AIC=283.62hipcenter ~ Age +Weight +HtShoes +Ht +Seated +Arm +Thigh +Leg
Df Sum of Sq RSS AIC- Ht 1 5 41267 282- Weight 1 9 41271 282- Seated 1 29 41290 282- HtShoes 1 108 41370 282- Arm 1 165 41427 282- Thigh 1 263 41525 282<none> 41262 284- Age 1 2632 43894 284- Leg 1 2655 43917 284
Multiple Regression Searching for the "best solution"
Deletions ranked in increasing order of AIC
![Page 14: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills](https://reader036.vdocuments.us/reader036/viewer/2022081519/56649cce5503460f9499a19e/html5/thumbnails/14.jpg)
14
Step: AIC=281.63 ## after removing Hthipcenter ~ Age +Weight +HtShoes +Seated +Arm +Thigh +Leg
Df Sum of Sq RSS AIC- Weight 1 11 41278 280- Seated 1 31 41297 280- Arm 1 161 41427 280- Thigh 1 269 41536 280- HtShoes 1 972 42239 281<none> 41267 282- Leg 1 2665 43931 282- Age 1 2809 44075 282
Multiple Regression Searching for the "best solution"
![Page 15: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills](https://reader036.vdocuments.us/reader036/viewer/2022081519/56649cce5503460f9499a19e/html5/thumbnails/15.jpg)
15
Step: AIC=279.64 ## after removing Ht & Weighthipcenter ~ Age +HtShoes +Seated +Arm +Thigh +Leg
Df Sum of Sq RSS AIC- Seated 1 35 41313 278- Arm 1 156 41434 278- Thigh 1 285 41563 278- HtShoes 1 975 42253 279<none> 41278 280- Leg 1 2661 43939 280- Age 1 3012 44290 280
Multiple Regression Searching for the "best solution"
![Page 16: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills](https://reader036.vdocuments.us/reader036/viewer/2022081519/56649cce5503460f9499a19e/html5/thumbnails/16.jpg)
16
Step: AIC=277.67 ## after removing Ht, Weight & Seatedhipcenter ~ Age +HtShoes +Arm +Thigh +Leg
Df Sum of Sq RSS AIC- Arm 1 172 41485 276- Thigh 1 345 41658 276- HtShoes 1 1853 43166 277<none> 41313 278- Leg 1 2871 44184 278- Age 1 2977 44290 278
Step: AIC=275.83 ## after removing Arm as wellhipcenter ~ Age + HtShoes + Thigh + Leg
Df Sum of Sq RSS AIC- Thigh 1 473 41958 274<none> 41485 276- HtShoes 1 2341 43826 276- Age 1 3501 44986 277- Leg 1 3592 45077 277
Multiple Regression Searching for the "best solution"
![Page 17: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills](https://reader036.vdocuments.us/reader036/viewer/2022081519/56649cce5503460f9499a19e/html5/thumbnails/17.jpg)
17
Step: AIC=274.26 ## after removing Thigh toohipcenter ~ Age + HtShoes + Leg
Df Sum of Sq RSS AIC<none> 41958 274- Age 1 3109 45067 275- Leg 1 3476 45434 275- HtShoes 1 4219 46176 276
> summary(s.mod)Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 456.2137 102.8078 4.438 9.1e-05 ***Age 0.5998 0.3779 1.587 0.1217 HtShoes -2.3023 1.2452 -1.849 0.0732 . Leg -6.8297 4.0693 -1.678 0.1024 ---Residual standard error: 35.13 on 34 degrees of freedomMultiple R-squared: 0.6813, Adjusted R-squared: 0.6531 F-statistic: 24.22 on 3 and 34 DF, p-value: 1.437e-08
Multiple Regression Searching for the "best solution"
![Page 18: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills](https://reader036.vdocuments.us/reader036/viewer/2022081519/56649cce5503460f9499a19e/html5/thumbnails/18.jpg)
18
> anova(s.mod)Response: hipcenter Df Sum Sq Mean Sq F value Pr(>F) Age 1 5541 5541 4.4904 0.04147 * HtShoes 1 80664 80664 65.3647 1.996e-09 ***Leg 1 3476 3476 2.8169 0.10244 Residuals 34 41958 1234
> drop1(s.mod, test="F")hipcenter ~ Age + HtShoes + Leg Df Sum of Sq RSS AIC F value Pr(F) <none> 41958 274 Age 1 3109 45067 275 2.5192 0.12173 HtShoes 1 4219 46176 276 3.4185 0.07318 .Leg 1 3476 45434 275 2.8169 0.10244
Multiple Regression Searching for the "best solution"
![Page 19: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills](https://reader036.vdocuments.us/reader036/viewer/2022081519/56649cce5503460f9499a19e/html5/thumbnails/19.jpg)
19
> plot(hipcenter ~ s.mod$fit)
> abline(0,1, col="red")
> null <- lm(hipcenter ~1, data=seatpos) ## starting from a null model
> s.mod <- step(null, ~. +Age+Weight+Ht+HtShoes+Seated+Arm+Thigh+Leg, direction="forward") ## intermediate steps removedStep: AIC=274.24hipcenter ~ Ht + Leg + Age
Df Sum of Sq RSS AIC<none> 41938 274+ Thigh 1 373 41565 276+ Arm 1 257 41681 276+ Seated 1 121 41817 276+ Weight 1 47 41891 276+ HtShoes 1 13 41925 276
Multiple Regression
No guarantee that the forward & backward searches
will find the same solution
![Page 20: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills](https://reader036.vdocuments.us/reader036/viewer/2022081519/56649cce5503460f9499a19e/html5/thumbnails/20.jpg)
20
> ozone.pollution <- read.table("ozone.data.txt", header=T)
> dim(ozone.pollution)[1] 111 4
> names(ozone.pollution)[1] "rad" "temp" "wind" "ozone"
> attach(ozone.pollution)
> pairs(ozone.pollution, panel=panel.smooth, pch=16, lwd=2)
> model <- lm(ozone ~ ., data=ozone.pollution)
Multiple RegressionAnother example: How is ozone concentration in the atmosphere
related to solar radiation, ambient temperature & wind speed ?
![Page 21: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills](https://reader036.vdocuments.us/reader036/viewer/2022081519/56649cce5503460f9499a19e/html5/thumbnails/21.jpg)
21
> summary(model)Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -64.23208 23.04204 -2.788 0.00628 ** rad 0.05980 0.02318 2.580 0.01124 * temp 1.65121 0.25341 6.516 2.43e-09 ***wind -3.33760 0.65384 -5.105 1.45e-06 ***---Residual standard error: 21.17 on 107 degrees of freedomMultiple R-squared: 0.6062, Adjusted R-squared: 0.5952 F-statistic: 54.91 on 3 and 107 DF, p-value: < 2.2e-16
> drop1(model, test="F")ozone ~ rad + temp + wind Df Sum of Sq RSS AIC F value Pr(F) <none> 47964 682 rad 1 2984 50948 686 6.6565 0.01124 * temp 1 19032 66996 717 42.4567 2.429e-09 ***wind 1 11680 59644 704 26.0567 1.450e-06 ***
Multiple Regression
![Page 22: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills](https://reader036.vdocuments.us/reader036/viewer/2022081519/56649cce5503460f9499a19e/html5/thumbnails/22.jpg)
22
> plot(ozone ~ model$fitted, pch=16, xlab="Model Predictions", ylab="Ozone Concentration")
> abline(0,1, col="red", lwd=2)
> shapiro.test(model$res)
Shapiro-Wilk normality test
data: model$res W = 0.9173, p-value = 3.704e-06
> plot(model, which=1)
Multiple Regression
![Page 23: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills](https://reader036.vdocuments.us/reader036/viewer/2022081519/56649cce5503460f9499a19e/html5/thumbnails/23.jpg)
23
> library(car)
> cr.plot(model, rad, pch=16, main="")
> cr.plot(model, temp, pch=16, main="")
> cr.plot(model, wind, pch=16, main="")
Multiple RegressionWhich predictor variable is
non-linearly related to Ozone ?
rad temp wind
![Page 24: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills](https://reader036.vdocuments.us/reader036/viewer/2022081519/56649cce5503460f9499a19e/html5/thumbnails/24.jpg)
24
> model2 <- lm(ozone ~ poly(rad,2)+poly(temp,2)+poly(wind,2))
> summary(model2)Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 42.099 1.735 24.269 < 2e-16 ***poly(rad, 2)1 65.085 19.379 3.359 0.00110 ** poly(rad, 2)2 -16.259 20.174 -0.806 0.42213 poly(temp, 2)1 142.708 23.502 6.072 2.09e-08 ***poly(temp, 2)2 56.043 19.138 2.928 0.00419 ** poly(wind, 2)1 -125.800 21.391 -5.881 4.99e-08 ***poly(wind, 2)2 88.636 19.199 4.617 1.12e-05 ***---Residual standard error: 18.28 on 104 degrees of freedomMultiple R-squared: 0.7148, Adjusted R-squared: 0.6984 F-statistic: 43.44 on 6 and 104 DF, p-value: < 2.2e-16
Polynomial RegressionWhen the trend between a predictor variable and
the response variable is not linear, the curvature
can be "captured" using polynomials of various degrees.
![Page 25: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills](https://reader036.vdocuments.us/reader036/viewer/2022081519/56649cce5503460f9499a19e/html5/thumbnails/25.jpg)
25
> extractAIC(model2) ## Simplified Version[1] 7.0000 651.8105
> model3 <- lm(ozone ~ rad +poly(temp,2) +poly(wind,2))> extractAIC(model3)[1] 6.0000 650.5016
> model4 <- lm(ozone ~ rad+poly(temp,3)+poly(wind,2) )> extractAIC(model4)[1] 7.0000 648.0394
> model5 <- lm(ozone ~ rad+poly(temp,2)+poly(wind,3) )> extractAIC(model5)[1] 7.0000 651.6489
> model6 <- lm(ozone ~ rad+poly(temp,3)+poly(wind,3) )> extractAIC(model6)[1] 8.0000 649.0149
> extractAIC(model) ## original, strictly linear model[1] 4.0000 681.6233
Polynomial RegressionBut how many degrees should we choose ?
Best Model
![Page 26: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills](https://reader036.vdocuments.us/reader036/viewer/2022081519/56649cce5503460f9499a19e/html5/thumbnails/26.jpg)
26
> plot(model4, which=1)
> shapiro.test(model4$res)
Shapiro-Wilk normality test
data: model4$res W = 0.9309, p-value = 2.267e-05
> plot(model4, which=2)
Polynomial Regression
![Page 27: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills](https://reader036.vdocuments.us/reader036/viewer/2022081519/56649cce5503460f9499a19e/html5/thumbnails/27.jpg)
27
> library(MASS)
> boxcox(model4, plotit=T)> boxcox(model4, plotit=T, lambda=seq(0,1,by=.1) )
> new.ozone <- ozone^(1/3)
> mod4 <- lm(new.ozone ~ rad +poly(temp,3) +poly(wind,2) )
> extractAIC(mod4)[1] 7.0000 -162.6569
> shapiro.test(mod4$res)
Shapiro-Wilk normality test
data: mod4$res W = 0.9899, p-value = 0.5855
Polynomial RegressionFinding the best transformation
of our response variable
![Page 28: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills](https://reader036.vdocuments.us/reader036/viewer/2022081519/56649cce5503460f9499a19e/html5/thumbnails/28.jpg)
28
> par(mfrow=c(2,2))
> plot(mod4)
> anova(mod4)Analysis of Variance Table
Response: new.ozone Df Sum Sq Mean Sq F value Pr(>F) rad 1 15.531 15.531 71.467 1.859e-13 ***poly(temp, 3) 3 40.947 13.649 62.804 < 2.2e-16 ***poly(wind, 2) 2 8.129 4.065 18.703 1.152e-07 ***Residuals 104 22.602 0.217
> summary(mod4)Residual standard error: 0.4662 on 104 degrees of freedomMultiple R-squared: 0.7408, Adjusted R-squared: 0.7259 F-statistic: 49.55 on 6 and 104 DF, p-value: < 2.2e-16
Polynomial Regression
![Page 29: Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills](https://reader036.vdocuments.us/reader036/viewer/2022081519/56649cce5503460f9499a19e/html5/thumbnails/29.jpg)
29
> par(mfrow=c(1,1))
> plot(new.ozone ~ mod4$fitted)> abline(0,1, col="red", lwd=2)
> cr.plot(mod4, rad, pch=16, main="")> cr.plot(mod4, poly(temp,3), pch=16, main="")> cr.plot(mod4, poly(wind,2), pch=16, main="")
Polynomial Regression
rad temp wind