lecture 3 linear models ii
DESCRIPTION
Advanced Research Skills. Lecture 3 Linear Models II. Olivier MISSA, [email protected]. Outline. "Refresher" on different types of model: Two-way Anova Ancova. Two-Way Anova. When your observations are categorized according to two different factors. - PowerPoint PPT PresentationTRANSCRIPT
![Page 2: Lecture 3 Linear Models II](https://reader033.vdocuments.us/reader033/viewer/2022051218/56815a8b550346895dc80137/html5/thumbnails/2.jpg)
2
Outline
"Refresher" on different types of model:
Two-way Anova
Ancova
![Page 3: Lecture 3 Linear Models II](https://reader033.vdocuments.us/reader033/viewer/2022051218/56815a8b550346895dc80137/html5/thumbnails/3.jpg)
3
When your observations are categorized according to two different factors.
Example: Blood calcium concentration in a population of birds,
10 males and 10 females randomly split in 2 groups,
and given either hormonal treatment or not.
Two-Way Anova
> dataset <- read.table("2wayAnova.csv", header=T, sep=",") > attach(dataset)
> names(dataset)[1] "plasma" "sex" "hormone"
> tapply(plasma, list(sex, hormone), length) no yesfemale 5 5male 5 5
balanced design equal replications among groupings
![Page 4: Lecture 3 Linear Models II](https://reader033.vdocuments.us/reader033/viewer/2022051218/56815a8b550346895dc80137/html5/thumbnails/4.jpg)
4
Two-Way Anova
> stripchart(plasma ~ hormone, + col=c("orange","blue")
> h.ave <- tapply(plasma, hormone, mean)> h.ave no yes 13.50 30.15
> stripchart(plasma ~ sex, + col=c("red","green")
> s.ave <- tapply(plasma, sex, mean)> s.avefemale male 23.70 19.95
13.50
30.15
23.7019.95
![Page 5: Lecture 3 Linear Models II](https://reader033.vdocuments.us/reader033/viewer/2022051218/56815a8b550346895dc80137/html5/thumbnails/5.jpg)
5
Two-Way Anova
> h.ave no yes 13.50 30.15 > gd.ave <- mean(plasma); gd.ave[1] 21.825
> summary(aov(plasma ~ hormone)) Df Sum Sq Mean Sq F value Pr(>F) hormone 1 1386.11 1386.11 56.501 5.883e-07 ***Residuals 18 441.59 24.53
> SS.h <- 10*(21.825-13.50)^2 + ## sum of squares due to 10*(21.825-30.15)^2 ## hormone treatment[1] 1386.112
> TSS <- sum((plasma-gd.ave)^2)[1] 1827.697
13.50
30.15
![Page 6: Lecture 3 Linear Models II](https://reader033.vdocuments.us/reader033/viewer/2022051218/56815a8b550346895dc80137/html5/thumbnails/6.jpg)
6
Two-Way Anova
> s.avefemale male 23.70 19.95 > gd.ave[1] 21.825
> summary(aov(plasma ~ sex)) Df Sum Sq Mean Sq F value Pr(>F)sex 1 70.31 70.31 0.7202 0.4072Residuals 18 1757.39 97.63
> SS.s <- 10*(21.825-23.70)^2 + ## sum of squares due to sex 10*(21.825-19.95)^2[1] 70.3125
> TSS <- sum((plasma-gd.ave)^2)[1] 1827.697
23.7019.95
![Page 7: Lecture 3 Linear Models II](https://reader033.vdocuments.us/reader033/viewer/2022051218/56815a8b550346895dc80137/html5/thumbnails/7.jpg)
7
Two-Way Anova
> summary(aov(plasma ~ hormone)) Df Sum Sq Mean Sq F value Pr(>F) hormone 1 1386.11 1386.11 56.501 5.883e-07 ***Residuals 18 441.59 24.53
> summary(aov(plasma ~ sex)) Df Sum Sq Mean Sq F value Pr(>F)sex 1 70.31 70.31 0.7202 0.4072Residuals 18 1757.39 97.63
> summary(aov(plasma ~ hormone + sex)) Df Sum Sq Mean Sq F value Pr(>F) hormone 1 1386.11 1386.11 63.4680 3.863e-07 ***sex 1 70.31 70.31 3.2195 0.09057 . Residuals 17 371.27 21.84
Analysing the two factors at the same time
can make a big difference to the outcome
![Page 8: Lecture 3 Linear Models II](https://reader033.vdocuments.us/reader033/viewer/2022051218/56815a8b550346895dc80137/html5/thumbnails/8.jpg)
8
Two-Way AnovaIncluding an interaction term is important too
> summary(aov(plasma ~ hormone + sex)) Df Sum Sq Mean Sq F value Pr(>F) hormone 1 1386.11 1386.11 63.4680 3.863e-07 ***sex 1 70.31 70.31 3.2195 0.09057 . Residuals 17 371.27 21.84 > summary(aov(plasma ~ hormone * sex)) Df Sum Sq Mean Sq F value Pr(>F) hormone 1 1386.11 1386.11 60.5336 7.943e-07 ***sex 1 70.31 70.31 3.0706 0.09886 . hormone:sex 1 4.90 4.90 0.2140 0.64987 Residuals 16 366.37 22.90
![Page 9: Lecture 3 Linear Models II](https://reader033.vdocuments.us/reader033/viewer/2022051218/56815a8b550346895dc80137/html5/thumbnails/9.jpg)
9
Two-Way AnovaWhat is the meaning of this interaction term ?
Assess whether the impact of the two factors
are mutually interdependent.
> gd.ave[1] 21.825
> h.diff <- h.ave - gd.ave no yes -8.325 8.325
> s.diff <- s.ave - gd.avefemale male 1.875 -1.875
> predicted <- predict(aov(plasma~hormone+sex)) 1 2 3 4 5 6 7 8 ... 15.375 15.375 15.375 15.375 15.375 11.625 11.625 11.625 ... 11 12 13 14 15 16 17 18 ... 32.025 32.025 32.025 32.025 32.025 28.275 28.275 28.275 ...
![Page 10: Lecture 3 Linear Models II](https://reader033.vdocuments.us/reader033/viewer/2022051218/56815a8b550346895dc80137/html5/thumbnails/10.jpg)
10
Two-Way AnovaWhat is the meaning of the interaction term ?
Assess whether the impact of the two factors
are mutually interdependent.
> pred.ave <- tapply(predicted, list(sex, hormone), mean)> pred.ave no yesfemale 15.375 32.025male 11.625 28.275
> averages <- tapply(plasma, list(sex, hormone), mean)> averages no yesfemale 14.88 32.52male 12.12 27.78
> sum((averages-pred.ave)^2)*5[1] 4.9005 ## SS due to interaction
![Page 11: Lecture 3 Linear Models II](https://reader033.vdocuments.us/reader033/viewer/2022051218/56815a8b550346895dc80137/html5/thumbnails/11.jpg)
11
Two-Way Anova
> averages no yesfemale 14.88 32.52male 12.12 27.78
> barplot(averages, beside=T, col=c("red","green"), ...)
> pred.ave no yesfemale 15.375 32.025male 11.625 28.275
> barplot(pred.ave, beside=T, col=c("red2","green3"), ...)
Graphical representations
Predicted averages
Observed averages
![Page 12: Lecture 3 Linear Models II](https://reader033.vdocuments.us/reader033/viewer/2022051218/56815a8b550346895dc80137/html5/thumbnails/12.jpg)
12
Two-Way Anova
> interaction.plot(hormone, sex, plasma, col=c("red","green2"), ... )
> interaction.plot(sex, hormone, plasma, col=c("orange","blue"), ...)
Graphical representations
![Page 13: Lecture 3 Linear Models II](https://reader033.vdocuments.us/reader033/viewer/2022051218/56815a8b550346895dc80137/html5/thumbnails/13.jpg)
13
Two-Way Anova as a linear model> mod <- lm(plasma ~ sex + hormone)
> summary(mod)Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 15.38 1.81 8.495 1.60e-07 ***sexmale -3.75 2.09 -1.794 0.0906 . hormoneyes 16.65 2.09 7.967 3.86e-07 ***---Residual standard error: 4.673 on 17 degrees of freedomMultiple R-squared: 0.7969, Adjusted R-squared: 0.773 F-statistic: 33.34 on 2 and 17 DF, p-value: 1.307e-06
> pred.ave no yesfemale 15.375 32.025male 11.625 28.275
![Page 14: Lecture 3 Linear Models II](https://reader033.vdocuments.us/reader033/viewer/2022051218/56815a8b550346895dc80137/html5/thumbnails/14.jpg)
14
Two-Way Anova as a linear model> mod2 <- lm(plasma ~ sex * hormone)
> summary(mod2)Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 14.880 2.140 6.953 3.25e-06 ***sexmale -2.760 3.026 -0.912 0.375 hormoneyes 17.640 3.026 5.829 2.57e-05 ***sexmale:hormoneyes -1.980 4.280 -0.463 0.650 ---Residual standard error: 4.785 on 16 degrees of freedomMultiple R-squared: 0.7995, Adjusted R-squared: 0.762 F-statistic: 21.27 on 3 and 16 DF, p-value: 7.89e-06
> averages no yesfemale 14.88 32.52male 12.12 27.78
![Page 15: Lecture 3 Linear Models II](https://reader033.vdocuments.us/reader033/viewer/2022051218/56815a8b550346895dc80137/html5/thumbnails/15.jpg)
15
Are the assumptions met ?1: Is the response variable continuous ?
2: Are the residuals normally distributed ?
> shapiro.test(mod$residuals)
Shapiro-Wilk normality test
data: mod$residuals W = 0.9788, p-value = 0.9172
> plot(mod, which=2) ## qqplot of std.residuals
Answer: YES !
YES !
![Page 16: Lecture 3 Linear Models II](https://reader033.vdocuments.us/reader033/viewer/2022051218/56815a8b550346895dc80137/html5/thumbnails/16.jpg)
16
3a : Are the residuals independent and identically distributed?
> plot(mod, which=1) ## residuals vs fitted values.
> plot(mod, which=3) ## sqrt(abs(standardized(residuals)))
vs fitted values.
Answer: Female treated with hormone seem to vary more in their response
![Page 17: Lecture 3 Linear Models II](https://reader033.vdocuments.us/reader033/viewer/2022051218/56815a8b550346895dc80137/html5/thumbnails/17.jpg)
17
When one predictor variable is continuous or discrete,
and the other predictor variable is categorical.
Example: Fruit production in a biennial plant,40 plants were allocated to two treatments, grazed and ungrazed.
The grazed plants were exposed to rabbits during the first two weeks of
stem elongation, then allowed to regrow protected by a fence.
> dataset2 <- read.table("ipomopsis.txt", header=T, sep="\t")
> attach(dataset2)
> names(dataset2)[1] "Root" "Fruit" "Grazing"
> str(dataset2)'data.frame': 40 obs. of 3 variables: $ Root : num 6.22 6.49 4.92 5.13 5.42 ... $ Fruit : num 59.8 61.0 14.7 19.3 34.2 ... $ Grazing: Factor w/ 2 levels "Grazed","Ungrazed": 2 2 2 ...
Ancova
![Page 18: Lecture 3 Linear Models II](https://reader033.vdocuments.us/reader033/viewer/2022051218/56815a8b550346895dc80137/html5/thumbnails/18.jpg)
18
> summary(dataset2)
Root Fruit Grazing Min. : 4.43 Min. : 14.7 Grazed :20 1st Qu.: 6.08 1st Qu.: 41.1 Ungrazed:20 Median : 7.12 Median : 60.9 Mean : 7.18 Mean : 59.4 3rd Qu.: 8.51 3rd Qu.: 76.2 Max. :10.25 Max. :116.0
> stripchart(Fruit ~ Grazing, col=c("blue","green2"), ...)
> summary( lm(Fruit ~ Grazing) )Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 67.941 5.236 12.976 1.54e-15 ***GrazingUngrazed -17.060 7.404 -2.304 0.0268 * ---Residual standard error: 23.41 on 38 degrees of freedomMultiple R-squared: 0.1226, Adjusted R-squared: 0.09949 F-statistic: 5.309 on 1 and 38 DF, p-value: 0.02678
67.9
50.9
Ancova
![Page 19: Lecture 3 Linear Models II](https://reader033.vdocuments.us/reader033/viewer/2022051218/56815a8b550346895dc80137/html5/thumbnails/19.jpg)
19
Ancova> plot(Fruit ~ Root, col=c("blue","green2")[as.numeric(Grazing)], ...)
## a few graphical functions omitted
> summary( lm(Fruit ~ Root * Grazing) )Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -125.173 12.811 -9.771 1.15e-11 ***Root 23.240 1.531 15.182 < 2e-16 ***GrazingUngrazed 30.806 16.842 1.829 0.0757 . Root:GrazingUngrazed 0.756 2.354 0.321 0.7500 ---Residual standard error: 6.831 on 36 degrees of freedomMultiple R-squared: 0.9293, Adjusted R-squared: 0.9234 F-statistic: 157.6 on 3 and 36 DF, p-value: < 2.2e-16
slope 23.24
slope 24.00
![Page 20: Lecture 3 Linear Models II](https://reader033.vdocuments.us/reader033/viewer/2022051218/56815a8b550346895dc80137/html5/thumbnails/20.jpg)
20
Ancova> anova( lm(Fruit ~ Root * Grazing) )Analysis of Variance Table
Response: Fruit Df Sum Sq Mean Sq F value Pr(>F) Root 1 16795.0 16795.0 359.9681 < 2.2e-16 ***Grazing 1 5264.4 5264.4 112.8316 1.209e-12 ***Root:Grazing 1 4.8 4.8 0.1031 0.75 Residuals 36 1679.6 46.7 ---> anova( lm(Fruit ~ Grazing * Root) )Analysis of Variance Table
Response: Fruit Df Sum Sq Mean Sq F value Pr(>F) Grazing 1 2910.4 2910.4 62.3795 2.262e-09 ***Root 1 19148.9 19148.9 410.4201 < 2.2e-16 ***Grazing:Root 1 4.8 4.8 0.1031 0.75 Residuals 36 1679.6 46.7 ---
The variables order matters in the F table
![Page 21: Lecture 3 Linear Models II](https://reader033.vdocuments.us/reader033/viewer/2022051218/56815a8b550346895dc80137/html5/thumbnails/21.jpg)
21
> drop1( lm(Fruit ~ Root * Grazing), test="F" )Single term deletions
Model:Fruit ~ Root * Grazing Df Sum of Sq RSS AIC F value Pr(F)<none> 1679.65 157.50 Root:Grazing 1 4.81 1684.46 155.61 0.1031 0.75
> drop1( lm(Fruit ~ Root + Grazing), test="F" )Single term deletions
Model:Fruit ~ Root + Grazing Df Sum of Sq RSS AIC F value Pr(F) <none> 1684.5 155.6 Root 1 19148.9 20833.4 254.2 420.62 < 2.2e-16 ***Grazing 1 5264.4 6948.8 210.3 115.63 6.107e-13 ***
Ancova A safer test of significance, dropping each term from the model
Akaike Information Criteria
![Page 22: Lecture 3 Linear Models II](https://reader033.vdocuments.us/reader033/viewer/2022051218/56815a8b550346895dc80137/html5/thumbnails/22.jpg)
22
> mod3 <- lm(Fruit ~ Root + Grazing)> summary(mod3)Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -127.829 9.664 -13.23 1.35e-15 ***Root 23.560 1.149 20.51 < 2e-16 ***GrazingUngrazed 36.103 3.357 10.75 6.11e-13 ***---Residual standard error: 6.747 on 37 degrees of freedomMultiple R-squared: 0.9291, Adjusted R-squared: 0.9252 F-statistic: 242.3 on 2 and 37 DF, p-value: < 2.2e-16
> anova(mod3)Analysis of Variance Table
Response: Fruit Df Sum Sq Mean Sq F value Pr(>F) Root 1 16795.0 16795.0 368.91 < 2.2e-16 ***Grazing 1 5264.4 5264.4 115.63 6.107e-13 ***Residuals 37 1684.5 45.5
Ancova After simplifying the model, the p-values in the summary table agree with those in the F table
![Page 23: Lecture 3 Linear Models II](https://reader033.vdocuments.us/reader033/viewer/2022051218/56815a8b550346895dc80137/html5/thumbnails/23.jpg)
23
Are the assumptions met ?1: Is the response variable continuous ?
2: Are the residuals normally distributed ?
> shapiro.test(mod3$residuals)
Shapiro-Wilk normality test
data: mod3$resid W = 0.9736, p-value = 0.4637
> plot(mod3, which=2) ## qqplot of std.residuals
Answer: YES !
YES !
![Page 24: Lecture 3 Linear Models II](https://reader033.vdocuments.us/reader033/viewer/2022051218/56815a8b550346895dc80137/html5/thumbnails/24.jpg)
24
3a : Are the residuals independent and identically distributed?
> plot(mod3, which=1) ## residuals vs fitted values.
> plot(mod3, which=3) ## sqrt(abs(standardized(residuals)))
vs fitted values.
Answer: Looks OK
![Page 25: Lecture 3 Linear Models II](https://reader033.vdocuments.us/reader033/viewer/2022051218/56815a8b550346895dc80137/html5/thumbnails/25.jpg)
25
Any major influential points ?
> plot(mod3, which=5) ## Residuals vs Leverages.
Answer: No !