Download - Lecture 6 Generalized Linear Models
![Page 2: Lecture 6 Generalized Linear Models](https://reader036.vdocuments.us/reader036/viewer/2022081503/568137ed550346895d9fa5c7/html5/thumbnails/2.jpg)
2
Outline
Continue exploring options available when
assumptions of classical linear models are untenable.
In this lecture:
What can we do when observations are not
continuous
and the residuals are not normally distributed nor
identically distributed ?
![Page 3: Lecture 6 Generalized Linear Models](https://reader036.vdocuments.us/reader036/viewer/2022081503/568137ed550346895d9fa5c7/html5/thumbnails/3.jpg)
3
Defined by three assumptions:
(1) the response variable is continuous.
(2) the residuals (ε) are normally distributed and ...
(3) ... independently (3a) and identically distributed (3b).
Today, we will consider a range of options available
when assumptions (1) (2) and/or (3b) are not verified.
Classical Linear Models
![Page 4: Lecture 6 Generalized Linear Models](https://reader036.vdocuments.us/reader036/viewer/2022081503/568137ed550346895d9fa5c7/html5/thumbnails/4.jpg)
4
Many situations exist:
The response variable could be
(1) a count (number of individuals in a population)(number of species in a community)
(2) a proportion (proportion "cured" after treatment) (proportion of threatened species)
(3) a categorical variable (breeding/non-breeding)
(different phenotypes)
(4) a strictly positive value (esp. time to success) (or time to failure)
( ... ) and so forth
Non-continuous response variable
![Page 5: Lecture 6 Generalized Linear Models](https://reader036.vdocuments.us/reader036/viewer/2022081503/568137ed550346895d9fa5c7/html5/thumbnails/5.jpg)
5
These types of non-continuous variables also tend to deviate from the assumptions of
Normality (assumption #2) and Homoscedasticity (assumption #3b)
(1) A count variable often follows a Poisson distribution (where the variance increases linearly with the mean)
(2) A proportion often follows a Binomial distribution (where the variance reaches a maximum for intermediate values
and a minimum at either end: 0% or 100%)
Added difficulties
![Page 6: Lecture 6 Generalized Linear Models](https://reader036.vdocuments.us/reader036/viewer/2022081503/568137ed550346895d9fa5c7/html5/thumbnails/6.jpg)
6
These types of non-continuous variables also tend to deviate from the assumptions of
Normality (assumption #2) and Homoscedasticity (assumption #3b).
(3) A categorical variable tends to follow a Binomial distribution
(when the variable has only two levels) or a Multinomial
distribution (when the variable has more than two levels)
(4) Time to success/failure can follow an exponential distribution or
an inverse Gaussian distribution (the latter having a variance
increasing more quickly than the mean).
Added difficulties
![Page 7: Lecture 6 Generalized Linear Models](https://reader036.vdocuments.us/reader036/viewer/2022081503/568137ed550346895d9fa5c7/html5/thumbnails/7.jpg)
7
Many of these situations can be unified under a central framework.
Since all these distributions (and a few more) belong to the exponential family of distributions.
Fortunately
),(
)(
)(exp,
yc
a
byyf
Probability density function (if y is continuous)
Probability mass function (if y is discrete)
Canonical (location) parameter
Dispersion parameter
Canonical form
bEY
abY var
mean
variance
![Page 8: Lecture 6 Generalized Linear Models](https://reader036.vdocuments.us/reader036/viewer/2022081503/568137ed550346895d9fa5c7/html5/thumbnails/8.jpg)
8
The Normal distribution
2
2
2exp
2
1,
y
yfProbability
density function
Canonical form
)2log(
2
12/exp 2
2
2
2
2
yy
Canonical (location) parameter
Dispersion parameter
2
bEY
2var abY
![Page 9: Lecture 6 Generalized Linear Models](https://reader036.vdocuments.us/reader036/viewer/2022081503/568137ed550346895d9fa5c7/html5/thumbnails/9.jpg)
9
The Poisson distribution
!
,y
eyf
y
Probability mass
function
Canonical form
!lnlnexp yy
= 1
Canonical (location) parameter
Dispersion parameter
ln1
bEY
abYvar
)exp()( b
![Page 10: Lecture 6 Generalized Linear Models](https://reader036.vdocuments.us/reader036/viewer/2022081503/568137ed550346895d9fa5c7/html5/thumbnails/10.jpg)
10
The Binomial distribution
yny
y
nyf
1,
Probability mass
function
Canonical form
y
nyny ln1lnlnexp
= 1
Canonical (location) parameter
Dispersion parameter
1
ln
1
nbEY
)1(var nabY
)exp1log()1ln()( nnb
y
nny ln1ln
1lnexp
![Page 11: Lecture 6 Generalized Linear Models](https://reader036.vdocuments.us/reader036/viewer/2022081503/568137ed550346895d9fa5c7/html5/thumbnails/11.jpg)
11
Why is that remotely useful ?1) A single algorithm (maximum likelihood)
will cope with all these situations.
2) Different types of Variance can be accommodated
When Var is constant -> Normal (Gaussian)
When Var increases linearly with the mean -> Poisson
When Var has a humped back shape -> Binomial
When Var increases as the square of the mean -> Gamma(means the coefficient of variation remains constant)
When Var increases as the cube of the mean -> inverse Gaussian
3) Most types of data are thus effectively covered
![Page 12: Lecture 6 Generalized Linear Models](https://reader036.vdocuments.us/reader036/viewer/2022081503/568137ed550346895d9fa5c7/html5/thumbnails/12.jpg)
12
Two ways to cope with non-independent observations
When design is balanced ("equal sample size")
We can use factors to partition our observations in different "groups" and analyse them as an ANOVA or ANCOVA.
We already know how to do that (when factors are "crossed")
We just need to figure out how to cope with nested factors.
When design is unbalanced ("uneven sample size")
Mixed effect models are then called for.
Non-independent Observations
![Page 13: Lecture 6 Generalized Linear Models](https://reader036.vdocuments.us/reader036/viewer/2022081503/568137ed550346895d9fa5c7/html5/thumbnails/13.jpg)
13
How does it work ?1) You need to specify the family of distribution to use
2) You need to specify the link function
ppxxx 22110 iyg
linear predictorlink function
For each type of variable the "natural" link function to use is indicated by the canonical parameter
Link
Normal Identity
Poisson Log
Binomial Logit
Gamma Inverse
Inv.Gaussian Inverse square
1
ln
![Page 14: Lecture 6 Generalized Linear Models](https://reader036.vdocuments.us/reader036/viewer/2022081503/568137ed550346895d9fa5c7/html5/thumbnails/14.jpg)
14
Count variableThis type of response variable often follows a
Poisson distribution with Variance increasing in direct relation with the Mean.
The family to use is Poisson and the canonical link is log.
Example: What are the environmental variables associated with plant diversity on the Galapagos ?
> library(faraway)> data(gala)> names(gala)[1] "Species" "Endemics" "Area" "Elevation" "Nearest" [6] "Scruz" "Adjacent"> attach(gala)
Beware some missing data in the original dataset have been filed for convenience.
Johnson, M.P. & Raven, P.H. (1973) Science 179(4076): 893-895.
![Page 15: Lecture 6 Generalized Linear Models](https://reader036.vdocuments.us/reader036/viewer/2022081503/568137ed550346895d9fa5c7/html5/thumbnails/15.jpg)
15
Count variable> summary(gala) Species Endemics Area Min. : 2.00 Min. : 0.00 Min. : 0.0100 1st Qu.: 13.00 1st Qu.: 7.25 1st Qu.: 0.2575 Median : 42.00 Median :18.00 Median : 2.5900 Mean : 85.23 Mean :26.10 Mean : 261.7087 3rd Qu.: 96.00 3rd Qu.:32.25 3rd Qu.: 59.2375 Max. :444.00 Max. :95.00 Max. :4669.3200
Elevation Nearest Scruz Adjacent Min. : 25.00 Min. : 0.20 Min. : 0.00 Min. : 0.03 1st Qu.: 97.75 1st Qu.: 0.80 1st Qu.: 11.03 1st Qu.: 0.52 Median : 192.00 Median : 3.05 Median : 46.65 Median : 2.59 Mean : 368.03 Mean :10.06 Mean : 56.98 Mean : 261.10 3rd Qu.: 435.25 3rd Qu.:10.03 3rd Qu.: 81.08 3rd Qu.: 59.24 Max. :1707.00 Max. :47.40 Max. :290.20 Max. :4669.32> gala <- gala[,-2] ## removing variable "Endemics" > modp <- glm(Species ~ ., family=poisson, data=gala)
by default the link for a Poisson is log
![Page 16: Lecture 6 Generalized Linear Models](https://reader036.vdocuments.us/reader036/viewer/2022081503/568137ed550346895d9fa5c7/html5/thumbnails/16.jpg)
16
Count variable> summary(modp) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.155e+00 5.175e-02 60.963 < 2e-16 ***Area -5.799e-04 2.627e-05 -22.074 < 2e-16 ***Elevation 3.541e-03 8.741e-05 40.507 < 2e-16 ***Nearest 8.826e-03 1.821e-03 4.846 1.26e-06 ***Scruz -5.709e-03 6.256e-04 -9.126 < 2e-16 ***Adjacent -6.630e-04 2.933e-05 -22.608 < 2e-16 ***---
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 3510.73 on 29 degrees of freedomResidual deviance: 716.85 on 24 degrees of freedomAIC: 889.68
Number of Fisher Scoring iterations: 5
Only valid if the Response variable is indeed following a Poisson
n
iiiiii yyyD
1
)ˆ()ˆln(2
Need to be broadly similar
also called G-statistic
![Page 17: Lecture 6 Generalized Linear Models](https://reader036.vdocuments.us/reader036/viewer/2022081503/568137ed550346895d9fa5c7/html5/thumbnails/17.jpg)
17
Count variable
> (dp <- sum(residuals(modp, type="pearson")^2)/modp$df.res)[1] 31.74914
Pearson's residuals
This dispersion parameter () must be calculated.
pn
y
pni iii
ˆˆˆ22
Residual degrees of freedom
Suggests that the Variance is 31.8 times the Mean.
In statistical terms this is called Overdispersion.
In biological terms, it suggests that the counts are not independent from each other but instead are Aggregated(i.e. Clumped).
Typically Overdispersed count data follow a Negative Binomial distribution, which is not part of the Exponential families of distribution.
It won't be covered here, but it can be approximated as a quasi-Poisson (family="quasipoisson").
If you need it in your future work, you can also try glm.nb (in MASS package)
![Page 18: Lecture 6 Generalized Linear Models](https://reader036.vdocuments.us/reader036/viewer/2022081503/568137ed550346895d9fa5c7/html5/thumbnails/18.jpg)
18
Count variable
> summary(modp, dispersion=dp)Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.1548079 0.2915897 10.819 < 2e-16 ***Area -0.0005799 0.0001480 -3.918 8.95e-05 ***Elevation 0.0035406 0.0004925 7.189 6.53e-13 ***Nearest 0.0088256 0.0102621 0.860 0.390 Scruz -0.0057094 0.0035251 -1.620 0.105 Adjacent -0.0006630 0.0001653 -4.012 6.01e-05 ***---(Dispersion parameter for poisson family taken to be 31.74914)
Null deviance: 3510.73 on 29 degrees of freedomResidual deviance: 716.85 on 24 degrees of freedomAIC: 889.68
The summary table can be adjusted with the dispersion parameter
These Values can now be taken at face value
![Page 19: Lecture 6 Generalized Linear Models](https://reader036.vdocuments.us/reader036/viewer/2022081503/568137ed550346895d9fa5c7/html5/thumbnails/19.jpg)
19
Count variable
> drop1(modp, test="F") When you have overdispersed data Model:Species ~ Area + Elevation + Nearest + Scruz + Adjacent Df Deviance AIC F value Pr(F) <none> 716.85 889.68 Area 1 1204.35 1375.18 16.3217 0.0004762 ***Elevation 1 2389.57 2560.40 56.0028 1.007e-07 ***Nearest 1 739.41 910.24 0.7555 0.3933572 Scruz 1 813.62 984.45 3.2400 0.0844448 . Adjacent 1 1341.45 1512.29 20.9119 0.0001230 ***---Warning message:In drop1.glm(modp, test = "F") : F test assumes 'quasipoisson' family
The drop1 function can be used to simplify the model
AIC values dodgy when quasipoisson is used
"Nearest" should probably be removed from the model.
Safer to use the F-values
and their p-values
![Page 20: Lecture 6 Generalized Linear Models](https://reader036.vdocuments.us/reader036/viewer/2022081503/568137ed550346895d9fa5c7/html5/thumbnails/20.jpg)
20
Count variable> modp2 <- update(modp, ~. - Nearest)> (dp2 <- sum(residuals(modp2, type="pearson")^2)/
modp2$df.res) [1] 29.53501> summary(modp2, dispersion=dp2)Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.1599640 0.2805140 11.265 < 2e-16 ***Area -0.0005978 0.0001396 -4.283 1.85e-05 ***Elevation 0.0035769 0.0004675 7.651 1.99e-14 ***Scruz -0.0038565 0.0025216 -1.529 0.126 Adjacent -0.0007030 0.0001521 -4.621 3.82e-06 ***---(Dispersion parameter for poisson family taken to be 29.53501)
Null deviance: 3510.73 on 29 degrees of freedomResidual deviance: 739.41 on 25 degrees of freedomAIC: 910.24
![Page 21: Lecture 6 Generalized Linear Models](https://reader036.vdocuments.us/reader036/viewer/2022081503/568137ed550346895d9fa5c7/html5/thumbnails/21.jpg)
21
> drop1(modp2, test="F")Model:Species ~ Area + Elevation + Scruz + Adjacent Df Deviance AIC F value Pr(F) <none> 739.41 910.24 Area 1 1290.08 1458.91 18.6184 0.0002200 ***Elevation 1 2525.09 2693.92 60.3749 3.981e-08 ***Scruz 1 818.74 987.57 2.6822 0.1140018 Adjacent 1 1570.87 1739.70 28.1123 1.709e-05 ***---Warning message:In drop1.glm(modp, test = "F") : F test assumes 'quasipoisson' family> modp3 <- update(modp2, ~. – Scruz)> (dp3 <- sum(residuals(modp3, type="pearson")^2)/
modp3$df.res) [1] 30.08155
Count variable
![Page 22: Lecture 6 Generalized Linear Models](https://reader036.vdocuments.us/reader036/viewer/2022081503/568137ed550346895d9fa5c7/html5/thumbnails/22.jpg)
22
> summary(modp3, dispersion=dp3)
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.9613109 0.2617588 11.313 < 2e-16 ***Area -0.0005704 0.0001381 -4.129 3.64e-05 ***Elevation 0.0035891 0.0004721 7.602 2.91e-14 ***Adjacent -0.0007508 0.0001524 -4.928 8.32e-07 ***---(Dispersion parameter for poisson family taken to be 30.08155)
Null deviance: 3510.73 on 29 degrees of freedomResidual deviance: 818.74 on 26 degrees of freedomAIC: 987.57
Count variable
How good is the model ? 1 – (Res. Dev. / Null Dev.)
= 76.68 %
![Page 23: Lecture 6 Generalized Linear Models](https://reader036.vdocuments.us/reader036/viewer/2022081503/568137ed550346895d9fa5c7/html5/thumbnails/23.jpg)
23
> plot(residuals(modp3) ~ predict(modp3, type="response"), xlab=expression(hat(mu)), ylab="Deviance residuals")
Count variable Checking the Model
Plotting residuals vs fitted values (Several options)
by default the Deviance version In the Original Response Scale
![Page 24: Lecture 6 Generalized Linear Models](https://reader036.vdocuments.us/reader036/viewer/2022081503/568137ed550346895d9fa5c7/html5/thumbnails/24.jpg)
24
Count variable Checking the Model
Plotting residuals vs fitted values (Several options)
> plot(residuals(modp3) ~ predict(modp3, type="link"), xlab=expression(hat(eta)), ylab="Deviance residuals")
both in the linked scale (Log for Poisson)
Clearest to inspect
"Good Spread"
![Page 25: Lecture 6 Generalized Linear Models](https://reader036.vdocuments.us/reader036/viewer/2022081503/568137ed550346895d9fa5c7/html5/thumbnails/25.jpg)
25
> plot(residuals(modp3, type="response") ~ predict(modp3, type="response"), xlab=expression(hat(mu)), ylab="Response residuals")
Count variable Checking the Model
Plotting residuals vs fitted values (Several options)
both in the original response scale
Harder to read
![Page 26: Lecture 6 Generalized Linear Models](https://reader036.vdocuments.us/reader036/viewer/2022081503/568137ed550346895d9fa5c7/html5/thumbnails/26.jpg)
26
> shapiro.test(residuals(modp3, type="deviance"))
Shapiro-Wilk normality test
data: residuals(modp3, type = "deviance") W = 0.9811, p-value = 0.854
Count variable Checking the Model
Do the residuals have the right distribution ?