open-source glmm tools

Precursors GLMMs Results Conclusions References

Open-source tools for estimation and inferenceusing generalized linear mixed models

Ben Bolker

McMaster UniversityDepartments of Mathematics & Statistics and Biology

7 April 2011

Ben Bolker McMaster University Departments of Mathematics & Statistics and Biology

Open-source GLMMs


Outline

1 PrecursorsExamplesDefinitions

2 GLMMsEstimationInference: testsInference: confidence intervals

3 ResultsGlyceraArabidopsis

4 Conclusions


Open-source GLMMs


Examples

Outline




4 Conclusions


Open-source GLMMs


Examples

Coral protection by symbionts

none shrimp crabs both

Number of predation events

Symbionts

Num

ber

of b

lock

s

0

2

4

6

8

10

1

2

0

1

2

0

2

0

1

2


Open-source GLMMs


Examples

Environmental stress: Glycera cell survival

H2S

Cop

per

0

33.3

66.6

133.3

0 0.03 0.1 0.32

Osm=12.8Normoxia

Osm=22.4Normoxia

0 0.03 0.1 0.32

Osm=32Normoxia

Osm=41.6Normoxia

0 0.03 0.1 0.32

Osm=51.2Normoxia

Osm=12.8Anoxia

0 0.03 0.1 0.32

Osm=22.4Anoxia

Osm=32Anoxia

0 0.03 0.1 0.32

Osm=41.6Anoxia

0

33.3

66.6

133.3

Osm=51.2Anoxia

0.0

0.2

0.4

0.6

0.8

1.0


Open-source GLMMs


Examples

Arabidopsis response to fertilization & clipping

panel: nutrient, color: genotype

Log(

1+fr

uit s

et)

0

1

2

3

4

5

unclipped clipped

●●●●● ●

●●

●

●●

●●

●

●

●

●●

●

●

●●●

●

●

●

●●● ●

● ●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

●

●

●●● ●● ●

●

●●

●

●●

●

●●

●

●

●

●

●● ●

●●

●

●

●

● ●●● ●●● ●●● ●●● ●● ●● ●● ●

●

● ●● ●●● ●●●●

●

●

●

●●

●

●

● ●

●

●

●

●●●●●● ●●

●

●●●

●

●

● ●

●

●

●

●

●

: nutrient 1

unclipped clipped

●

●

●●●

●●●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

● ●●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●●● ●● ●●● ●●●● ●

●

●

●●

●

●

●

●

●

●

●●●●●●● ●●● ●●

●

●

●

●

●

●

●●

●

●

●

●●

●●●● ●●●●●●●

●

●●

●

●

●

●

●

●

●

●

●

●

: nutrient 8


Open-source GLMMs


Examples

Glossary: data

Fixed effects Predictors where interest is in specific levels

Random effects (RE) predictors where interest is in distributionrather than levels (blocks) (Gelman, 2005)

Crossed RE multiple REs where levels of one occur in more thanone level of another (ex.: block × year: cf. nested)http://lme4.r-forge.r-project.org/book/,Pinheiro and Bates (2000)


Open-source GLMMs

http://lme4.r-forge.r-project.org/book/


Examples

Data challenges

Estimation Computation InferenceSmall # RE levels (<5–6) Large n Small N (< 40)Overdispersion Multiple REs Small nCrossed REs Crossed REsSpatial/temporal

correlationUnusual distributions

(Gamma, neg. binom . . . )


Open-source GLMMs


Definitions

Outline




4 Conclusions


Open-source GLMMs


Definitions

Generalized linear models

Distributions from exponential family(Poisson, binomial, Gaussian, Gamma,

neg. binomial (known k) . . . )

Means = linear functions of predictorson scale of link function (identity, log, logit, . . . )

Y ∼ D(g−1(Xβ), φ)

φ often set to 1 (Poisson, binomial) except forquasilikelihood approaches


Open-source GLMMs


Definitions

Generalized linear mixed models

Add random effects:

Y ∼ D(g−1(Xβ + Zu), φ)

u ∼ MVN(0,Σ)

Synonyms: multilevel, hierarchical models


Open-source GLMMs


Definitions

Marginal likelihood

Likelihood (Prob(data|parameters)) — requires integrating overpossible values of REs to get marginal likelihood e.g.:

likelihood of i th obs. in block j is L(xij |θi , σ2w )

likelihood of a particular block mean θj is L(θj |0, σ2b)

marginal likelihood is∫L(xij |θj , σ

2w )L(θj |0, σ2

b) dθj

Balance (dispersion of RE around 0) with (dispersion of dataconditional on RE)


Open-source GLMMs


Definitions

Shrinkage

● ●

●

●●

●●

● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●

Arabidopsis block estimates

Genotype

Mea

n(lo

g) fr

uit s

et

0 5 10 15 20 25

−15

−3

0

3

●● ●

●●

●●

● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●

23 10

810

43 9 9 4 6

4 2 6 10 5 7 9 4 9 11 2 5 5


Open-source GLMMs


Definitions

RE examples

Coral symbionts: simple experimental blocks, RE affectsintercept (overall probability of predation in block)

Glycera: applied to cells from 10 individuals, RE again affectsintercept (cell survival prob.)

Arabidopsis: region (3 levels, treated as fixed) / population /genotype: affects intercept (overall fruit set) as well astreatment effects (nutrients, herbivory, interaction)


Open-source GLMMs


Estimation

Outline




4 Conclusions


Open-source GLMMs


Estimation

Penalized quasi-likelihood (PQL)

alternate steps of estimating GLM using known RE variancesto calculate weights; estimate LMMs given GLM fit (Breslow,2004)

flexible (allows spatial/temporal correlations, crossed REs)

biased for small unit samples (e.g. counts < 5, binary orlow-survival data)

widely used: SAS PROC GLIMMIX, R MASS:glmmPQL: in ≈90% of small-unit-sample cases


Open-source GLMMs

http://stat.ethz.ch/R-manual/R-patched/library/MASS/html/glmmPQL.html


Estimation

Laplace approximation

approximate marginal likelihood

for given β, θ (RE parameters), find conditional modes bypenalized, iterated reweighted least squares; then usesecond-order Taylor expansion around the conditional modes

more accurate than PQL

reasonably fast and flexible

lme4:glmer, glmmML, glmmADMB, R2ADMB (AD Model Builder)


Open-source GLMMs

http://stat.ethz.ch/R-manual/R-patched/library/lme4/html/glmer.html

http://cran.r-project.org/web/packages/glmmML

https://r-forge.r-project.org/projects/glmmADMB/

https://r-forge.r-project.org/projects/R2ADMB/

http://admb-project.org/


Estimation

Gauss-Hermite quadrature (AGQ)

as above, but compute additional terms in the integral(typically 8, but often up to 20)

most accurate

slowest, hence not flexible (2–3 RE at most, maybe only 1)

lme4:glmer, glmmML, repeated


Open-source GLMMs

http://stat.ethz.ch/R-manual/R-patched/library/lme4/html/glmer.html

http://cran.r-project.org/web/packages/glmmML

http://www.commanster.eu/rcode.html


Estimation

Bayesian approaches

Bayesians have to do nasty integrals anyway (to normalize theposterior probability density)

various flavours of stochastic Bayesian computation (Gibbssampling, MCMC, etc.)

generally slower but more flexible

solves many problems of assessing confidence intervals

must specify priors, assess convergence

specialized: glmmAK, MCMCglmm (Hadfield, 2010), INLA

general: glmmBUGS, R2WinBUGS, BRugs(WinBUGS/OpenBUGS), R2jags, rjags (JAGS)


Open-source GLMMs

http://cran.r-project.org/web/packages/glmmAK

http://cran.r-project.org/web/packages/MCMCglmm

http://www.r-inla.org/

http://cran.r-project.org/web/packages/glmmBUGS

http://cran.r-project.org/web/packages/R2WinBUGS

http://www.biostat.umn.edu/~brad/software/BRugs

http://www.openbugs.info/w/

http://cran.r-project.org/web/packages/R2jags

http://cran.r-project.org/web/packages/rjags

http://www-ice.iarc.fr/~martyn/software/jags/


Estimation

Overdispersion (slight tangent)

Variance greater than expected from statistical model

Quasi-likelihood approaches: MASS:glmmPQL

Extended distributions (e.g. negative binomial): glmmADMB

Observation-level random effects (e.g. lognormal-Poisson):lme4


Open-source GLMMs



http://cran.r-project.org/web/packages/lme4


Estimation

Comparison of coral symbiont results

Regression estimates−6 −4 −2 0 2

Symbiont

Crab vs. Shrimp

Added symbiont

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

GLM (fixed)GLM (pooled)PQLLaplaceAGQ


Open-source GLMMs


Inference: tests

Outline




4 Conclusions


Open-source GLMMs


Inference: tests

Wald tests [non-quadratic likelihood surfaces]

For OLS/linear models, likelihood surface is quadratic; onlyasymptotically true for GLM(M)s

Wald tests (e.g. typical results of summary) assumequadratic, based on curvature (information matrix)

always approximate, sometimes awful (Hauck-Donner effect)

do model comparison (F , score or likelihood ratio tests [LRT])instead

But . . .


Open-source GLMMs


Inference: tests

Conditional F tests [Uncertainty in scale parameters]

Model comparison: in general−2 log L = D =

∑deviancei/φ

Classical linear models:∑

deviance and φ are both χ2

distributed so D ∼ F (ν1, ν2)

Denominator degrees of freedom (df) (ν2) for complex(unbalanced, crossed, R-side effects) models?Approximations: Satterthwaite, Kenward-Roger (Kenwardand Roger, 1997; Schaalje et al., 2002)

Is D really ∼ F in these situations?

Scale parameters usually not estimated in GLMMs (Gamma,quasi-likelihood cases only).But . . .


Open-source GLMMs


Inference: tests

Likelihood ratio tests [non-normality of likelihood]

What about cases where φ is specified (e.g. ≡ 1)?

in GLM(M) case, numerator is only asymptotically χ2 anyway

Bartlett corrections (Cordeiro et al., 1994; Cordeiro andFerrari, 1998), higher-order asymptotics: cond [neitherextended to GLMMs!]


Open-source GLMMs

http://cran.r-project.org/web/packages/cond


Inference: tests

Tests of random effects [boundary problems]

LRT depends on null hypothesis being within the parameter’sfeasible range (Goldman and Whelan, 2000; Molenberghs andVerbeke, 2007)

violated e.g. by H0 : σ2 = 0

In simple cases null distribution is a mixture of χ2

(e.g. 0.5χ20 + 0.5χ2

1 (emdbook:dchibarsq)

ignoring this leads to conservative tests (e.g. true p-value =12 · nominal p-value)

simulation-based testing: RLRsim


Open-source GLMMs

http://stat.ethz.ch/R-manual/R-patched/library/emdbook/html/dchibarsq.html

http://cran.r-project.org/web/packages/RLRsim


Inference: tests

Information-theoretic approaches

Above issues apply, but less well understood (Greven, 2008;Greven and Kneib, 2010)

AIC is asymptotic

“corrected” AIC (AICc) (HURVICH and TSAI, 1989) derivedfor linear models, widely used but not tested elsewhere(Richards, 2005)

For comparing models with different REs,or for AICc , what is p?

AICcmodavg, MuMIn


Open-source GLMMs

http://cran.r-project.org/web/packages/AICcmodavg

http://cran.r-project.org/web/packages/MuMIn


Inference: tests

Parametric bootstrapping

fit null model to data

simulate “data” from null model

fit null and working model, compute likelihood difference

repeat to estimate null distribution

> pboot <- function(m0, m1) {

s <- simulate(m0)

L0 <- logLik(refit(m0, s))

L1 <- logLik(refit(m1, s))

2 * (L1 - L0)

}

> replicate(1000, pboot(fm2, fm1))


Open-source GLMMs


Inference: tests

Finite-sample problems

How far are we from “asymptopia”?

How much data(number of samples, number of RE levels)?

How many parameters(number of fixed-effect parameters, number of RE levels,number of RE parameters)?

Hope (#data)− (#parameters)� 1 but if not?


Open-source GLMMs

http://www.aeaweb.org/articles.php?doi=10.1257/jep.24.2.31


Inference: tests

Levels of focus

how many parameters does a RE take?

Somewhere between q and r (e.g., 1 and the number of levelsfor a variance) . . . shrinkage

Conditional vs. marginal AIC

Similar issues with Deviance Information Criterion(Spiegelhalter et al., 2002)


Open-source GLMMs


Inference: confidence intervals

Outline




4 Conclusions


Open-source GLMMs



Wald tests

a sometimes-crude approximation

computationally easy, especially for many-parameter models

use Wald Z (assume “residual df” large)? Or t, guessing atthe residual df?

Available from most packages


Open-source GLMMs



Profile confidence intervals

Tedious to program

Computationally challenging

Inherits finite-size sample problems from LRT

lme4a (in development/soon!)


Open-source GLMMs

http://lme4.r-project.org/



Bayesian posterior intervals

Marginal quantile or highest posterior density intervals

Computationally “free” with results of stochastic Bayesiancomputation

Easily extended to confidence intervals on predictions, etc..

Post hoc Markov chain Monte Carlo sampling available forsome packages (glmmADMB, R2ADMB, eventually lme4a)


Open-source GLMMs



http://lme4.r-project.org/



Summary

Large data

computation can be limitingasymptotics better

Small data

RE variances may be poorly estimated/ set to zero(informative priors can help)inference tricky


Open-source GLMMs


Glycera

Outline




4 Conclusions


Open-source GLMMs


Glycera

Effect on survival

−60 −40 −20 0 20 40 60

Osm

Cu

H2S

Anoxia

Osm:Cu

Osm:H2S

Cu:H2S

Osm:Anoxia

Cu:Anoxia

H2S:Anoxia

Osm:Cu:H2S

Osm:Cu:Anoxia

Osm:H2S:Anoxia

Cu:H2S:Anoxia

Osm:Cu:H2S:Anoxia

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

MCMCglmmglmer(OD:2)glmer(OD)glmmMLglmer


Open-source GLMMs


Glycera

Effect on survival

−20 −10 0 10 20 30

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

H2SCu

Osm

Oxygen

Cu : H2S

Osm : Oxygen

Cu : Oxygen

Osm : H2S

H2S : OxygenOsm : Cu

Osm : Cu : H2S

Cu : H2S : Oxygen

Osm : H2S : Oxygen

Osm : Cu : Oxygen

Osm : Cu : H2S : Oxygen

main effects

2−way

3−way

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●


Open-source GLMMs


Glycera

Parametric bootstrap results

True p value

Infe

rred

p v

alue

0.02

0.04

0.06

0.08

0.02 0.04 0.06 0.08

Osm Cu

H2S

0.02 0.04 0.06 0.08

0.02

0.04

0.06

0.08

Anoxia


Open-source GLMMs


Arabidopsis

Outline




4 Conclusions


Open-source GLMMs


Arabidopsis

Arabidopsis: AIC comparison of REs

∆AIC

clip(gen) X clip(popu)clip(gen) X nut(popu)clip(gen) X int(popu)

nut(gen) X clip(popu)nut(gen) X nut(popu)nut(gen) X int(popu)int(gen) X clip(popu)int(gen) X nut(popu)int(gen) X int(popu)

int(popu)nointeract

0 2 4 6

●

●

●

●

●

●

●

●

●

●

●


Open-source GLMMs


Arabidopsis

Arabidopsis: fits with and without nutrient(genotype)

Regression estimates−1.0 −0.5 0.0 0.5 1.0 1.5

nutrient8

amdclipped

rack2

statusPetri.Plate

statusTransplant

nutrient8:amdclipped

●

●

●

●

●

●

●

●

●

●

●

●


Open-source GLMMs


Primary tools

lme4: multiple/crossed REs, (profiling): fast

MCMCglmm: Bayesian, very flexible

glmmADMB: negative binomial, zero-inflated etc.

Most flexible: R2ADMB/AD Model Builder,R2WinBUGS/WinBUGS/R2jags/JAGS, INLA


Open-source GLMMs

http://cran.r-project.org/web/packages/lme4


http://cran.r-project.org/web/packages/glmmADMB


http://admb-project.org/

http://cran.r-project.org/web/packages/R2WinBUGS

http://cran.r-project.org/web/packages/R2jags

http://www-ice.iarc.fr/~martyn/software/jags/



Loose ends

Overdispersion and zero-inflation: MCMCglmm, glmmADMB

Spatial and temporal correlation (R-side effects):MASS:glmmPQL (sort of), GLMMarp, INLA;WinBUGS, AD Model Builder

Additive models: amer, gamm4, mgcv

Penalized methods (Jiang, 2008) (?)

Hierarchical GLMs: hglm, HGLMMM

Marginal models: geepack, gee


Open-source GLMMs




http://cran.r-project.org/web/packages/GLMMarp


http://cran.r-project.org/web/packages/amer

http://cran.r-project.org/web/packages/gamm4

http://cran.r-project.org/web/packages/mgcv

http://cran.r-project.org/web/packages/hglm

http://cran.r-project.org/web/packages/HGLMMM

http://cran.r-project.org/web/packages/geepack

http://cran.r-project.org/web/packages/gee


To be done

Many holes in knowledge (but what can be done?)

Faster algorithms, more parallel computation

Lots of implementation and clean-up

Benefits & costs of staying within the GLMM framework

Benefits & costs of diversity

More info: glmm.wikidot.com


Open-source GLMMs

glmm.wikidot.com


Acknowledgements

Data: Josh Banta and Massimo Pigliucci (Arabidopsis);Adrian Stier and Sea McKeon (coral symbionts); CourtneyKagan, Jocelynn Ortega, David Julian (Glycera);

Co-authors: Mollie Brooks, Connie Clark, Shane Geange, JohnPoulsen, Hank Stevens, Jada White


Open-source GLMMs


References

Breslow, N.E., 2004. In D.Y. Lin and P.J. Heagerty, editors, Proceedings of the second Seattle symposium inbiostatistics: Analysis of correlated data, pages 1–22. Springer. ISBN 0387208623.

Cordeiro, G.M. and Ferrari, S.L.P., 1998. Journal of Statistical Planning and Inference, 71(1-2):261–269. ISSN0378-3758. doi:10.1016/S0378-3758(98)00005-6.

Cordeiro, G.M., Paula, G.A., and Botter, D.A., 1994. International Statistical Review / Revue Internationale deStatistique, 62(2):257–274. ISSN 03067734. doi:10.2307/1403512.

Gelman, A., 2005. Annals of Statistics, 33(1):1–53. doi:doi:10.1214/009053604000001048.

Goldman, N. and Whelan, S., 2000. Molecular Biology and Evolution, 17(6):975–978.

Greven, S., 2008. Non-Standard Problems in Inference for Additive and Linear Mixed Models. Cuvillier Verlag,Gottingen, Germany. ISBN 3867274916.

Greven, S. and Kneib, T., 2010. Biometrika, 97(4):773–789.

Hadfield, J.D., 2010. Journal of Statistical Software, 33(2):1–22. ISSN 1548-7660.

HURVICH, C.M. and TSAI, C., 1989. Biometrika, 76(2):297 –307. doi:10.1093/biomet/76.2.297.

Jiang, J., 2008. The Annals of Statistics, 36(4):1669–1692. ISSN 0090-5364. doi:10.1214/07-AOS517.

Kenward, M.G. and Roger, J.H., 1997. Biometrics, 53(3):983–997.

Molenberghs, G. and Verbeke, G., 2007. The American Statistician, 61(1):22–27.doi:10.1198/000313007X171322.

Pinheiro, J.C. and Bates, D.M., 2000. Mixed-effects models in S and S-PLUS. Springer, New York. ISBN0-387-98957-9.

Richards, S.A., 2005. Ecology, 86(10):2805–2814. doi:10.1890/05-0074.

Schaalje, G., McBride, J., and Fellingham, G., 2002. Journal of Agricultural, Biological & Environmental Statistics,7(14):512–524.

Spiegelhalter, D.J., Best, N., et al., 2002. Journal of the Royal Statistical Society B, 64:583–640.


Open-source GLMMs

open-source glmm tools

Education

q q q qq q q qq

fruit set qq q qq

biologyopensource glmms

glmms estimation inference

precursors examples

opensource tools

tests inference

results glycera arabidopsis