generalized linear mixed models (glmms)...generalized linear mixed models (glmms) an alternative to...

Generalized Linear Mixed Models (GLMMs)

An alternative to the marginal model (e.g., GEE) approach to handlingwithin cluster correlation is to include cluster-specific random effects inthe linear predictor.

• In a linear model context, we have already seen that the incorpora-tion of random effects into the model adds considerable flexibility.Features like dependence and multiple sources of heterogeneity thatthe CLM cannot handle at all can be modelled in a variety of wayswith the LMM.

The inclusion of random effects into GLMs and other nonlinear models istherefore a desirable, natural extension. However, this extension presentsformidable challenges to estimation and inference.

In addition, in general a GLM does not have an error term. In a LMM, thevector of error terms could be assumed to multivariate normal with somevar-cov matrix. That var-cov matrix could then be assumed to have somecovariance structure to account for serial correlation and/or measurementerror.

• Since the GLMM has no error term, it is harder to build measure-ment error and, especially, serial correlation into this type of model.Random effects in the GLMM do allow us to account for sharedcharacteristics, though.

201

We’ll consider GLMMs only for the clustered/longitudinal data case.

Suppose we have data from n independent clusters (e.g., subjects) y =(yT

1 , . . . ,yTn )

T where yi = (yi1, . . . , yiti)T . Suppose that within a cluster

the responses are correlated, but the within-cluster correlation is due solelyto shared characteristics; i.e., cluster-specific random effects bi.

That is, suppose that conditional on bi the responses within a cluster areindependent:

yi1, . . . , yiti |biind∼ f(yij |bi)

where the conditional density f(yij |bi) has mean E(yij |bi) = µcij that fol-

lows a generalized linear model with random effects in the linear predictor:

g(µcij) = xT

ijβ + zTijbi

or, in vector form,g(µc

i ) = Xiβ + Zibi

where µci = E(yi|bi), Xi is an ti × p fixed effects design matrix with rows

xTi1, . . . ,x

Titi, and Zi is an ti × q random effects design matrix constructed

similarly from the zTij ’s.

The q × 1 vector of random effects bi is assumed to have mean 0 andvariance-covariance matrix D(θ), and over the n clusters

b1, . . . ,bniid∼ f(bi;θ).

Common choices for f(bi;θ) are

1. Normal

2. The conjugate distribution for the conditional distribution of yij .

3. Alternatively, f can be left unspecified and several different non-parametric approaches are possible. However, we ignore this ap-proach in the presentation given here.

202

Maximum Likelihood:

When computationally feasible, the preferred approach to estimation inthe GLMM is ML (or REML, though REML is tougher to implement oreven define in this context).

For the ith cluster, the contribution to the likelihood is the marginal den-sity of yi, f(yi;β,θ).

However, this quantity is not specified directly in the model. Instead wemust compute the marginal density from the conditional density of yi

given bi and the marginal density of bi.

First, the joint density of yi,bi is

f(yi,bi;β,θ) = f(bi;θ)f(yi|bi;β)

= f(bi;θ)

ti∏j=1

f(yij |bi;β) (conditional independence)

Therefore, the marginal density of yi is

f(yi;β,θ) =

∫f(yi,bi;β,θ)dbi

=

∫f(bi;θ)

ti∏j=1

f(yij |bi;β)dbi

where the integral is with respect to bi and is therefore of dimensionq = dim(bi).

Because the clusters are independent, the full data likelihood is the productof such terms:

L(β,θ;y) =n∏

i=1

∫f(bi;θ)

ti∏j=1

f(yij |bi;β)dbi (♠)

203

• In the LMM with normal random effects, the likelihood is of thisform, but simplifies because the integral can be evaluated in closedform due to the linearity of the model.

• For simple random intercept models (dim(bi) = 1), or independentrandom effects models (dim(bi) > 1 but the components of bi areassumed independent) conjugate distribution specifications for bi

also will lead to a closed-form expression for the marginal likelihood(negative binomial model, beta-binomial model).

• Unfortunately, for many models of interest (e.g., nonlinear modelswith Gaussian random effects) the integral in L(β,θ;y) must beevaluated numerically.

• The lack of a closed-form solution for L(β,θ;y) is what makes GLMMsand other nonlinear mixed-effects models especially challenging.

Evaluation of the (log)likelihood in a GLMM:

To deal with the integral with respect to the random effects’ distributionin (♠) there are three main approaches:

1. Deterministic numerical integration (quadrature).

2. Monte Carlo integration.

3. Analytic approximation of the likelihood.

204

1. Numeric Integration (quadrature):

Integrals can be interpretted as infinite weighted sums. Quadratureapproximates an integral by a finite weighted sum of the integrandevaluated at a set of values of the variable being integrated out.

Choices of the weights and locations are called quadrature rules anddifferent rules perform better for different types of integrands.

The two most common types of quadrature used for GLMMs areGauss-Hermite quadrature (aka ordinary Gaussian quadrature)and adaptive Gaussian quadrature.

1.a. Gauss-Hermite quadrature (GHQ):

Gauss-Hermite quadrature rules are designed to evaluate integrals ofthe form ∫ +∞

−∞exp(−x2)f(x)dx (∗)

where f(·) is a smooth function that can be well approximated by apolynomial.

In particular, GHQ gives an R-term finite sum approximation of theform ∫ +∞

−∞exp(−x2)f(x)dx ≈

R∑r=1

p∗rf(a∗r)

where the quadrature points (aka abscissas), a∗r , and quadratureweights, p∗r , are designed to provide accurate approximation in thecase where f(·) is a polynomial.

– The number of quadrature pointsR can be chosen, with (roughly)increasing accuracy of the approximation as R increases, up toa point of diminishing (and sometimes decreasing) returns.

– If f(·) is a 2(R− 1) degree polynomial then the R point GHQapproximation is exact.

205

– Quadrature abscissas and weights are functions of the Her-mite polynomials and therefore can be computed by softwareroutines or obtained from tables in standard mathematical ref-erences.

GHQ is useful for GLMMs with normal random effects because the“weight function” exp(−x2) in (*) is proportional to a normal den-sity.

For illustration, suppose that we have a relatively simple GLMMwith a random intercept only:

g(µcij) = xT

ijβ + bi, biiid∼ N(0, ψ)

Then the likelihood contribution from the ith cluster is (cf. (♠))∫ +∞

−∞f(bi|ψ)

∏j

f(yij |bi)dbi

changing the variable of integration to a standard normal variableui = bi/

√ψ, this quantity becomes∫ +∞

−∞ϕ(ui)

∏j

f(yij |√ψui)dui ()

where ϕ(·) is the standard normal density function

ϕ(ui) =1√2π

exp(−u2i /2).

Applying GHQ gives∫ +∞

−∞ϕ(ui)

∏j

f(yij |√ψui)dui ≈

R∑r

pr∏j

f(yij |√ψar), ()

wherepr ≡ p∗r/

√π, ar ≡

√2a∗r .

206

• Extension of this GHQ approach to higher dimensional random ef-fects can be done via cartesian product quadrature (see, e.g., Skon-dal & Rabe-Hesketh, 2004). Essentially, this involves replacing ourR term finite sum approximation, with q nested finite sum approx-imations. If the same number of quadrature points is used for eachdimension this results in a qR-term sum to approximate the likeli-hood for a GLMM with q dimensional random effects vector bi.

• GHQ can work well with small R for integrands that are well ap-proximated by polynomials. However, for GLMMs R < 10 can beinaccurate and R ≥ 20 is often required.

• Even for R large, it may not be possible to approximate GLMM like-lihoods accurately with GHQ. Problems are worse for large clustersizes, and large random effects variances.

• GHQ inaccuracies can also lead to multimodality of the likelihood.

Many of the problems of GHQ can be alleviated by adaptive Gaus-sian quadrature.

207

1.b. Adaptive Gaussian Quadrature (AGQ):

• In GHQ, the points at which the integrand is evaluated are fixedand do not depend on the particular characteristics (shape) of thefunction to be integrated.

AGQ solves many of the problems of GHQ by adapting the abscis-sas and weights to the function to be integrated. Essentially, AGQshifts and rescales the quadrature points to lie under the peak of theintegrand (see Fig. 6.2 from S & R-H).

208

The integrand in (),

ϕ(ui)∏j

f(yij |√ψui)

is proportional to f(bi|yij), which we can think of as the posteriordensity of bi given the observed data. This density can often be wellapproximated by a normal density ϕ(ui;µi, τ

2i ) with some cluster-

specific mean µi and variance τ2i .

Instead of treating the prior density ϕ(ui) as the weight function forGHQ (as in ()), we rewrite the integral as

∫ +∞

−∞ϕ(ui;µi, τ

2i )

ϕ(ui)

∏j f(yij |

√ψui)

ϕ(ui;µi, τ2i )

dui,

and treat the normal density that approximates the posterior densityas the weight function for quadrature.

Changing the variable of integration from ui to zi = (ui−µi)/τi andapplying the standard GHQ rule yields

f(yi) =

∫ +∞

−∞

ϕ(zi)

τi

ϕ(τizi + µi)

∏j f(yij |

√ψ(τizi + µi))

exp(−z2i /2)/√2πτ2i

τidzi

≈R∑r

pr

ϕ(τiar + µi)

∏j f(yij |

√ψ(τiar + µi))

exp(−a2r/2)/√2πτ2i

=R∑r

πir∏j

f(yij |√ψαir)

whereαir ≡ τiar + µi

are shifted and rescaled abscissas with corresponding weights

πir ≡√2πτi exp(a

2r/2)ϕ(τiar + µi)pr.

209

• A drawback of this approach is that it requires computation of theapproximating mean and variance µi and τi of the normal approx-imation to the posterior. This typically requires a nontrivial andtypically iterative calculation to achieve the AGQ approximation tothe likelihood.

– E.g., often µi and τ2i are taken to be the mode and curvature at

the mode of the integrand in (), which must be computed by,for example, solving the gradient equation from this integrand(via Newton-Raphson) to get µi and computing the inversenegative Hessian of the log of this integrand to get τ2i .

• In simulation studies, AGQ has been found to dramatically outper-form GHQ with much fewer quadrature points. It is much more com-putationally intensive than GHQ (i.e., slower), but is recommendedover GHQ when feasible.

• AGQ is implemented in SAS’ PROC NLMIXED as well as the Stataprogram gllamm.

• A big drawback of quadrature methods is that they are computation-ally intensive. They are especially cumbersome for high dimensionalrandom effects, nested random effects, or crossed random effects.NLMIXED can handle bivariate random effects at one level of clus-tering. Gllamm can handle multilevel models via AGQ, but it is veryslow.

210

2. Monte Carlo Integration

Notice that the contribution to the GLMM likelihood for the ith clus-ter is an expectation of f(yi|bi;β) with respect to the distributionof bi (cf. (♠)):∫

f(bi;θ)∏j

f(yij |bi;β)︸︷︷︸=f(yi|bi;β)

dbi = Ef(yi|bi;β)

There are many simulation-based methods to evaluate expectationsthat can and have been applied to GLMMs. However, we discussjust two of these approaches, Simple Monte Carlo Integration andImportance Sampling.

2.a. Simple Monte Carlo Integration:

Suppose we want to evauate an expectation of a function h of arandom variable (could be a vector) X that takes values in X andwhich has p.d.f. fX :

Eh(X) =

∫Xh(x)fX(x)dx.

A natural approach is to draw a sample (x∗1, x∗2, . . . , x

∗R) from fX and

approximate this expectation by a sample mean

hR =1

R

R∑r=1

h(x∗r),

since hR converges almost surely to Eh(X) by a SLLN.

211

In addition, if h2 has finite expectation under fX , a CLT applies sothat hR is approximately normal whenR is large with mean Eh(X)and variance that can be estimated by

var(hR) =1

R(R− 1)

R∑r

h(x∗r)− hR2.

• This result allows the error of the approximation to be assessed - anadvantage relative to quadrature or analytic approximations.

• In addition, the precision of the approximation can be increased byincreasing R.

To illustrate on our simple GLMM with random intercept, we seek():

Ef(yi|√ψui) =

∫ +∞

−∞

∏j

f(yij |√ψui)ϕ(ui)dui

≈ 1

R

R∑r=1

∏j

f(yij |√ψu∗r)

where u∗1, . . . , u∗R are random deviates from N(0, 1).

212

2.b. Importance Sampling:

Simple Monte Carlo Integration can be improved by using impor-tance sampling to decrease the sampling variance of the approxima-tion. This approach is also useful relative to simple MC when it isdifficult to sample from fX or fX is not smooth.

In importance sampling, we choose an importance density g and thenwrite

Eh(X) =

∫Xg(x)

h(x)fX(x)

g(x)dx

The importance density g is chosen so that

i. it is easy to draw samples from g;ii. the support of g and fX are the same;

iii. it is easy to evaluate h(x)fX(x)g(x) given x; and

iv. h(x)fX(x)g(x) is bounded and smooth in the parameters over the

support of X.

Then the importance sampling approximation is given by

Eh(X) ≈ 1

R

R∑r=1

h(x∗r)fX(x∗r)

g(x∗r)

where (x∗1, x∗2, . . . , x

∗R) is a random sample from g.

213

Again returning to the GLMMwith random intercept, an importancesampler can be constructed by using ϕ(ui;µi, τ

2i ), the normal density

approximation to the posterior density f(ui|yi).

That is, the representation from p.207 of the integral we’re after,

Ef(yi|bi) =

∫ +∞

−∞ϕ(ui;µi, τ

2i )

ϕ(ui)

∏j f(yij |

√ψui)

ϕ(ui;µi, τ2i )

dui

suggests drawing a sample (u∗1, . . . , u∗R) from ϕ(ui;µi, τ

2i ) and using

the approximation

Ef(yi|bi) ≈ 1

R

R∑r

ϕ(u∗r)

∏j f(yij |

√ψu∗r)

ϕ(u∗r ;µi, τ2i )

• This method is akin to AGQ, which can be viewed as a determinis-tic version of importance sampling. Both methods require an extracomputation of µi, τ

2i , the mean and variance of the normal approx-

imation to the posterior distribution.

• Importance sampling is implemented in PROC NLMIXED.

214

3. Analytic Approximation of the Likelihood:

• Most of the methods that fall into this category are based upontaking a Laplace approximation to the integral involved in theGLMM loglikelihood.

For a unidimensional integral, the Laplace approximation can bewritten as∫ +∞

−∞expf(x)dx ≈

∫ +∞

−∞expf(x)− (x− x)2/(2σ2)dx

=

∫ +∞

−∞expf(x)

√2πσϕ(x; x, σ2)dx

= expf(x)√2πσ,

where ϕ(x; x, σ2) is a normal density with mean x and variance σ2,x is the mode of f(x) and hence of expf(x), and

σ2 = −(∂2f(x)

∂x2

∣∣∣∣x=x

)−1

.

• The approximation in the first line here is obtained by approximatingf(x) by a second order Taylor expansion around its mode. You’llnotice that the first derivative term f ′(x) drops out here because itis evaluated at the mode, at which point f ′(x) = 0.

215

For the GLMM with random intercept, the integrand in the likeli-hood contribution for the ith cluster (which plays the role of expf(x))is

ϕ(bi; 0, ψ)∏j

f(yij |bi) = exp

log

ϕ(bi; 0, ψ)∏j

f(yij |bi)

• As mentioned before, this quantity is proportional to the posteriordistribution f(bi|yi).

Therefore, in the Laplace approximation, the quantity playing therole of x is the posterior mode

bi = argmaxbi

ϕ(bi; 0, ψ)∏j

f(yij |bi)

and the curvature (inverse negative Hessian) of the integrand, σ2

i ,plays the role of σ2.

Thus, the Laplace approximation to the loglikelihood contributionfrom the ith subject becomes

log f(yi;β, ψ) ≈ log(√2πσi) + logϕ(bi; 0, ψ)+

∑j

log f(yij |bi)

= log(σi/√ψ)− b2i /(2ψ) +

∑j

log f(yij |bi). (∗∗)

• This approximation is good whenever the posterior density of bi isapproximately normal. This occurs for large cluster sizes.

• The approximation is also better when the conditional response dis-tributions are more nearly normal; E.g., for conditionally Poissonresponses with large(ish) means, conditionally binomial responseswith large denominators, but not for binary responses.

216

• Note that the Laplace approximation replaces the likelihood withan analytic approximation.* Unlike quadrature methods and MonteCarlo integration methods, this approximation can’t be made arbi-trarily accurate by increasing R.

• Thus, GLMM fitting methods of this type do not yield true MLEs.The yield maximum “approximate likelihood” estimates and thesemethods are often called approximate ML approaches. In contrast,quadrature and MC integration approaches yield (essentially) trueMLEs.

• The most well-known and popular approximate ML approach relatedto Laplace approximations is Penalized Quasilikelihood or PQL.

• In PQL, the first term in the Laplace approximation (**) is ignoredand parameter estimates are obtained by maximizing the remainingterms.

• The resulting estimators have been studied via simulation studiesand found to work reasonably well in many situations, but exhibitbias that can be large when the conditional response distribution isbinary or otherwise highly non-normal and/or when random effects’variances are large.

• Variants of PQL such as second-order PQL have been proposed toreduce this bias, and they are only partially successful. Other ap-proaches based on higher-order Laplace approximations have alsobeen proposed to improve on the bias exhibited by PQL.

• There are several different ways to derive the PQL approach, but acommonality in all is that the approximate likelihood is obtained bytaking a Taylor series expansion around the posterior mode of bi.Alternatively, this Taylor series expansion can be taken about themean of bi, which is 0.

* unless the integrand is proportional to a normal density, in whichcase the Laplace approximation is exact.

217

• Taking the expansion around 0 results in what is often called marginalquasilikelihood (MQL).

• MQL is computationally less demanding than PQL, but tends to notwork as well.

Example – Epileptic Seizure Data:

We return to the epilepsy data analyzed previously with GEE methods.

Recall our notation and model from that analysis: let yhij be the responseat time j (j = 0, 1, 2, 3, 4) for the ith subject in treatment group h (h = 1for placebo, h = 2 for progabide).

The marginal model we fit with GEE was

log(µhij) = λhj + βhj log(yhi0),h = 1, 2i = 1, . . . , nh

j = 1, 2, 3, 4(†)

wherevar(yhij) = ϕµhij ,

and we assumed some working correlation matrix for corr(yhi).

• This model was eventually simplifed by letting βhj = β for all h, j.

218

In the GLMM approach we build a model for the conditional momentsgiven a q−dimensional random effects vector specific to the ith subject.

Model M1: Gaussian random intercept model:

yhij |biind∼ Poisson(µc

hij)

wherelog(µc

hij) = λhj + bhi + βhj log(yhi0)

with b11, . . . , b2,t2iid∼ N(0, θ1).

• The SAS program epileps2.sas fits model M1 first with PROC GLIM-MIX using an effects type parameterization. This parameterizationhas the advantage of yielding an ANOVA table that gives main andinteraction effects of the treatment factors on the intercept (λhjabove) and the slope (βhj above).

• PROC GLIMMIX implements ML via AGQ (method=quad), PQL(method=MSPL), marginal quasilikleihood (method=MMPL), andapproximate ML via the Laplace approximation (method=Laplace).In practice, both PQL and MQL are implemented by iteratively fit-ting linear mixed effect models to pseudo data. The method=MSPLand method=MMPL options iteratively fit those linear mixed mod-els via ML estimation. Alternatively, the iterative fitting can be donevia REML. This is what is done in two other methods implementedin PROC GLIMMIX: method=RSPL (a REML-like version of PQL)and method=RMPL (a REML-like version of MQL). These are nottrue REML methods, and it is unclear what the objective functionis that is being optimized.

219

• A good reference for PQL and MQL and a very good paper overall isBreslow and Clayton (1993, JASA). Also worth reading is a paper byWolfinger and O’Connell (1993), which came out at about the sametime and independently of Breslow and Clayton (1993). The twopapers discuss essentially the same methodology, but using differentterminology. Wolfinger and O’Connell call the methods “pseudolike-lihood” approaches, which is the terminology adopted by SAS in themethod= option in PROC GLIMMIX.

• In model M1, we use ML estimation, which I strongly recommendover any of the other available approaches whenever it is feasible.

• The second call to PROC GLIMMIX refits model M1 with an equiv-alent cell means type parameterization (model M1a). This model isalso fit using ML via AGQ.

• The third call to PROC GLIMMIX refits the model with a cell meansparameterization once again, but this time using PQL. In this model,because ML estimation is feasible, I do not recommend using PQL,but it used to be the case that PROC GLIMMIX did not implementML estimation, so it was necessary to do it in PROC NLMIXED,which requires starting values. One use of PQL is to get startingvalues for the more computationally demanding ML estimation im-plemented in PROC NLMIXED. In addition, there are some modelsfor which ML is not feasible due to high dimensional random effects.In that case PQL, or one of the related approximate ML methods,may be necessary, so it is good to know when these methods workwell.

• The data here consist of counts, with a mean count that is of mod-erate size. The simple average of the observed counts over all post-baseline measurement occasions is 8.26. PQL is not recommendedfor Poisson counts with means below about 5 (Breslow’s recommen-dation) or 7 (McCulloch et al.’s recommendation). So, PQL can beexpected to perform adequately here. In fact, its results are quitesimilar to those from ML.

220

• To illustrate the use of PROC NLMIXED, model M1 is refit with thatprocedure (using ML and AGQ) as model M1c. PROC NLMIXEDis written to combine nonlinear models with mixed-effects models,so it’s not surprising that its syntax mixes that of PROC NLIN andPROC MIXED.

• The PARMS statement is used to specify starting values.

• In the MODEL statement, the conditional distribution of the re-sponse is specfied as normal, Poisson, binary, binomial or general (acustom loglikelihood of the user’s choice and coded by the user). Theparameters of the conditional distribution are given as in commonstatistical notation (e.g., normal(mu,sigma2) where mu and sigma2have been coded on separate programming statements).

• The RANDOM statement specifies the distribution of the random ef-fects that appear in the conditional distribution given on the MODELstatement. The only distribution currently implemented is normal.

• The programming statements can be any valid SAS programmingstatements as could appear in a DATA step. In epileps2.sas the pro-gramming statements are use to construct the Poisson mean, expeta.

• By default, NLMIXED uses AGQ and it chooses the number ofquadrature points adaptively, by trying successively greater num-bers of points until the difference between the AGQ approximationsof the loglikelihood at the initial parameter estimates for R and R−1quadrature is small (less than a controllable tolerance).

• Alternatively, the number of quadrature points can be set by the userwith the QPOINTS= option on the PROC NLMIXED statement.

– Setting QPOINTS=1 yields the Laplace approximation de-scribed on pp.215–216.

• The NOAD option on the PROC NLMIXED statement allows ordi-nary (nonadaptive) GHQ.

221

• Alternatively, one can specify METHOD=isamp as an option on thePROC NLMIXED statement to use importance sampling instead ofthe default quadrature approach.

PROC NLMIXED is a very powerful procedure. It can fit

a. GLMMs based on standard distributional assumptions on the con-ditional distribution of the response given a (univariate or bivariate)normal random effect (e.g., normal, Poisson, binomial);

b. A class of nonlinear mixed effect models that generalizes GLMMsby allowing the mean response to be related to covariates and pa-rameters via arbitrary nonlinear relationships (i.e., not necessarilythrough a link and linear predictor);

c. More general mixed models whose conditional response distribution(given a normal random effect) is not necessarily in the exponentialfamily. This flexibility is achieved by allowing the user to code theconditional loglikelihood.

All the above mixed models are fit by maximizing the loglikelihood whichis computed by integrating out the normal random effects from the condi-tional loglikelihood. In addition to several methods to do this integration,NLMIXED offers several different maximization algorithms which the usercan choose from. By leaving out random effects from the model and omit-ting a RANDOM statement, PROC NLMIXED can also fit

d. a wide variety of fixed effect models including GLMs, nonlinear mod-els, etc., using ML.

• Several optimization methods are available, all of which are gradientmethods related to the Newton-Raphson and Gauss-Newton meth-ods except for the Nelder-Mead Simplex method, which does notrequire derivatives.

222

Back to the example:

• After illustrating some of the features of PROCs GLIMMIX andNLMIXED, we now get back to analyzing the data. In model M2, Iswitch back to PROC GLIMMIX and reduce the model by allowingequal slopes on logbase across treatments. That is, we fit model M2as:

log(µchij) = λhj + bhi + βj log(yhi0).

• A LRT can be done to test H0 : βhj = βj for all j by comparing theloglikelihoods of the model. This yields a test statistic of 1322.0 −1314.3 = 7.7. Comparing to a χ2(4) yields a p-value of p = .1032, sowe fail to reject H0 and we prefer Model M2.

• Then in model M3, we reduce the model further to allow equal slopesacross groups and time. That is, we fit model M3 as:

log(µchij) = λhj + bhi + β log(yhi0)

• A LRT test of models M2 versus M3 yields p = .0307 so we shouldnot reduce to model M3. In addition, the AIC criterion prefers modelM2 (1348.0 for M2, 1350.9 for M3).

• Based on model M2, we test for a main effect of treatment and an in-teraction between treatment and time.* These results suggest thatthe mean post-baseline response (averaged over time) for the pro-gabide group is significantly lower with an estimated multiplicativeeffect of exp(.3398) = 1.40 on the mean seizure count associated withbeing in the placebo group. That is, switching from the active treat-ment to the placebo group would be associated with a 40% increasein a person’s seizure rate.

* Note that for a fixed baseline value, these contrasts do not dependon βj .

223

• The final call to NLMIXED fits a modified version of model M3 withrandom subject-specific intercepts and slopes (on logbase). Giventhat model M3 did not fit as well as model M2, this model is notwell motivated here, but is included just to illustrate how to fit amodel with bivariate random effects.

• Model M4 is as follows:

log(µchij) = λhj + b1hi + (β + b2hi) log(yhi0)

where b11, . . . ,b2,t2iid∼ N(0,D), where D =

(θ1 θ2θ2 θ3

)and bhi =(

b1hib2hi

).

– To illustrate the increase in computational burden associatedwith higher dimensional random effects, note that model M3took 0.10 seconds of cpu time to fit, whereas model M4 took5.67 seconds (about a 57-fold increase).

224

Missing Data in Longitudinal Studies

A very common complication in longitudinal data analysis is the problemof missing data. Most longitudinal studies are designed to collect data onevery subject at each of a fixed number or set of follow-up occasions. Thusthere is much greater potential for missingness on each subject than incross-sectional studies, and much greater complexity to the overall missingdata possibilities.

• Data can be missing intermittently (e.g., a subject misses some oneor more follow-up appointments but then returns for data collectionat later occasions).

• Alternatively, subjects can drop out from the study before its end.Permanent loss of subjects from the study is called attrition. In thiscase, data is missing from the drop out occasion forward throughtime.

225

Missing data have three important consequences for longitudinal analysis:

1. Imbalance. This is only a serious problem for methods of analysisthat require a common set of measurement occasions such as profileanalysis, but not for mixed models and other more flexible methods.

2. Loss of information. Clearly, the loss of measurements that wereoriginally designed to be collected in the study will result in a loss ofprecision in estimation and power for testing. This information lossis directly related to the amount of missing data, but also dependssomewhat on the pattern of missingness.

– This consequence can be reduced by good methods of analysis,but of course, not eliminated.

3. Bias. Under certain conditions missingness can introduce seriousbias, invalidating statistical inferences drawn from a longitudinalstudy.

– This bias can be avoided with careful modeling and analysisof the missing data mechanism. However, this modeling is notalways easy, and is often dependent on assumptions that areunverifiable.

As an example consider the Six Cities Study of Air Pollution and Healththat we considered several times earlier in the course. In this study sub-jects were enrolled in first and second grade and followed via annual ex-aminations of pulmonary function.

• The most common cause of attrition in this study was families mov-ing out of the study region. If this occurred because a parent changedjobs, such a missing data mechanism can be safely assumed to beunrelated to the child’s pulmonary function.

226

• However, if the family moved because of pulmonary health problemsexperienced by their children, then this mechanism is related to theresponse of interest in the study and must be accounted for in theanalysis appropriately to avoid bias.

Missing Data Mechanisms:

In the presence of missing data, statistical analysis is typically performedunder a set of assumptions about the missing data. The validity of theanalysis depends upon whether these assumptions hold.

The missing data mechanism can be thought of as a probability model forthe distribution of a set of indicator variables which take the value 1 whenan observation is observed and 0 otherwise.

Suppose that a subject with a complete set of responses has an n × 1response vector

Yi = (Yi1, . . . , Yin)T

• This is the set of potential observations, some of which may be un-observed due to missingness.

Then let Ri by an n× 1 vector of response indicators:

Ri = (Ri1, . . . , Rin)T

whereRij =

1 if Yij is observed0 otherwise.

• We assume there is an n× p matrix of covariates Xi associated withYi that is completely observed.

Let YOi be the sub-vector of Yi corresponding to observed values and YM

i

be the sub-vector of missing observations.

227

• Note that although Yi is not observed for all subjects, Ri is and itforms a stratification of the sample into distinct missingness patterns.See the table below.

228

A useful, but somewhat confusing, distinction among missing data mech-anisms is given by the following terms describing how Ri is related toYi:

(i) Missing completely at random (MCAR);(ii) Missing at random (MAR); and(iii) Not missing at random (NMAR) (note though that some authors

use the acronym MNAR for missing not at random).

• This nomenclature is easily confused. Adding to the problem isthe fact that the distinctions between these types of missingness aresomewhat subtle.

• However, these labels are useful, especially because they can be usedto identify classes of data under which each method of analysis isvalid.

(i) MCAR. Data missing completely at random means that the proba-bility that responses are missing is unrelated to either specific valuesthat, in principle, should have been obtained or the set of observedresponses.

That is, under MCAR the missing data indicators Ri are assumedindependent of both YO

i and YMi .

• Note that MCAR does not assume that missingness is inde-pendent of the explanatory variables. That is, it is possible tohave covariate-dependent missingness under MCAR.

To represent MCAR mathematically, consider the simple situation inwhich the potential response is bivariate, Yi = (Yi1, Yi2)

T , where weassume that Yi1 is fully observed and Yi2 is sometimes missing. ThenRi2 is the only necessary missingness indicator. In this situation, Yi2is MCAR iff

Pr(Ri2 = 1|Yi1, Yi2,Xi) = Pr(Ri2 = 1|Xi)

229

• Note that the fact that MCAR allows missingness to dependon Xi is powerful. To see why, suppose that both missingnessand the response variable increase with time. In this case, Rij

and Yij would be correlated. However, if we are willing toassume that conditional on time, Rij and Yij are independent,then under a model that includes time as a covariate, the datawould be MCAR.

• Thus, if a method of analysis is used that is valid only underMCAR, it is crucial to include all explanatory variables that arepredictive of both missingness Ri and the potential responseYi. Otherwise the missingness is not MCAR and the analysisis invalid.

• The essential feature of MCAR is that the observed data can bethought of as a random sample from the complete (potential)data.

• Therefore, “complete-case” analyses, which use only the dataobtained after throwing out subjects with any missing data, arevalid (although wasteful) under MCAR (e.g., profile analysis).

• In addition, “available-case” analyses, which use all of the ob-served data, are also valid under MCAR and less wasteful (e.g.,mixed models, GEE models).

(i) MAR. Data are said to be missing at random when the probabil-ity that the responses are missing depends on the set of observedresponses, but not on the missing responses, conditional on the ex-planatory variables in the model. That is, data are MAR iff

Pr(Ri|YOi ,Y

Mi ,Xi) = Pr(Ri|YO

i ,Xi).

230

• Note that MAR subsumes MCAR. That is MCAR impliesMAR, but not vice versa.

• An example of MAR is when subjects drop out of a studybecause their value of the response at the last measurementoccasion falls below (or above) some threshold.

• E.g., at his last clinic visit an HIV patient’s CD4 count ex-ceeds a limit causing him to withdraw and to seek alternativetherapy.

• Because MAR places a much weaker assumption on the missingdata mechanism than MCAR, many authors advocate usingmethods that are valid under MAR by default, unless thereare compelling reasons to justify an MCAR assumption.

• It is possible to test whether MCAR is a reasonable assumptionrelative to MAR.

• LMMs where the mean and variance-covariance structure areboth modeled are valid under MAR, but only if all relevantexplantory variables are included in the mean model and thevar-cov structure is modeled correctly.

• GEE-1 models require MCAR because they do not assume thatthe var-cov structure has been correctly specified. However,GEE-2 is valid under MAR (if its assumptions are correct!).

A closely related concept to MAR is ignorability. A missing datamechanism is said to be ignorable if (1) the missing data are MARand (2) the parameters of the complete data generating model andthe missingness model are distinct.

• In a likelihood context, if we assume (2) then

ignorable = MAR ∪MCAR = MAR

231

(iii) NMAR. NMAR is the situation in which missingness is related to theunobserved dataYM

i after taking into account the observed variablesXi and YO

i .

That is, data are NMAR if whether or not the response is missingdepends upon what the response would have been if the subject weremeasured.

• E.g., in a study of the symptoms of depression, if a patient failsto come to clinic visits whenever her depression gets severe, themissingness here is NMAR.

• Another example of NMAR would be a dieting study in whichsubjects come in for weekly weigh-ins, but some subject don’tcome after weeks in which they did not meet their weight-lossgoal.

• Because the distinction between MAR and NMAR involves theunobserved dataYM

i , it is impossible to confirm or reject MARvs. NMAR. It is possible to accept or reject a particular MARmodel relative to an NMAR model, but such tests are modeldependent.

• NMAR analyses are typically done in situations where there isstrong suspicion that the data violate MAR such as the twoexamples given above.

232

Classes of Models for Dealing with NMAR Data:

Methods for handling non-random missing data mechanisms is an area ofmuch research. The main classes of models useful for this purpose areselection models and pattern-mixture models.

1. Selection Models.

Originally, selection modeling involved two stages of analysis thatare either performed sequentially or iteratively.

The first stage involves developing a predictive model for whetheror not a subject drops out of the study using variables measuredprior to dropout (e.g., at baseline). This model produces a predicteddropout probability or propensity for each subject that are then usedin the second stage.

At the second stage the observed data are then modeled in terms ofthe relevant explanatory variables in addition to the dropout propen-sity which forms an additional covariate.

• More recently, mixed effect selection models have become morepopular.

Mixed effect selection models typically do not use dropout propensi-ties to model the observed response, but they do involve two modelsto simultaneously describe the dropout mechanism and the longitu-dinal development of the observed response vector.

These two models are typically linked via subject-specific randomeffects that appear in both models. This implies an association be-tween Ri and Yi that is assumed to account for the NMAR natureof the missing data.

• A nice feature of these models is that some examples of themcan be fit with sophisticated mixed model software such asPROC NLMIXED.

233

2. Pattern-mixture models. In the pattern-mixture modeling approachsubjects are stratified into groups based upon their missing data pat-terns. Then a model is built which allows the stratification variablethat defines different missingness patterns to be brought into theanalysis.

• A simple approach is to include the stratification factor intothe linear predictor for the observed data as well as interactionbetween this factor and other predictors of interest.

• E.g., the treatment effect and/or treatment by time interactionmay be allowed to depend on the missingness pattern.

• Often fewer than the total number of possible or even observedmissingness patterns are used to stratify the sample. E.g., inthe simplest case we can just define an indicator for whetherthe subject drops out from the study at any time-point.

234

Example - Treatment Effects on Schizophrenia Severity

• This example comes from Hedeker and Gibbons (1997, 2006) andwe’ll use it to illustrate both pattern mixture models and selectionmodels for handling data MNAR.

• In this study, we compare severity of illness, measured on a 7-pointLikert scale (1=normal, ..., 7=among the most extremely ill), andhow it changes over time, between schizophrenics on placebo andthose treated for their illness with one of three different drugs. Weanalyze these data with LMMs and develop both pattern mixtureand selection models that are in the LMM class.

• First, the pattern mixture approach. See schizpm5b.sas and its out-put.

• The selection model approach is illustrated in the SAS programSchz2mods.sas. Both of these SAS programs are slight modifica-tions of programs written by Don Hedeker, which are available onhis website, http://tigger.uic.edu/∼hedeker/long.html

235

The EM Algorithm

The Expectation-Maximization, or EM, algorithm is a general purpose,iterative technique for finding ML estimates in missing information prob-lems.

By missing information problems, we mean a wide variety of situationsin which computation of the MLE would be more easily accomplished ifcertain additional information were available. Situations like this include(but are not limited to)

1. fiting models to datasets impacted by missing data (e.g., from lon-gitudinal studies impacted by attrition, etc.);

2. mixed-effect models, where the latent random effects are the missinginformation;

3. finite mixture models, where each observation is assumed to be gen-erated from one ofG probability distributions, with respective proba-bilities p1, . . . , pG, but which particular distribution each observationbelongs to is unknown;

4. datasets affected by truncation or censoring. E.g., in surveys con-ducted at the mall, responses to the question, “How many times haveyou visited the mall in the past 30 days?” is truncated at 0 (zerosare necessarily unobserved). An example of censored data is survivaltime following heart by-pass surgery, when the study is concludedat a time at which some patients remain alive. Such subjects’ exactsurvival time is unkown; only a lower bound on the survival time isknown.

• In some cases, the missing information is obvious (e.g., missing datain a longitudinal study). In other cases, the problem is cast as amissing information problem simply to facilitate the use of the EMalgorithm (that is, for convenience) perhaps through a creative recon-ceptualization.

236

Observed data: y, w/ density g(y;ψ), loglikelihood ℓ(ψ;y). ψ ∈ Ω.

Missing data: z.

Complete data: x = (y, z), w/ density gc(x;ψ), loglikelihood ℓc(ψ;x).

The EM algorithm approaches the problem of solving the observed datascore equation ∂ℓ(ψ;y)/∂ψ = 0 by proceeding iteratively in terms of thecomplete data loglikelihood ℓc(ψ;x). As it is unobservable, it is replacedby its conditional expectation given what is observed: y and the currentestimate of ψ.

Let ψ(0) be an initial value for ψ. Then on the 1st iteration, the E-stepcalculates

Q(ψ|ψ(0)) ≡ Eψ(0)ℓc(ψ;x)|y.

• Here, the expectation is taken with respect to g(z|y;ψ(0)), the condi-tional density of the missing data, given the observed data, evaluatedat ψ(0).

The M-step then maximizes this quantity w.r.t. ψ to obtain an improvedestimate ψ(1). This process is repeated until convergence.

237

EM Algorithm:

Step 0: Specify a starting value ψ(0), and a convergence criterion.Set h = 0.

Step 1(E-Step): Calculate Q(ψ|ψ(h)) as defined above. This requiresevaluation of the conditional expectation of the unobservables given theobservables.

Step 2(M-Step): Find the value of ψ(h+1) that maximizesQ(ψ|ψ(h)).That is, find ψ(h+1) such that

Q(ψ(h+1)|ψ(h)) ≥ Q(ψ|ψ(h))

for all ψ ∈ Ω.

Step 3: If convergence criterion is obtained (e.g., if ℓ(ψ(h+1)) −ℓ(ψ(h)) < δ where δ is a preselected small constant), stop. Otherwise,set h = h+ 1 and return to step 1.

• It can be shown that an EM iteration cannot decrease the loglikeli-hood. That is,

ℓ(ψ(h+1)) ≥ ℓ(ψ(h)), h = 0, 1, 2, . . .

Therefore, if the loglikelihood is bounded above, the EM algorithmwill necessarily converge to a maximum, although it is possible forit to be a local maximum if the loglikelihood is not unimodal.

• In fact, the M step need not be a true maximization. All that isnecessary in the M step is to find a value ψ(h+1) that increases (butnot necessarily maximizes) Q(ψ|ψ(h)). That is, any ψ(h+1) can beused that satisfies

Q(ψ(h+1)|ψ(h)) ≥ Q(ψ(h)|ψ(h)).

– This relaxation of the M step yields what is called a general-ized EM algorithm.

238

Properties of the EM Algorithm:

1. It is numerically stable, with each iteration increasing the likelihood(the ascent property).

2. Under fairly general conditions, the EM algorithm has reliable globalconvergence. Excluding a very unlucky choice of ψ(0) or local pathol-ogy of the loglikelihood surface, it almost always converges to a (pos-sibly local) maximum.

3. It is typically easily implemented, because, often by construction,the required expectation w.r.t. the missing data and (especially) themaximization of the complete data loglikelihood are simple, standardcalculations, that are more easily accomplished than those necessarywhen working with the observed data loglikelihood directly.

4. It is typically easy to program (a consequence of 3) and often requiresless memory and storage space than are required for standard MLcalculations. Often by construction, the complete data has beenchosen so that the M-step leads to closed form, or at least standardcalculations, so existing algorithms can be taken advantage of.

5. The analytical work required is usually (but not always) relativelysimple. In particular, the E-step expectation is often straight-forward.

6. Because the calculations in each iteration are often closed form orstandard calculations, each iteration is typically computationallycheap (i.e., fast). This (partially) offsets the relatively large numberof iterations typically required.

7. When the likelihood is easily evaluated, the ascent property can bemonitored to examine convergence properties and to check for pro-gramming errors.

8. The EM algorithm can be used to predict values of the “missing”data.

239

Limitations/Drawbacks/Criticisms of the EM Algorithm:

a. Unlike NR or Fisher’s scoring, it does not automatically yield anestimate of var(ψ).

– However, methods have been developed to yield such an esti-mate with a minimum of additional calculations.

b. The EM algorithm can be slow, sometimes very slow, to converge.

– Again, many methods have been developed to deal with this is-sue, but these methods have drawbacks and/or limited efficacy,in general.

c. It does not guarantee convergence to a global maximum.

– However, few methods do. E.g., the same criticism applies togradient methods such as NR.

d. The E-step can be analytically intractable.

– In many such cases numerical and Monte Carlo methods canhelp, or the complete data formulation can be altered to makethe calculations more convenient.

• So, the EM algorithm is a very powerful and useful tool, and manyof its deficiencies can be (at least partially) rectified. However, thereare certainly many situations where it is not helpful.

240

Example - Genetic Linkage Data

Rao (1965) presents data in which a physical characteristic with 4 possiblevalues is measured on 197 animals. The underlying genetics imply that

y = (y1, y2, y3, y4)T ∼ Mult(n,π) where n = 197 and π = ( 2+θ

4 , 1−θ4 , 1−θ

4 , θ4 )T

Observed data: y = (125, 18, 20, 34)T .

Observed data loglik: ℓ(θ;y) = log(2 + θ)y1(1− θ)y2+y3θy4.

Complete data formulation: Suppose y1 consists of animals of two under-lying genotypes, occurring with probabilities 2/4 and θ/4, and suppose weknew those underlying genotypes. That is, suppose we knew x1, x2, suchthat x1 + x2 = y1, and let x3 = y2, x4 = y3, x5 = y4, so that

x = (x1, . . . , x5)T ∼ Mult(197, π) π = (

2

4,θ

4,1− θ

4,1− θ

4,θ

4)T

• Here the missing data is z = x2 (could be taken to be either x1 orx2), and x is the complete data vector.

Complete data loglikelihood: if z were observed, we would work with

ℓc(θ;x) = logθx2(1− θ)x3(1− θ)x4θx5 = logθx2+x5(1− θ)x3+x4

a binomial loglikelihood, which would yield

θ =x2 + x5

x2 + x3 + x4 + x5. (∗)

241

E-step:

Q(θ|θ(h)) = Eθ(h)ℓc(θ;x)|y

=

∫ℓc(θ;x)g(x2|y; θ(h))dx2

= Eθ(h)(x2 + x5) log θ + (x3 + x4) log(1− θ)= E(x2|y; θ(h)) + x5 log θ + (x3 + x4) log(1− θ)

= logθE(x2|y;θ(h))+x5(1− θ)x3+x4

M-step: The θ(h+1) that maximizes Q(θ|θ(h)) is the solution to

0 =∂

∂θQ(θ|θ(h)) = E(x2|y; θ(h)) + x5

θ− x3 + x4

1− θ

⇒ θ(h+1) =E(x2|y; θ(h)) + x5

E(x2|y; θ(h)) + x3 + x4 + x5

(cf. (*)).

Since we know x1 + x2 = y1 = 125, this implies

x2 ∼ Bin(125,θ/4

12 + θ/4

) = Bin(125,θ

2 + θ)

⇒ E(x2|y; θ(h)) = 125θ(h)

2 + θ(h)

⇒ θ(h+1) =125 θ(h)

2+θ(h) + 34

125 θ(h)

2+θ(h) + 18 + 20 + 34

• See EMexamp.R

242

Application of EM to the LMM:

Consider our general LMM

y = Xβ + Zb+ ε,

whereb ∼ Nq(0,D), indep. of ε ∼ NN (0,R).

In this context, the natural choice for “missing data” for the EM algorithmis b, the vector of fixed effects. Therefore, in the general EM notation weset x = (yT ,bT )T , z = b and let ψ = (βT ,θT )T . The complete dataloglikelihood becomes

Lc(β,θ;x) = f(y|b)f(b).

According to the LMM,

y|b ∼ Nn(Xβ + Zb,R).

Therefore, letting e ≡ y −Xβ − Zb, we have

ℓc(ψ;x) = −1

2log |R| − 1

2eTR−1e− 1

2log |D| − 1

2bTD−1b.

E Step: The E step then consists of taking the conditional expected values

Eψ(h)(eTR−1e|y) = trR−1Eψ(h)(ee

T |y),

Eψ(h)(bTD−1b|y) = trD−1Eψ(h)(bb

T |y),

To obtain these conditional expectations first note that for any randomvector u with (conditional) mean ν and (conditional) var-cov matrix Σ,the (conditional) expectation of uuT is given by

ννT +Σ.

243

From the definition of e and defining

b ≡ Eψ(h)(b|y), Vb ≡ varψ(h)(b|y),

we havee ≡ Eψ(h)(e|y) = y −Xβ − Zb

Ve ≡ varψ(h)(e|y) = ZVbZT ,

soEψ(h)(ee

T |y) = eeT + ZVbZT ,

Eψ(h)(bbT |y) = bbT +Vb.

• Hence, the E-step reduces to finding b and Vb.

Recall from our discussion of best predictors (BPs), BLP, and BLUPs, welearned that the BP=BLP under multivariate normality and, conditionalon the current values of the parameters (that is, treating these as known),BP=BLP=BLUP.

So we have (cf p.66)

b = D(h)ZT V(h)−1(y −Xβ(h))

= (ZT R(h)−1Z+ D(h)−1)−1ZT R(h)−1(y −Xβ(h)),

where V = ZDZT + R and the (h) superscript indicates evaluation atthe parameter estimates from the previous step of the algorithm. In ad-dition, the 2nd equality follows from the dimension reduction identitiesgiven below:

V−1 = (R+ ZDZT )−1 = R−1 −R−1ZI+DZTR−1Z−1DZTR−1

= R−1 −R−1ZDI+ ZTR−1ZD−1ZTR−1

= R−1 −R−1ZD−1 + ZTR−1Z−1ZTR−1

where V ≡ ZDZT + R and the last equality holds assuming D is non-singular.

244

The previous result could also have been obtained from classical resultsconcerning the conditional distribution (including the mean and variance)of a subvector of a multivariate normal vector given the remaining subvec-tor. This result also gives us Vb:

Vb = varψ(h)(b)− covψ(h)(b,y)var−1

ψ(h)(y)covψ(h)(y,b)

= D(h) −D(h)ZT V(h)−1ZD(h)

= (ZT R(h)−1Z+ D(h)−1)−1

where, again, the last equality follows after application of the dimensionreduction formulas and some algebraic simplification.

Combining these results we have

Eψ(h)(eTR−1e) = trR−1(eeT + ZVbZ

T )

= eTR−1e+ tr(ZTR−1ZVb)

andEψ(h)(b

TD−1b) = trD−1(bbT +Vb)

= bTD−1b+ tr(D−1Vb)

which lead to the E-step Q function as follows:

Q(ψ|ψ(h)) = Eψ(h)ℓc(ψ;x)|y

= −1

2log |R(θ)| − 1

2e(β)TR−1(θ)T e(β)

− 1

2log |D(θ)| − 1

2bTD−1(θ)b

− 1

2tr[ZTR−1(θ)Z+D−1(θ)(ZT R(h)−1Z+ D(h)−1)−1].

245

M Step: Notice that only the second term in the previous expression forQ involves β and, for fixed θ, it is a GLS criterion w.r.t. (y− Zb)−Xβ.Thus the M step involves first solving

∂Q

∂θ= 0 (∗)

to yield θ(h+1) and then plugging it into the least squares formula

β(h+1) = XR−1(θ(h+1))X−XR−1(θ(h+1))(y − Zb)

to yield β(h+1).

• There is no general closed-form solution for (*), so this step typicallyinvolves an iterative calculation with, for example, a Newton-typealgorithm.

• The EM algorithm for the LMM has the advantage of stability andgood convergence properties, but it is slow. So, some software forfitting LMMs takes a hybrid approach by using several iterationsof the EM algorithm to imporove starting values and get in theneighborhood of the MLEs, before switching to Newton-Raphson inorder to more quickly reach the solution.

• The EM algorithm has been presented here for ML estimation, butwith minor modification it can also be used for REML estimation.

246

Application of EM to the GLMM:

Now consider the GLMM for clustered data. Let yi1, . . . , yiti |biind∼ ED

(in the exponential dispersion family) with conditional p.d.f. f(yij |bi) andwhere yi has conditional mean µc

i that is related to covariates and randomeffects via

g(µci ) = Xiβ + ZiD

t/2bi

where DT/2 is a lower triangular Cholesky factor for var(D(θ)t/2bi) =D, which depends upon an unknown var-cov parameter θ. We assume

b1, . . . ,bniid∼ N(0, I).

• Model parameter: δ = (βT ,θT )T .

• Loglikelihood:

ℓ(δ;y) =n∑

i=1

log

∫ ti∏

j=1

f(yij |bi; δ)ϕq(bi)dbi

,

where ϕq(·) is the q−variate standard normal p.d.f. (That is, ϕq(bi) =∏qk=1 ϕ(bik).)

Another way to avoid evaluating the intractable integral in ℓ(δ;y) is theuse the EM algorithm.

Again, let b = (bT1 , ...,b

Tn )

T be the missing data in this problem. Thisleads to the complete data loglikelihood

ℓc(δ;y,b) = log f(y|b; δ) + log f(b)

=

n∑i=1

ti∑j=1

log f(yij |bi; δ) +

q∑k=1

log ϕ(bik)

≡

∑i

ℓci(δ;yi,bi)

247

E-Step: Evaluate

Q(δ|δ(h)) = Eδ(h)ℓc(δ;y,b)

=∑i

∫ℓci(δ;yi,bi)f(bi|yi; δ

(h))dbi (1)

Bayes’ Theorem tells us that

f(bi|yi; δ(h)) =

f(y|bi; δ(h))ϕq(bi)∫

f(yi|bi; δ(h))ϕq(bi)dbi(2)

so (1) becomes

Q(δ|δ(h)) =∑i

1

A(h)i

∫ ∑j

log f(yij |bi; δ) +

q∑k=1

log ϕ(bik)

×∏j

f(yij |bi; δ(h))

∏k

ϕ(bik)dbi

where A(h)i is the denominator of (2) which, notice, does not involve bi

(and therefore comes outside of the integral in the previous expression)and also does not depend on δ (it only depends on δ(h).

• Unlike the LMM, this E-step does not have a closed form expressionand we must resort to using quadrature or Monte Carlo approxima-tion of the integrals involved in Q. For example, using Monte Carlointegration, suppose we draw a large sample of size R of random vari-ables from the distribution of bi. That is, let bri∗ = (b∗ri1, ..., b

∗riq)

T ,r = 1, ..., R be a random sample from N(0, I). Then

Q(δ|δ(h)) ≈∑i

∑r

c(h)ri

∑j

log f(yij |b∗ri; δ) +

∑k

log ϕ(drik)

248

where

c(h)ri =

∑j f(yij |b∗

ri; δ(h))∑

r

∑j f(yij |b∗

ri; δ(h))

,

are weights that do not depend upon δ, the parameter to be estimated atthe h+ 1st step of the algorithm.

• Alternatively, approximations to Q based on OGQ, AGQ, or moresophisticaled MC methods of integration can be done. Many of thesehave the same form as given above, however, differing only in the

weights c(h)ri and valus of b∗

ri (which would be abscissas, in the caseof OGQ or AGQ).

M-Step: Although the approximation to Q from the E-step looks nasty,it is actually not, and is quite easy to maximize because it takes the formof a weighted exponential dispersion familiy loglikelihood. That is, it hasthe same loglikelihood, apart from the addition of some weights that donot depend upon the parameters, as in an ordinary fixed-effect GLM foran augmented or expanded dataset containing all of the yijs and and b∗

risstacked on top of one another.

• This, the M step produces δ(h) by fitting an ordinary fixed effectGLM via some routine that allows user-specified weights (most GLMpackages do).

249

generalized linear mixed models (glmms)...generalized linear mixed models (glmms) an alternative to...

Documents