stat 8630, mixed-effect models and longitudinal data analysis

STAT 8630, Mixed-Effect Models and LongitudinalData Analysis — Lecture Notes

Introduction to Longitudinal Data

Terminology:

Longitudinal data consist of observations (i.e., measurements) taken re-peatedly through time on a sample of experimental units (i.e., individuals,subjects).

• The experimental units or subjects can be human patients, animals,agricultural plots, etc.

• Typically, the terms “longitudinal data” and “longitudinal study”refer to situations in which data are collected through time underuncontrolled circumstances. E.g., subjects with torn ACLs in theirknees are assigned to one of two methods of surgical repair and thenfollowed through time (examined at 6, 12, ,18, 24 months for kneestability, say).

• Longitudinal data are to be contrasted with cross-sectional data.Cross-sectional data contain measurements on a sample of subjectsat only one point in time.

Repeated measures: The terms “repeated measurements” or, more sim-ply, “repeated measures” are sometimes used as rough synonyms for “lon-gitudinal data”, however, there are sometimes slight differences in meaningfor these terms.

• Repeated measures are also multiple measurements on each of severalindividuals, but they are not necessarily through time. E.g., mea-surements of chemical concentration in the leaves of a plant taken atdifferent locations (low, medium and high on the plant, say) can beregarded as repeated measures.

1

• In addition, repeated measures may occur across the levels of somecontrolled factor. E.g., crossover studies involve repeated mea-sures. In a crossover study, subjects are assigned to multiple treat-ments (usually 2 or 3) sequentially. E.g., a two period crossover ex-periment involves subjects who each get treatments A and B, somein the order AB, and others in the order BA.

Another rough synonym for longitudinal data is panel data.

• The term panel data is more common in econometrics, the termlongitudinal data is most commonly used in biostatistics, and theterm repeated measures most often arises in an agricultural context.

In all cases, however, we are referring to multiple measurements of essen-tially the same variable(s) on a given subject or unit of observation. We’lloften use the more generic term clustered data to refer to this situation.

Advantages and Disadvantages of Longitudinal Data:

Advantages:

1. Although time effects can be investigated in cross-sectional studies inwhich different subjects are examined at different time points, onlylongitudinal data give information on individual patterns of change.

2. Again, in contrast to cross-sectional studies involving multiple timepoints, longitudinal studies economize on subjects.

3. In investigating time effects in a longitudinal design or treatmenteffects in a crossover design, each subject can “serve as his or her owncontrol”. That is, comparisons can be made within a subject ratherthan between subjects. This eliminates between-subjects sources ofvariability from the experimental error and makes inferences moreefficient/powerful (think paired t-test versus two-sample t-test).

2

4. Since the same variables are measured repeatedly on the same sub-jects, the reliability of those measurements can be assessed, andpurely from a measurement standpoint, reliability is higher.

Disadvantages:

1. For longitudinal or, more generally, clustered data it is typically rea-sonable to assume independence across clusters, but repeated mea-sures within a cluster are almost always correlated, which compli-cates the analysis.

2. Clustered data are often unbalanced or partially incomplete (involvemissing data), which also complicates the analysis. For longitudinaldata, this may be due to loss to follow-up (some subjects move away,die, miss appointments, etc.). For other types of clustered data, thecluster size may vary (e.g., familial data, where family size varies).

3. As a practical matter, methods and/or software may not exist ormay be complex, so obtaining results and interpretting them may bedifficult.

3

Data Structure for Clustered Data:

The general data structure of clustered data is given in the table below.Here, yij represents the jth observation from the ith cluster, where i =1, . . . , n, and j = 1, . . . , ti.*

• Associated with each observation yij we may have a p× 1 vector ofexplanatory variables, or covariates, xij = (xij1, xij2, . . . , xijp)

T .

• In addition, to indicate missing data, we may sometimes define amissing value indicator:

δij =

{1, if yij and xij are observed,0, otherwise.

• We will often write the set of responses from the ith subject as avector: yi = (yi1, . . . , yiti)

T . In addition, let y = (yT1 , . . . ,y

Tn )

T bethe combined response vector from all subjects and time points andlet N =

∑ni=1 ti be the total sample size.

* Note that our text uses slightly different notation in which ni is thecluster size and N is the number of clusters.

4

Often subjects will be grouped somehow into treatment groups. In thiscase, we will need additional subscripts to index the groups. For example,the data layout below in Table 1.2 represents a one-way layout with sgroups with repeated measures over t time points. Here, yhij representsthe jth measurement on the ith subject from the hth group.

• Note that the s groups might correspond to s levels of a single treat-ment factor, or s combinations of the levels of two or more factors.In the latter case, it may be convenient to introduce additional sub-scripts.

– E.g., for a two-way layout with repeated measures involvingfactors A and B, we might index the data as yhkij to representthe jth observation on the ith subject in the h, kth treatment(at the hth level of A and the kth level of B).

5

For the single group case we can drop the index h from table 1.2 to repre-sent the data as follows:

For longitudinal data, the response variable can be laid out in a singlecolumn as in Table 1.1, or with one column per time point as in table 1.2amd 1.3. The two layouts suggest that such data can be conceptualizedas univariate or multivariate data.

In fact, there are classical normal-theory approches to analyzing continu-ous repeated measures data of each type:

— univariate methods (most notably the repeated measures analysis ofvariance); and

— multivariate methods (profile and growth curve analysis).

6

Repeated Measures ANOVA

The classical repeated measures analysis of variance (RM-ANOVA) situa-tion is a one-way layout with repeated measures over t time points. Thissituation is displayed in Table 1.2.

Here, there are are n1, n2, . . . , ns subjects, respectively, in s treatmentgroups. If there measurements are taken at just a single time point, wehave a (unbalanced) one-way layout. However, subjects are followed upover t time points to yield a repeated measures design.

Example – Methemoglobin in Sheep:

An experiment was designed to study trends in methemoglobin (M) insheep following treatment with 3 equally spaced levels of NO2 (factor A).Four sheep were assigned to each level and each animal was measured at6 sampling times (factor B), 5 of them following treatment. The responsewas log(M + 5).

The data from this experiment are as follows:

Sampling TimeNO2 Sheep 1 2 3 4 5 6

1 1 2.197 2.442 2.542 2.241 1.960 1.9881 2 1.932 2.526 2.526 2.152 1.917 1.9171 3 1.946 2.251 2.501 1.988 1.686 1.8411 4 1.758 2.054 2.588 2.197 2.140 1.6862 5 2.230 3.086 3.357 3.219 2.827 2.5342 6 2.398 2.580 2.929 2.874 2.282 2.3032 7 2.054 3.243 3.653 3.811 3.816 3.2272 8 2.510 2.695 2.996 3.246 2.565 2.2303 9 2.140 3.896 4.246 4.461 4.418 4.3313 10 2.303 3.822 4.109 4.240 4.127 4.0843 11 2.175 2.907 3.086 2.827 2.493 2.2303 12 2.041 3.824 4.111 4.301 4.206 4.182

7

• Here s = 3, n1 = n2 = n3 = 4, and t = 6.

The basic RM-ANOVA approach is based upon a similarity between arepeated measures design and a split-plot experimental design. TheRM-ANOVA approach uses the split-plot model with modifications to thesplit-plot analysis, if necessary, to account for differences between the twodesigns.

A split-plot experimental design is one in which (at least) two sizes ofexperimental unit are used. The larger experimental unit is known as thewhole plot, and is randomized to some experimental design (a one-waylayout, say).

The whole plot is then subdivided into smaller units known as split plots,which are then assigned to a second experimental design within each wholeplot.

Example – Chocolate Cake:

An experiment was conducted to determine the effect of baking tem-perature on quality for three recipes of chocolate cake. Recipes Iand II differed in that the chocolate was added at 40◦ C. and 60◦

C., respectively, while recipe III contained extra sugar. Six differentbaking temperatures were considered: 175◦ C., 185◦ C., 195◦ C.,205◦ C., 215◦ C., and 225◦ C. 45 batches of cake batter were pre-pared using each of the 3 recipes 15 times in a completely randomorder. Each batch was large enough for 6 cakes, and the six bakingtemperatures were randomly assigned to the 6 cakes per batch in arandom manner. One of several measurements of quality made oneach cake was the breaking angle of the cake.

8

The data from this experiment are as follows:

• Here, there are 45 batches of cake, which occur in a balanced one-waylayout. The batches, are the whole plots and recipe, with 3 levels, isthe whole plot factor.

• Each batch is then split into 6 cakes, which are randomly assigned toone of 6 temperatures. The cakes are the split plots, and temperatureis the split plot factor.

9

Let yhij represent the response at the jth level of the split-plot factor forthe ith whole plot in the hth group. The model traditionally used for thesplit-plot design exemplified by the chocolate cake example is

yhij = µ+ αh + ei(h) + βj + (αβ)hj + εhij , (∗)

Here, µ is a grand mean, αh is an effect for the hth level of the whole plotfactor (e.g., recipe), βj is an effect for the jth level of the split-plot factor(temperature), and (αβ)hj is an interaction term for the whole and splitplot factors. ei(h) is a random effect for the ith whole plot nested in the

hth level of the whole plot factor, and εhij is the overall error term.

One way to think about the split plot model is that it is the union of themodel appropriate for the whole plots and the model appropriate for thesplit plots.

• The whole plots occur in a one-way layout, so the one-way layoutmodel

µ+ αh + ei(h)

is appropriate for batches.

• The split plots occur in a randomized complete block design, so theRCBD model

µ+ blockhi︸︷︷︸=αh+ei(h)

+βj + εhij

is appropriate for cakes.

• Putting these portions of the models together and adding an inter-action term (αβ)hj to capture interactions between the whole andsplit plot factors, leads to model (*).

• The random whole plot effect ei(h) can be thought of as the wholeplot error term and εhij as the split plot error term. Since there aretwo experimental units with two separate randomizations, there aretwo error terms in the model.

10

In fact, the split plot model described above is an example of a linearmixed-effects model (LMM).

• It includes fixed effects for the whole plot factor (the αh’s for recipes),split plot factor (the βj ’s for temperatures), and their interaction (the(αβ)hj ’s). These are the regression parameters of the model.

• It also includes random effects: ei(h), a whole plot (or batch) effect,in addition to the overall error term εhij , which is always present inany linear model, so is typically not categorized as a random effect,even though it is.

• The term “mixed-effects model” or sometimes simply “mixed model”refers to the fact that the linear predictor of the model (the right sideof the model equation, excluding the overall error term) includes bothfixed and random effects.

Fixed vs. random effects: The effects in the model account for variabilityin the response across levels of treatment and design factors. The decisionas to whether fixed effects or random effects should be used depends uponwhat the appropriate scope of generalization is.

• If it is appropriate to think of the levels of a factor as randomlydrawn from, or otherwise representative of, a population to whichwe’d like to generalize, then random effects are suitable.

– Design or grouping factors are usually more appropriately mod-eled with random effects.

– E.g., blocks (sections of land) in an agricultural experiment,days when an experiment is conducted over several days, labtechnician when measurements are taken by several techni-cians, subjects in a repeated measures design, locations or sitesalong a river when we desire to generalize to the entire river,etc.

11

• If, however, the specific levels of the factor are of interest in and ofthemselves then fixed effects are more appropriate.

– Treatment factors are usually more appropriately modeled withfixed effects.

– E.g., In experiments to compare drugs, amounts of fertilizer,hybrids of corn, teaching techniques, and measurement devices,all of these factors are most appropriately modeled with fixedeffects.

• A good litmus test for whether the level of some factor should betreated as fixed is to ask whether it would be of broad interest toreport a mean for that level. For example, if I’m conducting anexperiment in which each of four different classes of third grade stu-dents are taught with each of three methods of instruction (e.g., ina crossover design) then it will be of broad interest to report themean response (level of learning, say) for a particular method ofinstruction, but not for a particular classroom of third grades.

– Here, fixed effects are appropriate for instruction method, ran-dom effects for class.

Since the whole plot error term represents random whole plot effects (batcheffects), the ei(h)’s are random effects. Therefore, we must make some as-sumptions about their distribution (and the distribution of the overall errorterm εhij) to complete the split-plot model. The following assumptions aretypical:

yhij = µ+ αh + βj + (αβ)hj + ei(h) + εhij = µhj + ei(h) + εhij ,

where

{ei(h)}iid∼ N(0, σ2

e)

{εhij}iid∼ N(0, σ2)

cov(εhij , ei′(h′)) = 0, for all h, i, j, h′, i′.

12

The chocolate cake experiment is an example of a balanced split plot designwith whole plots arranged in a one-way layout. More complex split-plotdesigns are possible. E.g., whole plots are often arranged in a RCBD, splitplots could be split once again to create split-split-plots, etc.

However, the classical analysis of all of these designs is relatively straight-forward provided that the design is balanced.

• By “balanced” here, we mean that there is an equal number of repli-cates for each whole plot treatment (recipe), and the same set ofsubplot treatments (temperatures) was observed within each wholeplot (batch).

The classical analysis of the balanced split plot model with whole plotsin a one-way layout is based on the model at the bottom of p.12 and thefollowing decomposition of the deviations yhij − y··· of each observationfrom the grand sample mean:

yhij − y··· = (yh·· − y···) + (yhi· − yh··) + (y··j − y···)

+ (yh·j − yh·· − y··j + y···) + (yhij − yh·j − yhi· + yh··), (∗)

where n =∑

h nh, we assume nh is constant over h, and

y··· = (nt)−1s∑

h=1

nh∑i=1

t∑j=1

yhij yh·· = (nht)−1

nh∑i=1

t∑j=1

yhij

y··j = n−1s∑

h=1

nh∑i=1

yhij yh·j = n−1h

nh∑i=1

yhij yhi· = t−1t∑

j=1

yhij

are sample means over all observations (y···), over observations in the hth

whole plot treatment group (yh··), etc.

13

This decomposition leads to the following analysis of variance:

Source of Sum of d.f. E(MS)Variation Squares

Whole plot groups SSWPG s− 1 σ2 + tσ2e +Q(α, αβ)

Whole plot error SSWPE n− s σ2 + tσ2e

Split plot groups SSSPG t− 1 σ2 +Q(β, αβ)Interaction SSWPG×SPG (s− 1)(t− 1) σ2 +Q(αβ)Split plot error SSSPE (n− s)(t− 1) σ2

Total SST nt− 1

The sums of squares in the ANOVA table are simply the sums, over allobservations, of the terms in decomposition (*). That is,

SSWPG =

s∑h=1

nh∑i=1

t∑j=1

(yh·· − y···)2

SSWPE =

s∑h=1

nh∑i=1

t∑j=1

(yhi· − yh··)2

SSSPG =s∑

h=1

nh∑i=1

t∑j=1

(y··j − y···)2

SSWPG×SPG =s∑

h=1

nh∑i=1

t∑j=1

(yh·j − yh·· − y··j + y···)2

SSSPE =s∑

h=1

nh∑i=1

t∑j=1

(yhij − yh·j − yhi· + yh··)2.

In addition, the quantities Q(α, αβ), Q(β, αβ), and Q(αβ) are quadraticforms representing differences across whole plot groups, split plot groups,and the whole plot group × split plot group interaction, respectively.

14

That is, Q(α, αβ) is a sum of squares in the αh’s and (αβ)hj ’s that equalszero under the null hypothesis of no differences across whole plot groups(hypothesis (1) below). Similarly, Q(β, αβ) and Q(αβ) are sums of squaresthat are zero under no differences across split plot groups (hypothesis (2)),and under no interaction (hypothesis (3)), respectively.

• The Q(·) notation is often used because it is more convenient thanwriting the term out exactly. The exact form of these terms are notimportant, it only matters that these terms are zero under the nullhypothesis of no effect, and positive under the alternative.

F tests appropriate for the hypotheses of interest can be determined byexamination of the expected mean squares.

Let µhj = E(yhij) = µ+ αh + βj + (αβ)hj and let

µh· = t−1t∑

j=1

µhj = µ+ αh + β· + ¯(αβ)h·

be the marginal mean for whole plot group h and let

µ·j = s−1s∑

h=1

µhj = µ+ α· + βj + ¯(αβ)·j

be the marginal mean for the jth split plot group. Then the hypotheses ofinterest and their corresponding test statistics are as follows:

1. H1 : µ1· = · · · = µs· (no main effect of the whole plot factor), whichis tested with

F =MSWPG

MSWPE∼ F (s− 1, n− s);

15

2. H2 : µ·1 = · · · = µ·t (no main effect of the split plot factor), whichis tested with

F =MSSPG

MSSPE∼ F (t− 1, (n− s)(t− 1));

3. and H3 : (αβ)hj = 0 for all h, j (no interaction between whole plotand split plot factors), which is tested with

F =MSWPG×SPG

MSSPE∼ F ((s− 1)(t− 1), (n− s)(t− 1)).

Note that side conditions are often placed on the split plot model to avoidthe complications introduced by having an overparameterized model andnon-full-rank design matrix. The usual sum-to-zero side conditions are∑

h

αh =∑j

βj =∑h

(αβ)hj =∑j

(αβ)hj = 0.

Such conditions are not strictly necessary to derive the F tests given above,but they do simplify things somewhat.

Without the side conditions, H1 can be expressed in terms of the fixedeffects in the model as

H1 : α1 + ¯(αβ)1· = · · · = αs + ¯(αβ)s·

which reduces toH1 : α1 = · · · = αs = 0

under the sum-to-zero constraints. Note that under these constraints theQ(α, αβ) term in E(MSWPG) reduces to Q(α) =

∑sh=1 α

2h/(s−1), which,

of course, is 0 under H1 and > 0 otherwise.

• Similar comments apply to H2, which under the sum-to-zero con-straints is equivalent to H2 : β1 = · · · = βt = 0 and Q(β, αβ) =Q(β) =

∑tj=1 β

2j /(t− 1).

16

Example — Chocolate Cake:

• See the handout labelled choccake.sas.

• In choccake.sas we use PROC MIXED to perform the analysis justdescribed. A call to PROC GLM is also included which reproducesthe basic PROC MIXED results (e.g., the ANOVA table and ex-pected mean squares). However, PROC GLM is not designed formixed models, and cannot, in general, be trusted to produce correctresults for split plot models and other LMMs.

• Note that method=type3 requests the classical ANOVA-type analy-sis. This is not the default, which is a REML analysis.

• The basic results are that there are not significant interactions be-tween recipe and baking temperature, there are not significant maineffects of recipe, but there are significant main effects of temperature.

– From the contrasts and profile plot, we see that mean breakingangle increases linearly with baking temperature.

• Note that the expected mean squares are printed on the bottom ofp.1 and the top of p.2. These results agree with the ANOVA tablegiven previously in these notes.

• Method of moment estimators of the variance components σ2 and σ2e

are easily derived from the expressions for E(MSSPE) and E(MSWPE).Equating MSSPE with its expectation σ2 yields the estimator

σ2 =MSSPE = 20.4709.

Similarly, equating MSWPE with its expectation σ2 + tσ2e yields

σ2e =

MSWPE −MSSPE

t= 41.8370.

17

Estimation and Inference on Means in the Split-plot Model:

According to the model on p.12,

var(yhij) = var(µhj + ei(h) + εhij) = var(ei(h) + εhij)

= var(ei(h)) + var(εhij) + 2 cov(ei(h), εhij)︸︷︷︸=0

= σ2e + σ2.

• Because the variance of yhij is the sum of two components, σ2e and

σ2, these quantities are often called variance components.

In addition,

cov(yhij , yhik)

= cov(µhj + ei(h) + εhij , µhk + ei(h) + εhik) = cov(ei(h) + εhij , ei(h) + εhik)

= cov(ei(h), ei(h)) + cov(ei(h), εhik)︸︷︷︸=0

+cov(εhij , ei(h))︸︷︷︸=0

+cov(εhij , εhik)︸︷︷︸=0

= var(ei(h)) = σ2e

for j = k, andcov(yhij , yh′i′j′) = 0,

for h = h′ or i = i′ (i.e., the covariance is 0 between subjects).

From these results we see that the correlation is zero between observationson different subjects, but

corr(yhij , yhik) =σ2e

σ2 + σ2e

≡ ρ, j = k.

• The within-subject correlation ρ here is called the intra-class cor-relation.

18

The within subject covariance structure we have just described can berepresented succinctly as

var(yhi) =

σ2 + σ2

e σ2e σ2

e · · · σ2e

σ2e σ2 + σ2

e σ2e · · · σ2

e

σ2e σ2

e σ2 + σ2e · · · σ2

e...

......

. . ....

σ2e σ2

e σ2e · · · σ2 + σ2

e

= (σ2 + σ2

e)[(1− ρ)It + ρJtt],

where It is the t× t identity matrix and Jtt is a t× t matrix of ones.

• This variance covariance structure is called compound symmetry.

Recall that for a linear model of the form

y = Xβ + ε, E(ε) = 0, var(ε) = σ2V

where V is a known positive definite matrix, the best linear unbiasedestimator (BLUE) of a vector of estimable functionsCβ is given byCβGLS

where βGLS is a generalized least squares (GLS) estimator of the form

βGLS = (XTV−1X)−XTV−1y.

• Under compound symmetry it is not hard to show that the GLSestimator is equivalent to the ordinary least squares (OLS) estimator

CβOLS whereβOLS = (XTX)−XTy

(see Rencher, Example 7.8.1; or Graybill, Corollary 6.8.1.2).

19

For balanced split-plot experiments, it is easy to show that the marginalmean µh·, µ·j and the joint means µhj are all estimable, and have BLUEs

ˆµh· = yh··, ˆµ·j = y··j , µhj = yh·j .

Standard errors for these quantities are defined as the estimated standarddeviations. Therefore, we need the variances of these estimators.

var(yh··) = var

(1

nht

∑i

jTt yhi

)=

1

(nht)2

∑i

jTt var(yhi)jt

=1

(nht)2

∑i

[t(σ2 + σ2e) + (t2 − t)σ2

e ] =1

(nht)2

∑i

[tσ2 + t2σ2e ]

=nht

(nht)2(σ2 + tσ2

e) =1

nht(σ2 + tσ2

e)

In addition,

var(y··j) = var

(1

snh

∑h

∑i

yhij

)=

1

(snh)2

∑h

∑i

var(yhij)

=1

(snh)2

∑h

∑i

(σ2 + σ2e) =

snh(snh)2

(σ2 + σ2e) =

σ2 + σ2e

snh

and, similarly,

var(yh·j) = var

(1

nh

∑i

yhij

)=nhn2h

var(yhij) =1

nh(σ2 + σ2

e).

20

In the case of yh··, its variance is easy to estimate because E(MSWPE) =σ2 + tσ2

e . So,

s.e.(yh··) =√

var(yh··) =

√MSWPE

nht.

However, var(y··j) and var(yh·j) both involve σ2 + σ2e , which is not the

expected value of any mean square in the ANOVA.

We do have estimates of σ2 and σ2e individually, though; namely, MSSPE ,

and (MSWPE −MSSPE)/t (bottom of p.17). So,

s.e.(y··j) =

√ σ2 + σ2e

snh, and s.e.(yh·j) =

√ σ2 + σ2e

nh

where σ2 + σ2e =MSSPE + (MSWPE −MSSPE)/t

=t− 1

tMSSPE +

1

tMSWPE .

21

CIs and contrasts for marginal means for whole plot factor:

Confidence intervals and hypothesis tests on µh·· are based on the pivotalquantity

t =yh·· − µh··

s.e.(yh··)=

yh·· − µh··√MSWPE/(nht)

∼ t(d.f.WPE) = t(n− s).

This leads to a 100(1− α)% CI for µh·· given by

yh·· ± t1−α/2(d.f.WPE)s.e.(yh··) = yh·· ± t1−α/2(n− s)

√MSWPE

nht

For a contrast ψ =∑

h chµh·· with sample estimator C =∑

h chyh··, wetest H0 : ψ = 0 via t or F tests.

The appropriate t test statistic is

t =|C|

s.e.(C), which we compare to t1−α/2(d.f.WPE)

for an α-level test. Equivalently, we can use an F test:

F = t2, which we compare to F1−α(1, d.f.WPE).

The standard error for a contrast in the whole plot groups is given by

s.e.(C) =

√√√√var

(∑h

chyh··

)=

√∑h

c2hvar(yh··) =

√MSWPE

nht

∑h

c2h.

A 100(1− α)% CI for ψ is given by

C ± t1−α/2(d.f.WPE)s.e.(C).

22

CIs and contrasts for marginal means for split plot factor:

Confidence intervals and hypothesis tests on µ··j are based on the pivotalquantity

t =y··j − µ··j

s.e.(y··j)=

y··j − µ··j√[(t− 1)MSSPE +MSWPE ]/(snht)

.

• However, now this quantity is not distributed exactly as a student’st!

Why?

Because the denominator doesn’t involve a single mean square (χ2 dividedby its d.f.), but instead involves a linear combination of mean squares.

What’s the distribution of a linear combination of independent meansquares in normally distributed random variables?

An approximate answer is given by Satterthwaite’s formula. Satterth-waite showed that a linear combination of independent mean squares of theform MS = a1MS1 + · · ·+ akMSk is approximately χ2 with approximatedegrees of freedom given by

d.f. =(MS)2

(a1MS1)2

d.f.1+ · · ·+ (akMSk)2

d.f.k

,

where here d.f.i is the d.f. associated with MSi and the ai’s are constants.

In our case, MS is (t− 1)MSSPE +MSWPE so we have

ν =[(t− 1)MSSPE +MSWPE ]

2

[(t−1)MSSPE ]2

d.f.SPE+ (MSWPE)2

d.f.WPE

.

23

Thus the pivotal quantity has an approximate t distribution:

t =y··j − µ··j

s.e.(y··j)=

y··j − µ··j√[(t− 1)MSSPE +MSWPE ]/(snht)

.∼ t(ν).

This leads to an approximate 100(1− α)% CI for µ··j given by

y··j ± t1−α/2(ν)s.e.(y··j).

For a contrast ψ =∑

j cj µ··j , we have the sample estimator C =∑

j cj y··j .Despite the fact that s.e.(y··j) involves two mean squares, it turns out thats.e.(C) involves only one, so no Satterthwaite approximation is necessaryfor a contrast in the µ··j ’s. To see this, note

C =∑j

cj y··j =∑j

cj1

snh

∑h

∑i

yhij =∑j

cj1

snh

∑h

∑i

(µhj + ei(h) + εhij)

=∑j

cj(µ·j + e·(·) + ε··j) =∑j

cj µ·j + e·(·)∑j

cj︸︷︷︸=0

+∑j

cj ε··j

So,

var(C) = var(∑j

cj ε··j) =∑j

c2jvar(ε··j) =∑j

c2jσ2

snh=

σ2

snh

∑j

c2j .

Therefore, for this kind of contrast,

s.e.(C) =

√MSSPE

snh

∑j

c2j .

Based on this result, we can test H0 : ψ = 0 with an exact t or F test.Our test statistic is

t =|C|

s.e.(C), which we compare to t1−α/2(d.f.SPE)

for an α-level test. Equivalently, we can compute:

F = t2, which we compare to F1−α(1, d.f.SPE).

A 100(1− α)% CI for ψ is given by

C ± t1−α/2(d.f.SPE)s.e.(C).

24

Joint Means:

As for marginal means of the split plot groups, the variance of the jointmean estimator yh·j involves σ2 + σ2

e , which we must estimate with [(t −1)MSSPE +MSWPE ]/t, rather than just a single mean square.

Satterthwaite’s formula yields

yh·j − µh·j

s.e.(yh·j)=

yh·j − µh·j

[(t− 1)MSSPE +MSWPE ]/(snht)

.∼ t(ν).

This leads to an approximate 100(1−α)% CI for the joint mean µh·j givenby

yh·j ± t1−α/2(ν)s.e.(yh·j).

Contrasts:

For contrast in the joint means of the form ψ =∑

h

∑j chjµh·j , we esti-

mate the contrast with C =∑

h

∑j chj yh·j . and we form test statistics in

the usual way. That is, t and F statistics for H0 : ψ = 0 are given by

t =|C|

s.e.(C), and F = t2, respectively.

However, the formula for s.e.(C) and the distribution of the test statisticsdepend on the nature of the contrast in the joint means. For certaincontrasts in the joint means, s.e.(C) involves only one mean squares; forothers, s.e.(C) involves two mean squares.

The former case is easy and yields an exact distribution, the latter case isharder because we need Satterthwaite’s formula again.

25

Case 1 (the easy case): Contrasts across split plot groups but withina single whole plot group:

In this case, C =∑

h

∑j chj yh·j has standard error

s.e.(C) =

√MSSPE

nh

∑h

∑j

c2hj

Exact tests and confidence intervals for these contrasts are thenbased on the fact that the t and F test statistics are distributedas

t ∼ t(d.f.SPE), F = t2 ∼ F (1, d.f.SPE).

Case 2 (the harder case): Contrasts involving more than one level ofthe whole plot factor:

In this case, C =∑

h

∑j chj yh·j has standard error

s.e.(C) =

√√√√ (t− 1)MSSPE +MSWPE

nht

∑h

∑j

c2hj

The t and F statistics for contrasts of this type do not have exactt and F distributions. However, approximate tests and confidenceintervals for these contrasts can be based on the approximations

t.∼ t(ν), F

.∼ (1, ν),

where, again, ν is given by Satterthwaite’s formula (bottom of p.23).

26

Example — Chocolate Cake (Continued)

• The DDFM=SATTERTH option on the MODEL statement tellsSAS to use the Satterthwaite formula to obtain the correct degreesof freedom for F and t tests. These are denominator d.f. (henceDDFM) for F tests and d.f. for t tests.

• By default, PROC MIXED will use the “containment” method forcomputing these d.f. if a RANDOM statement is included in thecall to PROC MIXED. This is a method which attempts to guessthe right d.f. from the syntax of the MODEL statement (see theSAS documentation for details). However, the containment methodcan be wrong. Its advantage is that it takes fewer computationalresources than the Satterthwaite method.

• As an example of the Satterthwaite method, consider estimationof µ11 − µ21, the difference between the mean response for cakesmade with recipe 1, temperature 1 versus cakes made with recipe 2,temperature 1.

• This is a “case 2” scenario, so we must use Satterthwaite’s formula.From the formula on the bottom of p.23,

ν =[(t− 1)MSSPE +MSWPE ]

2

[(t−1)MSSPE ]2

d.f.SPE+ (MSWPE)2

d.f.WPE

=[(6− 1)20.471 + 271.493]2

[(6−1)20.471]2

210 + (271.493)2

42

= 77.4.

The estimate is y1·1 − y2·1 = 29.1333 − 26.8667 = 2.2667 with astandard error of√√√√ (t− 1)MSSPE +MSWPE

nht

∑h

∑j

c2hj

=

√(6− 1)20.471 + 271.493

15(6)[12 + (−1)2] = 2.8823

27

• So, an approximate 95% CI for µ11 − µ21 is

2.2667± t.975(77.4)(2.8823) = (−3.47, 8.01)

and an approximate .05-level t test of H0 : µ11 = µ21 compares

t = 2.2667/(2.8823) = 0.79

to a t(77.4) distribution. Equivalently, an F test can be used wherewe compare F = t2 = 0.792 = 0.62 to an F (1, 77.4) distribution.Both of these tests give a p-value of .4340, so we fail to reject H0 atlevel α = .05.

We have presented the simplest split plot design in which whole plotsoccur in a one-way layout. Other more complicated split plot designs arepossible. In practice, the most common split plot design is one in whichthe whole plots occur in a randomized complete block design.

• For example, suppose the chocolate cake experiment was conductedover 15 days, and the batches were blocked by day. That is, on dayone 1 batch from each recipe was prepared and the resulting cakesbaked. On day two 1 more batch from each recipe was prepared, etc.

In such a situation, the appropriate model becomes

yhij = µ+ αh + bi + ehi + βj + (αβ)hj + εhij ,

where yhij is the response on jth split-plot in the hth whole plot group inthe ith whole plot block.

Here, ehi and εhij are the whole plot and split plot error terms, respec-tively, with similar assumptions as in the previous model. In addition,bi represents a whole plot block effect. If considered random, which istypically most appropriate, the bi’s are assumed i.i.d. N(0, σ2

b ) and inde-pendent of the ehi’s and the εhij ’s.

• Rather than treat this model in any detail, we move on to the re-peated measures ANOVA (RM-ANOVA). This example will be cov-ered by the general theory of LMMs, which we are working toward.

28

RM-ANOVA:

The repeated measures ANOVA (RM-ANOVA) is based upon the similar-ity between repeated measures designs and split-plot designs.

Consider again the Methomoglobin in Sheep example. This is a one-waylayout with repeated measures over 6 time points. It has much the samestructure as the chocolate cake example. This suggests using the sameanalysis.

• However, there is one important difference: In the RM design, the“split-plot” factor, time, is not randomized!

Instead, each experimental unit is measured under “time 1” first, “time 2”second, etc. This means that observations are subject to serial correlationas well as shared-characteristics, or clustering-type, correlation.

Recall that in the split-plot model, yhi, the vector of observation on theh, ith whole plot, had the compound symmetry variance-covariance struc-ture:

var(yhi) = (σ2e + σ2)

1 ρ · · · ρρ 1 · · · ρ...

.... . .

...ρ ρ · · · 1

,

where ρ = σ2e/(σ

2 + σ2e).

• Often, this seems an inappropriate variance-covariance structure forrepeated measures. Typically, we would expect observations takenclose together in time to be more highly correlated than observationstaken far apart.

• That is, we often expect a decaying correlation structure throughtime, rather than constant correlation through time.

29

Sphericity:

It turns out that compound symmetry is a sufficient but not necessarycondition for the F tests from the split-plot analysis to be valid for theRM-ANOVA design.

A more general condition, known as sphericity, is necessary and sufficient.Sphericity can be expressed in several different, but equivalent ways. Inparticular, sphericity is equivalent to

1. the variances of all pairwise differences between repeated measuresare equal; that is,

var(yhij − yhik) is constant for all j, k.

2. ϵ = 1, where

ϵ =t2(σjj − σ··)

2

(t− 1)(∑t

j=1

∑tj′=1 σ

2jj′ − 2t

∑tj=1 σ

2j· + t2σ2

··),

where

σ·· = mean of all elements of Σ ≡ var(yhi)

σjj = mean of elements on main diagonal of Σ

σj· = mean of elements in row j of Σ

• Since compound symmetry means that var(yhij) and cov(yhij , yhij′)are constant for all i, j, j′, it is clear that

var(yhij − yhij′) = var(yhij) + var(yhij′)− 2cov(yhij , yhij′)

is constant for all j, j′. Therefore, compound symmetry is a specialcase of sphericity.

• Sphericity is a more general (weaker) condition than compound sym-metry, and mathematically, it is all that is needed. However, it isdifficult to envision a realistic form for Σ in which sphericity wouldhold and compound symmetry would not.

30

Mauchly has proposed a test for sphericity. This test is of limited practicaluse for several reasons:

• Low power in small samples.

• In large samples test is likely to reject sphericity when non-sphericityhas little effect on validity of the split-plot F tests.

• Sensitive to departures from normality.

• Very sensitive to outliers.

It can be shown that sphericity holds when ϵ = 1 and maximum non-sphericity holds when ϵ = 1/(t− 1).

Under non-sphericity, it can be shown that the F test statistics for therepeated measures factor (usually time) and interactions involving the re-peated measures factor have approximate F distributions where the de-grees of freedom are the usual degrees of freedom multiplied by ϵ.

• Therefore, the “fix” of the split-plot analysis is to multiply the nu-merator and denominator degrees of freedom by ϵ for all F tests ontime and on and interactions involving time.

Two estimators of ϵ are commonly used: Greenhouse-Geisser and Hunyh-Feldt. ϵGG is simply ϵ computed on the sample variance-covariance matrixS rather than Σ, the population (true) variance-covariance matrix. ϵHF

is defined as

ϵHF = min

1,n·(t− 1)ϵGG − 2

(t− 1)(d.f.WPE − (t− 1)ϵGG)︸︷︷︸value given by SAS

,

where n· =the total number of “whole plots”.

• Note that ϵ ≤ 1, so the value given by SAS should always be roundeddown to 1, if it is > 1.

31

Which ϵ? For true ϵ ≤ 0.5 (greater non-sphericity) ϵGG is better. For trueϵ ≥ 0.75 (less non-sphericity) ϵHF is better. In practice we don’t know thetrue value of ϵ so often it is hard to say which is better.

• ϵGG tends to give the larger adjustment (makes it harder to rejectH0) than ϵHF , so if we desire to be conservative (slow to reject) thenwe should use ϵGG.

If we have a program like SAS that can compute ϵGG and ϵHF and corre-sponding adjusted p-values easily, then we should always go ahead and dothe adjustment for non-sphericity.

• There is no down-side here, because if the data are spherical, thenwe should get ϵGG = ϵHF = 1, and the adjustment will end up notaltering the split-plot analysis at all (which is what we would want).If the data are non-spherical, then an appropriate adjustment willbe done.

• If we want to avoid computing ϵ, then a very conservative approach isto use the adjustment for maximum non-sphericity. That is, multiplynumerator and denominator d.f. of the F tests by (t− 1)−1.

• Alternatively, use the Greenhouse-Geisser algorithm (see Davis, §5.3.2).

32

Example — Methemoglobin in Sheep:

• See sheep1.sas. In this SAS program PROC MIXED is used firstto perform the split-plot analysis exactly as in the chocolate cakeexample, and then the REPEATED statement in PROC GLM isused to reproduce this split-plot analysis as a RM-ANOVA analysis.That is, both PROCs fit the model

yhij = µhj + ei(h) + εhij ,

where µhj = µ + αh + βj + (αβ)hj , {ei(h)}iid∼ N(0, σ2

e), {εhij}iid∼

N(0, σ2), and the ei(h)’s and εhij ’s are independent of each other.

• The two sets of results are basically the same, but PROC GLM givesMauchly’s test and Greenhouse-Geisser and Hunyh-Feldt adjusted p-values for tests involving time.

• For the basic ANOVA table and F tests, both procedures give thecorrect RM-ANOVA results. However, PROC GLM will still notgive the correct inferences on means and contrasts.

• PROC MIXED would give the correct inferences under the assump-tion of sphericity if we were to add ESTIMATE and CONTRASTstatements to the program in sheep1.sas. In addition, PROCMIXEDcan be made to “fix” the split-plot analysis for non-sphericity, butwith a more sophisticated fix than that we’ve discussed and thatwhich is implemented in PROC GLM with the REPEATED state-ment. We’ll get to this later.

• The basic results of the analysis are as follows:

– According to Mauchly’s test (the one labelled, “OrthogonalComponents” on p.8), there is significant evidence of non-sphericity, so we should use adjusted p-values for tests on timeand no2×time. The estimates of ϵ are ϵGG = .2610 and ϵHF =.3551, which are both pretty far from 1, indicating non-sphericity.

33

– At level α = .05, we reject the hypothesis of no interactionwith either adjustment (HF or GG, see p.10). The nature ofthe interaction can be seen in the profile plot on p.5. Fromthe profile plot it appears that we should make inferences ontime separately within each of the no2 groups, but it does seemmeaningful to test for a main effect of no2.

– The main effect test of no2 is significant (p = .0026, see p.9)and needs no adjustment. It is clear that the mean response ishigher (at least after time 1) for increasing levels of no2.

– The main effect of time is significant, but, as noted above, thepattern over time seems to be different enough from one no2group to the next that it really would be more appropriate tocompare times separately within each no2 group.

– A natural set of contrasts to examine here would be linear andnonlinear contrasts over time separately in each no2 group.E.g., for no2 group 1, assuming that the times were equallyspaced, these contrasts would look like this in SAS:

contrast ’linear time, no2=1’ time -5 -3 -1 1 3 5no2*time -5 -3 -1 1 3 50 0 0 0 0 0 0 0 0 0 0 0;

contrast ’nonlinear time, no2=1’ time 5 -1 -4 -4 -1 5no2*time 5 -1 -4 -4 -1 50 0 0 0 0 0 0 0 0 0 0 0,

time -5 7 4 -4 -7 5no2*time -5 7 4 -4 -7 50 0 0 0 0 0 0 0 0 0 0 0,

time 1 -3 2 2 -3 1no2*time 1 -3 2 2 -3 10 0 0 0 0 0 0 0 0 0 0 0,

time -1 5 -10 10 -5 1no2*time -1 5 -10 10 -5 10 0 0 0 0 0 0 0 0 0 0 0;

34

Multivariate Methods for Repeated Measures

The Multivariate Linear Model:

Suppose we have a t component response vector yi = (yi1, . . . , yit)T on the

ith of n subjects, and suppose that yij is generated from the linear model

yij = xTi βj + εij , i = 1, . . . , n, j = 1, . . . , t,

where xi = (xi1, . . . , xip)T is a vector of p explanatory variables specific to

the ith subject (but constant over the t components of the response), andβj = (β1j , . . . , βpj)

T is a vector of unknown parameters specific to the jth

component of the response.

Let εi = (εi1, . . . , εit)T denote the vector of error terms for the ith subject.

We assume that the t components of the response are correlated within agiven subject, so we assume

εi ∼ Nt(0,Σ).

• In applications to repeated measures data, the t components ofthe response vector correspond to the response measured at t dis-tinct time points. So, Σ describes the variance-covariance structurethrough time.

• To ensure that Σ is positive definite, we assume p ≤ n− t.

We also assume independence between subjects so

ε ≡

ε1...εn

∼ Nnt(0, In ⊗ Σ).

35

• Here, ⊗ denotes the Kronecker (aka direct) product. W ⊗ Z, theKronecker product of matrices Wa×b,Zc×d, results in the ac × bdmatrix

w11Z w12Z · · · w1bZw21Z w22Z · · · w2bZ...

.... . .

...wa1Z wa2Z · · · wabZ

Thus, In ⊗ Σ is the nt× nt block-diagonal matrix

Σ 0 · · · 00 Σ · · · 0...

.... . .

...0 0 · · · Σ

• Thus, according to this model, y1, . . . ,yn are independent randomvectors with yi ∼ Nt(µi,Σ) where µi = (µi1, . . . , µit)

T , µij = xTi βj .

To express this model in matrix terms, let

Y =

yT1...yTn

=

y11 · · · y1t...

. . ....

yn1 · · · ynt

, X =

xT1...xTn

=

x11 · · · x1p...

. . ....

xn1 · · · xnp

,

where X is of rank p ≤ (n− t). Also, let

B = (β1, . . . ,βt) =

β11 · · · β1t...

. . ....

βp1 · · · βpt

, E =

εT1...εTn

=

ε11 · · · ε1t...

. . ....

εn1 · · · εnt

Then the multivariate linear model takes the form

Y = XB+E, vec(ET ) ∼ Nnt(0, In ⊗ Σ). (∗)

36

Estimation:

The maximum likelihood and ordinary least squares estimator of B is

B = (XTX)−1XTY

• Note that B is equal to (β1, . . . , βt), where βj = (XTX)−1XTuj isthe usual least squares estimator based just on uj , the j

th columnof Y.

• B is the BLUE.

That the OLS estimator is the BLUE can be seen by writing the multi-variate model as a linear model. Let vec(M) denote the vector formedby stacking the columns of its matrix argument M. The the multivariatelinear model (*) can be written in a univariate form as

vec(Y) = (Ip ⊗X)vec(B) + vec(E).

It is easily seen that the error term has moments E{vec(E)} = 0 andvar{vec(E)} = Σ ⊗ In. Therefore, this model has the form of a GLSmodel, which implies that the GLS estimator would be BLUE.

However, the same theorem we alluded to in claiming OLS to yield BLUEsin the split-plot model (see p.19) applies here. That theorem says that inthe univariate linear model

y = Xβ + ε, E(ε) = 0, var(ε) = σ2V,

the OLS estimator of β is BLUE iff C(VX) ⊂ C(X) (see Graybill, Thm6.8.1, or Christensen, Thm 10.4.5).

This condition is easy to show here because it is a property of Kroneckerproducts that (A ⊗ B)(C ⊗ D) = AC ⊗ BD for suitably conformablematrices. Therefore,

(Σ⊗ In)(Ip ⊗X) = Σ⊗X = (Ip ⊗X)(Σ⊗ I).

So,C([Σ⊗ In][Ip ⊗X]) = C([Ip ⊗X][Σ⊗ I]) ⊂ C([Ip ⊗X]).

37

The MLE of Σ is1

n(Y −XB)T (Y −XB).

However, this estimator is biased. An unbiased estimator of Σ is

S =1

n− p(Y −XB)T (Y −XB).

Estimation of linear functions of B is often of interest. Let ψ = aTBcwhere a and c are p× 1 and t× 1 vectors of constants, respectively.

• Note that a operates as a contrast within time points. c operates asa contrast across time points.

The BLUE of ψ isψ = aT Bc

and has variance

var(ψ) = (cTΣc)[aT (XTX)−1a].

Hypothesis Testing:

Most hypotheses of interest in the multivariate linear model can be ex-pressed as

H0 : ABC = D, (†)

where

Aa×p has rank a ≤ p, and operates across subjects (w/in time)

Ct×c has rank c− p, and operates across time (w/in subjects)

Da×c is a matrix of constants; often, D = 0a×c.

38

• This framework is very general. E.g., setting A = I, D = 0 yieldscontrasts in time, C = I, D = 0 yields contrasts across subjects,A = I, C = I, D = 0 yields the hypothesis B = 0, etc.

There are four tests commonly used to test hypotheses of this form, andall of the test statistics are defined in terms of the hypothesis SSCPmatrix and the error, or residual, SSCP matrix

• Here, SSCP stands for sum of squares and cross-products. A SSCPis the multivariate analog of a sum of squares in the univariate linearmodel. Note that it is a matrix, not a scalar.

Recall that in the univariate linear model,

y = Xβ + ε, ε ∼ Nn(0, σ2In),

an F test statistic for the hypothesis H0 : Aa×pβp×1 = da×1 was given by

F =(Aβ − d)T [A(XTX)−1AT ]−1(Aβ − d)

[yTy − βTXTXβ]

n− p

a=SSH

SSE

n− p

a.

In the multivariate context, SSH becomes a matrix, the hypothesis SSCP,given by

Qh = (ABC−D)T [A(XTX)−1AT ]−1(ABC−D),

and SSE becomes a matrix, the error SSCP, given by

Qe = CT [YTY − BT (XTX)B]C.

• Since these quantities are matrices in the multivariate context, wecan no longer compare them as simply as in the univariate F test.That is, we cannot simply take their ratio. Even computing QhQ

−1e

is not an option, because the result is not a scalar test statistic.

39

How to do we compare the “sizes” of Qh and Qe with a scalar quan-tity?

Several answers have been put forward, leading to several different teststatistics:

• Roy’s test: based on the largest eigenvalue of QhQ−1e .

• Lawley and Hotelling’s test: test statistic is tr(QhQ−1e ).

• Pillai’s test: test statistic is a function of tr[Qh(Qh +Qe)−1].

• Wilks’ likelihood ratio test: Wilks’ test statistic depends on Λ =|Qe|/|Qh +Qe| and is equivalent to the LRT.

Unfortunately, none of these tests is “best” in all situations. However,all of these tests are approximately equivalent in large samples, and differvery little in power for small samples.

• Because LRTs have good properties in general, we will confineattention to Wilks’ test.

There is no general result that gives the exact distribution of Wilks’ teststatistic. That is, the exact reference distribution for Wilks’ test is notknown, in general.

However, for some multivariate ANOVA (MANOVA) models that arise inprofile analysis, there is an equivalence between Wilks’ test and an F test,where the exact reference distribution is an F distribution.

40

In particular, in a one-way MANOVA model for comparing (s) groups(treatments) based upon a t-variate response, exact distributions are avail-able for the following cases:

TestCase Statistic Distributiont=1s≥2

(n−ss−1

) (1−ΛΛ

)F (s− 1, n− s)

t=2s≥2

(n−s−1s−1

)(1−

√Λ√

Λ

)F (2(s− 1), 2(n− s− 1))

t≥1s=2

(n−t−1

t

) (1−ΛΛ

)F (t, n− t− 1)

t≥1s=3

(n−t−2

t

) (1−

√Λ√

Λ

)F (2t, 2(n− t− 2))

where n =∑s

h=1 nh is the total number of subjects.

• In other cases we rely on approximations to obtain p−values forWilks’ Lambda.

• A large sample approximation due to Bartlett gives the followingrejection rule to obtain an approximate α−level test: Reject H0 if

−(n− 1− t+ s

2

)log(Λ) > χ2

α(t(s− 1)).

• Other approximations are available to obtain approximate p−valueswhen the total sample size n is small. These approximation areimplemented in SAS and other computer programs and are quitegood even for small sample sizes.

41

Profile Analysis:

We maintain the same notation and set-up: suppose repeated measures att time points from s groups of subjects. Let nh = number of subjects ingroup h, and let n =

∑sh=1 nh. Let yhij denote the observation at time j

for subject i in group h.

We assume the response vectors yhi = (yhi1, . . . , yhit)T are independent,

with

yhi ∼ N(µh,Σ), where µh =

µh1...µht

,

and µhj = E(yhij).

The profile analysis model is

yhij = µhj + εhij , where εhi =

εhi1...εhit

∼ N(0,Σ).

In terms of the multivariate general linear model,

Y = XB+E or

yT11...

yT1n1

yT21...

yT2n2

...

yTs1...

yTsns

=

1 0 · · · 0...

.... . .

...1 0 · · · 00 1 · · · 0...

.... . .

...0 1 · · · 0...

......

...0 0 · · · 1...

.... . .

...0 0 · · · 1

µ11 · · · µ1t

µ21 · · · µ2t...

. . ....

µs1 · · · µst

+

εT11...

εT1n1

εT21...

εT2n2

...

εTs1...

εTsns

42

Three general hypotheses are of interest in profile analysis:

H01 : the mean profiles (over time) for the s groups are parallel (i.e., nogroup×time interaction);

H02 : no differences among groups;H03 : no differences among time points.

• Note that H01 should be tested first, because the result of this testaffects what form the other two hypotheses should take (H02 andH03 have been expressed in a purposely vague way here).

If H01 is accepted, then, under the assumption that no interaction ispresent, it is appropriate to test for no difference between groups by com-paring the mean response in each group averaged over all time points, andit is appropriate to test no difference across time points by comparing themean response at each time, averaged over groups.

If, however, we reject H01 then it may be more appropriate to test hy-potheses of the form

H04 : no difference among groups within some subset of the measurementoccasions;

H05 : no difference among time points in a particular group, or subset ofgroups;

H06 : no difference within some subset of measurement occasions in a par-ticular group or subset of groups.

43

Test of parallelism:

The hypothesis of parallelism is

H01 :

µ11 − µ12

µ12 − µ13...

µ1,t−1 − µ1t

=

µ21 − µ22

µ22 − µ23...

µ2,t−1 − µ2t

= · · · =

µs1 − µs2

µs2 − µs3...

µs,t−1 − µst

.

• Testing this hypothesis is equivalent to conducting a one-way mul-tivariate analysis of variance (MANOVA) on the t − 1 differencesbetween adjacent time points from each subject.

In terms of the general form of the hypothesis, H01 can be expressed asABC = D where

A(s−1)×s = (Is−1,−js−1),

Ct×(t−1) =

1 0 · · · 0−1 1 · · · 00 −1 · · · 0...

.... . .

...0 0 · · · 10 0 · · · −1

D(s−1)×(t−1) = 0(s−1)×(t−1)

44

Test of no difference among groups:

Depending on the result of the test of H01, two tests of no difference amonggroups are possible.

First, if H01 is accepted, then we would test for differences across groupsaveraging over (or equivalently, summing over) time points. In this caseH02 takes the form H02a : ABC = D where

A(s−1)×s = (Is−1,−js−1),

Ct×1 = jt

D(s−1)×1 = 0(s−1)×1.

• This test is equivalent to doing a one-way ANOVA on the totals (ormeans) across time, for each subject.

Second, if H01 is rejected, we would not want to assume parallelism intesting across groups. In this case the null hypothesis is

H02b :

µ11

µ12...µ1t

=

µ21

µ22...µ2t

= · · · =

µ21

µ22...µ2t

,

or H02b : ABC = D, where

A(s−1)×s = (Is−1,−js−1),

Ct×1 = It

D(s−1)×t = 0(s−1)×t.

• This is the one-way MANOVA test on the vector of means at eachtime point.

45

Test of no difference among time points:

Similar to testing no difference among groups, the appropriate test heredepends upon the result of testingH01. IfH01 is accepted, we will typicallywant to test no difference across time points, averaging (or equivalently,summing) across groups. This hypothesis is H03a : ABC = D where

A1×s = jTs or1

sjTs

Ct×(t−1) =

(It−1

−jTt−1

)D1×(t−1) = 01×(t−1).

If H01 is rejected, then we can compare time points without assumingparallelism. That is, we can test the hypothesis

H03b :

µ11

µ21...µs1

=

µ12

µ22...µs2

= · · · =

µ1t

µ2t...µst

.

This hypothesis can also be written in the form ABC = D where

As×s = Is,

Ct×(t−1) =

(It−1

−jTt−1

),

Ds×(t−1) = 0s×(t−1).

46

Example — Methemoglobin in Sheep (again):

Recall that there are t = 6 measurements through time on each ofn1 = n2 = n3 = 4 sheep in NO2 groups 1, 2 and 3 (s = 3). Theprofile analysis model for these data is

yT11...

yT14

yT21...

yT24

yT31...

yT34

=

1 0 0...

......

1 0 00 1 0...

......

0 1 00 0 1...

......

0 0 1

µ11 µ12 · · · µ16

µ21 µ22 · · · µ26

µ31 µ32 · · · µ36

+

εT11...

εT14εT21...

εT24εT31...

εT34

,

or Y = XB + E where Y and E are 12 × 6 matrices, X is 12 × 3,and B is 3× 6.

The hypothesis of parallelism is H01 : ABC = 0 where

A =

(1 0 −10 1 −1

), C =

1 0 0 0 0−1 1 0 0 00 −1 1 0 00 0 −1 1 00 0 0 −1 10 0 0 0 −1

.

• See handout sheep2.sas. In this handout, PROC GLM is used to fitthe MANOVA model and to illustrate how to test the profile analysishypotheses.

47

• In the first call to PROC GLM, we test for parallelism. Note thatA is specified with the CONTRAST statement, the transpose of Cis specified with the MANOVA statement, and D is assumed to beequal to 0.

• According to Wilks’ test, the hypothesis of parallelism is rejected atα = .05 (p = .0164).

• Given that H01 is rejected, we should compare groups and timeswithout assuming parallelism, and/or compare times within eachgroup separately and compare groups with each time separately.However, for illustration purposes, I’ve given the tests of H02a, H02b,H03a, and H03b in sheep2.sas.

– The hypothesis of no group effect assuming parallelism (H02a)is rejected (p = .0026),

– the hypothesis of no group effect without assuming parallelism(H02b) is rejected (p = .0350),

– the hypothesis of no time effect assuming parallelism (H03a) isrejected (p < .0001), and

– the hypothesis of no time effect without assuming parallelism(H03b) is rejected (p < .0001).

• Finally, I tested for no time effect in group 1 only. This hypothesiswas also rejected (p = .0006).

• All of these results are consistent with the profile plot obtained insheep1.sas.

48

Growth Curve Analysis:

We have seen that repeated measures of a single variable (methemoglobin,say) over time can be analyzed with multivariate methods (e.g., MANOVA)by regarding each time-specific measurement of the variable as a distinctvariable.

• E.g., if we measure methemoglobin at 10, 20, 30, 40, 50, and 60minutes after treatment for each subject in each of three treatmentgroups, we can compare the groups with a MANOVA based on t = 6variables: methemoglobin at 10 min., methemoglobin at 20 min., . . .,methemoglobin at 60 min.

• Such an approach does not recognize any ordering of the repeatedmeasurements and fits no model to describe time trends or growthcurves.

• In fact, repeated measurements through time are naturally ordered.

In this case, it may be of interest to characterize trends over time usinglow-order polynomials (e.g., linear or quadratic curves in time).

By modelling the time trend, we hope to summarize the mean responseat the t time points with q < t parameters, rather than allowing for tseparate time-specific means.

• The use of polynomials to describe time trend within the context ofa multivariate linear model is known as growth curve analysis,and is usually attributed to Potthoff and Roy.

• Not to be confused with the use of nonlinear models of growth (e.g.,Richards’ model, von Bertalanffy’s model, etc.).

• This approach is seldom used these days, so we will not discuss it fur-ther. The use of polynomials in time to describe patterns of changeis still common, but this is more commonly done in the frameworkof linear mixed models these days.

49

Linear Mixed Effects Models (LMMs)

There are several disadvantages/limitations to multivariate methods (pro-file analysis, growth curves) for longitudinal data analysis.

• Methods assume same set of measurement times for each subject,so cannot handle missing data, varying measurement times, varyingcluster size (number of repeated measures) easily.

• Cannot handle time-varying covariates easily.

• Models make no assumptions on within-subject var-cov matrix. Thismakes these methods broadly valid, but not powerful.

• M’variate methods don’t model sources of heterogeneity/correlationin the design generating the data. No quantification of heterogene-ity, little flexibility to model multiple sources of heterogeneity andcorrelation.

A much more flexible class of models is the class of linear mixed effectsmodels (LMMs).

We have already seen examples of LMMs: the split-plot model (chocolatecake example), RM-ANOVA model (methemoglobin in sheep example).

• In these cases, a cluster-specific random effect (whole plot error term)was included to model whole plot to whole plot or subject to subjectvariability and to imply correlation within a whole-plot/subject.

In general, the inclusion of random effects into the linear model allowsfor modeling (and quantification) of multiple sources of heterogeneity andcomplex patterns of correlation.

Further flexibility is achieved in this class of models by also letting theerror term have a general, non-spherical variance-covariance matrix.

The result is a very rich, flexible and useful class of models.

50

Some Simple LMMs:

The one-way random effects model — Railway Rails:

(See Pinheiro and Bates, §1.1) The data displayed below are from anexperiment conducted to measure longitudinal (lengthwise) stress inrailway rails. Six rails were chosen at random and tested three timeseach by measuring the time it took for a certain type of ultrasonicwave to travel the length of the rail.

2

5

1

6

3

4

40 60 80 100

Zero-force travel time (nanoseconds)

Rai

l

Clearly, these data are grouped, or clustered, by rail. This clusteringhas two closely related implications:

1. (within-cluster correlation) we should expect that observationsfrom the same rail will be more similar to one another thanobservations from different rails; and

2. (between cluster heterogeneity) we should expect that the meanresponse will vary from rail to rail in addition to varying fromone measurement to the next.

These ideas are really flip-sides of the same coin.

51

Although it is fairly obvious that clustering by rail must be incor-porated in the modeling of these data somehow, we first consider anaive approach.

The primary interest here is in measuring the mean travel time.Therefore, we might naively consider the model

yij = µ+ εij , i = 1, . . . , 6, j = 1, . . . , 3,

where yij is the travel time for the jth trial on the ith rail, and we

assume ε11, . . . , ε63iid∼ N(0, σ2).

Here, µ is the mean travel time which we wish to estimate. ItsML/OLS estimate is y·· = 66.5 and the MSE is s2 = 23.6452.

However, an examination of the residuals form this model plottedseparately by rail reveals the inadequacy of the model:

-40

-20

020

Boxplots of Raw Residuals by Rail, Simple Mean Model

Res

idua

ls fo

r si

mpl

e m

ean

mod

el

2 5 1 6 3 4

Rail No.

52

Clearly, the mean response is changing from rail to rail. Therefore,we consider a one-way ANOVA model:

yij = µ+ αi + εij . (∗)

Here, µ is a grand mean across the rails included in the experiment,and αi is an effect up or down from the grand mean specific to the ith

rail. Alternatively, we could define µi = µ+αi as the mean responsefor the ith rail and reparameterize this model as

yij = µi + εij .

The OLS estimates of the parameters of this model are µi = yi·,of (µ1, . . . , µ6) = (54.00, 31.67, 84.67, 96.00, 50.00, 82.67) and s2 =4.022. The residual plot looks much better:

-6-4

-20

24

6

Boxplots of Raw Residuals by Rail, One-way Fixed Effects Model

Res

idua

ls fo

r on

e-w

ay fi

xed

effe

cts

mod

el

2 5 1 6 3 4

Rail No.

53

However, there are still drawbacks to this one-way fixed effects model:

– It only models the specific sample of rails used in the experi-ment, while the main interest is in the population of rails fromwhich these rails were drawn.

– It does not produce an estimate of the rail-to-rail variabilityin travel time, which is a quantity of significant interest in thestudy.

– The number of parameters increases linearly with the numberof rails used in the experiment.

These deficiencies are overcome by the one-way random effects model.

To motivate this model, consider again the one-way fixed effectsmodel. Model (*) can be written as

yij = µ+ (µi − µ) + εij

where, under the usual constraint∑

i αi = 0, (µi − µ) has mean 0when averaged over the groups (rails).

The one-way random effects model, replaces the fixed parameter(µi − µ) with a random effect bi, a random variable specific to theith rail, which is assumed to have mean 0 and an unknown varianceσ2b . This yields the model

yij = µ+ bi + εij ,

where b1, . . . , b6 are independent random variables, each with mean0 and variance σ2

b . Often, the bi’s are assumed normal, and they areusually assumed independent of the εij ’s. Thus we have

b1, . . . , bniid∼ N(0, σ2

b ), independent of ε11 . . . , εntniid∼ N(0, σ2),

where n is the number of rails, ti the number of observations on theith rail.

54

– Note that now the interpretation of µ changes from the meanover the 6 rails included in the experiment (fixed effects model)to the mean over the population of all rails from which the sixrails were sampled.

– In addition, we are not estimating µi the mean response for asingle rail, which is not of interest. Instead we are estimatingthe population mean µ and the variance from rail to rail in thepopulation, σ2

b .

– That is, now our scope of inference is the population of rails,rather than the six rails included in the study.

– In addition, we can estimate rail to rail variability σ2b ; and

– The number of parameters no longer increases with the numberof rails tested in the experiment.

The one-way random effects model is really a simplified version of thesplit-plot model and it implies a similar variance-covariance struc-ture. It is easy to show that for the one-way random effects model

var(yij) = σ2b + σ2

cov(yij , yij′) = σ2b , j = j′

corr(yij , yij′) = ρ ≡ σ2b

σ2b + σ2

, j = j′, and

cov(yij , yi′j′) = 0, i = i′.

That is, if yi = (yi1, . . . , yiti)T , then y1, . . . ,yn are independent,

with

var(yi) = (σ2b + σ2)

1 ρ · · · ρρ 1 · · · ρ...

.... . .

...ρ ρ · · · 1

(cf. the split-plot var-cov structure on p.29).

55

– In the rails example, the one-way random effects model againleads to a BLUE of y·· = 65.5 for µ.

– The restricted maximum likelihood (REML) estimators ofσ2 and σ2

b coincide with the method of moment type estimatorswe derived in the split-plot model. These estimates are σ2

b =24.8052 and σ2 = 4.0212.

The randomized complete block model — Stool Example:

In the last example, the data were grouped by rail and we wereinterested in only one treatment (there was only one experimentalcondition under which the travel time along the rail was measured).

Often, several treatments are of interest and the data are grouped.In a randomized complete block design (RCBD), each of s treatmentsare observed in each of n blocks.

As an example, consider the data displayed below. These data comefrom an experiment to compare the ergonomics of four different stooldesigns. b = 9 subjects were asked to sit in each of s = 4 stools. Theresponse measured was the amount of effort required to stand up.

8

5

4

9

6

3

7

1

2

8 10 12 14

Effort required to arise (Borg scale)

Sub

ject

T1 T2 T3 T4

56

Let yij be the response for the jth stool type tested by the ith subject.

The classical fixed effects model for the RCBD assumes

yij = µ+ αj + βi + εij ,

= µj + βi + εij ,i = 1, . . . , n, j = 1, . . . , s,

where ε11, . . . , εnsiid∼ N(0, σ2).

Here, µj is the mean response for the jth stool type, which can bebroken apart into a grand mean µ and a stool type effect αj . βi is afixed subject effect.

Again, the scope of inference for this model is the set of 9 subjectsused in this experiment. If we wish to generalize to the populationfrom which the 9 subjects in this experiment were drawn, a moreappropriate model would consider the subject effects to be random.

The RCBD model with random subject effects is

yij = µj + bi + εij ,

where

b1, . . . , bniid∼ N(0, σ2

b ) independent of ε11, . . . , εnsiid∼ N(0, σ2).

An equivalent representation is

yi = Xiβ + Zibi + εi, i = 1, . . . , n,

where

yi =

yi1...yis

,Xi = Is,β =

µ1...µs

,Zi = js =

1...1

, εi =

εi1...εis

.

57

From this model representation it is clear that the variance-covariancestructure here is quite similar to that in the one-way random effectsand split plot models. In particular,

cov(yi,yi′) = cov(Xiβ + Zibi + εi,Xi′β + Zi′bi′ + εi′) = 0, i = i′,

var(yi) = var(Xiβ + Zibi + εi) = var(Zibi + εi)

= Zi var(bi)︸︷︷︸=σ2

b

Zi + var(εi)︸︷︷︸=σ2Is

= σ2bJs,s + σ2Is

=

σ2 + σ2

b σ2b · · · σ2

b

σ2b σ2 + σ2

b · · · σ2b

......

. . ....

σ2b σ2

b · · · σ2 + σ2b

It is often stated that whether block effects are assumed random orfixed does not affect the analysis of the RCBD. This is not completelytrue. It is true that whether or not blocks are treated as random doesnot affect the ANOVA F test for treatments. The ANOVA table forthe RCBD with random block effects is

Source of Sum of d.f. Mean E(MS) FVariation Squares Squares

Treat’s n∑

j(y·j − y··)2 s− 1 SSTrt

s−1 σ2 +n∑

j(µj−µ·)

2

s−1MSTrt

MSE

Blocks s∑

j(yi· − y··)2 n− 1 SSBlocks

n−1 σ2 + sσ2b

Error SSE (by subtr.) (s− 1)(n− 1) SSE

(s−1)(n−1) σ2

Total∑

i

∑j(yij − y··)

2 sn− 1

– This table is identical to that with blocks fixed except for theexpected MS for blocks. The F tests for the two situations areidentical.

58

However, there are important differences in the analysis of the twodesigns. These differences affect inferences on treatment means.

For instance, in the fixed block effects model, the variance of a treat-ment mean is

var(y·j) = var{n−1∑i

(µj + βi + εij)} = var(ε·j) =σ2

n,

whereas in the random block effects model

var(y·j) = var{n−1∑i

(µj + bi + εij)} = var(b· + ε·j)

= var(b·) + var(ε·j) =σ2b

n+σ2

n=σ2 + σ2

b

n.

From the expressions for expected MS, method of moment (akaanova) estimators for σ2 and σ2

b are easily derived (cf. p.17, theanalogous results for the split-plot model):

σ2 =MSE

σ2b =

MSBlocks −MSE

s

This leads to a standard error of

s.e.(y·j) =√

var(y·j) =

√MSBlocks + (s− 1)MSE

ns

in the random block effects model and a standard error of

s.e.(y·j) =

√MSE

n

in the fixed block effects model.

– See stool1.sas. Note that the s.e.’s on LSMEANS are larger forthe random blocks model. This makes sense, since the scopeof inference for this model is broader.

59

The General LMM — Theory:

In general, we can write the linear mixed model as

y = Xβ + Zb+ ε, (1)

where X and Z are known matrices (the model or design matrices forthe fixed and random effects, respectively), β is a vector of unknown fixedeffects (parameters), b is a vector of random effects, and ε is a vector ofrandom error terms.

We assume for the random vectors b and ε that

E(b) = 0, var(b) = D,

E(ε) = 0, var(ε) = R,

and cov(b, ε) = 0.

• For statistical inference and for likelihood-based estimation we mustadd distributional assumptions on b and ε. We make the usualassumptions that

b ∼ N(0,D), ε ∼ N(0,R).

• Notice that the variance-covariance matrices D and R are not as-sumed to be known, and are of general form. In special cases we willassume spherical errors (R = σ2In) and/or special forms for D.

For example, in the RCBD model with random block effects, suppose thereare n = 3 random blocks and s = 2 treatments. Then

yij = µj + bi + εij

can be written in the general form (1) as follows:y11y12y21y22y31y32

︸︷︷︸

=y

=

1 00 11 00 11 00 1

︸︷︷︸

=X

(µ1

µ2

)︸︷︷︸

=β

+

1 0 01 0 00 1 00 1 00 0 10 0 1

︸︷︷︸

=Z

b1b2b3

︸︷︷︸

=b

+

ε11ε12ε21ε22ε31ε32

︸︷︷︸

=ε

60

For estimation of the fixed effects β, the mixed model can be written as ageneralized least-squares model. Define

V ≡ var(y) = var(Xβ + Zb+ ε) = var(Zb+ ε) = ZDZT +R.

• We assume that V is nonsingular.

Then model (1) is equivalent to

y = Xβ + ζ, E(ζ) = 0, var(ζ) = V.,

ory ∼ Nn(Xβ,V).

If V were known (at least up to a multiplicative constant), then our results

on GLS estimation would apply here, and we would obtain β as a solutionto the equation

XTV−1Xβ = XTV−1y

and then the BLUE of any estimable function cTβ would be given by cT β.

However, V is typically unknown. In that case, suppose we have an esti-mator V of V. Then a natural approach for estimating cTβ is to treat Vas the true value V and then use the (estimated) GLS estimator

cT β = cT (XT V−1X)−XT V−1y. (∗)

• If V is “close to” V, then cT β should be close to the BLUE ofcTβ. However, corresponding standard errors based on var(cT β) =

cT (XT V−1X)−c are known to be underestimated somewhat because

they don’t account for the error in estimating V by V.

• We will see that the estimator defined by (*) is the ML (REML)

estimator of β when V is the ML (REML) estimator of V.

61

Prediction of b:

Before considering our specific problem of predicting b in the LMM, weneed to know a little bit about prediction of random variables, in general.

Suppose we have random variables y1, . . . , yn from which we’d like to pre-dict the random variable y0. What is the best predictor of y0?

If we use the mean squared error of prediction as our criterion of optimality,then we can show that the best (minimum mse) predictor is

mbp(y) ≡ E(y0|y), where y = (y1, . . . , yn)T .

• Here the mean squared error of prediction for a predictor t(y) isdefined to be E[{y0 − t(y)}2]. Note this is a criterion of optimalityfor a predictor t(y). Don’t confuse this with the MSE of a fittedregression model.

62

This result is stated in the following theorem:

Theorem: Let mbp(y) = E(y0|y). Then, for any predictor t(y),

E[{y0 − t(y)}2] ≥ E[{y0 −mbp(y)}2].

Thus mbp(y) = E(y0|y) is the best predictor of y0 in the sense of mini-mizing the mean squared error of prediction.

Proof:

E[{y0 − t(y)}2

]= E

[{y0 −mbp(y) +mbp(y)− t(y)}2

]= E

[{y0 −mbp(y)}2

]+ E

[{mbp(y)− t(y)}2

]+ 2E [{y0 −mbp(y)} {mbp(y)− t(y)}] .

Since both E[{y0 −mbp(y)}2] and E[{mbp(y) − t(y)}2] are nonnegative,it suffices to show that E [{y0 −mbp(y)}{mbp(y)− t(y)}] = 0. This isindeed the case because

E[{y0 −mbp(y)} {mbp(y)− t(y)}

]= E

(E[{y0 −mbp(y)}{mbp(y)− t(y)}|y

])= E

(E[{y0 −mbp(y)}|y

]{mbp(y)− t(y)}

)= E

({E(y0|y)︸︷︷︸=mbp(y)

−mbp(y)}{mbp(y)− t(y)})

= E(0 {mbp(y)− t(y)}

)= 0.

• To form the best predictor E(y0|y) we, in general, require knowl-edge of the joint distribution of (y0, y1, . . . , yn)

T which may not beavailable.

• It requires substantially less information to form the best linearpredictor of y0 based on y. For the best predictor in the class oflinear predictors, we need only the means, variances and covariancesof y0 and y.

63

Limiting ourselves to the class of linear predictors, we seek a predictor ofthe form γ0+yTγ for some vector γ that minimizes E{(y0− γ0−yTγ)2}.

Let µy0 = E(y0), σ2y0

= var(y0), µy = E(y), Vyy = var(y) and vyy0 =cov(y, y0).

Let γ∗ denote a solution to Vyyγ = vyy0 . I.e. γ∗ = V−

yyvyy0 (= V−1yyvyy0

in the case that Vyy is nonsingular). Then the following theorem holds:

Theorem: The function

mblp(y) ≡ µy0 + (γ∗)T (y − µy)

is a best linear predictor of y0 based on y.

Proof: Denote an arbitrary linear predictor as t(y) = γ0 + yTγ. Then

E[{y0 − t(y)}2] = E[{y0 −mblp(y) +mblp(y)− t(y)}2]= E[{y0 −mblp(y)}2] + E[mblp(y)− t(y)}2]+ 2E[{y0 −mblp(y)}{mblp(y)− t(y)}].

Again, it suffices to show that E[{y0 − mblp(y)}{mblp(y) − t(y)}] = 0,because if this cross-product term is 0 then

E[{y0 − t(y)}2] = E[{y0 −mblp(y)}2] + E[mblp(y)− t(y)}2].

To find t(y) that minimizes the left hand side (the mse criterion), observethat both terms on the right hand side are nonnegative, the first termdoes not depend on t(y), and the second term is minimized when it iszero, which happens when t(y) = mblp(y).

So, it remains to show that E[{y0 −mblp(y)}{mblp(y)− t(y)}] = 0, whichwe leave as an exercise.

64

• In general, the best linear predictor and best predictor differ. How-ever, in the special case in which (y0, y1, . . . , yn)

T is multivariatenormal, the best linear predictor and best predictor coincide.

• It can also be shown that the BLP is essentially unique, so that itmakes sense to speak of the BLP.

Now let’s return to the LMM context. Suppose y satisfies the LMM (1)from p.69. Then

µy = Xβ, Vyy = ZDZT +R (assumed nonsingular)

so that the BLP of y0 is

µy0+ vy0y(ZDZT +R)−1(y −Xβ).

• However, this predictor is typically not of much use, because β isunknown. In addition, µy0 , D and R may be unknown as well.

For now, suppose that D and R and vy0y (which is often a function of Dand R) are known, but µy0 and β are not. Then, since the BLP is notavailable, a natural predictor to consider is

mblup(y) ≡ µy0 + vy0y(ZDZT +R)−1(y −Xβ),

where µy0 and Xβ are BLUEs of µyo and E(y) = Xβ, respectively.

• It can be shown that mblup(y) is the best linear unbiased predic-tor of y0 (see Christensen, Ch.12). That is, in the class of unbiasedpredictors that are linear in y, mblup(y) has the minimum mse ofprediction.

Unbiasedness of a Predictor: A predictor t(y) of y0 is said to beunbiased if

E{t(y)} = E(y0).

65

In a LMM context, it is typically of interest to predict cTb, a linear com-binations of the vector of random effects, based upon y, the observed datavector.

• That is, we now let cTb play the role of y0 in our description ofBLUP above.

Since E(cTb) = cTE(b) = 0, µy0 in mblup(y) becomes 0. In addition,

vy0y = cov(cTb,y) = cT cov(b,Xβ + Zb+ ε)

= cT cov(b,Zb+ ε) = cT {cov(b,b)ZT + cov(b, ε)︸︷︷︸=0

}

= cTvar(b)ZT = cTDZT

Therefore, the BLUP of cTb is given by

cTDZTV−1(y −Xβ)

where Xβ is the BLUE of Xβ and V = var(y) = ZDZT +R.

If we are interested in the BLUP of a vector of such functions, (cT1 b, . . . , cTr b)

T =Cb this result extends in the obvious way: The BLUP of Cb is given by

CDZTV−1(y −Xβ).

It is sometimes convenient to write this BLUP in the equivalent form

BLUP(Cb) = CDZTV−1(y −Xβ) = CDZTPy, (†)

whereP = V−1 −V−1X(XTV−1X)−XTV−1.

66

The “Mixed Model Equations”:

At this point we have seen that for D and R known (i.e., for var(y) =V = ZDZT +R known), a BLUE of β and the BLUP of b are

β = (XTV−1XT )−XTV−1y, b = DZTV−1(y −Xβ),

respectively.

In the classical linear model, the BLUE of β is obtained as the solution ofthe normal equations. In the LMM there is an analogous set of equationsthat yield the BLUE and BLUP of β and b. These equations are calledthe mixed model equations or, sometimes, Henderson’s equations.

We now present the mixed model equations. We assume R and D arenonsingular, known matrices.

Recall the LMM:

y = Xβ+Zb+ε, E(b) = E(ε) = 0, var(b) = D, var(ε) = R, cov(b, ε) = 0.

If b was fixed instead of random, the normal equations (based on GLS)for the model would be(

XT

ZT

)R−1(X,Z)

(βb

)=

(XT

ZT

)R−1y

which may be written equivalently as(XTR−1X XTR−1ZZTR−1X ZTR−1Z

)(βb

)=

(XTR−1yZTR−1y

).

Of course, in the mixed model, b is random, which leads to a slightlydifferent set of equations, known as the mixed model equations:(

XTR−1X XTR−1ZZTR−1X D−1 + ZTR−1Z

)(βb

)=

(XTR−1yZTR−1y

). (∗)

67

Theorem: If (βT , bT )T is a solution to the mixed model equations, then

Xβ is a BLUE of Xβ and b is a BLUP of b.

Proof: Recall that the LMM is equivalent to the model

y = Xβ + ζ, E(ζ) = 0, var(ζ) = ZDZT +R ≡ V.

Therefore, Xβ will be a BLUE of Xβ if β is a solution to XTV−1Xβ =XTV−1y. It can be shown (see Theorem B.56 in Christensen, for example)that

V−1 = R−1 −R−1Z{D−1 + ZTR−1Z}−1ZTR−1.

If β and b are solutions, then the second row of (*) gives

ZTR−1Xβ + {D−1 + ZTR−1Z}b = ZTR−1y

⇒ b = {D−1 + ZTR−1Z}−1ZTR−1(y −Xβ). (∗∗)

The first row of (*) is

XTR−1Xβ +XTR−1Zb = XTR−1y.

Substituting the expression for b gives

XTR−1Xβ+XTR−1Z{D−1+ZTR−1Z}−1ZTR−1(y−Xβ) = XTR−1y,

orXTR−1Xβ −XTR−1Z{D−1 + ZTR−1Z}−1ZTR−1Xβ

= XTR−1y −XTR−1Z{D−1 + ZTR−1Z}−1ZTR−1y,

orXT (R−1 −R−1Z{D−1 + ZTR−1Z}−1ZTR−1)︸︷︷︸

=V−1

Xβ

= XT (R−1 −R−1Z{D−1 + ZTR−1Z}−1ZTR−1)︸︷︷︸=V−1

y.

Thus, β is a GLS solutions so that Xβ is BLUE.

68

Now to show b is a BLUP: b in (**) can be rewritten as

b = (D{D−1 + ZTR−1Z} −DZTR−1Z)

× {D−1 + ZTR−1Z}−1ZTR−1(y −Xβ)

= (DZTR−1 −DZTR−1Z{D−1 + ZTR−1Z}−1ZTR−1)(y −Xβ)

= DZT (R−1 −R−1Z{D−1 + ZTR−1Z}−1ZTR−1)︸︷︷︸=V−1

(y −Xβ)

= DZTV−1(y −Xβ),

which is the BLUP of b by result (†) on p.75 (here I is playing the role ofC since we’re interested in the BLUP of Ib = b).

Sampling Variance of BLUE and BLUP for V known:

Just as it is useful for inference on β to known the variance of our estimatorβ, it is useful to known the prediction variance of the BLUP.

For V known and Cβ a vector of estimable functions, the estimator Cβ =C(XTV−1X)−XTV−1y has variance-covariance matrix

var(Cβ) = C(XTV−1X)−CT .

69

The analogous result for Cb is as follows:

var(Cb) = var(CDZTPy) = CDZTPVPTZDTCT

= CDZTPZDCT

Here, we have used the fact that PVP = P and that P and D are sym-metric.

If we are interested in the variance in the prediction error Cb−Cb, thenwe have

var(Cb−Cb) = Cvar(b− b)CT

= C{var(b) + var(b)− cov(b, b)− cov(b, b)T }CT

where

cov(b, b) = cov(b,DZTPy) = cov(b,y)PZD

= cov(b,Xβ + Zb+ ε)PZD = cov(b,Zb)PZD

= cov(b,b)ZTPZD = DZTPZD = var(b).

Therefore,

var(Cb−Cb) = C{var(b) + var(b)− var(b)− var(b)T }CT

= C{D−DZTPZD}CT .

In addition, note that for C a matrix of constants such that Cβ is a vectorof estimable functions, then

cov(Cβ, b) = 0.

70

Maximum Likelihood Estimation:

• We have already seen that for known V, Cβ has BLUE Cβ whereβ = (XTV−1X)−XTV−1y and b has BLUP DZTV−1(y −Xβ).

• These results do not depend upon any distributional assumption onb and ε.

• In addition, to this point we have concentrated on the case when Vis known. We now relax that assumption to consider the V unknowncase.

• Note that b is not a parameter of the model, so while we may beinterested in b, predicting b is not part of fitting the model.

• The unknown parameters of the LMM are β, D, and R.

• Typically, some structure is placed on D and R so that their formsare known, and they are assumed to be matrix functions of a rela-tively small number of parameters.

Let θ be the q×1 vector of unknown parameters describing D and R, andhence V.

• We will often write these matrices as D(θ), R(θ), V(θ) to emphasizethis dependence.

So, fitting the LMM involves estimating β and θ. After the model hasbeen fit it may also be of interest to predict b.

A unified framework for estimation of these parameters is provided bymaximum likelihood, which requires that make distributional assumptionson b and ε.

• Such assumptions will be necessary anyway for inference, so there isnot much cost in making them at the estimation phase.

71

Suppose y = Xβ + Zb + ε, where b ∼ N(0,D(θ)), ε ∼ N(0,R(θ)) andb and ε are independent. Then

y ∼ N(Xβ,V(θ)), where V = ZDZT +R,

so the loglikelihood for β,θ is just the log of a multivariate normal density:

ℓ(β,θ;y) = −n2log(2π)− 1

2log{|V(θ)|}− 1

2(y−Xβ)T {V(θ)}−1(y−Xβ).

The ML estimators of β and θ can be found by taking partial derivativesof ℓ(β,θ;y) with respect to β and the components of θ and setting theresulting functions equal to zero, and solving.

To take these partial derivatives, we need some results on matrix andvector differentiation. The following four results appear in Christensen(Plane Answers to Complex Questions) as Proposition 12.4.1, but can alsobe found in McCulloch et al., 2008, Appendix M, and other standardreferences.

1. ∂Ax∂x = A.

2. ∂xTAx∂x = 2xTA.

3. If A is a function of a scalar s,

∂A−1

∂s= −A−1 ∂A

∂sA−1.

4. If A is a function of a scalar s,

∂ log |A|∂s

= tr

(A−1 ∂A

∂s

).

72

Back to our problem. Recall the loglikelihood:

ℓ(β,θ;y) = −n2log(2π)− 1

2log{|V(θ)|}− 1

2(y−Xβ)T {V(θ)}−1(y−Xβ).

Using the matrix and vector differentiation results above, we obtain thefollowing partial derivatives:

∂ℓ

∂β= −βTXT {V(θ)}−1X+ yT {V(θ)}−1X, and

∂ℓ

∂θj= −1

2tr

(V−1 ∂V

∂θj

)+

1

2(y −Xβ)T {V(θ)}−1 ∂V

∂θj{V(θ)}−1(y −Xβ),

j = 1, . . . , q.

Setting these partials equal to zero, we get the following set of estimatingequations which can be solved to obtain the MLEs β and θ:

XT {V(θ)}−1Xβ = XT {V(θ)}−1y

tr

(V−1 ∂V

∂θj

)= (y −Xβ)T {V(θ)}−1 ∂V

∂θj{V(θ)}−1(y −Xβ),

(♡)

j = 1, . . . , q.

• Although these equations do not, in general, have simple closed-formsolutions, they can be solved simultaneously by any one of severalnumerical techniques (e.g., Newton-Raphson, EM algorithm)

73

Instead of solving the equations (♡) simultaneously, an alternative methodof maximizing ℓ(β,θ;y), which is often more convenient, is the method ofprofile (log)likelihood.

i. First, treat θ as fixed and maximize the loglikelihood ℓ(β,θ;y) withrespect to β.

ii. Second, plug the estimator of β, call it βθ (a function of θ), backinto the loglikelihood. This yields pℓ(θ;y) ≡ ℓ(βθ,θ;y), which is afunction of θ only (called the profile loglikelihood for θ). Maximize

pℓ(θ;y) with respect to θ to obtain the MLE θ.

iii. Finally, the MLE of β is obtained by plugging θ into our estimatorfor β obtained in step 1. That is, the MLE of β is β = βˆθ

.

• Notice that for fixed θ, maximizing ℓ(β,θ;y) with respect to β isequivalent to minimizing

(y −Xβ)T {V(θ)}−1(y −Xβ)

which is the GLS criterion. Therefore, step 1 gives

βθ = [XT {V(θ)}−1X]−XT {V(θ)}−1y.

The real work is done in step 2, where we obtain θ by maximizing

pℓ(θ;y) = −1

2

[log{|V(θ)|}+ (y −Xβθ)

T {V(θ)}−1(y −Xβθ)].

Once this step is accomplished, it is clear that the MLE of β will then be

β = βˆθ= (XT V−1X)−XT V−1y,

where V = V(θ).

• For a presentation of efficient computational methods for maximizingpℓ(θ;y), see Pinheiro and Bates (2000, §2.2) and McCulloch, Searle,and Neuhaus (2008, Ch. 14).

74

Variance Component Models:

While the parameterization of V through θ can, in the general LMM, takeon a wide variety of forms, in the important subclass of LMMs known asvariance component models, V(θ) has a specific simple form.

In variance component models, the levels of any particular random effectare assumed to be independent with the same variance. Different randomeffects are allowed different variances and are assumed independent. Inaddition, the errors are assumed to be homoscedastic so that R = σ2In.

• The one-way random effects model, the RCB model, and the split-plot model are all examples of variance component models.

Another example is the model for an s × s Latin Square design in whichboth blocking factors (rows and columns in the Latin square) are thoughtof as random. The appropriate model for yijk, the response in the ith

treatment, jth row, kth column, would be

yijk = µ+ αi + rj + ck + εijk,r1, . . . , rs

iid∼ N(0, σ2r)

c1, . . . , csiid∼ N(0, σ2

c )ε ∼ N(0, σ2I)

75

In variance component models, Z can be partitioned into q−1 submatrices(q = 2 in the one-way, RCB, and split-plot models, q = 3 in the LSD)as Z = (Z1,Z2, . . . ,Zq−1) and b can be partitioned accordingly as b =(bT

1 ,bT2 , . . . ,b

Tq−1)

T .

Let m(i) denote the number of columns in Zi (= number of element inbi). We assume var(bi) = σ2

i Im(i), i = 1, . . . , q − 1, and cov(bi,bj) = 0,for i = j.

Then D takes on a block-diagonal structure as follows:

D =

σ21Im(1) 0 · · · 00 σ2

2Im(2) · · · 0...

.... . .

...0 0 · · · σ2

q−1Im(q−1)

We also assumeR = σ2qIn. Putting these assumptions together, the matrix

V is assumed to be of the form

V =

q−1∑i=1

σ2iZiZ

Ti + σ2

qIn =

q∑i=1

σ2iZiZ

Ti , (♢)

where Zq ≡ In.

• E.g., in the LSD, suppose that the number of treatments = numberof rows = number of columns = 3. Suppose the design and data areas follows:

A B C

C A B

B C A,

y111 y212 y313

y321 y122 y223

y231 y332 y133

76

Then the model can be written as

y111y212y313y321y122y223y231y332y133

=

1 1 0 01 0 1 01 0 0 11 0 0 11 1 0 01 0 1 01 0 1 01 1 0 01 0 0 1

µα1

α2

α3

+

1 0 01 0 01 0 00 1 00 1 00 1 00 0 10 0 10 0 1

r1r2r3

+

1 0 00 1 00 0 11 0 00 1 00 0 11 0 00 1 00 0 1

c1c2c3

+

ε111ε212ε313ε321ε122ε223ε231ε332ε133

or y = Xβ + Z1b1 + Z2b2 + ε

Here, R = σ2qZq, where Zq = I9, and D has the form

D =

(σ2rI3 00 σ2

cI3

)and

V = var(y) = var(Z1b1 + Z2b2 + ε)

= Z1σ2rI3Z

T1 + Z2σ

2cI3Z

T2 + Zqσ

2qI9Z

Tq

= σ2rZ1Z

T1 + σ2

cZ2ZT2 + σ2

qZqZTq .

In variance component models, θ = (σ21 , σ

22 , . . . , σ

2q )

T andV(θ) =∑q

j=1 θjZjZTj

(cf. (♢)).

This leads to some simplification in the likelihood equations (♡). In par-ticular,

∂V

∂θj= ZjZ

Tj .

77

Information Matrix:

Under the general theory of maximum likelihood estimation, ML estima-tors are consistent and asymptotically normal, under suitable regularityconditions.

• For a discussion of the regularity conditions, see, for example, Coxand Hinkley (1974, Theoretical Statistics, Ch.9), or Seber and Wild(1989, Nonlinear Regression, Ch.12).

In particular, the asymptotic variance-covariance matrix of a MLE ϕ, de-fined as the maximizer of a loglikelihood function ℓ(ϕ), is I(ϕ)−1, where

I(ϕ) = −E

(∂2ℓ(ϕ)

∂ϕ∂ϕT

),

is known as the Fisher information matrix.

• In practice, we replace ϕ by ϕ and use I(ϕ).

• Alternatively, the observed information matrix or negative Hes-sian matrix

−(∂2ℓ(ϕ)

∂ϕ∂ϕT

) ∣∣∣ϕ=

ˆϕ

can be used in place of I(ϕ) without changing the asymptotics.

78

In the LMM context, ϕ = (βT ,θT )T and the loglikelihood is given on thetop of p.73. The information matrix is given by

−E

∂2ℓ

∂ββT

∂2ℓ

∂β∂θT(∂2ℓ

∂β∂θT

)T∂2ℓ

∂θ∂θT

=

(XTV−1X 0

0 12

{tr(V−1 ∂V

∂θjV−1 ∂V

∂θk

)}),

where{tr(V−1 ∂V

∂θjV−1 ∂V

∂θk

)}denotes the q× q matrix with the quantity

inside the curly braces as its j, kth element (see §6.11.a.iii of McCulloch,Searle, and Neuhaus, 2008, for the details).

Inverting this matrix leads to the following asymptotic variance-covariancematrices for the MLE (βT , θT )T :

avar(Xβ) = X(XTV−1X)−XT

avar(θ) = 2

[{tr

(V−1 ∂V

∂θjV−1 ∂V

∂θk

)}]−1

acov(Xβ, θ) = 0.

• Inference for ML in the LMM can now be based on standard asymp-totic likelihood-based methods — Wald, score and LR tests and CIs;and model selection criteria such as AIC, BIC — all can be formedin the usual way.

• E.g., the Wald test statistic for the hypothesis Cβ = d, where Cβis a vector of estimable functions (i.e., equals AX for some A), andwhere rank(C) = nrows(C) ≤ rank(X), is given by

(Cβ − d)T[C{XTV(θ)−1X

}−CT

]−1

(Cβ − d)a∼ χ2(nrows(C)).

• However, we will see that we can improve upon these asymptoticresults with approximate F and t-based inference that work in smalland large samples.

79

Restricted Maximum Likelihood Estimation:

Recall that in simple problems, ML estimation of variances produces biasedestimators.

• E.g., in a one-sample problem from a N(µ, σ2), the MLE of σ2 is1n

∑ni (xi−x)2 rather than the unbiased estimator s2 = 1

n−1

∑ni (xi−

x)2.

• Another example: in the CLM the MLE of the error variance is 1nSSE

rather than the unbiased estimator S2 = 1n−rank(X)SSE.

In estimating the variance, these ML estimators ignore the fact that pa-rameters in the mean have been estimated.

• In the one-sample problem, one degree of freedom is used up in esti-mating µ with x, so the appropriate divisor is n− 1 (number of ob-servations minus number of “non-redundant” parameters estimated)rather than n.

• In the CLM, we use up rank(X) degrees of freedom in estimating β.Therefore, the appropriate divisor is n− rank(X) rather than n.

We’d prefer to have a general “likelihood-based” method of estimationthat produces estimators of variances that account for the degrees of free-dom lost (information used up) in estimating parameters of the mean.Such a method is restricted maximum likelihood (REML) estima-tion (sometimes called residual maximum likelihood or marginal maximumlikelihood).

• Note that the goal here is to improve upon ML estimators of θ, notβ and θ. That is, REML is a method of estimating the variance-covariance parameters, not a method of estimating all of the param-eters of the model.

– However, given a REML estimator of θ it is obvious how theML estimator of β should be formed.

80

In REML, parameter estimates of θ are obtained by maximizing that partof the likelihood which is invariant to Xβ.

• That is, we eliminate β from the log-likelihood by considering theloglikelihood of a set of linear combinations of y, known as errorcontrasts, whose distribution does not depend upon β, rather thanthe density of y itself.

Error Contrasts: A linear combination kTy is said to be an error con-trast if E(kTy) = 0 for all β.

• It follows that kTy is an error contrast if and only if XTk = 0.

Let C(X) denote the column space of X and C(X)⊥ its orthogonal com-plement. Let PC(X)⊥ = I − PC(X) = I − X(XTX)−XT and supposerank(X) = s. Then each element of the vector PC(X)⊥y is an error con-trast because

E(PC(X)⊥y) = (I−PC(X))E(y) = (I−PC(X))Xβ = 0.

Note, however, that rank(PC(X)⊥) = n−s while the dimension of PC(X)⊥

is n× n.

• Therefore, there are some redundancies among the elements ofPC(X)⊥y.

A natural question arises:

How many essentially different (non-redundant) error contrasts canbe included in a single set?

81

Linearly Independent Error Contrasts: Error contrasts kT1 y,k

T2 y, . . . ,k

Tmy

are said to be linearly independent if k1, . . . ,km are linearly independentvectors.

Theorem: Any set of error contrasts contains at most n−rank(X) = n−slinearly independent error contrasts.

Theorem: Let K be a n × (n − s) matrix such that KTK = I andKKT = PC(X)⊥ . The (n− s)× 1 vector w defined by

w = KTy

is a vector of n−s linearly independent error contrasts. (It is not the onlyvector, however.)

The REML approach consists of applying ML estimation to w rather thany. If y ∼ Nn(Xβ,V(θ)) where V(θ) = ZD(θ)ZT +R(θ), then

w ∼ Nn−s(0,KTV(θ)K).

Therefore, the restricted loglikelihood for θ is just the log density of w:

ℓR(θ;y) = −n− s

2log(2π)− 1

2log |KTV(θ)K| − 1

2wT (KTV(θ)K)−1w.

θ is a REML estimate of θ if ℓR(θ;y) attains its maximum value at θ = θ.

• It can be shown that the maximizer of the restricted loglikelihooddoes not depend upon which vector of n − s linearly independenterror contrasts is used to form w.

– That is, the REML estimator is well-defined.

82

It is preferrable to express ℓR(θ;y) in terms of X and V (quantities thatdefine our model) rather than in terms of K and V. Hence, the followingresult:

Theorem: The log-likelihood function associated with any vector of n−slinearly independent error contrasts is, apart from an additive constantthat doesn’t depend on θ,

ℓR(θ;y) = −1

2log |V(θ)| − 1

2(y −Xβθ)

T {V(θ)}−1(y −Xβθ)

− 1

2log |XT {V(θ)}−1X|,

(♠)

where X represents any n × s matrix such that C(X) = C(X) and βθ isany solution to XT {V(θ)}−1Xβ = XT {V(θ)}−1y.

• Note that in ordinary ML, we obtain the profile likelihood for θ as

pℓ(θ;y) = −1

2log |V(θ)| − 1

2(y −Xβθ)

T {V(θ)}−1(y −Xβθ).

Notice that ℓR(θ;y) differs only from pℓ(θ;y) by the additionalterm −1

2 log |XT {V(θ)}−1X|. This term serves as an adjustment, or

penalty, for the estimation of β. Hence REML estimation is some-times called a penalized likelihood method.

To obtain θ, the REML estimate of θ, we solve the estimating equations

∂ℓR(θ;y)

∂θi= 0, i = 1, . . . , q

83

In general LMMs, these estimating equations can be written

1

2(y−Xβθ)

T {V(θ)}−1

(∂V(θ)

∂θi

){V(θ)}−1(y−Xβθ)−

1

2tr

{Q

(∂V(θ)

∂θi

)}= 0,

i = 1, . . . , q, where

Q = {V(θ)}−1[I−X(XT {V(θ)}−1X)−XT {V(θ)}−1

].

In variance component models where θ = (σ21 , σ

22 , . . . , σ

2q )

T , these estimat-

ing equations simplify based on ∂V(θ)/(∂σ2i ) = ZiZ

Ti , i = 1, . . . , q.

• REML estimators are not, in general, unbiased. However, they typ-ically have less bias than ML estimators of variance components.

• While it is not possible to make completely general recommendationsconcerning REML vs. ML estimation, it does appear that REMLestimators perform better than ML estimators for s large relative ton. I would recommend REML over ML for s > 4 or so.

• As previously mentioned, REML provides an estimator of θ, it saysnothing about the estimation of β. However, ML says to estimate θas

θML = argmaxθ

pℓ(θ;y)

and then estimate β as

βML = βˆθML

=

[XT

{V(θML)

}−1

X

]−1

XT{V(θML)

}−1

y.

In REML, we estimate θ by maximizing ℓR(θ;y) instead of pℓ(θ;y).Therefore, the obvious “REML estimator of β” is

βREML = βˆθREML

=

[XT

{V(θREML)

}−1

X

]−1

XT{V(θREML)

}−1

y,

whereθREML = argmax

θℓR(θ;y).

84

The asymptotic variance-covariance matrix of (βTREML, θ

TREML)

T is givenby (

(XTV−1X)− 0

0 2[{

tr(P−1 ∂V

∂θjP−1 ∂V

∂θk

)}]−1

),

where

P = V−1 −V−1X(XTV−1X)−XTV−1 = K(KTVK)−1KT .

• As in ML estimation, this asymptotic var-cov matrix can be esti-mated by evaluating it at the REML estimates.

• Wald-based inference can then be done based on the estimated asymp-totic var-cov matrix.

• Note however, that the restricted loglikelihood cannot be treatedas an ordinary loglikelihood. In particular, LRTs, AICs, and BICsbased on the restricted loglikelihood objective function should notbe used to select between models with different fixed-effects specifi-cations.

• The restricted loglikelihood given by (♠) can be derived in a numberof different ways. Harville (1974) and Laird and Ware (1982) usea Bayesian approach, while Patterson and Thompson (1971) use amore traditional fequentist approach.

• It can also be derived as a modified profile likelihood function (seePawitan, §10.6 and Ch.17). See also McCullagh and Nelder, §7.2,for connections to marginal and conditional likelihood.

85

Small-Sample Inference on the Fixed Effects:

As mentioned previously, ML/REML estimation provides a unified frame-work for estimation and inference in the LMM, and standard likelihood-based inference techniques for fixed effects are available (Wald tests, LRtests, etc.).

In addition, for many special cases of the LMM, such as anova models,exact (small and large sample) inference techniques are available for bal-anced data.

• E.g., in the one-way random effects model, or the RCB model, orthe split-plot model, it is possible to obtain exact F tests to testtreatment effects, do inference on treatment means, etc.

However, for unbalanced data, exact distributional results are not avail-able, and in small samples, asymptotic variances can seriously underesti-mate the true variances of parameter estimators, compromising the validityof the asymptotic inferences.

• Therefore, large-sample techniques (e.g., conventional Wald and LRtests) are not recommended for inference on the fixed effects inLMMs, except in very large samples.

We now consider small sample inference methods which attempt to com-pensate for the underestimation of the sampling variance of the REMLestimator of β in the LMM. The following presentation is based on Ken-ward and Roger (1997).

86

Recall that the REML estimator of β is

β = Φ(θ)XTV(θ)−1y,

whereΦ(θ) = {XTV(θ)−1X}−1,

and θ = θREML is the REML estimator of θ.

• For simplicity, we will assume that X is of full rank, the resultspresented here extend to the non-full rank case as well.

Recall that the matrix Φ(θ) is the asymptotic variance-covariance matrix

of β, and conventionally, its estimator

Φ ≡ Φ(θ) = {XTV(θ)−1X}−1

is used to quantify the precision of β.

There are two sources of bias in ϕ as a measure of the precision of β insmall samples:

1. Φ(θ) takes no account of the error introduced into β by having to

estimate θ, so it is an underestimate of var(β).

– Another way to say this is as follows: When θ is known, weknow that var

[{XTV(θ)−1X}−1XTV(θ)−1y

]= Φ(θ). Un-

doubtedly, plugging in θ in place of θ introduces some extravariability (error), so var(β) must be greater than Φ(θ).

2. Φ = Φ(θ) is a biased estimator of Φ(θ).

87

To correct these deficiencies, we write the variability in β as the sum oftwo components:

var(β) = Φ + Λ,

where Λ represents the amount by which Φ = avar(β) underestimates

var(β).

Using Taylor series expansions, it can be shown that Λ can be approxi-mated by

Λ ≈ Φ

q∑

i=1

q∑j=1

Wij(Sij −TiΦTj)

Φ,

where

Ti = XT ∂V−1

∂θiX, Sij = XT ∂V

−1

∂θiV∂V−1

∂θiX,

and Wij is the (i, j)th element of W = var(θ).

In addition, a Taylor series expansion about θ can be used to show thatΦ is biased as follows:

E(Φ) ≈ Φ− Λ +1

2

q∑i

q∑j

WijΦRijΦ︸︷︷︸=(∗)

,

where

Rij = XTV−1 ∂2V

∂θi∂θjV−1X.

Since we want to estimate Φ+Λ, Kenward and Roger suggest an adjustedsmall sample var-cov matrix of β given by

Φadj = Φ + 2Λ− 1

2

q∑i

q∑j

WijΦRijΦ,

where Wij is the (i, j)th element of ˆavar(θ) (top of p.85), and V(θ) is

substituted for V(θ) to form R and Λ.

• Note that in variance component models, the term (*) equals 0, so

the adjusted estimator of var(β) simplifies to Φadj = Φ + 2Λ.

88

Inference and Degrees of Freedom:

For a testable hypothesis of the form H0 : Cβ = d, where C is c × p offull row rank, a reasonable test statistic for H0 is given by

1

c(Cβ − d)T [var(Cβ)]−1(Cβ − d). (∗∗)

When V is known, we have

var(Cβ) = C(XTV−1X)−CT ,

or if V is known up to a multiplicate constant, i.e., if V = σ2W for Wknown, then

var(Cβ) =1

σ2C(XTW−1X)−CT ,

which is estimated by

var(Cβ) =1

S2C(XTW−1X)−CT ,

where

S2 =(y −Xβ)T (y −Xβ)

n− rank(X)

is the MSE from the fitted model. In that case, (**) becomes

F ≡ (Cβ − d)T [C(XTV−1X)−CT ]−1(Cβ − d)

cS2

and, under H0, we have

F ∼ F (c, n− rank(X)).

89

However, in the LMM when V is unknown, the estimation of var(Cβ)becomes more challenging, and no longer leads necessarily to an exact Fstatistic.

That is, when V is unkown, it still makes sense to use (**) as a test statis-

tic, but now we use the Kenward-Roger estimate Φadj to obtain var(Cβ).This leads to the test statistic

F ≡ 1

c(Cβ − d)T [CΦadjC

T ]−1(Cβ − d).

This quantity no longer has an exact F distribution, but similar to theSatterthwaite procedure, we can approximate the distribution of F byassuming that

λF.∼ F (c, d), (†)

for some λ and d.

A λ and d that make this a good approximation can be found by equatingthe first and second moments of both sides of (†). This approach leads to

λ =d

(d− 2)E(F ), and

d =1 + 2/c

var(F )

2{E(F )}2− 1

c

.

Formulas for E(F ) and var(F ) are given in Kenward and Roger(1997).

• This procedure for estimating λ and d gives approximate F statisticsthat will typically perform much better than asymptotic inferencetechniques.

• These F statistics are approximate, in general, but do reduce to theexact F statistics in those cases in which exact results are available(e.g., balanced anova models).

90

• The Kenward-Roger approach to inference is implemented in PROCMIXED with the DDFM=KENWARDROGER option on the MODELstatment.

• Alternatively, the DDFM=SATTERTH option implements a closelyrelated Satterthwaite approximation to obtain denominator degreesof freedom for approximate F tests. In the Satterthwaite procedure,the unadjusted estimator Φ is used instead of Kenward and Roger’sΦadj to form the F statistic. That is, var(Cβ) = C{XTV(θ)−1X}−CT

is used in the Satterthwaite procedure.

• There is not yet consensus on which approach to small sample infer-ence is, in general, preferrable in LMMs. However, a recent simula-tion study (Schaalje et al., 2002) suggests that the K-R proceduremay be superior, and it is what I recommend.

Inference on θ:

Often, the mean structure is of more interest when fitting a LMM than thevariance-covariance structure. However, adequate modeling of the var-covstructure is important for model-based inference on the mean (on β).

• Overparameterization of the var-cov structure leads to inefficient es-timation of β, low power, and potentially poor estimation of stan-dard errors for β.

• On the other hand, an over-simplified var-cov structure can invali-date inferences on β.

In addition, the var-cov structure is sometimes of interest in and of it-self, and correct modeling of the var-cov structure can be important whenhandling certain types of missing data.

91

The major obstacle to inference on θ is that θ parameterizes the variance-covariance matrices D and R, which must be positive definite.

Therefore, the parameter space for θ is constrained, which can stronglyaffect the distributional results typically used for inference.

• E.g., in variance component models, the elements of θ have the in-terpretation as variances, so they are necessarily non-negative. Thisstrongly affects the adequacy of distributional approximations for θ.

Wald Tests:

Based on classical likelihood theory,

θMLa∼ N

(θ, avar(θML)

), and θREML

a∼ N(θ, avar(θREML)

),

where the asymptotic var-cov matrices are given at the top of pp. 79 and85, for ML and REML, respectively.

In principle therefore, Wald based inference for a linear combination η =cTθ, could be based on the distributional result:

cT θ − η√cT avar(θ)c

.∼ N(0, 1). (†)

However, the adequacy of this normal approximation depends strongly onhow close η is to the boundary of its parameter space.

92

E.g., suppose η = θj and θj is a variance component (or a diagonal elementof D). Then if θj is close to zero, (†) becomes a poor approximation. Infact, (†) breaks down altogether if θj = 0, or, more generally, if η is on theboundary of its parameter space.

• So, if η is far from the boundary of its parameter space, (†) can beused to form Wald confidence intervals and hypothesis tests.

– Here, how far η must be from its boundary depends upon thesample size.

• However, (†) is useless for testing whether variance components areequal to zero (i.e., for testing whether a certain random effect isnecessary in the model).

Likelihood Ratio Tests:

Similar comments apply to LR tests.

Let Θ denote the parameter space for θ. Then, according to classicallikelihood theory, a hypothesis of the form

H0 : θ ∈ Θ0,

where Θ0 is a subspace of Θ, can be tested with the LR statistic

2{ℓ(θML)− ℓ(θ0ML}

.∼ χ2(dim(Θ)− dim(Θ0)),

where θML and θ0ML are the MLEs in Θ and Θ0, respectively.

• Similarly, if the null hypothesis does not involve β, then a restrictedLR test can be done based on

2{ℓR(θREML)− ℓR(θ0REML}

.∼ χ2(dim(Θ)− dim(Θ0)).

However, the regularity conditions that establish these results (Wilks’ The-orem) assume that θ is not on the boundary of its parameter space. So,once again, standard asymptotic theory does not apply to LR testing forthe significance of variance components.

93

In selecting an appropriate variance-covariance specification for our model,such hypotheses often arise.

• E.g., testing whether a certain random effect is necessary in themodel is equivalent to testing whether its associate variance compo-nent is zero.

Therefore, how can we go about building the var-cov structure of ourmodel?

One possible answer is to use the same Wald and LR test statistics, butuse their proper reference distributions under the null hypothesis.

• That is, when the hypothesis places the parameter on the boundaryof its parameter space, that means the usual normal and chi-squarelimiting distributions are no longer correct, not that the test statis-tics are no loinger appropriate. So, use the same statistics but usethe right reference distribution.

Unfortunately, figuring out what the right reference distribution is can behard, and sometime even if the distribution is known, obtaining criticalvalues or p−values from that distribution can be hard.

• The only truly simple case is when the error variance covariancematrix is of the form R = σ2I and we are testing a null modelthat includes i.i.d. random effects that are each q-variate versusan alternative model with i.i.d. random effects that are each q + 1variate, where under both the null and alternative each random effectvector has a general, unstructured variance-covariance matrix.

– In this case, the null distribution of the LRT and restrictedLRT is a 50:50 mixture of a χ2(q) and a χ2(q+1) distribution.

94

• E.g., suppose we are choosing between a model with no randomeffects, versus a model with a cluster specific random effect, bi, where

b1, ..., bniid∼ N(0, σ2

b ).

Then to test H0 : σ2b = 0, the LR and restricted LR tests both

have null distribution that is a 50:50 mixture of a χ2(0) and a χ2(1)distribution, where χ2(0) denotes the chi-square distribution with 0d.f., which is 0 with probability 1.

– In this case, the correct p-value for H0 is exactly one-half thevalue based on a χ2(1) distribution.

• As a second example, suppose we are choosing between a model witha cluster specific random effect, bi, where

b1, ..., bniid∼ N(0, σ2

b )

versus a model with a bivariate cluster-specific random effect, bi,where

b1, ...,bniid∼ N(0,D), where D =

(θ11 θ12θ12 θ22

).

Then the LR and restricted LR tests both have null distribution thatis a 50:50 mixture of a χ2(1) and a χ2(2).

• A table of critical values of 50:50 mixtures of χ2(q) and χ2(q + 1) isgiven in the back of our text, and can be used for testing hypotheseson variance components in this special case.

• However, more generally (e.g., when R = σ2I, or when comparingmore complex random effects structures) the null distribution of theLRT can be difficult to find and use.

– In such cases, our textbook authors suggest testing the hy-potheses using the standard asymptotics of the LRT, but usingα = 0.1 instead of α = 0.05.

95

• Alternatively, if we are not interested in inference on the variance-covariance structure per se, but simply want to choose an adequatevar-cov structure so that we can get valid conclusions in our infer-ences on the mean, then we may choose not to do formal hypothesistests on the variance-covariance structure of the model, but simplyselect an adequate var-cov structure via a model selection criterion,such as AIC.

Model Selection Criteria:

Consider testing some hypothesis H0 versus an alternative HA where thesehypotheses correspond to nested models.

Let ℓ0 and ℓA denote the loglikelihood function evaluated at the MLEunder the null and alternative models, respectively. Further let #ϕ0 and#ϕA denote the number of free parameters under H0 and HA.

Then the LR test rejects H0 is ℓA − ℓ0 is large in comparison to thedifference in d.f. of the two models to be compared. That is, it rejects if

ℓA − ℓ0 > F(#ϕA)−F(#ϕ0),

or equivalently, if

ℓA −F(#ϕA) > ℓ0 −F(#ϕ0),

for an appropriate function F .

• For example, for an α-level LR test under standard regularity condi-tions (i.e., when the null hypothesis doesn’t place the parameter onthe boundary of the parameter space), F is a function such that

F(#ϕA)−F(#ϕ0) =1

2χ21−α(#ϕA −#ϕ0).

• This procedure can only be considered a formal hypothesis test if thenull and alternative correspond to nested models and if F(#ϕA) −F(#ϕ0) gives the appropriate critical value from the reference dis-tribution of ℓA − ℓ0.

96

However, there is no reason why the above procedure could not be usedas a rule of thumb for comparing any two (not necessarily nested) models,or why we couldn’t consider other functions F(·) for choosing betweenmodels.

• Some commonly used functions for F are given in Table 6.7 of Ver-beke & Molenberghs (2000, p.74) reproduced below.

• These functions yield different choices of information criteria, ofwhich AIC and BIC are by far the most common.

• The basic idea in all of these criteria is to compare models based ontheir maximized loglikelihood values, but to penalize for the use oftoo many parameters (penalize model complexity).

• The model with the largest value of whichever criterion is chosen isthe winner, but note that sometimes AIC and BIC are defined as −2times the definition given here so that smallest is best.

• Note that for discriminating between variance-covariance structures,ℓR may be used in place of ℓ as long as we replace n by n∗ = n− p,the number of linearly independent error contrasts used to form ℓR.

• Which criterion performs best is not an easy question to answer anddepends upon the nature of the data, models, and the purpose towhich the models are to be put.

97

• AIC tends to penalize less for model complexity than BIC, so it tendsto err on the side of overspecified models, whereas BIC errs on theside of underspecification.

– Underspecification tends to lead to bias, overspecification toinefficiency (increased variance).

• For choosing an adequate variance-covariance structure when inter-est centers on the mean, the consequences of underspecification aremore dire. Therefore I do not recommend BIC.

• For selecting the variance-covariance structure in mixed models, AICis commonly used. However, AIC is an estimator of the expectedKullback discrepancy between the true model and a fitted model,and as such it is known to be downwardly biased by an amount thatdisappears asymptotically, but can be substantial in small samplesor whenever p is large relative to the total sample size n.

• Therefore, a variety of alternatives to AIC have been proposed thatattempt to correct this bias with the goal of better performance insmall samples. One of the most popular and effective alternatives isthe corrected AIC (AICc) of Sugiura (1978) and Hurvich and Tsai(1989, 1993, 1995):

AICc = −2ℓ(ϕ) + 2k

(n∗

n∗ − k − 1

)vs. AIC = −2ℓ(ϕ) + 2k,

where ϕ is the MLE of the model parameter ϕ and k = #ϕ.

• AICc is recommended in many regression contexts (CLM, GLMs,time series analysis, nonlinear regression, etc.), but has not been for-mally justified in a mixed model context, especially for the selectionof the variance-covariance/random effects structure. Therefore, it isnot recommended for this purpose. Development of bias-correctedversions of the AIC for mixed models is an active area of research. Atpresent, most of the methods that have been proposed and validatedare computationally intensive and not implemented in standard soft-ware.

98

The Linear Mixed Model for Clustered Data:

So far, we have presented the LMM in its general form. However, appli-cation to longitudinal or other clustered data is among the most commonuses of the LMM, so we now consider the LMM in that context.

Suppose we have data on n subjects, where yi = (yi1, . . . , yiti)T is the ti×1

vector of observations available on the ith subject, i = 1, . . . , n. Then theLMM for longitudinal data is given by

yi = Xiβ + Zibi + εi, i = 1, . . . , n,

where Xi and Zi are ti × p and ti × g design matrices for fixed effects βand random effects bi, respectively. εi is a vector of error terms.

As before, it is assumed that

bi ∼ N(0,D(θ)), εi ∼ N(0,Ri(θ))

and b1, . . . ,bn, ε1, . . . , εn are independent.

• Note that var(εi) = Ri depends upon i through the dimension ti ofyi (the cluster size), and may also depend on i by having a differentform, or at least different parameter values for different subsets ofsubjects.

• D and the dimension of bi, however, are assumed to the same for alli.

99

The specification above can be equivalently expressed as

yi|bi ∼ N(Xiβ + Zibi,Ri(θ)), where bi ∼ N(0,D(θ)). (♣)

• (♣) is known as the hierarchical formulation of the model.

Note that the hierarchical model implies a marginal model. That is, themarginal density of yi can be obtained from the conditional density ofyi|bi and the marginal density of bi throught the relationship

f(yi) =

∫f(yi|bi)f(bi)dbi.

Because both densities in the integrand are normal, the integral yields anormal so that

yi ∼ N(Xiβ,Vi(θ)), Vi(θ) = ZiD(θ)ZTi +Ri(θ).

• That is, the hierarchical (conditionally specified) model implies acorresponding marginal model.

• Note however, that for arbitrary Vi(θ) (Vi(θ) an arbitrary p.d. ma-trix), or even Vi(θ) p.d. and of the form ZiD(θ)ZT

i +Ri(θ) whereD and Ri are not assumed p.d., the marginal model does not neces-sarily imply the hierarchical one.

– That is, there is a subtle distinction between the hierarchi-cal and marginal formulations of the model. This distinctionwill become more important when we study generalized linearmixed models.

100

stat 8630, mixed-effect models and longitudinal data analysis

Documents