baayendavidsonbates.pdf

23
Mixed-effects modeling with crossed random effects for subjects and items R.H. Baayen a, * , D.J. Davidson b , D.M. Bates c a University of Alberta, Edmonton, Department of Linguistics, Canada T6G 2E5 b Max Planck Institute for Psycholinguistics, P.O. Box 310, 6500 AH Nijmegen, The Netherlands c University of Wisconsin, Madison, Department of Statistics, WI 53706-168, USA Received 15 February 2007; revision received 13 December 2007 Available online 3 March 2008 Abstract This paper provides an introduction to mixed-effects models for the analysis of repeated measurement data with sub- jects and items as crossed random effects. A worked-out example of how to use recent software for mixed-effects mod- eling is provided. Simulation studies illustrate the advantages offered by mixed-effects analyses compared to traditional analyses based on quasi-F tests, by-subjects analyses, combined by-subjects and by-items analyses, and random regres- sion. Applications and possibilities across a range of domains of inquiry are discussed. Ó 2007 Elsevier Inc. All rights reserved. Keywords: Mixed-effects models; Crossed random effects; Quasi-F; By-item; By-subject Psycholinguists and other cognitive psychologists use convenience samples for their experiments, often based on participants within the local university community. When analyzing the data from these experiments, partic- ipants are treated as random variables, because the interest of most studies is not about experimental effects present only in the individuals who participated in the experiment, but rather in effects present in language users everywhere, either within the language studied, or human language users in general. The differences between individuals due to genetic, developmental, envi- ronmental, social, political, or chance factors are mod- eled jointly by means of a participant random effect. A similar logic applies to linguistic materials. Psych- olinguists construct materials for the tasks that they employ by a variety of means, but most importantly, most materials in a single experiment do not exhaust all possible syllables, words, or sentences that could be found in a given language, and most choices of language to investigate do not exhaust the possible languages that an experimenter could investigate. In fact, two core prin- ciples of the structure of language, the arbitrary (and hence statistical) association between sound and mean- ing and the unbounded combination of finite lexical items, guarantee that a great many language materials must be a sample, rather than an exhaustive list. The space of possible words, and the space of possible sen- tences, is simply too large to be modeled by any other means. Just as we model human participants as random variables, we have to model factors characterizing their speech as random variables as well. 0749-596X/$ - see front matter Ó 2007 Elsevier Inc. All rights reserved. doi:10.1016/j.jml.2007.12.005 * Corresponding author. Fax: +1 780 4920806. E-mail addresses: [email protected] (R.H. Baayen), [email protected] (D.J. Davidson), bates@stat. wisc.edu (D.M. Bates). Available online at www.sciencedirect.com Journal of Memory and Language 59 (2008) 390–412 Journal of Memory and Language www.elsevier.com/locate/jml

Upload: valentin-tassile

Post on 14-Nov-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

  • ithts a

    Dav

    partm

    .O. B

    partm

    evisioline 3

    Abstract

    Psycholinguists and other cognitive psychologists useconvenience samples for their experiments, often based

    A similar logic applies to linguistic materials. Psych-olinguists construct materials for the tasks that they

    tences, is simply too large to be modeled by any othermeans. Just as we model human participants as randomvariables, we have to model factors characterizing theirspeech as random variables as well.

    0749-596X/$ - see front matter 2007 Elsevier Inc. All rights reserved.

    * Corresponding author. Fax: +1 780 4920806.E-mail addresses: [email protected] (R.H. Baayen),

    [email protected] (D.J. Davidson), [email protected] (D.M. Bates).

    Available online at www.sciencedirect.com

    Journal of Memory and Language

    Journal ofMemory andLanguageon participants within the local university community.When analyzing the data from these experiments, partic-ipants are treated as random variables, because theinterest of most studies is not about experimental eectspresent only in the individuals who participated in theexperiment, but rather in eects present in languageusers everywhere, either within the language studied,or human language users in general. The dierencesbetween individuals due to genetic, developmental, envi-ronmental, social, political, or chance factors are mod-eled jointly by means of a participant random eect.

    employ by a variety of means, but most importantly,most materials in a single experiment do not exhaustall possible syllables, words, or sentences that could befound in a given language, and most choices of languageto investigate do not exhaust the possible languages thatan experimenter could investigate. In fact, two core prin-ciples of the structure of language, the arbitrary (andhence statistical) association between sound and mean-ing and the unbounded combination of nite lexicalitems, guarantee that a great many language materialsmust be a sample, rather than an exhaustive list. Thespace of possible words, and the space of possible sen-This paper provides an introduction to mixed-eects models for the analysis of repeated measurement data with sub-jects and items as crossed random eects. A worked-out example of how to use recent software for mixed-eects mod-eling is provided. Simulation studies illustrate the advantages oered by mixed-eects analyses compared to traditionalanalyses based on quasi-F tests, by-subjects analyses, combined by-subjects and by-items analyses, and random regres-sion. Applications and possibilities across a range of domains of inquiry are discussed. 2007 Elsevier Inc. All rights reserved.

    Keywords: Mixed-eects models; Crossed random eects; Quasi-F; By-item; By-subjectMixed-eects modeling wfor subjec

    R.H. Baayen a,*, D.J.aUniversity of Alberta, Edmonton, De

    bMax Planck Institute for Psycholinguistics, PcUniversity of Wisconsin, Madison, De

    Received 15 February 2007; rAvailable ondoi:10.1016/j.jml.2007.12.005crossed random eectsnd items

    idson b, D.M. Bates c

    ent of Linguistics, Canada T6G 2E5

    ox 310, 6500 AH Nijmegen, The Netherlands

    ent of Statistics, WI 53706-168, USA

    n received 13 December 2007March 2008

    59 (2008) 390412

    www.elsevier.com/locate/jml

  • y X b S s W w 1

    R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412 391Clark (1973) illuminated this issue, sparked by thework of Coleman (1964), by showing how languageresearchers might generalize their results to the largerpopulation of linguistic materials from which they sam-ple by testing for statistical signicance of experimentalcontrasts with participants and items analyses. Clarksoft-cited paper presented a technical solution to thismodeling problem, based on statistical theory and com-putational methods available at the time (e.g., Winer,1971). This solution involved computing a quasi-F sta-tistic which, in the simplest-to-use form, could beapproximated by the use of a combined minimum-F sta-tistic derived from separate participants (F1) and items(F2) analyses. In the 30+ years since, statistical tech-niques have expanded the space of possible solutionsto this problem, but these techniques have not yet beenapplied widely in the eld of language and memory stud-ies. The present paper discusses an alternative known asa mixed eects model approach, based on maximumlikelihood methods that are now in common use inmany areas of science, medicine, and engineering (see,e.g., Faraway, 2006; Fielding & Goldstein, 2006; Gil-mour, Thompson, & Cullis, 1995; Goldstein, 1995; Pin-heiro & Bates, 2000; Snijders & Bosker, 1999).

    Software for mixed-eects models is now widelyavailable, in specialized commercial packages such asMLwiN (MLwiN, 2007) and ASReml (Gilmour, Gogel,Cullis, Welham, & Thompson, 2002), in general com-mercial packages such as SAS and SPSS (themixed proce-dures), and in the open source statistical programmingenvironment R (Bates, 2007). West, Welch, and Galech-ki (2007) provide a guide to mixed models for ve dier-ent software packages.

    In this paper, we introduce a relatively recent devel-opment in computational statistics, namely, the possibil-ity to include subjects and items as crossed, independent,random eects, as opposed to hierarchical or multilevelmodels in which random eects are assumed to benested. This distinction is sometimes absent in generaltreatments of these models, which tend to focus onnested models. The recent textbook by West et al.(2007), for instance, does not discuss models withcrossed random eects, although it clearly distinguishesbetween nested and crossed random eects, and advisesthe reader to make use of the lmer() function in R, thesoftware (developed by the third author) that we intro-duce in the present study, for the analysis of crosseddata.

    Traditional approaches to random eects modelingsuer multiple drawbacks which can be eliminated byadopting mixed eect linear models. These drawbacksinclude (a) deciencies in statistical power related tothe problems posed by repeated observations, (b) thelack of a exible method of dealing with missing data,(c) disparate methods for treating continuous and cate-

    gorical responses, as well as (d) unprincipled methodsij ij i i j j ij

    The vector yij represents the responses of subject i toitem j. In the present example, each of the vectors yijcomprises two response latencies, one for the shortand one for the long SOA. In (1), Xij is the design ma-trix, consisting of an initial column of ones and followedby columns representing factor contrasts and covariates.For the present example, the design matrix for each sub-ject-item combination has the simple formof modeling heteroskedasticity and non-sphericalerror variance (for either participants or items). Meth-ods for estimating linear mixed eect models haveaddressed each of these concerns, and oer a betterapproach than univariate ANOVA or ordinary leastsquares regression.

    In what follows, we rst introduce the concepts andformalism of mixed eects modeling.

    Mixed eects model concepts and formalism

    The concepts involved in a linear mixed eects modelwill be introduced by tracing the data analysis path of asimple example. Assume an example data set with threeparticipants s1, s2 and s3 who each saw three items w1,w2, w3 in a priming lexical decision task under bothshort and long SOA conditions. The design, the RTsand their constituent xed and random eects compo-nents are shown in Table 1.

    This table is divided into three sections. The left-most section lists subjects, items, the two levels ofthe SOA factor, and the reaction times for each com-bination of subject, item and SOA. This section repre-sents the data available to the analyst. The remainingsections of the table list the eects of SOA and theproperties of the subjects and items that underly theRTs. Of these remaining sections, the middle sectionlists the xed eects: the intercept (which is the samefor all observations) and the eect of SOA (a 19 msprocessing advantage for the short SOA condition).The right section of the table lists the random eectsin the model. The rst column in this section listsby-item adjustments to the intercept, and the secondcolumn lists by-subject adjustments to the intercept.The third column lists by-subject adjustments to theeect of SOA. For instance, for the rst subject theeect of a short SOA is attenuated by 11 ms. The nalcolumn lists the by-observation noise. Note that inthis example we did not include by-item adjustmentsto SOA, even though we could have done so. In theterminology of mixed eects modeling, this data setis characterized by random intercepts for both subjectand item, and by by-subject random slopes (but noby-item random slopes) for SOA.

    Formally, this dataset is summarized in (1).

  • and r

    SOA

    000

    191919

    000

    191919

    000

    191919vailaept, SubSO

    392 R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412Table 1Example data set with random intercepts for subject and item,

    Subj Item SOA RT Fixed

    Int

    s1 w1 Long 466 522.2s1 w2 Long 520 522.2s1 w3 Long 502 522.2s1 w1 Short 475 522.2s1 w2 Short 494 522.2s1 w3 Short 490 522.2s2 w1 Long 516 522.2s2 w2 Long 566 522.2s2 w3 Long 577 522.2s2 w1 Short 491 522.2s2 w2 Short 544 522.2s2 w3 Short 526 522.2s3 w1 Long 484 522.2s3 w2 Long 529 522.2s3 w3 Long 539 522.2s3 w1 Short 470 522.2s3 w2 Short 511 522.2s3 w3 Short 528 522.2

    The rst four columns of this table constitute the data normally athe contributions from the xed and random eects. Int: intercthe intercept; SubInt: by-subject adjustments to the intercept; Sresidual noise.Xij 1 0

    1 1

    2

    and is the same for all subjects i and items j. The designmatrix is multiplied by the vector of population coe-cients b. Here, this vector takes the form

    b 522:219:0

    3

    where 522.2 is the coecient for the intercept, and -19the contrast for the short as opposed to the long SOA.The result of this multiplication is a vector that againis identical for each combination of subject and item:

    Xijb 522:2

    503:2

    4

    It provides the group means for the long and short SOA.These group means constitute the models best guessabout the expected latencies for the population, i.e.,for unseen subjects and unseen items.

    The next two terms in Eq. (1) serve to make the mod-els predictions more precise for the subjects and itemsactually examined in the experiment. First consider therandom eects structure for Subject. The Si matrix (inthis example) is a full copy of the Xij matrix. It is multi-plied with a vector specifying for subject i the adjust-ments that are required for this subject to the interceptandom slopes for subject

    ItemInt Random Res

    SubInt SubSOA

    28.3 26.2 0 2.014.2 26.2 0 9.814.1 26.2 0 8.2

    28.3 26.2 11 15.414.2 26.2 11 8.414.1 26.2 11 11.9

    28.3 29.7 0 7.414.2 29.7 0 0.114.1 29.7 0 11.5

    28.3 29.7 12.5 1.514.2 29.7 12.5 8.914.1 29.7 12.5 8.2

    28.3 3.5 0 6.314.2 3.5 0 3.514.1 3.5 0 6.0

    28.3 3.5 1.5 2.914.2 3.5 1.5 4.614.1 3.5 1.5 13.2

    ble to the researcher. The remaining columns parse the RTs intoOA: contrast eect for SOA; ItemInt: by-item adjustments toA: by-subject adjustments to the slope of the SOA contrast; Res:and to the SOA contrast coecient. For the rst subjectin Table 1,

    S1js1 1 0

    1 1

    26:211:0

    26:215:2

    ; 5

    which tells us, rst, that for this subject the intercept hasto be adjusted downwards by 26.2 ms for both the longand the short SOA (the subject is a fast responder) andsecond, that in the short SOA condition the eect ofSOA for this subject is attenuated by 11.0 ms. Combinedwith the adjustment for the intercept that also applies tothe short SOA condition, the net outcome for an arbi-trary item in the short SOA condition is 15.2 ms forthis subject.

    Further precision is obtained by bringing the itemrandom eect into the model. The Wj matrix is again acopy of the design matrix Xij. In the present example,only the rst column, the column for the intercept, isretained. This is because in this particular constructeddata set the eect of SOA does not vary systematicallywith item. The vector wj therefore contains one elementonly for each item j. This element species the adjust-ment made to the population intercept to calibrate theexpected values for the specic processing costs associ-ated with this individual item. For item 1 in our exam-ple, this adjustment is 28.3, indicating that comparedto the population average, this particular item elicited

  • which we can rearrange in the form of a composite inter-cept, followed by a composite eect of SOA, followed bythe residual error.

    In this equation for y1 the presence of by-subject ran-dom slopes for SOA and the absence of by-item randomslopes for SOA is clearly visible.

    The subject matrix S and the item matrix W can becombined into a single matrix often written as Z, andthe subject and item random eects s and w can likewisebe combined into a single vector generally referenced asb, leading to the general formulation

    y Xb Zb : 9

    where R is the relative variance-covariance matrix forthe random eects. The symbol \ indicates indepen-dence of random variables and N denotes the multi-

    variate normal (Gaussian) distribution. We say thatmatrix R is the relative variance-covariance matrixof the random eects in the sense that it is thevariance of b relative to r2, the scalar variance ofthe per-observation noise term . The variance-covari-ance specication of the model is an important toolto capture non-independence (asphericity) betweenobservations.

    Hypotheses about the structure of the variance-covariance matrix can be tested by means of likelihood

    y1

    522:2 26:2 28:3 0 0 2:0522:2 26:2 14:2 0 0 9:8522:2 26:2 14:1 0 0 8:2

    0BBBBBB

    1CCCCCC 8

    R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412 393To complete the model specication, we need to beprecise about the random eects structure of our data.Recall that a random variable is dened as a normalvariate with zero mean and unknown standard devia-tion. Sample estimates (derived straightforwardly fromTable 1) for the standard deviations of the four ran-dom eects in our example are r^sint 28:11 for theby-subject adjustments to the intercepts, r^ssoa 9:65for the by-subject adjustments to the contrast coe-cient for SOA, r^i 24:50 for the by-item adjustmentsto the intercept, and r^ 8:55 for the residual error.522:2 26:2 28:3 19 11522:2 26:2 14:2 19 11522:2 26:2 14:1 19 11

    BB@shorter latencies, for both SOA conditions, across allsubjects.

    Wjw1 1

    1

    28:3 28:328:3

    6

    The model specication (1) has as its last term the vectorof residual errors ij, which in our running example hastwo elements for each combination of subject and item,one error for each SOA.

    For subject 1, Eq. (1) formalizes the following vectorof sums,

    y1 y1j X1jb Ss1 Wwj 1j

    522:2 0522:2 0522:2 0522:2 19522:2 19522:2 19

    0BBBBBBBB@ratio tests. Thus, we can formally test whether a randomeect for items is required and that the presence of theparameter ri in the model specication is actually justi-ed. Similarly, we can inquire whether a parameter forthe covariance of the by-subject slopes and interceptscontributes signicantly to the models goodness of t.We note that in this approach it is an empirical questionwhether random eects for item or subject are requiredin the model.

    When amixed-eectsmodel is tted to a data set, its setof estimated parameters includes the coecients for the15:48:411:9

    CCABecause the random slopes and intercept are pairwisetied to the same observational units, they may be cor-related. For our data, q^sint; soa 0:71. These four ran-dom eects parameters complete the specication ofthe quantitative structure of our dataset. We can nowpresent the full formal specication of the correspond-ing mixed-eects model,

    yXbZb;N 0;r2I; bN 0;r2R; b? ;10

    26:2 0 28:3 2:026:2 0 14:2 9:826:2 0 14:1 8:226:2 11 28:3 15:426:2 11 14:2 8:426:2 11 14:1 11:9

    1CCCCCCCCA

    7

  • erefore be stru as foll ng whaown as ng data t in R ( the unive data in SPSS

    3 s1 w3 short 490

    number of obs : 18, groups : bj, 3

    more precise for mixed-eects model

    394 R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 3904124 s1 w1 long 4665 s1 w2 long 5206 s1 w3 long 502

    We load the data, here simply an ASCII text le, intoR with

    > priming = read.table("ourdata.txt",header = TRUE)

    SPSS data les (if brought into the long format withinSPSS) can be loaded with read.spss and csv tables(in long format) are loaded with read.csv. We t themodel of Eq. (10) to the data with

    > priming.lmer = lmer(RT SOA + (1jItem) +(1 + SOAjSubj), data = priming)The dependent variable, RT, appears to the left of the

    tilde operator (), which is read as depends on or isa function of. The main eect of SOA, our xed eect,is specied to the right of the tilde. The random interceptfor Item is specied with (1|Item), which is read as arandom eect introducing adjustments to the intercept(denoted by 1) conditional on or grouped by Item. Therandom eects for Subject are specied as(1+SOA|Subject). This notation indicates, rst of12s1s1w1w2shortshort475494Subj Item SOA RTiat format ):

    kn the lo forma and as ar-

    th should ctured ows, usi t isxed eects on the one hand, and the standard deviationsand correlations for the random eects on the other hand.The individual values of the adjustments made to inter-cepts and slopes are calculated once the random-eectsparameters have been estimated. Formally, these adjust-ments, referenced as Best Linear Unbiased Predictors(or BLUPs), are not parameters of the model.

    Data analysis

    We illustrate mixed-eects modeling with R, an open-source language and environment for statistical comput-ing (R development core team, 2007), freely available athttp://cran.r-project.org. The lme4 package(Bates, 2005; Bates & Sarkar, 2007) oers fast and reliablealgorithms for parameter estimation (see also West et al.,2007:14) as well as tools for evaluating the model (usingMarkov chain Monte Carlo sampling, as explainedbelow).

    Input data for R should have the structure of the rstblock in Table 1, together with an initial header linespecifying column names. The data for the rst subjectlihood estimation seeks to nd those parameter valuesthat, given the data and our choice of model, make themodels predicted values most similar to the observedvalues. Discussion of the technical details of model t-ting is beyond the scope of this paper. However, in theAppendix we provide some indication of the kind ofissues involved.

    The summary proceeds with repeating the modelspecication, and then lists various measures of good-ness of t. The remainder of the summary containstwo subtables, one for the random eects, and one forthe xed eects.

    The subtable for the xed-eects shows that theestimates for slope and the contrast coecient forSOA are right on target: 522.11 for the intercept (com-pare 522.2 in Table 1), and -18.89 (compare -19.0).For each coecient, its standard error and t-statisticare listed.ing. Maximum like-

    a modication of maximum likelihood estimation that is

    using restricted maximum likelihood estimation (REML),The summary rst mentions that the model is ttedSOAshort 18.889 8.259 2.287

    (Intercept) 522.111 21.992 23.741Estimate Std.Error

    t valueFixed effects :

    Item, 3; SuResidual 102.28 10.113

    SOAshort 136.46 11.682 1.000all, that we introduce by-subject adjustments to the inter-cept (again denoted by 1) as well as by-subject adjust-ments to SOA. In other words, this model includes by-subject and by-item random intercepts, and by-subjectrandom slopes for SOA. This notation also indicates thatthe variances for the two levels of SOA are allowed to bedierent. In other words, it models potential by-subjectheteroskedasticity with respect to SOA. Finally, this spec-ication includes a parameter estimating the correlationq^sint; soa of the by-subject random eects for slope andintercept.

    A summary of the model is obtained with

    > summary (priming.lmer)Linear mixed-effects model fit by REML

    Formula : RTSOA+(1jItem)+(1+SOAjSubj)Data : primingAIC BIC logLik ML

    devianceREMLdeviance

    150.0 155.4 69.02 151.4 138.0Random effects :

    Groups Name Variance Std.Dev Corr

    Item (Intercept) 613.73 24.774Subj (Intercept) 803.07 28.338

  • t ra ,c e ina a e

    has been overparameterized. We rst simplify the modelby removing the correlation parameter and by assuming

    . RT I

    to hi ple sc ik o

    0 tci m (

    freedom. In R, the likelihood ratio test is carried outwith the anova function:

    > anova(priming.lmer, priming.lmer2)

    Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)

    priming.lmer2 4 162.353 165.914 77.176

    01

    R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412 395homoskedasticity for the subjects with respect to theSOA conditions, as follows:

    > priming.lmer1 = lmer(RT SOA + (1jItem) +(1jSubj) + (1j SOA:Subj), data =priming)> print(priming.lmer1, corr = FALSE)

    Random effects

    Groups Name Variance Std.Dev.

    SOA:Subj (Intercept) 34.039 5.8343Subj (Intercept) 489.487 22.1243Item (Intercept) 625.623 25.0125

    priming.lmer 6 163.397 168.740reports the variances, and the fourth column the squareroots of these variances, i.e., the corresponding standarddeviations. The sample standard deviations calculatedabove on the basis of Table 1 compare well with themodel estimates, as shown in Table 2.

    The high correlation of the intercept and slope for thesubject random eects (1.00) indicates that the modelThe second column species whether the random eectconcerns the intercept or a slope. The third columnItem, SubjResidual

    number of obs:Fixed effects:

    (Intercept)SOAshort

    (Here and in thR output.) Thefor SOA is smnd the observ18, groups: SOA:S

    Estimate Std.

    522.11118.889

    e examples to fovariance for th

    all, and potentiation noise (R119.715

    ubj, 6; Subj, 3;

    Error

    19.9097.021

    llow, we abbre by-subject ally redundant,sidual).

    that the rst olumn lists th main group g factors:Turning to he subtable of ndom eects we observeq^sint; soa 0. 1.00

    r^ 8.55 10.113

    71int

    r^ssoa 9.65 11.681r^ir^s24.528.124.77428.338Table 2Comparison of sample estimates and model estimates for thedata of Table 1

    Parameter Sample Model10.9414

    Item, 3

    t value

    26.232.69

    eviated thedjustmentsso we fur-The value listed under Chisq equals twice the dier-ence between the log-likelihood (listed under logLik)for priming.lmer and that of priming.lmer2.The degrees of freedom for the chi-squared distribution,2, is the dierence between the number of parameters inthe model (listed under Df). It it clear that the removalof the parameter for the correlation together with theparameter for by-subject random slopes for SOA is jus-tied (X 22 2:96; p 0:228). The summary of the sim-plied model

    > print(priming.lmer2, corr = FALSE)Random effects:

    Groups Name Variance Std.Dev.

    Item (Intercept) 607.72 24.652Subj (Intercept) 499.22 22.343Residual 137.35 11.720

    number of obs: 18, groups: Item, 3; Subj, 3Fixed effects:

    Estimate Std. Error t value

    (Intercept) 522.111 19.602 26.636SOAshort 18.889 5.525 3.419

    75.699 2.9555 2 0.2282q to the specic value of zero and assumes homoske-dasticity) with the more general model prim-ing.lmer (which does not restrict q a priori andexplicitly models heteroskedasticity). The likelihoodof the more general model (Lg) should be greater thanthe likelihood of the more specic model (Ls), andhence the likelihood ratio test statistic 2log(Lg/Ls) > 0. If g is the number of parameters for the gen-eral model, and s the number of parameters for therestricted model, then the asymptotic distribution ofthe likelihood ratio test statistic, under the nullhypothesis that the restricted model is sucient, fol-lows a chi-squared distribution with g-s degrees ofthe most spe c model pri ing.lmer2 which sets

    e.g., Pinheiro & Bates, 200 , p. 83) tha compares

    justied, we arry out a l elihood rati test (see,In order verify that t s most sim model i(1jSubj), data = priming)

    > priming lmer2 = lmer( SOA + (1j tem) +subject:

    ther simplify to a model with only random intercepts for

  • 396 R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412lists only random intercepts for subject and item, asdesired.

    The reader may have noted that summaries for modelobjects tted with lmer list standard errors and t-statis-tics for the xed eects, but no p-values. This is not with-out reason.

    With many statistical modeling techniques we canderive exact distributions for certain statistics calculatedfrom the data and use these distributions to performhypothesis tests on the parameters, or to create con-dence intervals or condence regions for the values ofthese parameters. The general class of linear models tby ordinary least squares is the prime example of sucha well-behaved class of statistical models for which wecan derive exact results, subject to certain assumptionson the distribution of the responses (normal, constantvariance and independent disturbance terms). This gen-eral paradigm provides many of the standard techniquesof modern applied statistics including t-tests and analy-sis of variance decompositions, as well as condenceintervals based on t-distributions. It is tempting tobelieve that all statistical techniques should provide suchneatly packaged results, but they dont.

    Inferences regarding the xed-eects parameters aremore complicated in a linear mixed-eects model thanin a linear model with xed eects only. In a model withonly xed eects we estimate these parameters and oneother parameter which is the variance of the noise thatinfects each observation and that we assume to be inde-pendent and identically distributed (i.i.d.) with a normal(Gaussian) distribution. The initial work by WilliamGossett (who wrote under the pseudonym of Student)on the eect of estimating the variance of the distur-bances on the estimates of precision of the sample mean,leading to the t-distribution, and later generalizations bySir Ronald Fisher, providing the analysis of variance,were turning points in 20th century statistics.

    When mixed-eects models were rst examined, inthat days when the computing tools were considerablyless sophisticated than at present, many approximationswere used, based on analogy to xed-eects analysis ofvariance. For example, variance components were oftenestimated by calculating certain mean squares andequating the observed mean square to the correspondingexpected mean square. There is no underlying objective,such as the log-likelihood or the log-restricted-likeli-hood, that is being optimized by such estimates. Theyare simply assumed to be desirable because of the anal-ogy to the results in the analysis of variance. Further-more, the theoretical derivations and correspondingcalculations become formidable in the presence of multi-ple factors, such as both subject and item, associatedwith random eects or in the presence of unbalanceddata.

    Fortunately, it is now possible to evaluate the maxi-

    mum likelihood or the REML estimates of the parametersin mixed-eects models reasonably easily and quickly,even for complex models t to very large observationaldata sets. However, the temptation to perform hypothe-sis tests using t-distributions or F-distributions based oncertain approximations of the degrees of freedom inthese distributions persists. An exact calculation can bederived for certain models with a comparatively simplestructure applied to exactly balanced data sets, such asoccur in text books. In real-world studies the data oftenend up unbalanced, especially in observational studiesbut even in designed experiments where missing datacan and do occur, and the models can be quite compli-cated. The simple formulas for the degrees of freedomfor inferences based on t or F-distributions do not applyin such cases. In fact, the pivotal quantities for suchhypothesis tests do not even have t or F-distributionsin such cases so trying to determine the correct valueof the degrees of freedom to apply is meaningless. Thereare many approximations in use for hypothesis tests inmixed modelsthe MIXED procedure in SAS oers 6 dif-ferent calculations of degrees of freedom in certain tests,each leading to dierent p-values, but none of them iscorrect.

    It is not even obvious how to count the number ofparameters in a mixed-eects model. Suppose we have1000 subjects, each exposed to 200 items chosen froma pool of 10000 potential items. If we model the eectof subject and item as independent random eects weadd two variance components to the model. At the esti-mated parameter values we can evaluate 1000 predictorsof the random eects for subject and 10000 predictors ofthe random eects for item. Did we only add two param-eters to the model when we incorporated these 11000random eects? Or should we say that we added severalthousand parameters that are adjusted to help explainthe observed variation in the data? It is overly simplisticto say that thousands of random eects amount to onlytwo parameters. However, because of the shrinkageeect in the evaluation of the random eects, each ran-dom eect does not represent an independent parameter.

    Fortunately, we can avoid this issue of countingparameters or, more generally, the issue of approximat-ing degrees of freedom. Recall that the original purposeof the t and F-distributions is to take into account theimprecision in the estimate of the variance of the ran-dom disturbances when formulating inferences regard-ing the xed-eects parameters. We can approach thisproblem in the more general context with Markov chainMonte Carlo (MCMC) simulations. In MCMC simulationswe sample from conditional distributions of parametersubsets in a cycle, thus allowing the variation in oneparameter subset, such as the variance of the randomdisturbances or the variances and covariances of ran-dom eects, to be reected in the variation of otherparameter subsets, such as the xed eects. This is what

    the t and F-distributions accomplish in the case of mod-

  • els with xed-eects only. Crucially, the MCMC techniqueapplies to more general models and to data sets witharbitrary structure.

    Informally, we can conceive of Markov chain MonteCarlo (MCMC) sampling from the posterior distributionof the parameters (see, e.g., Andrieu, de Freitas, Doucet,& Jordan, 2003, for a general introduction to MCMC) as arandom walk in parameter space. Each mixed eectsmodel is associated with a parameter vector, which canbe divided into three subsets,

    1. the variance, r2, of the per-observation noise term,2. the parameters that determine the variance-covari-

    ance matrix of the random eects, and3. the random eects b^ and the xed eects b^.

    Conditional on the other two subsets and on thedata, we can sample directly from the posterior distribu-tion of the remaining subset. For the rst subset we sam-ple from a chi-squared distribution conditional on the

    current residuals. The prior for the variances and covari-ances of the random eects is chosen so that for the sec-ond subset we sample from a Wishart distribution.Finally, conditional on the rst two subsets and on thedata the sampling for the third subset is from a multivar-iate normal distribution. The details are less importantthan the fact that these are well-accepted non-informa-tive priors for these parameters. Starting from the REMLestimates of the parameters in the model we cyclethrough these steps many times to generate a samplefrom the posterior distribution of the parameters. Themcmcsamp function produces such a sample, for whichwe plot the estimated densities on a log scale.

    > mcmc = mcmcsamp(priming.lmer2, n = 50000)> densityplot(mcmc, plot.points = FALSE)

    The resulting plot is shown in Fig. 1. We can see that theposterior density of the xed-eects parameters is rea-sonably symmetric and close to a normal (Gaussian) dis-

    0.8 2)

    0.00

    0.10

    0.20

    0 5 10 15

    log(I

    tem.(In

    ))0.

    00.

    10.

    20.

    3

    0 5 10 15 20

    log(S

    ubj.(I

    n))

    Carl

    R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412 397Den

    sity

    0.00

    00.

    010

    1000 0

    (Inter

    cept)

    0.00

    0.02

    0.04

    0.06

    60 40

    SOAs

    hort

    0.0

    0.2

    0.4

    0.6

    4 5

    log(s

    igma^

    Fig. 1. Empirical density estimates for the Markov chain Monte

    model for the priming data with random intercepts only (priming.l1000 2000 3000

    20 0 20

    6 7 8

    o sample for the posterior distribution of the parameters in the2 2 2mer2). From top to bottom: logrsint , logriint logr , bsoa, bint.

  • tribution, which is generally the case for such parame-ters. After we have checked this we can evaluate p-valuesfrom the sample with an ancillary function dened in thelanguageR package, which takes a tted model asinput and generates by default 10,000 samples fromthe posterior distribution:

    We obtain p-values for only the rst two parameters(the xed eects). The rst two columns show that the

    dence r thecording msbi , widegree (18

    > mcmc$random

    MCMCmean HPD95lower HPD95upper

    sigma 12.76 7.947 21.55Item.(In) 27.54 6.379 140.96Subj.(In) 32.62 9.820 133.47

    > mcmc = pvals.fnc(priming.lmer2, nsim = 10000)> mcmc$fixed

    Estimate MCMCmean HPD95l HP

    (Intercept) 522.11 521.80 435.45 61SOAshort 18.89 18.81 32.09

    1 For data sets characteristic for studies of memory andlanguage, which typically comprise many hundreds or thou-sands of observations, the particular value of the number ofdegrees of freedom is not much of an issue. Whereas thedierence between 12 and 15 degrees of freedom may haveimportant consequences for the evaluation of signicanceassociated with a t statistic obtained for a small data set, the

    398 R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412dierence between 612 and 615 degrees of freedom has nonoticeable consequences. For such large numbers of degrees offreedom, the t distribution has converged, for all practicalpurposes, to the standard normal distribution. For large datasets, signicance at the 5% level in a two-tailed test for the xedeects coecients can therefore be gauged informally bychecking the summary for whether the absolute value of themodel estimates and the mean estimate across MCMCsamples are highly similar, as expected. The next twocolumns show the upper and lower 95% highest poster-ior density intervals (see below). The nal two columnsshow p-values based on the posterior distribution(pMCMC) and on the t-distribution respectively. Thedegrees of freedom used for the t-distribution bypvals.fnc() is an upper bound: the number of obser-vations minus the number of xed-eects parameters. Asa consequence, p-values calculated with these degrees offreedom will be anti-conservative for small samples.1

    The distributions of the log-transformed varianceparameters are also reasonably symmetric, althoughsome rightward skewing is visible in Fig. 1. Withoutthe log transformation, this skewing would be muchmore pronounced: The untransformed distributionswould not be approximated well by a normal distribu-tion with mean equal to the estimate and standard devi-ation equal to a standard error. That the distribution ofthe variance parameters is not symmetric should notcome as a surprise. The use of a v2 distribution for a var-iance estimate is taught in most introductory statisticscourses. As Box and Tiao (1992) emphasize, the loga-rithm of the variance is a more natural scale on whichto assume symmetry.t-statistic exceeds 2.It is worth noting that the variances for the ran-dom eects parameters may get close to zero but willnever actually be zero. Generally it would not makesense to test a hypothesis of the form H0 : r

    2 = 0 ver-sus HA : r

    2 > 0 for these parameters. Neither Invert-ing the HPD interval nor using the empiricalcumulative distribution function from the MCMC sam-narrower:

    >coefs coefs[, 1] + qt(0.975, 16) * outer(coefs[, 2],c(1, 1))

    [,1] [,2]

    (Intercept) 479.90683 564.315397SOAshort 33.77293 4.004845

    For small data sets such as the example data consideredhere, they give rise to less conservative inferences thatmay be incorrect and should be avoided.

    The HPD intervals for the random eects can beobtained from the mcmc object obtained withpvals.fnc() as follows:bound for the s of freedom 2 = 16) are

    parameters, ac to b^i ta=2; th the upper

    dard 95% con intervals fo xed eectsof the endpoints is the nominal probability. In otherwords, the intervals are calculated to have 95% probabil-ity content. There are many such intervals. The HPDintervals are the shortest intervals with the given proba-bility content. Because they are created from a sample,these intervals are not deterministic: taking another sam-ple gives slightly dierent values. The HPD intervals forthe xed eects in the present example are listed in theoutput of pvals.fnc(), as illustrated above. The stan-6.533 0.0088 0.0035

    6.641 0.0012 0.0000ower D95upper pMCMC Pr(>jtj)For each of the panels in Fig. 1 we calculate a Bayes-ian highest posterior density (HPD) condence interval.For each parameter the HPD interval is constructed fromthe empirical cumulative distribution function of thesample as the shortest interval for which the dierencein the empirical cumulative distribution function values

  • response latencies, but as the experiment progresses,they nd that they cannot keep up their fast initial

    n ted r

    R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412 399pace, and their latencies progressively increase. Othersubjects start out cautiously, and progressively tunein to the task and respond more and more quickly.By means of counterbalancing, adverse eects of learn-ing and fatigue can be neutralized, in the sense that therisk of confounding these eects with critical predictorsis reduced. However, the eects themselves are notbrought into the statistical model, and consequentlyexperimental noise remains in the data, rendering moredicult the detection of signicance for the predictorsof interest when subsets of subjects are exposed tothe same lists of items.

    Second, in chronometric paradigms, the response to atarget trial is heavily inuenced by how the precedingtrials were processed. In lexical decision, for instance,the reaction time to the preceding word in the experi-ment is one of the best predictors for the target latency,with eect sizes that may exceed that of the word fre-quency eect. Often, this predictivity extends from theimmediately preceding trial to several additional preced-ing trials. This major source of experimental noiseple evaluated at zero works because the value 0 can-not occur in the MCMC sample. Using the estimate ofthe variance (or the standard deviation) and a stan-dard error to create a z statistic is, in our opinion,nonsense because we know that the distribution ofthe parameter estimates is not symmetric and doesnot converge to a normal distribution. We thereforerecommend likelihood ratio tests for evaluatingwhether including a random eects parameter is justi-ed. As illustrated above, we t a model with andwithout the variance component and compare thequality of the ts. The likelihood ratio is a reasonabletest statistic for the comparison but we note that theasymptotic reference distribution of a v2 does notapply because the parameter value being tested is onthe boundary. Therefore, the p-value computed usingthe v2 reference distribution is conservative for vari-ance parameters. For correlation parameters, whichcan be both positive or negative, this caveat doesnot apply.

    Key advantages of mixed-eects modeling

    An important new possibility oered by mixed-eectsmodeling is to bring eects that unfold during the courseof an experiment into account, and to consider otherpotentially relevant covariates as well.

    There are several kinds of longitudinal eects thatone may wish to consider. First, there are eects oflearning or fatigue. In chronometric experiments, forinstance, some subjects start out with very shortcesses the associated target presented later in theexperiment.

    Because mixed-eects models do not require prioraveraging, they oer the possibility of bringing all thesekinds of longitudinal eects straightforwardly into thestatistical model. In what follows, we illustrate thisadvantage for a long-distance priming experimentreported in de Vaan, Schreuder, and Baayen (2007).Their lexical decision experiment used long-term prim-ing (with 39 trials intervening between prime and tar-get) to probe budding frequency eects formorphologically complex neologisms. Neologisms werepreceded by two kinds of prime, the neologism itself(identity priming) or its base word (base priming).The data are available in the languageR package inthe CRAN archives (http://cran.r-project.org,see Baayen, 2008, for further documentation on thispackage) under the name primingHeidPrevRT. Afterattaching this data set we t an initial model withSubject and Word as random eects and primingCondition as xed-eect factor.

    > attach(primingHeidPrevRT)> print(lmer(RT Condition + (1jWord) + (1jSubject)),corr = FALSE)Random effects:

    Groups Name Variance Std.Dev.

    Word (Intercept) 0.0034119 0.058412Subject (Intercept) 0.0408438 0.202098Residual 0.0440838 0.209962

    number of obs: 832, groups: Word, 40; Subject, 26Fixed effects:

    Estimate Std. Error t value

    (Intercept) 6.60297 0.04215 156.66Conditionheid 0.03127 0.01467 2.13

    The positive contrast coecient for Condition andt > 2 in the summary suggests that long-distance identitypriming would lead to signicantly longer responselatencies compared to base priming.subject may be revealing about how that subject pro-averaging overof how a specsubjects or iteic prime wams. Howeves processedr, the detailsby a specicallowed primi g eects to be evalua only afte

    the statistical methodology of the past decadesFourth, in tasks using long-distance priming, lon-gitudinal eects are manipulated on purpose. Yeta noun or a verb, and so on.

    whether the preceding item was a word or a nonword,

    trial in a lexical decision task was correct or incorrect,should be brought under statistical control, at the riskof failing to detect otherwise signicant eects.

    Third, qualitative properties of preceding trialsshould be brought under statistical control as well. Here,one can think of whether the response to the preceding

  • 0.0436753 0.208986

    e e

    1

    c

    e e

    Estimate Std. Error t value

    e prime. Traditional averaging proce-these data would either report a null

    es it possible to obtain substantially

    ti -

    400 R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412elicited a nonword response and the target a wordresponse, response latencies to the target were slowedby some 100 ms, compared to when the prime eliciteda word response. For such trials, the response latencyto the prime was not predictive for the target. By con-trast, the reaction times to primes that were acceptedas words were signicantly correlated with the reactiontime to the corresponding targets.The table of coecients rev als that if th prime hadRTtoPrime

    incorrect:ResponseToPrim 0.20250 0.06056 3.344RTtoPrime 0.22764 0.03594 6.334Conditionheid 0.02657 0.01618 1.642

    e Primeincorre tlog(RTmin1)ResponseTo0.118341.454820.032510.405253.6403.590(Intercept) 4.32436 0.31520 3.720trials back in the experiment, are taken into account.

    > print(lmer(RT log(RTmin1) + ResponseToPrime *RTtoPrime + Condition + (1j Word) +(1jSubject)), + corr = FALSE)Random effects:

    Groups Name Variance Std.Dev.

    Word (Intercept) 0.0013963 0.037367Subject (Intercept) 0.0235948 0.153606Residual 0.0422885 0.205642

    number of obs: 832, groups: Word, 40; Subject, 26Fixed effects:

    Estimate Std. Error t valuewhen accuracy and response latency to the prime itself, 40

    The contrast coecient for Condition changes signfrequency eect was only 50 ms.

    predictor values, the corresponding dier nce for th

    with a 400 ms dierence between the smallest and largestThe latency to the preceding has a large eect sizelog(RTmin1) 0.12Conditionheid 0.02125 0785 0.03337

    .01463

    3.6331.903(Intercept) 5.80465 0.22298 26.032EstFixed effects:imate Std. Error t valuenumber of obs: 832, groups: Word, 40; Subject, 26Subject (InResidualHowever, this counterintuitive inhibitory primingeect is no longer signicant when the decision latencyat the preceding trial (RTmin1) is brought into themodel,

    > print(lmer(RTlog(RTmin1) + Condition + (1jWord) + (1jSubject)),corr = FALSE)Random effects:

    Groups Name Variance Std.Dev.

    Word (Intercept) 0.0034623 0.058841tercept) 0.0334773 0.182968tional analysis of variance and random regression.Raaijmakers, Schrijnemakers, and Gremmen (1999) dis-cuss two common factorial experimental designs andtheir analyses. In this section, we rst report simulationstudies using their designs, and compare the perfor-mance of current standards with the performance ofmixed-eects models. Simulations were run in R (version2.4.0) (R development core team, 2007) using the lme4package of Bates and Sarkar (2007) (see also Bates,consider how mixed-eects modeling compares to tradi-

    dom eects for subjects and items, we now turn to

    ges oered by mixed-eects modeling with crossed ran-Having illustrated the important analy cal advantaSome common designsdata.

    improved insight into the structure of ones experimental

    pitfalls, and mak

    sion). Mixed-eects modeling allows us to avoid these

    pletely wrong interpretation of the data (lexical deci-

    eect (for self-paced reading) or would lead to a com-ing latency for thdures applied to(Intercept) 4.388722 0.287621 15.259log(RTmin1) 0.103738 0.029344 3.535ResponseToPrimeincorrect

    1.560777 0.358609 4.352

    RTtoPrime 0.236411 0.032183 7.346BaseFrequency 0.009157 0.003590 2.551Conditionheid 0.038306 0.014497 2.642ResponseToPrimeincorrect:RTtoPrime

    0.216665 0.053628 4.040

    we observe signicant facilitation from long-distanceidentity priming. For a follow-up experiment usingself-paced reading of continuous text, latencies werelikewise codetermined by the reading latencies to thewords preceding in the discourse, as well as by the read-After addition of log Base Frequency as covariateand trimming of atypical outliers,

    > priming.lmer = lmer(RT log(RTmin1) + ResponseToPrime *RTtoPrime + BaseFrequency + Condition + (1j + Word) + (1jSubject))> print(update(priming.lmer,subset = abs(scale(resid(priming.lmer))) < 2.5),cor = FALSE)Random effects:

    Groups Name Variance Std.Dev.

    Word (Intercept) 0.00049959 0.022351Subject (Intercept) 0.02400262 0.154928Residual 0.03340644 0.182774

    number of obs: 815, groups: Word, 40; Subject, 26Fixed effects:

  • 2005). The code for the simulations is available in thelanguageR package in the CRAN archives (http://cran.r-project.org, see Baayen, 2008). We thenillustrate the robustness of mixed-eects modeling to

    1976). This anticonservatism of the t-test is clearly visiblein Table 4.

    The only procedures with nominal Type I error ratesare the quasi-F test and the mixed-eects model withMCMC sampling. For data sets with few observations, thequasi-F test emerges as a good choice with somewhatgreater power.

    D95lower HPD95upper pMCMC Pr(>jtj)8.58 583.50 0.0001 0.00002.88 76.29 0.3638 0.1956

    Table 3Mean squares decomposition for the data exemplifying the useof quasi-F ratios in Raaijmakers et al. (1999)

    Df Sum Sq Mean Sq

    SOA 1 8032.6 8032.6

    R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412 401Item (Intercept) 448.29 21.173Subject (Intercept) 861.99 29.360

    SOAshort 502.65 22.420 0.813Residual 100.31 10.016

    number of obs: 64, groups: Item, 8; Subject, 8Fixed effects:

    Estimate Std.Error t value

    (Intercept)540.91 14.93 36.23SOAshort 22.41 17.12 1.31

    The small t-value for the contrast coecient for SOAshows that this predictor is not signicant. This is clearas well from the summary of the xed eects producedby pvals.fnc (available in the languageR package),which lists the estimates, their MCMC means, the corre-sponding HPD intervals, the two-tailed MCMC probability,and the two-tailed probability derived from the t-testusing, as mentioned above, the upper bound for thedegrees of freedom.

    > pvals.fnc(quasif, nsim = 10000)

    Estimate MCMCmean HP

    (Intercept) 540.91 540.85 49SOAshort 22.41 22.38 3and inspect the estimated parameters with

    > summary(quasif)Random effects:

    Groups Name VarianceStd.Dev.Corr(1 + SOAjSubject), data = quasif)

    > quasif = lmer(RT SOA + (1jItem) +the data with

    package as quasif. We t a mixed eects model toThe present data set is available in the languageR

    of 72 parameters, 6 of which are inestimable.

    worthy that the model ts 64 data points with the help

    nicant (F(1.025,9.346) = 1.702,p = 0.224). It is note-the mean squashown in Tableres in the mea3 shows that thn squarese eect ofdecompoSOA is nsitionot sig-recommended by Raaijmakers et al. (1999), based on

    jects are crossed with item. A quasi-F test, the testmissing data for a split-plot design, and then pitmixed-eects regression against random regression, asproposed by Lorch and Myers (1990).

    A design traditionally requiring quasi-F ratios

    A constructed dataset discussed by Raaijmakers et al.(1999) comprises 64 observations with 8 subjects and 8items. Items are nested under treatment: 4 items are pre-sented with a short SOA, and 4 with a long SOA. Sub-Themodel summary lists four random eects: randomintercepts for participants and for items, by-participantrandom slopes for SOA, and the residual error. Each ran-dom eect is paired with an estimate of the standard devi-ation that characterizes the spread of the random eectsfor the slopes and intercepts. Because the by-participantBLUPs for slopes and intercepts are paired observations,the model specication that we used here allows for thesetwo random variables to be correlated. The estimate ofthis correlation (r = 0.813) is the nal parameter of thepresent mixed eects model.

    The p-value for the t-test obtained with the mixed-eects model is slightly smaller than that produced bythe quasi-F test. However, for the present small dataset the MCMC p-value is to be used, as the p-value withthe above mentioned upper bound for the degrees offreedom is anticonservative. To see this, consider Table4, which summarizes Type I error rate and power acrosssimulated data sets, 1000 with and 1000 without an eectof SOA. The number of simulation runs is kept small onpurpose: These simulations are provided to illustrateonly main trends in power and error rate.

    For each simulated data set, ve analyses were con-ducted: a mixed-eects analysis with the anticonservativep-value based on the t-test and the appropriate p-valuebased on 10,000 MCMC samples generated from the poster-ior distribution of the parameters of the tted mixed-eects model, a quasi-F test, a by-participant analysis, aby-item analysis, and an analysis that accepted the eectof SOA to be signicant only if both the F1 and the F2 testwere signicant (F1 + F2, compare Forster &Dickinson,

    Item 6 22174.5 3695.7Subject 7 26251.6 3750.2SOA*Subject 7 7586.7 1083.8Item*Subject 42 4208.8 100.2Residuals 0 0.0

  • ects fted m

    qua

    0.050.00

    0.230.09

    d by

    ects fed mo

    q

    rates.

    402 R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412Table 4Proportions (for 1000 simulation runs) of signicant treatment eby-item analyses, and the combined F1 and F2 test, for simulasubjects and 8 items

    lmer: p(t) lmer: p(MCMC)

    Without treatment eect

    a = 0.05 0.088 0.032a = 0.01 0.031 0.000

    With treatment eect

    a = 0.05 0.16a = 0.01 0.04

    Markov Chain Monte Carlo estimates of signicance are denoteerror rates. Too high Type 1 error rates are shown in bold.

    Table 5Proportions (for 1000 simulation runs) of signicant treatment eby-item analyses, and the combined F1 and F2 test, for simulatitems

    lmer: p(t) lmer: p(MCMC)

    Without treatment eecta = 0.05 0.055 0.027a = 0.01 0.013 0.001

    With treatment eecta = 0.05 0.823 0.681a = 0.01 0.618 0.392

    Power is tabulated only for models with nominal Type 1 errorMost psycholinguistic experiments yield much largernumbers of data points than in the present example.Table 5 summarizes a second series of simulations inwhich we increased the number of subjects to 20 andthe number of items to 40. As expected, the Type I errorrate for the mixed-eects models evaluated with testsbased on p-values using the t-test are now in accordancewith the nominal levels, and power is perhaps slightlylarger than the power of the quasi-F test. Evaluationusing MCMC sampling is conservative for this specicfully balanced example. Depending on the costs of aType I error, the greater power of the t-test may osetits slight anti-conservatism. In our experience, the dier-ence between the two p-values becomes very small fordata sets with thousands instead of hundreds of observa-tions. In analyses where MCMC-based evaluation and t-based evaluation yield a very similar verdict across coef-cients, exceptional disagreement, with MCMC samplingsuggesting clear non-signicance and the t-test suggest-ing signicance, is a diagnostic of an unstable and sus-pect parameter. This is often conrmed by inspectionof the parameters posterior density.

    It should be kept in mind that real life experiments arecharacterized bymissing data.Whereas the quasi-F test isknown to be vulnerable to missing data, mixed-eectsmodels are robust in this respect. For instance, in 1000simulation runs (without an eect of SOA) in whichor mixed-eects models (lmer), quasi-F tests, by-participant andodels with and without a treatment eect for a data set with 8

    si-F By-subject By-item F1+F2

    5 0.310 0.081 0.0795 0.158 0.014 0.009

    MCMC. Power is tabulated only for models with nominal Type 1

    or mixed-eects models (lmer), quasi-F tests, by-participant anddels with and without a treatment eect for 20 subjects and 40

    uasi-F Subject Item F1 + F2

    0.052 0.238 0.102 0.0990.009 0.120 0.036 0.036

    0.8090.587

    Too high Type 1 error rates are shown in bold.20% of the datapoints are randomly deleted before theanalyses are performed, the quasi-F test emerges asslightly conservative (Type 1 error rate: 0.045 fora = 0.05, 0.006 for a = 0.010), whereas the mixed-eectsmodel using the t test is on target (Type 1 error rate:0.052 for a = 0.05, 0.010 for a = 0.01). Power is slightlygreater for themixedanalysis evaluatingprobability usingthe t-test (a = 0.05: 0.84 versus 0.81 for the quasi-F test;a = 0.01: 0.57 versus 0.54). See also, e.g., Pinheiro andBates (2000).

    A Latin Square design

    Another design discussed by Raaijmakers and col-leagues is the Latin Square. They discuss a second con-structed data set, with 12 words divided over 3 lists with4 words each. These lists were rotated over participants,such that a given participant was exposed to a list foronly one of three SOA conditions. There were 3 groupsof 4 participants, each group of participants wasexposed to unique combinations of list and SOA. Raaij-makers and colleagues recommend a by-participantanalysis that proceeds on the basis of means obtainedby averaging over the words in the lists. An analysis ofvariance is performed on the resulting data set whichlists, for each participant, three means; one for eachSOA condition. This gives rise to the ANOVA decomposi-

  • tion shown in Table 6. The F test compares the meansquares for SOA with the mean squares of the interac-tion of SOA by List, and indicates that the eect ofSOA is not statistically signicant (F(2,2) = 1.15,p = 0.465). As the interaction of SOA by List is not sig-nicant, Raaijmakers et al. (1999) pool the interactionwith the residual error. This results in a pooled errorterm with 20 degrees of freedom, an F-value of 0.896,and a slightly reduced p-value of 0.42.

    A mixed-eects analysis of the same data set (avail-able as latinsquare in the languageR package)obviates the need for prior averaging. We t a sequenceof models, decreasing the complexity of the randomeects structure step by step.

    Word (Intercept) 754.542 27.4689Subject (Intercept) 1476.820 38.4294Residual 96.566 9.8268

    number of obs: 144, groups: Word, 12; Subject, 12Fixed effects:

    Estimate Std. Error t value

    (Intercept) 533.9583 13.7098 38.95SOAmedium 2.1250 2.0059 1.06SOAshort 0.4583 2.0059 0.23

    The summary of this model lists the three randomeects and the corresponding parameters: the variances(and standard deviations) for the random interceptsfor subjects and items, and for the residual error. Thexed-eects part of the model provides estimates forthe intercept and for the contrasts for medium and short

    SOA compared to the reference level, long SOA. Inspec-tion of the corresponding p-values shows that thep-value based on the t-test and that based on MCMC sam-pling are very similar, and the same holds for the p-valueproduced by the F-test for the factor SOA

    ord)

    ord)

    d) +ordubjlme

    Df Sum Sq Mean Sq

    Group 2 1696 848SOA 2 46 23List 2 3116 1558Group*Subject 9 47305 5256SOA*List 2 40 20Residuals 18 527 29

    D95l

    3.24SOAmedium 2.1250 2.1258 1.925 5.956 0.2934 0.2912SOAshort 0.4583 0.4086 4.331 3.589 0.8446 0.8196

    R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412 403> latinsquare.lmer1 = lmer2(RT SOA + (1jWdata = latinsquare)> latinsquare.lmer2 = lmer2(RT SOA + (1jWlatinsquare)> latinsquare.lmer3 = lmer2(RT SOA + (1jWor> latinsquare.lmer4 = lmer2(RT SOA + (1jW> latinsquare.lmer5 = lmer2(RT SOA + (1jS> anova(latinsquare.lmer1, latinsquare.> pvals.fnc(latinsquare.lmer4, nsim = 10000)

    Estimate MCMCmean HP

    (Intercept) 533.9583 534.0570 50Table 6Mean squares decomposition for the data with a Latin Squaredesign in Raaijmakers et al. (1999)The likelihood ratio tests show that the model with Sub-ject and Word as random eects has the right level ofcomplexity for this data set.

    latinsquare.lmer5)

    Df AIC BIC

    latinsquare.lmer5.p 4 1423.41 1435.29latinsquare.lmer4.p 5 1186.82 1201.67latinsquare.lmer3.p 6 1188.82 1206.64latinsquare.lmer2.p 7 1190.82 1211.61 .9latinsquare.lmer1.p 12 1201.11 1236.75p-value calculated from the MCMC samples (p = 0.391).The mixed-eects analysis has slightly superior power(F(2,141) = 0.944, p = 0.386) and the corresponding588.55 0.00 5 1.000

    588.41 1.379e-06 1 0 99588.41588.41238.590.001 Chisq)r2, latinsquare.lmer3, latinsquare.lmer4,

    ect), data = latinsquare)+ (1jSubject) + (1jGroup) + (1 + SOA jList),

    + (1j Subject) + (1jGroup) + (1jList), data =

    (1 jSubject) + (1jGroup), data = latinsquare)) + (1 jSubject), data = latinsquare)ower HPD95upper pMCMC Pr(>jtj)9 561.828 0.0001 0.0000>summary(latinsquare.lmer4)Random effects:

    Groups Name Variance Std.Dev.

  • Table 7Proportions (out of 1000 simulation runs) of signicant F-tests for a Latin Square design with mixed-eects models (lmer) and a by-subject analyis (F1)

    Without SOA With SOA

    lmer: p(F) lmer: p(MCMC) F1 lmer: p(F) lmer: p(MCMC) F1

    a = 0.05 With 0.055 0.053 0.052 0.262 0.257 0.092a = 0.01 With 0.011 0.011 0.010 0.082 0.080 0.020

    a = 0.05 Without 0.038 0.036 0.043 0.249 0.239 0.215

    cludesnt (W

    simula

    (0.031(0.007

    listed

    404 R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412compared to the F1 analysis proposed by Raaijmakerset al. (1999), as illustrated in Table 7, which lists TypeI error rate and power for 1000 simulation runs withoutand with an eect of SOA. Simulated datasets were con-structed using the parameters given by latin-

    a = 0.01 Without 0.010 0.009

    The upper part reports simulations in which the F1 analysis insimulations in which for the F1 analysis this interaction is abse

    Table 8Type I error rate and power for mixed-eects modeling of 1000data set with 20% missing data

    Type I error rate

    Full Missing

    a = 0.05 0.046 (0.046) 0.035a = 0.01 0.013 (0.011) 0.009

    Evaluation based on Markov chain Monte Carlo sampling aresquare.lmer4. The upper half of Table 7 showspower and Type I error rate for the situation in whichthe F1 analysis includes the interaction of SOA by List,the lower half reports the case in which this interaction ispooled with the residual error. Even for the most power-ful test suggested by Raaijmakers et al. (1999), themixed-eects analysis emerges with slightly betterpower, while maintaining the nominal Type-I error rate.

    Further pooling of non-explanatory parameters inthe F1 approach may be expected to lead to further con-vergence of power. The key point that we emphasizehere is that the mixed-eects approach obtains thispower without prior averaging. As a consequence, it isonly the mixed-eects approach that aords the possibil-ity of bringing predictors for longitudinal eects andinter-trial dependencies into the model. Likewise, the

    > dat.lmer1 = lmer(RT list * priming + (1 + prim> dat.lmer2 = lmer(RT list * priming + (1 + prim> dat.lmer3 = lmer(RT list * priming + (1jsubje> dat.lmer4 = lmer(RT list * priming + (1jsubje> anova(dat.lmer1, dat.lmer2, dat.lmer3, dat.lpossibility of bringing covariates gauging properties ofthe individual words into the model is restricted to themixed-eects analysis.

    A split-plot design

    0.006 0.094 0.091 0.065

    the interaction of List by SOA (With), the lower part reportsithout).

    ted data sets with a split-plot design, for the full data set and a

    Power

    Full Missing

    ) 0.999 (0.999) 0.995 (0.993)) 0.993 (0.993) 0.985 (0.982)

    in parentheses.Another design often encountered in psycholinguis-tic studies is the split plot design. Priming studies oftenmake use of a counterbalancing procedure with twosets of materials. Words are primed by a related primein List A and by an unrelated prime in List B, and viceversa. Dierent subjects are tested on each list. This isa split-plot design, in the sense that the factor List isbetween subjects and the factor Priming within sub-jects. The following example presents an analysis ofan articial dataset (dat, available as splitplot inthe languageR package) with 20 subjects, 40 items.A series of likelihood ratio tests on a sequence of mod-els with decreasing complex random eects structureshows that a model with random intercepts for subjectand item suces.

    ingjsubjects) + (1 + listjitems), data = dat)ingjsubjects) + (1jitems), data = dat)cts) + (1jitems), data = dat)cts), data = dat)mer4)

  • BIC

    9452944394569466

    Std.

    214682

    t va2203

    0

    R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412 405The estimates are close to the parameters that generatedthe simulated data: ri = 20, rs = 50, r = 80, bint = 400,bpriming = 30, blist = 18.5, blist:priming = 0.

    Table 8 lists power andType I error ratewith respect tothe priming eect for 1000 simulation runs with a mixed-eect model, run once with the full data set, and once with20% of the data points randomly deleted, using the same

    Df AIC

    dat.lmer4.p 5 9429.0dat.lmer3.p 6 9415.0dat.lmer2.p 8 9418.8dat.lmer1.p 10 9419.5

    > print(dat.lmer3, corr = FALSE)Random effects:

    Groups Name Variance

    items (Intercept) 447.15subjects (Intercept) 2123.82Residual 6729.24

    Number of obs: 800, groups: items, 40; subjects, 20Fixed effects:

    Estimate Std. Error(Intercept) 362.658 16.382listlistB 18.243 23.168primingunprimed 31.975 10.583listlistB:primingunprimed

    6.318 17.704parameters that generated the above data set. It is clearthat with the low level of by-observation noise, the pres-ence of a priming eect is almost always detected. Powerdecreases only slightly for the casewithmissing data.Eventhough power is at ceiling, the Type I error rate is in accor-dancewith the nominal levels.Note the similarity betweenevaluation of signicance based on the (anticonservative)t-test and evaluation based onMarkov chainMonte Car-lo sampling. This example illustrates the robustness ofmixed eects models with respect to missing data: Thepresent results were obtained without any data pruningand without any form of imputation.

    A multiple regression design

    Multiple regression designs with subjects and items,and with predictors that are tied to the items (e.g., fre-quency and length for items that are words) have tradi-tionally been analyzed in two ways. One approachaggregates over subjects to obtain item means, and thenproceeds with standard ordinary least squares regression.We refer to this as by-item regression. Another approach,advocated by Lorch and Myers (1990), is to t separateregression models to the data sets elicited from the indi-vidual participants. The signicance of a given predictoris assessed by means of a one-sample t-test applied tothe coecients of this predictor in the individual regres-sion models. We refer to this procedure as by-participantregression. It is also known under the name of randomregression. (From our perspective, these procedures com-bine precise and imprecise information on an equal foot-ing.) Some studies report both by-item and by-participant

    logLik Chisq Chi Df Pr(>Chisq)

    .4 4709.5

    .1 4701.5 15.9912 1 6.364e-05

    .3 4701.4 0.1190 2 0.9423

    .3 4699.7 3.3912 2 0.1835

    Dev.

    .146

    .085

    .032

    lue.137.787.021.357regression models (e.g., Alegre & Gordon, 1999).The by-participant regression is widely regarded as

    superior to the by-item regression. However, the by-par-ticipant regression does not take item-variability intoaccount. To see this, compare an experiment in whicheach participant responds to the same set of words to anexperiment in which each participant responds to a dier-ent set ofwords.When the same lexical predictors are usedin both experiments, the by-participant analysis proceedsin exactly the same way for both. But whereas thisapproach is correct for the second experiment, it ignoresa systematic source of variation in the case of the rstexperiment.

    A simulation study illustrates that ignoring item var-iability that is actually present in the data may lead tounacceptably high Type I error rates. In this simulationstudy, we considered three predictors, X, Y and Z tied to20 items, each of which was presented to 10 participants.In one set of simulation runs, these predictors had betaweights 2, 6 and 4. In a second set of simulation runs,the beta weight for Z was set to zero. We were interestedin the power and Type I error rates for Z for by-partic-ipant and for by-item regression, and for two dierentmixed-eects models. The rst mixed-eects model thatwe considered included crossed random eects for par-

  • 406 R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412ticipant and item with random intercepts only. Thismodel reected exactly the structure implemented inthe simulated data. A second mixed-eects modelignored the item structure in the data, and included onlyparticipant as a random eect. This model is the mixed-eects analogue to the by-participant regression.

    Table 9 reports the proportions of simulation runs(on a total of 1000 runs) in which the coecients of

    Table 9Proportion of simulation runs (out of 1000) in which thecoecients for the intercept and the predictors X, Y and Z arereported as signicantly dierent from zero according to fourmultiple regression models

    lmer: mixed-eect regression with crossed random eects forsubject and item; lmerS: mixed-eect model with subject as ran-dom eect; Subj: by-subject regression; Item: by-item regression.the regression model were reported as signicantly dif-ferent from zero at the 5% and 1% signicance levels.The upper part of Table 9 reports the proportions forsimulated data in which an eect of Z was absent, withbZ = 0. The lower part of the table lists the correspond-ing proportions for simulations in which Z was present(bZ = 4). The bolded numbers in the upper part of thetable highlight the very high Type I error rates for mod-els that ignore by-item variability that is actually presentin the data. The only models that come close to the nom-inal Type I error rates are the mixed-eects model withcrossed random eects for subject and item, and theby-item regression. The lower half of Table 9 shows thatof these three models, the power of the mixed-eectsmodel is consistently greater than that of the by-itemregression. (The greater power of the by-subject models,shown in grey, is irrelevant given their unacceptablyhigh Type I error rates.)

    Of the two mixed-eects models, it is only the modelwith crossed random eects that provides correct esti-mates of the standard deviations characterizing the ran-dom eects, as shown in Table 10. When the itemrandom eect is ignored (lmerS), the standard deviationof the residual error is overestimated substantially, andthe standard deviation for the subject random eect isslightly underestimated.

    We note that for real datasets, mixed-eects regres-sion oers the possibility to include not only item-boundpredictors, but also predictors tied to the subjects, aswell as predictors capturing inter-trial dependenciesand longitudinal eects.

    Further issues

    Some authors, e.g., Quene and Van den Bergh (2004),have argued that in experiments with subjects and items,items should be analyzed as nested under subjects. Thenesting of items under participants creates a hierarchicalmixed-eects model. Nesting is argued to be justied onthe grounds that items may vary in familiarity acrossparticipants. For instance, if items are words, than lexi-cal familiarity is known to vary considerably acrossoccupations (see, e.g., Gardner, Rothkopf, Lapan, &Laerty, 1987). Technically, however, nesting amountsto the strong assumption that there need not be anycommonality at all for a given item across participants.

    This strong assumption is justied only when the pre-dictors in the regression model are treatments adminis-tered to items that otherwise do not vary ondimensions that might in any way aect the outcomeof the experiment. For many linguistic items, predictorsare intrinsically bound to the items. For instance, whenitems are words, predictors such as word frequency andword length are not treatments administered to items.Instead, these predictors gauge aspects of a words lexi-cal properties. Furthermore, for many current studies itis unlikely that they fully exhaust all properties that co-determine lexical processing. In these circumstances, it ishighly likely that there is a non-negligible residue ofitem-bound properties that are not brought into themodel formulation. Hence a random eect for wordshould be considered seriously. Fortunately, mixed-eects models allow the researcher to explicitly test

    Table 10Actual and estimated standard deviations for simulated regres-sion data

    Item Subject Residual

    Data 40 80 50lmer 39.35 77.22 49.84lmerS 76.74 62.05whether a random eect for Item is required by meansof a likelihood ratio test comparing a model with andwithout a random eect for item. In our experience, suchtests almost invariably show that a random eect foritem is required, and the resulting models provide a tigh-ter t to the data.

    Mixed-eects regression with crossed random eectsfor participants and items have further advantages tooer. One advantage is shrinkage estimates for the

  • there is no main eect. However, they can actually make

    1995; Goldstein et al., 1993; Nutall, Goldstein, Prosser,

    R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412 407BLUPs (the subject and item specic adjustments to inter-cepts and slopes), which allow enhanced prediction forthese items and subjects (see, e.g., Baayen, 2008, for fur-ther discussion). Another important advantage is thepossibility to include simultaneously predictors thatare tied to the items (e.g., frequency, length) and predic-tors that are tied to participants (e.g., handedness, age,gender). Mixed-eects models have also been extendedto generalized linear models and can hence be used e-ciently to model binary response data such as accuracyin lexical decision (see Jaeger, this volume).

    To conclude, we briey address the question of theextent to which an eect observed to be signicant in amixed-eects analysis generalizes across both subjectsand items (see Forster, this issue). The traditional inter-pretation of the F1 (by-subject) and F2 (by-item) analy-ses is that signicance in the F1 analysis would indicatethat the eect is signicant for all subjects, and that theF2 analysis would indicate that the eect holds for allitems. We believe this interpretation is incorrect. In fact,even if we replace the F1+F2 procedure by a mixed-eects model, the inference that the eect would general-ize across all subjects and items remains incorrect. Thexed-eect coecients in a mixed-eect model are esti-mates of the intercept, slopes (for numeric predictors)or contrasts (for factors) in the population for the aver-age, unknown subject and the average, unknown item.Individual subjects and items may have intercepts andslopes that diverge considerably from the populationmeans. For ease of exposition, we distinguish three pos-sible states of aairs for what in the traditional terminol-ogy would by described as an Eect by Item interaction.

    First, it is conceivable that the BLUPs for a given xed-eect coecient, when added to that coecient, neverchange its sign. In this situation, the eect indeed gener-alizes across all subjects (or items) sampled in the exper-iment. Other things being equal, the partial eect of thepredictor quantied by this coecient will be highlysignicant.

    Second, situations arise in which adding the BLUPs toa xed coecient results in a majority of by-subject (orby-item) coecients that have the same sign as the pop-ulation estimate, in combination with a relatively smallminority of by-subject (or by-item) coecients with theopposite sign. The partial eect represented by the pop-ulation coecient will still be signicant, but there willbe less reason for surprise. The eect generalizes to amajority, but not to all subjects or items. Nevertheless,we can be condent about the magnitude and sign ofthe eect on average, for unknown subjects or items, ifthe subjects and items are representative of the popula-tion from which they are sampled.

    Third, the by-subject (or by-item) coecientsobtained by taking the BLUPs into account may resultin a set of coecients with roughly equal numbers ofcoecients that are positive and coecients that aresubstantial prot by bringing it on the market withwarnings for adverse side eects and proper distribu-tional controls.

    Returning to our own eld, we know that no twobrains are the same, and that dierent brains have dier-ent developmental histories. Although in the initialstages of research the available technology may onlyreveal the most robust main eects, the more ourresearch advances, the more likely it will become thatwe will be able to observe systematic individual dier-ences. Ultimately, we will need to bring these individualdierences into our theories. Mixed-eect models havebeen developed to capture individual dierences in aprincipled way, while at the same time allowing general-izations across populations. Instead of discarding indi-vidual dierences across subjects and items as anuninteresting and disappointing nuisance, we shouldembrace them. It is not to the advantage of scienticprogress if systematic variation is systematically ignored.

    Hierarchical models in developmental and educational

    psychology

    Thus far, we have focussed on designs with crossedrandom eects for subjects and items. In educationaland developmental research, designs with nested ran-dom eects are often used, such as the natural hierarchyformed by students nested within a classroom (Gold-stein, 1987). Such designs can also be handled bymixed-eects models, which are then often referred toas hierarchical linear models or multilevel models.

    Studies in educational settings are often focused onlearning over time, and techniques developed for thistype of data often attempt to characterize how individu-als performance or knowledge changes over time,termed the analysis of growth curves (Goldstein, 1987,negative. In this situation, the main eect (for a numericpredictor or a binary contrast) will not be signicant, incontradistinction to the signicance of the random eectfor the slopes or contrasts at issue. In this situation,there is a real and potentially important eect, but aver-aged across subjects or items, it cancels out to zero.

    In the eld of memory and language, experimentsthat do not yield a signicant main eect are generallyconsidered to have failed. However, an experimentresulting in this third state of aairs may constitute apositive step forward for our understanding of languageand language processing. Consider, by way of example,a pharmaceutical company developing a new medicine,and suppose this medicine has adverse side eects forsome, but highly benecial eects for other patientspatients for which it is an eective life-saver. The com-pany could decide not to market the medicine because

  • eect analysis the parameters calculated from the single

    408 R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412& Rasbash, 1989; Willet, Singer, & Martin, 1998).Examples of this include the assessment of dierentteaching techniques on students performance (Aitkin,Anderson, & Hinde, 1981), and the comparison of theeectiveness of dierent schools (Aitkin & Longford,1986). Goldstein et al. (1993) used multilevel techniquesto study the dierences between schools and studentswhen adjusting for pre-existing dierences when studentsentered classes. For a methodological discussion of theuse of these models, see the collection of articles in theSummer 1995 special issue of Journal of Educational andBehavioral Statistics on hierarchical linear models, e.g.,Kreft (1995). Singer (1998) provides a practical introduc-tion to multilevel models including demonstration code,and Collins (2006) provides a recent overview of issuesin longitudinal data analysis involving these models.Finally, Fielding and Goldstein (2006) provide a compre-hensive overview ofmultilevel and cross-classiedmodelsapplied to education research, including a brief softwarereview. West et al. (2007) provide a comprehensive soft-ware review for nested mixed-eects models.

    These types of models are also applicable to psycho-linguistic research, especially in studies of developmentalchange. Individual speakers from a language communityare often members of a hierarchy, e.g., language:dia-lect:family:speaker, and many studies focus on learningor language acquisition, and thus analysis of change ordevelopment is important. Huttenlocher, Haight, Bryk,and Seltzer (1991) used multilevel models to assess theinuence of parental or caregiver speech on vocabularygrowth, for example. Boyle and Willms (2001) providean introduction to the use of multilevel models to studydevelopmental change, with an emphasis on growthcurve modeling and discrete outcomes. Raudenbush(2001) reviews techniques for analyzing longitudinaldesigns in which repeated measures are used. Recently,Misangyi, LePine, Algina, and Goeddeke (2006) com-pared repeated measures regression to multivariateANOVA (MANOVA) and multilevel analysis in researchdesigns typical for organizational and behavioralresearch, and concluded that multilevel analysis can pro-vide equivalent results as MANOVA, and in cases wherespecic assumptions about variance-covariance struc-tures could be made, or in cases where missing valueswere present, that multilevel modeling is a better analy-sis strategy and in some cases a necessary strategy (seealso Kreft & de Leeuw, 1998 and Snijders & Bosker,1999).

    Finally, a vast body of work in educational psychol-ogy concerns test construction and the selection of testitems (Lord & Novick, 1968). Although it is beyondthe scope of this article to review this work, it shouldbe noted that work within generalizability theory (Cron-bach, Gleser, Nanda, & Rajaratnam, 1972) has beenconcerned with the problem of crossed subject and item

    factors using random eects models (Schroeder & Haks-participants are used in a mixed model to test whether acontrast is signicant over participants, in order to testwhether the contrasts reects a dierence in the popula-tion from which the participants were sampled. This isanalogous to how cognitive psychology experimenterstreat mean RTs, for example. A common analysis strat-tian, 1990). For a recent application of the software con-sidered here to item response theory, see Doran, Bates,Bliese, and Dowling (2007), and for the application ofhierarchical models to joint response type and responsetime measures, see Fox, Entink, and van der Linden(2007).

    Mixed-eects models in neuroimaging

    In neuroimaging, two-level or mixed eects modelsare now a standard analysis technique (Friston et al.,2002a, 2002b; Worsley et al., 2002), and are used in con-junction with Gaussian Random Field theory to makeinferences about activity patterns in very large data sets(voxels from fMRI scans). These techniques are formallycomparable to the techniques that are advocated in thispaper (Friston, Stephan, Lund, Morcom, & Kiebel,2005). Interestingly, however, the treatment of stimulias random eects has not been widely addressed in theimaging and physiological community, until recently(Bedny, Aguirre, & Thompson-Schill, 2007).

    In imaging studies that compare experimental condi-tions, for example, statistical parameter maps (SPM; Fris-ton et al., 1995) are calculated based on successivelyrecorded time series for the dierent experimental condi-tions. A hypothesized hemodynamic response functionis convolved with a function that encodes the experimen-tal design matrix, and this forms a regressor for each ofthe time series in each voxel. Signicant parameters forthe regressors are taken as evidence of activity in thevoxels that exhibit greater or less activity than isexpected based on the null hypothesis of no activity dif-ference between conditions. The logic behind these testsis that a rejection of the null hypothesis for a region isevidence for a dierence in activity in that region.

    Neuroimaging designs are often similar to cognitivepsychology designs, but the dimension of the responsevariable is much larger and the nature of the responsehas dierent statistical properties. However, this is notcrucial for the application of mixed eects models. Infact, it shows the technique can scale to problems thatinvolve very large datasets.

    A prototypical case of a xed eects analysis in fMRIwould test whether a image contrast is statistically sig-nicant within a single subject over trials. This wouldbe analogous to a psychophysics experiment using onlya few participants, or a patient case study. For randomegy is to calculate a single parameter for each participant

  • tion and assign them to experimental condition. How-

    R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412 409in an RT study, and then analyze this data in (what inthe neuroimaging community is called) a random eectsanalysis.

    The estimation methods used to calculate the statisti-cal parameters of these models include Maximum Like-lihood or Restricted Maximum Likelihood, just as in theapplication of the multilevel models used in educationresearch described earlier. One reason that these tech-niques are used is to account for correlation betweensuccessive measurements in the imaging time series.These corrections are similar to corrections familiar topsychologists for non-sphericity (Greenhouse & Geisser,1958).

    Similar analysis concerns are present within electro-physiology. In the past, journal policy in psychophysio-logical research has dealt with the problems posed byrepeated measures experimental designs by suggestingthat researchers adopt statistical procedures that takeinto account the correlated data obtained from thesedesigns (Jennings, 1987; Vasey & Thayer, 1987). Mixedeects models are less commonly applied in psychophys-iological research, as the most common techniques arethe traditional univariate ANOVA with adjustments ormultivariate ANOVA (Dien & Santuzzi, 2004), but someresearchers have advocated them to deal with repeatedmeasures data. For example, Bagiella, Sloan, and Heit-jan (2000) suggest that mixed eects models have advan-tages over more traditional techniques for EEG dataanalysis.

    The current practice of psychophysiologists and neu-roimaging researchers typically ignores the issue ofwhether linguistic materials should be modeled withxed or random eect models. Thus, while there aretechniques available for modeling stimuli as randomeects, it is not yet current practice in neuroimagingand psychophysiology to do so. This represents a tre-mendous opportunity for methodological developmentin language-related imaging experiments, as psycholin-guists have considerable experience in modeling stimuluscharacteristics.

    Cognitive psychologists and neuroscientists mightreasonably assume that the language-as-a-xed-eectdebate is only a concern when linguistic materials areused, given that most discussion to date has taken placein the context of linguistically-motivated experiments.This assumption is too narrow, however, because natu-ralistic stimuli from many domains are drawn frompopulations.

    Consider a researcher interested in the electrophysiol-ogy of face perception. She designs an experiment to testwhether an ERP component such as the N170 in responseto faces has a dierent amplitude in one of two face con-ditions, normal and scrambled form. She obtains a set ofimages from a database, arranges them according to herexperimental design, and proceeds to present each pic-

    ture in a face-detection EEG experiment, analogous toever, stimuli may have stochastic characteristicswhether or not they are randomly sampled or not. Par-ticipants have stochastic characteristics, as well, whetherthey are randomly sampled or not. Therefore, the pres-ent debate about the best way to model random eectsof stimuli is wider than previously has been appreciated,and should be seen as part of the debate over the use ofnaturalistic stimuli in sensory neurophysiology as well(Felsen & Yang, 2005; Ruse & Movshon, 2005).

    Concluding remarks

    We have described the advantages that mixed-eectsmodels with crossed random eects for subject and itemoer to the analysis of experimental data.

    The most important advantage of mixed-eects mod-els is that they allow the researcher to simultaneouslyconsider all factors that potentially contribute to theunderstanding of the structure of the data. These factorscomprise not only standard xed-eects factors typicallymanipulated in psycholinguistic experiments, but alsocovariates bound to the items (e.g., frequency, complex-ity) and the subjects (e.g., age, sex). Furthermore, localdependencies between the successive trials in an experi-ment can be brought into the model, and the eects ofprior exposure to related or identical stimuli (as inlong-distance priming) can be taken into account aswell. (For applications in eye-movement research, seeKliegl, 2007, and Kliegl, Risse, & Laubrock, 2007).Mixed-eects models may oer substantially enhancedinsight into how subjects are performing in the coursethe way that a psycholinguist would present words andnon-words to a participant in a lexical decision experi-ment. The images presented in this experiment wouldbe a sample of all possible human faces. It is not contro-versial that human participants are to be modeled as arandom variable in psychological experiments. Picturesof human faces are images of a random variable, pre-sented as stimuli. Thus, it should be no source of contro-versy that naturalistic face stimuli are also a randomvariable, and sh