three-level meta-analysis of dependent effect sizes · 2017. 8. 29. · three-level meta-analysis...

19
Three-level meta-analysis of dependent effect sizes Wim Van den Noortgate & José Antonio López-López & Fulgencio Marín-Martínez & Julio Sánchez-Meca Published online: 9 October 2012 # Psychonomic Society, Inc. 2012 Abstract Although dependence in effect sizes is ubiqui- tous, commonly used meta-analytic methods assume inde- pendent effect sizes. We describe and illustrate three-level extensions of a mixed effects meta-analytic model that accounts for various sources of dependence within and across studies, because multilevel extensions of meta- analytic models still are not well known. We also present a three-level model for the common case where, within stud- ies, multiple effect sizes are calculated using the same sample. Whereas this approach is relatively simple and does not require imputing values for the unknown sampling covariances, it has hardly been used, and its performance has not been empirically investigated. Therefore, we set up a simulation study, showing that also in this situation, a three- level approach yields valid results: Estimates of the treat- ment effects and the corresponding standard errors are unbiased. Keywords Multilevel . Meta-analysis . Multiple outcomes . Dependence Three-level meta-analysis of dependent effect sizes Meta-analysis refers to the statistical integration of the quan- titative results of several studies. For instance, if several experimental studies examine the effect of student coaching on studentsachievement, a meta-analysis can be used to investigate how large the effect is on average, whether this effect varies over studies, and whether we can find factors that are related to the size of the effect (for an introduction to meta-analysis, see Borenstein, Hedges, Higgins, & Rothstein, 2009). Most often, the effects that were observed in the primary studies are not exactly the same as the ones predicted on the basis of the meta-analytic model. Most meta-analytic methods assume that these deviations of the observed effect sizes from their expected values are inde- pendent. This means, it is assumed that one specific ob- served effect size does not give information about the direction or size of deviation of another observed effect from the value we would expect on the basis of the meta- analytic model. Nevertheless, in practice, there are numer- ous sources of effect size dependence, and therefore, meta- analysts are very often confronted with dependent effect sizes. For instance, if several studies were done by the same research group, it is not unlikely that the effect sizes from this research group are more similar than effect sizes from two different research groups, because the effect sizes might be influenced by common factors, including the way the dependent and independent variables were operationalized, the characteristics of the studied population, the way sub- jects were sampled, and the observers or interviewers who collected the data (Cooper, 2009). In the same way, study results from the same country might be more similar than study results from different countries, again inducing depen- dence in observed effects. Besides dependence over studies, there also might be dependence within studies. For instance, a specific effect can be investigated for several samples in a single study. In this situation, factors such as the ones described above possibly vary less within the same study than over studies, possibly resulting in dependence. The dependence within studies is most obvious if effect sizes are based on a common sample (Becker, 2000; Gleser & Olkin, 1994). Studies often do not use one single outcome W. Van den Noortgate (*) Faculty of Psychology and Educational Sciences, University of Leuven, Vesaliusstraat 2, 3000 Leuven, Belgium e-mail: [email protected] J. A. López-López : F. Marín-Martínez : J. Sánchez-Meca Universidad de Murcia, Murcia, Spain Behav Res (2013) 45:576594 DOI 10.3758/s13428-012-0261-6

Upload: others

Post on 29-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Three-level meta-analysis of dependent effect sizes

    Wim Van den Noortgate & José Antonio López-López &Fulgencio Marín-Martínez & Julio Sánchez-Meca

    Published online: 9 October 2012# Psychonomic Society, Inc. 2012

    Abstract Although dependence in effect sizes is ubiqui-tous, commonly used meta-analytic methods assume inde-pendent effect sizes. We describe and illustrate three-levelextensions of a mixed effects meta-analytic model thataccounts for various sources of dependence within andacross studies, because multilevel extensions of meta-analytic models still are not well known. We also present athree-level model for the common case where, within stud-ies, multiple effect sizes are calculated using the samesample. Whereas this approach is relatively simple and doesnot require imputing values for the unknown samplingcovariances, it has hardly been used, and its performancehas not been empirically investigated. Therefore, we set up asimulation study, showing that also in this situation, a three-level approach yields valid results: Estimates of the treat-ment effects and the corresponding standard errors areunbiased.

    Keywords Multilevel .Meta-analysis .Multiple outcomes .

    Dependence

    Three-level meta-analysis of dependent effect sizes

    Meta-analysis refers to the statistical integration of the quan-titative results of several studies. For instance, if severalexperimental studies examine the effect of student coachingon students’ achievement, a meta-analysis can be used to

    investigate how large the effect is on average, whether thiseffect varies over studies, and whether we can find factorsthat are related to the size of the effect (for an introduction tometa-analysis, see Borenstein, Hedges, Higgins, &Rothstein, 2009). Most often, the effects that were observedin the primary studies are not exactly the same as the onespredicted on the basis of the meta-analytic model. Mostmeta-analytic methods assume that these deviations of theobserved effect sizes from their expected values are inde-pendent. This means, it is assumed that one specific ob-served effect size does not give information about thedirection or size of deviation of another observed effectfrom the value we would expect on the basis of the meta-analytic model. Nevertheless, in practice, there are numer-ous sources of effect size dependence, and therefore, meta-analysts are very often confronted with dependent effectsizes. For instance, if several studies were done by the sameresearch group, it is not unlikely that the effect sizes fromthis research group are more similar than effect sizes fromtwo different research groups, because the effect sizes mightbe influenced by common factors, including the way thedependent and independent variables were operationalized,the characteristics of the studied population, the way sub-jects were sampled, and the observers or interviewers whocollected the data (Cooper, 2009). In the same way, studyresults from the same country might be more similar thanstudy results from different countries, again inducing depen-dence in observed effects.

    Besides dependence over studies, there also might bedependence within studies. For instance, a specific effectcan be investigated for several samples in a single study. Inthis situation, factors such as the ones described abovepossibly vary less within the same study than over studies,possibly resulting in dependence.

    The dependence within studies is most obvious if effectsizes are based on a common sample (Becker, 2000; Gleser& Olkin, 1994). Studies often do not use one single outcome

    W. Van den Noortgate (*)Faculty of Psychology and Educational Sciences,University of Leuven,Vesaliusstraat 2,3000 Leuven, Belgiume-mail: [email protected]

    J. A. López-López : F. Marín-Martínez : J. Sánchez-MecaUniversidad de Murcia,Murcia, Spain

    Behav Res (2013) 45:576–594DOI 10.3758/s13428-012-0261-6

  • variable but, instead, report the treatment effect on differentvariables. These can refer to different but related constructs(such as anxiety and depression), different aspects of aconstruct (such as subscales of a psychological test), differ-ent operationalizations of the same construct (such as twoinstruments for anxiety), or repeated measures of the sameconstruct (e.g., a posttest and follow-up measures). Anotherexample of overlap in the samples is the case where multiple(independent) treatment groups are compared with a com-mon control group.

    Stevens and Taylor (2009) discussed another source ofdependence, one that is not intrinsic to the design or the databut is related to the way the effect sizes are calculated andoccurs when mean differences from independent groups arestandardized using a common within-group standard devia-tion estimate—more specifically, the root MSE resultingfrom an ANOVA for a between-subjects factor with morethan two levels.

    Dependent effects sizes are less informative than inde-pendent effect sizes. Suppose that two outcome variables areperfectly correlated. Essentially, this means that both out-comes refer to the same latent variable and that effect sizescalculated for both outcomes will give exactly the sameinformation. If, in a meta-analysis, both of these effect sizesare included as independent effect sizes, the same informa-tion therefore is used twice. In general, when outcomevariables are correlated, information regarding one outcomeoverlaps with information yielded by the other outcome. By“inflating” the available information in this way, we willoverestimate confidence in the results of the meta-analysis.In statistical terms, standard errors of the parameter esti-mates are likely to be underestimated, resulting in too smallconfidence intervals and too high a number of incorrectrejections of the null hypotheses (Becker, 2000). There isalso another problem when ordinary meta-analytic methodsare used to combine dependent effect sizes: Studies orgroups of studies with multiple effect sizes will have a largerinfluence on the results of the meta-analysis than will stud-ies reporting only one effect size, potentially resulting inbiased estimates. For both reasons, it is important that thedependence between effect sizes be taken into account inour meta-analysis.

    Yet the dependence over studies and, especially, thedependence within studies including independent samplesare often overlooked, although recently, Stevens and Taylor(2009) and Hedges, Tipton and Johnson (2010) discussedthe use of random effects for dealing with this kind ofdependence. Some authors even explicitly have stated thatif effect sizes are based on independent samples, effect sizescan be regarded as independent even if they stem from thesame study (e.g., Littell, Corcoran, & Pillai, 2008).

    The dependence within studies due to overlapping sam-ples has received much more attention. Becker (2000) and

    Littell et al. (2008) have given an overview of approaches toaccounting for this kind of dependence, as well as of thecorresponding problems or limitations. In general,approaches boil down to three types: ignoring dependence,avoiding dependence, and modeling dependence. Due toreasons described above, ignoring dependence in principleis not appropriate. Yet if only one or two studies in a largeset of studies based more than one effect size on the samesample, treating the effect sizes as independent will proba-bly not substantially influence the results of the meta-analysis. In any case, if a researcher decides to treat multipleeffect sizes from the same studies as independent, sensitivityanalyses are recommended, comparing the results of theanalysis ignoring the dependence with one or more alter-natives (Becker, 2000).

    A second strategy is to avoid dependence, which meansthat we restrict our analysis to one effect size per study. Ifmore or less the same outcome variables are measured ineach study, separate meta-analyses can be performed foreach type of outcome. Also, this approach is not withoutproblems or disadvantages. For instance, Rosa-Alcázar,Sánchez-Meca, Gómez-Conesa and Marín-Martínez(2008), investigating the effect of psychological treatmentof obsessive–compulsive disorder, performed five separatemeta-analyses for five types of outcome variables. Yet theyfound that only 7 of the 24 studies reported the effect ofpsychological treatment on social adjustment, a number thatis too small for accurately estimating the between-studyvariance. Performing separate analyses even can becomeunfeasible if there are a lot of different outcomes and,especially, if chosen outcomes vary a lot over studies.Moreover, when separate meta-analyses are performed fordifferent outcomes, testing differences between outcomes inthe mean treatment effect or in the moderating effects ofstudy characteristics is not straightforward. Finally, eachseparate meta-analysis uses only a subset of the data, whichin principle results in less accurate estimates and less powerin statistical tests (Gleser & Olkin, 2009).

    Another way of avoiding dependence is not taking theindividual effect sizes as the units of analysis but, rather, thesamples (i.e., using one effect size per sample), the studies,or even research groups (Cooper, 2009). To obtain oneeffect size per higher level unit, one of the observed effectsizes from that unit can be selected, either randomly orbecause the chosen effect size refers to the outcome variablethat is most of interest from a substantive point of view. Acommon way to reduce multiple effect sizes per higher levelunit (e.g., per study) is to average them, an approach thatespecially makes sense if outcomes refer to the same con-struct. However, by simply averaging effect sizes withinstudies, the variance between effect sizes is artificially re-duced, and informative differences between outcomes getlost (Becker, 2000; Cheung & Chan, 2008).

    Behav Res (2013) 45:576–594 577

  • The third and most complex strategy is to model thedependence. Raudenbush, Becker and Kalaian (1988) pro-posed a multivariate model for analyzing multivariate effectsize data. This model was extended by Kalaian andRaudenbush (1996) to a multivariate mixed model, allowingone to model variation between and within studies. Themodel, which will be presented later, assumes that effectsizes within the same study are possibly correlated. Whereasan ordinary meta-analysis uses sampling variance estimatesof the effect sizes to approximate the optimal weights for theeffect sizes and to estimate unknown parameters andcorresponding standard errors, a multivariate meta-analysisuses the estimated sampling covariance matrix of themultivariate effect sizes. The multivariate approach hasseveral advantages: The treatment effect is estimated foreach outcome, all available information is used in onesingle analysis, it is possible to test contrasts betweentreatment effects of outcomes (e.g., whether the effecton a first outcome is the same as the effect on twoother outcomes), and, similarly, it is possible to testdifferential moderating effects of study characteristics.The approach, however, is more complex to use, be-cause it is not always implemented in software formeta-analysis. A major disadvantage of the approach isthat for estimating the covariance matrix of the effectsizes, the correlations between the outcomes are alsoneeded. If the results of standardized tests are used,correlations reported in the test manuals might give anidea about these correlations, but otherwise estimates ofthese correlations are difficult to obtain, because onlyrarely are estimates reported by the primary studies andthe raw data that could be used to estimate the correla-tions are typically not available to the meta-analyst.Therefore, multivariate meta-analyses are seldom used or areused only on a subset of outcomes for which relatively accu-rate intercorrelation estimates are available. In addition, themultivariate approach, as well as the approach of separatemeta-analyses for each outcome variable, is feasible only ifthere are only a few different outcomes.

    In this article, we describe and evaluate an alternativeapproach to accounting for dependence within and overstudies: the use of multilevel models. First, we briefly intro-duce multilevel models and their application for meta-analysis. Next, we give some examples of meta-analysesusing three-level models, accounting for different kinds ofdependencies. In one of the examples, a three-level model isused to account for sampling covariation. Because applyingthe multilevel approach in this situation is not straightfor-ward and has not been validated before, we explore itsperformance with an in-depth discussion of the analysisresults for a simulated data set and by presenting the resultsof an extensive simulation study. The article closes with adiscussion and conclusions.

    Multilevel meta-analysis

    In social and behavioral sciences, data often have a clusteredstructure. For instance, if in educational research, first a setof schools is sampled from a population of schools and, in asecond stage, students are sampled from the selectedschools, data are clustered: The students participating inthe study can be grouped according to the schools theybelong to, as illustrated in Fig. 1.

    This structure can induce dependence in the data:Students from the same school are, in general, more alikethan students from different schools—for instance, due to(self-) selection effects or to effects schools have on theirstudents. Therefore, school membership has to be taken intoaccount when statistical analyses that assume independentresiduals are performed.

    Multilevel models have been developed to deal with suchgrouped data (Goldstein, 1987; Raudenbush, 1988). A sim-ple two-level model (with a within- and a between-grouplevel) includes a regression equation (Eq. 1) regressing thedependent variable Y on a predictor X, describing the vari-ation over units (referred to with index i) within groups(index j):

    Yij ¼ b0j þ b1jXij þ eij: ð1Þ

    Regression coefficients also get an index j, meaning thatthey are allowed to vary over groups. This variation at thegroup level is described using additional regression equa-tions, possibly including group characteristics as predictors.Equations 2 and 3 at this second level regress the within-group intercepts and slopes, respectively, on a group char-acteristic Z:

    b0j ¼ g00 þ g01Zj þ u0j ð2Þ

    b1j ¼ g10 þ g11Zj þ u1j: ð3Þ

    Residuals at each level are typically assumed identicallyand independently normally distributed with zero means,and being independent over levels, this is

    eij � N 0;σ2e� �

    andu0ju1j

    � �� N 0

    0

    � �;

    σ2u0σu0u1 σ

    2u1

    � �� �ð4Þ

    School S1 S2 S3 S4 S5 S6

    Student s … s s … s s … s s … s s … s s … s

    Fig. 1 A clustered structure in educational research

    578 Behav Res (2013) 45:576–594

  • Especially when there is a large number of groups, themultilevel model is a very economic model, because regard-less the number of groups that are included, the only param-eters to be estimated are the regression coefficients at thegroup level (the γ’s), and the (co)variances of the residualsat the first and second levels (σ2e ;σ

    2u0;σu0u1 ;σ

    2u1). If one is

    interested in the individual level 1 regression coefficients(the β’s), these coefficients can be estimated afterward usingempirical Bayes techniques. By borrowing strength from thewhole data set, empirical Bayes estimates are more efficientthan the estimates based on the scores from the specificgroup alone (Raudenbush & Bryk, 2002).

    An important strength of the multilevel model is itsremarkable flexibility, allowing one, for instance, to definea third level describing the variation of the level 2 regressioncoefficients over groups of groups, including additionalcovariates at each of the levels, defining more complexcovariance structures, such as an autocorrelation structure,or using a nonidentity link function for modeling discretedependent variables, such as proportions or counts. Thanksto their flexibility, multilevel or mixed models have becomepopular in several fields, including education, psychology,economics, and biomedical sciences, for several kinds ofnesting (students in schools, repeated measurement occa-sions within subjects, children in families, patients in doc-tors, etc.).

    One application of multilevel models is meta-analysis(Hox, 2002; Raudenbush & Bryk, 2002). In a meta-analysis, we have indeed a similar nested data structure:Study subjects are nested within studies. Scores canvary over subjects from the same study (level 1), butthere might also be differences between studies (level2). For instance, if we want to combine the results of aset of treatment effectiveness studies comparing a con-trol and a treatment group, we define a predictor at thesubject level (Xij of Eq. 1) as a dummy treatmentindicator, having value 1 if subject i of study j belongsto the treatment group and 0 if the subject belongs tothe control group. In this way, β0j of Eq. 1 is equal tothe expected value of a subject belonging to the controlgroup in study j, whereas β1j refers to the increase ofthe expected value if a study subject belongs to thetreatment group and, therefore, can be interpreted asthe treatment effect. One or more study characteristicscan be included in the study-level regression equations(Eqs. 2 and 3) in an attempt to explain possible varia-tion over studies in the expected baseline levels, β0j, orin the treatment effects β1j. If a study characteristic isfound to explain or describe the treatment effects, thestudy characteristic has the role of a moderator variable.

    There are, however, three major differences between anordinary multilevel analysis and a typical meta-analysis. Afirst difference is that meta-analyses often combine results

    of studies in which the dependent variable is not alwaysmeasured on the same scale, requiring a standardization ofthe data. To compensate for multiplicative factors, scores ofeach study can be divided by (an estimate of) the within-study standard deviation σje. More generally, linearly equat-able scales can be made comparable by dividing the devia-tion of each score from the mean by the standard deviation.Note that whereas, for the unstandardized scores, β1j isequal to the difference in expected values for both controland treatment conditions, by standardizing scores in eitherway,β1j becomes the standardized mean difference that wasproposed by Cohen (1988) and has become a popular effectsize metric in social and behavioral research.

    A second difference between ordinary multilevel analy-ses and meta-analyses is that a meta-analyst often does nothave all raw data but, rather, depends on results reported inthe form of summary statistics, test statistics, or effect sizevalues. Fortunately, it is often possible to convert the resultsof each study to a common effect size metric and combinethese effect sizes using a multilevel model. For instance, if,for each study, we can obtain a standardized mean differ-ence, which is an estimate of the population standardizedmean difference β1j, Eq. 1 is adapted as follows:

    bb1j ¼ b1j þ rj: ð5ÞThe residual rj of Eq. 5 summarizes the effect of the

    residuals eij of all individual subjects in study j on theobserved treatment effect. Differences over studies betweentreatment effects can be described in the same way as forraw data, regressing the β1j on one or more study character-istics to investigate their moderating effects (Eq. 3). Bysubstitution, we can write the two-level meta-analytic model

    (Eqs. 5 and 3) in one equation: bb1j ¼ g10 þ g11Zj þ u1j þ rj.If this second-level equation does not include any predictor,the multilevel meta-analytic model simplifies to the meta-analytic random effects model described by DerSimonian

    and Laird (1986) and Hedges and Olkin (1985), with bb1jreferring to an effect size estimate, γ10 to the overall effect,and σ2ujto the systematic between-study variance. By includ-

    ing predictor variables, we assume that the variation oftreatment effect over studies might be partly explained bymoderating study characteristics, whereas the other partremains unexplained or random. Therefore, the multilevelmeta-analytic model is also called a mixed meta-analyticmodel (Borenstein et al., 2009).

    A third difference between raw data multilevel analysesand effect size meta-analyses is that for most commonlyused effect size metrics, (a good approximation of) thesampling variance σ2rjcan be calculated prior to performing

    the meta-analysis, and therefore, the variance of the resid-uals at the subject level does not have to be estimated in the

    Behav Res (2013) 45:576–594 579

  • meta-analysis anymore (Raudenbush & Bryk, 2002). Wewant to remark that the multilevel model (Eqs. 5 and 3)are applicable for other measures of effect size (such as logodds ratios or Fisher’s Z-transformed correlation coeffi-cients), as long as their sampling distributions are approxi-mately normal, with a variance that can be estimated, andthe same effect size metric is used for all studies.

    Because a multilevel analysis of the raw data is similar toa two- or multistage analysis in which the subject-levelregression coefficients (the βs from Eq. 1) are estimatedand used as the dependent variable in new regression anal-yses, analyzing raw data or analyzing effect sizes (which canbe regarded as standardized regression coefficients at thesubject level) will give essentially the same results. Anadvantage of the analysis of the raw data (in medical scien-ces known as individual patient data), however, is that itallows for the incorporation of covariates at the subjectlevel—that is, in Eq. 1—as discussed by Higgins, Whitehead,Turner, Omar, & Thompson, 2001. Moreover, using maxi-mum likelihood estimation procedures, as is common formultilevel models, or using traditional estimation proceduresfor meta-analysis, such as the method of moments inDerSimonian and Laird (1986), will give very similar results,even if the normality assumption that is made for the maximumlikelihood procedure is violated (López-lópez, Viechtbauer,Sánchez-Meca, & Marín-Martínez, 2010; Van den Noortgate& Onghena, 2003b). The major strength of using the multilevelmodeling framework for meta-analysis is, however, itsflexibility. In the following section, we will discuss oneextension of the two-level meta-analytic model that isonly very rarely used in practice: the inclusion of anadditional level to account for dependence in the effectsizes.

    Three-level meta-analyses for dependent effect sizes

    Suppose that several teams performed more than one study,possibly resulting in dependent study results. The equationat the first level—that is, the subject level—states that due tosampling variation, the observed effect size from study jfrom team k possibly deviates from the “true” treatmenteffect for that study:

    bb1jk ¼ b1jk þ rjk with rjk � N 0;σ2rjk� : ð6ÞBecause the sampling variance, σ2rjk , depends on the

    study—especially on the study size—it also gets anindex for the study and the team. Additional equationsstate that the true treatment effect β1jk can vary ran-domly over studies from the same team around a team-specific mean effect (θ10k from Eq. 7, the second-levelmodel) and that this team-specific mean effect, in turn,

    can vary randomly over teams around an overall meaneffect (γ100 from Eq. 8, the third-level model):

    b1jk ¼ θ10k þ u1jk with u1jk � N 0;σ2u� � ð7Þ

    θ10k ¼ g100 þ v10k with v10k � N 0;σ2v� �

    : ð8Þ

    In order to try to explain this variation over studies andresearch teams, characteristics of these studies and teamscan be included as predictors at the respective levels. It iseven possible to allow the effect of a study characteristic tovary (partly) randomly over research teams.

    The use of the three-level model is illustrated belowusing three examples of our own research. For all threeanalyses, use was made of the restricted maximum likelihoodestimation procedure implemented in Proc Mixed from SAS(Littell, Milliken, Stroup,Wolfinger, & Schabenberger, 2006).Codes for running the analyses can be obtained from theauthors.

    Example 1: studies with multiple experiments

    Van den Bussche, Van den Noortgate and Reynvoet (2009)performed a meta-analysis of the results of 54 studies, toassess the magnitude of psychological subliminal prim-ing effects and to explore what factors are associatedwith the strength of the priming effects. In each study,subjects performed a number of tasks (e.g., indicatingwhether a number is larger or smaller than 5) after beengiven a congruent stimulus (e.g., a number larger than 5is preceded by a number larger than 5) or an incongru-ent stimulus (e.g., a number larger than 5 is precededby a number smaller than 5). Because the meta-analysisfocused on subliminal priming, only studies were in-cluded that presented primes below the threshold forconscious perception—for instance, by presenting stim-uli for very short durations. A priming effect is man-ifested as faster responses on congruent than onincongruent trials. For each subject, the priming effectcan be estimated as the mean reaction time on incon-gruent trials minus the mean reaction time on congruenttrials. The outcome used to quantify the observed prim-ing effect in a primary study was defined as the meandifference over subjects, divided by the standard devia-tion of the differences.

    In most of the 54 studies, more than one experiment wasconducted, 156 in total, that differed from each other in oneor more factors. Experiments from the same study cannot beregarded as independent, because they were done by thesame researchers, and often with more similar researchgroups or more similar stimuli or procedures than in experi-ments from different studies. Therefore, we used a three-

    580 Behav Res (2013) 45:576–594

  • level model to analyze the data. We have a set of studies(level 3), experiments nested within studies (level two), anda sample of subjects for each experiment (level 1) (seeFig. 2). There are, therefore, three sources of variance:population differences between study population effects,population differences between effects of experiments fromthe same study, and, finally, sampling variance.

    Two separate meta-analyses were performed on a setof 23 studies (88 experiments) containing semantic cat-egorization tasks and a set of 32 studies (68 experi-ments) containing lexical decision and naming tasks.Using a three-level extension of the common meta-analytic random effects model—that is, a model withoutpredictors at the experiment and study level—resultedfor the first set of studies in a mean effect estimate of0.80, but significant variance was found between stud-ies, as well as between experiments within studies.Next, we investigated the moderating effect of severalcharacteristics of the experiments and studies and foundthat, with only two predictors (prime novelty and cate-gory size), the variance at the experiment and studylevels was reduced by 87 % and 40 %, respectively.For the second set of studies, a mean effect of 0.47 wasfound, and effects were found to vary over studies, but notover experiments, within studies. With two predictors (primeduration and target set size), the variance over and withinstudies was reduced by 99 % and 44 %, respectively.

    Example 2: single-subject experimental designs

    In single-subject experimental designs (SSEDs), subjectsare measured repeatedly under different conditions—forinstance, during a baseline phase without a treatment, atreatment phase, and a second baseline phase. The effectof the treatment for a subject can be evaluated by comparingthe mean score of the subject during the baseline phase(s)and the mean during the treatment phase(s). To exploregeneralizability, SSED studies often include more than 1subject, resulting in hierarchical data: Measurements aregrouped or “nested” in subjects. If we have several SSEDstudies, subjects in turn are grouped in studies, meaning thatthree hierarchical levels can be distinguished (Fig. 3).

    In SSED studies, data are typically reported using graph-ical displays, and conclusions are based on a visual analysisof these plots. The tradition of graphically presenting data

    means that the raw data can often be retrieved from theresearch reports, permitting one to combine the data ofseveral subjects, using an ordinary two-level model to ana-lyze data of one SSED study or an ordinary three-levelmodel to combine the results of several studies (Van denNoortgate & Onghena, 2003a). This approach was illustrat-ed by Van den Noortgate and Onghena (2008), reanalyzingthe meta-analytic data set of Shogren, Fagella-Luby, Baeand Wehmeyer (2004), consisting of the results of 30 studysubjects from 13 SSED studies that evaluated the impact ofproviding choice-making opportunities on problem behav-ior. In some of these studies, the intervention allowed sub-jects to choose the order of the task, whereas in otherstudies, subjects could make a choice between two tasks.Problem behavior refers to, for instance, aggressive behav-ior, destructive behavior, and off-task behavior. To compareresults over studies, scores within studies were standardizedby dividing the scores for each subject by the within condi-tion standard deviation.

    Results of the three-level meta-analysis revealed that, onaverage, the amount of problem behavior during the base-line phase was about 2.72 points, whereas the averagestandardized mean difference between phases was −1.72.This 63 % reduction of problem behavior was statisticallysignificant, z 0 4.91, p < .0001. Estimates of the between-study (co)variance of the baseline level and the choice effectwere small and statistically not significant. Within studies,however, there was significant variation over subjects, bothin the baseline level and in the choice effects. There was alsosome evidence for a negative covariation between baselinelevel and effect, suggesting that the intervention has asmaller effect for subjects with a relatively high baselinelevel of problem behavior, but this covariation was statisti-cally not significant at the .05 significance level. Van denNoortgate and Onghena (2008) explored whether variationin baseline levels and in the effect sizes could be explainedwith a characteristic of the subjects (age) and a characteristicof the study (the kind of choice that was given), but neitherof the two characteristics was found to have an effect and toreduce the variation.

    Although this meta-analysis is atypical, in that raw datawere analyzed rather than effect sizes, we found that com-bining effect sizes (more specifically, standardized regres-sion coefficients for each subject) for this example gavealmost identical parameter estimates and correspondingstandard errors for the mean treatment effect, the variance

    Study Study 1 Study 2 … Study K

    Experiments E1 E2 E1 E1 E2 E3

    Participants p … p p … p p … p p … p p … p p … p

    Fig. 2 Three-level hierarchical structure in the priming study meta-analysis

    Study Study 1 Study 2 … Study K

    Subjects S1 S2 S1 S1 S2 S3

    Measurements m … m m … m m … m m … m m … m m … m

    Fig. 3 Three-level hierarchical structure in a set of SSED studies

    Behav Res (2013) 45:576–594 581

  • of the treatment effect over subjects and studies, and themoderating effects of both predictors.

    Example 3: multiple outcomes per sample

    Geeraert, Van den Noortgate, Grietens and Onghena (2004)performed a meta-analysis on 40 studies evaluating theeffect of early prevention programs for families with youngchildren at risk for physical child abuse and neglect.

    Whereas the ultimate goal of each of the evaluated earlysupport programs was to reduce child abuse and neglect, theevaluation of a reduction of child abuse and neglect due tothe program is not straightforward, because parents try tohide these aspects. We found that there were large differ-ences between studies in the criteria used to evaluate theeffect of the programs. Some studies used direct reports ofchild abuse or neglect by child protective services, somestudies used indirect indications of child abuse and neglect,such as reports of hospitalization, the frequency of medicalemergency service visits, contacts with youth protectionservices, and out-of-home placements. A lot of studies alsoevaluated the reduction of risk, looking at, for instance, thewell-being of the child, parent–child interaction character-istics, and social support. Most studies calculated the effectfor more than one outcome variable. The number of effectsizes per study varied from 1 to 52, with an average ofalmost 15 effect sizes per study. A complicating factor forthis analysis of dependent effect sizes, therefore, was thata lot of very different outcome variables were used in thestudies. Even after categorizing the outcomes in broadcategories, we ended up with 11 types of outcomes.Moreover, because the criteria do not refer to well-studied standardized measurement instruments, we didnot have a clue about the correlation between these (typesof) outcomes. In this situation, the use of a multivariatemodel is not feasible.

    Therefore, Geeraert et al. (2004) used a three-level mod-el, modeling the sampling variation for each effect size(level 1), variation over outcomes within a study (level 2),and variation over studies (level 3), as shown in Fig. 4.

    Geeraert et al. (2004) found a small but statisticallysignificant overall standardized mean difference, equal to0.29, z 0 6.59, p < .001, but also strong evidence forsystematic differences between studies and within studies.Because a large part of the studies evaluated one specific

    program (the Healthy Families America program), an indi-cator was included to evaluate whether the effect for thisintervention differed from the effect of other programs, butthe difference was statistically not significant. A set of tendummy variables was used to explore whether the type ofoutcome explained part of the variance within and betweenstudies, but an omnibus test revealed that there were nodifferences between the 11 outcome categories.

    Whereas in the first two examples, the use of three-levelmodels for meta-analysis is straightforward (for other exam-ples with separate effect sizes for different subsamples, as inour first example, see Marsh, Bornmann, Mutz, Daniel, &O'Mara, 2009; Thompson, Turner, & Warn, 2001), this isnot the case for the third example, in which several out-comes are based on the same sample. Indeed, it is clear fromFig. 4 that the structure assumed for the analysis does notcorrespond to the real data structure. Whereas, in reality, wehave samples (level 1) nested in studies (level 2), withmultiple effect sizes for each sample, the structure assumedin the three-level analysis implies that each sample corre-sponds to one outcome—in other words, that outcomes arecalculated for independent samples. Therefore, the multilev-el approach in principle incorrectly assumes that the result-ing effect size measures are independent at the sample level.Yet it might be expected that the dependence between out-comes from the same study is taken into account by using anintermediate level of outcomes within studies. Although themultilevel approach is appealing, especially in situations inwhich we have a lot of outcome variables and/or correla-tions between outcomes are unknown, it has not been vali-dated yet. In the remainder of the article, we will evaluatethe three-level approach by comparing its results with thoseof a multivariate model and a traditional two-level model,by means of a simulation study.

    An evaluation of three-level meta-analyseswith sampling covariation

    “Remember that all models are wrong; the practical questionis how wrong do they have to be to not be useful” (Box &Draper, 1987, p. 74). In this paragraph, we will evaluate theuse of a meta-analytic model incorrectly assuming indepen-dent samples at the first level, but including an intermediateoutcome-within-study level. To illustrate how parameterestimates are affected, we start with one large simulateddata set and next look at the results of an extensivesimulation study. All data were simulated using SASand were analyzed using the restricted maximum likeli-hood procedure implemented in SAS Proc Mixed(Littell et al., 2006).

    Although the focus of this article is on the meta-analysisof effect sizes, we do not simulate effect sizes directly but,

    Study Study 1 Study 2 … Study K

    Outcomes O1 O2 O1 O1 O2 O3

    Participants p … p p … p p … p p … p p … p p … p

    Fig. 4 Three-level hierarchical structure assumed for the child abusemeta-analysis

    582 Behav Res (2013) 45:576–594

  • rather, simulate raw data and calculate for each study astandardized mean difference. In this way, we can com-pare the results of an ordinary multilevel raw dataanalysis with those of a meta-analysis on effect sizes.Moreover, whereas the use of a multivariate model foreffect sizes is difficult in practice because of a lack ofinformation about the sampling covariation, a compari-son of the results of a three-level analysis with those ofa multivariate analysis on the raw data can give furtherinsight into the three-level approach.

    Understanding the three-level analysis

    In order to understand better the interpretation of the resultsof a three-level analysis for effect sizes based on the samesample, one data set was simulated and analyzed, and resultsof the (two-level) multivariate model and of the (univariate)three-level model were compared. Data were simulatedusing the following bivariate model, with Yijk referringto the value for outcome j (j 0 1, 2), for study subject ifrom study k:

    Yi1k ¼ b01k þ b11k Treatmentð Þik þ ei1kYi2k ¼ b02k þ b12k Treatmentð Þik þ ei2k

    withei1kei2k

    � �� N 0; σ

    2e1

    σe2e1 σ2e2

    � �� � ð9Þ

    b01k ¼ g010 þ w01k and b11k ¼ g110 þ w11kb02k ¼ g020 þ w02k and b12k ¼ g120 þ w12k

    with wk � N 0;Wwð Þ

    ð10Þ

    Because we have two outcome variables that possiblyare related, the covariance matrix at the subject level(level 1) is a 2 × 2 matrix, with a positive value forσe2e1 indicating that if, in a study, a subject is sampledwho has a high score on Y1, this person is also likely tohave a high score on Y2. At the study level, we have arandom residual for each outcome for the expectedvalue in the control condition (w01k and w02k), as wellas a random residual for each outcome for the treatmenteffect (w11k and w12k), and therefore, the covariancematrix at this level is a 4 × 4 matrix (Ωw). The dis-cussion and illustration of what happens if we use athree-level model ignoring sampling covariance, insteadof a two-level multivariate model accounting for sam-pling variance, will be split up in three steps. First wediscuss what happens to the parameter estimates if themean effect is estimated, rather than the effect for each

    outcome variable separately. Second, we look at theeffect of ignoring the sampling covariance (i.e., thecovariance at the first level, the subject level), andfinally, we discuss the use of a three-level model.

    The parameter values that were used to simulate data for300 studies (K 0 300), each with two groups of size 50 (n 0 50),are given in the second column of Table 1.

    Model 1

    The first model we used for analyzing the simulateddata (model 1 from Table 1) is the multivariate modelwe used to generate the data (Eqs. 9 and 10). Parameterestimates are close to the parameter values used to generatethe data.

    Model 2

    Sometimes, we might be interested not in the effect for eachoutcome separately but, rather, in the mean effect over out-comes. In the second model, we did not estimate the meanintercept and the mean effect for each outcome variableseparately but, instead, estimated one overall intercept (γ00, withb01k ¼ g00 þ w01k and b02k ¼ g00 þ w02k ) and one overalleffect (γ10, with b11k ¼ g10 þ w11k and b12k ¼ g10 þ w12k ).As a result (compare models 1 and 2), the between-study variance of treatment effects increases (because,in general, deviations of the treatment effects from theoverall treatment effect become larger), but the between-studycovariance decreases (because deviations within studies areless similar than before; one deviation increases, the otherdecreases). The estimated treatment effect now is in the mid-dle of the previously estimated treatment effects for bothoutcomes.

    Model 3

    In the third model, we assumeei1kei2k

    � �� N 0;σ2e

    � �. We thus

    estimated only one variance parameter for both out-comes at the subject level (which is not a real restric-tion because, if the within-study variance cannot beassumed the same, scores are typically standardized).But more important, we assumed that the covariance iszero, whereas in fact, in the simulated data set, it is0.79. Table 1 shows that as a result, the study levelcovariances between intercepts and between treatmenteffects become larger. This can be understood as fol-lows. In the multivariate model, the intercept β0jk refersto the expected value (or population mean) for outcomej in the control condition of study k, and this expected

    Behav Res (2013) 45:576–594 583

  • value varies over studies. Due to the fact that within astudy only a sample of subjects is used, rather than thewhole population, the observed mean is not necessarilythe same as the expected value. More specifically, thesampling variation of the observed mean in the control

    condition for an outcome j is equal toσ2ejn . Because the

    residuals at both levels are independent, the total vari-ance in the observed effect sizes is equal to the sum ofthe variance over studies in the expected control group

    level, and the sampling variance is σ2w0j þσ2ejn . Similarly,

    the sampling covariance between the means for out-

    comes is equal toσejej0n , and the total covariance is σw0jw0j0

    þσejej0n . When the full multivariate model 1 is used, the

    observed variation and covariation in the intercepts issplit up over the two levels. The results of the analysisshow that if the covariance at the first level is notexplicitly included in the model (model 3), the estimatedsampling covariance in the control condition study means isadded to the estimated covariance at the between-study level:0:055 ¼ 0:040þ 0:79050 .

    The covariance in the treatment effects is affected in asimilar way. The effect of the treatment dummy variable,β1jk, refers to the difference in the expected values for theexperimental and the control groups for outcome j in study k.The observed mean difference again varies, due to

    sampling variation and due to between-study variation.More specifically

    Y :jkjexperimental � Y :jkjcontrol¼ b0jk þ b1jk þ e:jkjexperimental

    � �� b0jk þ e:jkjcontrol� �¼ g1j0 þ w1jk þ e:jkjexperimental � e:jkjcontrol

    ð11Þwhere Y :jkjexperimental and Y :jkjcontrol refer to the observed meansin study k for outcome j in both conditions and e:jkjexperimentaland e:jkjcontrol refer to the mean residual in both conditions.

    Because residuals from the subject and study levels areindependent and the residuals in both groups are independent,it is easy to show that the covariation between the mean

    differences for both outcomes is equal to σw1jw1j0 þ2σejej0

    n .

    Again, results of the example show that ignoring the samplingcovariance (the covariance at the subject level) has as conse-quence that the estimated sampling covariation in the coeffi-cients is added to the estimated between-study covariance:0:038 ¼ 0:006þ 2*0:79050 .

    We conclude that, by ignoring the sampling covariationat the subject level, the covariation is still accounted for byoverestimating the study level covariation by the sameamount. Therefore, all other parameters and standard errorsremain unchanged.

    Table 1 Parameter values and estimates (and standard errors) for a two-level multivariate data set

    Parameter values Model 1 (multivariate) Model 2 (no outcome indicator) Model 3 (no level 1 covariance)

    Fixed effects

    Intercept −0.019 (0.020) −0.019 (0.020)

    Outcome 1 0.000 0.002 (0.022)Outcome 2 0.000 −0.009 (0.026)

    Treatment effect 0.209 (0.021) 0.209 (0.021)

    Outcome 1 0.100 0.109 (0.025)Outcome 2 0.300 0.310 (0.025)

    (Co)variances

    Level 2 (study)

    Intercepts0:1000:050 0:100

    � �0:0800:040 0:112

    � �0:0800:040 0:111

    � �0:0800:055 0:111

    � �Treatment effect

    0:1000:020 0:100

    � �0:0890:016 0:090

    � �0:0980:006 0:100

    � �0:0980:038 0:100

    � �Level 1 (subjects)

    1:0000:800 1:000

    � �0:9900:790 0:994

    � �0:9900:790 0:994

    � � 0.992

    Note. Standard errors of the estimates are given between parentheses. To simplify the table, at the study level, only the covariance betweenintercepts and the covariance between treatment effects is given, not the covariances between intercepts and treatment effects. Population values forthese covariances were −0.025.

    584 Behav Res (2013) 45:576–594

  • Model 4

    In a third step, we look at the results of the three-levelanalysis, with samples, outcomes, and studies as the unitsat the respective levels. More specifically, we analyze thedata using the following model:

    Yijk ¼ b0jk þ b1jk Treatmentð Þik þ eijk with eijk � N 0;σ2e� �

    b0jk ¼ θ0k þ v0jk with v0jkv1jk� �

    � N 0; σ2v0

    σv0v1 σ2v1

    � �� �b1jk ¼ θ1k þ v1jk

    8<:

    θ0k ¼ g00 þ u0k with u0ku1k

    � �� N 0; σ

    2u0

    σu0u1 σ2u1

    � �� �θ1k ¼ g10 þ u1k

    8<:

    ð12ÞEquation 12 states that the intercept and treatment effect

    of outcome j and study k can (co)vary over outcomes withinstudies, as well as over studies. Whereas β0jk and β1jk referto the control group level and treatment effect for outcome jin study k, θ0k and θ1k refer to the mean control group leveland treatment effect (over outcomes) in study k, and γ00 andγ10 to the overall control group level and effect (over out-comes and studies). Parameter estimates for Eq. 12 are givenin Table 2 (model 4).

    A comparison of the results of the three-level model(model 4 from Table 2) and the multivariate model withouta covariance at the first level (model 3 from Table 1) revealsthat in the three-level analysis, the variance in the interceptsfrom the multivariate model is split up in variance overstudies and variance over outcomes from the same study:0.055 + 0.040 0 0.095, which is in the middle of the

    between-study variances of the intercepts for both outcomesin the multivariate model (0.080 and 0.111). Similarly, thevariance over studies of both treatment effects (0.098 and0.100) from the multivariate model is split up in the three-level model into variance over studies and variance overoutcomes: 0.038 + 0.061 0 0.099.

    The part going to the between-study level is equal tothe covariance between the outcomes: 0.055 for bothintercepts, 0.038 for the treatment effects. This is notsurprising: In a multilevel analysis with units withingroups, the variance between groups is equal to thecovariation within groups (Snijders & Bosker, 1999).Also, intuitively, this makes sense: The larger the co-variation between outcomes, the smaller the differencesare within studies, and the larger the differences arebetween study means. If outcomes do vary over studiesbut do not covary, we expect to find differences betweenoutcomes in a study, but on average, there will be no differ-ence between studies.

    Importantly, the mean effect estimate and correspondingstandard error from the three-level model (model 4) areexactly the same as those of the multivariate model (model3, but also model 2).

    Model 5

    Finally, we summarized, for each outcome and study, thedata in a standardized mean difference, estimated thecorresponding sampling variance (Hedges & Olkin, 1985),and analyzed these effect sizes using a three-level meta-analytic model (Eqs. 6, 7, 8). Comparing the results (seeTable 2, model 5) with those of model 4 illustrates that themeta-analysis of the effect sizes gives almost identical pa-rameter estimates as the analysis of the raw data.

    Simulation study

    In order to evaluate in a more systematic way the three-levelmodeling approach, we simulated a number of data sets inthe same way as the example data set. The followingparameters were varied: the number of outcomes (J 0 2or 5), the covariance between outcomes at the subjectlevel (σejej0 ¼ 0; 0:4; 0:8), the covariance in treatmenteffects between outcomes at the study level (σw1jw1j0 0 0, 0.02

    or 0.04), the overall mean effect size (0, 0.20, 0.40), thedeviation of the outcome effects from the overall mean effect(all two or five outcomes have same effect vs. deviations of−0.20 and +0.20 for outcomes 1 and 2 or −0.20, −0.10, 0,0.10, and 0.20 for the five outcomes), the number of studies(K 0 30 or 60), and the group sizes (n 0 25 or 50). Values forthe mean effects were chosen to be representative for the smalland moderate effects commonly found in social and

    Table 2 Parameter estimates (and standard errors) using three-levelmodels

    Model 4 (three-levelmodel for raw data)

    Model 5 (three-levelmodel for effect sizes)

    Fixed effects

    Intercept 0.003 (0.021)

    Treatment effect 0.210 (0.021) 0.207 (0.021)

    (Co)variances

    Level 3 (study)0:055�0:035 0:038

    � � 0.035Level 2 (outcomeswithin studies) 0:040

    0:012 0:061

    � � 0.061Level 1 (subjects) 0.992 *

    * For the effect size analysis, the sampling variance, σ2rjk , depends on

    the study and is estimated before the actual meta-analysis is performed.

    Behav Res (2013) 45:576–594 585

  • behavioral sciences (Cohen, 1988). The value used forthe between-study variance in effect sizes (0.10) resultsin a realistic ratio with the sampling variance of ob-served effect sizes (about 0.08 for studies with n 0 25and 0.04 for studies with n 0 50) and avoids thepossibility that the effects of correlation at either levelbecome ignorable (Riley, 2009). Covariances at the sub-ject level were chosen to cover the range of possiblevalues of correlation coefficients (covariances corre-spond with correlation coefficients of 0, .4, and .8).Whereas a high correlation at the first level is notunlikely in reality (e.g., when outcomes may refer todifferent instruments for the same construct), correlationcoefficients at the second level are likely to be small(Lipsey & Wilson, 2001). Therefore, we used relativelysmall covariances at this level, corresponding to corre-lation coefficients of 0, .20, and .40. Differences be-tween the effects of the outcomes were chosen to berelatively large (e.g., for two outcomes and an overallmean of .20, the outcome-specific mean effects areequal to 0 and .40), to be sure that a possible effecton the performance of the different models would bevisible.

    In total, we therefore have 2 × 3 × 3 × 3 × 2 × 2 × 2 0 432combinations. For each combination, we simulated1,000 data sets, 432,000 in total. Each data set wasanalyzed using the multivariate model, but includingonly one parameter for the treatment effects for alloutcomes (model 2 from Table 1), with a three-levelmodel (model 4 from Table 2), and with a traditionalmeta-analytic model that ignored the dependence of out-comes within studies (Eqs. 5 and 3)—that is, a two-level model without the third level, the study level. Inaddition, we summarized the data for each outcomewithin each study by calculating standardized meandifferences, using the simulated raw data, and analyzedthe resulting effect sizes with three-level and two-levelmodels equivalent to the raw data models.

    We want to stress that if we have the raw data, themultivariate model can be used because the sampling co-variance can be estimated using the raw scores, and there isno reason to use the three-level approach unless the numberof outcomes is very large and we are simply interested in themean effect. Yet, if we only have effect sizes, often notenough qualitative information is available regarding thesampling covariances, and a multivariate model is notapplicable.

    Results of the simulation study

    A first important result is that for all conditions, param-eter estimates regarding the treatment effects (meaneffects and standard errors, as well as variance components)

    are almost identical when performing multilevel meta-analyses of the effect sizes, rather than performing mul-tilevel analyses on the raw data. Therefore, only the rawdata results will be discussed, but conclusions equallyapply to the meta-analyses of effect sizes. Moreover,because we are interested in the treatment effects—morespecifically, the average effect and the variation in treat-ment effects—we will not discuss the results for theintercepts (referring to the level of performance in the controlconditions).

    Second, regarding the mean effects, we did not findimportant bias in any of the conditions (with absolutebias estimates never exceeding 0.006), and therefore,bias estimates are not further discussed here. We alsofound that the estimates of the mean effects are verysimilar for each of the models, with correlations of .99and more. Estimates of the overall effects were evenexactly the same for the univariate two- and three-levelmodels. As was expected, the MSE of the estimatesdecreases with an increasing number of studies, anincrease in the size of studies, and a decreasing covari-ance at the subject or study level. Bias and MSE esti-mates can be found in the Appendix.

    Of major interest in our simulation study are the estimat-ed standard errors for the mean effects, because these areused to test the mean effects and to construct confidenceintervals. We evaluated the standard error estimates in twoways. First, standard errors corresponding to a parameterestimate refer to the standard deviation of the samplingdistribution of the parameter estimator. Because we simulat-ed a large number of data sets for each condition, it can beassumed that the standard deviation of the parameter esti-mates within each condition is a good approximation of thestandard deviation of the sampling distribution and,therefore, can be used as a criterion to evaluate thestandard error estimates. Figure 5 shows the mean stan-dard error estimates and the standard deviations of theestimates for n 0 25, K 0 50, a study level covarianceof .02, a mean effect of 0.4, and two outcome variables(trends are similar for other conditions). When we havea common outcome effect (Fig. 5a), the standard devi-ation of the estimates is slightly larger when the multi-variate model is used, as compared with using themultilevel model. This is because, in the multivariate model,more parameters are estimated using the same amount of dataand, therefore, the estimates are more fluctuating.Nevertheless, the mean standard error estimates are slight-ly smaller for the multivariate model than for the three-level model. In sum, whereas the standard errors for thethree-level model are accurate, for the multivariate mod-el they are somewhat too small. It is also clear from thegraph that the variation in parameter estimates increaseswith an increasing covariance: If outcomes covary

    586 Behav Res (2013) 45:576–594

  • substantially, a large deviation of the scores on oneoutcome in a study is less likely to be compensatedfor by the scores on the other outcome. Standard errorsfor both the multivariate approach and the three-levelapproach follow this increasing trend. This is not thecase if a two-level model is used: Standard errors donot depend on the sampling covariation. Therefore,whereas standard errors for this model are only slightlytoo small if there is no sampling covariation (due tobetween-study covariation of the outcomes), their nega-tive bias become much more pronounced with increas-ing sampling covariance.

    If the effect is not the same for each outcome but thesame models are used to estimate the mean effect(Fig. 5b), estimates for the multivariate model varymuch more. Yet standard errors of the multivariate andthree-level models remain the same. Whereas standarderrors of the three-level model are still relatively accu-rate, with only a small positive bias when there is nosampling covariance and a small negative bias if thereis a strong sampling covariance, for the multivariatemodel they are severely negatively biased. The resultssuggest that the multivariate model does not work wellfor estimating the mean effect. The standard errors forthe two-level model are larger than when there is novariation between outcomes (and can even be positivelybiased if there is no sampling covariation), but againthey do not increase with increasing sampling covari-ance, resulting in negative bias if there is large sam-pling covariance.

    A second way to evaluate the standard error estimates isby looking at the coverage proportion of the confidenceintervals calculated using these standard error estimates.Table 3 gives the coverage proportions for 90 % confidenceintervals for two and for five outcomes.1 A dimension that is

    not included in the tables (but does not affect the coverageproportion) is the mean effect: In each condition, the cover-age proportion is calculated over the three parameter valuesof the mean effect.

    In line with the evaluation of the standard errors bycomparing them with the standard deviation of the pa-rameter estimates, we see that the coverage proportionsfor the 90 % confidence intervals are accurate for thethree-level models in almost all combinations. Only ifthere is no covariation at either level and there aredifferences between outcomes is the coverage proportionslightly too high.

    For the multivariate model, coverage proportions are, ingeneral, slightly too small if the mean effect is the same forall outcomes, again suggesting that the standard errors aretoo small. When there are real differences between out-comes in the treatment effects (right side of the table), thecoverage proportions are much too small for all combina-tions, suggesting that the multivariate model does not workwell for estimating the mean effect. Also, for the two-levelmodel, coverage proportions often are too small, unlessoutcomes do not covary at either level. As could beexpected, conclusions are similar but more pronounced ifwe have five outcomes instead of two outcomes per study,resulting in much too small coverage proportions for thetraditional two-level model.

    If the mean effect is zero, the coverage proportionalso refers to the proportion of confidence intervalsincluding zero and, therefore, to the probability of notmaking a type I error. If the dependence is ignored, forsome conditions, the proportion of type I errors is verysubstantially inflated (up to 36%, instead of the nominal 10%significance level; i.e., when there is substantial covariation ateach level and we have five outcomes with a common meaneffect).

    The models give not only mean effect and correspondingstandard error estimates, but also estimates of the between-study (and for the three-level model, the between-outcome)variance. In our simulation study, we saw the same patternsas in the elaborated simulated example discussed above: Inthe three-level model, the total variance between effect sizes

    0.063

    0.058

    0.053

    0.048

    0.043

    0.0380 0.2 0.4

    SE Multivariate SE Two-level SD Three-levelSE Three-level SD Multivariate SD Two-level

    Sampling covariance

    a b

    Sampling covariance0.6 0.8 0 0.2 0.4 0.6 0.8

    0.063

    0.058

    0.053

    0.048

    0.043

    0.038

    Fig. 5 Standard deviation (SD)of the overall effect estimatesand mean standard errorestimates (SEs) for themultivariate model, the three-level model, and the two-levelmodel ignoring dependence.Left graph: same effect for alloutcomes; right graph:outcome-specific effects

    1 For the data with five outcomes, we estimated only the meta-analyticmodels for combining effect sizes because raw data analyses, andespecially the multivariate analysis, were very time consuming.Therefore, Table 3 does not include results for the multivariate modelfor five outcomes.

    Behav Res (2013) 45:576–594 587

  • Tab

    le3

    Cov

    erageprop

    ortio

    nsof

    the90

    %confidence

    intervalsforthemultiv

    ariate

    andtheun

    ivariate

    three-

    andtwo-levelmod

    els

    Com

    mon

    effect

    Outcome-specific

    effect

    σw1jw1j0

    00.02

    0.04

    00.02

    0.04

    K30

    6030

    6030

    6030

    6030

    6030

    60

    Jσe je j0

    n25

    5025

    5025

    5025

    5025

    5025

    5025

    5025

    5025

    5025

    5025

    5025

    50

    20

    Multiv.

    .87

    .87

    .88

    .87

    .86

    .87

    .88

    .88

    .86

    .85

    .89

    .89

    .79

    .76

    .81

    .78

    .77

    .75

    .80

    .78

    .77

    .74

    .78

    .75

    3-level

    .91

    .92

    .91

    .90

    .90

    .91

    .90

    .90

    .90

    .89

    .91

    .91

    .94

    .94

    .93

    .93

    .93

    .93

    .92

    .93

    .92

    .91

    .91

    .91

    2-level

    .89

    .91

    .89

    .89

    .88

    .88

    .87

    .88

    .86

    .84

    .87

    .86

    .93

    .93

    .93

    .93

    .91

    .92

    .92

    .92

    .90

    .89

    .89

    .90

    0.4

    Multiv.

    .88

    .87

    .89

    .89

    .87

    .85

    .88

    .89

    .86

    .86

    .89

    .88

    .78

    .75

    .81

    .76

    .77

    .75

    .77

    .77

    .73

    .72

    .76

    .74

    3-level

    .91

    .91

    .90

    .91

    .90

    .89

    .89

    .91

    .89

    .89

    .90

    .89

    .91

    .93

    .91

    .92

    .91

    .92

    .89

    .92

    .90

    .89

    .89

    .91

    2-level

    .88

    .88

    .86

    .89

    .85

    .84

    .85

    .86

    .83

    .83

    .84

    .84

    .90

    .92

    .91

    .91

    .89

    .90

    .88

    .90

    .87

    .86

    .86

    .89

    0.8

    Multiv.

    .87

    .86

    .87

    .88

    .87

    .86

    .89

    .88

    .86

    .86

    .88

    .87

    .76

    .75

    .78

    .77

    .74

    .71

    .76

    .76

    .72

    .68

    .74

    .71

    3-level

    .89

    .90

    .89

    .89

    .90

    .89

    .90

    .90

    .89

    .89

    .89

    .89

    .90

    .92

    .90

    .90

    .90

    .90

    .90

    .91

    .90

    .89

    .90

    .91

    2-level

    .83

    .85

    .83

    .86

    .82

    .83

    .83

    .84

    .80

    .80

    .80

    .81

    .89

    .91

    .88

    .90

    .87

    .88

    .86

    .90

    .86

    .87

    .86

    .87

    50

    3-level

    .92

    .91

    .90

    .92

    .90

    .89

    .89

    .91

    .90

    .89

    .90

    .90

    .92

    .93

    .92

    .92

    .91

    .90

    .91

    .91

    .89

    .89

    .90

    .90

    2-level

    .90

    .89

    .89

    .90

    .82

    .81

    .82

    .82

    .78

    .73

    .77

    .75

    .91

    .92

    .91

    .91

    .86

    .83

    .85

    .85

    .79

    .76

    .78

    .77

    0.4

    3-level

    .90

    .89

    .89

    .89

    .89

    .89

    .90

    .89

    .91

    .89

    .90

    .90

    .90

    .90

    .89

    .90

    .90

    .89

    .90

    .89

    .89

    .90

    .90

    .90

    2-level

    .79

    .81

    .77

    .82

    .74

    .75

    .74

    .74

    .70

    .68

    .69

    .69

    .81

    .85

    .82

    .86

    .76

    .77

    .77

    .77

    .71

    .72

    .73

    .72

    0.8

    3-level

    .90

    .89

    .89

    .89

    .90

    .89

    .90

    .90

    .90

    .90

    .90

    .90

    .89

    .89

    .90

    .89

    .90

    .88

    .90

    .89

    .90

    .89

    .91

    .91

    2-level

    .70

    .75

    .71

    .74

    .68

    .69

    .67

    .70

    .63

    .65

    .65

    .64

    .73

    .78

    .73

    .79

    .68

    .72

    .69

    .73

    .65

    .68

    .67

    .69

    588 Behav Res (2013) 45:576–594

  • is split up into two parts, one referring to the variationbetween outcomes within a study, one referring to the var-iation between studies. Figure 6 shows this split-up; in thecase in which the sample size n 0 25, the number of out-comes is two, the treatment effect is common to all out-comes, and the study-level covariance between outcomes isequal to 0 (left graph) and 0.04 (right graph). As wasdiscussed above, the variance between studies refers to the

    total covariation between outcomes, σu11u12 þ 2σe2e1n , whichfor the simulated conditions is equal to 0.00, 0.032, and0.064 (left graph) and 0.040, 0.072, and 0.104 (right graph),depending on the sampling covariance. Figure 6 shows thatin the three-level model, the variance estimate at the studylevel is an unbiased estimate of the total covariance betweenoutcomes. Only in the condition where the total covariancebetween outcomes is equal to 0.104 is the estimated study-level variance slightly smaller. More specifically, the study-level variance does not exceed the between-study varianceof the model used to simulate the data (0.100). Furthermore,Fig. 6 also shows that in the three-level model, the variancebetween studies (0.100) is split up into a variance at thestudy level and a variance at the outcome level, although thesum of both variance estimates is typically slightly smaller.This negative bias is especially visible if there is no covari-ance: The total variance is now only 0.090, instead of 0.100.Figure 6 also shows that the estimate of the between-studyvariance when using the two-level model (ignoring thedependence) is always almost equal to the between-studyvariance of the multivariate model (0.10). Similar patternsare found for other conditions.

    Conclusions

    In this article, we described the use of three-level models fordealing with dependent outcomes in meta-analysis and il-lustrated this approach using several real data examples.Whereas most traditional approaches for combining depen-dent effect sizes involve choosing a proper unit of analysis(Cooper, 2009), multilevel models were exactly developed“for the analysis of data sets comprising several types of unitof analysis” (Snijders, 2003, p. 674). Three-level models are

    suitable when studies are clustered—for instance, in re-search groups or countries—or when, within studies, wehave several effect sizes calculated on independent groups,two situations in which dependencies are typically ignored.Sometimes, it is even explicitly stated that as long as there isno overlap between samples, multiple estimates from thesame studies are truly independent (Littell et al., 2008). Oursimulation study nonetheless shows that ignoring covari-ance at the study level also might result in biased standarderrors and confidence interval coverage proportions, even ifsamples are independent.

    An important conclusion is that, although the multilevelmodel we proposed for dealing with multiple outcomeswithin the same study in principle assumes no samplingcovariation (or independent samples), our simulation studysuggests that using an intermediate level of outcomes withinstudies succeeds in accounting for the sampling covariancein an accurate way, yielding appropriate standard errors andinterval estimates for the effects. This is true for both thethree-level model for raw data and the three-level model foranalyzing effect sizes, which gave very similar results. Inthis article, we mainly presented the results for the analysisof raw data, because in this way we could compare theresults with those of a multivariate approach. Whereas amultivariate approach, in principle, is also possible for theanalysis of effect sizes, estimating the sampling covarianceis often not feasible without having the raw data, making themultivariate approach not always feasible in practice. Thesimulation study also showed that using a random effectsmeta-analytic model assuming independent effect sizesmight result in flawed inferences. This is especially true ifthe outcome variables are more correlated and the number ofoutcomes per study is higher. For instance, coverage pro-portions of the 90 % confidence intervals for the meaneffects were typically between .65 and .75 if the numberof outcomes was five, with intercorrelations of .80. Thisresult is not unexpected: In general, the underestimation ofthe standard errors when clustered data are treated as inde-pendent data depends on the intraclass correlation (ICC) andthe cluster size. For a two-level structure, for instance, Kish(1965) showed that the design effect—that is, the efficiencyloss when cluster sampling is used for estimating a popula-tion mean rather than using simple random sampling—is

    0.12

    0.1

    0.08

    0.06

    0.04

    0.02

    00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

    Three-level model:between study level

    Three-level model:between outcome level

    Two-level model:between outcome level

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

    vari

    ance

    est

    imat

    e

    Sampling covariance Sampling covariance

    0.12

    0.1

    0.08

    0.06

    0.04

    0.02

    0

    vari

    ance

    est

    imat

    e

    Fig. 6 Median varianceestimates for the two- and three-level models if the covariance atthe study level is zero (left) or0.04 (right) (n 0 25)

    Behav Res (2013) 45:576–594 589

  • equal to 1þ n� 1ð ÞICC, with n being the cluster size. If thesize of the clusters does not vary too much over clusters, thedesign effect is approximated well by replacing n by theaverage cluster size.

    Although already, at the early development of multilevelanalysis theory, the link was made with meta-analysis(Raudenbush & Bryk, 1985), meta-analysts seem to havefailed to take advantage of the power of multilevel modelsfor meta-analysis. Three-level analyses, for instance, are stillvery uncommon. We see nevertheless several importantadvantages to using a multilevel approach.

    First, the multilevel model is a very flexible model.Whereas in meta-analysis the mixed effects meta-analyticmodel with a random study effect is often regarded as theultimate model because it can account for moderator varia-bles without making the strong assumption that these mod-erator variables explain all systematic variation betweenstudies, the multilevel model is even more general than thistwo-level model, allowing one to define additional levels. Inthis way, it is possible to account for several sources ofdependence at the same time. In our simulation study, weshowed that a three-level model can account for samplingcovariance, but the model can easily be extended with anadditional upper level to account for a possible nesting ofstudies. If we have multiple outcomes based on the samesample, a multilevel model without predictors can be used toestimate the mean effect over all samples. Yet we can alsoinclude a characteristic of the outcomes as a predictor var-iable to explore whether the effect depends on the type ofoutcome. We even can include an outcome indicator as apredictor (as a set of dummy variables) to estimate the meaneffect for each outcome, as is done in a typical multivariatemodel. Moreover, if we are interested in the treatment effectof each specific outcome, we could also perform separatemeta-analyses, although an advantage of performing a mul-tilevel meta-analysis with an outcome indicator over sepa-rate meta-analyses is that we can perform an omnibus test ofdifferential mean effects—that is, a test of the effect of theoutcome indicator. Moreover, contrasts can be estimated andtested—for instance, for evaluating whether the treatmenteffect is the same for two specific outcomes or for evaluat-ing whether the effect for outcome one differs from theeffect of outcomes four and five, tests that are less straight-forward when performing separate meta-analyses for eachoutcome. The model can also easily be extended by includ-ing other covariates at each of the levels, in an attempt toexplain variation at that level. An interaction term (or con-trasts) can be used to estimate and test differential moderat-ing effects of a predictor for various outcomes or for varioustypes of outcomes. An important property of the multilevelmodel is that it does not require that the number of outcomesbe the same for each study: If one study reports the effect forone outcome and another study the effect for five outcomes,

    all effects are used in the analysis. We further want to notethat although, in our examples and discussion, we usedstandardized mean differences as the effect size metric, themultilevel meta-analytic model is equally appropriate forcombining other commonly used metrics, such as oddsratios or Pearson’s correlation coefficients. The approachassumes, as do most meta-analytic approaches, only thatthe sampling distribution of the metric is normal and, there-fore, that a normalizing transformation of the effect sizes,such as the logarithmic transformation of odds ratios or theFisher’s Z transformation of correlations, might be required.

    A second strength of the multilevel approach is that it is arelatively simple and intuitive way to account for dependen-cies. This is partly because multilevel models are discussedin several excellent handbooks (e.g., Hox, 2002;Raudenbush & Bryk, 2002) and software is widely availableto use them—more specifically, specialized software, suchas MLwiN and HLM, or general (statistical) software, suchas SAS. Although depending on the situation, otherapproaches may be very helpful, these other approachessometimes oversimplify the problem of dependency (e.g.,by simply ignoring the dependency), inefficiently use onlypart of the data for each analysis, or are complex to imple-ment because of a lack of necessary information, as de-scribed above.

    A third strength of the approach is that multilevel modelsautomatically account for the hierarchical structure in thedata. If, for instance, one study results in 20 effect sizeestimates, this study will not contribute 20 times as muchto the estimation of the mean effect, as compared with astudy reporting only 1 effect size. Rather, this study isregarded as only one study yielding information about onestudy-specific mean effect in the distribution of study meaneffects. The exact weight of each study will depend on thedependence between effect sizes from the same study: Thesmaller this dependence, the less the weight given to each ofthe individual effect sizes depends on the number of effectsizes reported in the study.

    Specifically for accounting for sampling covariance, wesee three major advantages of the multilevel approach foranalyzing multivariate effect size data, as compared with atruly multivariate meta-analytic model. First, the multilevelapproach does not require that the sampling covariance ofthe effect size estimates is “known” in advance. The covari-ation rather is taken into account by using the between-studyvariance as a “stand in” for this covariance. This is animportant advantage, because the problem that, often, noor little information about the covariances is available isexactly the reason why multivariate meta-analyses are onlyseldom used. A second advantage is that because the sam-pling covariance is not to be known in advance, the multi-level model is also more applicable for metrics for which theformula for calculating the sampling covariance has a very

    590 Behav Res (2013) 45:576–594

  • complex form, gives biased estimates, or is unknown (seeBecker, 2000, for a discussion of the multivariate distribu-tion of commonly used effect sizes).

    Finally, in the example of Geeraert et al. (2004), wesaw that especially for dependent variables that are diffi-cult to operationalize, there can be much variation be-tween studies in the outcome variables used. Whereasthis makes the multivariate model practically infeasible,the three-level model assuming a distribution of out-comes within studies is still easy to use. We recognize,however, that although the possibility of making esti-mates of the mean effect over outcomes is attractive,using the type of outcome or an outcome indicator as apredictor variable is to be preferred, for both conceptualand statistical reasons: If the effect varies over outcomes,the mean effect is less informative because effects mightobscure each other. In addition, if one type of outcome ismore often reported in the set of studies, the effect ofthis outcome will have a stronger influence on the aver-age effect, possibly inducing bias. An example is wherestudies are less likely to report effect sizes for outcomesthat are less affected by the treatment, resulting in apositive reporting bias. Moreover, the simulated dataillustrate that including a moderator variable reducesstandard errors, as well as the bias in the standard errors.

    Although results of the simulation study are verypromising, we are aware of some limitations of thestudy. Conclusions are, in principle, restricted to theconditions for which data were generated. More specif-ically, we generated data only for a relatively large setof studies (K030 or 60). It is known from the literatureon multilevel analysis (e.g., Maas & Hox, 2005) ormultilevel meta-analysis (e.g., Van den Noortgate &Onghena, 2003b; Viechtbauer, 2005) that when(restricted) maximum likelihood procedures are used,smaller numbers of units at the highest level (in this case,the study level) might result in underestimated standard errorsand, therefore, inflated type I error rates for testing the regres-sion coefficients (in this case, the overall effect size and/or themoderator effects), but especially in biased estimates of thevariance at the highest level and of the corresponding standarderror. Because, in multilevel literature, 30 units is oftenregarded as the smallest acceptable number of units at thehighest level (e.g., Kreft & De Leeuw, 1998), we did notsimulate data sets with less than 30 studies.

    Furthermore, we simulated only balanced data sets,with the same sample size and the same number ofoutcomes for each study. In practice, it often occursthat part of the studies report only one effect size,whereas another part of the studies report two or moreeffect sizes. In line with the design effect described byKish (1965), we expect that also in this situation, themultilevel approach will outperform the approach

    treating all effect sizes as independent, with a benefitthat depends on the average number of effect sizesreported per study. Moreover, in our simulation design,we varied the size of the between-study variance of thetreatment effect, but we assumed that this heterogeneityvariance is equal for each outcome variable. We alsovaried the between-study covariance in the treatmenteffects—a positive covariance meaning that if, in astudy, the treatment effect of an outcome is relativelylarge, the treatment effect of the other outcome variablealso is likely to be large—but assumed a commoncovariation for all pairs of outcomes. These limitationscan also explain why the three-level model performedeven slightly better than the multivariate model whenthe effect was the same for all outcomes, although themultivariate model was used to generate the data:Whereas the multivariate model in which each outcomeis regarded as a separate dependent variable includes aseparate between-study variance parameter for each out-come and a separate between-study covariance parame-ter for each pair of outcomes, the three-level modelassumes a common variance and covariance, resultingin a more parsimonious model to estimate with thesame amount of data, resulting in more stable estimates.Whereas these assumptions are common in simulationstudies (e.g., Hedges et al., 2010), they are often vio-lated in practice. For instance, if two outcomes aremeasures of the same construct variable (e.g., two de-pression scales), the dependence is likely to be largerthan for two outcomes referring to two different con-struct variables (e.g., a depression and an anxiety scale).Another situation where the covariation between out-comes can depend on the pair of outcomes is wherestudies include more than one (independent) sample andmultiple outcomes were calculated for each sample:Outcomes from independent samples from the samestudy still are dependent, but less dependent than out-comes based on the same sample. We are currentlyinvestigating the performance of the proposed three-level approach by means of a simulation study set upin much the same way as the present study, for situa-tions in which the number of outcomes varies overstudies, if the between-study variance depends on theoutcome, if the covariance between outcomes variesover pairs of outcomes, if samples sizes vary overstudies, and if outcome effects are randomly sampled.

    Author Note Wim Van den Noortgate, Faculty of Psychology andEducational Sciences, ITEC- IBBT, University of Leuven, Leuven,Belgium; José Antonio López-López, Faculty of Psychology, Universidadde Murcia, Spain; Fulgencio Marín-Martínez, Universidad de Murcia,Spain; and Julio Sánchez-Meca, Faculty of Psychology, Universidad deMurcia, Spain.

    Behav Res (2013) 45:576–594 591

  • Tab

    le4

    Bias(*

    10,000

    )fortheov

    eralleffect

    estim

    ateforthemultiv

    ariate,andtheun

    ivariate

    three-

    andtwo-levelmod

    els

    Com

    mon

    effect

    Outcome-specific

    effect

    σw1jw1j0

    00.02

    0.04

    00.02

    0.04

    K30

    6030

    6030

    6030

    6030

    6030

    60

    Jσe je j0

    N25

    5025

    5025

    5025

    5025

    5025

    5025

    5025

    5025

    5025

    5025

    5025

    50

    20

    Multiv.

    345

    -41

    -11

    0-11

    108

    -12

    14-5

    1812

    -9-10

    -23

    19-7

    -39

    345

    -4

    3-level

    306

    -70

    -61

    -10

    127

    -13

    12-5

    153

    -12

    42

    15-3

    -28

    306

    -7

    2-level

    306

    -70

    -61

    -10

    127

    -13

    12-5

    153

    -12

    42

    15-3

    -28

    306

    -7

    0.4

    Multiv.

    315

    99

    71

    11-2

    -425

    -81

    190

    -72

    -9-4

    -63

    -31

    315

    9

    3-level

    216

    1011

    11-3

    8-2

    -13

    25-6

    323

    -11

    -34

    -19

    -39

    5-24

    216

    10

    2-level

    216

    1011

    11-3

    8-2

    -13

    25-6

    323

    -11

    -34

    -19

    -39

    5-24

    216

    10

    0.8

    Multiv.

    11-13

    131

    -5-7

    12-4

    -19

    -3-2

    10-44

    31-23

    -12

    -39

    -27

    -510

    -54

    11-13

    13

    3-level

    10-13

    100

    -2-1

    10-5

    -14

    -7-1

    8-15

    3-13

    -1-9

    5-2

    10-2

    10-13

    10

    2-level

    10-13

    100

    -2-1

    10-5

    -14

    -7-1

    8-15

    3-13

    -1-9

    5-2

    10-2

    10-13

    10

    50

    3-level

    -36

    -16

    -41

    -22

    -52

    -2-45

    -18

    -25

    -22

    -33

    -22

    -50

    -20

    -45

    -17

    -43

    -5-41

    -16

    -53

    -24

    -48

    -18

    2-level

    -36

    -16

    -41

    -22

    -52

    -2-45

    -18

    -25

    -22

    -33

    -22

    -50

    -20

    -45

    -17

    -43

    -5-41

    -16

    -53

    -24

    -48

    -18

    0.4

    3-level

    -52

    -38

    -47

    -28

    -27

    -41

    -34

    -25

    -31

    -14

    -38

    -20

    -47

    -20

    -40

    -18

    -58

    -33

    -42

    -24

    -43

    -16

    -38

    -19

    2-level

    -52

    -38

    -47

    -28

    -27

    -41

    -34

    -25

    -31

    -14

    -38

    -20

    -47

    -20

    -40

    -18

    -58

    -33

    -42

    -24

    -43

    -16

    -38

    -19

    0.8

    3-level

    -36

    -10

    -39

    -19

    -47

    -26

    -29

    -12

    -27

    -7-25

    -26

    -38

    -21

    -42

    -25

    -55

    -22

    -43

    -16

    -37

    -13

    -37

    -8

    2-level

    -36

    -10

    -40

    -19

    -48

    -26

    -29

    -12

    -31

    -7-29

    -26

    -38

    -21

    -42

    -25

    -55

    -22

    -44

    -16

    -38

    -13

    -37

    -8

    Appendix

    592 Behav Res (2013) 45:576–594

  • Tab

    le5

    MSE

    (*10

    ,000

    )fortheov

    eralleffect

    estim

    ateforthemultiv

    ariate

    andtheun

    ivariate

    three-

    andtwo-levelmod

    els

    Com

    mon

    effect

    Outcome-specific

    effect

    σw1jw1j0

    00.02

    0.04

    00.02

    0.04

    K30

    6030

    6030

    6030

    6030

    6030

    60

    Jσe je j0

    n25

    5025

    5025

    5025

    5025

    5025

    5025

    5025

    5025

    5025

    5025

    5025

    50

    20

    Multiv.

    3326

    1513

    3728

    1814

    4035

    1916

    4943

    2420

    5948

    2723

    6560

    3229

    3-level

    3123

    1512

    3326

    1713

    3733

    1815

    3024

    1512

    3327

    1613

    3632

    1815

    2-level

    3123

    1512

    3326

    1713

    3733

    1815

    3024

    1512

    3327

    1613

    3632

    1815

    0.4

    Multiv.

    3729

    1913

    4134

    2015

    4636

    2117

    6248

    2824

    6857

    3427

    8773

    4035

    3-level

    3527

    1813

    3932

    2014

    4434

    2116

    3626

    1814

    3830

    2015

    4236

    2216

    2-level

    3527

    1813

    3932

    2014

    4434

    2116

    3626

    1814

    3830

    2015

    4236

    2216

    0.8

    Multiv.

    4532

    2315

    4736

    2317

    5140

    2519

    7555

    3526

    8871

    4130

    110

    9250

    41

    3-level

    4230

    2215

    4433

    2216

    4838

    2518

    4030

    2115

    4534

    2215

    4638

    2418

    2-level

    4230

    2215

    4433

    2216

    4838

    2518

    4030

    2115

    4534

    2215

    4638

    2418

    50

    3-level

    1110

    64

    1715

    97

    2121

    1110

    129

    65

    1615

    87

    2320

    1210

    2-level

    1110

    64

    1715

    97

    2121

    1110

    129

    65

    1615

    87

    2320

    1210

    0.4

    3-level

    2014

    107

    2520

    1210

    2925

    1512

    2014

    107

    2519

    1310

    3125

    1512

    2-level

    2014

    107

    2520

    1210

    2925

    1512

    2014

    107

    2519

    1310

    3125

    1512

    0.8

    3-level

    2919

    149

    3425

    1712

    4028

    1914

    2919

    159

    3325

    1712

    4029

    1913

    2-level

    2919

    149

    3425

    1712

    4028

    1914

    2919

    159

    3325

    1712

    4029

    1913

    Behav R