the effect of ignoring classroom-level variance in estimating the generalizability of school mean...

10

Click here to load reader

Upload: xin-wei

Post on 23-Jul-2016

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: The Effect of Ignoring Classroom-Level Variance in Estimating the Generalizability of School Mean Scores

Educational Measurement: Issues and PracticeSpring 2011, Vol. 30, No. 1, pp. 13–22

The Effect of Ignoring Classroom-Level Variance in Estimatingthe Generalizability of School Mean Scores

Xin Wei, SRI International, and Edward Haertel, Stanford University

Contemporary educational accountability systems, including state-level systems prescribed underNo Child Left Behind as well as those envisioned under the “Race to the Top” comprehensiveassessment competition, rely on school-level summaries of student test scores. The precision ofthese score summaries is almost always evaluated using models that ignore the classroom-levelclustering of students within schools. This paper reports balanced and unbalanced generalizabilityanalyses investigating the consequences of ignoring variation at the level of classrooms withinschools when analyzing the reliability of such school-level accountability measures. Results showthat the reliability of school means cannot be determined accurately when classroom-level effectsare ignored. Failure to take between-classroom variance into account biases generalizability (G)coefficient estimates downward and standard errors (SEs) upward if classroom-level effects areregarded as fixed, and biases G-coefficient estimates upward and SEs downward if they areregarded as random. These biases become more severe as the difference between the school-levelintraclass correlation (ICC) and the class-level ICC increases. School-accountability systems shouldbe designed so that classroom (or teacher) level variation can be taken into consideration whenquantifying the precision of school rankings, and statistical models for school mean scorereliability should incorporate this information.

Keywords: generalizability theory, reliability, school-level scores

A s a condition for receipt of Title I funds, each state isrequired to test all public school students in Grades 3

through 8 annually in reading and mathematics. Under theNo Child Left Behind (NCLB) Act, districts and schools are ex-pected to meet annual measurable objectives (AMOs) statedin terms of the percentage of students at or above profi-ciency levels in reading and mathematics. The school as wellas specified student subgroups within the school must meeteach of these objectives (or safe-harbor provisions) to satisfythe NCLB requirement for “Adequate Yearly Progress” (AYP).School-level summaries of test scores, for all students anddesignated subgroups, are also envisioned under the com-prehensive assessment systems now being developed underthe “Race to the Top” competition. Because of their high-stakes nature and central role in these incentive systems, weneed to closely examine the reliability of these subgroup- andschool-level scores.

The importance under NCLB of assessing the reliabilityand standard errors (SEs) of school means (or of estimatedpercentages at or above cut scores defining “proficient” perfor-mance) needs no defense. In recent years, accurate estima-tion of SEs has become more important than ever, as “marginof error” adjustments are increasingly used in determiningwhether schools or student subgroups have met AMOs or

Xin Wei, SRI International, 333 Ravenswood Avenue, BS169,Menlo Park, CA 94025-3493; [email protected]. Edward Haertel,School of Education, Stanford University, 485 Lasuen Mall, Stan-ford, CA 94305-3096.

“safe harbor” provisions. These adjustments are statisticallyquestionable (Rogosa, 2005, pp. 161–167), but nonethelesswidely used. It cannot as yet be known whether they willcontinue to be used following the pending Elementary andSecondary Education Act (ESEA) reauthorization. Becausethe “margin of error” adjustment in effect changes the cutscore by some function of the (school- or subgroup-level)SE of measurement, inaccurate calculation of SEs directlyaffects AYP determinations.

Accurate estimation of reliability and SEs for school meanscores is complicated by the organization of students intoclassrooms within schools. Investigating the class/teachereffect supplies the major motive for this study. Cronbach,Linn, Brennan, and Haertel (1997) presented a theoreticalargument for taking account of classroom-level variationwhen analyzing the generalizability of school-level scores,but the authors were unable to locate any published researchstudies examining how school-level score reliability estimatesare affected when class (or teacher) variance is omitted fromgeneralizability calculations. For example, these effects areignored in Yen and Ferrara’s (1997) comprehensive report onMaryland’s accountability system. As further evidence of thisgeneral neglect, the Standards and Assessments Peer ReviewGuidance (U.S. Department of Education, 2009, p. 44), whichsets forth detailed criteria for state accountability plans underNCLB, calls for “evidence of generalizability for all relevantsources, such as variability of groups, internal consistency ofitem responses, variability among schools, consistency fromform to form of the test, and interrater consistency in scoring,”

Copyright C© 2011 by the National Council on Measurement in Education 13

Page 2: The Effect of Ignoring Classroom-Level Variance in Estimating the Generalizability of School Mean Scores

but makes no mention of variance associated with classroomswithin schools.

In this study, we used both analytical demonstration andactual multilevel student reading achievement data to exam-ine the distortions in generalizability estimates when classeffects are ignored. The first part of this paper presents illus-trative calculations for a range of balanced-design scenariosvarying by class size, number of classes per school, and the rel-ative magnitudes of variance components for different scoreeffects. These calculations are equally relevant to considera-tions of reliability and precision for scale scores or for percentabove cut measures such as percent proficient. The secondpart of the paper analyzes actual multilevel student readingachievement scale scores from a sample of 18 schools forwhich information on the assignment of students to classeswas available.

Debates on School-Level ScoresA common belief is that aggregate-level scores must be morereliable than individual-level scores. This belief follows fromthe observation that random noise at the individual level tendsto cancel out at the aggregate level (Ingelhart, 1985; Jones &Norrander, 1996). However, more and more researchers arepointing out that school-level scores may not be as reliableas they appear, and that uncritical inferences from schoolmeans to judgments of school effectiveness are highly prob-lematical (Brennan, 1995; Schechtman & Yitzhaki, 2009). Theinfluence of random error associated with the measurementof individual students is greatly diminished when scores areaveraged, but when the object of measurement shifts fromindividual students to school means, if the intended infer-ence concerns an enduring characteristic of the school (asopposed to a historical report on a particular group of stu-dents), then students within schools are properly regarded asrandom, not fixed effects. Thus, a new source of error, namelystudent sampling, comes into play. As Brennan (1995) ex-plains, it is not necessarily true that group mean scores aremore reliable than individual scores. He describes variousconditions under which group-level reliability may be lowerthan individual-level reliability, especially when tests are long(implying high individual-level reliability) or when numbersof students within schools are small.

It is generally recognized that the sampling of students isthe major source of uncertainty in estimates of school means(Cronbach et al., 1997; Kane & Staiger, 2002; Yen, 1997). Thissource of variation is much more important than the measure-ment error affecting individual students’ scores. The varianceof student scores within schools is generally much larger thanthe variance of school means. Coleman et al. (1966) foundthat between-school variance in student test scores repre-sented only 10–15% of the total score variance, although thisproportion appears to have increased significantly since thattime (e.g., Hedges & Hedberg, 2007). This within-school vari-ance in student’s scores limits the precision of school meanscores.

A factor largely ignored in these discussions has been thegrouping of students into classes within schools. In almost allcases, students within schools are treated as a simple ran-dom sample from some hypothetical population. It is moreaccurate to regard students as nested in classes or asso-ciated with individual teachers, which are in turn nested inschools. Ignoring aspects of a multilevel data structure results

in biased estimation of regression coefficients and variancecomponents (Kim & Frees, 2006). In particular, ignoring anintermediate level causes an overestimate of the variance andunstable regression coefficient estimates belonging to the lev-els just above and just below the level omitted (Opdenakker& Van Damme, 2000). Although ignoring variables at any levelwill bias the results, omitting effects of lower level variablesmay yield more severely biased estimates of regression coeffi-cients and variance components than omission of higher levelvariables (Kim & Frees, 2006). The contribution of this paperis to examine the effects of the nesting of students withinclasses within schools, in the context of models where theschool is the unit of analysis.

Nesting Students in Classes and SchoolsSchools are complex, multilevel, and interactive organiza-tions. Students are typically instructed in classroom groups bya single teacher, although pull-out, team teaching, and othermodels are also used. Typical high schools, as well as manymiddle schools, use a departmentalized organization in whichstudents may move from classroom to classroom throughoutthe school day. For purposes of estimating school-level means,it may not be feasible to capture all of these details in a sta-tistical model. However, especially at the elementary schoollevel, a much closer approximation to reality can be attainedby treating students as nested within classes and classes asnested in schools, as opposed to the common practice of ignor-ing classroom-level groupings altogether and simply modelingstudents as nested within schools. Thus, a challenge for eval-uating the precision of school-level mean scores is to separatethe variances associated with schools, with classes, and withstudents. Researchers use generalizability theory to addressthe reliability (dependability) of a measurement by correctlyaccounting for multiple sources of variation (Brennan, 2001a;Cronbach, Gleser, Nanda, & Rajaratnam, 1972; Shavelson &Webb, 1991).

Variance Components

A generalizability study design (p:c:s) can be used to es-timate the reliability of school-level scores, in which stu-dents (denoted p, as in pupils) are nested within classes (c)within schools (s) (Cronbach et al., 1997). There are threevariance components in this design. The school effect s isthe difference between the mean universe score for a hypo-thetical population of students from which those attendingthe school are sampled and the mean universe score for allstudents across all schools. The variance of these school scoreeffects is the variance component for school, denoted σ 2

s . Thec:s score effects are the differences between classroom-levelmean universe scores and the corresponding school universescore. The p:c:s effects are the differences between studentuniverse scores and their class universe scores. Variance com-ponents for these effects are defined similarly. An individualstudent’s observed score may be decomposed as follows:

X pcs = μ [grand mean]+ (μs − μ) [school effect]+ (μc :s − μs ) [class effect]+ (μp:c :s − μc :s ) [person effect]+ (X pcs − μp:c :s ) [residual effect]

(1)

14 Educational Measurement: Issues and Practice

Page 3: The Effect of Ignoring Classroom-Level Variance in Estimating the Generalizability of School Mean Scores

Table 1. ANOVA Table for p:c:s BalancedDesign

Sum of Expected MeanEffect Squares df Mean Square Square

s SSs ns−1 SSs/ �2p:c:s,e

dfs +np�2c:s+npnc�

2s

c:s SSc:s (nc−1)ns SSc:s/dfc:s �2p:c:s,e

+ np�2c:s

p:c:s SSp:c:s (np−1)ncns SSp:c:s/dfp:c:s �2p:c:s,e

Table 2. ANOVA Table for p:s Balanced Design

ExpectedSum of Mean Mean

Effect Squares df Square Square

s SSs ns−1 SSs/ �2p:s,e

dfs +np�2c:s+npnc�

2s

p:s SSp:s Np−ns SSp:s/ �2p:s,e

= SSc:s = ncnsnp−ns dfp:s = �2p:c:s,e

+SSp:c:s + ncnp−npncnp−1 �2

c:s

Note. Np is the total number of students across all schools.

The expected value of the observed-score variance is equalto the sum of the variance components for these four effects,although estimates of the last two are typically confounded,and so they are pooled:

σ 2(X pcs ) = σ 2s + σ 2

c :s + σ 2p:c :s + σ 2

e or,

σ 2(X pcs ) = 2σ 2s + σ 2

c :s + σ 2p:c :s,e .

(2)

With a balanced design (the number of students equalacross classes and the number of classes equal acrossschools), it is straightforward to estimate these variance com-ponents via analysis of variance (ANOVA). Table 1 displaysthe ANOVA table for the p:c:s design. To estimate the vari-ance components, we set the mean squares (MS) equal tothe expected mean squares (EMS) and solve for σ 2

s , σ 2c :s , and

σ 2p:c :s,e .Typically, however, the classroom level is ignored, and the

data are treated as if they originated from a p:s design. Table 2provides the ANOVA table for this design. The sum of squares(SS), df, MS, EMS, and variance components in the p:s design(Table 2) can be expressed in terms of the correspondingstatistics from the p:c:s design (Table 1). Note that SSs in thep:s design is equal to SSs in the p:c:s design. SSp:s in the p:sdesign is equal to the total of SSc:s and SSp:c:s in the p:c:s design.The expression for σ 2

p:s,e in terms of σ 2p:c :s,e and σ 2

c :s may beobtained by multiplying the corresponding EMS expressionsin Table 1 by their degrees of freedom to obtain expectedvalues for sums of squares, then adding these together anddividing by the pooled degrees of freedom. The populationparameter σ 2

p:s,e = σ 2p:c :s,e + σ 2

c :s , but, as shown in Table 2,the estimate of σ 2

p:s,e obtained from the (incorrect) ANOVAmodel ignoring the grouping of students within classrooms isan unequally weighted average of the (unbiased) estimates

of σ 2p:c :s,e and σ 2

c :s obtained from the (correct) ANOVA modelshown in Table 1.

Fixed versus Random Teacher/Classroom Effectand G Coefficient

In comparison with reliability at the individual score level, re-liability for aggregate measures such as class or school meansis relatively neglected in the research literature (O’Brien,1990). Generalizability studies (G-studies) examining the re-liability of class means were reviewed by Kane and Brennan(1977), who focused on a “split plot” design with studentsnested in classes and items crossed with students. Variationsof this design with various facets treated as fixed versus ran-dom were also used by Brennan (1975) and by Kane, Gillmore,and Crooks (1976). In this context, with classroom means asthe objects of measurement, Kane and Brennan (1977) con-sidered four possible reliability coefficients: infinite universesof students and of items, infinite universe of students and fixeditems, fixed students and infinite universe of items, and fixedstudents and fixed items. They suggested the most appropri-ate coefficient estimate is from infinite universes of studentsand items. At the school level, multiple test forms have alsobeen included as a facet in G-study designs (e.g., Yen, 1997;Yen & Ferrara, 1997). In contemporary assessment applica-tions, item response theory (IRT) is used to estimate studentscale scores and to place alternate test forms on a commonscale. Because IRT scaling accounts for form-to-form varia-tion in item difficulty, there is no need to include item or formeffects in the analyses presented here. Note that there is noanalog in these earlier papers to the predominant practiceof simply ignoring the organization of students within class-rooms when estimating school effects. None of the studiescited here considered the problem of estimating school meansaccounting for a variance component defined at the classroomlevel.

Education-accountability systems employ students’ testscores aggregated to the school level to measure school qual-ity, and thereby treat schools as the objects of measurement.School effects can be thought of as a combination of effectsof teachers, lesson plans, school environment, nonacademicactivities, and so on, all of which may change over time (Yen,1997). Student-level effects are typically regarded as random,implying that σ 2

p:c :s,e contributes to the error in school meanscores (Cronbach et al., 1997). The students providing testscores are treated as a random sample from a hypotheticallyinfinite population of students who might have been enrolledin a given class in a given school at a given point in time(Haertel, 2006).1

As explained by Cronbach et al. (1997), the model thattreats students as being sampled from an infinite populationis generally preferred, but does represent a choice amongalternatives:

Alternatively, the population may be limited to the actual stu-dent body, the MN pupils in the school and grade this year. Asimilar distinction is to be made regarding classes: the numberof classes M may be finite or infinite, and so may the numberof pupils per class N. (p. 391)

By a parallel argument, classes within schools may betreated as either a random or a fixed facet. Classes might betreated as random if the classroom-level effect is attributed tothe unique configuration of teacher and students interacting

Spring 2011 15

Page 4: The Effect of Ignoring Classroom-Level Variance in Estimating the Generalizability of School Mean Scores

in a classroom over the course of a single year. Classes mightbe treated as fixed if the classroom-level effect is attributedto a particular teacher present in the school year after year.The truth is probably somewhere in between. Thus, the fixed-and random-effect models bracket a range of plausible possi-bilities.

Whether classes/teachers is a fixed or a random effectdetermines the formula for and the magnitude of the G coeffi-cient for school-level scores, as well as the SE of school means.The G coefficient is a “reliability-like” coefficient (Brennan,1983, p. xiii, 5, 17). It is the ratio of the universe score vari-ance to the expected observed score variance (Brennan, 1983,2001a; Shavelson & Webb, 1991). For balanced designs, whenthe universe of generalization consists of a random p facet anda random c facet (denoted p:cR:s), the reliability of school-level scores or G coefficient is

ρ2p:c R :s = σ 2

s

σ 2s + σ 2

c :s /nc + σ 2p:c :s,e/(npnc )

, (3)

where, σ 2s is school variance component (i.e., the variance of

the school effects), σ 2c :s is the variance component for classes

within schools, σ 2p:c :s,e is the variance component for students

within classes within schools, nc is the number of classes perschool, np is the number of students per class, and nc np isthe number of students per school. Under this model, the SEof the school means is

S E p:c R :s =√

σ 2c :s /nc + σ 2

p:c :s,e/(npnc ). (4)

With a fixed c facet (denoted p:cF:s), the G coefficient is

ρ2p:c F :s = σ 2

s + σ 2c :s /nc

σ 2s + σ 2

c :s /nc + σ 2p:c :s,e/(npnc )

. (5)

Under this model, the SE of school means is

S E p:c F :s =√

σ 2p:c :s,e/(npnc ). (6)

The G coefficient for the model ignoring the variance ofclasses within schools altogether (denoted p:s) is

ρ2p:s = σ 2

s

σ 2s + σ 2

p:s,e/(npnc ). (7)

For this model, the SE of school means is

S E p:s =√

σ 2p:s,e/(npnc ). (8)

Given the same nonzero variance components, it is readilyshown that ρ2

p:c F :s > ρ2p:s > ρ2

p:c R :s and S E p:c F :s < S E p:s <

S E p:c R :s . Thus, if σ 2c :s > 0, the most commonly used reliability

and SE statistics are biased either upward or downward,depending on assumptions and/or intended inferences.

G Analyses for Balanced DesignIntraclass Correlation (ICC) and G Coefficient

The ICC coefficient measures the relatedness of the studentswithin a group, such as a school or a classroom. It is theratio of the variance component due to schools or classroomsto the total variance for individual students. The ICC maybe regarded as a special case of a G coefficient for a singleobservation, such as a one-item test (Shrout & Fleiss, 1979)or, in the present context, an estimate of a class or schoolmean based on a score for just one student. The ICC for thestudents within schools is conventionally defined as

ICCs = σ 2s /(σ 2

s + σ 2c :s + σ 2

p:c :s,e ). (9)

The ICC for the students within classes over schools isdefined as

ICCc = (σ 2s + σ 2

c :s )/(σ 2s + σ 2

c :s + σ 2p:c :s,e ). (10)

The ICCc is from two sources: the variance among schoolsand the variance among classes within schools. Note that theICCc is always greater than or equal to the ICCs, with equalityobtaining only if σ 2

c :s = 0. ICCs is algebraically equivalent toρ2

p:c R :s , equation 3 when nc = np = 1. ICCc is equivalent toρ2

p:c F :s , equation 5 when nc = np = 1. Note that in contrastto the ICCs, G coefficients pertain to average scores overcollections of observations, and therefore vary depending onnc and np. The extensive literature on describing the ICCs inreal settings (Gulliford, Ukoumunne, & Chinn, 1999; Hedges& Hedberg, 2007; Murray & Blitstein, 2003; Murray, Varnell,& Blitstein, 2004; Verma & Lee, 1996) provides us a goodstart to understand G coefficients for different fixed/randomschool-effect models.

Analytical Results

Hedges and Hedberg (2007) have reported the average un-adjusted (not controlling for covariates) ICCs is about .22 innational samples, with lower values where samples of schoolsare restricted with respect to socioeconomic level.2 This re-sult means that the variance among schools constitutes 22%of the total variance in student achievement outcomes. Forillustrative purposes, we chose ICCs values to be .10 and ICCcto be .10, .20, .30, or .50, or ICCs values to be .20 and ICCc to be.20, .30, or .50 to represent the range of actual ICC values thatmight be observed in real education settings. The percent ofextra variance explained by classes within schools is (ICCc –ICCs) which is, for example, .10 when ICCs = .10 and ICCc= .20. The analyses represent student test scores at a singlegrade level, because state accountability systems generallyuse different tests for each grade, and pooling across distincttests would introduce irrelevant complications. Therefore, wechose two or five classes per school and 10 or 20 students perclass for these analyses.

Table 3 presents illustrative calculations showing the ef-fect of fixed versus random treatment of classes within schoolsfor these plausible ranges of parameter values. The first fourcolumns in Table 3 list the values of ICCs, ICCc, nc , and np. Thenext three columns of Table 3 provide the specified variancecomponents of school, classes within schools, and studentswithin classes within schools. The last six columns present

16 Educational Measurement: Issues and Practice

Page 5: The Effect of Ignoring Classroom-Level Variance in Estimating the Generalizability of School Mean Scores

the G coefficients and SEs for the following three models: (1)classes as a random facet (p:cR:s); (2) classes as a fixed facet(p:cF:s); and (3) classes ignored (p:s). For ease of interpreta-tion, variance components are scaled so thatσ 2

s +σ 2c :s +σ 2

p:c :s,e= 1. Thus, the variance component values shown may be readas the proportions of variance attributable to the correspond-ing score effects.3 The variance component for schools (σ 2

s )is equal to ICCs. The difference between ICCc and ICCs isthe variance component for classes within schools (σ 2

c :s ).Values of ρ2

p:c R :s , ρ2p:c F :s , and ρ2

p:s are calculated by usingequations 3, 5, and 7 based on the values in the previouscolumns. Note that ρ2

p:s depends on ICCs and the total num-ber of students in a school (np nc ), but not on σ 2

c :s . Notealso that these illustrations employed balanced designs forconvenience in exposition.

Figure 1 displays the magnitudes of G coefficients underthe three models for the combinations of parameter valuesshown in the first sixteen rows of Table 3 (i.e., those withICCs = .1). Note that whenever σ 2

c :s = 0, ρ2p:c R :s , ρ2

p:c F :s ,and ρ2

p:s are all equal. This is represented by the single lineplotted in Figure 1 with “×” markers. As shown in Table 3, thisline also represents the calculated value of ρ2

p:s regardless ofthe magnitude of σ 2

c :s , implying that reliability coefficientsestimated according to the usual p:s model are accurate onlywhen classroom-level variance within schools is negligible. Inreal education settings, the assumption that σ 2

c :s = 0 (i.e., thatall classes within any given school have equal means except forrandom variation at the student level) is not realistic, whichimplies that the near-universal application of the p:s designto calculate reliability of school-level scores is incorrect.

The solid lines above the “σ 2c :s =0” line are the G coefficients

for p:cF:s models, while the dashed lines below the “σ 2c :s =

0” line are the G coefficients for p:cR:s models. As is clearlyshown in Figure 1, the reliability of school means is poorly de-termined when ignored class effects are even modestly greaterthan zero. The bias due to ignoring class-level variance be-comes more and more severe as the discrepancy between ICCsand ICCc increases. Ignoring class variance overestimates theG coefficients when the correct models is taken to be p:cR:s.G coefficients decrease and the bias due to ignoring class-level variance becomes more positive when ICCc increasesfor p:cR:s models. If we assume p:cR:s is the accurate model,then when class variances were very large (ICCc = .5) andschool variances = .1 (line “class random, σ 2

c :s = .4”), thefailure to model class-level effects results in biased reliabilityestimates in the range of .69 to .92 (p:s model), which falselyindicates that school-level scores are reliable. The G coeffi-cients of p:s models were on average .40 bigger than the cor-rect G coefficients (ρ2

p:c R :s ). Note that G coefficients increaseas the number of students increases, and that G coefficientsincrease dramatically as the number of classes per schoolincreases.

In an opposite fashion, ignoring class variance leads tounderestimates of G coefficients when the correct modelis taken to be p:cF:s. The G coefficients of school effectsincrease and the bias due to ignoring class variance be-comes more negative as class-level ICC increases. For ex-ample, the G coefficients for p:s were on average .07 smallerthan the correct G coefficients when ICCs = .1 and ICCc= .5 (line “class fixed, σ 2

c :s = 0.4”). The G coefficient in-creases as the number of classes per school increases and

Table 3. Variance Components and G Coefficients of p:cR:s Model, p:cF:s Model, and p:s Modelfor Different Combinations of ICCs, ICCc, nc, and np for Balanced Design

ICCs ICCc nc np σ2s σ2

c:s σ2p:c:s,e ρ2

p:cR :s ρ2p:cF :s ρ2

p:s SE p:cR :s SE p:cF :s SE p:s

.1 .1 2 10 .1 0 .9 .69 .69 .69 .21 .21 .21

.1 .1 2 20 .1 0 .9 .82 .82 .82 .15 .15 .15

.1 .1 5 10 .1 0 .9 .85 .85 .85 .13 .13 .13

.1 .1 5 20 .1 0 .9 .92 .92 .92 .09 .09 .09

.1 .2 2 10 .1 .1 .8 .53 .79 .69 .30 .20 .21

.1 .2 2 20 .1 .1 .8 .59 .88 .82 .26 .14 .15

.1 .2 5 10 .1 .1 .8 .74 .88 .85 .19 .13 .13

.1 .2 5 20 .1 .1 .8 .78 .94 .92 .17 .09 .09

.1 .3 2 10 .1 .2 .7 .43 .85 .69 .37 .19 .21

.1 .3 2 20 .1 .2 .7 .46 .92 .82 .34 .13 .15

.1 .3 5 10 .1 .2 .7 .65 .91 .85 .23 .12 .13

.1 .3 5 20 .1 .2 .7 .68 .95 .92 .22 .08 .09

.1 .5 2 10 .1 .4 .5 .31 .92 .69 .47 .16 .21

.1 .5 2 20 .1 .4 .5 .32 .96 .82 .46 .11 .15

.1 .5 5 10 .1 .4 .5 .53 .95 .85 .30 .10 .13

.1 .5 5 20 .1 .4 .5 .54 .97 .92 .29 .07 .09

.2 .2 2 10 .2 0 .8 .83 .83 .83 .20 .20 .20

.2 .2 2 20 .2 0 .8 .91 .91 .91 .14 .14 .14

.2 .2 5 10 .2 0 .8 .93 .93 .93 .13 .13 .13

.2 .2 5 20 .2 0 .8 .96 .96 .96 .09 .09 .09

.2 .3 2 10 .2 .1 .7 .70 .88 .83 .29 .19 .20

.2 .3 2 20 .2 .1 .7 .75 .93 .91 .26 .13 .14

.2 .3 5 10 .2 .1 .7 .85 .94 .93 .18 .12 .13

.2 .3 5 20 .2 .1 .7 .88 .97 .96 .16 .08 .09

.2 .5 2 10 .2 .3 .5 .53 .93 .83 .42 .16 .20

.2 .5 2 20 .2 .3 .5 .55 .97 .91 .40 .11 .14

.2 .5 5 10 .2 .3 .5 .74 .96 .93 .26 .10 .13

.2 .5 5 20 .2 .3 .5 .75 .98 .96 .25 .07 .09

Spring 2011 17

Page 6: The Effect of Ignoring Classroom-Level Variance in Estimating the Generalizability of School Mean Scores

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

2,10

2,20

5,10

5,20

Number of Classes, Number of Students per Class

G C

oe

ffic

ien

ts

Class Fixed

Class Fixed

Class Fixed

Class Random

Class Random

Class Random

2: 0.1c s =

2: 0.2c s =

2: 0.4c s =

2: 0.2c s =

2: 0.4c s =

2: 0.1c s =

2: 0c s =

FIGURE 1. G coefficients when class within school is fixed, random, or ignored, with ICCs = .1 and increasing values of �2c:s.

as the number of students per class increases for p:cF:smodels.

Although more attention is sometimes paid to reliabilityand G coefficients, the SE of measurement is arguably a moreuseful statistic. In the context of school accountability, itmay be less important whether schools can be rank-orderedconsistently (the essence of the question addressed by G co-efficients) than how precisely each school’s performance isdetermined (the essence of the question addressed by SEs).Reporting and interpretation of SEs has been recommendedby various authors (Cronbach et al., 1997; Brennan, 2001a;Feldt & Brennan, 1989; Haertel, 2006). Figure 2 presents thecomparison of the SEs associated with school mean scoresunder three models. As with Figure 1, the “σ 2

c :s = 0” line rep-resents agreement among the three models when classroom-level effects are null. The solid lines below the “σ 2

c :s = 0”line are S E p:c F :s , while the dashed lines above the “σ 2

c :s = 0”line are S E p:c R :s . SEs of school mean scores for p:s models

are larger than those for the p:cF:s models but smaller thanthose for the p:cR:s models for a given ICCs, ICCc, np, andnc combination. In general, for a given model SEs are largerwhen np and nc decrease.

Student Reading Achievement Example for UnbalancedDesignThe preceding calculations have illustrated the potential vari-ation in estimates of G coefficients and SEs under alternativeassumptions concerning classroom-level effects. We next turnto empirical findings to ascertain the actual magnitudes ofthese effects using a representative data set for which infor-mation on the grouping of students into classes was available.The data are taken from a study of a reading intervention. Thisdata set has a hierarchical structure. Students are nested inclasses (or teachers) which in turn are nested within schools.However, data from the real education setting were messier

18 Educational Measurement: Issues and Practice

Page 7: The Effect of Ignoring Classroom-Level Variance in Estimating the Generalizability of School Mean Scores

0.00

0.10

0.20

0.30

0.40

0.50

2,10 2,20 5,10 5,20

Number of Classes, Number of Students per Class

Class Fixed

Class Fixed

Class Fixed

Class Random

Class Random

Class Random

2: 0.1c s =

2: 0.2c s =

2: 0.4c s =

2: 0.2c s =

2: 0.4c s =

2: 0.1c s =

2: 0c s =

Sta

nd

ard

Err

ors

fo

r S

ch

oo

l M

ean

s

FIGURE 2. SEs of school means when class within school is fixed, random, or ignored, with ICCs = .1 and increasing values of �2c:s.

than the theoretical calculations because neither the num-bers of students in classes nor the numbers of classes inschools were equal. There were 18 schools in the data set,which together included 106 classes, which together included2,382 students. The number of classes within schools rangedfrom three to nine, and the number of students per classranged from 10 to 33. The outcome measures were student-level test scores on the Northwest Evaluation Association(NWEA) test in reading.4 Student scale score has a mean of199.05 and a standard deviation of 15.64.

urGENOVA (Brennan, 2001b) was used to estimate ran-dom effects variance components for unbalanced designs ofthe student reading achievement data set. The program syn-tax for these analyses, which provides counts of the numbersof students within classrooms within schools, is presented inAppendix A (for the p:c:s designs) and Appendix B (for thep:s design).5 Random effects variance components are esti-mated using an analogous-ANOVA procedure. For unbalancednesting designs, this procedure is also called the Hender-

son’s (1953) Method 1 or 3 or the method of moments (MM)(Brennan, 2001b). As an alternative, SAS PROC VAR COM-PONENT and SAS PROC MIXED were also used to developrestricted maximum likelihood (REML) estimates for vari-ance components. The REML method is sometimes judgedpreferable to the ANOVA method because under suitable as-sumptions it produces more accurate estimates and it is moreflexible (Beaumont, 1991; Bowen & Huang, 1990). Both MMand REML variance component estimates for the p:cR:s, p:cF:s,and p:s models are shown in Table 4.

To calculate G coefficients and SEs for these unbal-anced designs, harmonic means of numbers of classes withinschools (nc = 5.30) and of numbers of students withinclasses (np = 21.26) were substituted for nc and np inequations 3, 4, 5, and 6.

Table 4 reports that G coefficients for p:cR:s and p:cF:smodel were .79 and .97, respectively, and the G coefficientfor the p:s model was .95. The S E p:c R :s , S E p:c F :s , and S E p:sare roughly 3, 1.24, and 1.35, respectively. As predicted, the

Spring 2011 19

Page 8: The Effect of Ignoring Classroom-Level Variance in Estimating the Generalizability of School Mean Scores

Table 4. Variance Components and G Coefficients and SEs by MM and REML Estimates forp:cR:s, p:cF:s, and p:s Model under Unbalanced Design

Class Effects Model Ignore Class Model

REML REMLMS df MM VC VC MS df MM VC VC

s 5665.15 17 34.83 36.38 s 5665.15 17 41.60 41.95c:s 1075.82 88 40.37 41.99 p:s 205.58 2364 205.58 205.58p:c:s 171.93 2276 171.93 172.05�2

p:cR :s .79 .79 �2p:s .95 .95

SE p:cR :s 3.02 3.07 SE p:s 1.35 1.35�2

p:cF :s .97 .97SE p:cF :s 1.24 1.24

commonly used p:s calculation yielded a value much higherthan that of the p:cR:s model and slightly smaller than thatof the p:cF:s model. Conversely, the S E p:s are larger thanS E p:c F :s , but smaller than S E p:c R :s .

If the variance component estimates in Table 4 were scaledto sum to 1, they would be .14, .16, and .70 for σ 2

s , σ 2c :s , and

σ 2p:c :s , respectively, yielding intraclass correlations of ICCs =

.14 and ICCc = .30. These values, along with nc = 5.30 andnp = 21.26, are near the midpoints of the ranges of valuesdisplayed in Table 3. Consistent with the analytical resultsshown in Table 3 for the balanced design, these empiricalresults for the unbalanced case confirm that omitting classeffects produces serious bias in G coefficients because sta-tistical dependence among scores of students in the sameclassroom is ignored.

Discussion and ConclusionNCLB legislation requires that “‘adequate yearly progress’shall be defined by the State in a manner that . . . is statisti-cally valid and reliable” (20 USC 6311(b)(2)(C)). Moreover,many state accountability systems rely on estimated SEs ofschool means (and within-school student-subgroup means)for “margin of error” calculations that may determine whethera school is judged to have made “Adequate Yearly Progress.”It is known that the most commonly used statistical modelfor determining the reliability of school means, the p:s modelthat ignores classroom-level effects, gives an overly optimisticpicture of score reliability under the plausible assumptionsthat the classroom effects are nonzero and that they varyfrom year to year. In order to find out how big the distortionsare that arise from this common practice, we contrasted thissimple model with two alternatives, one treating classes as arandom facet (p:cR:s design) and the other treating classesas fixed (p:cF:s design).

If classroom effects are primarily due to the unique dynamicof teacher-student interactions in a given year, then the classrandom effect model is the right one to use. Our G analyses,both demonstrations and empirical results, showed that ignor-ing the existence of substantial within-school heterogeneityin class means was a critical problem. The p:s approach sub-stantially overstates reliability (and understates SEs) if thecorrect model is assumed to be p:cR:s. Bias becomes more se-vere as the within-school classroom-level variance increases.

Historically, the p:s model has been the best available, be-cause information on the assignment of students to classeswithin schools was not systematically collected at the statelevel. However, there is increasing pressure on states to track

the assignment of students to individual teachers,6 and asa consequence, states’ student data systems are rapidly im-proving. As it becomes more feasible to incorporate classroom-level effects into reliability calculations, the models presentedin this paper may find wider application.

Despite the additional knowledge gained from this study,we acknowledge that this study does not address the effect ofother within-schools organization variables on the reliabilityof school mean scores. For example, factors such as teamteaching, pull-out programs for special education studentsand English language learners, and student mobility may af-fect the reliability of school mean scores. These more subtleeffects are beyond the scope of this study.

Appendix AurGENOVA Syntax for p:cR:s or p:cF:s Design

GSTUDY p:c:sOPTIONS NREC 12 “∗.out” EMS ET NOBANNER SECI.8EFFECT ∗ s 18EFFECT c:s 5 9 7 5 5 6 7 5 9 3 9 4 6 6 3 6 4 7EFFECT p:c:s 25 22 23 21 17

22 23 20 27 19 19 25 22 2429 26 22 22 24 22 2417 24 29 12 2117 19 18 19 2621 20 22 17 20 1820 26 23 14 18 25 2128 29 24 30 2725 28 27 32 19 31 27 27 3121 18 1420 23 24 22 21 23 22 23 2117 24 31 1815 16 22 21 24 2630 26 28 32 33 3129 34 3015 17 20 14 12 1719 19 14 1824 22 23 24 10 21 24

FORMAT 0 0PROCESS “data.txt”

Appendix BurGENOVA Syntax for p:s DesignGSTUDY p:c:sOPTIONS NREC 12 “∗.out” EMS ET NOBANNER SECI.8EFFECT ∗ s 18

20 Educational Measurement: Issues and Practice

Page 9: The Effect of Ignoring Classroom-Level Variance in Estimating the Generalizability of School Mean Scores

EFFECT p:s108 201 169 103 99 118 147 138 247 53 199 90 124 180 93 95

70 148FORMAT 0 0PROCESS “data.txt”

AcknowledgmentsThis research is supported by a grant from the American Educa-tional Research Association, which receives funds for its “AERAGrants Program” from the National Science Foundation and Na-tional Center for Education Statistics of the Institute of Educa-tion Sciences (U.S. Department of Education) under NSF GrantNo. DRL-0634035. Additional support was provided by the Stan-ford Graduate Fellowship. Opinions reflect those of the authorsand do not necessarily reflect those of the granting agencies. Theauthors thank the anonymous reviewers and the editor for theirhelpful comments.

Notes1In as much as the students within a school in a given year are not, infact, a random sample from an infinite population, there is no formalstatistical warrant for this model. That said, the student random-effectsmodel is the best available for inferring an enduring property of a schoolfrom test data obtained at a single point in time.2This value is markedly higher than that reported by Coleman et al.(1966), 40 years earlier, possibly suggesting a trend over time in socioe-conomic stratification.3G coefficients are unaffected by the scale chosen, but SEs are affected,because SEs are in the same metric as observed scores. The chosenscaling of the variance components makes the variance of observedscores equal to 1. Thus, SEs are in effect expressed in units of thestandard deviation of individual student scores.4For this example, continuous scale scores are analyzed. FollowingYen (1997), we could have dichotomized these scores to create a newstudent-level score representing “proficient” versus “not proficient.” AsYen (1997, p. 13) notes, this choice would discard potentially usefulinformation. Given the serious deficiencies in PAC measures as dis-tributional summaries (Holland, 2002; Ho, 2008) we did not pursuethis option. However, the method of analysis would be the same with adichotomized outcome as with scale scores.5Actual student scores are not presented. The 2,382 student scores wereread from the “data.txt” files referenced at the end of each job stream.6For example, the phase-2 application process for Race to the Top fund-ing awards points for state plans that “provide teachers and principalswith data on student growth for their students, classes, and schools”(U.S. Department of Education, 2010, p. 34).

References

Beaumont, C. (1991). Comparison of Henderson’s Method I and re-stricted maximum likelihood estimation of genetic parameters ofreproductive traits. Poultry Science, 70(7), 1462–1470.

Bowen, J., & Huang, M. (1990). A comparison of maximum likelihoodwith method of moment procedures for separating individual andgroup effects. Journal of Personality and Social Psychology, 58(1),90–94.

Brennan, R. L. (1975). The calculation of reliability from a split-plotfactorial design. Educational and Psychological Measurement, 35,779–788.

Brennan, R. L. (1983). Elements of generalizability theory. Iowa City,IA: American College Testing Program.

Brennan, R. L. (1995). The conventional wisdom about group meanscores. Journal of Educational Measurement, 32, 385–396.

Brennan, R. L. (2001a). Generalizability theory. New York: Springer-Verlag.

Brennan, R. L. (2001b). urGENOVA (Version 2.1) [Com-puter software and manual]. Iowa City, IA: University ofIowa. Retrieved March 8, 2008, from http://www.education.uiowa.edu/casma/computer_programs.htm#genova.

Coleman, J. S., Campbell, E. Q., Hobson, C. J., McPartland, J., Mood, A. M.,Weinfeld, F.D., et al. (1966). Equality of educational opportunity.Washington, DC: US Government Printing Office.

Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). Thedependability of behavioral measurements. New York: Wiley.

Cronbach, L. J., Linn, R. L., Brennan, R. L, & Haertel, E. H. (1997).Generalizability analysis for performance assessments of studentachievement or school effectiveness. Educational and PsychologicalMeasurement, 57, 373–399.

Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Ed-ucational measurement (3rd ed., pp.105–146). New York: AmericanCouncil on Education and Macmillan.

Gulliford, M. C., Ukoumunne, O. C., & Chinn, S. (1999). Componentsof variance and intraclass correlations for the design of community-based surveys and intervention studies. Data from the Health Surveyfor England 1994. American Journal of Epidemiology, 149, 876–883.

Haertel, E. H. (2006). Reliability. In R. L. Brennan (Ed.). Educationalmeasurement (4th ed., pp. 65–110). Westport, CT: American Councilon Education/Praeger.

Hedges, L. V., & Hedberg, E. C. (2007). Intraclass correlation values forplanning group-randomized trials in education. Educational Evalu-ation and Policy Analysis, 29(1), 60–87.

Henderson, C. R. (1953). Estimation of variance and covariance com-ponents. Biometrics, 9, 227–252.

Ho, A. D. (2008). The problem with “Proficiency”: Limitations of statisticsand policy under No Child Left Behind. Educational Researcher,37(6), 351–360.

Holland, P. (2002). Two measures of change in the gaps between theCDFs of test-score distributions. Journal of Educational and Behav-ioral Statistics, 27, 3–17.

Ingelhart, R. (1985). Aggregate stability and individual-level flux inmass belief systems: The level of analysis paradox. American PoliticalScience Review, 79, 97–116.

Jones, B. S., & Norrander, B. (1996). The reliability of aggregated publicopinion measures. American Journal of Political Science, 40(1),295–309.

Kane, M. T., & Brennan, R. L.(1977). The generalizability of class means.Review of Educational Research, 47(2), 267–292.

Kane, M. T., Gillmore, G. M., & Crooks, T. J. (1976). Student evalu-ations of teaching: The generalizability of class means. Journal ofEducational Measurement, 13(3), 171–183.

Kane, T. J., & Staiger, D. O. (2002). The promise and pitfalls of usingimprecise school accountability measures. The Journal of EconomicPerspectives, 16(4), 91–114.

Kim, J.-S., & Frees, E.W. (2006). Omitted variables in multilevel models.Psychometrika, 71(4), 659–690.

Murray, D. M., & Blitstein, J. L. (2003). Methods to reduce the impact ofintraclass correlation in group randomized trials. Evaluation Review,27, 79–103.

Murray, D. M., Varnell, S. P., & Blitstein, J. L. (2004). Design andanalysis of group-randomized trials: Review of recent methodologicaldevelopments. American Journal of Public Health, 94, 423–432.

No Child Left Behind Act of 2001, Pub. L. No. 197–110, 115Stat. 1425 (2002). Retrieved March 13, 2008, from http://www.ed.gov/policy/elsec/leg/esea02/index.html.

O’Brien, R. M. (1990). Estimating the reliability of aggregate-level vari-ables based on individual-level characteristics. Sociological MethodsResearch, 18, 473–504.

Opdenakker, M., & Van Damme, J. (2000). The importance of identifyinglevels in multilevel analysis: An illustration of the effects of ignoringthe top and intermediate levels of school effectiveness research.School Effectiveness and School Improvement, 11, 103–130.

Rogosa, D. R. (2005). Statistical misunderstandings of the propertiesof school scores and school accountability. In J. L. Herman & E.

Spring 2011 21

Page 10: The Effect of Ignoring Classroom-Level Variance in Estimating the Generalizability of School Mean Scores

H. Haertel (Eds.), Uses and misuses of data for educational ac-countability and improvement (The 104th yearbook of the NationalSociety for the Study of Education, Part 2, pp. 147–174). Malden, MA:Blackwell.

Schechtman, E., & Yitzhaki, S. (2009). Ranking groups’ abilities: Is italways reliable? International Journal of Testing, 9(3), 195–214.

Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: Aprimer. Newbury Park, CA: Sage Publications.

Shrout, P. E., & Fleiss, J. L (1979), Intraclass correlations: Uses inassessing rater reliability. Psychological Bulletin, 86, 420–428.

U.S. Department of Education. (2009). Standards and assessmentspeer review guidance: Information and examples for meeting re-quirements of the No Child Left Behind Act of 2001. Washington, DC:Office of Elementary and Secondary Education. Retrieved June 18,2010, from http://www2.ed.gov/policy/elsec/guid/saaprguidance.pdf.

U.S. Department of Education (2010). Race to the Top application forphase 2 funding (CFDA Number: 84.395A). Washington, DC: Office ofElementary and Secondary Education. Retrieved June 18, 2010, fromhttp://www2.ed.gov/programs/racetothetop/phase2-application.doc.

Verma, V., & Lee, T. (1996). An analysis of sampling errors for de-mographic and health surveys. International Statistical Review, 64,265–294.

Yen, W. M. (1997). The technical quality of performance assess-ments: Standard errors of percents of pupils reaching stan-dards. Educational Measurement: Issues and Practice, 16(3), 5–15.

Yen, W. M., & Ferrara, S. (1997). The Maryland School PerformanceAssessment Program: Performance assessment with psychometricquality suitable for high-stakes usage. Educational and Psychologi-cal Measurement, 57(1), 60–84.

22 Educational Measurement: Issues and Practice