met_ensuring fair and reliable measures_practitioner brief

7/30/2019 MET_Ensuring Fair and Reliable Measures_Practitioner Brief

1/28

METproject

Ensuring

Fair and ReliableMeasures o

Efective TeachingCulminating Findingsrom the MET Projects

Three-Year Study

POLICY ANDPRACTITIONER BRIEF


2/28

ABOUT THIS REPORT: This non-technical research brie or policymakers and practitioners summarizes recent analyses rom

the Measures o Eective Teaching (MET) project on identiying eective teaching while accounting or dierences among

teachers students, on combining measures into composites, and on assuring reliable classroom observations.1

Readers who wish to explore the technical aspects o these analyses may go to www.metproject.org to fnd the three companion

research reports: Have We Identifed Eective Teachers?by Thomas J. Kane, Daniel F. McCarey, Trey Miller, and Douglas O.

Staiger;A Composite Estimator o Eective Teaching by Kata Mihaly, Daniel F. McCarey, Douglas O. Staiger, and J.R. Lockwood;

and The Reliability o Classroom Observations by School Personnel by Andrew D. Ho and Thomas J. Kane.

Earlier MET project bries and research reports also on the website include:

Working with Teachers to

Develop Fair and Reliable

Measures of Teaching (2010).

A white paper describing the

rationale or and components

o the MET projects study

o multiple measures oeective teaching.

Learning about Teaching:

Initial Findings from the

Measures of Effective Teaching

Project(2010). A research

report and non-technical

policy brie with the same

title on analysis o student-perception surveys and

student achievement gain

measures.

Gathering Feedback for

Teaching: Combining

High-Quality Observations

with Student Surveys and

Achievement Gains(2012). A

research report and policy/

practitioner brie with thesame title with initial fndings

on the reliability o classroom

observations and implications

or combining measures o

teaching.

Asking Students about

Teaching: Student

Perception Surveys and

Their Implementation

(2012). A non-technical

brie or policymakers and

practitioners on the qualitieso well-designed student

surveys and implications

or their implementation

or teacher eedback and

evaluation.

January 20

ABOUT THE MET PROJECT: The MET project is a research partnership o academics, teachers, and education organizations

committed to investigating better ways to identiy and develop eective teaching. Funding is provided by the Bill & Melinda

Gates Foundation.

The approximately 3,000 MET project teachers who

volunteered to open up their classrooms or this work are

rom the ollowing districts: The Charlotte-Mecklenburg

Schools, the Dallas Independent Schools, the Denver

Public Schools, the Hillsborough County Public Schools,

the Memphis Public Schools, the New York City Schools,

and the Pittsburgh Public Schools.

Partners include representatives o the ollowing

institutions and organizations: American Institutes or

Research, Cambridge Education, University o Chicago,

The Danielson Group, Dartmouth College, EducationalTesting Service, Empirical Education, Harvard University,

National Board or Proessional Teaching Standards,

National Math and Science Initiative, New Teacher Center,

University o Michigan, RAND, Rutgers University,

University o Southern Caliornia, Stanord University,

Teachscape, University o Texas, University o V irginia,

University o Washington, and Westat.

MET Project Teachers

Denver

Memphis

Hillsborough

County

Dallas

Charlotte-

Mecklenbu

Pittsburgh

New Y

City


3/28

ContentsExecutive Summary 3

Can Measures o Eective Teaching Identiy

Teachers Who Better Help Students Learn? 6

How Much Weight Should Be Placed

on Each Measure o Eective Teaching? 10

How Can Teachers Be Assured Trustworthy

Results rom Classroom Observations? 16

What We Know Now 20

Endnotes 23

Culminating Findings from the MET Projects Three-Year Study


4/28Ensuring Fair and Reliable Measures of Effective Teaching2


5/28


6/28

The Questions

Can measures o eective teaching

identiy teachers who better help

students learn?

Despite decades o research suggesting

that teachers are the most important in-

school actor aecting student learning,

an underlying question remains unan-

swered: Are seemingly more eectiveteachers truly better than other teachers

at improving student learning, or do they

simply have better students?

Ultimately, the only way to resolve that

question was by randomly assigning

students to teachers to see i teachers

previously identied as more eective

actually caused those students to learn

more. That is what we did or a subset

o MET project teachers. Based on data

we collected during the 200910 school

year, we produced estimates o teach-

ing eectiveness or each teacher. We

adjusted our estimates to account or

student dierences in prior test scores,

demographics, and other traits. We

then randomly assigned a classroom o

students to each participating teacher

or 201011.

Following the 201011 school year we

asked two questions: First, did students

actually learn more when randomly

assigned to the teachers who seemed

more eective when we evaluated them

the prior year? And, second, did the

magnitude o the dierence in student

outcomes ollowing random assignment

correspond with expectations?

How much weight should be placed

on each measure o eective

teaching?

While using multiple measures to

provide eedback to teachers, many

states and districts also are combining

measures into a single index to support

decisionmaking. To date, there has been

little empirical evidence to inorm how

systems might weight each measure

within a composite to support improve-

ments in teacher eectiveness. To help

ll that void, we tasked a group o our

research partners to use data rom METproject teachers to build and compare

composites using dierent weights and

dierent outcomes.

How can teachers be assured

trustworthy results rom

classroom observations?

Our last report on classroom observa-

tions prompted numerous ques-

tions rom practitioners about

how to best use resources

to produce quality inor-

mation or eedback

on classroom practice. For example:

How many observers are needed to

achieve sucient reliability rom a given

number o observations? Do all obser-

vations need to be the same length to

have condence in the results? And

what is the value o adding observers

rom outside a teachers own school?

To help answer these questions, we

designed a study in which administra-

tors and peer observers produced more

than 3,000 scores or lessons taught by

teachers within one MET project partner

school district.

Key ndings rom those analyses:

1. Eective teaching can be measured.

We collected measures o teaching

during 200910. We adjusted those

measures or the backgrounds and

prior achievement o the students

in each class. But, without random

assignment, we had no way to know i

the adjustments we made were su-

cient to discern the markers o eec-

tive teaching rom the unmeasured

aspects o students backgrounds.

Feedback and evaluation systems depend

on trustworthy inormation about teaching

eectiveness to support improvement in teachers

practice and better outcomes or students.

Ensuring Fair and Reliable Measures of Effective Teaching4


7/28

In act, we learned that the adjusted

measures did identiy teachers who

produced higher (and lower) average

student achievement gains ollowing

random assignment in 201011. The

data show that we can identiy groups

o teachers who are more eective

in helping students learn. Moreover,

the magnitude o the achievement

gains that teachers generated was

consistent with expectations.

In addition, we ound that more

eective teachers not only caused

students to perorm better on state

tests, but they also caused students

to score higher on other, more cog-

nitively challenging assessments inmath and English.

2. Balanced weights indicate multiple

aspects o eective teaching. A com-

posite with weights between 33 per-

cent and 50 percent assigned to state

test scores demonstrated the best mix

o low volatility rom year to year and

ability to predict student gains on mul-

tiple assessments. The composite that

best indicated improvement on statetests heavily weighted teachers prior

student achievement gains based on

those same tests. But composites

that assigned 33 percent to 50 percent

o the weight to state tests did nearly

as well and were somewhat better at

predicting student learning on more

cognitively challenging assessments.

Multiple measures also produce

more consistent ratings than stu-dent achievement measures alone.

Estimates o teachers eective-

ness are more stable rom year to

year when they combine classroom

observations, student surveys, and

measures o student achievement

gains than when they are based

solely on the latter.

3. Adding a second observer increases

reliability signifcantly more than

having the same observer score

an additional lesson. Teachers

observation scores vary more

rom observer to observer than

rom lesson to lesson. Given the

same total number o observations,

including the perspectives o two or

more observers per teacher greatly

enhances reliability. Our study o

video-based observation scoring also

revealed that:

a. Additional shorter observations

can increase reliability. Our

analysis suggests that having

additional observers watch just

part o a lesson may be a cost-

eective way to boost reliability by

including additional perspectives.

b. Although school administrators

rate their own teachers some-

what higher than do outside

observers, how they rank their

teachers practice is very similar

and teachers own administrators

actually discern bigger dier-

ences in teaching practice, which

increases reliability.

c. Adding observations by observ-

ers rom outside a teachers

school to those carried out by a

teachers own administrator can

provide an ongoing check against

in-school bias. This could be done

or a sample o teachers rather

than all, as we said in Gathering

Feedback or Teaching.

The ollowing pages urther explain

these ndings and the analyses that

produced them.



8/28

By defnition, teaching is eective when it enables student learning. But

identiying eective teaching is complicated by the act that teachers oten havevery dierent students. Students start the year with dierent achievement levelsand dierent needs. Moreover, some teachers tend to get particular types ostudents year ater year (that is, they tend to get higher-perorming or lower-perorming ones). This is why so-called value-added measures attempt toaccount or dierences in the measurable characteristics o a teachers students,such as prior test scores and poverty.

Can Measures o

Efective TeachingIdentiy Teachers WhoBetter Help Students Learn?

2

However, students dier in other

wayssuch as behavior and parental

involvementwhich we typically cannot

account or in determining teachingeectiveness. I those unaccounted or

dierences also aect student learning,

then what seems like eective teaching

may actually refect unmeasured char-

acteristics o a teachers students. The

only way to know i measures o teaching

truly identiy eective teaching and not

some unmeasured student characteris-

tics is by randomly assigning teachers to

students. So we did.

In 200910, we measured teachers

eectiveness using a combined mea-

sure, comprising teachers classroom

observation results, student perception

survey responses, and student achieve-

ment gains adjusted or student char-

acteristics, such as prior perormance

and demographics. The ollowing year

(201011), we randomly assigned dier-

ent rosters o students to two or more

MET project teachers who taught thesame grade and subject in the same

school. Principals created rosters and

the RAND Corp assigned them randomly

to teachers (see Figure 1). Our aim was

to determine i the students who were

randomly assigned to teachers who

previously had been identied as more

eective actually perormed better at

the end o the 201011 school year.3

They did. On average, the 200910

composite measure o eective teaching

accurately predicted 201011 student

perormance. The research conrmed

that, as a group, teachers previously

identied as more eective caused stu-

dents to learn more. Groups o teachers

who had been identied as less eective

Teachers previously

identifed as more eective

caused students to learnmore. Groups o teachers

who had been identifed

as less eective caused

students to learn less.



9/28

caused students to learn less. We can

say they caused more (or less) student

learning because when we randomly

assigned teachers to students during the

second year, we could be condent that

any subsequent dierences in achieve-

ment were being driven by the teachers,

not by the unmeasured characteristics

o their students. In addition, the mag-

nitude o the gains they caused was

consistent with our expectations.

Figure 2 illustrates just how well the

measures o eective teaching pre-

dicted student achievement ollowingrandom assignment. The diagonal line

represents perect prediction. Dots

above the diagonal line indicate groups

o teachers whose student outcomes ol-

lowing random assignment were better

than predicted. Dots below the line indi-

cate groups o teachers whose student

outcomes ollowing random assignment

were worse than predicted. Each dot

Figure 1

Putting Measures o Efective Teaching to the Test withRandom Assignment

Do measures o teaching really identiy teachers who help students learn more, or do seemingly more eective teachers

just get better students? To nd out, the MET project orchestrated a large-scale experiment with MET project teachers

to see i teachers identied as more eective than their peers would have greater student achievement gains even with

students who were assigned randomly.

To do so, the MET project rst estimated teachers eectiveness using multiple measures rom the 200910 school year.

As is common in schools, some teachers had been assigned students with stronger prior achievement than others. In

assessing each teachers practice that year, the project controlled or students prior achievement and demographic

characteristics. But there may have been other dierences among students as well. So or the ollowing school year(201011), principals created rosters o students or each class in the study, and then researchers randomly assigned

each roster to a participating teacher rom among those who could teach the class.

At the end o the 201011 school year, MET project analysts checked to see i students taught by teachers identied as

more eective than their colleagues actually had greater achievement gains than students taught by teachers iden-

tied as less eective. They also checked to see how well actual student achievement gains or teachers matched

predicted gains.



10/28

represents 5 percent o the teachers in

the analysis, sorted based on their pre-

dicted impact on student achievement.4

As seen in Figure 2, in both math

and English language arts (ELA), the

groups o teachers with greater pre-dicted impacts on student achievement

generally had greater actual impacts on

student achievement ollowing ran-

dom assignment. Further, the actual

impacts are approximately in line with

the predicted impacts.5 We also ound

that teachers who we identied as being

eective in promoting achievement on

the state tests also generated larger

gains on the supplemental tests admin-

istered in spring 2011.

Based on our analysis, we can unam-

biguously say that school systems

should account or the prior test scores

o students. When we removed this

control, we wound up predicting much

larger dierences in achievement

than actually occurred, indicating that

student assignment biased the results.

However, our analysis could not shed

as much light on the need to control

or demographics or peer eects

that is, the average prior achievement

and demographics o each students

classmates. Although we included those

Figure 2

Actual and Predicted Achievement o Randomized Classrooms (Math)

-.1

-.1

.10

0

-.05

-.

15

.15

.05

-.

05

.1

.05

-.1

-.

1

.10

0

-.05

-.

15

.15

.05

-.

05

.1

.05

Predicted Achievement (in standard deviations)

ActualAchievement

(in

standard

deviations)

Actual and Predicted Achievement o Randomized Classrooms

(English Language Arts)

These charts compare the actual 201011 school

year achievement gains or randomly assigned

classrooms with the results that were predicted

based on the earlier measures o teaching

eectiveness. Each dot represents the combination

o actual and estimated student perormance or 5

percent o the teachers in the study, grouped by the

teachers estimated eectiveness. The dashed line

shows where the dots would be i the actual andpredicted gains matched perectly.

On average, students o teachers with higher teache

eectiveness estimates outperormed students o

teachers with lower teacher eectiveness estimates

Moreover, the magnitude o students actual gains

largely corresponded with gains predicted by their

eectiveness measured the previous year. Both

the actual and predicted achievement are reported

relative to the mean in the randomization block. Tha

is, a zero on either axis implies that the value was

no dierent rom the mean or the small group oteachers in a grade, subject, and school within whic

class lists were randomized.

Impacts are reported in student-level standard

deviations. A .25 standard deviation dierence

is roughly equivalent to a year o schooling. The

predicted impacts are adjusted downward to accoun

or incomplete compliance with randomization.

Efectiveness Measures Identiy Teachers

Who Help Students Learn More

Actual =Predicted

Actual =Predicted

Predicted Achievement (in standard deviations)

ActualAchievement

(in

standardd

ev

iations)



11/28

controls, we cannot determine rom

our evidence whether school systems

should include them. Our results were

ambiguous on that score.

To avoid over-interpretation o these

results, we hasten to add two caveats:

First, a prediction can be correct on

average but still be subject to measure-

ment error. Our predictions o students

achievement ollowing random assign-

ment were correct on average, but

within every group there were some

teachers whose students perormed

better than predicted and some whose

students perormed worse. Second,

we could not, as a practical matter,

randomly assign students or teachers to

a dierent school site. As a result, our

study does not allow us to investigate

bias in teacher eectiveness measures

arising rom student sorting betweendierent schools.6

Nonetheless, our analysis should

give heart to those who have invested

considerable eort to develop practices

and policies to measure and support

eective teaching. Through this large-

scale study involving random assign-

ment o teachers to students, we are

condent that we can identiy groups o

teachers who are comparatively more

eective than their peers in helping stu-dents learn. Great teaching does make

a dierence.

We can unambiguously say that school systems should adjust their achievement

gain measures to account or the prior test scores o students. When we removed

this control, we wound up predicting much larger dierences in achievement than

actually occurred.



12/28

How Much Weight

Should Be Placedon Each Measure oEfective Teaching?

7

Teaching is too complex or any single measure o perormance to capture it

accurately. Identiying great teachers requires multiple measures. While statesand districts embrace multiple measures or targeted eedback, many also arecombining measures into a single index, or composite. An index or compositecan be a useul summary o complex inormation to support decisionmaking.The challenge is to combine measures in ways that support eective teachingwhile avoiding such unintended consequences as too-narrow a ocus on oneaspect o eective teaching.

To date, there has been little empiri-

cal evidence to suggest a rationale or

particular weights. The MET projects

report Gathering Feedback or Teachingshowed that equally weighting three

measures, including achievement gains,

did a better job predicting teachers

success (across several student out-

comes) than teachers years o experi-

ence and masters degrees. But that

work did not attempt to determine opti-

mal weights or composite measures.

Over the past year, a team o MET

project researchers rom the RAND

Corporation and Dartmouth College

used MET project data to compare di-

erently weighted composites and study

the implications o dierent weighting

schemes or dierent outcomes. As

in the Gathering Feedback or Teaching

report, these composites included stu-

dent achievement gains based on state

assessments, classroom observations,and student surveys. The research-

ers estimated the ability o variously

weighted composites to produce con-

sistent results and accurately orecast

teachers impact on student achieve-

ment gains on dierent types o tests.

The goal was not to suggest a spe-

cic set o weights but to illustrate

the trade-os involved when choosing

weights. Assigning signicant weight

to one measure might yield the best

predictor o uture perormance on that

measure. But heavily weighting a single

measure may incentivize teachers to

ocus too narrowly on a single aspect



13/28

o eective teaching and neglect its

other important aspects. For example, a

singular ocus on state tests could dis-

place gains on other harder-to-measure

outcomes. Moreover, i the goal is or

students to meet a broader set o learn-

ing objectives than are measured by a

states tests, then too-heavily weighting

that test could make it harder to identiy

teachers who are producing other val-

ued outcomes.

Composites Compared

The research team compared our

dierent weighting models, illustrated

in Figure 3: (Model 1) The best

predictor o state achievement

test gains (with weights calculated

to maximize the ability to predict

teachers student achievement gains on

state tests, resulting in 65+ percent o

the weight being placed on the student

achievement gains across grades and

subjects); (Model 2) a composite that

assigned 50 percent o the weight to

students state achievement test gains;

(Model 3) a composite that applied

equal weights to each measure; and

(Model 4) one that gave 50 percent to

observation ratings and 25 percent

each to achievement gains and student

surveys. The weights that best predict

state tests, shown or Model 1 in

Figure 3, were calculated to predict

gains on state ELA tests at the middle

school level, which assigns a whopping

81 percent o the weight to prior gains

on the same tests (best-predictor

weights or other grades and subjects

are in the table on page 14).

Figure 4 compares the dierent weight-

ing schemes on three criteria, usingmiddle school ELA as an example (see

the table on page 14 or other grades

and subjects). The rst is predicting

teachers student achievement gains

on state assessments. A correlation o

1.0 would indicate perect accuracy in

Heavily weighting a single measure may incentivize

teachers to ocus too narrowly on a single aspect o

eective teaching and neglect its other importantaspects. ... [I] the goal is or students to meet a

broader set o learning objectives than are measured

by a states tests, then too-heavily weighting that test

could make it harder to identiy teachers who are

producing other valued outcomes.

ObservationsStudent surveysAchievement gainson state tests

81%

2%

17%

Model 1

50%

25%

25%33%

33%

33% 25%

50%

25%

Weighted for maximumaccuracy in predicting

gains on state tests*

*Weights shown for Model 1 were calculated to best predict gains on state tests for middle school Englishlanguage arts. Similar best predictor weights for other grades and subjects are in the table on page 14.

50% weight onstate test results

Equal weights 50% weights onobservations

Model 2 Model 3 Model 4

Four Ways to WeightFigure 3

These charts illustrate our ways to construc

composite measure o eective teaching. Eac

model uses dierent weights but includes the

same components student achievement ga

on the state tests, student perception surveys

and classroom observations. Model 1 uses th

weights that would best predict a teachers

impact on state test scores. Across grades an

subjects, the best predictor model assigns

65 percent or more o the weight to a teacher

prior state test gains. Models 24 are not bas

on maximizing any particular outcome. They

approximate dierent weighting schemes

used by states and districts, with each model

placing progressively less weight on student

achievement gains on state tests.

Culminating Findings from the MET Projects Three-Year Study 1


14/28

predicting teachers student achieve-

ment gains on state tests. By denition,

the best composite in this regard is

Model 1, the model weighted or maxi-

mizing accuracy on state test results.

Models 24 show the eect o reducing

weights on student achievement gainson state tests or middle school ELA. As

shown rom middle school ELA, reduc-

ing weights on student achievement

gains decreases the power to predict

uture student achievement gains on

state tests rom 0.69 to 0.63 with Model

2; to 0.53 with Model 3; and to 0.43 with

Model 4. Other grades and subjects

showed similar patterns, as indicated in

the table on page 14.

While it is true that the state tests

are limited and that schools should

value other outcomes, observations

and student surveys may not be more

correlated with those other outcomes

than the state tests. As a result, we

set out to test the strength o each

models correlation with another set o

test outcomes. The middle set o bars

in Figure 4 compares the our models

(see Figure 3)each using state test

results to measure achievement

gainson how well they would predict

teachers student achievement gains

on supplemental tests that wereadministered in MET project teachers

classrooms: The SAT 9 Open-Ended

Reading Assessment (SAT 9 OE)

and the Balanced Assessment in

Mathematics (BAM).

ReliabilityCorrelation withstate tests gains

Correlation withhigher-order tests

1

0.290.34 0.33 0.32

0.51

0.66

0.76 0.75

2 3 1 2 3 1 2 34 4 4

0.63

0.53

0.43

0.69

Trade-Offs from Different Weighting SchemesMiddle School English Language Arts

Models

Figure 4

These bars compare the our weighting schemes in Figure 3 on three criteria: accuracy in

predicting teachers achievement gains on state tests; accuracy in predicting student achievement

gains on supplemental assessments designed to test higher-order thinking skills; and reliability,

refecting the year-to-year stability o teachers results. Shown are the results or middle school

ELA (see Table 1 on page 14 or results or other grades and subjects).

As indicated, Model 2 (50 percent state test results) and Model 3 (33 percent state tests) achieve

much o the same predictive power as Model 1 (the best predictor o state test results) in

anticipating teachers uture state test results (Model 1). Model 4 (50 percent observation) is

considerably less predictive. However, the gures also illustrate two other trade-os. Models

2 and 3 also are somewhat better than Model 1 at predicting gains on the tests o higher-order

thinking skills (or all but elementary school math). Across most grades and subjects, Model 1 wasthe least reliable.



15/28

Increasing Accuracy, Reducing Mistakes

When high-stakes decisions must be made, can these

measures support them? Undoubtedly, that question willbe repeated in school board meetings and in aculty break

rooms around the country in the coming years.

The answer is yes, not because the measures are perect

(they are not), but because the combined measure is better

on virtually every dimension than the measures in use now.

There is no way to avoid the stakes attached to every hir-

ing, retention, and pay decision. And deciding not to make a

change is, ater all, a decision. No measure is perect, but

better inormation should support better decisions.

In our report Gathering Feedback or Teaching, we compared

the equally weighted measure (Model 3 in Figures 3 and 4)

to two indicators that are almost universally used or pay or

retention decisions today: teaching experience and posses-

sion o a masters degree. On every student outcomethe

state tests, supplemental tests, students sel-reported

level o eort and enjoyment in classthe teachers who

excelled on the composite measure had better outcomes

than those with high levels o teaching experience or a mas-

ters degree.

In addition, many districts currently require classroom

observations, but they do not include student surveys or

achievement gains. We tested whether observations aloneare enough. Even with our ull classroom observations (two

by one observer and two by another), conducted by observ-

ers trained and certifed by the Educational Testing Service,

the observation-only model perormed ar worse than any o

our multiple measures composites. (The correlations com-

parable to those in Figure 5 would have been .14 and .25 withthe state tests and test o higher-order skills.)

Still, it is air to ask, what might be done to reduce error?

Many steps have been discussed in this and other reports

rom the project:

First, i any type o student data is to be usedeither

rom tests or rom student surveysschool systems

should give teachers a chance to correct errors in their

student rosters.

Second, classroom observers should not only be trained

on the instrument. They should frst demonstrate theiraccuracy by scoring videos or observing a class with a

master observer.

Third, observations should be done by more than one

observer. A principals observation is not enough. To

ensure reliability, it is important to involve at least one

other observer, either rom inside or outside the school.

Fourth, i multiple years o data on student achievement

gains, observations, and student surveys are available,

they should be used. For novice teachers and or systems

implementing teacher evaluations or the frst time, there

may be only a single year available. We have demon-

strated that a single year contains inormation worth

acting on. But the inormation would be even better i it

included multiple years. When multiple years o data are

available they should be averaged (although some sys-

tems may choose to weight recent years more heavily).

While covering less material than

state tests, the SAT 9 OE and BAM

assessments include more cogni-

tively challenging items that require

writing, analysis, and application

o concepts, and they are meant to

assess higher-order thinking skills.

Sample items released by the assess-

ment consortia or the new Common

Core State Standards assessments

are more similar to the items on these

supplemental tests than the ones

on the state assessments. Shown

in Figure 4 is the eect o reduc-

ing the weight on state test gains

in predicting gains on these other

assessments, again or middle school

ELA. For most grades and subjects,

Model 2 and Model 3 (50 percent state

test and equal weights or all three

measures) best predicted teachers

student achievement gains on these

supplemental assessments, with little

dierence between the two models.

The one exception was elementary

school math, where Model 1 (best pre-

dictor) was best.

The third set o bars in Figure 4compares composites on their reli-

abilitythat is, the extent to which the

composite would produce consistent

results or the same teachers rom

year to year (on a scale rom 01.0, with



16/28

CALCULATED WEIGHTS FOR MA XIMUM ACCURACY IN PREDICTING GAINS ON STATE TESTS

English Language Arts Math

State Tests Observations Student Surveys State Tests Observations Student Surveys

Elementary 65% 9% 25% 85% 5% 11%

Middle 81% 2% 17% 91% 4% 5%

RELIABILITY AND ACCURACY OF DIFFERENT WEIGHTING SCHEMES

English Language Arts Math

Weightedor Max

State TestAccuracy

50% StateTest

EqualWeights

50%Observations

Weightedor Max

State TestAccuracy

50% StateTest

EqualWeights

50%Observations

Elementary

Reliability 0.42 0.46 0.50 0.49 0.52 0.57 0.57 0.55

Correlationwith statetest

0.61 0.59 0.53 0.45 0.72 0.65 0.54 0.46

Correlationwith higher-order test

0.35 0.37 0.37 0.35 0.31 0.29 0.25 0.20

Middle

Reliability 0.51 0.66 0.76 0.75 0.86 0.88 0.88 0.83

Correlationwith statetest

0.69 0.63 0.53 0.43 0.92 0.84 0.73 0.65

Correlationwith higher-order test

0.29 0.34 0.33 0.32 0.38 0.44 0.45 0.45

Table 1

1.0 representing perect consistency

and no volatility). Again, results shown

are or middle school ELA. Across all

grades and subjects, the most reliable

composites were either Models 2 (50

percent state test) or 3 (equal weights).

For all but middle school math, the least

reliable composite was Model 1 (best

predictor). Model 4 (50 percent observa-

tions) was somewhat less reliable than

Model 2 (equal weights) or all grades

and subjects. Although not shown, stu-

dent achievement gains on state tests

by themselves are less stable than all

o the composites, with one exception:

Model 4 (50 percent observations) is

slightly less stable than achievement

gains alone or middle school math.

General Implications

The intent o this analysis was not torecommend an ideal set o weights to

use in every circumstance. Rather, our

goal was to describe the trade-os

among dierent approaches.8

I the goal is to predict gains on state

tests, then the composites that put 65+

percent o the weight on the student

achievement gains on those tests will

generally show the greatest accuracy.

However, reducing the weights on the

state test achievement gain measures to

50 percent or 33 percent generates two

positive trade-os: it increases stability

(lessens volatility rom year to year) and

it also increases somewhat the correla-

tion with tests other than the state tests.

However, it is possible to go too ar.

Lowering the weight on state test

achievement gains below 33 percent,

and raising the weight on observations

to 50 percent and including student

surveys at 25 percent, is counter-

productive. It not only lowers the



17/28

correlation with state achievement

gains; it can also lower reliability and

the correlation with other types o

testing outcomes.

Ultimately, states, local education

authorities, and other stakehold-ers need to decide how to weight the

measures in a composite. Our data

suggest that assigning 50 percent or

33 percent o the weight to state test

results maintains considerable pre-

dictive power, increases reliability,

and potentially avoids the unintended

negative consequences rom assigning

too-heavy weights to a single measure.

Removing too much weight rom state

tests, however, may not be a good idea,given the lower predictive power and

reliability o Model 4 (25 percent state

tests). In short, there is a range o

reasonable weights or a composite o

multiple measures.

Validity and ContentKnowledge or Teaching

Teachers shouldnt be asked to expend

eort to improve something that doesnthelp them achieve better outcomes

or their students. I a mea-

sure is to be included

in ormal evaluation, then it should be

shown that teachers who perorm better

on that measure are generally more

eective in improving student outcomes.

This test or validity has been central

to the MET projects analyses. Measures

that have passed this test include high-

quality classroom observations, well-

designed student-perception surveys,

and teachers prior records o student

achievement gains on state tests.

Over the past year, MET project

researchers have investigated another

type o measure, called the Content

Knowledge or Teaching (CKT) tests.

These are meant to assess teach-

ers understanding o how studentsacquire and understand subject-

specic skills and concepts in math

and ELA. Developed by the Educational

Testing Service and researchers at the

University o Michigan, these tests are

among the newest measures o teaching

included in the MET projects analyses.

Mostly multiple choice, the questions

ask how to best represent ideas to

students, assess student understand-

ing, and determine sources o studentsconusion.

The CKT tests studied by the MET

project did not pass our test or validity.

MET project teachers who perormed

better on the CKT tests were not

substantively more eective in

improving student achievement on

the outcomes we measured. This was

true whether student achievement

was measured using state tests or the

supplemental assessments o higher-

order thinking skills. For this reason,

the MET project did not include CKT

results within its composite measure o

eective teaching.

These results, however, speak to the

validity o the current measure still

early in its development in predictingachievement gains on particular stu-

dent assessmentsnot to the impor-

tance o content-specic pedagogical

knowledge. CKT as a concept remains

promising. The teachers with higher

CKT scores did seem to have somewhat

higher scores on two subject-based

classroom observation instruments:

the Mathematical Quality o Instruction

(MQI) and the Protocol or Language

Arts Teacher Observations (PLATO).Moreover, the MET projects last report

suggested that some content-specic

observation instruments were better

than cross-subject ones in identiying

teachers who were more eective in

improving student achievement in ELA

and math. Researchers will continue to

develop measures or assessing teach-

ers content-specic teaching knowl-

edge and validating them as states

create new assessments aligned to theCommon Core State Standards. When

they have been shown to be substan-

tively related to a teachers students

achievement gains, these should be

considered or inclusion as part o

a composite measure o eective

teaching.



18/28

How Can Teachers

Be AssuredTrustworthy Results romClassroom Observations?

9

Classroom observations can be powerul tools or proessional growth. But

or observations to be o value, they must reliably refect what teachers dothroughout the year, as opposed to the subjective impressions o a particularobserver or some unusual aspect o a particular lesson. Teachers need to knowthey are being observed by the right people, with the right skills, and a sucientnumber o times to produce trustworthy results. Given this, the challenge orschool systems is to make the best use o resources to provide teachers withhigh-quality eedback to improve their practice.

The MET projects report Gathering

Feedback or Teaching showed the

importance o averaging together

multiple observations rom multipleobservers to boost reliability. Reliability

represents the extent to which results

refect consistent aspects o a teachers

practice, as opposed to other ac-

tors such as observer judgment. We

also stressed that observers must be

well-trained and assessed or accuracy

beore they score teachers lessons.

But there were many practical ques-

tions the MET project couldnt answer in

its previous study. Among them:

Can school administrators reliably

assess the practice o teachers in

their schools?

Can additional observations by exter-

nal observers not amiliar with a

teacher increase reliability?

Must all observations involve viewing

the entire lesson or can partial les-

sons be used to increase reliability?

And,

What is the incremental benet o

adding additional lessons and addi-

tional observers?

These questions came rom our

partners, teachers, and administra-

tors in urban school districts. Inresponse, with the help o a partner

district, the Hillsborough County (Fla.)

Public Schools, the MET project added

a study o classroom observation

For the same total

number o observations,

incorporating additionalobservers increases

reliability.



19/28

Hillsborough Countys Classroom Observation Instrument

Like many school districts, Hillsborough County uses an

evaluation instrument adapted rom the Framework or

Teaching, developed by Charlotte Danielson. The ramework

defnes our levels o perormance or specifc competen-

cies in our domains o practice. Two o those domains

pertain to activities outside the classroom: Planning and

Preparation, and Proessional Responsibility. Observers

rated teachers on the 10 competencies in the rameworks

two classroom-ocused domains, as shown:

Domain 2: The Classroom Environment Domain 3: Instruction

Creating an Environment o Respect and Rapport

Establishing a Culture o Learning

Managing Classroom Procedures

Managing Student Behavior

Organizing Physical Space

Communicating with Students

Using Discussion and Questioning Techniques

Engaging Students in Learning

Using Assessment in Instruction

Demonstrating Flexibility and Responsiveness

reliability. This study engaged district

administrators and teacher experts

to observe video-recorded lessons o

67 Hillsborough County teachers who

agreed to participate.

Comparison o Ratings

Two types o observers took part in

the study: Fity-three were

school-based admin-

istratorseither

principals or

assistant

principalsand 76 were peer observers.

The latter are district-based posi-

tions lled by teachers on leave rom

the classroom who are responsible

or observing and providing eed-

back to teachers in multiple schools.

In Hillsborough Countys evaluation

system, teachers are observed multiple

times, ormally and inormally, by their

administrators and by peer observ-

ers. Administrators and peers are

trained and certied in the districts

observation instrument, which is

based on Charlotte Danielsons

Framework or Teaching.

These observers each rated 24 lessons

or us and produced more than 3,000

ratings that we could use to investigate

our questions. MET project research-

ers were able to calculate reliability

or many combinations o observers

(administrator and peer), lessons (rom

1 to 4), and observation duration (ull

lesson or 15 minutes). We were able to

compare dierences in the ratings given

to teachers lessons by their own

and unknown administrators

and between administrators

and peers.



20/28

Eects on Reliability

Figure 5 graphically represents many

o the key ndings rom our analyses

o those ratings. Shown are the esti-

mated reliabilities or results rom a

given set o classroom observations.Reliability is expressed on a scale rom

0 to 1. A higher number indicates that

results are more attributable to the

particular teacher as opposed to other

actors such as the particular observer

or lesson. When results or the same

teachers vary rom lesson to lesson or

rom observer to observer, then averag-

ing teachers ratings across multiple

lessons or observers decreases the

amount o error due to such actors,

and it increases reliability.

Adding lessons and observers increasesthe reliability o classroom observa-

tions. In our estimates, i a teachers

results are based on two lessons, having

the second lesson scored by a second

observer can boost reliability signi-

cantly. This is shown in Figure 5: When

the same administrator observes a

second lesson, reliability increases rom

.51 to .58, but when the second lesson

is observed by a dierent administra-

tor rom the same school, reliability

increases more than twice as much,

rom .51 to .67. Whenever a given number

o lessons was split between multiple

observers, the reliability was greater

than that achieved by a single observer.

In other words, or the same total

number o observations, incorporating

additional observers increases reliability.

O course, it would be a problem i

school administrators and peer observ-

ers produced vastly dierent results or

the same teachers. But we didnt nd

that to be the case. Although adminis-trators gave higher scores to their own

teachers, their rankings o their own

teachers were similar to those produced

by peer observers and administrators

rom other schools. This implies that

administrators are seeing the same.51

.58

.67

Reliability

.67 .66.69

.72

There Are Many Roads to Reliability

Lesson observed by own administrator = 45 min

Lesson observed by peer observer = 45 min

Three 15-minute lessons observed by three additional peer observers = 45 min

A

A

B

B

Aand Bdenote different observers of the same type

Figure 5

These bars show how the number o observations and observers aects

reliability. Reliability represents the extent to which the variation in results

refects consistent aspects o a teachers practice, as opposed to other

actors such as diering observer judgments. Dierent colors represent

dierent categories o observers. The A and B in column three show

that ratings were averaged rom two dierent own-school observers.

Each circle represents approximately 45 minutes o observation time (a

solid circle indicates one observation o that duration, while a circle split

into three indicates three 15-minute observations by three observers).

As shown, reliabilities o .66.72 can be achieved in multiple ways, with

dierent combinations o number o observers and observations. (For

example, one observation by a teachers administrator when combined with

three short, 15-minute observations each by a dierent observer would

produce a reliability o .67.)



21/28

things in the videos that others do, and

they are not being swayed by personal

biases.

I additional observations by additional

observers are important, how can the

time or those added observationsbe divided up to maximize the use

o limited resources while assuring

trustworthy results? This is an increas-

ingly relevant question as more school

systems make use o video in providing

teachers with eedback on their prac-

tice. Assuming multiple videos or a

teacher exist, an observer could use the

same amount o time to watch one ull

lesson or two or three partial lessons.

But to consider the latter, one wouldwant to know whether partial-lesson

observations increase reliability.

Our analysis rom Hillsborough County

showed observations based on the

rst 15 minutes o lessons were about

60 percent as reliable as ull lesson

observations, while requiring one-third

as much observer time. Thereore,

Although administrators gave higher scores to

their own teachers, their rankings o their own

teachers were similar to those produced by external

observers and administrators rom other schools.

one way to increase reliability is to

expose a given teachers practice to

multiple perspectives. Having three

dierent observers each observe or

15 minutes may be a more economical

way to improve reliability than having

one additional observer sit in or 45

minutes. Our results also suggest that

it is important to have at least one or

two ull-length observations, given that

some aspects o teaching scored on theFramework or Teaching (Danielsons

instrument) were requently not

observed during the rst 15 minutes

o class.

Together, these results provide a range

o scenarios or achieving reliable

classroom observations. There is a

point where both additional observers

and additional observations do little to

reduce error. Reliability above 0.65 can

be achieved with several congurations

(see Figure 5).

Implications or Districts

Ultimately, districts must decide how to

allocate time and resources to class-

room observations. The answers to the

questions o how many lessons, o what

duration, and conducted by whom are

inormed by reliability considerations,

as well as other relevant actors, such

as novice teacher status, prior eec-

tiveness ratings, and a districts overall

proessional development strategy.



22/28

In three years we have learned a lot about how multiple measures can identiy

eective teaching and the contribution that teachers can make to student

learning. The goal is or such measures to inorm state and district eorts tosupport improvements in teaching to benet all students. Many o these lessonshave already been put into practice as school systems eagerly seek out evidence-based guidance. Only a ew years ago the norm or teacher evaluation was toassign satisactory ratings to nearly all teachers evaluated while providingvirtually no useul inormation to improve practice.10 Among the signicant lessonslearned through the MET project and the work o its partners:

What We

Know Now

Student perception surveys and

classroom observations can

provide meaningul eedback to

teachers. They also can help system

leaders prioritize their investments

in proessional development to target

the biggest gaps between teachers

actual practice and the expectations

or eective teaching.

Implementing specifc procedures

in evaluation systems can increase

trust in the data and the results.

These include rigorous training and

certication o observers; observa-

tion o multiple lessons by dierent

observers; and in the case o student

surveys, the assurance o student

condentiality.

Each measure adds something

o value. Classroom observationsprovide rich eedback on practice.

Student perception surveys provide

a reliable indicator o the learning

environment and give voice to the

intended beneciaries o instruction.

Student learning gains (adjusted

to account or dierences among

students) can help identiy groups

o teachers who, by virtue o their

instruction, are helping students

learn more.

A balanced approach is most sensi-

ble when assigning weights to orm

a composite measure. Compared

with schemes that heavily weight

one measure, those that assign 33

percent to 50 percent o the weight

to student achievement gains

achieve more consistency, avoid the

risk o encouraging too narrow a

ocus on any one aspect o teaching,

and can support a broader range o

learning objectives than measured

by a single test.

There is great potential in using

video or teacher eedback and or

the training and assessment o

observers. The advances made in

this technology have been signicant,

resulting in lower costs, greater ease

o use, and better quality.



23/28

The Work Ahead

As we move orward, MET project

teachers are supporting the transition

rom research to practice. More than

300 teachers are helping the project

build a video library o practice or usein proessional development. They will

record more than 50 lessons each by

the end o this school year and make

these lessons available to states, school

districts, and other organizations com-

mitted to improving eective teaching.

This will allow countless educators to

analyze instruction and see examples o

great teaching in action.

Furthermore, the unprecedented data

collected by the MET project over

the past three years are being madeavailable to the larger research com-

munity to carry out additional analyses,

which will increase knowledge o what

constitutes eective teaching and how

to support it. MET project partners

already are tapping those data or new

studies on observer training, combining

student surveys and observations, and

other practical concerns. Finally, com-

mercially available video-based tools or

observer training and certication now

exist using the lessons learned rom the

MET projects studies.

Many o the uture lessons regarding

teacher eedback and evaluation systems

must necessarily come rom the eld, as

states and districts innovate, assess the

results, and make needed adjustments.

This will be a signicant undertaking,

as systems work to better support great

teaching. Thanks to the hard work o

MET project partners, we have a solid

oundation on which to build.Many o the uture lessons regarding teacher

eedback and evaluation systems must necessarilycome rom the feld, as states and districts

innovate, assess the results, and make needed

adjustments. This will be a signifcant undertaking,

as systems work to better support great teaching.



25/28

1. The lead authors o this brie are Steven Cantrell, Chie Research Ocer

at the Bill & Melinda Gates Foundation, and Thomas J. K ane, Proessoro Education and Economics at the Harvar d Graduate School o Education

and principal investigator o the Measures o Eective Teaching (MET)

project. Lead authors o the related research papers are Thomas J.

Kane (Harvard), Daniel F. McCarey (RAND), and Douglas O. Staiger

(Dartmouth). Essential support came rom Je Archer, Sarah Buhayar,

Alejandro Ganimian, Andrew Ho, Kerri Kerr, Erin McGoldrick, and

David Parker. KSA-Plus Communications provided design and editorial

assistance.

2. This section summarizes the analyses and key ndings rom the

research report Have We Identifed Eective Teachers?by Thomas J. Kane,

Daniel F. McCarey, Trey Miller, and Douglas O. Staiger. Readers who

want to review the ull set o ndings can download that repor t at www.

metproject.org.

3. As expected, not every student on a randomly assigned roster stayed inthe classroom o the intended teacher. Fortunately, we could track those

students. We estimated the eects o teachers on student achievement

using a statistical technique commonly used in randomized trials called

instrumental variables.

4. These predictions, as well as the average achievement outcomes, are

reported relative to the average among par ticipating teachers in the same

school, grade, and subject.

5. Readers may notice that some o the dierences in Figure 2 are smaller

than the dierences reported in earlier MET report s. Due to non-

compliancestudents not remaining with their randomly assigned

teacheronly about 30 percent o the randomly assigned dierence in

teacher eectiveness translated into dierences in the eectiveness o

students actual teacher. The estimates in Figure 2 are adjusted or non-compliance. I all the students had remained with their randomly assigned

teachers, we would have predicted impacts roughly three times as big.

Our results imply that, without non-compliance, we would have expected

to see dierences just as large as included in earlier reports.

6. Other researchers have studied natural movements o teachers between

schools (as opposed to randomly assigned transers) and ound noevidence o bias in estimated teacher eectiveness between schools.

See Raj Chetty, John Friedman, and Jonah E. Rocko, The Long-Term

Impacts o Teachers: Teacher Value-Added and Student Outcomes in

Adulthood, working paper no. 17699, National Bureau o Economic

Research, December 2011.

7. The ndings highlighted in this summary and the technical details o the

methods that produced them are explained in detail in the research paper

A Composite Estimator o Eective Teaching, by Kata Mihaly, Daniel

McCarey, Douglas O. Staiger, and J.R. Lockwood. A copy may be ound at

www.metproject.org.

8. Dierent student assessments, observation protocols, and student

survey instruments would likely yield somewhat dierent amounts o

reliability and accuracy. Moreover, measures used or ev aluation may

produce dierent results than seen in the MET project, which attached nostakes to the measures it administered in the classrooms o its volunteer

teachers.

9. This section summarizes key analyses and ndings rom the report

The Reliability o Classroom Observations by School Personnel by Andrew

D. Ho and Thomas J. Kane. Readers who want to review the ull set

o ndings and methods or the analyses can download that report at

www.metproject.org. The MET project acknowledges the hard work o

Danni Greenberg Resnick and David Steele, o the Hillsborough County

Public Schools, and the work o the teachers, administrators, and peer

observers who par ticipated in this study.

10. Weisburg , D. et al. (2009). The Widget Eect: Our National Failure to

Acknowledge and Act on Di erences in Teacher Eectivenes s. Brooklyn: New

Teacher Project.

Endnotes



27/28

2013 Bill & Melinda Gates Foundation. All Rights Reserved.

Bill & Melinda Gates Foundation is a registered trademark

in the United States and other countries.

Bill & Melinda Gates Foundation

Guided by the belie that every lie has equal

value, the Bill & Melinda Gates Foundation

works to help all people lead healthy,

productive lives. In developing countries, it

ocuses on improving peoples health and

giving them the chance to lit themselves out

o hunger and extreme poverty. In the United

States, it seeks to ensure that all people

especially those with the ewest resources

have access to the opportunities they need to

succeed in school and lie. Based in Seattle,

Washington, the oundation is led by CEO Je

Raikes and Co-chair William H. Gates Sr.,

under the direction o Bill and Melinda Gates

and Warren Buett.

For more inormation on the U.S. Program,

which works primarily to improve high school

and postsecondary education, please visit

www.gatesoundation.org.


28/28

www.gatesoundation.org

met_ensuring fair and reliable measures_practitioner brief

Documents