met_ensuring fair and reliable measures_practitioner brief
TRANSCRIPT
-
7/30/2019 MET_Ensuring Fair and Reliable Measures_Practitioner Brief
1/28
METproject
Ensuring
Fair and ReliableMeasures o
Efective TeachingCulminating Findingsrom the MET Projects
Three-Year Study
POLICY ANDPRACTITIONER BRIEF
-
7/30/2019 MET_Ensuring Fair and Reliable Measures_Practitioner Brief
2/28
ABOUT THIS REPORT: This non-technical research brie or policymakers and practitioners summarizes recent analyses rom
the Measures o Eective Teaching (MET) project on identiying eective teaching while accounting or dierences among
teachers students, on combining measures into composites, and on assuring reliable classroom observations.1
Readers who wish to explore the technical aspects o these analyses may go to www.metproject.org to fnd the three companion
research reports: Have We Identifed Eective Teachers?by Thomas J. Kane, Daniel F. McCarey, Trey Miller, and Douglas O.
Staiger;A Composite Estimator o Eective Teaching by Kata Mihaly, Daniel F. McCarey, Douglas O. Staiger, and J.R. Lockwood;
and The Reliability o Classroom Observations by School Personnel by Andrew D. Ho and Thomas J. Kane.
Earlier MET project bries and research reports also on the website include:
Working with Teachers to
Develop Fair and Reliable
Measures of Teaching (2010).
A white paper describing the
rationale or and components
o the MET projects study
o multiple measures oeective teaching.
Learning about Teaching:
Initial Findings from the
Measures of Effective Teaching
Project(2010). A research
report and non-technical
policy brie with the same
title on analysis o student-perception surveys and
student achievement gain
measures.
Gathering Feedback for
Teaching: Combining
High-Quality Observations
with Student Surveys and
Achievement Gains(2012). A
research report and policy/
practitioner brie with thesame title with initial fndings
on the reliability o classroom
observations and implications
or combining measures o
teaching.
Asking Students about
Teaching: Student
Perception Surveys and
Their Implementation
(2012). A non-technical
brie or policymakers and
practitioners on the qualitieso well-designed student
surveys and implications
or their implementation
or teacher eedback and
evaluation.
January 20
ABOUT THE MET PROJECT: The MET project is a research partnership o academics, teachers, and education organizations
committed to investigating better ways to identiy and develop eective teaching. Funding is provided by the Bill & Melinda
Gates Foundation.
The approximately 3,000 MET project teachers who
volunteered to open up their classrooms or this work are
rom the ollowing districts: The Charlotte-Mecklenburg
Schools, the Dallas Independent Schools, the Denver
Public Schools, the Hillsborough County Public Schools,
the Memphis Public Schools, the New York City Schools,
and the Pittsburgh Public Schools.
Partners include representatives o the ollowing
institutions and organizations: American Institutes or
Research, Cambridge Education, University o Chicago,
The Danielson Group, Dartmouth College, EducationalTesting Service, Empirical Education, Harvard University,
National Board or Proessional Teaching Standards,
National Math and Science Initiative, New Teacher Center,
University o Michigan, RAND, Rutgers University,
University o Southern Caliornia, Stanord University,
Teachscape, University o Texas, University o V irginia,
University o Washington, and Westat.
MET Project Teachers
Denver
Memphis
Hillsborough
County
Dallas
Charlotte-
Mecklenbu
Pittsburgh
New Y
City
-
7/30/2019 MET_Ensuring Fair and Reliable Measures_Practitioner Brief
3/28
ContentsExecutive Summary 3
Can Measures o Eective Teaching Identiy
Teachers Who Better Help Students Learn? 6
How Much Weight Should Be Placed
on Each Measure o Eective Teaching? 10
How Can Teachers Be Assured Trustworthy
Results rom Classroom Observations? 16
What We Know Now 20
Endnotes 23
Culminating Findings from the MET Projects Three-Year Study
-
7/30/2019 MET_Ensuring Fair and Reliable Measures_Practitioner Brief
4/28Ensuring Fair and Reliable Measures of Effective Teaching2
-
7/30/2019 MET_Ensuring Fair and Reliable Measures_Practitioner Brief
5/28
-
7/30/2019 MET_Ensuring Fair and Reliable Measures_Practitioner Brief
6/28
The Questions
Can measures o eective teaching
identiy teachers who better help
students learn?
Despite decades o research suggesting
that teachers are the most important in-
school actor aecting student learning,
an underlying question remains unan-
swered: Are seemingly more eectiveteachers truly better than other teachers
at improving student learning, or do they
simply have better students?
Ultimately, the only way to resolve that
question was by randomly assigning
students to teachers to see i teachers
previously identied as more eective
actually caused those students to learn
more. That is what we did or a subset
o MET project teachers. Based on data
we collected during the 200910 school
year, we produced estimates o teach-
ing eectiveness or each teacher. We
adjusted our estimates to account or
student dierences in prior test scores,
demographics, and other traits. We
then randomly assigned a classroom o
students to each participating teacher
or 201011.
Following the 201011 school year we
asked two questions: First, did students
actually learn more when randomly
assigned to the teachers who seemed
more eective when we evaluated them
the prior year? And, second, did the
magnitude o the dierence in student
outcomes ollowing random assignment
correspond with expectations?
How much weight should be placed
on each measure o eective
teaching?
While using multiple measures to
provide eedback to teachers, many
states and districts also are combining
measures into a single index to support
decisionmaking. To date, there has been
little empirical evidence to inorm how
systems might weight each measure
within a composite to support improve-
ments in teacher eectiveness. To help
ll that void, we tasked a group o our
research partners to use data rom METproject teachers to build and compare
composites using dierent weights and
dierent outcomes.
How can teachers be assured
trustworthy results rom
classroom observations?
Our last report on classroom observa-
tions prompted numerous ques-
tions rom practitioners about
how to best use resources
to produce quality inor-
mation or eedback
on classroom practice. For example:
How many observers are needed to
achieve sucient reliability rom a given
number o observations? Do all obser-
vations need to be the same length to
have condence in the results? And
what is the value o adding observers
rom outside a teachers own school?
To help answer these questions, we
designed a study in which administra-
tors and peer observers produced more
than 3,000 scores or lessons taught by
teachers within one MET project partner
school district.
Key ndings rom those analyses:
1. Eective teaching can be measured.
We collected measures o teaching
during 200910. We adjusted those
measures or the backgrounds and
prior achievement o the students
in each class. But, without random
assignment, we had no way to know i
the adjustments we made were su-
cient to discern the markers o eec-
tive teaching rom the unmeasured
aspects o students backgrounds.
Feedback and evaluation systems depend
on trustworthy inormation about teaching
eectiveness to support improvement in teachers
practice and better outcomes or students.
Ensuring Fair and Reliable Measures of Effective Teaching4
-
7/30/2019 MET_Ensuring Fair and Reliable Measures_Practitioner Brief
7/28
In act, we learned that the adjusted
measures did identiy teachers who
produced higher (and lower) average
student achievement gains ollowing
random assignment in 201011. The
data show that we can identiy groups
o teachers who are more eective
in helping students learn. Moreover,
the magnitude o the achievement
gains that teachers generated was
consistent with expectations.
In addition, we ound that more
eective teachers not only caused
students to perorm better on state
tests, but they also caused students
to score higher on other, more cog-
nitively challenging assessments inmath and English.
2. Balanced weights indicate multiple
aspects o eective teaching. A com-
posite with weights between 33 per-
cent and 50 percent assigned to state
test scores demonstrated the best mix
o low volatility rom year to year and
ability to predict student gains on mul-
tiple assessments. The composite that
best indicated improvement on statetests heavily weighted teachers prior
student achievement gains based on
those same tests. But composites
that assigned 33 percent to 50 percent
o the weight to state tests did nearly
as well and were somewhat better at
predicting student learning on more
cognitively challenging assessments.
Multiple measures also produce
more consistent ratings than stu-dent achievement measures alone.
Estimates o teachers eective-
ness are more stable rom year to
year when they combine classroom
observations, student surveys, and
measures o student achievement
gains than when they are based
solely on the latter.
3. Adding a second observer increases
reliability signifcantly more than
having the same observer score
an additional lesson. Teachers
observation scores vary more
rom observer to observer than
rom lesson to lesson. Given the
same total number o observations,
including the perspectives o two or
more observers per teacher greatly
enhances reliability. Our study o
video-based observation scoring also
revealed that:
a. Additional shorter observations
can increase reliability. Our
analysis suggests that having
additional observers watch just
part o a lesson may be a cost-
eective way to boost reliability by
including additional perspectives.
b. Although school administrators
rate their own teachers some-
what higher than do outside
observers, how they rank their
teachers practice is very similar
and teachers own administrators
actually discern bigger dier-
ences in teaching practice, which
increases reliability.
c. Adding observations by observ-
ers rom outside a teachers
school to those carried out by a
teachers own administrator can
provide an ongoing check against
in-school bias. This could be done
or a sample o teachers rather
than all, as we said in Gathering
Feedback or Teaching.
The ollowing pages urther explain
these ndings and the analyses that
produced them.
Culminating Findings from the MET Projects Three-Year Study
-
7/30/2019 MET_Ensuring Fair and Reliable Measures_Practitioner Brief
8/28
By defnition, teaching is eective when it enables student learning. But
identiying eective teaching is complicated by the act that teachers oten havevery dierent students. Students start the year with dierent achievement levelsand dierent needs. Moreover, some teachers tend to get particular types ostudents year ater year (that is, they tend to get higher-perorming or lower-perorming ones). This is why so-called value-added measures attempt toaccount or dierences in the measurable characteristics o a teachers students,such as prior test scores and poverty.
Can Measures o
Efective TeachingIdentiy Teachers WhoBetter Help Students Learn?
2
However, students dier in other
wayssuch as behavior and parental
involvementwhich we typically cannot
account or in determining teachingeectiveness. I those unaccounted or
dierences also aect student learning,
then what seems like eective teaching
may actually refect unmeasured char-
acteristics o a teachers students. The
only way to know i measures o teaching
truly identiy eective teaching and not
some unmeasured student characteris-
tics is by randomly assigning teachers to
students. So we did.
In 200910, we measured teachers
eectiveness using a combined mea-
sure, comprising teachers classroom
observation results, student perception
survey responses, and student achieve-
ment gains adjusted or student char-
acteristics, such as prior perormance
and demographics. The ollowing year
(201011), we randomly assigned dier-
ent rosters o students to two or more
MET project teachers who taught thesame grade and subject in the same
school. Principals created rosters and
the RAND Corp assigned them randomly
to teachers (see Figure 1). Our aim was
to determine i the students who were
randomly assigned to teachers who
previously had been identied as more
eective actually perormed better at
the end o the 201011 school year.3
They did. On average, the 200910
composite measure o eective teaching
accurately predicted 201011 student
perormance. The research conrmed
that, as a group, teachers previously
identied as more eective caused stu-
dents to learn more. Groups o teachers
who had been identied as less eective
Teachers previously
identifed as more eective
caused students to learnmore. Groups o teachers
who had been identifed
as less eective caused
students to learn less.
Ensuring Fair and Reliable Measures of Effective Teaching6
-
7/30/2019 MET_Ensuring Fair and Reliable Measures_Practitioner Brief
9/28
caused students to learn less. We can
say they caused more (or less) student
learning because when we randomly
assigned teachers to students during the
second year, we could be condent that
any subsequent dierences in achieve-
ment were being driven by the teachers,
not by the unmeasured characteristics
o their students. In addition, the mag-
nitude o the gains they caused was
consistent with our expectations.
Figure 2 illustrates just how well the
measures o eective teaching pre-
dicted student achievement ollowingrandom assignment. The diagonal line
represents perect prediction. Dots
above the diagonal line indicate groups
o teachers whose student outcomes ol-
lowing random assignment were better
than predicted. Dots below the line indi-
cate groups o teachers whose student
outcomes ollowing random assignment
were worse than predicted. Each dot
Figure 1
Putting Measures o Efective Teaching to the Test withRandom Assignment
Do measures o teaching really identiy teachers who help students learn more, or do seemingly more eective teachers
just get better students? To nd out, the MET project orchestrated a large-scale experiment with MET project teachers
to see i teachers identied as more eective than their peers would have greater student achievement gains even with
students who were assigned randomly.
To do so, the MET project rst estimated teachers eectiveness using multiple measures rom the 200910 school year.
As is common in schools, some teachers had been assigned students with stronger prior achievement than others. In
assessing each teachers practice that year, the project controlled or students prior achievement and demographic
characteristics. But there may have been other dierences among students as well. So or the ollowing school year(201011), principals created rosters o students or each class in the study, and then researchers randomly assigned
each roster to a participating teacher rom among those who could teach the class.
At the end o the 201011 school year, MET project analysts checked to see i students taught by teachers identied as
more eective than their colleagues actually had greater achievement gains than students taught by teachers iden-
tied as less eective. They also checked to see how well actual student achievement gains or teachers matched
predicted gains.
Culminating Findings from the MET Projects Three-Year Study
-
7/30/2019 MET_Ensuring Fair and Reliable Measures_Practitioner Brief
10/28
represents 5 percent o the teachers in
the analysis, sorted based on their pre-
dicted impact on student achievement.4
As seen in Figure 2, in both math
and English language arts (ELA), the
groups o teachers with greater pre-dicted impacts on student achievement
generally had greater actual impacts on
student achievement ollowing ran-
dom assignment. Further, the actual
impacts are approximately in line with
the predicted impacts.5 We also ound
that teachers who we identied as being
eective in promoting achievement on
the state tests also generated larger
gains on the supplemental tests admin-
istered in spring 2011.
Based on our analysis, we can unam-
biguously say that school systems
should account or the prior test scores
o students. When we removed this
control, we wound up predicting much
larger dierences in achievement
than actually occurred, indicating that
student assignment biased the results.
However, our analysis could not shed
as much light on the need to control
or demographics or peer eects
that is, the average prior achievement
and demographics o each students
classmates. Although we included those
Figure 2
Actual and Predicted Achievement o Randomized Classrooms (Math)
-.1
-.1
.10
0
-.05
-.
15
.15
.05
-.
05
.1
.05
-.1
-.
1
.10
0
-.05
-.
15
.15
.05
-.
05
.1
.05
Predicted Achievement (in standard deviations)
ActualAchievement
(in
standard
deviations)
Actual and Predicted Achievement o Randomized Classrooms
(English Language Arts)
These charts compare the actual 201011 school
year achievement gains or randomly assigned
classrooms with the results that were predicted
based on the earlier measures o teaching
eectiveness. Each dot represents the combination
o actual and estimated student perormance or 5
percent o the teachers in the study, grouped by the
teachers estimated eectiveness. The dashed line
shows where the dots would be i the actual andpredicted gains matched perectly.
On average, students o teachers with higher teache
eectiveness estimates outperormed students o
teachers with lower teacher eectiveness estimates
Moreover, the magnitude o students actual gains
largely corresponded with gains predicted by their
eectiveness measured the previous year. Both
the actual and predicted achievement are reported
relative to the mean in the randomization block. Tha
is, a zero on either axis implies that the value was
no dierent rom the mean or the small group oteachers in a grade, subject, and school within whic
class lists were randomized.
Impacts are reported in student-level standard
deviations. A .25 standard deviation dierence
is roughly equivalent to a year o schooling. The
predicted impacts are adjusted downward to accoun
or incomplete compliance with randomization.
Efectiveness Measures Identiy Teachers
Who Help Students Learn More
Actual =Predicted
Actual =Predicted
Predicted Achievement (in standard deviations)
ActualAchievement
(in
standardd
ev
iations)
Ensuring Fair and Reliable Measures of Effective Teaching8
-
7/30/2019 MET_Ensuring Fair and Reliable Measures_Practitioner Brief
11/28
controls, we cannot determine rom
our evidence whether school systems
should include them. Our results were
ambiguous on that score.
To avoid over-interpretation o these
results, we hasten to add two caveats:
First, a prediction can be correct on
average but still be subject to measure-
ment error. Our predictions o students
achievement ollowing random assign-
ment were correct on average, but
within every group there were some
teachers whose students perormed
better than predicted and some whose
students perormed worse. Second,
we could not, as a practical matter,
randomly assign students or teachers to
a dierent school site. As a result, our
study does not allow us to investigate
bias in teacher eectiveness measures
arising rom student sorting betweendierent schools.6
Nonetheless, our analysis should
give heart to those who have invested
considerable eort to develop practices
and policies to measure and support
eective teaching. Through this large-
scale study involving random assign-
ment o teachers to students, we are
condent that we can identiy groups o
teachers who are comparatively more
eective than their peers in helping stu-dents learn. Great teaching does make
a dierence.
We can unambiguously say that school systems should adjust their achievement
gain measures to account or the prior test scores o students. When we removed
this control, we wound up predicting much larger dierences in achievement than
actually occurred.
Culminating Findings from the MET Projects Three-Year Study
-
7/30/2019 MET_Ensuring Fair and Reliable Measures_Practitioner Brief
12/28
How Much Weight
Should Be Placedon Each Measure oEfective Teaching?
7
Teaching is too complex or any single measure o perormance to capture it
accurately. Identiying great teachers requires multiple measures. While statesand districts embrace multiple measures or targeted eedback, many also arecombining measures into a single index, or composite. An index or compositecan be a useul summary o complex inormation to support decisionmaking.The challenge is to combine measures in ways that support eective teachingwhile avoiding such unintended consequences as too-narrow a ocus on oneaspect o eective teaching.
To date, there has been little empiri-
cal evidence to suggest a rationale or
particular weights. The MET projects
report Gathering Feedback or Teachingshowed that equally weighting three
measures, including achievement gains,
did a better job predicting teachers
success (across several student out-
comes) than teachers years o experi-
ence and masters degrees. But that
work did not attempt to determine opti-
mal weights or composite measures.
Over the past year, a team o MET
project researchers rom the RAND
Corporation and Dartmouth College
used MET project data to compare di-
erently weighted composites and study
the implications o dierent weighting
schemes or dierent outcomes. As
in the Gathering Feedback or Teaching
report, these composites included stu-
dent achievement gains based on state
assessments, classroom observations,and student surveys. The research-
ers estimated the ability o variously
weighted composites to produce con-
sistent results and accurately orecast
teachers impact on student achieve-
ment gains on dierent types o tests.
The goal was not to suggest a spe-
cic set o weights but to illustrate
the trade-os involved when choosing
weights. Assigning signicant weight
to one measure might yield the best
predictor o uture perormance on that
measure. But heavily weighting a single
measure may incentivize teachers to
ocus too narrowly on a single aspect
Ensuring Fair and Reliable Measures of Effective Teaching10
-
7/30/2019 MET_Ensuring Fair and Reliable Measures_Practitioner Brief
13/28
o eective teaching and neglect its
other important aspects. For example, a
singular ocus on state tests could dis-
place gains on other harder-to-measure
outcomes. Moreover, i the goal is or
students to meet a broader set o learn-
ing objectives than are measured by a
states tests, then too-heavily weighting
that test could make it harder to identiy
teachers who are producing other val-
ued outcomes.
Composites Compared
The research team compared our
dierent weighting models, illustrated
in Figure 3: (Model 1) The best
predictor o state achievement
test gains (with weights calculated
to maximize the ability to predict
teachers student achievement gains on
state tests, resulting in 65+ percent o
the weight being placed on the student
achievement gains across grades and
subjects); (Model 2) a composite that
assigned 50 percent o the weight to
students state achievement test gains;
(Model 3) a composite that applied
equal weights to each measure; and
(Model 4) one that gave 50 percent to
observation ratings and 25 percent
each to achievement gains and student
surveys. The weights that best predict
state tests, shown or Model 1 in
Figure 3, were calculated to predict
gains on state ELA tests at the middle
school level, which assigns a whopping
81 percent o the weight to prior gains
on the same tests (best-predictor
weights or other grades and subjects
are in the table on page 14).
Figure 4 compares the dierent weight-
ing schemes on three criteria, usingmiddle school ELA as an example (see
the table on page 14 or other grades
and subjects). The rst is predicting
teachers student achievement gains
on state assessments. A correlation o
1.0 would indicate perect accuracy in
Heavily weighting a single measure may incentivize
teachers to ocus too narrowly on a single aspect o
eective teaching and neglect its other importantaspects. ... [I] the goal is or students to meet a
broader set o learning objectives than are measured
by a states tests, then too-heavily weighting that test
could make it harder to identiy teachers who are
producing other valued outcomes.
ObservationsStudent surveysAchievement gainson state tests
81%
2%
17%
Model 1
50%
25%
25%33%
33%
33% 25%
50%
25%
Weighted for maximumaccuracy in predicting
gains on state tests*
*Weights shown for Model 1 were calculated to best predict gains on state tests for middle school Englishlanguage arts. Similar best predictor weights for other grades and subjects are in the table on page 14.
50% weight onstate test results
Equal weights 50% weights onobservations
Model 2 Model 3 Model 4
Four Ways to WeightFigure 3
These charts illustrate our ways to construc
composite measure o eective teaching. Eac
model uses dierent weights but includes the
same components student achievement ga
on the state tests, student perception surveys
and classroom observations. Model 1 uses th
weights that would best predict a teachers
impact on state test scores. Across grades an
subjects, the best predictor model assigns
65 percent or more o the weight to a teacher
prior state test gains. Models 24 are not bas
on maximizing any particular outcome. They
approximate dierent weighting schemes
used by states and districts, with each model
placing progressively less weight on student
achievement gains on state tests.
Culminating Findings from the MET Projects Three-Year Study 1
-
7/30/2019 MET_Ensuring Fair and Reliable Measures_Practitioner Brief
14/28
predicting teachers student achieve-
ment gains on state tests. By denition,
the best composite in this regard is
Model 1, the model weighted or maxi-
mizing accuracy on state test results.
Models 24 show the eect o reducing
weights on student achievement gainson state tests or middle school ELA. As
shown rom middle school ELA, reduc-
ing weights on student achievement
gains decreases the power to predict
uture student achievement gains on
state tests rom 0.69 to 0.63 with Model
2; to 0.53 with Model 3; and to 0.43 with
Model 4. Other grades and subjects
showed similar patterns, as indicated in
the table on page 14.
While it is true that the state tests
are limited and that schools should
value other outcomes, observations
and student surveys may not be more
correlated with those other outcomes
than the state tests. As a result, we
set out to test the strength o each
models correlation with another set o
test outcomes. The middle set o bars
in Figure 4 compares the our models
(see Figure 3)each using state test
results to measure achievement
gainson how well they would predict
teachers student achievement gains
on supplemental tests that wereadministered in MET project teachers
classrooms: The SAT 9 Open-Ended
Reading Assessment (SAT 9 OE)
and the Balanced Assessment in
Mathematics (BAM).
ReliabilityCorrelation withstate tests gains
Correlation withhigher-order tests
1
0.290.34 0.33 0.32
0.51
0.66
0.76 0.75
2 3 1 2 3 1 2 34 4 4
0.63
0.53
0.43
0.69
Trade-Offs from Different Weighting SchemesMiddle School English Language Arts
Models
Figure 4
These bars compare the our weighting schemes in Figure 3 on three criteria: accuracy in
predicting teachers achievement gains on state tests; accuracy in predicting student achievement
gains on supplemental assessments designed to test higher-order thinking skills; and reliability,
refecting the year-to-year stability o teachers results. Shown are the results or middle school
ELA (see Table 1 on page 14 or results or other grades and subjects).
As indicated, Model 2 (50 percent state test results) and Model 3 (33 percent state tests) achieve
much o the same predictive power as Model 1 (the best predictor o state test results) in
anticipating teachers uture state test results (Model 1). Model 4 (50 percent observation) is
considerably less predictive. However, the gures also illustrate two other trade-os. Models
2 and 3 also are somewhat better than Model 1 at predicting gains on the tests o higher-order
thinking skills (or all but elementary school math). Across most grades and subjects, Model 1 wasthe least reliable.
Ensuring Fair and Reliable Measures of Effective Teaching12
-
7/30/2019 MET_Ensuring Fair and Reliable Measures_Practitioner Brief
15/28
Increasing Accuracy, Reducing Mistakes
When high-stakes decisions must be made, can these
measures support them? Undoubtedly, that question willbe repeated in school board meetings and in aculty break
rooms around the country in the coming years.
The answer is yes, not because the measures are perect
(they are not), but because the combined measure is better
on virtually every dimension than the measures in use now.
There is no way to avoid the stakes attached to every hir-
ing, retention, and pay decision. And deciding not to make a
change is, ater all, a decision. No measure is perect, but
better inormation should support better decisions.
In our report Gathering Feedback or Teaching, we compared
the equally weighted measure (Model 3 in Figures 3 and 4)
to two indicators that are almost universally used or pay or
retention decisions today: teaching experience and posses-
sion o a masters degree. On every student outcomethe
state tests, supplemental tests, students sel-reported
level o eort and enjoyment in classthe teachers who
excelled on the composite measure had better outcomes
than those with high levels o teaching experience or a mas-
ters degree.
In addition, many districts currently require classroom
observations, but they do not include student surveys or
achievement gains. We tested whether observations aloneare enough. Even with our ull classroom observations (two
by one observer and two by another), conducted by observ-
ers trained and certifed by the Educational Testing Service,
the observation-only model perormed ar worse than any o
our multiple measures composites. (The correlations com-
parable to those in Figure 5 would have been .14 and .25 withthe state tests and test o higher-order skills.)
Still, it is air to ask, what might be done to reduce error?
Many steps have been discussed in this and other reports
rom the project:
First, i any type o student data is to be usedeither
rom tests or rom student surveysschool systems
should give teachers a chance to correct errors in their
student rosters.
Second, classroom observers should not only be trained
on the instrument. They should frst demonstrate theiraccuracy by scoring videos or observing a class with a
master observer.
Third, observations should be done by more than one
observer. A principals observation is not enough. To
ensure reliability, it is important to involve at least one
other observer, either rom inside or outside the school.
Fourth, i multiple years o data on student achievement
gains, observations, and student surveys are available,
they should be used. For novice teachers and or systems
implementing teacher evaluations or the frst time, there
may be only a single year available. We have demon-
strated that a single year contains inormation worth
acting on. But the inormation would be even better i it
included multiple years. When multiple years o data are
available they should be averaged (although some sys-
tems may choose to weight recent years more heavily).
While covering less material than
state tests, the SAT 9 OE and BAM
assessments include more cogni-
tively challenging items that require
writing, analysis, and application
o concepts, and they are meant to
assess higher-order thinking skills.
Sample items released by the assess-
ment consortia or the new Common
Core State Standards assessments
are more similar to the items on these
supplemental tests than the ones
on the state assessments. Shown
in Figure 4 is the eect o reduc-
ing the weight on state test gains
in predicting gains on these other
assessments, again or middle school
ELA. For most grades and subjects,
Model 2 and Model 3 (50 percent state
test and equal weights or all three
measures) best predicted teachers
student achievement gains on these
supplemental assessments, with little
dierence between the two models.
The one exception was elementary
school math, where Model 1 (best pre-
dictor) was best.
The third set o bars in Figure 4compares composites on their reli-
abilitythat is, the extent to which the
composite would produce consistent
results or the same teachers rom
year to year (on a scale rom 01.0, with
Culminating Findings from the MET Projects Three-Year Study 1
-
7/30/2019 MET_Ensuring Fair and Reliable Measures_Practitioner Brief
16/28
CALCULATED WEIGHTS FOR MA XIMUM ACCURACY IN PREDICTING GAINS ON STATE TESTS
English Language Arts Math
State Tests Observations Student Surveys State Tests Observations Student Surveys
Elementary 65% 9% 25% 85% 5% 11%
Middle 81% 2% 17% 91% 4% 5%
RELIABILITY AND ACCURACY OF DIFFERENT WEIGHTING SCHEMES
English Language Arts Math
Weightedor Max
State TestAccuracy
50% StateTest
EqualWeights
50%Observations
Weightedor Max
State TestAccuracy
50% StateTest
EqualWeights
50%Observations
Elementary
Reliability 0.42 0.46 0.50 0.49 0.52 0.57 0.57 0.55
Correlationwith statetest
0.61 0.59 0.53 0.45 0.72 0.65 0.54 0.46
Correlationwith higher-order test
0.35 0.37 0.37 0.35 0.31 0.29 0.25 0.20
Middle
Reliability 0.51 0.66 0.76 0.75 0.86 0.88 0.88 0.83
Correlationwith statetest
0.69 0.63 0.53 0.43 0.92 0.84 0.73 0.65
Correlationwith higher-order test
0.29 0.34 0.33 0.32 0.38 0.44 0.45 0.45
Table 1
1.0 representing perect consistency
and no volatility). Again, results shown
are or middle school ELA. Across all
grades and subjects, the most reliable
composites were either Models 2 (50
percent state test) or 3 (equal weights).
For all but middle school math, the least
reliable composite was Model 1 (best
predictor). Model 4 (50 percent observa-
tions) was somewhat less reliable than
Model 2 (equal weights) or all grades
and subjects. Although not shown, stu-
dent achievement gains on state tests
by themselves are less stable than all
o the composites, with one exception:
Model 4 (50 percent observations) is
slightly less stable than achievement
gains alone or middle school math.
General Implications
The intent o this analysis was not torecommend an ideal set o weights to
use in every circumstance. Rather, our
goal was to describe the trade-os
among dierent approaches.8
I the goal is to predict gains on state
tests, then the composites that put 65+
percent o the weight on the student
achievement gains on those tests will
generally show the greatest accuracy.
However, reducing the weights on the
state test achievement gain measures to
50 percent or 33 percent generates two
positive trade-os: it increases stability
(lessens volatility rom year to year) and
it also increases somewhat the correla-
tion with tests other than the state tests.
However, it is possible to go too ar.
Lowering the weight on state test
achievement gains below 33 percent,
and raising the weight on observations
to 50 percent and including student
surveys at 25 percent, is counter-
productive. It not only lowers the
Ensuring Fair and Reliable Measures of Effective Teaching14
-
7/30/2019 MET_Ensuring Fair and Reliable Measures_Practitioner Brief
17/28
correlation with state achievement
gains; it can also lower reliability and
the correlation with other types o
testing outcomes.
Ultimately, states, local education
authorities, and other stakehold-ers need to decide how to weight the
measures in a composite. Our data
suggest that assigning 50 percent or
33 percent o the weight to state test
results maintains considerable pre-
dictive power, increases reliability,
and potentially avoids the unintended
negative consequences rom assigning
too-heavy weights to a single measure.
Removing too much weight rom state
tests, however, may not be a good idea,given the lower predictive power and
reliability o Model 4 (25 percent state
tests). In short, there is a range o
reasonable weights or a composite o
multiple measures.
Validity and ContentKnowledge or Teaching
Teachers shouldnt be asked to expend
eort to improve something that doesnthelp them achieve better outcomes
or their students. I a mea-
sure is to be included
in ormal evaluation, then it should be
shown that teachers who perorm better
on that measure are generally more
eective in improving student outcomes.
This test or validity has been central
to the MET projects analyses. Measures
that have passed this test include high-
quality classroom observations, well-
designed student-perception surveys,
and teachers prior records o student
achievement gains on state tests.
Over the past year, MET project
researchers have investigated another
type o measure, called the Content
Knowledge or Teaching (CKT) tests.
These are meant to assess teach-
ers understanding o how studentsacquire and understand subject-
specic skills and concepts in math
and ELA. Developed by the Educational
Testing Service and researchers at the
University o Michigan, these tests are
among the newest measures o teaching
included in the MET projects analyses.
Mostly multiple choice, the questions
ask how to best represent ideas to
students, assess student understand-
ing, and determine sources o studentsconusion.
The CKT tests studied by the MET
project did not pass our test or validity.
MET project teachers who perormed
better on the CKT tests were not
substantively more eective in
improving student achievement on
the outcomes we measured. This was
true whether student achievement
was measured using state tests or the
supplemental assessments o higher-
order thinking skills. For this reason,
the MET project did not include CKT
results within its composite measure o
eective teaching.
These results, however, speak to the
validity o the current measure still
early in its development in predictingachievement gains on particular stu-
dent assessmentsnot to the impor-
tance o content-specic pedagogical
knowledge. CKT as a concept remains
promising. The teachers with higher
CKT scores did seem to have somewhat
higher scores on two subject-based
classroom observation instruments:
the Mathematical Quality o Instruction
(MQI) and the Protocol or Language
Arts Teacher Observations (PLATO).Moreover, the MET projects last report
suggested that some content-specic
observation instruments were better
than cross-subject ones in identiying
teachers who were more eective in
improving student achievement in ELA
and math. Researchers will continue to
develop measures or assessing teach-
ers content-specic teaching knowl-
edge and validating them as states
create new assessments aligned to theCommon Core State Standards. When
they have been shown to be substan-
tively related to a teachers students
achievement gains, these should be
considered or inclusion as part o
a composite measure o eective
teaching.
Culminating Findings from the MET Projects Three-Year Study 1
-
7/30/2019 MET_Ensuring Fair and Reliable Measures_Practitioner Brief
18/28
How Can Teachers
Be AssuredTrustworthy Results romClassroom Observations?
9
Classroom observations can be powerul tools or proessional growth. But
or observations to be o value, they must reliably refect what teachers dothroughout the year, as opposed to the subjective impressions o a particularobserver or some unusual aspect o a particular lesson. Teachers need to knowthey are being observed by the right people, with the right skills, and a sucientnumber o times to produce trustworthy results. Given this, the challenge orschool systems is to make the best use o resources to provide teachers withhigh-quality eedback to improve their practice.
The MET projects report Gathering
Feedback or Teaching showed the
importance o averaging together
multiple observations rom multipleobservers to boost reliability. Reliability
represents the extent to which results
refect consistent aspects o a teachers
practice, as opposed to other ac-
tors such as observer judgment. We
also stressed that observers must be
well-trained and assessed or accuracy
beore they score teachers lessons.
But there were many practical ques-
tions the MET project couldnt answer in
its previous study. Among them:
Can school administrators reliably
assess the practice o teachers in
their schools?
Can additional observations by exter-
nal observers not amiliar with a
teacher increase reliability?
Must all observations involve viewing
the entire lesson or can partial les-
sons be used to increase reliability?
And,
What is the incremental benet o
adding additional lessons and addi-
tional observers?
These questions came rom our
partners, teachers, and administra-
tors in urban school districts. Inresponse, with the help o a partner
district, the Hillsborough County (Fla.)
Public Schools, the MET project added
a study o classroom observation
For the same total
number o observations,
incorporating additionalobservers increases
reliability.
Ensuring Fair and Reliable Measures of Effective Teaching16
-
7/30/2019 MET_Ensuring Fair and Reliable Measures_Practitioner Brief
19/28
Hillsborough Countys Classroom Observation Instrument
Like many school districts, Hillsborough County uses an
evaluation instrument adapted rom the Framework or
Teaching, developed by Charlotte Danielson. The ramework
defnes our levels o perormance or specifc competen-
cies in our domains o practice. Two o those domains
pertain to activities outside the classroom: Planning and
Preparation, and Proessional Responsibility. Observers
rated teachers on the 10 competencies in the rameworks
two classroom-ocused domains, as shown:
Domain 2: The Classroom Environment Domain 3: Instruction
Creating an Environment o Respect and Rapport
Establishing a Culture o Learning
Managing Classroom Procedures
Managing Student Behavior
Organizing Physical Space
Communicating with Students
Using Discussion and Questioning Techniques
Engaging Students in Learning
Using Assessment in Instruction
Demonstrating Flexibility and Responsiveness
reliability. This study engaged district
administrators and teacher experts
to observe video-recorded lessons o
67 Hillsborough County teachers who
agreed to participate.
Comparison o Ratings
Two types o observers took part in
the study: Fity-three were
school-based admin-
istratorseither
principals or
assistant
principalsand 76 were peer observers.
The latter are district-based posi-
tions lled by teachers on leave rom
the classroom who are responsible
or observing and providing eed-
back to teachers in multiple schools.
In Hillsborough Countys evaluation
system, teachers are observed multiple
times, ormally and inormally, by their
administrators and by peer observ-
ers. Administrators and peers are
trained and certied in the districts
observation instrument, which is
based on Charlotte Danielsons
Framework or Teaching.
These observers each rated 24 lessons
or us and produced more than 3,000
ratings that we could use to investigate
our questions. MET project research-
ers were able to calculate reliability
or many combinations o observers
(administrator and peer), lessons (rom
1 to 4), and observation duration (ull
lesson or 15 minutes). We were able to
compare dierences in the ratings given
to teachers lessons by their own
and unknown administrators
and between administrators
and peers.
Culminating Findings from the MET Projects Three-Year Study 1
-
7/30/2019 MET_Ensuring Fair and Reliable Measures_Practitioner Brief
20/28
Eects on Reliability
Figure 5 graphically represents many
o the key ndings rom our analyses
o those ratings. Shown are the esti-
mated reliabilities or results rom a
given set o classroom observations.Reliability is expressed on a scale rom
0 to 1. A higher number indicates that
results are more attributable to the
particular teacher as opposed to other
actors such as the particular observer
or lesson. When results or the same
teachers vary rom lesson to lesson or
rom observer to observer, then averag-
ing teachers ratings across multiple
lessons or observers decreases the
amount o error due to such actors,
and it increases reliability.
Adding lessons and observers increasesthe reliability o classroom observa-
tions. In our estimates, i a teachers
results are based on two lessons, having
the second lesson scored by a second
observer can boost reliability signi-
cantly. This is shown in Figure 5: When
the same administrator observes a
second lesson, reliability increases rom
.51 to .58, but when the second lesson
is observed by a dierent administra-
tor rom the same school, reliability
increases more than twice as much,
rom .51 to .67. Whenever a given number
o lessons was split between multiple
observers, the reliability was greater
than that achieved by a single observer.
In other words, or the same total
number o observations, incorporating
additional observers increases reliability.
O course, it would be a problem i
school administrators and peer observ-
ers produced vastly dierent results or
the same teachers. But we didnt nd
that to be the case. Although adminis-trators gave higher scores to their own
teachers, their rankings o their own
teachers were similar to those produced
by peer observers and administrators
rom other schools. This implies that
administrators are seeing the same.51
.58
.67
Reliability
.67 .66.69
.72
There Are Many Roads to Reliability
Lesson observed by own administrator = 45 min
Lesson observed by peer observer = 45 min
Three 15-minute lessons observed by three additional peer observers = 45 min
A
A
B
B
Aand Bdenote different observers of the same type
Figure 5
These bars show how the number o observations and observers aects
reliability. Reliability represents the extent to which the variation in results
refects consistent aspects o a teachers practice, as opposed to other
actors such as diering observer judgments. Dierent colors represent
dierent categories o observers. The A and B in column three show
that ratings were averaged rom two dierent own-school observers.
Each circle represents approximately 45 minutes o observation time (a
solid circle indicates one observation o that duration, while a circle split
into three indicates three 15-minute observations by three observers).
As shown, reliabilities o .66.72 can be achieved in multiple ways, with
dierent combinations o number o observers and observations. (For
example, one observation by a teachers administrator when combined with
three short, 15-minute observations each by a dierent observer would
produce a reliability o .67.)
Ensuring Fair and Reliable Measures of Effective Teaching18
-
7/30/2019 MET_Ensuring Fair and Reliable Measures_Practitioner Brief
21/28
things in the videos that others do, and
they are not being swayed by personal
biases.
I additional observations by additional
observers are important, how can the
time or those added observationsbe divided up to maximize the use
o limited resources while assuring
trustworthy results? This is an increas-
ingly relevant question as more school
systems make use o video in providing
teachers with eedback on their prac-
tice. Assuming multiple videos or a
teacher exist, an observer could use the
same amount o time to watch one ull
lesson or two or three partial lessons.
But to consider the latter, one wouldwant to know whether partial-lesson
observations increase reliability.
Our analysis rom Hillsborough County
showed observations based on the
rst 15 minutes o lessons were about
60 percent as reliable as ull lesson
observations, while requiring one-third
as much observer time. Thereore,
Although administrators gave higher scores to
their own teachers, their rankings o their own
teachers were similar to those produced by external
observers and administrators rom other schools.
one way to increase reliability is to
expose a given teachers practice to
multiple perspectives. Having three
dierent observers each observe or
15 minutes may be a more economical
way to improve reliability than having
one additional observer sit in or 45
minutes. Our results also suggest that
it is important to have at least one or
two ull-length observations, given that
some aspects o teaching scored on theFramework or Teaching (Danielsons
instrument) were requently not
observed during the rst 15 minutes
o class.
Together, these results provide a range
o scenarios or achieving reliable
classroom observations. There is a
point where both additional observers
and additional observations do little to
reduce error. Reliability above 0.65 can
be achieved with several congurations
(see Figure 5).
Implications or Districts
Ultimately, districts must decide how to
allocate time and resources to class-
room observations. The answers to the
questions o how many lessons, o what
duration, and conducted by whom are
inormed by reliability considerations,
as well as other relevant actors, such
as novice teacher status, prior eec-
tiveness ratings, and a districts overall
proessional development strategy.
Culminating Findings from the MET Projects Three-Year Study 1
-
7/30/2019 MET_Ensuring Fair and Reliable Measures_Practitioner Brief
22/28
In three years we have learned a lot about how multiple measures can identiy
eective teaching and the contribution that teachers can make to student
learning. The goal is or such measures to inorm state and district eorts tosupport improvements in teaching to benet all students. Many o these lessonshave already been put into practice as school systems eagerly seek out evidence-based guidance. Only a ew years ago the norm or teacher evaluation was toassign satisactory ratings to nearly all teachers evaluated while providingvirtually no useul inormation to improve practice.10 Among the signicant lessonslearned through the MET project and the work o its partners:
What We
Know Now
Student perception surveys and
classroom observations can
provide meaningul eedback to
teachers. They also can help system
leaders prioritize their investments
in proessional development to target
the biggest gaps between teachers
actual practice and the expectations
or eective teaching.
Implementing specifc procedures
in evaluation systems can increase
trust in the data and the results.
These include rigorous training and
certication o observers; observa-
tion o multiple lessons by dierent
observers; and in the case o student
surveys, the assurance o student
condentiality.
Each measure adds something
o value. Classroom observationsprovide rich eedback on practice.
Student perception surveys provide
a reliable indicator o the learning
environment and give voice to the
intended beneciaries o instruction.
Student learning gains (adjusted
to account or dierences among
students) can help identiy groups
o teachers who, by virtue o their
instruction, are helping students
learn more.
A balanced approach is most sensi-
ble when assigning weights to orm
a composite measure. Compared
with schemes that heavily weight
one measure, those that assign 33
percent to 50 percent o the weight
to student achievement gains
achieve more consistency, avoid the
risk o encouraging too narrow a
ocus on any one aspect o teaching,
and can support a broader range o
learning objectives than measured
by a single test.
There is great potential in using
video or teacher eedback and or
the training and assessment o
observers. The advances made in
this technology have been signicant,
resulting in lower costs, greater ease
o use, and better quality.
Ensuring Fair and Reliable Measures of Effective Teaching20
-
7/30/2019 MET_Ensuring Fair and Reliable Measures_Practitioner Brief
23/28
The Work Ahead
As we move orward, MET project
teachers are supporting the transition
rom research to practice. More than
300 teachers are helping the project
build a video library o practice or usein proessional development. They will
record more than 50 lessons each by
the end o this school year and make
these lessons available to states, school
districts, and other organizations com-
mitted to improving eective teaching.
This will allow countless educators to
analyze instruction and see examples o
great teaching in action.
Furthermore, the unprecedented data
collected by the MET project over
the past three years are being madeavailable to the larger research com-
munity to carry out additional analyses,
which will increase knowledge o what
constitutes eective teaching and how
to support it. MET project partners
already are tapping those data or new
studies on observer training, combining
student surveys and observations, and
other practical concerns. Finally, com-
mercially available video-based tools or
observer training and certication now
exist using the lessons learned rom the
MET projects studies.
Many o the uture lessons regarding
teacher eedback and evaluation systems
must necessarily come rom the eld, as
states and districts innovate, assess the
results, and make needed adjustments.
This will be a signicant undertaking,
as systems work to better support great
teaching. Thanks to the hard work o
MET project partners, we have a solid
oundation on which to build.Many o the uture lessons regarding teacher
eedback and evaluation systems must necessarilycome rom the feld, as states and districts
innovate, assess the results, and make needed
adjustments. This will be a signifcant undertaking,
as systems work to better support great teaching.
Culminating Findings from the MET Projects Three-Year Study 2
-
7/30/2019 MET_Ensuring Fair and Reliable Measures_Practitioner Brief
24/28Ensuring Fair and Reliable Measures of Effective Teaching22
-
7/30/2019 MET_Ensuring Fair and Reliable Measures_Practitioner Brief
25/28
1. The lead authors o this brie are Steven Cantrell, Chie Research Ocer
at the Bill & Melinda Gates Foundation, and Thomas J. K ane, Proessoro Education and Economics at the Harvar d Graduate School o Education
and principal investigator o the Measures o Eective Teaching (MET)
project. Lead authors o the related research papers are Thomas J.
Kane (Harvard), Daniel F. McCarey (RAND), and Douglas O. Staiger
(Dartmouth). Essential support came rom Je Archer, Sarah Buhayar,
Alejandro Ganimian, Andrew Ho, Kerri Kerr, Erin McGoldrick, and
David Parker. KSA-Plus Communications provided design and editorial
assistance.
2. This section summarizes the analyses and key ndings rom the
research report Have We Identifed Eective Teachers?by Thomas J. Kane,
Daniel F. McCarey, Trey Miller, and Douglas O. Staiger. Readers who
want to review the ull set o ndings can download that repor t at www.
metproject.org.
3. As expected, not every student on a randomly assigned roster stayed inthe classroom o the intended teacher. Fortunately, we could track those
students. We estimated the eects o teachers on student achievement
using a statistical technique commonly used in randomized trials called
instrumental variables.
4. These predictions, as well as the average achievement outcomes, are
reported relative to the average among par ticipating teachers in the same
school, grade, and subject.
5. Readers may notice that some o the dierences in Figure 2 are smaller
than the dierences reported in earlier MET report s. Due to non-
compliancestudents not remaining with their randomly assigned
teacheronly about 30 percent o the randomly assigned dierence in
teacher eectiveness translated into dierences in the eectiveness o
students actual teacher. The estimates in Figure 2 are adjusted or non-compliance. I all the students had remained with their randomly assigned
teachers, we would have predicted impacts roughly three times as big.
Our results imply that, without non-compliance, we would have expected
to see dierences just as large as included in earlier reports.
6. Other researchers have studied natural movements o teachers between
schools (as opposed to randomly assigned transers) and ound noevidence o bias in estimated teacher eectiveness between schools.
See Raj Chetty, John Friedman, and Jonah E. Rocko, The Long-Term
Impacts o Teachers: Teacher Value-Added and Student Outcomes in
Adulthood, working paper no. 17699, National Bureau o Economic
Research, December 2011.
7. The ndings highlighted in this summary and the technical details o the
methods that produced them are explained in detail in the research paper
A Composite Estimator o Eective Teaching, by Kata Mihaly, Daniel
McCarey, Douglas O. Staiger, and J.R. Lockwood. A copy may be ound at
www.metproject.org.
8. Dierent student assessments, observation protocols, and student
survey instruments would likely yield somewhat dierent amounts o
reliability and accuracy. Moreover, measures used or ev aluation may
produce dierent results than seen in the MET project, which attached nostakes to the measures it administered in the classrooms o its volunteer
teachers.
9. This section summarizes key analyses and ndings rom the report
The Reliability o Classroom Observations by School Personnel by Andrew
D. Ho and Thomas J. Kane. Readers who want to review the ull set
o ndings and methods or the analyses can download that report at
www.metproject.org. The MET project acknowledges the hard work o
Danni Greenberg Resnick and David Steele, o the Hillsborough County
Public Schools, and the work o the teachers, administrators, and peer
observers who par ticipated in this study.
10. Weisburg , D. et al. (2009). The Widget Eect: Our National Failure to
Acknowledge and Act on Di erences in Teacher Eectivenes s. Brooklyn: New
Teacher Project.
Endnotes
Culminating Findings from the MET Projects Three-Year Study 2
-
7/30/2019 MET_Ensuring Fair and Reliable Measures_Practitioner Brief
26/28Ensuring Fair and Reliable Measures of Effective Teaching24
-
7/30/2019 MET_Ensuring Fair and Reliable Measures_Practitioner Brief
27/28
2013 Bill & Melinda Gates Foundation. All Rights Reserved.
Bill & Melinda Gates Foundation is a registered trademark
in the United States and other countries.
Bill & Melinda Gates Foundation
Guided by the belie that every lie has equal
value, the Bill & Melinda Gates Foundation
works to help all people lead healthy,
productive lives. In developing countries, it
ocuses on improving peoples health and
giving them the chance to lit themselves out
o hunger and extreme poverty. In the United
States, it seeks to ensure that all people
especially those with the ewest resources
have access to the opportunities they need to
succeed in school and lie. Based in Seattle,
Washington, the oundation is led by CEO Je
Raikes and Co-chair William H. Gates Sr.,
under the direction o Bill and Melinda Gates
and Warren Buett.
For more inormation on the U.S. Program,
which works primarily to improve high school
and postsecondary education, please visit
www.gatesoundation.org.
-
7/30/2019 MET_Ensuring Fair and Reliable Measures_Practitioner Brief
28/28
www.gatesoundation.org