graduate school of education the quantitative side of teacher evaluation in new jersey bruce d....

48
Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Upload: warren-oliver

Post on 18-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

The Quantitative Side of Teacher Evaluation in New Jersey

Bruce D. Baker

Page 2: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

Modern Teacher Evaluation PoliciesMaking Certain Distinctions with Uncertain Information

• First, the modern teacher evaluation template requires that objective measures of student achievement growth necessarily be considered in a weighting system of parallel components. – Placing the measures alongside one another in a weighting scheme assumes all measures in the

scheme to be of equal validity and reliability but of varied importance (utility) – varied weight.

• Second, modern teacher evaluation template requires that teachers be placed into effectiveness categories by assigning arbitrary numerical cutoffs to the aggregated weighted evaluation components. – That is, a teacher in the 25%ile or lower when combining all evaluation components might be

assigned a rating of “ineffective,” whereas the teacher at the 26%ile might be labeled effective.

• Third, the modern teacher evaluation template places inflexible timelines on the conditions for removal of tenure. – Typical legislation dictates that teacher tenure either can or must be revoked and the teacher

dismissed after 2 consecutive years of being rated ineffective (where tenure can only be achieved after 3 consecutive years of being rate effective).

Page 3: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

Due Process Concerns• Attribution/Validity

– Face• There are many practical challenges including whether teachers should be held

responsible for summer learning in an annual assessment model, or how to parse influence of teacher teams and/or teachers with assistants.

• SGPs have their own face validity problem in the authors own words. One cannot reasonably evaluate someone on a measure not attributable to them.

– Statistical• That there may be, and likely will be with an SGP, significant persistent bias –

that is, other stuff affecting growth such as student assignment, classroom conditions, class sizes, etc. – which render the resulting estimate NOT attributable to the teacher.

• Reliability– Lack of reliability of measures, jumping around from year to year, suggests also that

the measures are not a valid representation of actual teacher quality. • Arbitrary/Unjustifiable Decisions

– Cut-points imposed throughout the system make invalid assumptions regarding the statistics – that a 1 point differential is meaningful.

Page 4: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

Legal Parallels• NJ Anti-Bullying Statute

– Damned if you do, damned if you don’t!– Bullying statute forces district officials to make distinctions/apply labels

which may not be justifiably applied (“bully” or “bullying behavior”)– Application of these labels leads to damages to the individual to whom

they are applied (liberty interests/property interests)– Result?

• Due process challenges • District backtracking (avoidance)

• Measuring good teaching is no more precise an endeavor than measuring bullying! (and state mandates cannot make it so!)

Page 5: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

TEACH NJ Issues

Arbitrary Weightings

Assumes equal Validity, but

Varied Importance

Page 6: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

TEACH NJ Issues

Arbitrary Reduction/Categorization of (forced) normally distributed data

Page 7: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

Statistically inappropriate to declare +.49sd categorically

different from +.50sd![especially with noisy data]

Page 8: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

Arbitrary, statistically unjustifiable collapsing to 4pt scale, then multiplied times arbitrary weight

Assumes all parts equally valid & reliable, but of varied importance (wt)

Page 9: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

• At best, this applies to 20 to 30% of teachers– At the middle school level! (which would be the

highest)

ArtBusinessElementary Generalist

English/LAL

Family/Consumer

SciHealth/PEIndustrial

Arts

MathMusic

Science

Social Studies

Special Ed

Support Services

Vocational Ed World Language

Average NJ Middle School 2008 – 2011 Fall Staffing Reports

1. Requires differential contracts by staffing type

2. Some states/school districts “resolve” this problem by assigning all other teachers the average of those rated:

1. Significant “attribution” concern (due process)

2. Induces absurd practices3. This problem undermines “reform”

arguments that in cases of RIF, quality, not seniority should prevail because supposed “quality” measures only apply to those positions least likely to be reduced.

Page 10: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

Usefulness of Teacher Effect Measures

• Estimating teacher effects– Basic attribution problems

• Seasonality• Spill-over

– Stability Issues• Decomposing the Signal and the Noise• SGP and VAM

– Attribution?– False signals in NJ SGP data

• Debunking Disinformation

Page 11: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

A little stat-geeking• What’s in a growth or VAM estimate?

– The largest part is random noise… that is, if we look from year to year, across the same teachers, estimates jump around a lot, or vary a lot in “unexplained” and seemingly unpredictable ways.

– The other two parts are: • False Signal, or predictable patterns that are predictable not as a

function of anything the teacher is doing, but a function of other stuff outside the teacher’s control, that happens to have predictable influence

– Student sort, classroom conditions, summer experiences, test form/scale and starting position of students on that scale.

• True Signal, or that piece of the predictability of change in test score from time 1 to time 2 that might fairly be attributed to the role of the teacher in the classroom.

Page 12: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

Distilling Signal from Noise

Total Variation

Unknown & Seemingly Un-predictable Er-ror(Random)

Predictable Variation (Stable Component)

Attributable to Teacher?

Attributable to other Persistent Attributes?

Difficult if not implausible to accurately parse

Making high stakes personnel decisions on the

basis of either Noise or False Signal is problematic![& that may be the majority

of the variation]

Page 13: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

The SGP Difference!

Page 14: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

SGPs & New Jersey• Student Growth Percentiles are not designed for inferring

teacher influence on student outcomes.• Student Growth Percentiles do not (even try to) control for

various factors outside of the teacher’s control.• Student Growth Percentiles are not backed by research on

estimating teacher effectiveness. By contrast, research on SGPs has shown them to be poor at isolating teacher influence.

• New Jersey’s Student Growth Percentile measures, at the school level, are significantly statistically biased with respect to student population characteristics and average performance level.

Page 15: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

In the authors words…Damian Betebenner:

“Unfortunately Professor Baker conflates the data (i.e. the measure) with the use. A primary purpose in the development of the Colorado Growth Model (Student Growth Percentiles/SGPs) was to distinguish the measure from the use: To separate the description of student progress (the SGP) from the attribution of responsibility for that progress.”

http://www.ednewscolorado.org/voices/student-growth-percentiles-and-shoe-leather

Page 16: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

Playing Semantics…

• When pressed on the point that GPs are not designed for attributing student gains to their teachers, those defending their use in teacher evaluation will often say… – “SGPs are a good measure of student growth, and

shouldn’t teachers be accountable for student growth?”

• Let’s be clear here, one cannot be accountable for something that is not rightly attributable to them!

Page 17: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

The Bottom Line? • Policymakers seem to be moving forward on implementation of

policies that display complete disregard for basic statistical principles – – that one simply cannot draw precise conclusions (and thus make

definitive decisions) based on imprecise information. • Can’t draw a strict cut point through messy data. Same applies to high stakes cut

scores for kids.

– That one cannot make assertions about the accuracy of the position of any one point among thousands, based on the loose patterns we find in these types of data.

• Good data informed decision making requires deep nuanced understanding of statistics, measures, what they mean… and most importantly WHAT THEY DON’T! (and can’t possibly)

Page 18: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

Reasonable Alternatives?• To the extent these data can produce some true signal amidst the false

signal and noise, central office data teams in large districts might be able to use several (not just one) rich, but varied models to screen for variations that warrant further exploration.

• This screening approach, much like high-error-rate rapid diagnostics tests, might tell us where to focus some additional energy (that is, classroom and/or school observation).

• We may then find that the signal was false, or that it really does tell us something either about how we’ve mismatched teachers and assignments, or the preparedness of some teachers.

• But, the initial screening information should NEVER dictate the final decision (as it will under Toxic Trifecta models).

• But, if we find that the data-driven analysis more often sends us down inefficient pathways, we might decide it’s just not worth it.

But this cannot be achieved by centralized policy or through contractual agreements.

Unfortunately current policies and recent contractual agreements prohibit thoughtful, efficient strategies!

Screening

Observation

Validation[or NOT]

& Questions?

Page 19: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

Page 20: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

New York City Examples

Page 21: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

-10

12

Va

lue

Add

ed 2

009-

10

-.4 -.2 0 .2 .4 .6Value Added 2008-09

Other Good to BadBad to Good Average

Bad Good

Correlation=.327

English Language Arts Grades 4 to 89 to 15% (of those who were “good” or were “bad” in the previous year) move all the way from good to bad or bad to good. 20 to 35% who were “bad” stayed “bad” & 20 to 35% who were “good” stayed “good.” And this is between the two years that show the highest correlation for ELA.

Page 22: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

-1-.

50

.51

1.5

Va

lue

Add

ed 2

009-

10

-.5 0 .5 1 1.5Value Added 2008-09

Other Good to BadBad to Good Average

Bad Good

Correlation=.5046

Mathematics Grades 4 to 8For math, only about 7% of teachers jump all the way from being bad to good or good to bad (of those who were “good” or “bad” the previous year), and about 30 to 50% who were good remain good, or who were bad, remain bad.

Page 23: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

But is the signal we find real or false? • Math – Likelihood of being labeled “good”

– 15% less likely to be good in school with higher attendance rate– 7.3% less likely to be good for each 1 student increase in school average class size– 6.5% more likely to be good for each additional 1% proficient in Math

• Math – Likelihood of being repeatedly labeled “good”– 19% less likely to be sequentially good in school with higher attendance rate (gr 4

to 8)– 6% less likely to be sequentially good in school with 1 additional student per class

(gr 4 to 8)– 7.9% more likely to be sequentially good in school with 1% higher math proficiency

rate.

• Math Flip Side – Likelihood of being labeled “bad”– 14% more likely to be bad in school with higher attendance rate. – 7.9% more likely to be sequentially bad for each additional student in average class

size (gr 4 to 8)

Page 24: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

About those “Great” Irreplaceable Teachers!

Page 25: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

Figure 1 – Who is irreplaceable in 2006-07 after being irreplaceable in 2005-06?

020

4060

8010

0%

ile 2

006-

07

0 20 40 60 80 100%ile 2005-06

OK to Stinky Teachers Awesome Teachers

Awesomeness

Awesome x 2

Important Tangent: Note how spreading data into percentiles makes pattern messier!

Page 26: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

Figure 2 – Among those 2005-06 Irreplaceables, how do they reshuffle between 2006-07 & 2007-08?

020

4060

8010

0%

ile 2

007-

08

0 20 40 60 80 100%ile 2006-07

OK to Stinky Teachers Awesome Teachers

Awesome x 3

Page 27: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

Figure 3 – How many of those teachers who were totally awesome in 2007-08 were still totally awesome in 2008-09?

020

4060

8010

0%

ile

2008

-09

0 20 40 60 80 100%ile 2007-08

OK to Stinky Teachers Awesome Teachers

Awesome x 4?[but may have dropped out one prior year]

Page 28: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

020

4060

8010

0%

ile 2

009-

10

0 20 40 60 80 100%ile 2008-09

OK to Stinky Teachers Awesome Teachers

Figure 4 – How many of those teachers who were totally awesome in 2008-09 were still totally awesome in 2009-10?

Page 29: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

Persistently Irreplaceable?

Of the thousands of teachers for whom ratings exist for each year in NYC, there are 14 in math and 5 in ELA that stay in the top 20% for each

year! Sure hope they don’t leave!

Page 30: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

Distilling Signal & Noise in New Jersey MGPs

Page 31: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

020

4060

80M

edia

n SG

P M

ath

0 20 40 60 80 100% Proficient 7th Grade Math

correlation=.54

Schools Including 7th GradeNew Jersey SGPs & Performance Level

Is it really true that the most effective teachers are in the schools that already have high proficiency rates?

Strong FALSE signal (bias)

Page 32: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

2040

6080

Med

ian

SGP

Lan

guag

e A

rts

0 20 40 60 80 100% Proficient 7th Grade Language Arts

correlation=.57

Schools Including 7th GradeNew Jersey SGPs & Performance Level

Is it really true that the most effective teachers are in the schools that already have high proficiency rates?

Strong FALSE signal (bias)

Page 33: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

3040

5060

70L

angu

age

Art

s M

GP

0 .2 .4 .6 .8 1% Black or Hispanic

correlation=-.4755

Grades 06-08 SchoolsMiddle School MGP Racial Bias

Is it really true that the most effective teachers are in the schools that serve the fewest minority students?

Strong FALSE signal (bias)

Page 34: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

2040

6080

Mat

h M

GP

0 .2 .4 .6 .8 1% Black or Hispanic

correlation=-.3260

Grades 06-08 Schools

Middle School MGP Racial Bias

Is it really true that the most effective teachers are in the schools that serve the fewest minority students?

Strong FALSE signal (bias)

Page 35: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

Okay… so is it really true that the most effective teachers are in the schools that serve the fewest non-proficient special education students?

Significant FALSE signal (bias)

Page 36: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

And what if the underlying measures are junk?

Ceilings, Floors and Growth Possiblities?

Page 37: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

05

1015

Per

cent

100 150 200 250 300Math 2011

Math 4One CANNOT create variation where there is none!

Page 38: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

02

46

810

Per

cent

100 150 200 250 300Math 2011

Math 7

One CANNOT create variation where there is none!

Page 39: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

100

150

200

250

300

Mat

h 20

11

100 150 200 250 300Math 2010

Sample DFG FG District

NJASK Math Grade 3 to 4 Cohort

Page 40: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

150

200

250

300

Mat

h 20

12

100 150 200 250 300Math 2011

Sample DFG FG DistrictNJASK Math Grade 4 to 5 Cohort

Page 41: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

100

150

200

250

300

Mat

h 20

11

150 200 250 300Math 2010

Sample DFG FG District

NJASK Math Grade 6 to 7 Cohort

Page 42: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

100

150

200

250

300

Mat

h 20

12

100 150 200 250 300Math 2011

Sample DFG FG District

NJASK Math Grade 7 to 8 Cohort

Page 43: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

Debunking Disinformation

Page 44: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

Misrepresentations• NJ Commissioner Christopher Cerf explained:

– “You are looking at the progress students make and that fully takes into account socio-economic status,” Cerf said. “By focusing on the starting point, it equalizes for things like special education and poverty and so on.”[17] (emphasis added)

• Why this statement is untrue: – First, comparisons of individual students don’t actually explain what happens when a

group of students is aggregated to their teacher and the teacher is assigned the median student’s growth score to represent his/her effectiveness, where teachers don’t all have an evenly distributed mix of kids who started at similar points (to other teachers). So, in one sense, this statement doesn’t even address the issue.

– Second, this statement is simply factually incorrect, even regarding the individual student. The statement is not supported by research on estimating teacher effects which largely finds that sufficiently precise student, classroom and school level factors do relate to variations not only in initial performance level but also in performance gains.

[17]http://www.wnyc.org/articles/new-jersey-news/2013/mar/18/everything-you-need-know-about-students-baked-their-test-scores-new-jersy-education-officials-say/

Page 45: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

Further research on this point…• Two recent working papers compare SGP and VAM estimates for teacher and

school evaluation and both raise concerns about the face validity and statistical properties of SGPs. – Goldhaber and Walch (2012) conclude: “For the purpose of starting conversations

about student achievement, SGPs might be a useful tool, but one might wish to use a different methodology for rewarding teacher performance or making high-stakes teacher selection decisions” (p. 30).[6]

– Ehlert and colleagues (2012) note: “Although SGPs are currently employed for this purpose by several states, we argue that they (a) cannot be used for causal inference (nor were they designed to be used as such) and (b) are the least successful of the three models [Student Growth Percentiles, One-Step VAM & Two-Step VAM] in leveling the playing field across schools” (p. 23).[7]

[6] Goldhaber, D., & Walch, J. (2012). Does the model matter? Exploring the relationship between different student achievement-based teacher assessments. University of Washington at Bothell, Center for Education Data & Research. CEDR Working Paper 2012-6.[7] Ehlert, M., Koedel, C., &Parsons, E., & Podgursky, M. (2012). Selecting growth measures for school and teacher evaluations. National Center for Analysis of Longitudinal Data in Education Research (CALDAR). Working Paper #80.

Page 46: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

Misrepresentations• “The Christie administration cites its own research to back up its plans, the most favored

being the recent Measures of Effective Teaching (MET) project funded by the Gates Foundation, which tracked 3,000 teachers over three years and found that student achievement measures in general are a critical component in determining a teacher’s effectiveness.”[23]

• The Gates Foundation MET project did not study the use of Student Growth Percentile Models. Rather, the Gates Foundation MET project studied the use of value-added models, applying those models under the direction of leading researchers in the field, testing their effects on fall to spring gains, and on alternative forms of assessments. Even with these more thoroughly vetted value-added models, the Gates MET project uncovered, though largely ignored, numerous serious concerns regarding the use of value-added metrics. External reviewers of the Gates MET project reports pointed out that while the MET researchers maintained their support for the method, the actual findings of their report cast serious doubt on its usefulness.[24]

• [23]http://www.njspotlight.com/stories/13/03/18/fine-print-overview-of-measures-for-tracking-growth/• [24] Rothstein, J. (2011). Review of “Learning About Teaching: Initial Findings from the Measures of Effective Teaching Project.”

Boulder, CO: National Education Policy Center. Retrieved [date] from http://nepc.colorado.edu/thinktank/review-learning-about-teaching. [accessed 2-may-13]

Page 47: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

Misrepresentations• But… even though these ratings look unstable from year to year, they are about

as stable as baseball batting averages from year to year, and clearly batting average is “good” statistic for making baseball decisions?

• Not so, say the baseball stat geeks: – Not surprisingly, Batting Average comes in at about the same consistency for hitters as

ERA for pitchers. One reason why BA is so inconsistent is that it is highly correlated to Batting Average on Balls in Play (BABIP)–.79–and BABIP only has a year-to-year correlation of .35.

– Descriptive statistics like OBP and SLG fare much better, both coming in at .62 and .63 respectively. When many argue that OBP is a better statistic than BA it is for a number of reasons, but one is that it’s more reliable in terms of identifying a hitter’s true skill since it correlates more year-to-year.

http://www.beyondtheboxscore.com/2011/9/1/2393318/what-hitting-metrics-are-consistent-year-to-year

Put simply, VAM estimates ARE about as useful as batting average – NOT VERY!

Page 48: Graduate School of Education The Quantitative Side of Teacher Evaluation in New Jersey Bruce D. Baker

Graduate School of Education

About the Chetty Study…• While the Chetty, Friedman and Rockoff studies suggest

that variation many, many years ago, absent high stakes assessment, in NYC, across classrooms of kids, are associated with small wage differences of those kids at age 30 (thus arguing that teaching quality – as measured by variation in classroom level

student gains), this study has no direct implications for what might work in hiring, retaining and compensating teachers.

• The presence of variation across thousands of teachers, even if loosely correlated with other stuff, provides little basis for identifying any one single teacher as persistently good or bad.