value-added: a brief overview and implications for educator...

VALUE-ADDED:

A BRIEF OVERVIEW AND

IMPLICATIONS FOR

EDUCATOR EVALUATION

Nandita Gawade – Value Added Research Center 6/18/13

Introduction

Value-Added Research Center

Rob Meyer – Center Director

3 Main lines of work

Education statistics and data analysis

Professional development and Technical assistance

Technical policy analysis and development

Minneapolis

Milwaukee

Racine

Chicago

Madison

Tulsa

Atlanta

New York City

Los Angeles

Hillsborough County

NORTH DAKOTA

SOUTH DAKOTA

MINNESOTA

WISCONSIN

ILLINOIS

Districts and States Working with VARC

Collier County

NEW YORK

CALIFORNIA

VARC Design Process:

Continuous Improvement

Objective

• Valid and fair comparisons of teachers serving different student populations

Model Co-Build

• Full disclosure: no black-box

• Model informed by technical and consequential validity

Output

• Productivity estimates (contribution to student academic growth)

• Data formatting

Stakeholder Feedback

• Model refinement

• New objectives

National Landscape

Large scale top down policies/grants

Original No Child Left Behind

Race to the Top

Teacher Incentive Fund

ESEA Waivers (NCLB tweak)

Relatively low quality assessment at the state level

Relatively low quality data systems

National Landscape Looking Ahead

Common Core standards

Common Core assessment

Much more advanced assessment and measurement

NWEA MAP and common core assessments pushing on

the depth of knowledge frontier

AI scoring of essays advancing faster than anticipated

Data systems seem to be lagging behind

How do Policy Makers React?

Some states and districts have moved to implement

new metrics

RTTT states have especially felt this

“bleeding edge”

Almost all others are in the process of

implementation (or will be) because of waivers

Challenge: implement the best policies using the

best tools available

Head in sand not allowed

A Fact

Fact: your friendly neighborhood education policy

maker does not sleep much these days

A Policy Framework

Many technical decisions

Many policy goals

No perfect answers

Researchers like us tend to muddy the waters

Where to start?

The ground rules are that we need the following:

Accuracy

Criterion validity

Technical (causal) validity

Reliability (precision)

Consequential validity

Transparency

Technical validity

Technical validity measures the degree to which the

statistical model and data used in the model (for

example, student outcomes, student characteristics,

and student-classroom-teacher linkages) provide

consistent (unbiased) estimates of performance

using the available student outcomes/assessments

Requires development of a quasi-experimental

model that captures (to the extent possible) the

structural factors of the phenomenon at hand

Consequential validity

Consequential validity addresses the incentives and

decisions that are triggered by the design and use

of policy measures and systems

Consequential Validity: Uses and

Decisions

Parental choice of schools

Teachers willingness to teach in given schools

Identification of master teachers

Identification of teachers for professional development

Performance based compensation

Provision of supplemental services

Avoid bubble effects: incentives to deploy resources to students as artifact of statistical measures (Statistics based on means rather than medians can be affected by all students)

Transparency

Transparency addresses the consequences of

simplicity versus complexity in the design (and

clarity of explanation) of models and reports

A simple example

Consider the consequences of using attainment measures for accountability (NCLB)

Technical validity:

Valid if we believe we can create proficiency cut points on assessments

Consequential validity:

No context – why would a principal or teacher want to be judged by external factors

May drive the best talent away from the schools that need the most help

Student Growth Models

Current policy frameworks tend to require some

form of student growth model

Gain models, value-added models (several types),

student growth percentile models

All attempt to measure the same thing: assessment

achievement of kids conditional on starting point

Policy then “rates” teachers and schools based on

the metric

Model Intuition

Spring 2013 (Math)

Actual student

achievement

scale score

Gain Starting student

achievement

scale score

Spring 2012 (Math)

Single Subject Model

Value-Added Models

Spring 2013 (Math)

Actual student

achievement

scale score

Predicted student achievement

(Based on observationally

similar students)

Value-Added Starting student

achievement

scale score

Spring 2012 (Math)


Value-Added Visually with

Multiple Prior Tests

Spring 2013 (Math)

Actual student

achievement

scale score


(Based on observationally

similar students)

Value-Added Starting student

achievement

scale score

across multiple

subjects

Spring 2012

Multiple Subject Model

What about SGP?

Same concept!

Student Growth Percentiles are a specific set of

assumptions on a regression model:

Spline relationship between post and pre

Quantile regression instead of mean regression

One can replicate almost everything about an SGP

model with more traditional models

More on the technical validity, consequential validity,

and transparency concerns later

Posttest Pretest Relationship

First step of modeling student growth for any

purpose

This is the most important part!

Options:

Flexible (estimated relationship) or not

Linear or nonlinear

Multiple pretests or single pretest

Flexible or Inflexible

Inflexible relationship means the policy maker “knows” the relationship from outside information

Gain models, certain layered value-added models

Technical validity:

Teacher or school models correct if the relationship assumption is correct


Too strong a relationship (most common)

Classrooms with high pretest scores are penalized (and vice versa)

Consequential validity of strong

inflexible assumption

Spring 2013 (Math)

Actual student

achievement

scale score


(Based on data)

True Value-Added

Starting student

achievement

scale score

Spring 2012 (Math)


(Based on assumption)

Estimated VA

Posttest Pretest Relationship

Consequences

Staff are disincentivized from teaching high achieving

students

Recent research by Reckase and Wooldridge

suggests choosing an inflexible relationship is not

optimal

What about estimated (data driven) relationships?

Estimated test relationships

Typical of value-added models used by the

research community

SGP could be considered a semiparametric

approach to estimating the relationship

Technical validity: if estimated properly no

assumption on the relationship needed

Consequential validity: neutralizes the incentives to

teach certain types of kids

Measurement error

Not so fast

The presence of measurement error in the

assessments causes a naïve estimation to fail

Creates a relationship that is too weak and thus

closer to attainment models

Classrooms that have high pretest kids will be more

likely to have high value added

Effects of measurement error (linear)

-2

-2

2

2

Pretest (Z)

Posttest (Z)

Strong relationship (1)

True relationship

Estimated (with measurement error)

No relationship (attainment)

Measurement error correction

In some cases we can correct the estimated

relationship

Linear relationship most understood

Currently no way to correct for this in SGP

Recent research by Akram and Meyer suggests a

non linear mean regression model (such as the ones

used by MET) can be corrected

Consequential validity expanded

Consider the consequences of controlling for prior achievement and other predictors – switching from measurement of attainment (as in NCLB) to growth

Technical validity:

Positive – causal classroom estimates are more accurate


Positive if we want to be able to compare schools/teachers on a fair field

Negative if controlling for prior achievement and other predictors inevitably leads to reduced expectations

Consequential Validity

Spring 2013 (Math)

Actual student

achievement

scale score

Predicted

Value-Added

Class 1

Spring 2012 (Math)


Spring 2012 (Math)

Class 2 (low SES) Proficiency

(Expectation???)


Spring 2013 (Math)


Spring 2012 (Math)

Class 3 (very low

SES)

Proficiency

(Expectation???)

Teacher effect

(best teacher in district)

How do we make this part up?

Who makes it up?

Spring 2012 (Math)

Key Point: the Power of Two

Decisions need to be informed by:

Measure of school/classroom or teacher performance

Measures of student achievement

Actual average student achievement

Student achievement target (e.g., proficiency status)

Options

Use only information on student attainment (NCLB)

Use only information on value-added performance

Use both pieces of data to inform decisions

Peer effects (classroom information)

Some models (including some we use) use classroom average demographics to capture peer effects

Technical validity: peer effect literature suggests this is not necessarily correct (error of commission)

Over control

Eliminates the ability to estimate selection across a state or district

Technical validity: also not correct to omit (error of omission)

Misspecified model

We need a one-armed economist!

Peer effects

Now we have a policy decision


Accountability for HR purposes – gives an absolute fair shot to every teacher in every situation

Possibly too fair!!

Accountability for parental choice issues – hides the fact that some schools are in fact worse

May need 2 estimates

Ehlert, Koedel, Parsons and Podgursky discuss this issue and claim using class averages is a superior policy

Again – depends on use!

Endogenous covariates

Some policies would like attendance or suspensions included in the model

Technical validity: these factors may improve the precision of a prediction


We would be controlling for at least part of a teacher’s or school’s effect on student achievement

Does an effective teacher develop strategies to increase attendance?

Can a teacher/school control attendance 100%?

Small Sample Size

With naïve estimation, small schools and classrooms

would be falsely overrepresented in the highest and

lowest Value-Added categories.

Technical validity:

Shrinkage improves the accuracy and precision of

Value-Added estimates by adjusting for the wider

variance that occurs simply as a result of teaching

fewer students.

Shrinkage increases the stability of Value-Added

estimates from year to year.

Small Sample Size

Consequential Validity:

Small classrooms will now show little variance, and seem

always average

May be better than false extremes

Be careful how this integrates into policy

State Models

Technically similar to a district model

Consequentially very different

Imagine 2 school districts (Happyville ISD and Sadville ISD)

For the same salary, all teachers prefer to teach in Happyville

Sadville ISD can only hire teachers that were not accepted at Happyville ISD

If we make VA part of accountability across the state we will see higher VA on average at Happyville ISD

State Models

We could give Sadville a break and develop a model that removes the differences across districts

From a parents perspective we have eliminated useful information about where they should buy a house

Important – this DOES NOT change the fact that the teachers at Sadville ISD are less effective

Sadville ISD still has to serve its students – should we penalize it then for market conditions it cannot control?

Integration With Other Metrics

Many policy systems attempt to “weight” different statistical measures

Technical validity: one can always average numbers – in this case the interpretation is entirely unclear

High observation low VA vs. middle on both


Depending on how the point system is set up gaming is possible

Rankings cause focus on highest variance not highest weight

Transparency: LAUSD

Los Angeles Unified School District recently

implemented a value added metric (AGT)

Process:

Techncial co-build

Advisory panel

Slow roll out

Thoughtful model selection (and rejection)

Massive communication effort

No accountability decisions until model build complete

Model Criteria

Student coverage

Reliability

Stability

Exogenous variance explained

Teacher variance

Model Diagnostics Criteria

Statistic Low

Value

High

Value

Within Teacher R-Sq 0.5 0.8

Reliability 0.5 0.9

Estimate of Teacher

SD 0.1 0.3

Some models failed

ELA 10 and 11 failed to differentiate between

teachers (but worked at the school level)

Science 5 included a curriculum that spanned

multiple teachers and no adequate model could be

developed for their data

A note on statistical noise

Standard errors can be created for all statistical

estimates

It is inappropriate to present the point estimate of a

student growth model as reality

This is especially important when integrating these

metrics into policy use

High noise (and thus high instability) estimates

should only be used for low stakes decisions – not

knowing that instability may open one up to lawsuits

Contact Information

Nandita G Gawade

Value-Added Research Center

University of Wisconsin - Madison

[email protected]

mailto:[email protected]

mailto:[email protected]

Session Evaluation

http://f8s.co/18tc

value-added: a brief overview and implications for educator...

Documents