value-added: a brief overview and implications for educator...
TRANSCRIPT
VALUE-ADDED:
A BRIEF OVERVIEW AND
IMPLICATIONS FOR
EDUCATOR EVALUATION
Nandita Gawade – Value Added Research Center 6/18/13
Introduction
Value-Added Research Center
Rob Meyer – Center Director
3 Main lines of work
Education statistics and data analysis
Professional development and Technical assistance
Technical policy analysis and development
Minneapolis
Milwaukee
Racine
Chicago
Madison
Tulsa
Atlanta
New York City
Los Angeles
Hillsborough County
NORTH DAKOTA
SOUTH DAKOTA
MINNESOTA
WISCONSIN
ILLINOIS
Districts and States Working with VARC
Collier County
NEW YORK
CALIFORNIA
VARC Design Process:
Continuous Improvement
Objective
• Valid and fair comparisons of teachers serving different student populations
Model Co-Build
• Full disclosure: no black-box
• Model informed by technical and consequential validity
Output
• Productivity estimates (contribution to student academic growth)
• Data formatting
Stakeholder Feedback
• Model refinement
• New objectives
National Landscape
Large scale top down policies/grants
Original No Child Left Behind
Race to the Top
Teacher Incentive Fund
ESEA Waivers (NCLB tweak)
Relatively low quality assessment at the state level
Relatively low quality data systems
National Landscape Looking Ahead
Common Core standards
Common Core assessment
Much more advanced assessment and measurement
NWEA MAP and common core assessments pushing on
the depth of knowledge frontier
AI scoring of essays advancing faster than anticipated
Data systems seem to be lagging behind
How do Policy Makers React?
Some states and districts have moved to implement
new metrics
RTTT states have especially felt this
“bleeding edge”
Almost all others are in the process of
implementation (or will be) because of waivers
Challenge: implement the best policies using the
best tools available
Head in sand not allowed
A Fact
Fact: your friendly neighborhood education policy
maker does not sleep much these days
A Policy Framework
Many technical decisions
Many policy goals
No perfect answers
Researchers like us tend to muddy the waters
Where to start?
The ground rules are that we need the following:
Accuracy
Criterion validity
Technical (causal) validity
Reliability (precision)
Consequential validity
Transparency
Technical validity
Technical validity measures the degree to which the
statistical model and data used in the model (for
example, student outcomes, student characteristics,
and student-classroom-teacher linkages) provide
consistent (unbiased) estimates of performance
using the available student outcomes/assessments
Requires development of a quasi-experimental
model that captures (to the extent possible) the
structural factors of the phenomenon at hand
Consequential validity
Consequential validity addresses the incentives and
decisions that are triggered by the design and use
of policy measures and systems
Consequential Validity: Uses and
Decisions
Parental choice of schools
Teachers willingness to teach in given schools
Identification of master teachers
Identification of teachers for professional development
Performance based compensation
Provision of supplemental services
Avoid bubble effects: incentives to deploy resources to students as artifact of statistical measures (Statistics based on means rather than medians can be affected by all students)
Transparency
Transparency addresses the consequences of
simplicity versus complexity in the design (and
clarity of explanation) of models and reports
A simple example
Consider the consequences of using attainment measures for accountability (NCLB)
Technical validity:
Valid if we believe we can create proficiency cut points on assessments
Consequential validity:
No context – why would a principal or teacher want to be judged by external factors
May drive the best talent away from the schools that need the most help
Student Growth Models
Current policy frameworks tend to require some
form of student growth model
Gain models, value-added models (several types),
student growth percentile models
All attempt to measure the same thing: assessment
achievement of kids conditional on starting point
Policy then “rates” teachers and schools based on
the metric
Model Intuition
Spring 2013 (Math)
Actual student
achievement
scale score
Gain Starting student
achievement
scale score
Spring 2012 (Math)
Single Subject Model
Value-Added Models
Spring 2013 (Math)
Actual student
achievement
scale score
Predicted student achievement
(Based on observationally
similar students)
Value-Added Starting student
achievement
scale score
Spring 2012 (Math)
Single Subject Model
Value-Added Visually with
Multiple Prior Tests
Spring 2013 (Math)
Actual student
achievement
scale score
Predicted student achievement
(Based on observationally
similar students)
Value-Added Starting student
achievement
scale score
across multiple
subjects
Spring 2012
Multiple Subject Model
What about SGP?
Same concept!
Student Growth Percentiles are a specific set of
assumptions on a regression model:
Spline relationship between post and pre
Quantile regression instead of mean regression
One can replicate almost everything about an SGP
model with more traditional models
More on the technical validity, consequential validity,
and transparency concerns later
Posttest Pretest Relationship
First step of modeling student growth for any
purpose
This is the most important part!
Options:
Flexible (estimated relationship) or not
Linear or nonlinear
Multiple pretests or single pretest
Flexible or Inflexible
Inflexible relationship means the policy maker “knows” the relationship from outside information
Gain models, certain layered value-added models
Technical validity:
Teacher or school models correct if the relationship assumption is correct
Consequential validity:
Too strong a relationship (most common)
Classrooms with high pretest scores are penalized (and vice versa)
Consequential validity of strong
inflexible assumption
Spring 2013 (Math)
Actual student
achievement
scale score
Predicted student achievement
(Based on data)
True Value-Added
Starting student
achievement
scale score
Spring 2012 (Math)
Predicted student achievement
(Based on assumption)
Estimated VA
Posttest Pretest Relationship
Consequences
Staff are disincentivized from teaching high achieving
students
Recent research by Reckase and Wooldridge
suggests choosing an inflexible relationship is not
optimal
What about estimated (data driven) relationships?
Estimated test relationships
Typical of value-added models used by the
research community
SGP could be considered a semiparametric
approach to estimating the relationship
Technical validity: if estimated properly no
assumption on the relationship needed
Consequential validity: neutralizes the incentives to
teach certain types of kids
Measurement error
Not so fast
The presence of measurement error in the
assessments causes a naïve estimation to fail
Creates a relationship that is too weak and thus
closer to attainment models
Classrooms that have high pretest kids will be more
likely to have high value added
Effects of measurement error (linear)
-2
-2
2
2
Pretest (Z)
Posttest (Z)
Strong relationship (1)
True relationship
Estimated (with measurement error)
No relationship (attainment)
Measurement error correction
In some cases we can correct the estimated
relationship
Linear relationship most understood
Currently no way to correct for this in SGP
Recent research by Akram and Meyer suggests a
non linear mean regression model (such as the ones
used by MET) can be corrected
Consequential validity expanded
Consider the consequences of controlling for prior achievement and other predictors – switching from measurement of attainment (as in NCLB) to growth
Technical validity:
Positive – causal classroom estimates are more accurate
Consequential validity:
Positive if we want to be able to compare schools/teachers on a fair field
Negative if controlling for prior achievement and other predictors inevitably leads to reduced expectations
Consequential Validity
Spring 2013 (Math)
Actual student
achievement
scale score
Predicted
Value-Added
Class 1
Spring 2012 (Math)
Single Subject Model
Spring 2012 (Math)
Class 2 (low SES) Proficiency
(Expectation???)
Consequential Validity
Spring 2013 (Math)
Single Subject Model
Spring 2012 (Math)
Class 3 (very low
SES)
Proficiency
(Expectation???)
Teacher effect
(best teacher in district)
How do we make this part up?
Who makes it up?
Spring 2012 (Math)
Key Point: the Power of Two
Decisions need to be informed by:
Measure of school/classroom or teacher performance
Measures of student achievement
Actual average student achievement
Student achievement target (e.g., proficiency status)
Options
Use only information on student attainment (NCLB)
Use only information on value-added performance
Use both pieces of data to inform decisions
Peer effects (classroom information)
Some models (including some we use) use classroom average demographics to capture peer effects
Technical validity: peer effect literature suggests this is not necessarily correct (error of commission)
Over control
Eliminates the ability to estimate selection across a state or district
Technical validity: also not correct to omit (error of omission)
Misspecified model
We need a one-armed economist!
Peer effects
Now we have a policy decision
Consequential validity:
Accountability for HR purposes – gives an absolute fair shot to every teacher in every situation
Possibly too fair!!
Accountability for parental choice issues – hides the fact that some schools are in fact worse
May need 2 estimates
Ehlert, Koedel, Parsons and Podgursky discuss this issue and claim using class averages is a superior policy
Again – depends on use!
Endogenous covariates
Some policies would like attendance or suspensions included in the model
Technical validity: these factors may improve the precision of a prediction
Consequential validity:
We would be controlling for at least part of a teacher’s or school’s effect on student achievement
Does an effective teacher develop strategies to increase attendance?
Can a teacher/school control attendance 100%?
Small Sample Size
With naïve estimation, small schools and classrooms
would be falsely overrepresented in the highest and
lowest Value-Added categories.
Technical validity:
Shrinkage improves the accuracy and precision of
Value-Added estimates by adjusting for the wider
variance that occurs simply as a result of teaching
fewer students.
Shrinkage increases the stability of Value-Added
estimates from year to year.
Small Sample Size
Consequential Validity:
Small classrooms will now show little variance, and seem
always average
May be better than false extremes
Be careful how this integrates into policy
State Models
Technically similar to a district model
Consequentially very different
Imagine 2 school districts (Happyville ISD and Sadville ISD)
For the same salary, all teachers prefer to teach in Happyville
Sadville ISD can only hire teachers that were not accepted at Happyville ISD
If we make VA part of accountability across the state we will see higher VA on average at Happyville ISD
State Models
We could give Sadville a break and develop a model that removes the differences across districts
From a parents perspective we have eliminated useful information about where they should buy a house
Important – this DOES NOT change the fact that the teachers at Sadville ISD are less effective
Sadville ISD still has to serve its students – should we penalize it then for market conditions it cannot control?
Integration With Other Metrics
Many policy systems attempt to “weight” different statistical measures
Technical validity: one can always average numbers – in this case the interpretation is entirely unclear
High observation low VA vs. middle on both
Consequential Validity
Depending on how the point system is set up gaming is possible
Rankings cause focus on highest variance not highest weight
Transparency: LAUSD
Los Angeles Unified School District recently
implemented a value added metric (AGT)
Process:
Techncial co-build
Advisory panel
Slow roll out
Thoughtful model selection (and rejection)
Massive communication effort
No accountability decisions until model build complete
Model Criteria
Student coverage
Reliability
Stability
Exogenous variance explained
Teacher variance
Model Diagnostics Criteria
Statistic Low
Value
High
Value
Within Teacher R-Sq 0.5 0.8
Reliability 0.5 0.9
Estimate of Teacher
SD 0.1 0.3
Some models failed
ELA 10 and 11 failed to differentiate between
teachers (but worked at the school level)
Science 5 included a curriculum that spanned
multiple teachers and no adequate model could be
developed for their data
A note on statistical noise
Standard errors can be created for all statistical
estimates
It is inappropriate to present the point estimate of a
student growth model as reality
This is especially important when integrating these
metrics into policy use
High noise (and thus high instability) estimates
should only be used for low stakes decisions – not
knowing that instability may open one up to lawsuits
Contact Information
Nandita G Gawade
Value-Added Research Center
University of Wisconsin - Madison
Session Evaluation
http://f8s.co/18tc