the evaluation of teacher and school effectiveness using growth models and value added modeling:...
TRANSCRIPT
The Evaluation of Teacher and School Effectiveness Using Growth Models and Value
Added Modeling:Hope Versus Reality
Robert W. Lissitz
University of Maryland
Maryland Assessment Research Center for Education Success
http://marces.org/Completed.htm
THANK YOU
First, I want to thank…
• The creators of this symposium
• Burcu Kaniskan• The State of Maryland
• MARCES:
• Laura Reiner, Yuan Zhang, Xiaoshu Zhu, and Dr. Bill Schafer
• Drs. Xiaodong Hou and Ying Li
• Yong Luo, Matt Griffin, Tiago Calico, and Christy Lewis
PREVIEW
• Overview of the Literature:
• Reliability
• Validity
• Application of VAM to real data
• Direction of VAM in the future
INTRODUCTION
• The federal government is asking psychometricians to help
make decisions
• Race to the Top evaluating teachers and schools
• Earlier: No Child Left Behind (“Race to the Middle”) repealed
the law of individual differences
• The government wants a system that
• Pressures educational administrations to do the right thing
• Combats the teachers’ unions perceived as obstacles
• Seems to assume that teachers don’t want to teach effectively
RACE TO THE MIDDLE
• Value-added modeling (VAM) is a system that we hope
can determine the effectiveness of some mechanism
• Usually teachers or schools
• Most popular models include
• Simple regression, Transitions between performance
levels in adjacent grades, Mixed effects or multilevel
regression models (Teacher or school as level 2 effect)
• Models students’ performance over or under expectation,
aggregated by their teacher or school (usually normative)
INTRODUCTION AND HISTORYWHAT IS VAM?
• Nonrandom assignment of students to teachers
• Past effects or nuisance variables not controlled by use of prior
performance level
• Bias reduced using multiple prior measures, but not eliminated
• Advantaged by having your class unsuccessful last year
• “Dynamic” interaction between students and teachers
• Association between teacher effectiveness and student characteristics
• Effects may have different influence on students of different ability
• Testing is selective
• Many teachers with subjects not tested
• Memphis, TN – VAM does not apply to 70% of teachers
INTRODUCTIONVAM: CHALLENGES – CRITICISM
Think of the reliability of VAM as a generalizability problem. Are inferences you draw from one
situation true in another situation?
RELIABILITYGENERALIZABILITY
RELIABILITY
Mandeville (1988):
• School effectiveness estimates were stable in the 0.34 to 0.66 range of correlations
• Large differences across grade level and subject matter
McCaffrey (2009):
• Teacher effect estimates one year apart had correlations around 0.2 to 0.3
• Teaching itself may not be a stable phenomenon
• Variability may be due to actual performance changes from year to year; instability may be intractable
STABILITY OVER A ONE-YEAR PERIOD
Sass (2008) and Newton, et al (2010):
• Estimates of teacher effectiveness from test-retest
assessments over a short time period
• Correlations in the range of 0.6
• Results may indicate a real phenomenon, but modest
RELIABILITYSTABILITY OVER A SHORT PERIOD OF TIME
Mandeville & Anderson (1987) and others (Rockoff, 2004;
Newton, et al, 2010):
• Effectiveness fluctuates across grade and subject matter
• Stability, though modest, found more often with math
courses, less often with reading courses
• Does success depend on what class you are assigned
rather than your ability? To some extent it does.
• Serious issues of fairness and comparability
RELIABILITYSTABILITY ACROSS GRADE AND SUBJECT
Newton, et al (2010):
• Students who are less advantaged, ESL, or on a lower track can
have a negative impact on teacher effect estimates
• Perception that entire school is good or bad is very popular, but generally untrue
• Different grades and different subjects get different evaluations
• Bottom line:
• Rankings or groupings of schools or teachers (e.g., quintiles) are not highly stable.
RELIABILITYSTABILITY AT THE SCHOOL LEVEL
Sass (2008):
• Top quintile and bottom quintile seem the most stable
• Correlation of teacher effectiveness in those groups was 0.48 across comparable exams over a short time
• Time extended to a year between tests: correlation dropped to 0.27
Papay (2011):
• Three different tests
• Rank order correlations of teacher effectiveness across time ranged from 0.15 to 0.58 across different tests
• Test timing and measurement error have effects
RELIABILITYSTABILITY ACROSS TEST FORMS
Tekwe, et al (2004):
• Compared four similar regression models
• Unless such models involve different variables, results tend
to be similar
Dawes (1979):
• Linear composites seem to be pretty much the same
regardless of how one gets the weights
Hill, et al (2011):
• A big convergent validity problem
RELIABILITYSTABILITY ACROSS STATISTICAL MODELS
• 30-60% of variation is due to sampling error
• In part due to small numbers of students as the basis of
effectiveness estimates
• Regression to the mean
• Class sizes vary within a school or district
• Classroom measures based on fewer students tend toward the mean
• Bayes estimates in multilevel modeling introduces bias that is a
function of sample size
• Other occupations: Lack of consistency of performance is typical of
complex professions – baseball players, stock investors…
RELIABILITYSOURCES OF UNRELIABILITY
• Years of experience, advanced degrees, certification, licensure,
school quality, etc. have low relationship (if any) to teacher
effectiveness
• National Board little better than a coin flip (Sanders and Wright,
2008)
• Knowledge of mathematics positively correlated with teaching
mathematics effectively
• VAM estimates provide better measures of teacher impact on
student test scores than measures on teacher’s job application
• Having trouble isolating teaching factors that relate to VAM
VALIDITYJOB APPLICATIONS AS PREDICTIVE MEASURES
Reliability is the easy thing to study – Validity is much harder
Goe, et al (2008):
• Context for evaluation
• To draw valid conclusions, teachers should be compared to other
teachers who:
• Teach similar courses
• In same grade
• In a similar context
• Assessed by same or similar examination
• Similar student characteristics
VALIDITYTRIANGULATION OF MULTIPLE INDICATORS
• Student ability is correlated with growth and status
• Gifted students learn at a faster rate
• Gifted students and their teachers have an
advantage
• Interaction between student ability and teachers’
opportunity to be effective
VALIDITYCOMPARABILITY
Rubin (2004):
• Missing data are not missing at random
• Missing in a way that confounds results and
complicates inferences
• We do not have a clear idea what our hypothesis is
• Multiple operational definitions of growth, but no
developmental science for the phenomenon
• No standardization for effectiveness
VALIDITYCAUSALITY, RESEARCH DESIGN, AND THEORY
• Without carefully controlled experiments, we cannot isolate teacher effects
• Students have multiple teachers and other influences• Effect of prior performance and experience
• What do we even mean by teachers have a causal effect?
• How do teachers and schools impart their supposed effect?• How is it internalized by the student?
• Lord’s paradox
• ANCOVA does not lead to unambiguous interpretations• We do not know what optimal teacher decision-making is
VALIDITYCAUSALITY, RESEARCH DESIGN, AND THEORY
Are teachers the most important factor determining student
achievement? NO.
• Nye, et al (2004): 11% of variation in student gains
explained by teacher effects
• Rockoff (2004): Teacher effects 5.0-6.4%
School effects 2.7-6.1%
Student fixed effects 59-68%
VALIDITYWHY SHOULD WE CARE?
Importance of classroom context
• Kennedy (2010), etc.:
• Situational factors influence teacher success
• Time on task, materials, work assignments
Might add controlling behavioral issues; mainstreaming only
students who are willing/capable to be non-disruptive
• Technical assistance with teaching (computers..)
New teacher’s Goal: Maximize the context for learning
VALIDITYWHY SHOULD WE CARE?
New paradigm? – different orientation toward the teaching - learning process
• Teacher optimizes the context of the learning environment
• Adding to motivation• Preventing disruption• Providing opportunity for enhanced learning engagement
• Use of assistive teaching devices (computers) will change teacher’s role
• Develop a learning science
• Current paradigm emphasizes immediate generality and immediate usage, with questionable validity
• Instead, create laboratory for education science
VALIDITYWHY SHOULD WE CARE?
• The MARCES Center has studied 11 of the simplest models that might be applied
• The full VAM report and the full text supporting this presentation can be accessed at
http://marces.org/Completed.htm
OUR STUDYCOMPARING MODELS USING REAL DATA
• We obtained 3 years of data on the same students, linked to their teachers
• Students divided into four cohorts: (N ≈ 5000 per cohort)
OUR STUDYCOMPARING MODELS USING REAL DATA
Cohort 1: 3rd, 4th, 5th grades Cohort 3: 5th, 6th, 7th grades
Cohort 2: 4th, 5th, 6th grades Cohort 4: 6th, 7th, 8th grades
• Math and reading data from yearly spring state assessment (2008-2010)
• No vertical scale
• Horizontally equated from year to year
• VAM models chosen for comparison do not require vertical scaling
• Nine models compare growth from first to second year
• Two models compare growth from first and second to third year
OUR STUDYMODELS
Variable Label
QRG1 Quantile regression with one predictor
QRG2 Quantile regression with two predictor
ConD Deciles conditional on deciles
ConZ Z scores conditional on deciles
OLS1 Ordinary least squares with one predictor
OLS2 Ordinary least squares with two predictors
OLSS Ordinary least squares using spline scores
DIFS Difference between spline scores
TRSG Transition model with values reflecting both status and growth
TRUG Transition model reflecting upward growth only
TRUD Transition model reflecting upward and downward change
Quantile regression conditional on prior year(s) – Betebenner using percentiles
Simplification using conditional deciles of z-scores (effect size) - Thum
Least squares regression predicted by prior year(s)
Models using spline scores to create vertical scale - Schafer
Transition models
Simplification using deciles of students
OUR STUDYMODELS
Variable Label
QRG1 Quantile regression with one predictor
QRG2 Quantile regression with two predictors
ConD Deciles conditional on deciles
ConZ Z scores conditional on deciles
OLS1 Ordinary least squares with one predictor
OLS2 Ordinary least squares with two predictors
OLSS Ordinary least squares using spline scores
DIFS Difference between spline scores
TRSG Transition model with values reflecting both status and growth
TRUG Transition model reflecting upward growth only
TRUD Transition model reflecting upward and downward change
Value Table for TRUD B1 B2 B3 P1 P2 P3 A1 A2 A3B1 0 0.5 1 1.5 2 2.5 3 3.5 4B2 -1 0 0.5 1 1.5 2 2.5 3 4B3 -1 -1 0 0.5 1 1.5 2 2.5 3P1 -2 -1 -1 0 0.5 1 1.5 2 3P2 -2 -2 -1 -1 0 0.5 1 1.5 2P3 -3 -2 -2 -1 -1 0 0.5 1 2A1 -3 -3 -2 -2 -1 -1 0 0.5 1A2 -4 -3 -3 -2 -2 -1 -1 0 1A3 -4 -4 -3 -3 -2 -2 -1 -1 0
Value Table for TRUG B1 B2 B3 P1 P2 P3 A1 A2 A3B1 1 3 4 4 4 4 4 4 4B2 0 1 3 4 4 4 4 4 4B3 0 0 1 3 3 4 4 4 4P1 0 0 0 1 2 3 4 4 4P2 0 0 0 0 1 2 4 4 4P3 0 0 0 0 0 1 3 3 4A1 0 0 0 0 0 0 1 2 3A2 0 0 0 0 0 0 0 1 2A3 0 0 0 0 0 0 0 0 1
TRSG rewards students for maintaining previous status and for growth within and across performance levels
• Reward increases with higher performance level status
TRANSITION MODELS: Performance Levels
TRSG rewards students for maintaining previous status and for growth within and across performance levels
TRUD values reflect growth as well as decreased performance, but not status
TRUG rewards students only for growth and does not punish for regressing
Value Table for TRSG B1 B2 B3 P1 P2 P3 A1 A2 A3B1 9 11 13 15 17 19 21 23 25B2 8 10 12 14 16 18 20 22 24B3 7 9 11 13 15 17 19 21 23P1 6 8 10 12 14 16 18 20 22P2 5 7 9 11 13 15 17 19 21P3 4 6 8 10 12 14 16 18 20A1 3 5 7 9 11 13 15 17 19A2 2 4 6 8 10 12 14 16 18A3 1 3 5 7 9 11 13 15 17
OUR STUDYMODELS
Variable Label
QRG1 Quantile regression with one predictor
QRG2 Quantile regression with two predictors
ConD Deciles conditional on deciles
ConZ Z scores conditional on deciles
OLS1 Ordinary least squares with one predictor
OLS2 Ordinary least squares with two predictors
OLSS Ordinary least squares using spline scores
DIFS Difference between spline scores
TRSG Transition model with values reflecting both status and growth
TRUG Transition model reflecting upward growth only
TRUD Transition model reflecting upward and downward change
TRSG rewards students for maintaining previous status and for growth within and across performance levels
• Reward increases with higher performance level status
TRANSITION MODELS: Performance Levels
TRSG rewards students for maintaining previous status and for growth within and across performance levels
TRUD values reflect growth as well as decreased performance, but not status
TRUG rewards students only for growth and does not punish for regressing
Value Table for TRUD B1 B2 B3 P1 P2 P3 A1 A2 A3B1 0 0.5 1 1.5 2 2.5 3 3.5 4B2 -1 0 0.5 1 1.5 2 2.5 3 4B3 -1 -1 0 0.5 1 1.5 2 2.5 3P1 -2 -1 -1 0 0.5 1 1.5 2 3P2 -2 -2 -1 -1 0 0.5 1 1.5 2P3 -3 -2 -2 -1 -1 0 0.5 1 2A1 -3 -3 -2 -2 -1 -1 0 0.5 1A2 -4 -3 -3 -2 -2 -1 -1 0 1A3 -4 -4 -3 -3 -2 -2 -1 -1 0
OUR STUDYMODELS
Variable Label
QRG1 Quantile regression with one predictor
QRG2 Quantile regression with two predictors
ConD Deciles conditional on deciles
ConZ Z scores conditional on deciles
OLS1 Ordinary least squares with one predictor
OLS2 Ordinary least squares with two predictors
OLSS Ordinary least squares using spline scores
DIFS Difference between spline scores
TRSG Transition model with values reflecting both status and growth
TRUG Transition model reflecting upward growth only
TRUD Transition model reflecting upward and downward change
TRSG rewards students for maintaining previous status and for growth within and across performance levels
• Reward increases with higher performance level status
TRANSITION MODELS: Performance Levels
TRSG rewards students for maintaining previous status and for growth within and across performance levels
TRUD values reflect growth as well as decreased performance, but not status
TRUG rewards students only for growth and does not punish for regressing
Value Table for TRUG B1 B2 B3 P1 P2 P3 A1 A2 A3B1 1 3 4 4 4 4 4 4 4B2 0 1 3 4 4 4 4 4 4B3 0 0 1 3 3 4 4 4 4P1 0 0 0 1 2 3 4 4 4P2 0 0 0 0 1 2 4 4 4P3 0 0 0 0 0 1 3 3 4A1 0 0 0 0 0 0 1 2 3A2 0 0 0 0 0 0 0 1 2A3 0 0 0 0 0 0 0 0 1
• Factor analysis of student growth from these models
intercorrelated growth in year 1-2 and replicated for years 2-3
• One dimension accounts for largest percentage of variance
• Great deal of noise in results
• Over 80% of variance undefined by first dimension
• Results of factor analysis essentially the same for each pair of
years, for each cohort and for each content area
OUR STUDYINTER-CORRELATION OF STUDENT GROWTH SCORES FROM EACH MODEL AND THEIR DIMENSIONALITY
Example: Scree Plot for Math 2008-2009, Cohort 1
OUR STUDYINTER-CORRELATION OF STUDENT GROWTH SCORES AND THEIR DIMENSIONALITY
OUR STUDYTHE CORRELATION BETWEEN GROWTH IN MATH AND GROWTH IN READING
QRG1 ConD ConZ OLS1 OLSS DIFS TRSG TRUG TRUD0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
Growth Score Correlation between subjects Year 2008-2009
Cohort 1
Cohort 2
Cohort 3
Cohort 4
Gro
wth
Sco
re C
orre
lati
on
OUR STUDYTHE CORRELATION BETWEEN THE TWO GROWTH PERIODS (YEAR 1-2 AND YEAR 2-3)
QRG1 ConD ConZ OLS1 OLSS DIFS TRSG TRUG TRUD-0.50
-0.40
-0.30
-0.20
-0.10
0.00
0.10
0.20
0.30
0.40
Growth Score Correlation across Years
Math
Reading
Gro
wth
Sco
re C
orre
lati
on
OUR STUDYTEACHER EFFECTIVENESS: RELIABILITY
QRG1 ConD ConZ OLS1 OLSS DIFS TRSG TRUG TRUD0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Year to Year Reliability of Teacher EffectivenessMath
Rel
iabi
lity
QRG1 ConD ConZ OLS1 OLSS DIFS TRSG TRUG TRUD0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Year to Year Reliability of Teacher EffectivenessReading
Grade 5
Grade 6
Grade 7R
elia
bili
ty
OUR STUDYSCHOOL EFFECTIVENESS: RELIABILITY
QRG1 ConD ConZ OLS1 OLSS DIFS TRSG TRUG TRUD-0.30
-0.20
-0.10
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Year to Year Reliability of School EffectivenessMath
Rel
iabi
lity
QRG1 ConD ConZ OLS1 OLSS DIFS TRSG TRUG TRUD-0.30
-0.20
-0.10
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Year to Year Reliability of School EffectivenessReading
Grade 5
Grade 6
Grade 7R
elia
bili
ty
Levels of Effectiveness
2008-2009 (Results are similar in 2009-2010)
OUR STUDYCOMPARISON BETWEEN SCHOOL AND TEACHER EFFECTIVENESS
QRG1 ConD ConZ OLS1 OLSS DIFS TRSG TRUG TRUD0.00
0.10
0.20
0.30
0.40
0.50
0.60
School_Math School_ReadingTeacher_Math Teacher_Reading
Eff
ecti
vene
ss
Math Cohort 1 in Year 2008-2009
OUR STUDYMETHODOLOGICAL ISSUES
THE MODEL YOU USE CAN MAKE A DIFFERENCE
• Decide how to balance status against growth
• No standardization for the modeling of VAM
• Traditional qualitative approaches used by principals are
not likely to be an improvement on VAM
• Using either approach for high stakes testing and
decision-making seems premature
• Combining two procedures that are not highly valid will not
necessarily result in a more valid system
OUR CONCLUSIONS
INTERACTIONS SHOULD BE MODELED• All students do NOT react the same way• Teachers are NOT the same over time• Many differences exist within a school
OUR CONCLUSIONS
CONTEXT EFFECTS SHOULD BE STUDIED• Teacher’s role should be changed
• Need to create a learning science
• Context may add to the modest results for teachers and schools
CHANGE IN INSTRUCTION INVOLVING SUPPORTIVE TECHNOLOGY
• Paradigm shift in education may be closer than we think
• Cognitive, computer, econometric, engineering, neuro
scientists are beginning to study education
• Field can be expected to change as these researchers and
their students become more involved
• Teacher’s decision-making becoming more systematic
• Radical changes for the better are expected
OUR CONCLUSIONS
VAM FOR HIGH STAKES
• Right now, I do not encourage using VAM for high stakes
applications
• Might use VAM for initial screening, then follow-up
• It makes a difference which VAM model we implement
• Choose the model based on policy decisions that capture
the goals, values and intent of the school system
• Factors not in teacher’s control will have an effect
OUR CONCLUSIONS
RELATE VAM TO WHAT TEACHERS ARE DOING• Create causal models and explore with experiments
• Effective teaching requires good measurement, and presents a great challenge and is a worthy goal…
OUR CONCLUSIONS
INTERESTED IN IMPLEMENTING A VAM?
• Read Finlay and Manavi (2008) and others first
•Practical political issues of using VAM in schools involve
unions, federal government, state government, special education advocates… and the list goes on and on …
Questions?
Robert W. Lissitz
University of Maryland
Maryland Assessment Research Center for Education Success
Visit http://marces.org to find references, the full text of this talk, comparison of value-added
models and there will be aMARCES conference on VAM (October 18 & 19)