the evaluation of teacher and school effectiveness using growth models and value added modeling:...

The Evaluation of Teacher and School Effectiveness Using Growth Models and Value

Added Modeling:Hope Versus Reality

Robert W. Lissitz

University of Maryland

Maryland Assessment Research Center for Education Success

http://marces.org/Completed.htm

THANK YOU

First, I want to thank…

• The creators of this symposium

• Burcu Kaniskan• The State of Maryland

• MARCES:

• Laura Reiner, Yuan Zhang, Xiaoshu Zhu, and Dr. Bill Schafer

• Drs. Xiaodong Hou and Ying Li

• Yong Luo, Matt Griffin, Tiago Calico, and Christy Lewis

PREVIEW

• Overview of the Literature:

• Reliability

• Validity

• Application of VAM to real data

• Direction of VAM in the future

INTRODUCTION

• The federal government is asking psychometricians to help

make decisions

• Race to the Top evaluating teachers and schools

• Earlier: No Child Left Behind (“Race to the Middle”) repealed

the law of individual differences

• The government wants a system that

• Pressures educational administrations to do the right thing

• Combats the teachers’ unions perceived as obstacles

• Seems to assume that teachers don’t want to teach effectively

RACE TO THE MIDDLE

• Value-added modeling (VAM) is a system that we hope

can determine the effectiveness of some mechanism

• Usually teachers or schools

• Most popular models include

• Simple regression, Transitions between performance

levels in adjacent grades, Mixed effects or multilevel

regression models (Teacher or school as level 2 effect)

• Models students’ performance over or under expectation,

aggregated by their teacher or school (usually normative)

INTRODUCTION AND HISTORYWHAT IS VAM?

• Nonrandom assignment of students to teachers

• Past effects or nuisance variables not controlled by use of prior

performance level

• Bias reduced using multiple prior measures, but not eliminated

• Advantaged by having your class unsuccessful last year

• “Dynamic” interaction between students and teachers

• Association between teacher effectiveness and student characteristics

• Effects may have different influence on students of different ability

• Testing is selective

• Many teachers with subjects not tested

• Memphis, TN – VAM does not apply to 70% of teachers

INTRODUCTIONVAM: CHALLENGES – CRITICISM

Think of the reliability of VAM as a generalizability problem. Are inferences you draw from one

situation true in another situation?

RELIABILITYGENERALIZABILITY

RELIABILITY

Mandeville (1988):

• School effectiveness estimates were stable in the 0.34 to 0.66 range of correlations

• Large differences across grade level and subject matter

McCaffrey (2009):

• Teacher effect estimates one year apart had correlations around 0.2 to 0.3

• Teaching itself may not be a stable phenomenon

• Variability may be due to actual performance changes from year to year; instability may be intractable

STABILITY OVER A ONE-YEAR PERIOD

Sass (2008) and Newton, et al (2010):

• Estimates of teacher effectiveness from test-retest

assessments over a short time period

• Correlations in the range of 0.6

• Results may indicate a real phenomenon, but modest

RELIABILITYSTABILITY OVER A SHORT PERIOD OF TIME

Mandeville & Anderson (1987) and others (Rockoff, 2004;

Newton, et al, 2010):

• Effectiveness fluctuates across grade and subject matter

• Stability, though modest, found more often with math

courses, less often with reading courses

• Does success depend on what class you are assigned

rather than your ability? To some extent it does.

• Serious issues of fairness and comparability

RELIABILITYSTABILITY ACROSS GRADE AND SUBJECT

Newton, et al (2010):

• Students who are less advantaged, ESL, or on a lower track can

have a negative impact on teacher effect estimates

• Perception that entire school is good or bad is very popular, but generally untrue

• Different grades and different subjects get different evaluations

• Bottom line:

• Rankings or groupings of schools or teachers (e.g., quintiles) are not highly stable.

RELIABILITYSTABILITY AT THE SCHOOL LEVEL

Sass (2008):

• Top quintile and bottom quintile seem the most stable

• Correlation of teacher effectiveness in those groups was 0.48 across comparable exams over a short time

• Time extended to a year between tests: correlation dropped to 0.27

Papay (2011):

• Three different tests

• Rank order correlations of teacher effectiveness across time ranged from 0.15 to 0.58 across different tests

• Test timing and measurement error have effects

RELIABILITYSTABILITY ACROSS TEST FORMS

Tekwe, et al (2004):

• Compared four similar regression models

• Unless such models involve different variables, results tend

to be similar

Dawes (1979):

• Linear composites seem to be pretty much the same

regardless of how one gets the weights

Hill, et al (2011):

• A big convergent validity problem

RELIABILITYSTABILITY ACROSS STATISTICAL MODELS

• 30-60% of variation is due to sampling error

• In part due to small numbers of students as the basis of

effectiveness estimates

• Regression to the mean

• Class sizes vary within a school or district

• Classroom measures based on fewer students tend toward the mean

• Bayes estimates in multilevel modeling introduces bias that is a

function of sample size

• Other occupations: Lack of consistency of performance is typical of

complex professions – baseball players, stock investors…

RELIABILITYSOURCES OF UNRELIABILITY

• Years of experience, advanced degrees, certification, licensure,

school quality, etc. have low relationship (if any) to teacher

effectiveness

• National Board little better than a coin flip (Sanders and Wright,

2008)

• Knowledge of mathematics positively correlated with teaching

mathematics effectively

• VAM estimates provide better measures of teacher impact on

student test scores than measures on teacher’s job application

• Having trouble isolating teaching factors that relate to VAM

VALIDITYJOB APPLICATIONS AS PREDICTIVE MEASURES

Reliability is the easy thing to study – Validity is much harder

Goe, et al (2008):

• Context for evaluation

• To draw valid conclusions, teachers should be compared to other

teachers who:

• Teach similar courses

• In same grade

• In a similar context

• Assessed by same or similar examination

• Similar student characteristics

VALIDITYTRIANGULATION OF MULTIPLE INDICATORS

• Student ability is correlated with growth and status

• Gifted students learn at a faster rate

• Gifted students and their teachers have an

advantage

• Interaction between student ability and teachers’

opportunity to be effective

VALIDITYCOMPARABILITY

Rubin (2004):

• Missing data are not missing at random

• Missing in a way that confounds results and

complicates inferences

• We do not have a clear idea what our hypothesis is

• Multiple operational definitions of growth, but no

developmental science for the phenomenon

• No standardization for effectiveness

VALIDITYCAUSALITY, RESEARCH DESIGN, AND THEORY

• Without carefully controlled experiments, we cannot isolate teacher effects

• Students have multiple teachers and other influences• Effect of prior performance and experience

• What do we even mean by teachers have a causal effect?

• How do teachers and schools impart their supposed effect?• How is it internalized by the student?

• Lord’s paradox

• ANCOVA does not lead to unambiguous interpretations• We do not know what optimal teacher decision-making is

VALIDITYCAUSALITY, RESEARCH DESIGN, AND THEORY

Are teachers the most important factor determining student

achievement? NO.

• Nye, et al (2004): 11% of variation in student gains

explained by teacher effects

• Rockoff (2004): Teacher effects 5.0-6.4%

School effects 2.7-6.1%

Student fixed effects 59-68%

VALIDITYWHY SHOULD WE CARE?

Importance of classroom context

• Kennedy (2010), etc.:

• Situational factors influence teacher success

• Time on task, materials, work assignments

Might add controlling behavioral issues; mainstreaming only

students who are willing/capable to be non-disruptive

• Technical assistance with teaching (computers..)

New teacher’s Goal: Maximize the context for learning


New paradigm? – different orientation toward the teaching - learning process

• Teacher optimizes the context of the learning environment

• Adding to motivation• Preventing disruption• Providing opportunity for enhanced learning engagement

• Use of assistive teaching devices (computers) will change teacher’s role

• Develop a learning science

• Current paradigm emphasizes immediate generality and immediate usage, with questionable validity

• Instead, create laboratory for education science


• The MARCES Center has studied 11 of the simplest models that might be applied

• The full VAM report and the full text supporting this presentation can be accessed at

http://marces.org/Completed.htm

OUR STUDYCOMPARING MODELS USING REAL DATA

• We obtained 3 years of data on the same students, linked to their teachers

• Students divided into four cohorts: (N ≈ 5000 per cohort)

OUR STUDYCOMPARING MODELS USING REAL DATA

Cohort 1: 3rd, 4th, 5th grades Cohort 3: 5th, 6th, 7th grades

Cohort 2: 4th, 5th, 6th grades Cohort 4: 6th, 7th, 8th grades

• Math and reading data from yearly spring state assessment (2008-2010)

• No vertical scale

• Horizontally equated from year to year

• VAM models chosen for comparison do not require vertical scaling

• Nine models compare growth from first to second year

• Two models compare growth from first and second to third year

OUR STUDYMODELS

Variable Label

QRG1 Quantile regression with one predictor

QRG2 Quantile regression with two predictor

ConD Deciles conditional on deciles

ConZ Z scores conditional on deciles

OLS1 Ordinary least squares with one predictor

OLS2 Ordinary least squares with two predictors

OLSS Ordinary least squares using spline scores

DIFS Difference between spline scores

TRSG Transition model with values reflecting both status and growth

TRUG Transition model reflecting upward growth only

TRUD Transition model reflecting upward and downward change

Quantile regression conditional on prior year(s) – Betebenner using percentiles

Simplification using conditional deciles of z-scores (effect size) - Thum

Least squares regression predicted by prior year(s)

Models using spline scores to create vertical scale - Schafer

Transition models

Simplification using deciles of students

OUR STUDYMODELS

Variable Label


QRG2 Quantile regression with two predictors










Value Table for TRUD B1 B2 B3 P1 P2 P3 A1 A2 A3B1 0 0.5 1 1.5 2 2.5 3 3.5 4B2 -1 0 0.5 1 1.5 2 2.5 3 4B3 -1 -1 0 0.5 1 1.5 2 2.5 3P1 -2 -1 -1 0 0.5 1 1.5 2 3P2 -2 -2 -1 -1 0 0.5 1 1.5 2P3 -3 -2 -2 -1 -1 0 0.5 1 2A1 -3 -3 -2 -2 -1 -1 0 0.5 1A2 -4 -3 -3 -2 -2 -1 -1 0 1A3 -4 -4 -3 -3 -2 -2 -1 -1 0

Value Table for TRUG B1 B2 B3 P1 P2 P3 A1 A2 A3B1 1 3 4 4 4 4 4 4 4B2 0 1 3 4 4 4 4 4 4B3 0 0 1 3 3 4 4 4 4P1 0 0 0 1 2 3 4 4 4P2 0 0 0 0 1 2 4 4 4P3 0 0 0 0 0 1 3 3 4A1 0 0 0 0 0 0 1 2 3A2 0 0 0 0 0 0 0 1 2A3 0 0 0 0 0 0 0 0 1

TRSG rewards students for maintaining previous status and for growth within and across performance levels

• Reward increases with higher performance level status

TRANSITION MODELS: Performance Levels


TRUD values reflect growth as well as decreased performance, but not status

TRUG rewards students only for growth and does not punish for regressing

Value Table for TRSG B1 B2 B3 P1 P2 P3 A1 A2 A3B1 9 11 13 15 17 19 21 23 25B2 8 10 12 14 16 18 20 22 24B3 7 9 11 13 15 17 19 21 23P1 6 8 10 12 14 16 18 20 22P2 5 7 9 11 13 15 17 19 21P3 4 6 8 10 12 14 16 18 20A1 3 5 7 9 11 13 15 17 19A2 2 4 6 8 10 12 14 16 18A3 1 3 5 7 9 11 13 15 17

OUR STUDYMODELS

Variable Label


















Value Table for TRUD B1 B2 B3 P1 P2 P3 A1 A2 A3B1 0 0.5 1 1.5 2 2.5 3 3.5 4B2 -1 0 0.5 1 1.5 2 2.5 3 4B3 -1 -1 0 0.5 1 1.5 2 2.5 3P1 -2 -1 -1 0 0.5 1 1.5 2 3P2 -2 -2 -1 -1 0 0.5 1 1.5 2P3 -3 -2 -2 -1 -1 0 0.5 1 2A1 -3 -3 -2 -2 -1 -1 0 0.5 1A2 -4 -3 -3 -2 -2 -1 -1 0 1A3 -4 -4 -3 -3 -2 -2 -1 -1 0

OUR STUDYMODELS

Variable Label


















Value Table for TRUG B1 B2 B3 P1 P2 P3 A1 A2 A3B1 1 3 4 4 4 4 4 4 4B2 0 1 3 4 4 4 4 4 4B3 0 0 1 3 3 4 4 4 4P1 0 0 0 1 2 3 4 4 4P2 0 0 0 0 1 2 4 4 4P3 0 0 0 0 0 1 3 3 4A1 0 0 0 0 0 0 1 2 3A2 0 0 0 0 0 0 0 1 2A3 0 0 0 0 0 0 0 0 1

• Factor analysis of student growth from these models

intercorrelated growth in year 1-2 and replicated for years 2-3

• One dimension accounts for largest percentage of variance

• Great deal of noise in results

• Over 80% of variance undefined by first dimension

• Results of factor analysis essentially the same for each pair of

years, for each cohort and for each content area

OUR STUDYINTER-CORRELATION OF STUDENT GROWTH SCORES FROM EACH MODEL AND THEIR DIMENSIONALITY

Example: Scree Plot for Math 2008-2009, Cohort 1

OUR STUDYINTER-CORRELATION OF STUDENT GROWTH SCORES AND THEIR DIMENSIONALITY

OUR STUDYTHE CORRELATION BETWEEN GROWTH IN MATH AND GROWTH IN READING

QRG1 ConD ConZ OLS1 OLSS DIFS TRSG TRUG TRUD0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

Growth Score Correlation between subjects Year 2008-2009

Cohort 1

Cohort 2

Cohort 3

Cohort 4

Gro

wth

Sco

re C

orre

lati

on

OUR STUDYTHE CORRELATION BETWEEN THE TWO GROWTH PERIODS (YEAR 1-2 AND YEAR 2-3)

QRG1 ConD ConZ OLS1 OLSS DIFS TRSG TRUG TRUD-0.50

-0.40

-0.30

-0.20

-0.10

0.00

0.10

0.20

0.30

0.40

Growth Score Correlation across Years

Math

Reading

Gro

wth

Sco

re C

orre

lati

on

OUR STUDYTEACHER EFFECTIVENESS: RELIABILITY


0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Year to Year Reliability of Teacher EffectivenessMath

Rel

iabi

lity


0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Year to Year Reliability of Teacher EffectivenessReading

Grade 5

Grade 6

Grade 7R

elia

bili

ty

OUR STUDYSCHOOL EFFECTIVENESS: RELIABILITY


-0.20

-0.10

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Year to Year Reliability of School EffectivenessMath

Rel

iabi

lity


-0.20

-0.10

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Year to Year Reliability of School EffectivenessReading

Grade 5

Grade 6

Grade 7R

elia

bili

ty

Levels of Effectiveness

2008-2009 (Results are similar in 2009-2010)

OUR STUDYCOMPARISON BETWEEN SCHOOL AND TEACHER EFFECTIVENESS


0.10

0.20

0.30

0.40

0.50

0.60

School_Math School_ReadingTeacher_Math Teacher_Reading

Eff

ecti

vene

ss

Math Cohort 1 in Year 2008-2009

OUR STUDYMETHODOLOGICAL ISSUES

THE MODEL YOU USE CAN MAKE A DIFFERENCE

• Decide how to balance status against growth

• No standardization for the modeling of VAM

• Traditional qualitative approaches used by principals are

not likely to be an improvement on VAM

• Using either approach for high stakes testing and

decision-making seems premature

• Combining two procedures that are not highly valid will not

necessarily result in a more valid system

OUR CONCLUSIONS

INTERACTIONS SHOULD BE MODELED• All students do NOT react the same way• Teachers are NOT the same over time• Many differences exist within a school

OUR CONCLUSIONS

CONTEXT EFFECTS SHOULD BE STUDIED• Teacher’s role should be changed

• Need to create a learning science

• Context may add to the modest results for teachers and schools

CHANGE IN INSTRUCTION INVOLVING SUPPORTIVE TECHNOLOGY

• Paradigm shift in education may be closer than we think

• Cognitive, computer, econometric, engineering, neuro

scientists are beginning to study education

• Field can be expected to change as these researchers and

their students become more involved

• Teacher’s decision-making becoming more systematic

• Radical changes for the better are expected

OUR CONCLUSIONS

VAM FOR HIGH STAKES

• Right now, I do not encourage using VAM for high stakes

applications

• Might use VAM for initial screening, then follow-up

• It makes a difference which VAM model we implement

• Choose the model based on policy decisions that capture

the goals, values and intent of the school system

• Factors not in teacher’s control will have an effect

OUR CONCLUSIONS

RELATE VAM TO WHAT TEACHERS ARE DOING• Create causal models and explore with experiments

• Effective teaching requires good measurement, and presents a great challenge and is a worthy goal…

OUR CONCLUSIONS

INTERESTED IN IMPLEMENTING A VAM?

• Read Finlay and Manavi (2008) and others first

•Practical political issues of using VAM in schools involve

unions, federal government, state government, special education advocates… and the list goes on and on …

Questions?

Robert W. Lissitz

University of Maryland

Maryland Assessment Research Center for Education Success

Visit http://marces.org to find references, the full text of this talk, comparison of value-added

models and there will be aMARCES conference on VAM (October 18 & 19)

the evaluation of teacher and school effectiveness using growth models and value added modeling:...

Documents