validity in an era of accountability daniel koretz cresst/ harvard graduate school of education
TRANSCRIPT
Validity in an Era of Accountability
Daniel Koretz
CRESST/ Harvard Graduate School of Education
CRESST/ Harvard Graduate School of Education
CRESST/ Harvard Graduate School of Education
2
Two types of issues raised bytesting for accountability (TFA)
Behavioral issues
Arise from behavioral responses to testing other than those that improve learning
Non-behavioral issues
Not stemming from behavioral responses to testing
CRESST/ Harvard Graduate School of Education
3
Some non-behavioral issues raised by TFA
Error: Sampling error in aggregate statistics used for
accountability Error in PACs Error in value-added estimates for teachers or
schools
Reporting issues: Choice of aggregate reporting metric Issues raised by standards-based reporting
Causal inference: ascertaining effectiveness of programs, teachers, and schools
CRESST/ Harvard Graduate School of Education
4
Behavioral issues raised by TFA
“Right-hand side gaming:” affects who is tested
Exclusion
Reclassification of students
Retention in grade
“Left-hand side gaming:” affects the scores of tested students
Inappropriate test preparation and score inflation
CRESST/ Harvard Graduate School of Education
5
Examples of Bob Linn’s work on TFA
Reasonableness and robustness of performance standards
Causal inference from test scores
Inconsistencies among performance metrics
Reliability of aggregate performance estimates
Numerous aspects of accountability system design, e.g., AYP
Score inflation from high stakes
CRESST/ Harvard Graduate School of Education
6
Where does research on TFA fitin the measurement field?
“Traditional psychometrics plus”
Veneer of research responding to TFA
Considerable amount on non-behavioral issues
Much less on behavioral issues
Practice has not changed sufficiently
Some areas extended, e.g., treatment of reliability
Impact of work on non-behavioral issues limited by client demands (e.g., standards)
Behavioral issues have been largely ignored
CRESST/ Harvard Graduate School of Education
7
Why worry about behavioral issues?
Research shows a major threat to validity: bias of .50-.75 SD
Bias is inconsistent in size across schools
Little is known about the distribution of the bias
Cannot evaluate overall improvement
Kids are left behind, despite illusion of progress
Cannot evaluate relative improvement
to identify schools for reward, corrective action, emulation
CRESST/ Harvard Graduate School of Education
8
One example of the threat to validity:grade 4 KIRIS reading
KIRIS NAEP
Gain in scale scores 18.8 -1
Standardized Gain 0.76 -0.03
Source: Hambleton et al., 1995
CRESST/ Harvard Graduate School of Education
9
How are scores inflated?
RHS gaming:
Obvious: exclude low-scoring kids
LHS gaming:
Reallocation
Coaching
Cheating
CRESST/ Harvard Graduate School of Education
10
Two characteristics of tests thatunderlie LHS gaming
Tests are like polls: small samples of a larger domain
Even if well aligned, tests omit relevant content
Scores only matter if they represent the domain
Tests have recurrences
In details of content (included and excluded)
In forms of presentation
In scoring rubrics
CRESST/ Harvard Graduate School of Education
11
Reallocation
Shifting instructional resources among substantive areas
Within subject
Between subjects
Results in reallocating achievement
Within subjects, can lead to either meaningful change or inflation
CRESST/ Harvard Graduate School of Education
12
Coaching
Focuses on details of the test
Unimportant substantive details
Non-substantive details, such as item formats and scoring rubrics
Includes test-taking tricks (e.g., POE, plug-in)
CRESST/ Harvard Graduate School of Education
13
Two ways that validity is undermined
Coaching and cheating: performance on measured elements is biased upward
Test-taking tricks
“Teaching to the rubric”
Focusing on details of presentation
Reallocation: Performance on individual elements is accurately measured but no longer represents domain
If deemphasized material matters for inference
CRESST/ Harvard Graduate School of Education
14
Biased estimates of element-level performance(Princeton Review’s Cracking the MCAS) Plugging in:
“Rather than doing a problem like this in your head or trying to solve it algebraically, the easiest and fastest way to solve it is to plug in a number for x.”
Process of elimination “Sometimes the best way to solve a problem is to figure out
what the…wrong answers are and eliminate them….It’s often easier to identify the wrong answers than to find the correct one.”
Pythagorean theorem: “Popular Pythagorean ratios include the 3:4:5 (and its
multiples) and the 5:12:13 (and its multiples).”
CRESST/ Harvard Graduate School of Education
15
Coaching or cheating?
The… review sheet…reads in part: “The average amount that each band member must raise is a function of the number of band members, b, with the rule f(b)=12000/b.”
The question on the actual test reads in part: “The average amount each cheerleader must pay is a function of the number of cheerleaders, n, with the rule f(n)=420/n.”
Source: Strauss, V., The Washington Post, July 10, 2001, p. A09
CRESST/ Harvard Graduate School of Education
16
Homework
Download technical report for your state test
Find section on validity
Look for discussion of evidence relevant to these threats to validity
CRESST/ Harvard Graduate School of Education
17
Why traditional validation is insufficient for TFA
Cross-sectional and generally correlational
Insensitive to changes in levels of performance
Assumes stability in relationships between tested and untested aspects of performance
Ignores omissions and recurrences in tests
Ignores behavioral responses to high-stakes testing
CRESST/ Harvard Graduate School of Education
18
What needs to be done?
More research on TFA
Changes to the practice of measurement
Expanded approach to validation
New approaches to test design in response to issues of incentives and accountability
Possible changes to ‘operational’ procedures, such as linking
CRESST/ Harvard Graduate School of Education
19
Additional research needed More research on methods to disentangle inflation
from meaningful gains
More research exploring the extent and distribution of inflation, e.g., across types of schools or students
More research exploring the variables shaping incentives in TFA, e.g.,
Characteristics of tests Measures of performance employed Rate of expected change
Evaluations of new designs and for tests and accountability systems
CRESST/ Harvard Graduate School of Education
20
Options for changes in test design
To better estimate true gain and to create better incentives (less incentive to narrow or coach)
1. Maximize breadth of coverage (matrix sample?)
2. Minimize unnecessary repetition, e.g., repetition of:
Details of content
Styles of presentation
Non-substantive task demands
3. Build in audit items to better estimate real gains
CRESST/ Harvard Graduate School of Education
21
Expanded approach to validation
Cannot stop with initial quality of tests and inferences
Must consider validity of inferences about gains after stakes have been imposed
Will require expanded and more routine auditing of gains
Should be treated as a core aspect of validity, e.g.,
In tech reports and texts