validity in an era of accountability daniel koretz cresst/ harvard graduate school of education

Validity in an Era of Accountability

Daniel Koretz

CRESST/ Harvard Graduate School of Education



2

Two types of issues raised bytesting for accountability (TFA)

Behavioral issues

Arise from behavioral responses to testing other than those that improve learning

Non-behavioral issues

Not stemming from behavioral responses to testing


3

Some non-behavioral issues raised by TFA

Error: Sampling error in aggregate statistics used for

accountability Error in PACs Error in value-added estimates for teachers or

schools

Reporting issues: Choice of aggregate reporting metric Issues raised by standards-based reporting

Causal inference: ascertaining effectiveness of programs, teachers, and schools


4

Behavioral issues raised by TFA

“Right-hand side gaming:” affects who is tested

Exclusion

Reclassification of students

Retention in grade

“Left-hand side gaming:” affects the scores of tested students

Inappropriate test preparation and score inflation


5

Examples of Bob Linn’s work on TFA

Reasonableness and robustness of performance standards

Causal inference from test scores

Inconsistencies among performance metrics

Reliability of aggregate performance estimates

Numerous aspects of accountability system design, e.g., AYP

Score inflation from high stakes


6

Where does research on TFA fitin the measurement field?

“Traditional psychometrics plus”

Veneer of research responding to TFA

Considerable amount on non-behavioral issues

Much less on behavioral issues

Practice has not changed sufficiently

Some areas extended, e.g., treatment of reliability

Impact of work on non-behavioral issues limited by client demands (e.g., standards)

Behavioral issues have been largely ignored


7

Why worry about behavioral issues?

Research shows a major threat to validity: bias of .50-.75 SD

Bias is inconsistent in size across schools

Little is known about the distribution of the bias

Cannot evaluate overall improvement

Kids are left behind, despite illusion of progress

Cannot evaluate relative improvement

to identify schools for reward, corrective action, emulation


8

One example of the threat to validity:grade 4 KIRIS reading

KIRIS NAEP

Gain in scale scores 18.8 -1

Standardized Gain 0.76 -0.03

Source: Hambleton et al., 1995


9

How are scores inflated?

RHS gaming:

Obvious: exclude low-scoring kids

LHS gaming:

Reallocation

Coaching

Cheating


10

Two characteristics of tests thatunderlie LHS gaming

Tests are like polls: small samples of a larger domain

Even if well aligned, tests omit relevant content

Scores only matter if they represent the domain

Tests have recurrences

In details of content (included and excluded)

In forms of presentation

In scoring rubrics


11

Reallocation

Shifting instructional resources among substantive areas

Within subject

Between subjects

Results in reallocating achievement

Within subjects, can lead to either meaningful change or inflation


12

Coaching

Focuses on details of the test

Unimportant substantive details

Non-substantive details, such as item formats and scoring rubrics

Includes test-taking tricks (e.g., POE, plug-in)


13

Two ways that validity is undermined

Coaching and cheating: performance on measured elements is biased upward

Test-taking tricks

“Teaching to the rubric”

Focusing on details of presentation

Reallocation: Performance on individual elements is accurately measured but no longer represents domain

If deemphasized material matters for inference


14

Biased estimates of element-level performance(Princeton Review’s Cracking the MCAS) Plugging in:

“Rather than doing a problem like this in your head or trying to solve it algebraically, the easiest and fastest way to solve it is to plug in a number for x.”

Process of elimination “Sometimes the best way to solve a problem is to figure out

what the…wrong answers are and eliminate them….It’s often easier to identify the wrong answers than to find the correct one.”

Pythagorean theorem: “Popular Pythagorean ratios include the 3:4:5 (and its

multiples) and the 5:12:13 (and its multiples).”


15

Coaching or cheating?

The… review sheet…reads in part: “The average amount that each band member must raise is a function of the number of band members, b, with the rule f(b)=12000/b.”

The question on the actual test reads in part: “The average amount each cheerleader must pay is a function of the number of cheerleaders, n, with the rule f(n)=420/n.”

Source: Strauss, V., The Washington Post, July 10, 2001, p. A09


16

Homework

Download technical report for your state test

Find section on validity

Look for discussion of evidence relevant to these threats to validity


17

Why traditional validation is insufficient for TFA

Cross-sectional and generally correlational

Insensitive to changes in levels of performance

Assumes stability in relationships between tested and untested aspects of performance

Ignores omissions and recurrences in tests

Ignores behavioral responses to high-stakes testing


18

What needs to be done?

More research on TFA

Changes to the practice of measurement

Expanded approach to validation

New approaches to test design in response to issues of incentives and accountability

Possible changes to ‘operational’ procedures, such as linking


19

Additional research needed More research on methods to disentangle inflation

from meaningful gains

More research exploring the extent and distribution of inflation, e.g., across types of schools or students

More research exploring the variables shaping incentives in TFA, e.g.,

Characteristics of tests Measures of performance employed Rate of expected change

Evaluations of new designs and for tests and accountability systems


20

Options for changes in test design

To better estimate true gain and to create better incentives (less incentive to narrow or coach)

1. Maximize breadth of coverage (matrix sample?)

2. Minimize unnecessary repetition, e.g., repetition of:

Details of content

Styles of presentation

Non-substantive task demands

3. Build in audit items to better estimate real gains


21

Expanded approach to validation

Cannot stop with initial quality of tests and inferences

Must consider validity of inferences about gains after stakes have been imposed

Will require expanded and more routine auditing of gains

Should be treated as a core aspect of validity, e.g.,

In tech reports and texts

validity in an era of accountability daniel koretz cresst/ harvard graduate school of education

Documents

behavioral responses

behavioral issuespractice

schoolsreporting issues

schoolsbehavioral issues

standardsbehavioral

nonbehavioral issuesmuch

kidslhs gaming

rhs gaming