searching for dif, drift, and growth among the deck chairs -or- why we cant make sense of measures...

20
Searching for DIF, Drift, and Growth Among the Deck Chairs -or- Why we can’t make sense of measures of educational achievement Presentation at the 2008 Conference of The Fordham Council on Applied Psychometrics June 26, 2008 By Eliot R. Long Data Research Services Brooklyn, NY 11231 Copyright 2008 by Eliot R. Long. All rights reserved.

Upload: isabel-mcdowell

Post on 26-Mar-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Searching for DIF, Drift, and Growth Among the Deck Chairs -or- Why we cant make sense of measures of educational achievement Presentation at the 2008

Searching for DIF, Drift, and GrowthAmong the Deck Chairs

-or-

Why we can’t make sense of measures of educational achievement

Presentation at the 2008 Conference ofThe Fordham Council on Applied PsychometricsJune 26, 2008

By Eliot R. LongData Research ServicesBrooklyn, NY 11231

Copyright 2008 by Eliot R. Long. All rights reserved.

Page 2: Searching for DIF, Drift, and Growth Among the Deck Chairs -or- Why we cant make sense of measures of educational achievement Presentation at the 2008

Finding Meaning In the Difference Between Two Test Scores Exhibit 2.

No Child Left Behind requires a reliable measurement of change through the comparison of test scores

Yet, schools experience erratic, inexplicable variations in measures of achievement gains.

“This volatility results in some schools being recognized as outstanding and other schools identified as in need of improvement simply as the result of random fluctuations.”

Robert L. Linn and Carolyn Haug (Spring 2002)Stability of school-building accountability scores and gains.Educational Evaluation and Policy Analysis, 24(1), 29-36.

Page 3: Searching for DIF, Drift, and Growth Among the Deck Chairs -or- Why we cant make sense of measures of educational achievement Presentation at the 2008

Guessing on tests Exhibit 3.

Findings from 90+ years of research on guessing- Guessing effects few test items- Modest effect on test score reliability- Encouraged guessing may improve fairness

Text book recommendations- Encourage test-takers to guess, if necessary, to answer all questions

e.g. Measurement and Evaluation in Education and Psychology Mehrens, W. A. & Lehmann, I. J. (1991), 464-465.

School practice- Informal policy (not written) to encourage guessing “If its blank, it’s wrong”

Are these research findings appropriate for current, high stakes testing?

Is the informal practice of encouraged guessing consistent with standardized testing?

Page 4: Searching for DIF, Drift, and Growth Among the Deck Chairs -or- Why we cant make sense of measures of educational achievement Presentation at the 2008

A Norms Review Exhibit 4.

The following exhibits are based on four separate research projects, each including the development of group response pattern norms

- Classroom groups, grades 3-7 in a northeast urban school district

15,825 classrooms, 391,078 students

- School groups, grade 3 statewide in Midwest 2,317 schools, 140,203 students

- Nationwide sample, grade 4A test section of the 2002 NAEP Reading36,314 students

- Job applicant groups across the U.S.87 employers, 447 employer groups, 32,458 job applicants

Page 5: Searching for DIF, Drift, and Growth Among the Deck Chairs -or- Why we cant make sense of measures of educational achievement Presentation at the 2008

Percent Correct &Test Completion Exhibit 5.

Teacher Administered Tests Non-teacher Administered Tests Pct. Pct. Attp. Pct. Pct. Attp.

Test-Takers Correct All Questions Test-Takers Correct All Questions

Northeast 1999-2001 Independent Proctor AdministeredUrban School District – Reading Tests NAEP Reading 2002 Grade 3 68.6% 97.4% Grade 4 67.6% 60.9% Grade 4 74.7% 96.7% Grade 5 65.5% 94.0% Employer Administered 1996-1999 Grade 6 67.4% 93.1% Verbal Skills Grade 7 71.0% 96.4% Job Applicants 82.0% 44.0%

Midwest 2001 Quantitative SkillsStatewide – Math Test Job Applicants 75.2% 28.2% Grade 3 63.5% 97.4%

“If it’s blank, it’s wrong.” No encouraged guessing

Page 6: Searching for DIF, Drift, and Growth Among the Deck Chairs -or- Why we cant make sense of measures of educational achievement Presentation at the 2008

Test Completion: A Teacher/Proctor Effect Exhibit 6.

Answers left blank are concentrated by classroom15.6% of all classrooms account for 77.6% of all answers left blank. 5.6% of all classrooms account for 48.0% of all answers left blank.

Grade 5 Reading45 items – 4 alternative, multiple-choice

All Classes ‘Low Blanks’ Classes ‘High Blanks’ Classes < 26 Ans. Left Blank 26+ Ans. Left Blank

Class Blanks Pct. Attp. Blanks Blanks Pct. of All Standing Classes Per Class All Ques. Classes Per Class Classes Per Class Classes Blanks

4th Q. 617 1.8 97.3% 613 1.6 4 34.3 0.6% 12.0% 3rd Q. 620 4.5 94.9% 599 3.2 21 43.3 3.4% 32.5% 2nd Q. 619 6.1 93.1% 580 3.7 39 42.0 6.3% 43.5% 1st Q. 619 10.4 90.1% 544 4.3 75 54.6 12.1% 63.8%

All 2,475 5.7 94.0% 2,336 3.1 139 48.8 5.6% 48.0%

-------------------------------------------------------------------------------------------------------- Pct. Correct 65.5% 65.9% 59.3% Pct. Attp. All 94.0% 95.1% 74.0%

Page 7: Searching for DIF, Drift, and Growth Among the Deck Chairs -or- Why we cant make sense of measures of educational achievement Presentation at the 2008

Tale of Two Classes Exhibit 7.

Two classrooms at the same class average scorewith and without encouraged guessing.

Class: n = 21, Blanks = 3 Class n = 21, Blanks = 199 RS Avg. = 19.4 SD = 4.3 RS Avg. = 19.4 SD = 7.9 KR-20 = .53 – Pct. Blank = 0.3% KR-20 = .89 – Pct. Blank = 21.1%

The Norm of Classroom Test Administration The Exception9 11 11 13 14 14 15 15 16 16 16 16 19 23 23 24 24 25 30 33 41

Grade 5 Reading - Number Correct Score

0

15

30

45

Num

ber

Corr

ect -

N

um

ber A

ttem

pte

d

Number Correct Number Attempted Forecast Number Attempted

Student Scores: Number Correct and Number Attempted

High Blanks Class KR-20 = .89Regression estimate (r =.679, n = 72): Number Attempted = 20.2 =(0.67*Number Correct)

12 13 14 15 16 16 17 18 19 19 20 20 21 21 21 21 22 24 25 25 29

Grade 5 Reading - Number Correct Score

0

15

30

45

Num

ber

Cor

rect

- N

um

ber

Atte

mpt

ed

Number Correct Number Attempted Forecast Number Attempted

Student Scores: Number Correct and Number Attempted

Low Blanks Class KR-20 = .53Regression estimate: Number Attempted = 20.2 + (0.67*Number Correct)

Page 8: Searching for DIF, Drift, and Growth Among the Deck Chairs -or- Why we cant make sense of measures of educational achievement Presentation at the 2008

CorrelationNumber Correct - Number Attempted Exhibit 8.

Teacher Administered All Students Students with Blanks =>5Grade 5 Reading r = .153 n = 66,320 r = .527 n = 1,094Grade 5 Math r = .110 n = 69,413 r = .549 n = 238Grade 6 Reading r = .162 n = 62,524 r = .583 n = 658Grade 7 Reading r = .202 n = 58,915 r = .597 n = 1,416

Independent Test AdministratorNAEP Grade 4 Reading r = .608 n = 36,314

Employer AdministeredJob ApplicantsTest of Verbal Skills r = .717 n = 32,458Test of Quantitative Skills r = .581 n = 31,629

Hovland and Wonderlic (1939)Adult workers & studentsOtis Test of Mental Ability 4 test forms & 2 time limits r = .608 to .723 n = 125 to 2,274

Page 9: Searching for DIF, Drift, and Growth Among the Deck Chairs -or- Why we cant make sense of measures of educational achievement Presentation at the 2008

Location of Answers Left Blank Exhibit 9. Recommendations to encourage guessing presume that most answers left blank are imbedded; that is, they representquestions that are addressed and, for some reason, skipped.

Our norms reveal that most blanks are trailing; that is,they represent questions that are not reached during the time limit.

Position of Blanks Imbedded Trailing

Grade 5 Reading 22.3% 77.7%NAEP Grade 4 Reading 15.8% 84.2%Job Applicant Verbal Skills 5.2% 94.8%

Teachers must significantly change students’ test work behaviorto achieve answers to ‘not reached’ questions. How?

Page 10: Searching for DIF, Drift, and Growth Among the Deck Chairs -or- Why we cant make sense of measures of educational achievement Presentation at the 2008

Test Score Reliability (KR-20) by Classroom Exhibit 10.

Teacher involvement in their students’ test work behavior to encourage guessing is entrepreneurial, often undermining test score reliability.

50+ Answers Left Blank No Answers Left Blank 42 classrooms at and below average 330 classrooms at and below average likely to have little encouragement to guess likely to have extensive encouragement to

guess

15.620.6

21.622.4

23.023.6

24.124.6

25.025.5

25.826.2

26.526.8

27.127.6

28.128.4

28.7

Grade 5 Reading - Class Average Number Correct Score

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

K-R

20

-

T

est R

elia

bility -

Inte

rna

l C

onsis

tency

Forecast for Blanks => 50 Observed for Blanks => 50

Test Reliabil ity (K-R 20) by Class Average Number Correct Score

Classes with 50 or more Answers Left Blank - n = 42Average K-R 20 = .82; RS forecast K-R20: r = .013; constant .824, slope -.0003

15.620.6

21.622.4

23.023.6

24.124.6

25.025.5

25.826.2

26.526.8

27.127.6

28.128.4

28.7

Grade 5 Reading - Class Average Number Correct Score

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

K-R

20

-

Te

st R

elia

bili

ty -

Inte

rna

l Co

nsi

ste

ncy

Forecast for Blanks = 0 Observed for Blanks = 0

Test Reliabil ity (K-R 20) by Class Average Number Correct Score

Classes with No Answers Left Blank - n = 330Average K-R 20 = .75; RS forecast K-R20: r = .339; constant .364, slope .0153

Page 11: Searching for DIF, Drift, and Growth Among the Deck Chairs -or- Why we cant make sense of measures of educational achievement Presentation at the 2008

The Volume of Teacher Encouraged Guessing Exhibit 11.

Parsing Grade 5 number correct scores:

The traditional correction-for-guessing:

S = R – W/(n-1)

For the number correct score at the minimum for Basic (R = 18):

S = 18 – 27/(4-1) = 18 - 9S = 9

Result: Half of the number correct score is due to random guessing.

RS 18 = Min. Scale Score For ‘Basic’ - just passing

S = True ScoreR = Number RightW = Number Wrong n = Number of Answer Choices

Grade 5 Reading: 45 items 4 ans. alternatives

Page 12: Searching for DIF, Drift, and Growth Among the Deck Chairs -or- Why we cant make sense of measures of educational achievement Presentation at the 2008

Success rate: A norms approach Exhibit 12.

The traditional correction-for-guessing formula assumes that 100% of skills based answers are correct. A regression of median percent correct on number attempted for test-takers who leave 5+ answers blank finds a variable rate of success:

Regression of Median Pct. Correct on Number Attempted

Test-Takers Number Data Pts R squared Constant Slope Grade 5 Reading 1,449 7* .699 0.321 0.0091Grade 6 Reading 1,486 7* .877 0.416 0.0065Grade 7 Reading 1,269 7* .703 0.468 0.0040

Job Applicants 15,650 25** .905 0.465 0.0094

or Percent Correct = 0.465 + 0.0094*As

where As represents the number of questions answered based on the test-taker’s skills.

* Number attempted ranges: Up to 15, 16-20, 21-25, 26-30, 31-35, 36-40, 41-45 ** Number attempted: 21 through 45

Page 13: Searching for DIF, Drift, and Growth Among the Deck Chairs -or- Why we cant make sense of measures of educational achievement Presentation at the 2008

Add norms to The traditional formula = Empirical Approach Exhibit 13.

Traditional formula:S = R – W/(n-1)

or R = S + W/(n-1) skills + guessing

Empirical formula:

R = Pct. Correct*As + (At – As)/nor R = 0.0094*As2 + 0.465*As + (At – As)/n

---- skills ----- + guessing

For a score of 18:

18 = (0.0094*17.72) + (0.465*17.7) + ((45-17.7)/4) = 2.945 + 8.23 + 6.825

18 = 11.175 + 6.825 skills + guessing

Results: 39% (17.7/45) of answers are attempted based on skills61% of answers are guessed due to teacher encouragement38% of the observed score is based on encouraged random guessing

Note:

W = (At – As)*((n-1)/n)

At = Total attempts = 45

As = Skill based attempts

Solution:

Substitute 45 for At

and 18 for R, find

As = 17.7

Page 14: Searching for DIF, Drift, and Growth Among the Deck Chairs -or- Why we cant make sense of measures of educational achievement Presentation at the 2008

Observed and Estimated True Scores Exhibit 14.

Grade 5 Reading Test: Distribution of Observed and Estimated True Skills

Application of the ‘empirical’ parsingformula to the full distribution of Grade 5 scores*.

Student Distribution Mean SD

Observed 29.1 7.8Est. True 26.4 9.4Change +10.2% -16.5%

Classroom Distribution Avg. Mean Avg. SD

Observed 29.1 5.9Est. True 26.0 7.4Change +11.6% -19.9%

* Random guessing outcomes are forecast bythe binomial distribution and moderated by the variation in the volume of guessing with student skill level. The actual percent guessed correct is lower than expected among lower observed scores and higher than expected among higher observed scores.

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45

Grade 5 Reading Test - Number Correct Score

0%

1%

2%

3%

4%

5%

Fre

quen

cy -

Per

cent

of A

ll S

tude

nts

Est. True Score Distribution Observed Score Distribution

Distribution of Number Correct Scores

Estimated Skills Based and Observed Number Correct Scores

Page 15: Searching for DIF, Drift, and Growth Among the Deck Chairs -or- Why we cant make sense of measures of educational achievement Presentation at the 2008

Volume of encouraged guessingBy Performance Level Exhibit 15.

Contribution of Encouraged GuessingTo Student Scores

Student Averages by Performance LevelGrade 5 Reading

Estimates for Random Guessing

Average Average Pct. of Pct. of Student Pct. of Number Number Number All Performance Level Students Correct Attempted Correct Answers Level 4 Advanced 6.2% 42.1 45.0 0.0% 0.0% Level 3 Proficient 48.4% 34.1 44.9 5.3% 15.4% Level 2 Basic 37.1% 23.6 44.6 19.0% 42.8% Level 1 Below Basic 8.3% 14.4 44.1 44.1% 69.8%

All Students 100.0% 29.1 44.8 10.2% 26.7%

Page 16: Searching for DIF, Drift, and Growth Among the Deck Chairs -or- Why we cant make sense of measures of educational achievement Presentation at the 2008

Gain in class average test score & Attenuation of class distribution due to guessing Exhibit 16.

Encouraged guessing adds disproportionately to lowerperforming students’ scores, compressing the distributionof scores with respect to the distribution of achievement.

Grade 5 Reading

Classroom Averages___________Quartile Standing Parsed for Guessing of Class Number of Observed Est. True Scores Pct. Pct. Average Score Classes RS Avg. SD Avg. RS Avg. SD Avg. RS Gain SD Atten.

4th Q. 619 35.7 4.8 34.3 6.0 4.0% -19.4%3rd Q. 620 30.6 6.2 27.9 7.7 9.5% -20.1%

2nd Q. 619 27.0 6.5 23.5 8.1 15.0% -20.2%1st Q. 619 22.9 6.1 18.4 7.6 24.7% -19.6%

All 2,477 29.1 5.9 26.0 7.4 11.6% -19.9%

Page 17: Searching for DIF, Drift, and Growth Among the Deck Chairs -or- Why we cant make sense of measures of educational achievement Presentation at the 2008

Encouraged guessing Creates a test score modulator Exhibit 17.

Changes in skill and guessing move in opposite direction, offsetting in the total score.

Comparison of First Test and Second Test Scores

Test AnswersTest Observed Based on Based on Guessing

Administration Total Skills Guessing Contribution

1st Test Admin.Correct 18 11.2 6.8 37.8%Attempts 45 17.7 27.3 60.7%Pct. Correct 40.0% 63.3% 25.0%

2nd Test Admin.Correct 20 13.8 6.2 31.0%Attempts 45 20.1 24.9 55.3%Pct. Correct 44.4% 68.7% 25.0%

Gain 2 2.6 -0.6Pct. Gain 11.1% 23.2%

52% of

true gain

masked by

guessing

-30%

-20%

-10%

0%

10%

20%

30%

Pe

rce

nt

Ga

in o

r D

ec

line

Observed

True

True vs. Observed Gainsat min. score for passing

Page 18: Searching for DIF, Drift, and Growth Among the Deck Chairs -or- Why we cant make sense of measures of educational achievement Presentation at the 2008

Estimated Gain Masked by Guessing Exhibit 18.

The ‘empirical’ formula may be applied to first test and second tests at each score level.

Hypothetical Gains Parsed for Guessing Effects

Number Correct Pct. Pct. Pct.Percentile First Second Observed Est. True True GainStanding Score Score Gain Gain Masked_

90% 39.0 42.9 10.0% 10.9% 8.3% 80% 36.0 39.6 10.0% 13.3% 24.9% 70% 34.0 37.4 10.0% 13.2% 24.2% 60% 32.0 35.2 10.0% 13.8% 27.3% 50% 30.0 33.0 10.0% 13.4% 25.6% 40% 27.0 29.7 10.0% 15.2% 34.1% 30% 25.0 27.5 10.0% 15.8% 36.7% 20% 22.0 24.2 10.0% 16.5% 39.2% 10% 18.0 19.8 10.0% 20.6% 51.5%

Page 19: Searching for DIF, Drift, and Growth Among the Deck Chairs -or- Why we cant make sense of measures of educational achievement Presentation at the 2008

Findings of a Norms Review Exhibit 19.

The informal practice of teacher encouraged guessing to complete all test answers has the following effects:

1. High volume of non-skills based test answers

The volume of test answers that result from teacher encouragement is very high: 26% of all answers for students at the school district average and 50% or more among students most at risk of failing.

2. Teacher involvement lowers test score reliability Teacher involvement is unstructured, varying from classroom to classroom and from student to student, creating widely varying and generally lower test score reliability.

3. Guessed correct answers reduce the range of measurement Added guessing increases among lower performing students, raising their scores more than higher performing students and therefore narrowing the range of measurement by ~20%.

Page 20: Searching for DIF, Drift, and Growth Among the Deck Chairs -or- Why we cant make sense of measures of educational achievement Presentation at the 2008

Findings of a Norms Review Continued Exhibit 19 cont.

4. Guessing creates a test score modulator Changes in student achievement will cause changes in the volume of guessing – in the opposite, offsetting direction - modulating observed scores. This modulating effect masks variations in gain, by as much as 50% or more among low performing students.

Teacher encouraged guessing narrows the window onto studentachievement gains, while reducing both the range and reliability of the measurement that can be observed. As a consequence, non-skills related variation may predominate, misdirecting test score interpretation and education policy.