instmtion 'national assessment of educational progress. · the major analysis problem was...

DOCUMENT RESUME

ED 110 494 TM 004 767

AUTHOR Phillips, Donald L.; And OthersTITLE Stability of Nominal Categories Over Readers, Over

Time.INSTMTION Education 'Commission of the States, Denver, Coro.

'National Assessment of Educational Progress.PUB DATE [Mar 751NOTE 19p.; Paper presented at the Annual Meeting of the

American Educational Research Association(Washingtc%, D.C., March 30-April 3, 1975)

EDRS PRICE , MF-$0.76 HC-$1.58 PLUS POSTAGEDESCRIPTORS Analysis of Variance; Elementary Secondary Education;

*Essay Tests;*H4liability; *Scoring; Test Bias;*Testing Problems; Test Results; Time

ABSTRACTThe consistency across time and readers of the

scoring of National Assessment of Educational Progress (NAEP)open-ended exercises was examined. The procedure studied is a nominalcategorical scoring. Ten readers independently read 28 sampleresponses to each of 12 open-ended exercises at three different -

times. All ten readers agreed on their assignment on about 75 percentof the sample responses. About 89 percent of the time a reader agreedon the category assignment from one reading to another. (Author)

1

***********************************************************************Documents acquired by ERIC include many ihformal unpublished *"

* materials not available from other sources. ERIC makes every effort ** to obtain the best copy available. nevertheless, items of marginal *

* reproducibility are often encountered and this affects the quality *

* of the microfiche and hardcopy reproductions ERIC makes available *,

* via the ERIC Document Reproduction Service (EDRS). EDRS is not .*

* responsible for the quality of the original document. Reproductions ** supplied by EDRS are the best,that can be made from the original. **********************************************************************

4

U S OaPARTMENT OF HEALTHEOUCATION a WELFARENATIONAL INSTITUTE Of

EDUCATIONTHIS DOCUMENT HAS SEEN REPRODuCED EXACTLY AS RECEIVED FROMTHE PERSON OR ORGANIZATION ORIGINATING IT POIN,S OF VIEW OR OPINIONSSTAVED DZ.: NECESSARILY REPRESENT O. C.AL NATION.,EDUCA'ION POSITION OP POLICY

STABILITY OF NOMINAL CATEGORIES

OVER READERS, OVER TIME

Donald L. Phillips

Nancy W. Burton

and

Alex M. Pearson

March, 1975

Natidnal Assessment of Educational ProgressThe Education Commission of the States

Paper Presented at the American Educational

110Research Association 1975 Annual Meeting.

Session no. 6.09

4:>

2

Stability of Nominal Categoriesover Readers, over Time

Donald L. Phillips

Nancy W. Bur*t.on

And

Alex M. Pearson

INTRODUCTION

National Assessment of Educational Progress (NAEP) isa census-like assessment project which collects data on anational probability sample. NZIAEP has collected data in ten,different curriculum areas from four different age classes(9-year-olds, 1 -year-olds, 17-year-olds, anii\young adults26-35 years old). The major purpose of NAEP to measurechanges across time in performance On-objectiveS7referencedexercises.' Many of the exercises NAEP uses in its assess-ment process are open-ended 'and must-.be hand scored. NAEP'shand scoring does not generally consist of assigning re-sponses to points on an ordinal-scale. Instead responsesare almost always assigned to nominal (descriptive) cate-gories.' These nominal categories arc, `however, classifiableas acceptable or unacceptable.

Because NAEP's objective is to measure changes overtime in performance, it is important that hand scoring notdepend heavily on the,scorer or0the time of the.-scoring. Itis known that when essays are scored for quality on_ ordinalscales, the scores vary with'the context in which. the papersare read (Coffman, 1971). If these--findings are true alsofor NAEP's data, measurement of change would require thatall responses from all points in time be read'in the samecontext and time. If NAEP can show that its semi-profes-sional scoring is consistent across time and scorers, then

-perhaps change can be measured on these exercises withoutall responses being re-read each time a change measure ismade.

METHODS

The study was designed to answer several questions:

1. To what extent does the score a response receivesdepend upon the scorer who scores it?

Thanks are due to Janet Bailey for her careful computations.

3

e

2. To what extent does the score depend on the time(within the two-three month scoring session) whenthe response is scored?

For this study sample responses were selected from theactual response data from the Writing and Career and Occupa-tional Development (COD) assessments done inj973-1974.Three exercises from COD and two exercises from Writing wereselected at ages 9, 13, and 17. Three exercises from CODwere selected for adults. (See Table 1.) For each exerciseone sample response was selected arbitrarily from eactpof28 administration units spread throughout the country. Eachsample response was assighpd a number (1 -28), and photocopies of the samples-wefe made.

Table 1.

Exercise Number ofAge -Content Area Number (NAEP Number) Parts Analyzed

9

Writing

COD

1

2

/ 3

4

5

(0-201002)(0-201012)

(2-301034)(2-302015)(2-402002)

5

1

2

'6

3

Wfiting 6 (0-201018) 3

7 (0-202007) 113

0 8 (2-102025) 3

COD 9 (2-302015) 610 (2-306012) 3

Writing 11 (, 0-201018) . 30 12 (0-301008) 4

17

13 (2-102025) 3

COD 14 (2-306006) 5

15 (2-306012) 3 .

16 (2-302-005)- 10Adult COD 17 (2-306009) 12

18 (2-306012) 3

2

The readers were members of the professional scoringdepartment at MeasureMent Research Center in IA.All of these scorers had at least bachelors degrees; mosthad teaching_experience., The experimental pa.ers werescored as a part-of the normal-COD -and -Writi-n\., scoring,

, which involved approximately 100,000 student res)onses per-age. All scorers were trained together on the use of up to40 different scoring guides before the scoring for each agebegan. The scoring-guides consist of a descriptive titlefor each category,_illustrated by up to twenty sample re-sponses.

After scoring normal assessment responses for two tothree weeks, each scorer was given.in randbm order sets ofphotocopies of each sample response. Each scorer indepen-dently read each response and recorded the score on aseparate sheet. The sets of sample responses were thencollected and given new random'oraers. The sets were pre-

, sented again to scorers for rescoring when about one halfof all of the data for an age class had been scored andagain immediately after the scoring for the age class wascompleted: The scoring for each agetook from two to threemonths to complete.

Introduction

ANALYSES

The major analysis problem was thattimost exerc ses hadnominal scoring categories. Many conversional sumnrfiary Sta-

.

tistics, however, require ordinal data lige final y definedseveral percent-of-agreement summaries that made sense to usand were based on the raw data.

The first answers the ';question:

fl) What is the probability that all of the scorersor all scorers but one will agree on a randompaper?

A second type of percent of agreement focuses not on theagreement among scorers, but on the, agreement withinscorers over the three scoring answers the ques-

(2) What is the probability that a random scorer willassign the same category at leapt twice, or allthree times ?*

*For"age 9 the sample responses were only scored twice, sincethe entire scoring session took only six weeks.

3

5

Further analysis required transformation of the data to,an ordinal scale. Since the scorers are believed to be com-petent jt4dges, the score assignethby most scorers to a givenresponse was assumed to be the true score for that response.The_ most common score variable was defined as presence orabsence of this true score. (The most common score variableis denoted by MCS.)

Forthe MCS (most commcn. score) variable, one furtherpercentage was computed. -It was a more general percent ofagreement than the percent of agreement on responses (1) orthe percent of agreement over time (2), defined above. Thisoverall percent of agreement answers the question:

(3) What is the probability that a random scorer, ona random response,. at ,any one of three times,will assign.the_true category?

All three percents of agreement are discussed below and pre-sented.in Attachments 1-4.

The MCS variable was also used to compute a repeated-measures analysis of variance, with_responses and scorersas random factors,and --iiiif(where applicable) exercisepart*--Ai-iixed factors. These Responses x Scorers x Timesx Parts analyses were meant to answer the questions:

. .

Do different scorers vary in their ability to assigntrue scores?

Does time affect scorers' ability to assign truescores?

Does the exercise part* affect that ability?

Does thespecific response affect that ability?

A second ordinal variable was created by collapsing the- data into acceptable and unacceptable categories. The A/U

(acceptable/unacceptable) variable was also used in the Re-spondents x Scorers x Times x Parts analysis of variance

9

*Exercises have parts for several reasons. Parts may betwo aspects of the same task (such as scores for spellingand punctuation) or they may be two attempts at the sametask (as when respondents are asked to give two reasonsfor Something).

4

design, but the questions to be answered differed. Onewould expect both respondent and part scores to vary inacceptability - -to vary, that is, in difficulty. The ques-tionb to be answered by this analysis are:

Do scorers differ in _their assignment of acceptablescores?

Does time of scoring affect the assignment of accep-table scores?

Or, in other words; do either scorers or, time affectthe difficulty of an open-ended measure?

The analysis of variance for both MCS and A/U are presentedin Attachments 5-8 below.

One final analysis cif the A/U variable was made, basedon analysis of variance .04ta. This analysis is related tothe intra -class correlatibn or Cronbach's alpha. Itdiffers, however, in that it is basedon a multi-factordesign. It involves estimating the generalizability ofthe scoring from a ratio of relevant components of vari-ance.* (For a general discussion of the technique, see.Stanley, 1971).

Specifically, the among-respondents ccmponent of vari-ance is taken as an.:estimate of variance in the populationof the ability to perform, or not perform, the exercise.That is, it is taken as an estimate of the variance of thepopulation true score. That variance component is dividedby the sum of the variance components judged to be relevant,to the actual (as opposed to the experimental) scoringsituation. The resulting ratio Lan be interpreted as areliability estimate: a ratio of true variance to totalvariance.

Those components of variance 4termined to be relevantWere:

among persons, the-estimate of true variance.

among times: since a norm 1 hand scoring takes,

two-three months to complet **

klExpected Mean Squares were constructed by.the BMD 08Vanalysis of variance program (Dixon, 1973).

:,**See Glass and Hakstian (1968) for a critical dis ussion ofincluding fixed effects iman analysis of this typ Wefelt justified, since our selection of times would re Itin a maximum variance due to time, and thus create a con-servative estimate of reliability.

5

7

(3) among scorers.: .since NAIiT data are based on sumsof items scored by diffetent scorers.

-

(4) all interaction of the above factorS-.

Note that the variance due to parts was omitted. Varianceamong parts might be construed as part of the true vari-ance.- It Fas nevertheless not included, because parts werea fixed factor (see note on preceding page) and thus might.seriousli,bias the ratio. These variance ratios are pre-sented in Attachments 1-4. .

RESULTS AND DISCUSSION

Percent Agreement

The data were initially analyzed by calculating, foreach-exercise, the percentage of the sample responses onwhich all scorers agreed upon the category assignments.These, percentages showed considerable variation acrossexercises ranging in value from 58.4% to 93.8% with the meanperceritagebeing 76.5%.' The overall mean for agreement ofall but one scorer was 86.3%.

Next the categories assigned to the responses by eachreader for all readings were compared. The average percent'of reader agreements were calculated on each exercise.These percentages varied from 82.6% to 98.7% and a,mean of90.4%.

The overall agreement, based on presence or absenceof most common score, ranged from 88.8% to 99.5%, with anaverage of 94.1%.

All the percentages seem to be high enough to indicatethat NAEP hand scoring is not heavily dependent upon thescorer. The percentages mire displayed.in Attachments 1-4.

There is a slight advantage for COD over Writingexercises on all three indices. Since the average advantageacross ages 9, 13 and 17 is no greater than 5% on any index,

''we conclude that the scoring is essentially e%,livalent forbath Subjects. Indeed, since the COD exercises studiedcover topics appropriate to mathematics.and'citizenship aswell as'career education, we are tempted to generalize toall NAEP scoring (except ai)t-,, literature, and music!).

6

Analysis of Variance - MCS

If NAEP's scoring procedure were perfectly generaliz-able over scorers, times, respondents, and differentexercise parts, there would-be no variation at all inassignment, of the most common score. We would, therefore,prefer to find no significant analysis `of variance effects.The orst possible/result would be to find large effectsfor corers or times. That would mean that the baselineNAEP7data would have to be rescored every time NAEP wantedto ny, easure change or a state or local assessment wanted tocompare its results with the NAEP baseline. Even if theexpense of such re-scoring were tolerable, it would be ex-tremely-inelegant for NAEP's baseline, criterion results tochange for every different comparison.

Inspection of Attachments 5-8, "Probability LevelsAssociated with the F-Ratios for the Analysis of Variance"fdir the MCS results show two strong and consistent effects.These are the effects for responses and for the responsesby-exercise parts interaction. It is not surprising- -though regrettable--that the'consistenc of the scoringdepends on how people respond and what they are respondingto.

There do not appear to be any consistent effects fortimes cr fer scorers but the scorers by times interactions

. appear more'often than one would like. To evaluate theimportance of these.effects, components of variance were .

---,estimated and, from these, the percentage of total vari-ance was calculated for each effeqt. This analysis showedthat\approximately 17% of the variance (over all exercises)could be attributed to re onses and parts combined, andless than 1% could be attributed to scorers and times com-bined. Thus the effects, ven for responses and parts, aresmall, though statisticall stable.

Analysis of Variance A/U .:.

In contrast to the MCS analysis, one would expect largevariations among responses and parts for the acceptable/unacceptable variable. In fact, the variance among responses

\tm

is--as mentioned above -an estimate of the variance amongrue scores in the sample. However, as with the MCS, varianceong scorers or times is a strong blow at the generalizability

ofthe scoring procedure.

\\

The results of the U/ALanalyses of variance are summarized in

Attaohtents 5-8. Again, the only strong and consistent effectsare\for responses and the responses by parts interaction.

0 TheSe two effects- combined account for over 73% of the varianceacross all exercises. In contrast, the two effects account

7

9

for only 17% of the variance in, the MCS variable, above./Thus, for the A/U variale, the effect of responses andparts is both stable and large.

The effects for scorers and times are negligible.

Components of Varianbe Estimates of Generalizability

Components of variance for the A/U variable were also.used to compute an estimate of the ratio of true varianceto total variance in the, ability ) perform the exerciseacceptably. The results'" are displayed in column .,(4) ofAttachments 1-4. Note that this coefficient is affectedby the lack of variance among respdklses on very easy (orvery difficult) exercises. In particular, exercises 9 and18--which were answered correctly 1:4,99% of respondents--have a coefficient of lesS than .35 SimplIr-because therewas almost nc iariation among responS s. Exercises 8, 13,and 15 also were answered correctly b more than 90% of therespondents.

\ ,

The median perdehtage'ofvariance accounted for was.80. This is a conseitvatiVeestimate of the generalizabilityof NAEP scoring, since\it is based on si gle exercises,answered by only 28 respondents; since va iations due toscorers and times are included in the err r variance; andsince 5 of the 18 exercises included were _xtremelv easy.While the question always remains of how r liable is" reliableenough, the present inveStigators were extremely pleasedwith a median coefficient of\.80. .We feel hat good evidencenow exists that NAEP scoring procedures will generalize toother times--for change measures- -and other sers--for local,;

--:=----

comparisons.

8'

Att chment,1:

Percents ofAgreements

(1)

Res onse

Age 9

(2)

(3) ,

Ratio

Times

Overall

of Variance Components

Subject

Exercise

At Least'

(Generalizability of

76

2Acceptable/Unacceptable

Scorers

Scorers

Times

Scoring)

165.7

77.2

87.0

88.8

.65

Writing

267.2

5.0

92.4

89.3

.82

393.8

i)6.4

98.7

98.4

.99

COD

474.2

6.4

90.1

92.6

.90

558.4

8.0

85.7

89.0

.49

Pah

Subject

Exercise

Attachment 2:

Percents of Agreements

Age 13

11`

(2)

'(3),

(4)

Ratio

Responses

Times

Overall


.

*(Generalizabiiity of

10

at'le 'st 9

at least 8

3at least 2

Acceptable/Unacceptable

Scorers

Scor rs

Scorers

Times

Times

Scoring)

Writing --,

COD

-

6 7 0-.,

9,

10

64.3

-71.4

68.6

--

91.5

72.6

80.

1

82.4 i

79.41

92.6

.84.5t I t

85.3

86.9

85.7

98.6

89.3

82.9

S6.8

82.6

98:7

86.8

O

98:2';

98.9

97.9

100.0

_98.9

91.3

92.6

92.6

99.5

94.0

.91.

.80

i /.62

.33

.85

Subject

.xercise

AttachMent 3:


(1)

Eesponses

Age 17

(2)

(3)

Ratio

Times

Overall


i_(Generaliiability of

at least 9

at least 8

3at least 2


Scorers

Scorers

-Times

Times

Scoring)

1188..4

93.8

86.9

99.5

97.3

.83

_Writing

12

76.8

90.5

92.9

A00.0

95.5

.75

13

72.6

84.1

86.9

99.5

93.1

.61

CO.)

14

;I

72.9

84.3

i89.4

99.0

93.2

.87

15

92.1

95.6

96.7

100.0

98.2

.65

t:

Subject

Exercise

COD

Attachment 4:


Adults

(1)

Responses*

(2)

(3)

Ratio

Times

Overall


(Generalizability of

10

at least 9

at least'8

3at least 2


N,

Scorers

Scorers

Scorers

Times

Times

Scoring)

16

93.0

17

62.8

82.3

18

90.1

94.5

97.2

97.5

99.9

98.4

.93

91.7

88.0

99.6

92.9

.80

94.5

96.6

9.9.6

96.9

.29

*For exercise 16 only 9 scorers scored the sample responses.

ATTACHMENT 5

Probability Level. AssoCiated with the F-Ratios

for the Age 9 Analysis of Variance

Exercise #

(See Table

1

1)

Most Common Score

23

45


23

45

Effect

R (Responses)

T (Times)**

S (Scorers)

*

.01

* *

.01* *

.01

.10

*

.01

* *

.01

* .10

.01

* *

.01

* *

.01

*

,01

.05

.25

P (Parts) **

.01

.10

*.25

.01

.1_1

*.na

R x T

.01

.05

.25

.01

.25

R x S

TS

*.25

*.05

.05

.10

**

*.05

R x:P

.01

.01

-.01

.01

.01

.01

.01

.01

T x P**

.25

**

.10

**

*.10

S x P

**

.05

*.01

*.05

.05

4.,

RxTxS

RxTxP

.01

*.01

.05

.05,

*.01

.05

RxSxP

T x S x P

.25

,*

.25

**

**

.05

RxTXSxP

ti

* Indicates a probability level of greater than .25.

**The F-statistic computed for these effects

ras F'

(see Winer, 1971).

ATTACHMENT

6

Exercise #

(See Table

Probability IJevels Associated with the F-Ratios

for the Age 13 Analysis of Variance

Most Common Score


,

)

67

89,

10

67

8-,

910

1) ,,

,+:

-,

.

,

Effect

R (Responses)

.01

.01

.01

.01

.01

.01

.01

.01

.01

.01

T (Times)**

*.10

*/

.25

**

**

*.10

S (Scorers)

.01

.10,

.25

*.25

*.05

**

P-(Parts)**

*

.',

R x T

.25

.10

.05

*

.25

.10

.05

.05.

.01

.05

*

.10

*

.25

*

.10

.05

R x S

--

--

--

/-

1

INA

,T x S

.01

.10

.01

.25

.01

**

.01

.25

*

Cr)

R x P

.01

.01

.01

.01

.01

.01

.01

.01

T x P**

**

.25

**

**

*

S x P

.01

*.25

**

**

*RxTxS

--

--

--

--

-,RxTxP

.01.

*.01

.10

.01

*.n1

.25

AxSxP

--

._

--.

--

-

TxSxP

.01

.25'

.25

.05

.10

*.25

*

RxTxSxP

°-

*Indicates a probability level, of greater than

** The F-statistic computed for ,these effects was F-1

(see Winer, 1971).

ATTACHMENT

7

Exercise #

(See Table

!It

Probability

Levels Associated with the F.- Ratios

for ithe Age 17 Analysis of Variance

M/rt

Common Score

Acceptable /Unacceptable

11

/12

13

14

15

,11

12

13

14

1)

1

15

Effect

rl

R (Responses)

.01

:01

.01

.01

.01

.01

.01

.01

.01

.01

T (Times)**

*.25

.10

.01

**

**

**

S (Scorers)

.25

.25

.05

.05

*.01

**

.01

*P '(Parts)**

**

*,

.25

**

.01.

:

.25

**

R x T

**

.1Q

.05

**

.25

.01

.25

.01

immil

R x S

--

--

--

-..1

T x S

.05

*.05

.25

**

.05

*.05

*

R x P

_.01

.05

.01

.01

.01

.01

.01

.01

.01

.01

T x P**

**

*,10

:,;.-25

**

**

*S x l)

.05

.25

**

-e.k

*.0

1*

*.25

RxTxS,

--

--

,-

--

R x T- x P

**

.05

.05

**

...,*

.01

*.01

RxSxP

_-

_,

--

--

TxSxP

.25

*.10

**

*.01

*.05

*RxTXSxP


**The F-statistic computed for-these effects was F'

(see Winer, 1971).

i 1

ATTACHMENT 8

Probability Level Associated with the F-Ratios

for the Age Adult Analysis of Variance

Most Common Score


Exercise #

16

(See Table 1)

Effect

.

R (Responses)

T (Times) **

S AScorers)

P (Parts)**

R x T

R x S

T x S

R x P

T x ,P**

S x P

01'

* *

.01

17

18

16

17

18

.01

.01

.01

.01

.01

.25

**

**

.10

.05

.01

.01

.10

.25

.10

.01

.01

.25

*

--

-*

.01.

* -

i.

.25

.01

.01

.01

i.01

**

.10

*

*.01

.05

.01

* -* _

.01

*

.01

.01

.25

.25

.01

*

RxTxS

=-

--

--

1

R x T x P

.10

.25

**

.25

*.

"RxSxP

--

-1

TxSxP

.10

.01

*.01 -

.c.

*

RxTxSAcP

--

-1

-


**The F-statistic computed for these effects was F'

(see,14iner,

971).

II

r111

j

References

Coffman, W.E. "Essay Examinations. In R.L. Thorndike, ed.,Educational Measurement (2nd Ed.). Washington, D.C.:American Council on Education, 1971.

Dixon, W.J. Biomedical Computer Programs, University ofCalifornia Press: Berkeley and Los Apgeles California,1973, p.699.

Glass, G. V and Hakstian, A.R. "Measures of Association inComparative Experiments: Their Development' and Interpreta-tion." Research Paper No. 14. Boulder, Co.: Laboratoryof Educational Research, University-of Colorado, 1968.

Stanley, J.C. "Reliability." In R.L. Thorndike, ed.,Educational Measurement (2nd Ed.). Washington, D.C.:American Council on Education, 1971.

Winer, B.3. Statistical Principles in Experimental Design(22nd Ed:) New York: McGraw Hill, 1971.

19

yt

1