degrees of incorrectness in computer adaptive language testing tsopanoglou, a., ypsilandis, g.s.,...

DEGREES OF INCORRECTNESS IN

COMPUTER ADAPTIVE LANGUAGE TESTING

Tsopanoglou, A., Ypsilandis, G.S., Mouti, A.Dept of Italian Language and Literature

Aristotle University of Thessaloniki

This Talk

Multiple-Choice and CAT

The Study

The Results

The Proposal

CAT from OUP and Univ of Cambridge

Multiple-Choice: Dichotomous

STEM - QUESTION

CORRECT ANSWER

1ST DISTRACTOR

2ND DISTRACTOR

3RD DISTRACTOR

Multiple-Choice, e.g.

Who was the Prime Minister of the UK in the year 1988

CORRECT ANSWERMargaret Thatcher

1ST DISTRACTORJohn Mayor

2ND DISTRACTORElton John

3RD DISTRACTORLiverpool FC

Multiple-Choice Polychotomous Pattern

STEM - QUESTION

CORRECT ANSWER

1ST DISTRACTOR ------ VERY LIKABLE / VERY SUITABLE

2ND DISTRACTOR ------ LIKABLE / SUITABLE

3RD DISTRACTOR ------ IRRELEVANT / TOTALLY WRONG

The Procedure

Collection of MC questions from a CAT

Completion of test in paper by Test Takers

Test Correction in Traditional Mode

Test Correction in Experimental Mode

Listing of Test Items by Exp. Examiner

Analysis of Results

Conclusion

Rating of Test Items by Expert Examiner (Native)

CLEAR - DICHOTOMOUS• Three Wrong – 1 Correct = 54• All Wrong = 6• 1 Correct – 1/2 Likable = 5• TOTAL = 65 (81%)

PATTERNED - POLYCHOTOMOUS• 2 Correct rest Wrong = 7• Entire Experimental Pattern = 1• 1 Correct – 1/2 Very Likable = 7• TOTAL = 15 (19%)

Comparing Scoring Procedures 1.Traditional Mode, 1.Experimental Mode, and 3.Negative scoring

86% 68% 79% 75% 36% 49% 63% 46% 55% 88% 69% 81% 76% 40% 51% 66% 49% 55%84% 56% 74% 66% 12% 28% 36% 26% 33%

72% 51% 56% 91% 35% 51% 51% 83% 53%76% 56% 59% 91% 43% 55% 54% 84% 57%62% 40% 45% 88% 19% 35% 34% 76% 40%

Tr-Exp Pearson r = 0,995488

Tr-Neg Pearson r = 0,979421

Exp-Neg Pearson r = 0,985956

Qualitative Analysis

5.5% subjects

48%-49%

Tr. Method

>=50%

Exp.Method

16.6% subjects

51%

Tr. Method

Secure

Exp.Method

89% subjects

Higher

Exp.Method

Same

Both Methods

11% subjects

More … Qualitative Analysis

• All subjects scored less when corrected with negative scoring

• 5 of those corrected with negative scoring would score considerably (<=20%) lower from traditional or experimental scoring

• Of those 5 of the above category 5 scored <=55%

Correlation of Answers of Experimental Pattern

Pearson r = - 0.310272. This indicates a tendency that those who score high select few totally wrong answers

Pearson r = -0.53534. This indicates that those who score high select few very likable.

Pearson r = 0,35228. This indicates a tendency that those who score very likable also score irrelevant

0

10

20

30

40

50

60

70

80

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

correctvery likablelikableirrelevant

Conclusions• reward intermediate levels of performance

particularly when v. suitable items are selected• increase reliability and efficiency in scoring

compared to dichotomous scoring (in agreement to Bennett, Wongwiwatthananukit, and Popovich (2000)) .

• Make Score outcomes more reflective of student knowledge compared to dichotomous scoring.

• This would give us a more precise and individualized language proficiency measurement.

The Proposal• Partial Credit Scoring for (2-1-0-0) framework2a+1b• For those who wish to include negative scoring in

mapple languagetest:= proc (A, B, G, D)[a=A, b=B, g=G, d=D]if a<40 then 0

elif b>20 then e:=b-20d:=d-e:

fi:2a+1b-dend:

The Proposal

• In CAT it is possible to:1. Offer more items of the same level when

a very likable / suitable distractor is selected

2. shift to an in-between level before shifting level

2. Use the code presented in the previous slide at the end of the test before offering the final verdict

Future Hypothesis

• Does awareness of partial credit scoring increase test takers’ responsibility in answering the items? and

• does it change test takers’ attitude to more responsible responses?

Concluding Remark

• Bachman (1990:280) - Spolsky (1981) address the ethical considerations of test use, questioning whether language testers have enough evidence to be sure of the decisions made on the basis of test scores.

• We would like to think that the proposed scoring technique provides supportive evidence in this direction.

degrees of incorrectness in computer adaptive language testing tsopanoglou, a., ypsilandis, g.s.,...

Documents