degrees of incorrectness in computer adaptive language testing tsopanoglou, a., ypsilandis, g.s.,...
DESCRIPTION
CAT from OUP and Univ of CambridgeTRANSCRIPT
DEGREES OF INCORRECTNESS IN
COMPUTER ADAPTIVE LANGUAGE TESTING
Tsopanoglou, A., Ypsilandis, G.S., Mouti, A.Dept of Italian Language and Literature
Aristotle University of Thessaloniki
This Talk
Multiple-Choice and CAT
The Study
The Results
The Proposal
CAT from OUP and Univ of Cambridge
Multiple-Choice: Dichotomous
STEM - QUESTION
CORRECT ANSWER
1ST DISTRACTOR
2ND DISTRACTOR
3RD DISTRACTOR
Multiple-Choice, e.g.
Who was the Prime Minister of the UK in the year 1988
CORRECT ANSWERMargaret Thatcher
1ST DISTRACTORJohn Mayor
2ND DISTRACTORElton John
3RD DISTRACTORLiverpool FC
Multiple-Choice Polychotomous Pattern
STEM - QUESTION
CORRECT ANSWER
1ST DISTRACTOR ------ VERY LIKABLE / VERY SUITABLE
2ND DISTRACTOR ------ LIKABLE / SUITABLE
3RD DISTRACTOR ------ IRRELEVANT / TOTALLY WRONG
The Procedure
Collection of MC questions from a CAT
Completion of test in paper by Test Takers
Test Correction in Traditional Mode
Test Correction in Experimental Mode
Listing of Test Items by Exp. Examiner
Analysis of Results
Conclusion
Rating of Test Items by Expert Examiner (Native)
CLEAR - DICHOTOMOUS• Three Wrong – 1 Correct = 54• All Wrong = 6• 1 Correct – 1/2 Likable = 5• TOTAL = 65 (81%)
PATTERNED - POLYCHOTOMOUS• 2 Correct rest Wrong = 7• Entire Experimental Pattern = 1• 1 Correct – 1/2 Very Likable = 7• TOTAL = 15 (19%)
Comparing Scoring Procedures 1.Traditional Mode, 1.Experimental Mode, and 3.Negative scoring
86% 68% 79% 75% 36% 49% 63% 46% 55% 88% 69% 81% 76% 40% 51% 66% 49% 55%84% 56% 74% 66% 12% 28% 36% 26% 33%
72% 51% 56% 91% 35% 51% 51% 83% 53%76% 56% 59% 91% 43% 55% 54% 84% 57%62% 40% 45% 88% 19% 35% 34% 76% 40%
Tr-Exp Pearson r = 0,995488
Tr-Neg Pearson r = 0,979421
Exp-Neg Pearson r = 0,985956
Qualitative Analysis
5.5% subjects
48%-49%
Tr. Method
>=50%
Exp.Method
16.6% subjects
51%
Tr. Method
Secure
Exp.Method
89% subjects
Higher
Exp.Method
Same
Both Methods
11% subjects
More … Qualitative Analysis
• All subjects scored less when corrected with negative scoring
• 5 of those corrected with negative scoring would score considerably (<=20%) lower from traditional or experimental scoring
• Of those 5 of the above category 5 scored <=55%
Correlation of Answers of Experimental Pattern
Pearson r = - 0.310272. This indicates a tendency that those who score high select few totally wrong answers
Pearson r = -0.53534. This indicates that those who score high select few very likable.
Pearson r = 0,35228. This indicates a tendency that those who score very likable also score irrelevant
0
10
20
30
40
50
60
70
80
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
correctvery likablelikableirrelevant
Conclusions• reward intermediate levels of performance
particularly when v. suitable items are selected• increase reliability and efficiency in scoring
compared to dichotomous scoring (in agreement to Bennett, Wongwiwatthananukit, and Popovich (2000)) .
• Make Score outcomes more reflective of student knowledge compared to dichotomous scoring.
• This would give us a more precise and individualized language proficiency measurement.
The Proposal• Partial Credit Scoring for (2-1-0-0) framework2a+1b• For those who wish to include negative scoring in
mapple languagetest:= proc (A, B, G, D)[a=A, b=B, g=G, d=D]if a<40 then 0
elif b>20 then e:=b-20d:=d-e:
fi:2a+1b-dend:
The Proposal
• In CAT it is possible to:1. Offer more items of the same level when
a very likable / suitable distractor is selected
2. shift to an in-between level before shifting level
2. Use the code presented in the previous slide at the end of the test before offering the final verdict
Future Hypothesis
• Does awareness of partial credit scoring increase test takers’ responsibility in answering the items? and
• does it change test takers’ attitude to more responsible responses?
Concluding Remark
• Bachman (1990:280) - Spolsky (1981) address the ethical considerations of test use, questioning whether language testers have enough evidence to be sure of the decisions made on the basis of test scores.
• We would like to think that the proposed scoring technique provides supportive evidence in this direction.