facets to adjust for rater discrepancies
TRANSCRIPT
-
7/27/2019 Facets to adjust for rater discrepancies
1/12
Investigating the Effect ofRaters L1 Background on
Writing Assessment
A Presentation for
IJASParis, France
April 8, 2013
by
Farah BahrouniSultan Qaboos University (SQU)
OMAN
mailto:[email protected]:[email protected] -
7/27/2019 Facets to adjust for rater discrepancies
2/12
Confusion is the beginning of learning.Socrates (469-399 BC)
If we knew what we were doing, we
wouldnt call it research.Albert Einstein
These 2 quotations might explain why I am here!
-
7/27/2019 Facets to adjust for rater discrepancies
3/12
Outline:
1) Claim
2) Study
Data collection
Analysis: FACETS & One-Way ANOVA
Results
3) Conclusion
Implication & Significance
-
7/27/2019 Facets to adjust for rater discrepancies
4/12
1. Claim
Research has established that writing assessment can by no
means be objective
Studies have probed possible reasons extensively:Weigle (1994: 23, 24) grouped sources of
raters' disagreement into three categories:
within the text : prompt, writers background & ability within the rater: physical & psychological conditions
within the rating context: when, where & under what
conditions the rating is done
She adds that interactions among these sources are
also possible:A rater from a certain background may react to a text written in a certain style
differently from the way a rater from a different background would. p. 24
-
7/27/2019 Facets to adjust for rater discrepancies
5/12
Bachman (1990) refers to the above sources as:potential
sources of measurement errorand categorizes them into three
groups:
test method factors (e.g. raters, prompt type, etc.),
personal attributes (e.g. test taker's cognitive style, knowledge
of particular content, etc.)
random factors (e.g. fatigue, time of day, etc)
Most of the other studies revolve around these points with
respect to their different contexts.The claim in this study is that L1, which has been neglected to a
great extent, is a significant source of discrepancy between raters
that should be studied thoroughly on its own
-
7/27/2019 Facets to adjust for rater discrepancies
6/12
Quantitative Data collection
20 ESL teachers from 4 different language backgrounds ( 5 native speakers,
5 Arabs sharing the students mother tongue, 5 Indians, and 5 Russians
scored 3 essays written by 3 Omani university students. All raters are
experienced ESL/EFL teachers, and have taught in the Omani context for a
minimum of 2 years
Analysis:
-
7/27/2019 Facets to adjust for rater discrepancies
7/12
2. Analysis:
2.1 vertical rule
2.2
http://localhost/var/www/apps/conversion/tmp/scratch_10/Vertical%20rule.dochttp://localhost/var/www/apps/conversion/tmp/scratch_10/Vertical%20rule.doc -
7/27/2019 Facets to adjust for rater discrepancies
8/12
2.2 Data collection (II)
Write:
1) construct definitions based on Bachman & Palmers
(1996) communicative approach
2) definitions of performance levels based on LOs, Tsresponses and the 65 studied reports
Analysis: FACETS + One-Way ANOVA
RS1 RS2
Piloting: 5 teachers scored 10 samples twice:
http://localhost/var/www/apps/conversion/tmp/scratch_10/RS1.xlshttp://localhost/var/www/apps/conversion/tmp/scratch_10/RS2%20-%20version%202.xlshttp://localhost/var/www/apps/conversion/tmp/scratch_10/RS2%20-%20version%202.xlshttp://localhost/var/www/apps/conversion/tmp/scratch_10/RS1.xls -
7/27/2019 Facets to adjust for rater discrepancies
9/12
Results from FACETS:
RS1 Category Measurement Report (arranged by MN).
----------------------------------------------------------------------------------------------
| Obsvd Obsvd Obsvd Fair-M| Model | Infit Outfit |Estim.| |
| Score Count Average Avrage|Measure S.E. | MnSq ZStd MnSq ZStd|Discrm| N Category |
----------------------------------------------------------------------------------------------
| 179 50 3.6 3.65 | -.28 .25 | 1.27 1.1 1.31 1.3| .74 | 1 CONT |
| 175 50 3.5 3.58 | -.04 .24 | 1.01 .0 1.01 .1| .96 | 2 ORG |
| 174 50 3.5 3.56 | .02 .24 | 1.07 .3 .94 -.2| 1.06 | 4 SCES |
| 169 50 3.4 3.47 | .30 .23 | .67 -1.6 .75 -1.1| 1.26 | 3 Lge use |
------------------------------------------------------------------------------------------
Model, Sample: RMSE .24 Adj (True) S.D. .00 Separation .00 Reliability .00
Model, Fixed (all same) chi-square: 3.0 d.f.: 3 significance (probability): .40----------------------------------------------------------------------------------------------
RS2 Category Measurement Report (arranged by MN). |
----------------------------------------------------------------------------------------------
| Obsvd Obsvd Obsvd Fair-M| Model | Infit Outfit |Estim.| |
| Score Count Average Avrage|Measure S.E. | MnSq ZStd MnSq ZStd|Discrm| N Category |
----------------------------------------------------------------------------------------------
| 187 50 3.7 3.77 | -.71 .26 | .85 -.5 .89 -.4| 1.11 | 2 ORG |
| 184 50 3.7 3.71 | -.51 .26 | .51 -2.4 .53 -2.4| 1.48 | 1 CONT |
| 164 50 3.3 3.39 | .58 .21 | 2.11 3.5 2.05 3.3| .47 | 4 SCES |
| 163 50 3.3 3.38 | .63 .21 | .73 -1.1 .89 -.3| .90 | 3 Lge Use |
----------------------------------------------------------------------------------------------
Model, Sample: RMSE .24 Adj (True) S.D. .66 Separation 2.82 Reliability .89
Model, Fixed (all same) chi-square: 26.7 d.f.: 3 significance (probability): .00
----------------------------------------------------------------------------------------------
-
7/27/2019 Facets to adjust for rater discrepancies
10/12
Results from One-Way ANOVA
Sum of
Squares
dfMean
Square
F
Sig.
Sum of
Squares
dfMean
Square
F Sig.
Total 78049 Total 780 49
WithinGroups
295.6 40 7.39
484.4 9 53.822 7.28
RS2TOTAL
Between
Groups
884
22 1.43
0.239
RS2TOTAL
Between
Groups0
WithinGroups
692 45 15.38
42.136 3.62
Total 844.4249 Total 844.42 49
11.63Within
Groups465.2 40
RS1
TOTAL
Between
Groups379.22 9
5 Raters ANOVA : Rater Total scores 5 Raters ANOVA : Samples Total scores
RS1
TOTAL
Between
Groups
136.324
34.08 2.17
0.0880.002
Within
Groups
708.145
15.74
-
7/27/2019 Facets to adjust for rater discrepancies
11/12
3. Implication & significance:
Analysis indicates that RS2 function more effectively than RS1
Ts involvement in defining what they think should be assessed in sts writing &
describing the levels of performance (what those labels as Excellent, Good, or
Poorstand for) helped Ts reach a more common understanding of the lgeaspects being assessed and a shared interpretation of the score descriptions
The rating scales I have developed arehome made, based on LOs and tailored
to FPE, and therefore the LC, needs. They can be generalised to any similar
multi-cultural context to produce a less personalized and more institutionalizedobjective assessment of students writing performance.
-
7/27/2019 Facets to adjust for rater discrepancies
12/12
REFERENCES
Alderson, J. C. (1991). Bands and Scores. In J. C. Alderson & B. North (Eds.), Language Testing in the 1990s: The
Communicative Legacy(Vol. 71 - 86). London and Basingstoke: Macmillan Publishers Limited.
Alderson, J. C., Clapham, C., & Wall, D. (1995). Language Test Construction and Evaluation: Cambridge University Press.
Bachman, L. F. (1990). Fundamental Considerations in Language Testing: Oxford: Oxford University Press.
Bachman, L. F., & Palmer, A. S. (1996). Language Testing in Practice: Designing and Developing Useful Language Tests.:Oxford: Oxford University Press.
Brindley, G. (1998). Describing language development? Rating scales and SLA. In: L. F. Bachman & A. D. Cohen (Eds.),
Interfaces between second language acquisition and language testing research. CUP.
Fulcher, G. (2000). The 'communicative' legacy in language testing. System, 28, 483 -497.
Fulcher, G. (2010). Practical Language Testing. Hodder Education, An Hachette UK Company
Fulcher, G., Davidson, F. & Kemp, J. (2011) Effective rating scale development for speaking tests: Performance decision
trees. Language Testing 28 (1) 5-29
Hamp-Lyons, L. (1991). Scoring procedures for ESL contexts. In L. Hamp-Lyons (Ed.),Assessing Second Language
Writing in Academic Contexts (pp. 241-276). Norwood, NJ: Ablex Publishing Corporation.
Hunter, D. M., Jones, R. M., & Randhawa, B. S. (1996). The Use of Holistic versus Analytic Scoring for Large-Scale
Assessment of Writing. The Canadian Journal of Program Evaluation, 11(2), 61 - 85.
North, B. (2000) The development of a Common Framework Scale of Language Proficiency: Theoretical Studies in
Second Language Acquisition P. Lang.
North, B. (2003). Scales for rating language performance: Descriptive models, formulation styles, and presentation
formats. TOEFL Monograph, 24.
North, B. & Schneider, G. (1998) Scaling descriptors for language proficiency scales. Language Testing 15 (2) 217-263
Weigle, S. C. (1994). Effects of training on raters of English as a second language compositions: Quantitative and
Qualitative approaches. University of California, Los Angeles.
Weigle, S. C. (2002).Assessing Writing. Cambridge: Cambridge University Press.
Thank you