facets to adjust for rater discrepancies

7/27/2019 Facets to adjust for rater discrepancies

1/12

Investigating the Effect ofRaters L1 Background on

Writing Assessment

A Presentation for

IJASParis, France

April 8, 2013

by

Farah BahrouniSultan Qaboos University (SQU)

OMAN

[email protected]
mailto:[email protected]:[email protected]


2/12

Confusion is the beginning of learning.Socrates (469-399 BC)

If we knew what we were doing, we

wouldnt call it research.Albert Einstein

These 2 quotations might explain why I am here!


3/12

Outline:

1) Claim

2) Study

Data collection

Analysis: FACETS & One-Way ANOVA

Results

3) Conclusion

Implication & Significance


4/12

1. Claim

Research has established that writing assessment can by no

means be objective

Studies have probed possible reasons extensively:Weigle (1994: 23, 24) grouped sources of

raters' disagreement into three categories:

within the text : prompt, writers background & ability within the rater: physical & psychological conditions

within the rating context: when, where & under what

conditions the rating is done

She adds that interactions among these sources are

also possible:A rater from a certain background may react to a text written in a certain style

differently from the way a rater from a different background would. p. 24


5/12

Bachman (1990) refers to the above sources as:potential

sources of measurement errorand categorizes them into three

groups:

test method factors (e.g. raters, prompt type, etc.),

personal attributes (e.g. test taker's cognitive style, knowledge

of particular content, etc.)

random factors (e.g. fatigue, time of day, etc)

Most of the other studies revolve around these points with

respect to their different contexts.The claim in this study is that L1, which has been neglected to a

great extent, is a significant source of discrepancy between raters

that should be studied thoroughly on its own


6/12

Quantitative Data collection

20 ESL teachers from 4 different language backgrounds ( 5 native speakers,

5 Arabs sharing the students mother tongue, 5 Indians, and 5 Russians

scored 3 essays written by 3 Omani university students. All raters are

experienced ESL/EFL teachers, and have taught in the Omani context for a

minimum of 2 years

Analysis:


7/12

2. Analysis:

2.1 vertical rule

2.2
http://localhost/var/www/apps/conversion/tmp/scratch_10/Vertical%20rule.dochttp://localhost/var/www/apps/conversion/tmp/scratch_10/Vertical%20rule.doc


8/12

2.2 Data collection (II)

Write:

1) construct definitions based on Bachman & Palmers

(1996) communicative approach

2) definitions of performance levels based on LOs, Tsresponses and the 65 studied reports

Analysis: FACETS + One-Way ANOVA

RS1 RS2

Piloting: 5 teachers scored 10 samples twice:
http://localhost/var/www/apps/conversion/tmp/scratch_10/RS1.xlshttp://localhost/var/www/apps/conversion/tmp/scratch_10/RS2%20-%20version%202.xlshttp://localhost/var/www/apps/conversion/tmp/scratch_10/RS2%20-%20version%202.xlshttp://localhost/var/www/apps/conversion/tmp/scratch_10/RS1.xls


9/12

Results from FACETS:

RS1 Category Measurement Report (arranged by MN).

----------------------------------------------------------------------------------------------

| Obsvd Obsvd Obsvd Fair-M| Model | Infit Outfit |Estim.| |

| Score Count Average Avrage|Measure S.E. | MnSq ZStd MnSq ZStd|Discrm| N Category |

----------------------------------------------------------------------------------------------

| 179 50 3.6 3.65 | -.28 .25 | 1.27 1.1 1.31 1.3| .74 | 1 CONT |

| 175 50 3.5 3.58 | -.04 .24 | 1.01 .0 1.01 .1| .96 | 2 ORG |

| 174 50 3.5 3.56 | .02 .24 | 1.07 .3 .94 -.2| 1.06 | 4 SCES |

| 169 50 3.4 3.47 | .30 .23 | .67 -1.6 .75 -1.1| 1.26 | 3 Lge use |

------------------------------------------------------------------------------------------

Model, Sample: RMSE .24 Adj (True) S.D. .00 Separation .00 Reliability .00

Model, Fixed (all same) chi-square: 3.0 d.f.: 3 significance (probability): .40----------------------------------------------------------------------------------------------

RS2 Category Measurement Report (arranged by MN). |

----------------------------------------------------------------------------------------------

| Obsvd Obsvd Obsvd Fair-M| Model | Infit Outfit |Estim.| |

| Score Count Average Avrage|Measure S.E. | MnSq ZStd MnSq ZStd|Discrm| N Category |

----------------------------------------------------------------------------------------------

| 187 50 3.7 3.77 | -.71 .26 | .85 -.5 .89 -.4| 1.11 | 2 ORG |

| 184 50 3.7 3.71 | -.51 .26 | .51 -2.4 .53 -2.4| 1.48 | 1 CONT |

| 164 50 3.3 3.39 | .58 .21 | 2.11 3.5 2.05 3.3| .47 | 4 SCES |

| 163 50 3.3 3.38 | .63 .21 | .73 -1.1 .89 -.3| .90 | 3 Lge Use |

----------------------------------------------------------------------------------------------

Model, Sample: RMSE .24 Adj (True) S.D. .66 Separation 2.82 Reliability .89

Model, Fixed (all same) chi-square: 26.7 d.f.: 3 significance (probability): .00

----------------------------------------------------------------------------------------------


10/12

Results from One-Way ANOVA

Sum of

Squares

dfMean

Square

F

Sig.

Sum of

Squares

dfMean

Square

F Sig.

Total 78049 Total 780 49

WithinGroups

295.6 40 7.39

484.4 9 53.822 7.28

RS2TOTAL

Between

Groups

884

22 1.43

0.239

RS2TOTAL

Between

Groups0

WithinGroups

692 45 15.38

42.136 3.62

Total 844.4249 Total 844.42 49

11.63Within

Groups465.2 40

RS1

TOTAL

Between

Groups379.22 9

5 Raters ANOVA : Rater Total scores 5 Raters ANOVA : Samples Total scores

RS1

TOTAL

Between

Groups

136.324

34.08 2.17

0.0880.002

Within

Groups

708.145

15.74


11/12

3. Implication & significance:

Analysis indicates that RS2 function more effectively than RS1

Ts involvement in defining what they think should be assessed in sts writing &

describing the levels of performance (what those labels as Excellent, Good, or

Poorstand for) helped Ts reach a more common understanding of the lgeaspects being assessed and a shared interpretation of the score descriptions

The rating scales I have developed arehome made, based on LOs and tailored

to FPE, and therefore the LC, needs. They can be generalised to any similar

multi-cultural context to produce a less personalized and more institutionalizedobjective assessment of students writing performance.


12/12

REFERENCES

Alderson, J. C. (1991). Bands and Scores. In J. C. Alderson & B. North (Eds.), Language Testing in the 1990s: The

Communicative Legacy(Vol. 71 - 86). London and Basingstoke: Macmillan Publishers Limited.

Alderson, J. C., Clapham, C., & Wall, D. (1995). Language Test Construction and Evaluation: Cambridge University Press.

Bachman, L. F. (1990). Fundamental Considerations in Language Testing: Oxford: Oxford University Press.

Bachman, L. F., & Palmer, A. S. (1996). Language Testing in Practice: Designing and Developing Useful Language Tests.:Oxford: Oxford University Press.

Brindley, G. (1998). Describing language development? Rating scales and SLA. In: L. F. Bachman & A. D. Cohen (Eds.),

Interfaces between second language acquisition and language testing research. CUP.

Fulcher, G. (2000). The 'communicative' legacy in language testing. System, 28, 483 -497.

Fulcher, G. (2010). Practical Language Testing. Hodder Education, An Hachette UK Company

Fulcher, G., Davidson, F. & Kemp, J. (2011) Effective rating scale development for speaking tests: Performance decision

trees. Language Testing 28 (1) 5-29

Hamp-Lyons, L. (1991). Scoring procedures for ESL contexts. In L. Hamp-Lyons (Ed.),Assessing Second Language

Writing in Academic Contexts (pp. 241-276). Norwood, NJ: Ablex Publishing Corporation.

Hunter, D. M., Jones, R. M., & Randhawa, B. S. (1996). The Use of Holistic versus Analytic Scoring for Large-Scale

Assessment of Writing. The Canadian Journal of Program Evaluation, 11(2), 61 - 85.

North, B. (2000) The development of a Common Framework Scale of Language Proficiency: Theoretical Studies in

Second Language Acquisition P. Lang.

North, B. (2003). Scales for rating language performance: Descriptive models, formulation styles, and presentation

formats. TOEFL Monograph, 24.

North, B. & Schneider, G. (1998) Scaling descriptors for language proficiency scales. Language Testing 15 (2) 217-263

Weigle, S. C. (1994). Effects of training on raters of English as a second language compositions: Quantitative and

Qualitative approaches. University of California, Los Angeles.

Weigle, S. C. (2002).Assessing Writing. Cambridge: Cambridge University Press.

Thank you

facets to adjust for rater discrepancies

Documents