facets to adjust for rater discrepancies

Upload: farah-bahrouni

Post on 14-Apr-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 Facets to adjust for rater discrepancies

    1/12

    Investigating the Effect ofRaters L1 Background on

    Writing Assessment

    A Presentation for

    IJASParis, France

    April 8, 2013

    by

    Farah BahrouniSultan Qaboos University (SQU)

    OMAN

    [email protected]

    mailto:[email protected]:[email protected]
  • 7/27/2019 Facets to adjust for rater discrepancies

    2/12

    Confusion is the beginning of learning.Socrates (469-399 BC)

    If we knew what we were doing, we

    wouldnt call it research.Albert Einstein

    These 2 quotations might explain why I am here!

  • 7/27/2019 Facets to adjust for rater discrepancies

    3/12

    Outline:

    1) Claim

    2) Study

    Data collection

    Analysis: FACETS & One-Way ANOVA

    Results

    3) Conclusion

    Implication & Significance

  • 7/27/2019 Facets to adjust for rater discrepancies

    4/12

    1. Claim

    Research has established that writing assessment can by no

    means be objective

    Studies have probed possible reasons extensively:Weigle (1994: 23, 24) grouped sources of

    raters' disagreement into three categories:

    within the text : prompt, writers background & ability within the rater: physical & psychological conditions

    within the rating context: when, where & under what

    conditions the rating is done

    She adds that interactions among these sources are

    also possible:A rater from a certain background may react to a text written in a certain style

    differently from the way a rater from a different background would. p. 24

  • 7/27/2019 Facets to adjust for rater discrepancies

    5/12

    Bachman (1990) refers to the above sources as:potential

    sources of measurement errorand categorizes them into three

    groups:

    test method factors (e.g. raters, prompt type, etc.),

    personal attributes (e.g. test taker's cognitive style, knowledge

    of particular content, etc.)

    random factors (e.g. fatigue, time of day, etc)

    Most of the other studies revolve around these points with

    respect to their different contexts.The claim in this study is that L1, which has been neglected to a

    great extent, is a significant source of discrepancy between raters

    that should be studied thoroughly on its own

  • 7/27/2019 Facets to adjust for rater discrepancies

    6/12

    Quantitative Data collection

    20 ESL teachers from 4 different language backgrounds ( 5 native speakers,

    5 Arabs sharing the students mother tongue, 5 Indians, and 5 Russians

    scored 3 essays written by 3 Omani university students. All raters are

    experienced ESL/EFL teachers, and have taught in the Omani context for a

    minimum of 2 years

    Analysis:

  • 7/27/2019 Facets to adjust for rater discrepancies

    7/12

    2. Analysis:

    2.1 vertical rule

    2.2

    http://localhost/var/www/apps/conversion/tmp/scratch_10/Vertical%20rule.dochttp://localhost/var/www/apps/conversion/tmp/scratch_10/Vertical%20rule.doc
  • 7/27/2019 Facets to adjust for rater discrepancies

    8/12

    2.2 Data collection (II)

    Write:

    1) construct definitions based on Bachman & Palmers

    (1996) communicative approach

    2) definitions of performance levels based on LOs, Tsresponses and the 65 studied reports

    Analysis: FACETS + One-Way ANOVA

    RS1 RS2

    Piloting: 5 teachers scored 10 samples twice:

    http://localhost/var/www/apps/conversion/tmp/scratch_10/RS1.xlshttp://localhost/var/www/apps/conversion/tmp/scratch_10/RS2%20-%20version%202.xlshttp://localhost/var/www/apps/conversion/tmp/scratch_10/RS2%20-%20version%202.xlshttp://localhost/var/www/apps/conversion/tmp/scratch_10/RS1.xls
  • 7/27/2019 Facets to adjust for rater discrepancies

    9/12

    Results from FACETS:

    RS1 Category Measurement Report (arranged by MN).

    ----------------------------------------------------------------------------------------------

    | Obsvd Obsvd Obsvd Fair-M| Model | Infit Outfit |Estim.| |

    | Score Count Average Avrage|Measure S.E. | MnSq ZStd MnSq ZStd|Discrm| N Category |

    ----------------------------------------------------------------------------------------------

    | 179 50 3.6 3.65 | -.28 .25 | 1.27 1.1 1.31 1.3| .74 | 1 CONT |

    | 175 50 3.5 3.58 | -.04 .24 | 1.01 .0 1.01 .1| .96 | 2 ORG |

    | 174 50 3.5 3.56 | .02 .24 | 1.07 .3 .94 -.2| 1.06 | 4 SCES |

    | 169 50 3.4 3.47 | .30 .23 | .67 -1.6 .75 -1.1| 1.26 | 3 Lge use |

    ------------------------------------------------------------------------------------------

    Model, Sample: RMSE .24 Adj (True) S.D. .00 Separation .00 Reliability .00

    Model, Fixed (all same) chi-square: 3.0 d.f.: 3 significance (probability): .40----------------------------------------------------------------------------------------------

    RS2 Category Measurement Report (arranged by MN). |

    ----------------------------------------------------------------------------------------------

    | Obsvd Obsvd Obsvd Fair-M| Model | Infit Outfit |Estim.| |

    | Score Count Average Avrage|Measure S.E. | MnSq ZStd MnSq ZStd|Discrm| N Category |

    ----------------------------------------------------------------------------------------------

    | 187 50 3.7 3.77 | -.71 .26 | .85 -.5 .89 -.4| 1.11 | 2 ORG |

    | 184 50 3.7 3.71 | -.51 .26 | .51 -2.4 .53 -2.4| 1.48 | 1 CONT |

    | 164 50 3.3 3.39 | .58 .21 | 2.11 3.5 2.05 3.3| .47 | 4 SCES |

    | 163 50 3.3 3.38 | .63 .21 | .73 -1.1 .89 -.3| .90 | 3 Lge Use |

    ----------------------------------------------------------------------------------------------

    Model, Sample: RMSE .24 Adj (True) S.D. .66 Separation 2.82 Reliability .89

    Model, Fixed (all same) chi-square: 26.7 d.f.: 3 significance (probability): .00

    ----------------------------------------------------------------------------------------------

  • 7/27/2019 Facets to adjust for rater discrepancies

    10/12

    Results from One-Way ANOVA

    Sum of

    Squares

    dfMean

    Square

    F

    Sig.

    Sum of

    Squares

    dfMean

    Square

    F Sig.

    Total 78049 Total 780 49

    WithinGroups

    295.6 40 7.39

    484.4 9 53.822 7.28

    RS2TOTAL

    Between

    Groups

    884

    22 1.43

    0.239

    RS2TOTAL

    Between

    Groups0

    WithinGroups

    692 45 15.38

    42.136 3.62

    Total 844.4249 Total 844.42 49

    11.63Within

    Groups465.2 40

    RS1

    TOTAL

    Between

    Groups379.22 9

    5 Raters ANOVA : Rater Total scores 5 Raters ANOVA : Samples Total scores

    RS1

    TOTAL

    Between

    Groups

    136.324

    34.08 2.17

    0.0880.002

    Within

    Groups

    708.145

    15.74

  • 7/27/2019 Facets to adjust for rater discrepancies

    11/12

    3. Implication & significance:

    Analysis indicates that RS2 function more effectively than RS1

    Ts involvement in defining what they think should be assessed in sts writing &

    describing the levels of performance (what those labels as Excellent, Good, or

    Poorstand for) helped Ts reach a more common understanding of the lgeaspects being assessed and a shared interpretation of the score descriptions

    The rating scales I have developed arehome made, based on LOs and tailored

    to FPE, and therefore the LC, needs. They can be generalised to any similar

    multi-cultural context to produce a less personalized and more institutionalizedobjective assessment of students writing performance.

  • 7/27/2019 Facets to adjust for rater discrepancies

    12/12

    REFERENCES

    Alderson, J. C. (1991). Bands and Scores. In J. C. Alderson & B. North (Eds.), Language Testing in the 1990s: The

    Communicative Legacy(Vol. 71 - 86). London and Basingstoke: Macmillan Publishers Limited.

    Alderson, J. C., Clapham, C., & Wall, D. (1995). Language Test Construction and Evaluation: Cambridge University Press.

    Bachman, L. F. (1990). Fundamental Considerations in Language Testing: Oxford: Oxford University Press.

    Bachman, L. F., & Palmer, A. S. (1996). Language Testing in Practice: Designing and Developing Useful Language Tests.:Oxford: Oxford University Press.

    Brindley, G. (1998). Describing language development? Rating scales and SLA. In: L. F. Bachman & A. D. Cohen (Eds.),

    Interfaces between second language acquisition and language testing research. CUP.

    Fulcher, G. (2000). The 'communicative' legacy in language testing. System, 28, 483 -497.

    Fulcher, G. (2010). Practical Language Testing. Hodder Education, An Hachette UK Company

    Fulcher, G., Davidson, F. & Kemp, J. (2011) Effective rating scale development for speaking tests: Performance decision

    trees. Language Testing 28 (1) 5-29

    Hamp-Lyons, L. (1991). Scoring procedures for ESL contexts. In L. Hamp-Lyons (Ed.),Assessing Second Language

    Writing in Academic Contexts (pp. 241-276). Norwood, NJ: Ablex Publishing Corporation.

    Hunter, D. M., Jones, R. M., & Randhawa, B. S. (1996). The Use of Holistic versus Analytic Scoring for Large-Scale

    Assessment of Writing. The Canadian Journal of Program Evaluation, 11(2), 61 - 85.

    North, B. (2000) The development of a Common Framework Scale of Language Proficiency: Theoretical Studies in

    Second Language Acquisition P. Lang.

    North, B. (2003). Scales for rating language performance: Descriptive models, formulation styles, and presentation

    formats. TOEFL Monograph, 24.

    North, B. & Schneider, G. (1998) Scaling descriptors for language proficiency scales. Language Testing 15 (2) 217-263

    Weigle, S. C. (1994). Effects of training on raters of English as a second language compositions: Quantitative and

    Qualitative approaches. University of California, Los Angeles.

    Weigle, S. C. (2002).Assessing Writing. Cambridge: Cambridge University Press.

    Thank you