work smart { reducing human e ort in short …...work smart {reducing human e ort in short-answer...

41
Work Smart – Reducing Human Effort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad´ o, Hochschule f¨ ur Technik Stuttgart

Upload: others

Post on 08-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Work Smart –Reducing Human Effort in Short-Answer Grading

Margot Mieskes, Hochschule DarmstadtUlrike Pado, Hochschule fur Technik Stuttgart

Page 2: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Introduction

I Testing is an integral part of (language) teaching

I Specifically in focus: Tests with Short-Answer Questions(SAQs) for language or content assessment

1 / 18

Page 3: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Short-Answer Questions

Example from CREE corpus, Meurers et al. (2011b): ReadingComprehension

I Read text “Television and Children”

I Question: How is violence portrayed in cartoons according tothe article?

I Student Answer: There are underlying themes of justice andpunishment, that is, the “bad guys” do not usually win.

I Grader 1: correct, grader 2: correct

2 / 18

Page 4: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Our Goal

I Grading SAQs takes time, especially for large cohorts orfrequently repeated tests

I Reduce human grading effort!

I Specifically: Machines do some of the work and humans stepin where machines fail

I Determine likely machine failure through measures likeAccuracy and Fleiss’ κ

I This means not every student answer will behuman-graded

I Appropriate for placement testing etc., where the overallgrade is reported

3 / 18

Page 5: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Our Goal

I Grading SAQs takes time, especially for large cohorts orfrequently repeated tests

I Reduce human grading effort!

I Specifically: Machines do some of the work and humans stepin where machines fail

I Determine likely machine failure through measures likeAccuracy and Fleiss’ κ

I This means not every student answer will behuman-graded

I Appropriate for placement testing etc., where the overallgrade is reported

3 / 18

Page 6: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Our Goal

I Grading SAQs takes time, especially for large cohorts orfrequently repeated tests

I Reduce human grading effort!

I Specifically: Machines do some of the work and humans stepin where machines fail

I Determine likely machine failure through measures likeAccuracy and Fleiss’ κ

I This means not every student answer will behuman-graded

I Appropriate for placement testing etc., where the overallgrade is reported

3 / 18

Page 7: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Our Goal

I Grading SAQs takes time, especially for large cohorts orfrequently repeated tests

I Reduce human grading effort!

I Specifically: Machines do some of the work and humans stepin where machines fail

I Determine likely machine failure through measures likeAccuracy and Fleiss’ κ

I This means not every student answer will behuman-graded

I Appropriate for placement testing etc., where the overallgrade is reported

3 / 18

Page 8: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Our Goal

I Grading SAQs takes time, especially for large cohorts orfrequently repeated tests

I Reduce human grading effort!

I Specifically: Machines do some of the work and humans stepin where machines fail

I Determine likely machine failure through measures likeAccuracy and Fleiss’ κ

I This means not every student answer will behuman-graded

I Appropriate for placement testing etc., where the overallgrade is reported

3 / 18

Page 9: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Outline of Talk

I Machine Graders: Data, features and evaluation

I Study 1: Human performance

I Study 2: Combining machine graders

4 / 18

Page 10: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Outline of Talk

I Machine Graders: Data, features and evaluation

I Study 1: Human performance

I Study 2: Combining machine graders

4 / 18

Page 11: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Outline of Talk

I Machine Graders: Data, features and evaluation

I Study 1: Human performance

I Study 2: Combining machine graders

4 / 18

Page 12: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Data Sets

Corpus#Questions/ Language#Answers

ASAP (www.kaggle.com/c/asap-sas) 5/8182

ENSEB (Dzikovska et al., 2013) 135/4969Beetle (Dzikovska et al., 2013) 47/3941Mohler (Mohler et al., 2011) 81/2273CREE (Meurers et al., 2011a) 61/566CREG (Meurers et al., 2011b) 85/543

GERCSSAG (Pado and Kiefer, 2015) 31/1926

5 / 18

Page 13: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Models and Features

I Three learning algorithms: Random Forest, Support VectorMachine, Decision Tree

I Features designed to cover the feature types used in theliterature: N-Grams, text similarity measures, dependencyparses and deep semantic representations, textual entailment(Pado, 2016)

6 / 18

Page 14: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Evaluation Measures

I Comparing predictions and gold annotationI Accuracy: Which percentage of the answers has been labelled

correctly?

I Comparing parallel annotations: Fleiss’ κI Do the annotators agree more than they would by chance?I Compare multiple annotators, for multiple target grades, down

to the individual answer

7 / 18

Page 15: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Evaluation Measures

I Comparing predictions and gold annotationI Accuracy: Which percentage of the answers has been labelled

correctly?

I Comparing parallel annotations: Fleiss’ κI Do the annotators agree more than they would by chance?

I Compare multiple annotators, for multiple target grades, downto the individual answer

7 / 18

Page 16: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Evaluation Measures

I Comparing predictions and gold annotationI Accuracy: Which percentage of the answers has been labelled

correctly?

I Comparing parallel annotations: Fleiss’ κI Do the annotators agree more than they would by chance?I Compare multiple annotators, for multiple target grades, down

to the individual answer

7 / 18

Page 17: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Outline of Talk

I Machine Graders: Data, features and evaluation

I Study 1: Human performance

I Study 2: Combining machine graders

8 / 18

Page 18: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Human Performance

Human Performance

Measure ASAP CREG CSSAG MohlerHuman Acc 93.7 85.8 89.9 83.5Human κ 0.82 0.64 0.54 0.41

I Easiest case: Correct-incorrect decision

I Doubly-annotated corpora show large variation betweenhigh-volume and ad-hoc testing

I Accuracies around 85% have been accepted: ∼ 15% error

9 / 18

Page 19: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Human Performance

Human Performance

Measure ASAP CREG CSSAG MohlerHuman Acc 93.7 85.8 89.9 83.5Human κ 0.82 0.64 0.54 0.41

I Easiest case: Correct-incorrect decision

I Doubly-annotated corpora show large variation betweenhigh-volume and ad-hoc testing

I Accuracies around 85% have been accepted: ∼ 15% error

9 / 18

Page 20: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Human Performance

Human Performance

Measure ASAP CREG CSSAG MohlerHuman Acc 93.7 85.8 89.9 83.5Human κ 0.82 0.64 0.54 0.41

I Easiest case: Correct-incorrect decision

I Doubly-annotated corpora show large variation betweenhigh-volume and ad-hoc testing

I Accuracies around 85% have been accepted: ∼ 15% error

9 / 18

Page 21: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Machine Ensembles

Our Goal

I Reduce human grading effort!

I Ideally, at the same error levels as before

I Idea: Machines do some of the work and humans step inwhere machines fail

10 / 18

Page 22: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Machine Ensembles

Strategy

I Train several classifiers and collect their predictions: Ensemblelearning

I As long as each ensemble learner is better than chance, theensemble is guaranteed to improve over the individual learners

I Also, now there are multiple annotations! Use κ to determineensemble agreement

I Assumption: The better ensemble agreement is on aprediction, the more reliable it is

I Human checks of the machine labels are only needed forunreliable predictions

11 / 18

Page 23: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Machine Ensembles

Strategy

I Train several classifiers and collect their predictions: Ensemblelearning

I As long as each ensemble learner is better than chance, theensemble is guaranteed to improve over the individual learners

I Also, now there are multiple annotations! Use κ to determineensemble agreement

I Assumption: The better ensemble agreement is on aprediction, the more reliable it is

I Human checks of the machine labels are only needed forunreliable predictions

11 / 18

Page 24: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Machine Ensembles

Strategy

I Train several classifiers and collect their predictions: Ensemblelearning

I As long as each ensemble learner is better than chance, theensemble is guaranteed to improve over the individual learners

I Also, now there are multiple annotations! Use κ to determineensemble agreement

I Assumption: The better ensemble agreement is on aprediction, the more reliable it is

I Human checks of the machine labels are only needed forunreliable predictions

11 / 18

Page 25: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Machine Ensembles

Strategy

I Train several classifiers and collect their predictions: Ensemblelearning

I As long as each ensemble learner is better than chance, theensemble is guaranteed to improve over the individual learners

I Also, now there are multiple annotations! Use κ to determineensemble agreement

I Assumption: The better ensemble agreement is on aprediction, the more reliable it is

I Human checks of the machine labels are only needed forunreliable predictions

11 / 18

Page 26: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Machine Ensembles

Verifying the Assumption

Classes ASAP CREE CREG CSSAG Mohler Beetle SEBBinary 10% 15% 12% 24% 9% 17% 25%

Multi 18% – – 38% 30% – –

I Percentage of incorrect predictions made in full agreement

I For most corpora, decisions made in full agreement are asreliable as human annotators

I The task is noticeably harder for more than two grade levels

12 / 18

Page 27: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Machine Ensembles

Verifying the Assumption

Classes ASAP CREE CREG CSSAG Mohler Beetle SEBBinary 10% 15% 12% 24% 9% 17% 25%Multi 18% – – 38% 30% – –

I Percentage of incorrect predictions made in full agreement

I For most corpora, decisions made in full agreement are asreliable as human annotators

I The task is noticeably harder for more than two grade levels

12 / 18

Page 28: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Machine Ensembles

Verifying the Assumption

Classes ASAP CREE CREG CSSAG Mohler Beetle SEBBinary 10% 15% 12% 24% 9% 17% 25%Multi 18% – – 38% 30% – –

I Percentage of incorrect predictions made in full agreement

I For most corpora, decisions made in full agreement are asreliable as human annotators

I The task is noticeably harder for more than two grade levels

12 / 18

Page 29: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Machine Ensembles

Identifying Unreliable Predictions

I Clearly: Any answers the ensemble couldn’t label (noagreement; multiclass case only)

I Next: Any answers the ensemble didn’t label in full agreement

13 / 18

Page 30: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Machine Ensembles

Identifying Unreliable Predictions

I Clearly: Any answers the ensemble couldn’t label (noagreement; multiclass case only)

I Next: Any answers the ensemble didn’t label in full agreement

13 / 18

Page 31: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Machine Ensembles

Effort and Remaining Error: Binary Case

ASAP CREE CREG CSSAG Mohler Beetle SEBNAonly

Effort 0 0 0 0 0 0 0Error 16% 15% 16% 29% 11% 23% 30%

allPartA

Effort 20% 19% 12% 27% 7% 24% 28%Error 8% 9% 9% 17% 8% 13% 18%

I Binary case: Pass-fail decision

I First: Any answers the ensemble couldn’t label (none here!)

I Next: Any answers the ensemble didn’t label in fullagreement: Remaining error below human levels at 20-30% ofanswers graded

14 / 18

Page 32: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Machine Ensembles

Effort and Remaining Error: Binary Case

ASAP CREE CREG CSSAG Mohler Beetle SEBNAonly

Effort 0 0 0 0 0 0 0Error 16% 15% 16% 29% 11% 23% 30%

allPartA

Effort 20% 19% 12% 27% 7% 24% 28%Error 8% 9% 9% 17% 8% 13% 18%

I Binary case: Pass-fail decision

I First: Any answers the ensemble couldn’t label (none here!)

I Next: Any answers the ensemble didn’t label in fullagreement: Remaining error below human levels at 20-30% ofanswers graded

14 / 18

Page 33: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Machine Ensembles

Effort and Remaining Error: Multiclass Case

ASAP CSSAG MohlerNAonly

Effort 7% 4% 9%Error 28% 44% 41%

allPartA

Effort 39% 50% 59%Error 11% 19% 15%

I Multiclass case: 5 to 10-way decision

I First: Revise answers the ensemble couldn’t label – more workclearly needed

I Second: Revise cases of partial agreement

I Acceptable error levels, but more manual work than in thebinary case

15 / 18

Page 34: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Machine Ensembles

Effort and Remaining Error: Multiclass Case

ASAP CSSAG MohlerNAonly

Effort 7% 4% 9%Error 28% 44% 41%

allPartA

Effort 39% 50% 59%Error 11% 19% 15%

I Multiclass case: 5 to 10-way decision

I First: Revise answers the ensemble couldn’t label – more workclearly needed

I Second: Revise cases of partial agreement

I Acceptable error levels, but more manual work than in thebinary case

15 / 18

Page 35: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Machine Ensembles

Effort and Remaining Error: Multiclass Case

ASAP CSSAG MohlerNAonly

Effort 7% 4% 9%Error 28% 44% 41%

allPartA

Effort 39% 50% 59%Error 11% 19% 15%

I Multiclass case: 5 to 10-way decision

I First: Revise answers the ensemble couldn’t label – more workclearly needed

I Second: Revise cases of partial agreement

I Acceptable error levels, but more manual work than in thebinary case

15 / 18

Page 36: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Did we reach our goal?

I Human effort can be reduced while error levels remain stable

I Our approach works better for the learner corpora: Binarydecisions, reliable machine learners

I For the multiclass case, similar efficiency as reported inHorbach et al. (2014) (60% effort saved, 15% remainingerror); much better for pass-fail

16 / 18

Page 37: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Did we reach our goal?

I Human effort can be reduced while error levels remain stable

I Our approach works better for the learner corpora: Binarydecisions, reliable machine learners

I For the multiclass case, similar efficiency as reported inHorbach et al. (2014) (60% effort saved, 15% remainingerror); much better for pass-fail

16 / 18

Page 38: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

Did we reach our goal?

I Human effort can be reduced while error levels remain stable

I Our approach works better for the learner corpora: Binarydecisions, reliable machine learners

I For the multiclass case, similar efficiency as reported inHorbach et al. (2014) (60% effort saved, 15% remainingerror); much better for pass-fail

16 / 18

Page 39: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

What else to consider?

I When planning ensemble-supported grading:I Know your requirements

I When creating corpora:I Few classesI Well-trained annotatorsI Size matters (somewhat)

I Future workI Run a user study: Get feedback on usefulness and usability

17 / 18

Page 40: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

What else to consider?

I When planning ensemble-supported grading:I Know your requirements

I When creating corpora:I Few classesI Well-trained annotatorsI Size matters (somewhat)

I Future workI Run a user study: Get feedback on usefulness and usability

17 / 18

Page 41: Work Smart { Reducing Human E ort in Short …...Work Smart {Reducing Human E ort in Short-Answer Grading Margot Mieskes, Hochschule Darmstadt Ulrike Pad o, Hochschule fur Technik

Introduction Machine Grading Experiments Discussion

What else to consider?

I When planning ensemble-supported grading:I Know your requirements

I When creating corpora:I Few classesI Well-trained annotatorsI Size matters (somewhat)

I Future workI Run a user study: Get feedback on usefulness and usability

17 / 18