na-rae han (university of pittsburgh), joel tetreault (ets), soo-hwa lee (chungdahm learning, inc.),...

1

Na-Rae Han (University of Pittsburgh), Joel Tetreault (ETS),

Soo-Hwa Lee (Chungdahm Learning, Inc.), Jin-Young Ha (Kangwon University)

May 19 2010, LREC 2010

Using an Error-Annotated Learner Corpus to Develop an ESL/EFL Error Correction System

2

Objective A feedback tool for detecting and correcting

preposition errors I wait /for you. (<NULL,p>: omitted prep) So I go to/ home quickly. (<p,NULL>: extraneous prep) Adult give money at/on birthday. (<p1,p2>: selection error)

Why preposition errors? Preposition usage is one of the most difficult aspects of

English for non-native speakers 18% of sentences from ESL essays contain a preposition

error (Dalgish, 1985) 8-10% of all prepositions in TOEFL essays are used

incorrectly (Tetreault and Chodorow, 2008)

3

Diagnosing L2 Errors

Statistical modeling on large corpora. But what kind?

1. General corpora composed of well-edited texts by native speakers (“native speaker corpora”)

Currently dominant approach

2. Error-annotated learner corpora: consist of texts written by ESL learners

Our approach

4

Our Learner Corpus

Chungdahm English Learner Corpus A collection of English essays written by Korean-

speaking students of Chungdahm Institute, operated in S. Korea

130,754,000 words in 861, 481 essays, written on 1,545 prompts

Over 6.6 million error annotations in 4 categories: grammar, strategy, style, substance

Non-exhaustive error marking (more on this later)

5

The Preposition Data Set

Our preposition data set The 11 “preposition” types:

NULL, about, at, by, for, from, in, of, on, to, with

represents 99% of student error tokens in data

Text set consists of 20.5 mil words 117,665 preposition errors 1,104,752 preposition non-errors Preposition error rate as marked in the data: 9.6%

6

Method Cast error correction as a classification problem Train an 11-way Maximum Entropy classifier on

preposition events extracted from the Chungdahm corpus

A preposition annotation is represented as <s,c> (s: student’s prep choice, c: correct preposition) where s and c range over: { NULL, about, at, by, for, from, in, of, on, to, with } s≠c for prep errors; s=c for non-errors

A preposition event consists of: Outcome (prediction target): c Contextual features extracted from immediate

contexts surrounding preposition tokens, including the student’s original preposition choice (i.e., s)

7

Preposition Context Student prep choice + 3 words to left and right MOD: Head of the phrase modified by the prep

phrase ARG: Noun argument of the preposition

Identified using Stanford Parser

Example text and annotation: Snow is falling there at the winter .

-3 -2 -1 s +1 +2 +3

MOD ARG

<s,c>: <at,in>

8

Event Representation Represented as an event:

Outcome: in Features: (24 total)

name value

s at

wd-1 there

wd+1 the

MOD falling

ARG winter

MOD_ARG falling_winter

MOD_s_ARG falling_at_winter

3GRAM there_at_the

5GRAM falling_there_at_the_winter

... ...

9

Training and Testing Training set: 978,000 events

The rest is set aside for evaluation and development

Creating an evaluation set for testing Error annotation in Chungdahm corpus is not

exhaustive: Many student errors are left unmarked by

tutors This necessitates creating a re-annotated evaluation

set 1,000 preposition contexts annotated by 3 trained

annotators Inter-annotator agreement (0.860~0.910), kappa

(0.662~0.804)

10

Evaluation Results 11-way classification

- works as error correction (multi-outcome decision) model- can be backed-off to an error detection (binary decision) model

Omission errors (I wait /for you. )

*Error detection is trivial for this type

Extraneous prep errors (So I go to/ home quickly.)

Selection errors (Adult give money at/on birthday.)

<NULL,p> accuracy

error correction

0.833

<p,NULL> precision recall

error correction

0.87 0.043

detection only

1.00 0.049

<p1,p2> precision

recall

error correction

0.817 0.132

detection only

0.933 0.148

11

Related Work Chodorow et al. (2007)

Error detection model targeting 34 prepositions Trained on San Jose Mercury news + Lexile data 0.88 (precision) 0.16 (recall) for detecting selection errors

Gamon et al. (2008) Error detection and correction model of 13 prepositions One classifier to determine whether a preposition/article

should be present; another for correct choice; an additional filter

Trained on MS Encarta data, tested on Chinese learner writing

80% precision; recall not reported Izumi et al. (2003, 2004)

Trained on Standard Speaking Test Corpus (Japanese) 56 speakers, 6,216 sentences 25% precision and 7% recall on 13 grammatical error types

12

Comparison: Native-Corpus-Trained Models Question: Will models trained on native-

speaker-produced texts outperform our model?

The advantage of native corpora: They are plentiful. We allowed these models to have a larger

training size.

Experimental setup: Build models on native corpora, using varying

training set sizes (1mil – 5mil) Data: the Lexile Corpus, 7th and 8th grade reading

levels A comparable feature set was employed

13

Learner Model vs. Native Models Testing results on learner data (replacement errors

only):

Learner model outperforms all native models Native models: performance gain with larger size

insignificant beyond 2-3mil point

<p1,p2> error correction error detection only

precision recall precision recall

Learner(about 1

mil)

0.817 0.132 0.933 0.148

N-1mil 0.416 0.106 0.536 0.132

N-2mil 0.416 0.116 0.586 0.142

N-3mil 0.453 0.099 0.594 0.126

N-4mil 0.462 0.125 0.583 0.153

N-5mil 0.484 0.121 0.605 0.147

14

What Does This Prove? Are the native models flawed? Bad feature set?

No. In-set testing (against held-out native text) shows performance levels comparable to those in published studies

Could some of the performance gaps be due to genre differences? Highly likely. However, 7th-8th grade reading materials

were the closest match we could find to student essays.

In sum: Native models’ advantage of larger training size does not outweigh those of the learner model’s: genre/text similarity and error-annotation

15

Discussion: Learner language vs. native corpora Modeling on native corpora:

Produces a one-size-fits-all model of “native” English More generic & universally applicable?

Modeling on a learner corpus: Produces a model specific to the particular learner language Can it be applied to the language of other learner groups?

ex. French citizens? Japanese-speaking English learners?

Combining two approaches: A system with specific models for different L1 background Plus a back-off “generic” model, built on native corpora

16

Discussion: The Problem of Partial Error Annotation Partial error annotation problem:

57% of replacement errors and 85% of extraneous prepositions are unchecked by Chungdahm tutors

Training data includes conflicting evidence.

Our model’s low recall/high precision are impacted by it Model assumes a lower-than-true error rate Model has to reconcile between conflicting sets of evidence When the model does flag an error, it does so with high

confidence and accuracy

Solution? Bootstrapping, relabeling of unannotated errors

17

Conclusions As language instruction turns digital, more and

more (partially) error-annotated learner corpora like the Chungdahm corpus will become available

Building a direct model of L2 errors, whenever available, offers an advantage over models based on native corpora, despite the partial annotation problem (if any)

Exhaustive annotation is not necessary for learner-corpus-trained models to outperform standard native-text-trained models with much larger training data set

na-rae han (university of pittsburgh), joel tetreault (ets), soo-hwa lee (chungdahm learning, inc.),...

Documents