na-rae han (university of pittsburgh), joel tetreault (ets), soo-hwa lee (chungdahm learning, inc.),...
TRANSCRIPT
1
Na-Rae Han (University of Pittsburgh), Joel Tetreault (ETS),
Soo-Hwa Lee (Chungdahm Learning, Inc.), Jin-Young Ha (Kangwon University)
May 19 2010, LREC 2010
Using an Error-Annotated Learner Corpus to Develop an ESL/EFL Error Correction System
2
Objective A feedback tool for detecting and correcting
preposition errors I wait /for you. (<NULL,p>: omitted prep) So I go to/ home quickly. (<p,NULL>: extraneous prep) Adult give money at/on birthday. (<p1,p2>: selection error)
Why preposition errors? Preposition usage is one of the most difficult aspects of
English for non-native speakers 18% of sentences from ESL essays contain a preposition
error (Dalgish, 1985) 8-10% of all prepositions in TOEFL essays are used
incorrectly (Tetreault and Chodorow, 2008)
3
Diagnosing L2 Errors
Statistical modeling on large corpora. But what kind?
1. General corpora composed of well-edited texts by native speakers (“native speaker corpora”)
Currently dominant approach
2. Error-annotated learner corpora: consist of texts written by ESL learners
Our approach
4
Our Learner Corpus
Chungdahm English Learner Corpus A collection of English essays written by Korean-
speaking students of Chungdahm Institute, operated in S. Korea
130,754,000 words in 861, 481 essays, written on 1,545 prompts
Over 6.6 million error annotations in 4 categories: grammar, strategy, style, substance
Non-exhaustive error marking (more on this later)
5
The Preposition Data Set
Our preposition data set The 11 “preposition” types:
NULL, about, at, by, for, from, in, of, on, to, with
represents 99% of student error tokens in data
Text set consists of 20.5 mil words 117,665 preposition errors 1,104,752 preposition non-errors Preposition error rate as marked in the data: 9.6%
6
Method Cast error correction as a classification problem Train an 11-way Maximum Entropy classifier on
preposition events extracted from the Chungdahm corpus
A preposition annotation is represented as <s,c> (s: student’s prep choice, c: correct preposition) where s and c range over: { NULL, about, at, by, for, from, in, of, on, to, with } s≠c for prep errors; s=c for non-errors
A preposition event consists of: Outcome (prediction target): c Contextual features extracted from immediate
contexts surrounding preposition tokens, including the student’s original preposition choice (i.e., s)
7
Preposition Context Student prep choice + 3 words to left and right MOD: Head of the phrase modified by the prep
phrase ARG: Noun argument of the preposition
Identified using Stanford Parser
Example text and annotation: Snow is falling there at the winter .
-3 -2 -1 s +1 +2 +3
MOD ARG
<s,c>: <at,in>
8
Event Representation Represented as an event:
Outcome: in Features: (24 total)
name value
s at
wd-1 there
wd+1 the
MOD falling
ARG winter
MOD_ARG falling_winter
MOD_s_ARG falling_at_winter
3GRAM there_at_the
5GRAM falling_there_at_the_winter
... ...
9
Training and Testing Training set: 978,000 events
The rest is set aside for evaluation and development
Creating an evaluation set for testing Error annotation in Chungdahm corpus is not
exhaustive: Many student errors are left unmarked by
tutors This necessitates creating a re-annotated evaluation
set 1,000 preposition contexts annotated by 3 trained
annotators Inter-annotator agreement (0.860~0.910), kappa
(0.662~0.804)
10
Evaluation Results 11-way classification
- works as error correction (multi-outcome decision) model- can be backed-off to an error detection (binary decision) model
Omission errors (I wait /for you. )
*Error detection is trivial for this type
Extraneous prep errors (So I go to/ home quickly.)
Selection errors (Adult give money at/on birthday.)
<NULL,p> accuracy
error correction
0.833
<p,NULL> precision recall
error correction
0.87 0.043
detection only
1.00 0.049
<p1,p2> precision
recall
error correction
0.817 0.132
detection only
0.933 0.148
11
Related Work Chodorow et al. (2007)
Error detection model targeting 34 prepositions Trained on San Jose Mercury news + Lexile data 0.88 (precision) 0.16 (recall) for detecting selection errors
Gamon et al. (2008) Error detection and correction model of 13 prepositions One classifier to determine whether a preposition/article
should be present; another for correct choice; an additional filter
Trained on MS Encarta data, tested on Chinese learner writing
80% precision; recall not reported Izumi et al. (2003, 2004)
Trained on Standard Speaking Test Corpus (Japanese) 56 speakers, 6,216 sentences 25% precision and 7% recall on 13 grammatical error types
12
Comparison: Native-Corpus-Trained Models Question: Will models trained on native-
speaker-produced texts outperform our model?
The advantage of native corpora: They are plentiful. We allowed these models to have a larger
training size.
Experimental setup: Build models on native corpora, using varying
training set sizes (1mil – 5mil) Data: the Lexile Corpus, 7th and 8th grade reading
levels A comparable feature set was employed
13
Learner Model vs. Native Models Testing results on learner data (replacement errors
only):
Learner model outperforms all native models Native models: performance gain with larger size
insignificant beyond 2-3mil point
<p1,p2> error correction error detection only
precision recall precision recall
Learner(about 1
mil)
0.817 0.132 0.933 0.148
N-1mil 0.416 0.106 0.536 0.132
N-2mil 0.416 0.116 0.586 0.142
N-3mil 0.453 0.099 0.594 0.126
N-4mil 0.462 0.125 0.583 0.153
N-5mil 0.484 0.121 0.605 0.147
14
What Does This Prove? Are the native models flawed? Bad feature set?
No. In-set testing (against held-out native text) shows performance levels comparable to those in published studies
Could some of the performance gaps be due to genre differences? Highly likely. However, 7th-8th grade reading materials
were the closest match we could find to student essays.
In sum: Native models’ advantage of larger training size does not outweigh those of the learner model’s: genre/text similarity and error-annotation
15
Discussion: Learner language vs. native corpora Modeling on native corpora:
Produces a one-size-fits-all model of “native” English More generic & universally applicable?
Modeling on a learner corpus: Produces a model specific to the particular learner language Can it be applied to the language of other learner groups?
ex. French citizens? Japanese-speaking English learners?
Combining two approaches: A system with specific models for different L1 background Plus a back-off “generic” model, built on native corpora
16
Discussion: The Problem of Partial Error Annotation Partial error annotation problem:
57% of replacement errors and 85% of extraneous prepositions are unchecked by Chungdahm tutors
Training data includes conflicting evidence.
Our model’s low recall/high precision are impacted by it Model assumes a lower-than-true error rate Model has to reconcile between conflicting sets of evidence When the model does flag an error, it does so with high
confidence and accuracy
Solution? Bootstrapping, relabeling of unannotated errors
17
Conclusions As language instruction turns digital, more and
more (partially) error-annotated learner corpora like the Chungdahm corpus will become available
Building a direct model of L2 errors, whenever available, offers an advantage over models based on native corpora, despite the partial annotation problem (if any)
Exhaustive annotation is not necessary for learner-corpus-trained models to outperform standard native-text-trained models with much larger training data set