the detection and correction of real-word errors in dyslexic text jenny pedler school of computer...

12
The Detection and Correction of Real-word Errors in Dyslexic Text Jenny Pedler School of Computer Science & Information Systems Birkbeck College Research Presentation July 2002

Upload: jack-davis

Post on 28-Mar-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Detection and Correction of Real-word Errors in Dyslexic Text Jenny Pedler School of Computer Science & Information Systems Birkbeck College Research

The Detection and Correction of Real-word Errors

in Dyslexic Text

Jenny PedlerSchool of Computer Science & Information Systems

Birkbeck College

Research Presentation July 2002

Page 2: The Detection and Correction of Real-word Errors in Dyslexic Text Jenny Pedler School of Computer Science & Information Systems Birkbeck College Research

Real-word errors

• However I gave (have) no idea what it represents.

• Now go to the macros button as shown bellow (below).

• Fred is away form (from) the 5-5- 89 and this leaves us vary (very) exposed.

• there is no evidence on Bill Gates that I have herd (heard) of

Page 3: The Detection and Correction of Real-word Errors in Dyslexic Text Jenny Pedler School of Computer Science & Information Systems Birkbeck College Research

Work completed to date

Dyslexic error corpus

Investigation of possible approaches

Syntactic anomaly

Confusion set

Part-of speech tag collocation experiment

Dictionary update

Page 4: The Detection and Correction of Real-word Errors in Dyslexic Text Jenny Pedler School of Computer Science & Information Systems Birkbeck College Research

Confusion Sets

{their, there, they're} {form, from}

{weather, whether}{were, where, we're}

{loose, lose}{collage, college}

{their, there} {form, from}

{weather, whether}{were, where}{loose, lose}

Page 5: The Detection and Correction of Real-word Errors in Dyslexic Text Jenny Pedler School of Computer Science & Information Systems Birkbeck College Research

Calculating word|tag probabilities

• Count occurrences of each tag,word pair

• Calculate probability for immediately preceding and succeeding tags

P(tp|w), P(ts|w)

• Use Bayes rule to calculate probability of word occurring given the tag

P(w|tp), P(w|ts)

• Store for use at run-time

Page 6: The Detection and Correction of Real-word Errors in Dyslexic Text Jenny Pedler School of Computer Science & Information Systems Birkbeck College Research

Using Bayes’ Rule

mjni

tPwtPwP

twP

wi

w) P(w

wwt

wtP

j

ijiji

n

i

ii

i

ijij

1 , where1

,)(

)|()()|(

||

||

|||,|

)|(

1

{w1,....,wn} Set of words

{t1,....,tm} Set of tags

|tj, wi| the number of occurrences of word wi collocating with tag tj

Page 7: The Detection and Correction of Real-word Errors in Dyslexic Text Jenny Pedler School of Computer Science & Information Systems Birkbeck College Research

unlike their adult

0.005774

0.005832

there their

AJ0

PRP

0.001514

0.1750660.1750660.005832

thereP(w|tp)

0.002079

there their

0.195774

0.0849080.001805

NN1

AJ0

P(w|ts)

0.002079 0.195774

P(w|tp) * P(w|ts)

there 0.000012

their 0.034723their 0.034723

Page 8: The Detection and Correction of Real-word Errors in Dyslexic Text Jenny Pedler School of Computer Science & Information Systems Birkbeck College Research

Correct usage Error usage

Total Accept as correct*

Propose correction

Total Propose correction*

Accept as correct

Prop. incor.

470 389 83% 81 112 92 82% 17 3

Initial Results

Page 9: The Detection and Correction of Real-word Errors in Dyslexic Text Jenny Pedler School of Computer Science & Information Systems Birkbeck College Research

Modifications

• Reduced tagset

• Combined probabilities

Page 10: The Detection and Correction of Real-word Errors in Dyslexic Text Jenny Pedler School of Computer Science & Information Systems Birkbeck College Research

• Target not in confusion set.the lose (loss) of {loose, lose}

• Errors in the immediate contextgrauate form (from) harved

(graduate from Harvard )

in their teems

(in their teens)

• Probabilities based on rare uses of a word

Problems

Page 11: The Detection and Correction of Real-word Errors in Dyslexic Text Jenny Pedler School of Computer Science & Information Systems Birkbeck College Research

Dictionary Update

• CUV2 – 70,000+ entries

• More precise word-frequency information

• Part-of- speech tags corresponding to BNC

• Additional entries– words occurring frequently in BNC but not

in CUV2

Page 12: The Detection and Correction of Real-word Errors in Dyslexic Text Jenny Pedler School of Computer Science & Information Systems Birkbeck College Research

Further work

• Word collocationweather: hot, wet, dry, warm, severe, heavy,

adverse, warmer, windy, better

collage: paper, sticking, colourful, sound, brand, blue, postmodern, hessian, marble, cloth

• Increase the number of confusion sets

• Final testing