the detection and correction of real-word errors in dyslexic text jenny pedler school of computer...
TRANSCRIPT
The Detection and Correction of Real-word Errors
in Dyslexic Text
Jenny PedlerSchool of Computer Science & Information Systems
Birkbeck College
Research Presentation July 2002
Real-word errors
• However I gave (have) no idea what it represents.
• Now go to the macros button as shown bellow (below).
• Fred is away form (from) the 5-5- 89 and this leaves us vary (very) exposed.
• there is no evidence on Bill Gates that I have herd (heard) of
Work completed to date
Dyslexic error corpus
Investigation of possible approaches
Syntactic anomaly
Confusion set
Part-of speech tag collocation experiment
Dictionary update
Confusion Sets
{their, there, they're} {form, from}
{weather, whether}{were, where, we're}
{loose, lose}{collage, college}
{their, there} {form, from}
{weather, whether}{were, where}{loose, lose}
Calculating word|tag probabilities
• Count occurrences of each tag,word pair
• Calculate probability for immediately preceding and succeeding tags
P(tp|w), P(ts|w)
• Use Bayes rule to calculate probability of word occurring given the tag
P(w|tp), P(w|ts)
• Store for use at run-time
Using Bayes’ Rule
mjni
tPwtPwP
twP
wi
w) P(w
wwt
wtP
j
ijiji
n
i
ii
i
ijij
1 , where1
,)(
)|()()|(
||
||
|||,|
)|(
1
{w1,....,wn} Set of words
{t1,....,tm} Set of tags
|tj, wi| the number of occurrences of word wi collocating with tag tj
unlike their adult
0.005774
0.005832
there their
AJ0
PRP
0.001514
0.1750660.1750660.005832
thereP(w|tp)
0.002079
there their
0.195774
0.0849080.001805
NN1
AJ0
P(w|ts)
0.002079 0.195774
P(w|tp) * P(w|ts)
there 0.000012
their 0.034723their 0.034723
Correct usage Error usage
Total Accept as correct*
Propose correction
Total Propose correction*
Accept as correct
Prop. incor.
470 389 83% 81 112 92 82% 17 3
Initial Results
Modifications
• Reduced tagset
• Combined probabilities
• Target not in confusion set.the lose (loss) of {loose, lose}
• Errors in the immediate contextgrauate form (from) harved
(graduate from Harvard )
in their teems
(in their teens)
• Probabilities based on rare uses of a word
Problems
Dictionary Update
• CUV2 – 70,000+ entries
• More precise word-frequency information
• Part-of- speech tags corresponding to BNC
• Additional entries– words occurring frequently in BNC but not
in CUV2
Further work
• Word collocationweather: hot, wet, dry, warm, severe, heavy,
adverse, warmer, windy, better
collage: paper, sticking, colourful, sound, brand, blue, postmodern, hessian, marble, cloth
• Increase the number of confusion sets
• Final testing