robust error detection: a hybrid approach combining unsupervised error detection and linguistic...
Post on 19-Dec-2015
221 views
TRANSCRIPT
![Page 1: Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge Johnny Bigert and Ola Knutsson Royal Institute](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3e5503460f94a17895/html5/thumbnails/1.jpg)
Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge
Johnny Bigert and Ola Knutsson
Royal Institute of TechnologyStockholm, Sweden
![Page 2: Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge Johnny Bigert and Ola Knutsson Royal Institute](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3e5503460f94a17895/html5/thumbnails/2.jpg)
Detection of context-sensitive spelling errors Identification of less-frequent
grammatical constructions in the face of sparse data
Hybrid method Unsupervised error detection Linguistic knowledge used for phrase
transformations
![Page 3: Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge Johnny Bigert and Ola Knutsson Royal Institute](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3e5503460f94a17895/html5/thumbnails/3.jpg)
Properties Find difficult error types in
unrestricted text (spelling errors resulting in an existing word etc.)
No prior knowledge required, i.e. no classification of errors or confusion sets
![Page 4: Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge Johnny Bigert and Ola Knutsson Royal Institute](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3e5503460f94a17895/html5/thumbnails/4.jpg)
A first approach
Algorithm:
for each position i in the streamif the frequency of (ti-1 ti ti+1) is
lowreport error to the user
report no error
![Page 5: Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge Johnny Bigert and Ola Knutsson Royal Institute](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3e5503460f94a17895/html5/thumbnails/5.jpg)
Sparse data
Problems: Data sparseness for trigram
statistics Phrase and clause boundaries may
produce almost any trigram
![Page 6: Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge Johnny Bigert and Ola Knutsson Royal Institute](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3e5503460f94a17895/html5/thumbnails/6.jpg)
Sparse dataExample: ”It is every manager's task to…” ”It is every” is tagged (pn.neu.sin.def.sub/obj,
vb.prs.akt, dt.utr/neu.sin.ind) and has a frequency of zero
Probable cause: out of a million words in the corpus, only 709 have been assigned the tag (dt.utr/neu.sin.ind)
![Page 7: Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge Johnny Bigert and Ola Knutsson Royal Institute](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3e5503460f94a17895/html5/thumbnails/7.jpg)
Sparse data
We try to replace ”It is every manager's task to…”
with ”It is a manager's task to…”
![Page 8: Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge Johnny Bigert and Ola Knutsson Royal Institute](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3e5503460f94a17895/html5/thumbnails/8.jpg)
Sparse data ”It is every” is tagged
(pn.neu.sin.def.sub/obj, vb.prs.akt, dt.utr/neu.sin.ind) and had a frequency of 0
”It is a” is tagged (pn.neu.sin.def.sub/obj, vb.prs.akt, dt.utr.sin.ind) and have a frequency of 231
(dt.utr/neu.sin.ind) had a frequency of 709 (dt.utr.sin.ind) has a frequency 19112
![Page 9: Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge Johnny Bigert and Ola Knutsson Royal Institute](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3e5503460f94a17895/html5/thumbnails/9.jpg)
Tag replacements
When replacing a tag: All tags are not suitable as
replacements All replacements are not equally
appropriate… …and thus, we require a penalty or
probability for the replacement
![Page 10: Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge Johnny Bigert and Ola Knutsson Royal Institute](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3e5503460f94a17895/html5/thumbnails/10.jpg)
Tag replacements
To be considered: Manual work to create the
probabilities for each tag set and language
The probabilities are difficult to estimate manually
Automatic estimation of the probabilities (other paper)
![Page 11: Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge Johnny Bigert and Ola Knutsson Royal Institute](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3e5503460f94a17895/html5/thumbnails/11.jpg)
Tag replacements
Examples of replacement probabilities:
Mannen var glad. (The man was happy.)
Mannen är glad. (The man is happy.)
100% vb.prt.akt.kop vb.prt.akt.kop
74% vb.prt.akt.kop vb.prs.akt.kop
50% vb.prt.akt.kop vb.prt.akt____
48% vb.prt.akt.kop vb.prt.sfo
![Page 12: Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge Johnny Bigert and Ola Knutsson Royal Institute](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3e5503460f94a17895/html5/thumbnails/12.jpg)
Tag replacements
Examples of replacement probabilities:
Mannen talar med de anställda. (The man talks to the employees.)
Mannen talar med våra anställda. (The man talks to our employees.)
100% dt.utr/neu.plu.def dt.utr/neu.plu.def
44% dt.utr/neu.plu.def dt.utr/neu.plu.ind/def
42% dt.utr/neu.plu.def ps.utr/neu.plu.def
41% dt.utr/neu.plu.def jj.pos.utr/neu.plu.ind.nom
![Page 13: Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge Johnny Bigert and Ola Knutsson Royal Institute](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3e5503460f94a17895/html5/thumbnails/13.jpg)
Weighted trigrams
Replacing (t1 t2 t3) with (r1 r2 r3): f = freq(r1 r2 r3) · penalty
penalty = Pr[replace t1 with r1] · Pr[replace t2 with r2] · Pr[replace t3 with r3]
![Page 14: Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge Johnny Bigert and Ola Knutsson Royal Institute](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3e5503460f94a17895/html5/thumbnails/14.jpg)
Weighted trigrams
Replacement of tags: Calculate f for all representatives
for t1 , t2 and t3 (typically 3 · 3 · 3 of them)
The weighted frequency is the sum of the penalized frequencies
![Page 15: Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge Johnny Bigert and Ola Knutsson Royal Institute](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3e5503460f94a17895/html5/thumbnails/15.jpg)
Algorithm
Algorithm:
for each position i in the streamif weighted freq for (ti-1 ti ti+1) is low
report error to the userreport no error
![Page 16: Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge Johnny Bigert and Ola Knutsson Royal Institute](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3e5503460f94a17895/html5/thumbnails/16.jpg)
An improved algorithm Problems with sparse data Phrase and clause boundaries may
produce almost any trigram Use clauses as the unit for error
detection to avoid clause boundaries
![Page 17: Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge Johnny Bigert and Ola Knutsson Royal Institute](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3e5503460f94a17895/html5/thumbnails/17.jpg)
Phrase transformations We identify phrases to transform
rare constructions to those more frequent
Replacing the phrase with its head Removing phrases (e.g. AdvP, PP)
![Page 18: Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge Johnny Bigert and Ola Knutsson Royal Institute](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3e5503460f94a17895/html5/thumbnails/18.jpg)
Phrase transformations
Example: Alla hundar som är bruna är lyckliga
(All dogs that are brown are happy)
Hundarna är lyckliga
(The dogs are happy)
NP
NP
![Page 19: Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge Johnny Bigert and Ola Knutsson Royal Institute](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3e5503460f94a17895/html5/thumbnails/19.jpg)
Phrase transformations Den bruna (jj.sin) hunden (the brown
dog) De bruna (jj.plu) hundarna (the brown
dogs)
![Page 20: Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge Johnny Bigert and Ola Knutsson Royal Institute](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3e5503460f94a17895/html5/thumbnails/20.jpg)
Phrase transformations
The same example with a tagging error: Alla hundar som är bruna (jj.sin) är lyckliga
(All dogs that are brown are happy)
Robust NP detection yield
Hundarna är lyckliga
(The dogs are happy)
NP
NP
![Page 21: Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge Johnny Bigert and Ola Knutsson Royal Institute](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3e5503460f94a17895/html5/thumbnails/21.jpg)
Results
Error types found: context-sensitive spelling errors split compounds spelling errors verb chain errors
![Page 22: Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge Johnny Bigert and Ola Knutsson Royal Institute](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3e5503460f94a17895/html5/thumbnails/22.jpg)
Comparison between probabilistic methods The unsupervised method has a
good error capacity but also a high rate of false alarms
The introduction of linguistic knowledge dramtically reduces the number of false alarms
![Page 23: Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge Johnny Bigert and Ola Knutsson Royal Institute](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3e5503460f94a17895/html5/thumbnails/23.jpg)
Future work The error detection method is not
only restricted to part-of-speech tags - we consider adopting the method to phrase n-grams
Error classification Generation of correction
suggestions
![Page 24: Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge Johnny Bigert and Ola Knutsson Royal Institute](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3e5503460f94a17895/html5/thumbnails/24.jpg)
Summing up Detection of context-sensitive
spelling errors Combining an unsupervised error
detection method with robust shallow parsing
![Page 25: Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge Johnny Bigert and Ola Knutsson Royal Institute](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3e5503460f94a17895/html5/thumbnails/25.jpg)
Internal Evaluation POS-tagger: 96.4% NP-recognition: P=83.1% and
R=79.5% Clause boundary recognition:
P=81.4% and 86.6%