datech2014 - session 3 - correcting noisy ocr: context beats confsusion
DESCRIPTION
Presentation of the paper Correcting Noisy OCR: Context Beats Confsusion by John Evershed and Kent Fitch in DATeCH 2014. #digidaysTRANSCRIPT
Automatic OCR correction http://overproof.projectcomputing.com
Correcting noisy OCR
- Context beats Confusion
[ presentation viewable at http://goo.gl/n85gR6 ]
Automatic OCR correction http://overproof.projectcomputing.com
who are we?
● Australian software company
● developers John and Kent
● we put theory into practice
Automatic OCR correction http://overproof.projectcomputing.com
● the first draft of history
● popular if made available
● usually poorly digitized
● too extensive for full human
correction
main target - newspapers
Automatic OCR correction http://overproof.projectcomputing.com
goals
● run on commodity cloud server
● optimal for noisy text
● at least 1000 words/sec
● correct at least 50% of errors
Automatic OCR correction http://overproof.projectcomputing.com
division of labour
bad
good
models
models
MANAGER,
TRIAGE
CORE
Automatic OCR correction http://overproof.projectcomputing.com
snippets for the core
● prefer triaged good words at start/end
● column aware
● some easy corrections applied
● some suggestions supplied
● bag of topic words available
● surrounding noise level indicated
Automatic OCR correction http://overproof.projectcomputing.com
error contexts
● spell: vowals or consonnants
● type: you jit teh wrng key
● OCR: roprcroiitativcs cf thc Coveriuient
● random: anygh<eg 0at7happen
Automatic OCR correction http://overproof.projectcomputing.com
confusion cost matrix
93: w ← w 155: e ← e 3750: c ← e 4451: m ← rn 6652: rn ← m 11065: E ← m
Automatic OCR correction http://overproof.projectcomputing.com
word cost (eg rnorniny|morning)
language cost ● lexicon frequency
● entity list
● rare word list
● character 4-gram
error cost ● edit sum
● visual correlation
● generator hint
Automatic OCR correction http://overproof.projectcomputing.com
word character confusion
m o r n i n g
r n o r n i n y
Automatic OCR correction http://overproof.projectcomputing.com
visual correlation
Automatic OCR correction http://overproof.projectcomputing.com
suggestion methods
● gift
● common, cached
● language
● entities
● split/join
● generated (magic)
Automatic OCR correction http://overproof.projectcomputing.com
searching for gold (A*)
l i
i
n e
r
h
hcii h li b n ... c e r o … i i 1 l n u … i i 1 l ...
purple nodes: working priority queue
red nodes: output priority queue
Automatic OCR correction http://overproof.projectcomputing.com
amazing generated suggestions
Parhumuitar} ← Parliamentary I.iulwuvB ← Railways Itegtniont ← Regiment niltfltory ← adultery uj.rccu.eut ← agreement couniutfc.o ← committee cnuipuii ← company dctoimiuatJOu ← determination uiidcrtkikcr’a ← undertaker’s
Automatic OCR correction http://overproof.projectcomputing.com
selecting best combination
unsiejitlv unsightly unseemly unsettle unsteady Unsightly urgently
bohavlour behaviour behavour behavior Behaviour behaviours behaving
abonf about above along been
am am an a in as
unsiejitlv unsightly unseemly unsettle unsteady Unsightly urgently
disgrie disgrace disagree disguise desire degree disease
[NOTE: word joins and splits are also supported]
Automatic OCR correction http://overproof.projectcomputing.com
training
● 5-grams - subset selection
● corpus 1,2,3-grams - statistical build
● extra word lists - easy
● error model - bootstrap or new pairs
Automatic OCR correction http://overproof.projectcomputing.com
testing
● 65000 words ground truth including
foreign (US) newspapers
● all measures exceeded goal:
○ search errors (article word types)
○ read errors (article word tokens)
○ entropy weighted term errors
Automatic OCR correction http://overproof.projectcomputing.com
Before After
Recall 83.8% 94.1% recall misses reduced 63.3%
Raw Error Rate 18.5% 5.5% errors reduced 70.1%
Weighted Error Rate 16.2% 6.7% weighted errors reduced 59.4%
SMH sample
Automatic OCR correction http://overproof.projectcomputing.com
¿preguntas?
Presentation viewable at http://goo.gl/n85gR6
Automatic OCR correction http://overproof.projectcomputing.com
Automatic OCR correction http://overproof.projectcomputing.com
National Library of Australia’s
TROVE
● 1.4m distinct visitors/month
● 16m pageviews/month
● 80% of usage is old newspapers o 13m pages, over 600 titles
o 85k lines corrected/day
Automatic OCR correction http://overproof.projectcomputing.com
Even this massive volunteer effort
cannot keep up
● < 2% of errors have been corrected
● % corrected is declining
● Hence searching is unreliable, OCR’ed text
is hard to read and reuse
● Trove’s accuracy is “typical”
Automatic OCR correction http://overproof.projectcomputing.com
Automatic OCR correction http://overproof.projectcomputing.com
159 randomly selected news
articles from The Sydney
Morning Herald
47.4K words hand-corrected to ground truth
Automatic OCR correction http://overproof.projectcomputing.com
Before After
Recall 83.8% 94.1% recall misses reduced 63.3%
False positive recall 26.7% 9.1% false positives reduced 65.8%
Raw Error Rate 18.5% 5.5% errors reduced 70.1%
Weighted Error Rate 16.2% 6.7% weighted errors reduced 59.4%
SMH sample
Automatic OCR correction http://overproof.projectcomputing.com
Automatic OCR correction http://overproof.projectcomputing.com
Automatic OCR correction http://overproof.projectcomputing.com
Automatic OCR correction http://overproof.projectcomputing.com
Automatic OCR correction http://overproof.projectcomputing.com
49 randomly selected news
articles from LoC
Chronicling America
18.1K words hand-corrected to ground truth
Automatic OCR correction http://overproof.projectcomputing.com
Before After
Recall 84.0% 93.1% recall misses reduced 56.6%
False positive recall 23.6% 8.8% false positives reduced 62.8%
Raw Error Rate 19.1% 6.4% errors reduced 66.7%
Weighted Error Rate 16.0% 7.7% weighted errors reduced 51.8%
LOC sample