bl demo day - july2011 - (7) ocr profiler and post-correction

31
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. TR5 Profiler and Post-Correction System Ludwig-Maximilians-Universität München Centrum für Informations- und Sprachverarbeitung

Upload: impact-centre-of-competence

Post on 18-Jun-2015

1.408 views

Category:

Technology


0 download

DESCRIPTION

Jesse de Does gives presentation on the LMU OCR Profiler and Post Correction. Delivered at British Library Demo Day on the 12th of July 2011.

TRANSCRIPT

Page 1: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

TR5 Profiler and Post-Correction System Ludwig-Maximilians-Universität München

Centrum für Informations- und Sprachverarbeitung

Page 2: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

TR5 Post-Correction System

User interface for easy postcorrection of historical OCR'd documents

Stand-alone user interface Innovative language technology enables

identification, presentation of recognition errors and efficient correction

User interface for easy postcorrection of historical OCR'd documents

Stand-alone user interface Innovative language technology enables

identification, presentation of recognition errors and efficient correction

Page 3: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Customizable user interface

OCR and image fragments

Correction candidates,Special functions

Complete image

Freely rearrangeable interface elements:

– OCR with Image snippets– Complete image– Correction candidates/ Special

functions

Freely rearrangeable interface elements:

– OCR with Image snippets– Complete image– Correction candidates/ Special

functions

Font size

Page 4: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Word by word presentation of recognized text and image clippings.

Comparison of text and image follows reading order and is much easier than side-by-side presentation of image and text.

Word by word presentation of recognized text and image clippings.

Comparison of text and image follows reading order and is much easier than side-by-side presentation of image and text.

View: OCR and Image clippings

Page 5: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

– For difficult cases – When word segmentation by OCR

fails– Current word is highlighted

– For difficult cases – When word segmentation by OCR

fails– Current word is highlighted

View: Original image

Page 6: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Correction by manual text entry Choosing correction candidates Faster correction thanks to candidates

proposed by the postcorrection system

Correction by manual text entry Choosing correction candidates Faster correction thanks to candidates

proposed by the postcorrection system

Word by word correction of text

Page 7: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Batch correction – Several occurences of identical

word

Batch correction – Several occurences of identical

word

Batch correction: efficient postcorrection

Page 8: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Batch correction– classes of systematic errors– errors where the correction

candidate has a high degree of certainty

– further possilities Frequent errors For instance Location names

Batch correction– classes of systematic errors– errors where the correction

candidate has a high degree of certainty

– further possilities Frequent errors For instance Location names

Batch correction: efficient postcorrection

Page 9: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Postcorrection system: Evaluation

9Ulrich Reffle, 4, Juli 2011

Result:

Error correction thanks to text and error profiling is 2.7 times faster

Result:

Error correction thanks to text and error profiling is 2.7 times faster

User Experiment with 14 individual instances

Page 10: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Korrektursystem

10

Page 11: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Korrektursystem

11

Page 12: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Targets more specialist audience

Thanks to underlying language technology: Historical variants are recognized and

not marked as errors – even when not in historical lexicon

Historical variants are proposed as correction candidates

Typical error patterns are exploited Ranking of correction candidates

Targets more specialist audience

Thanks to underlying language technology: Historical variants are recognized and

not marked as errors – even when not in historical lexicon

Historical variants are proposed as correction candidates

Typical error patterns are exploited Ranking of correction candidates

Why another postcorrection system?

Page 13: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Lexica and language models help dealing with orthographical variants und unknown words.

Recognition of OCR errors and proposal of Correction candidates depends on specially developed LMU language technology Approximate search in “hypothetical lexica“ An analysis of the whole work („text and error profile“) produces document-

specific information about the language and the type of OCR errors

Lexica and language models help dealing with orthographical variants und unknown words.

Recognition of OCR errors and proposal of Correction candidates depends on specially developed LMU language technology Approximate search in “hypothetical lexica“ An analysis of the whole work („text and error profile“) produces document-

specific information about the language and the type of OCR errors

Underlying language technology

Page 14: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Text and error profilesText profile Error profile

14

Coverage of lexicaTypical variant patterns

→ Targeted selection of lexica→ Better language models

→Distinguishing historical variants and OCR errors

→Ranking of correction candidates→Recall and Precision in IR

Coverage of lexicaTypical variant patterns

→ Targeted selection of lexica→ Better language models

→Distinguishing historical variants and OCR errors

→Ranking of correction candidates→Recall and Precision in IR

Estimate of error rateTypical OCR errors

→ Better modeling of error channel→Distinguishing historical variants

and OCT errors→Ranking of correction candidates→Treatment of systematic errors

Estimate of error rateTypical OCR errors

→ Better modeling of error channel→Distinguishing historical variants

and OCT errors→Ranking of correction candidates→Treatment of systematic errors

Page 15: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Underlying logic: Dual noisy channel model

Interpretation of OCR output tokens as result of two “noisy channels”

modern word u historical variant v OCR result w

Given an OCR token w, give possible interpretations of w in terms of• “underlying” modern word u (IR!)• correct historical word v and its derivation from u via “patterns”• OCR errors garbling v into w

patterns OCR errors

Page 16: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Historical variant and OCR error patterns

HistoricalVariants

OCRError patterns

teil theil

theil iheil

Page 17: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Relative frequency: 2.9% of all ‘t’ are rewritten to ‘th’

Absolute frequency: Pattern was found 120 times in the current document.

Page 18: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Local view: interpretations of tokens

– Local view: “Meaningful interpretations” for all tokens of the ocr text are the matches in all attached lexicons, using the given settings.

Occurrence of spelling variant “i→y”:

Occurrence of ocr error “i→y”:

Page 19: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Global view: pattern frequencies

– Global view: Increment counters to estimate (relative) frequencies.

Occurrences of spelling variant “i→y”:+0.999771

Occurrences of ocr error “i→y”:+0.000224948

Page 20: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Computation of profile: initialization

20

OCR resultw0, w1 ,w2, w3, …

OCR resultw0, w1 ,w2, w3, …

Initial global profile

Non-specific model with probabilities for•Words•Variant Patterns•Error

Non-specific model with probabilities for•Words•Variant Patterns•Error

Page 21: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

w3:… → … → …… → … → …… → … → …… → … → …

w3:… → … → …… → … → …… → … → …… → … → …

Ulrich Reffle, 4, Juli 2011

21

w3:… → … → …… → … → …… → … → …… → … → …

w3:… → … → …… → … → …… → … → …… → … → …

w2:… → … → …… → … → …… → … → …… → … → …

w2:… → … → …… → … → …… → … → …… → … → …

w1:… → … → …… → … → …… → … → …… → … → …

w1:… → … → …… → … → …… → … → …… → … → …

w0:… → … → …… → … → …… → … → …… → … → …

w0:… → … → …… → … → …… → … → …… → … → …

Local profileInitial global profile

Computation of profile: global to local

OCR resultw0, w1 ,w2, w3, …

OCR resultw0, w1 ,w2, w3, …

Non-specific model with probabilities for•Words•Variant Patterns•Error

Non-specific model with probabilities for•Words•Variant Patterns•Error

Page 22: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

w3:… → … → …… → … → …… → … → …… → … → …

w3:… → … → …… → … → …… → … → …… → … → …

Computation of profile: local to global

Ulrich Reffle, 4, Juli 2011

22

w3:… → … → …… → … → …… → … → …… → … → …

w3:… → … → …… → … → …… → … → …… → … → …

w2:… → … → …… → … → …… → … → …… → … → …

w2:… → … → …… → … → …… → … → …… → … → …

w1:… → … → …… → … → …… → … → …… → … → …

w1:… → … → …… → … → …… → … → …… → … → …

w0:… → … → …… → … → …… → … → …… → … → …

w0:… → … → …… → … → …… → … → …… → … → …

Local profileGlobal profile

OCR resultw0, w1 ,w2, w3, …

OCR resultw0, w1 ,w2, w3, …

Improved model with probabilities for•Words•Variant Patterns•Error

Improved model with probabilities for•Words•Variant Patterns•Error

Page 23: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Computation of profile: iteration

Ulrich Reffle, 4, Juli 2011

23

Local profileGlobal profile

w3:… → … → …… → … → …… → … → …… → … → …

w3:… → … → …… → … → …… → … → …… → … → …

w3:… → … → …… → … → …… → … → …… → … → …

w3:… → … → …… → … → …… → … → …… → … → …

w2:… → … → …… → … → …… → … → …… → … → …

w2:… → … → …… → … → …… → … → …… → … → …

w1:… → … → …… → … → …… → … → …… → … → …

w1:… → … → …… → … → …… → … → …… → … → …

w0:… → … → …… → … → …… → … → …… → … → …

w0:… → … → …… → … → …… → … → …… → … → …

OCR resultw0, w1 ,w2, w3, …

OCR resultw0, w1 ,w2, w3, …

Improved model with probabilities for•Words•Variant Patterns•Error

Improved model with probabilities for•Words•Variant Patterns•Error

Page 24: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Profiler Evaluation

Measure the quality

1. of global profiles

2. of OCR error detection

Challenges Measures not obvious Good evaluation data is difficult to gather Results need interpretation

Page 25: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Evaluation: Measures

(1) Global Profiles

Percentage of matches for the first 10 patterns in the ranked output lists

Two Values: Historical Patterns, OCR Patterns

(2) OCR Error Detection

Precision and Recall for the OCR errors detected by the Profiler

(3) Indirect evaluation

(For instance, by means of the postcorrection system)

Page 26: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Evaluation: Data preparation

(1) Deep Evaluation:

For each token of the evaluation document the historical interpretation and the

OCR interpretation have been manually annotated.

++ fully accurate -- manual work

(2) Shallow Evaluation:

The OCR’ed document is automatically aligned with its re-typed ground truth;

For each token of the evaluation document the historical and the OCR

interpretation is automatically assigned from the ground truth.

++ no manual work – not completely accurate

Page 27: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Evaluation: Data

Deep: Eckartshausen 100 pages

Briefkunst 40 pages

Shallow: 5 books each,

16th, 17th and 18th century

Page 28: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Evaluation: Eckartshausen

(1) historical patterns

matches first 10 70%

precision all 68%

recall all 73%

(2) OCR patterns

matches first 6 67%

precision all 59%

recall all 19%

(3) OCR error detection

precision 86%

recall 46%

Page 29: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Graphical Evaluation: Eckartshausen

Page 30: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Graphical Evaluation: diacritics

Hist. Var.

OCR

Page 31: BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Shallow Evaluation Results

16th 17th 18th

HIST Patterns first 10 60% 74% 78%

OCR Patterns first 10 48% 70% 50%

Error Detection Prec 95% 92% 81%

Error Detection Recall 49% 43% 45%

Content Words Errors 64% 44% 16%

Easy Interactive Correction per 10,000 words

≈3000 words ≈ 1892 words ≈ 720 words