document collections a cost efficient approach to correct
TRANSCRIPT
A Cost Efficient approach to correct OCR errors in Large Document Collections
1
Deepayan Das, Jerin Philip, Minesh Mathew and C.V. JawaharCenter for Visual Information Technology, IIIT- Hyderabad
Digital Library
2
A digital repository for books, accessible to people around the world
Digital Libraries
Popular Digital libraries include:
3
Project #Books
Google Books Project 25 million (as of 2015)
Project Gutenberg 60, 000
Million Books Project 1.5 million
Digital Libraries
● Easy access to millions of books and articles.● Less cost in maintenance and support.● Supports search and indexing.
4
Digital Libraries
5
Scanning centers
OCR Access to millions of
people
Annotator proofreads
the text
OCRs are not always 100% accurate
6
● OCR is sensitive to quality of document images.● Degradations can result in words being misclassified.
Word Image OCR prediction Ground Truth
Lord Cauning Lord Canning
Cawnporo Cawnpore
Dolhi Delhi
rnorning Morning
OCR
7
Information Retrieval on OCR text
OCR errors leads to difference in ranking of the retrieved document.
8
Post-processing for Large Document Collection
9
Project Gutenberg.GB. Newby and
C.Frank.Distributed
proofreading.JCDL, 2003
Google BooksVon Ahn et. al
Recaptcha: Human based character recognition.
Science, 2008
Motivation
● OCR makes consistent errors throughout a document collection.
10
Juiiet Juiiet Juiiet Juiiet
Qucen Qucen
Camiing Canning Caniiing Caiiing
Qucen Qucen
Word images and their corresponding predictions in a collection
Motivation
● Books/collections have a finite vocabulary that repeat throughout the book.
11
A small subset of words can cover more than 50% of total words in a collection.
50% of words
Motivation
● Grouping and correcting words with high frequency can lead to significant gain in word accuracy.
12
50% of words
t-SNE Image Embedding
13
Maaten, Laurens van der and Hinton, Geoffrey. “Visualizing data using t-SNE”. JMLR, 2008
14
Reverse Annotation
Sankar et al. “Probabilistic Reverse Annotation For Large Scale Image Annotation.” CVPR, 2007.
Fusing Word Clusters
Rasagna et al. “Robust Recognition of Documents by Fusing Results of Word Clusters.” ICDAR, 2009.
Khader and casey. “Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment”. ICDAR, 2009.
Automatic Error Correction
15
Automatic Error Correction
16
Cluster representative, propagated to all cluster elements
Character Majority Voting
● Word Images are clustered on a feature space.● A cluster representative is chosen for each cluster.
○ Rasagna et al. use character major voting where the most frequently character is taken at each time step.
Automatic Error Correction
● Voted label is propagated to all the cluster elements.
17
Fig. shows a cluster of error words with label “thousand”. There are two incorrect labels “housan” and “thusiasn” which can be corrected with the above proposed method.
thusiasn
housan
thousand
Automatic Error Correction
18
moneymoney
moneymoney
money
money
money
aoney
more
Fig. shows nearest neighbour to the image embedding for word “money”. The error word (highlighted in red) can be corrected using character majority voting.
● Each clusters cannot be completely homogenous.● Character majority voting can lead to error propagation.
Cluster Impurities
19
impurity
moneymoney
moneymoney,
money
money
aoney
more
Can a better clustering algorithm help?
20
MST on word predictions
● We further partition the clusters using minimum spanning tree (MST).
● The nodes are the predictions.● The edit distance between the predictions form the edges.
21
Minimum spanning tree
22
Fully connected graph Minimum spanning tree Forest of individual trees
MST on word predictions
23
money!
money
money,
money,
money
money
aoney
more,
Cluster Partition using MST
What happens when all the predictions are wrong in a cluster ?
24
Manual correction
25
Figure shows a cluster with erred OCR predictions.
Automatic Error correction by label propagation will not achieve absolute word accuracy when clusters
are not homogeneous.
26
Human vs Machine
27
● Humans accurate but slow.● High cost ● Machines fast but inaccurate.
● Error propagation.● A human will be needed to rectify
the mistakes incurred by the machine.
● Will lead to high cost.
Human Machine Collaboration
28
Proposed Method
A human should verify each cluster by:
1. Picking the cluster representative.2. Choosing the cluster elements to which the label should
be propagated to.
29
Pipeline
30
Word Predictions
HWNet [2]
Image Features
ErrorDetection
Error Clusters
falling,
Manual
Clustering
Correction
CRNN-OCR [1]
1. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. Shi et. al PAMI, 20172. HWNet v2: An Efficient Word Image Representation for Handwritten Documents. Krishnan et. al IJDAR 2018.
Implementation Details
31
Edit Actions
● Full Typing ( no Dictionary involved)● Type + Select from dropdown (Static Dictionary)● Type + Select from dropdown(Growing Dictionary)
32
● Fully Annotated○ English
■ 19 books■ ~2.5k pages
○ Hindi■ 32 books■ ~5k pages
Datasets
33
Sample word images from Fully annotated dataset.
Evaluation Protocol
● For Fully Annotated○ Units of seconds spent by a human for correcting a book.○ We refer to it as cost of correction (C).○ We measure the cost for each method relative to
Full-Typing.
34
Cost of correction
Cost C = w1ct+ w2cd + w3cv
ct = typing cost
cd = selection cost
cv = verification cost
w1 = error words that need typing
w2 = error words whose correct alternative is present in dictionary
W3 = words that are correct but flagged wrongly as errors 35
Results
36
Relative cost
37
Fig. Relative cost of correction with respect to full typing when no clustering is involved.
Relative cost
38
Cost of correction across different clustering techniques for automatic label selection and propagation.
Cost without clustering
Qualitative Results
39
Qualitative results of k-means + MST clustering. The false positives are crossed out. Images, relevant to the cluster are marked correct while the non relevant ones are crossed out.
Results: Fully Annotated Data (English)
Comparison between Automatic vs Manual correction
40
● No false error propagation.
● Reduces cost of correction.
Scope for automatic error correction
41
Clustering on Large Scale Dataset
● We cluster on images from 100 unannotated books.● Testing is done on 200 annotated pages.● We use CMV for label assignment to erred predictions.
42
● Comparison of performance of CRNN and Tesseract OCR
Results
43
Automatic error correction is able to rectify the errors more accurately as the size of collection increases.
● Gain in word accuracy CRNN-OCR >> Tesseract OCR
Clustering on Large Scale Dataset
We observe that as the size of the collection increases, CMV becomes better at picking the correct cluster representative which is subsequently propagated to all the cluster elements.
44
Conclusion
● We proposed a cost efficient batch correction scheme for error reduction in OCRs.
● We also demonstrate how our approach can effectively be scaled to larger collections.
45
Future Work
● active learning techniques to find clusters/subclusters that need post-processing
● adapting recognizer to a collection,not just the post-processing module.
46
Thank You
47