![Page 1: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/1.jpg)
A Cost Efficient approach to correct OCR errors in Large Document Collections
1
Deepayan Das, Jerin Philip, Minesh Mathew and C.V. JawaharCenter for Visual Information Technology, IIIT- Hyderabad
![Page 2: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/2.jpg)
Digital Library
2
A digital repository for books, accessible to people around the world
![Page 3: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/3.jpg)
Digital Libraries
Popular Digital libraries include:
3
Project #Books
Google Books Project 25 million (as of 2015)
Project Gutenberg 60, 000
Million Books Project 1.5 million
![Page 4: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/4.jpg)
Digital Libraries
● Easy access to millions of books and articles.● Less cost in maintenance and support.● Supports search and indexing.
4
![Page 5: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/5.jpg)
Digital Libraries
5
Scanning centers
OCR Access to millions of
people
Annotator proofreads
the text
![Page 6: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/6.jpg)
OCRs are not always 100% accurate
6
![Page 7: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/7.jpg)
● OCR is sensitive to quality of document images.● Degradations can result in words being misclassified.
Word Image OCR prediction Ground Truth
Lord Cauning Lord Canning
Cawnporo Cawnpore
Dolhi Delhi
rnorning Morning
OCR
7
![Page 8: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/8.jpg)
Information Retrieval on OCR text
OCR errors leads to difference in ranking of the retrieved document.
8
![Page 9: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/9.jpg)
Post-processing for Large Document Collection
9
Project Gutenberg.GB. Newby and
C.Frank.Distributed
proofreading.JCDL, 2003
Google BooksVon Ahn et. al
Recaptcha: Human based character recognition.
Science, 2008
![Page 10: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/10.jpg)
Motivation
● OCR makes consistent errors throughout a document collection.
10
Juiiet Juiiet Juiiet Juiiet
Qucen Qucen
Camiing Canning Caniiing Caiiing
Qucen Qucen
Word images and their corresponding predictions in a collection
![Page 11: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/11.jpg)
Motivation
● Books/collections have a finite vocabulary that repeat throughout the book.
11
A small subset of words can cover more than 50% of total words in a collection.
50% of words
![Page 12: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/12.jpg)
Motivation
● Grouping and correcting words with high frequency can lead to significant gain in word accuracy.
12
50% of words
![Page 13: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/13.jpg)
t-SNE Image Embedding
13
Maaten, Laurens van der and Hinton, Geoffrey. “Visualizing data using t-SNE”. JMLR, 2008
![Page 14: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/14.jpg)
14
Reverse Annotation
Sankar et al. “Probabilistic Reverse Annotation For Large Scale Image Annotation.” CVPR, 2007.
Fusing Word Clusters
Rasagna et al. “Robust Recognition of Documents by Fusing Results of Word Clusters.” ICDAR, 2009.
Khader and casey. “Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment”. ICDAR, 2009.
![Page 15: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/15.jpg)
Automatic Error Correction
15
![Page 16: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/16.jpg)
Automatic Error Correction
16
Cluster representative, propagated to all cluster elements
Character Majority Voting
● Word Images are clustered on a feature space.● A cluster representative is chosen for each cluster.
○ Rasagna et al. use character major voting where the most frequently character is taken at each time step.
![Page 17: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/17.jpg)
Automatic Error Correction
● Voted label is propagated to all the cluster elements.
17
Fig. shows a cluster of error words with label “thousand”. There are two incorrect labels “housan” and “thusiasn” which can be corrected with the above proposed method.
thusiasn
housan
thousand
![Page 18: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/18.jpg)
Automatic Error Correction
18
moneymoney
moneymoney
money
money
money
aoney
more
Fig. shows nearest neighbour to the image embedding for word “money”. The error word (highlighted in red) can be corrected using character majority voting.
![Page 19: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/19.jpg)
● Each clusters cannot be completely homogenous.● Character majority voting can lead to error propagation.
Cluster Impurities
19
impurity
moneymoney
moneymoney,
money
money
aoney
more
![Page 20: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/20.jpg)
Can a better clustering algorithm help?
20
![Page 21: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/21.jpg)
MST on word predictions
● We further partition the clusters using minimum spanning tree (MST).
● The nodes are the predictions.● The edit distance between the predictions form the edges.
21
![Page 22: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/22.jpg)
Minimum spanning tree
22
Fully connected graph Minimum spanning tree Forest of individual trees
![Page 23: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/23.jpg)
MST on word predictions
23
money!
money
money,
money,
money
money
aoney
more,
Cluster Partition using MST
![Page 24: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/24.jpg)
What happens when all the predictions are wrong in a cluster ?
24
![Page 25: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/25.jpg)
Manual correction
25
Figure shows a cluster with erred OCR predictions.
![Page 26: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/26.jpg)
Automatic Error correction by label propagation will not achieve absolute word accuracy when clusters
are not homogeneous.
26
![Page 27: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/27.jpg)
Human vs Machine
27
● Humans accurate but slow.● High cost ● Machines fast but inaccurate.
● Error propagation.● A human will be needed to rectify
the mistakes incurred by the machine.
● Will lead to high cost.
![Page 28: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/28.jpg)
Human Machine Collaboration
28
![Page 29: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/29.jpg)
Proposed Method
A human should verify each cluster by:
1. Picking the cluster representative.2. Choosing the cluster elements to which the label should
be propagated to.
29
![Page 30: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/30.jpg)
Pipeline
30
Word Predictions
HWNet [2]
Image Features
ErrorDetection
Error Clusters
falling,
Manual
Clustering
Correction
CRNN-OCR [1]
1. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. Shi et. al PAMI, 20172. HWNet v2: An Efficient Word Image Representation for Handwritten Documents. Krishnan et. al IJDAR 2018.
![Page 31: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/31.jpg)
Implementation Details
31
![Page 32: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/32.jpg)
Edit Actions
● Full Typing ( no Dictionary involved)● Type + Select from dropdown (Static Dictionary)● Type + Select from dropdown(Growing Dictionary)
32
![Page 33: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/33.jpg)
● Fully Annotated○ English
■ 19 books■ ~2.5k pages
○ Hindi■ 32 books■ ~5k pages
Datasets
33
Sample word images from Fully annotated dataset.
![Page 34: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/34.jpg)
Evaluation Protocol
● For Fully Annotated○ Units of seconds spent by a human for correcting a book.○ We refer to it as cost of correction (C).○ We measure the cost for each method relative to
Full-Typing.
34
![Page 35: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/35.jpg)
Cost of correction
Cost C = w1ct+ w2cd + w3cv
ct = typing cost
cd = selection cost
cv = verification cost
w1 = error words that need typing
w2 = error words whose correct alternative is present in dictionary
W3 = words that are correct but flagged wrongly as errors 35
![Page 36: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/36.jpg)
Results
36
![Page 37: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/37.jpg)
Relative cost
37
Fig. Relative cost of correction with respect to full typing when no clustering is involved.
![Page 38: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/38.jpg)
Relative cost
38
Cost of correction across different clustering techniques for automatic label selection and propagation.
Cost without clustering
![Page 39: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/39.jpg)
Qualitative Results
39
Qualitative results of k-means + MST clustering. The false positives are crossed out. Images, relevant to the cluster are marked correct while the non relevant ones are crossed out.
![Page 40: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/40.jpg)
Results: Fully Annotated Data (English)
Comparison between Automatic vs Manual correction
40
● No false error propagation.
● Reduces cost of correction.
![Page 41: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/41.jpg)
Scope for automatic error correction
41
![Page 42: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/42.jpg)
Clustering on Large Scale Dataset
● We cluster on images from 100 unannotated books.● Testing is done on 200 annotated pages.● We use CMV for label assignment to erred predictions.
42
![Page 43: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/43.jpg)
● Comparison of performance of CRNN and Tesseract OCR
Results
43
Automatic error correction is able to rectify the errors more accurately as the size of collection increases.
● Gain in word accuracy CRNN-OCR >> Tesseract OCR
![Page 44: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/44.jpg)
Clustering on Large Scale Dataset
We observe that as the size of the collection increases, CMV becomes better at picking the correct cluster representative which is subsequently propagated to all the cluster elements.
44
![Page 45: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/45.jpg)
Conclusion
● We proposed a cost efficient batch correction scheme for error reduction in OCRs.
● We also demonstrate how our approach can effectively be scaled to larger collections.
45
![Page 46: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/46.jpg)
Future Work
● active learning techniques to find clusters/subclusters that need post-processing
● adapting recognizer to a collection,not just the post-processing module.
46
![Page 47: Document Collections A Cost Efficient approach to correct](https://reader031.vdocuments.us/reader031/viewer/2022013009/61ce80f7cbb6583e5628ee7c/html5/thumbnails/47.jpg)
Thank You
47