iiit hyderabadumass amherst robust recognition of documents by fusing results of word clusters...
TRANSCRIPT
IIIT
Hyd
erab
ad
UM
AS
S
AM
HE
RS
T
Robust Recognition of Documents by Fusing Results of Word Clusters
Venkat Rasagna1, Anand Kumar1, C. V. Jawahar1, R. Manmatha2
1Center for Visual Information Technology, IIIT- Hyderabad2Center for Intelligent Information Retrieval, UMASS - Amherst
IIIT
Hyd
erab
ad
UM
AS
S
AM
HE
RS
T
• Recognition of books and collections.
• Recognition of words is crucial to Information Retrieval.
• Use of dictionaries and post processors are not feasible in many languages.
Introduction
IIIT
Hyd
erab
ad
UM
AS
S
AM
HE
RS
T
Motivation• Most of the (Indian language) OCRs recognize
glyph(component) and generate text from the class labels.• Word accuracies are far lower than component accuracies.• Word accuracy is inversely proportional to no. of components
in the word.• Use of language model for post processing is challenging.
– High entropy, Large vocabulary (eg. Telugu).– Language processing modules still emerging.
Com
pone
nt a
cc.
wor
d ac
c50
0
100
wor
d a
ccNo of components
Is it possible to make use of multiple occurrence of the same word to improve OCR performance ?
Recognize Parse
Average word length =
Component Accuracy = 9 / 12 = 75%Word Accuracy = 25%
IIIT
Hyd
erab
ad
UM
AS
S
AM
HE
RS
T
Overview
Text
Multiple occurrences of a word
• Words are degraded independently• OCR output is different for the word at different instances
OCR output Goal
Cluster
OCR OCR OCR OCR OCR
IIIT
Hyd
erab
ad
UM
AS
S
AM
HE
RS
T
Related Work
Malayalam Bangla
TamilHindi
U. Pal, B. Chaudhuri, Pattern Recognition, 20041A. Negi et al., ICDAR, 2001 ; 2C. V. Jawahar et al., ICDAR, 2003; 3K. S. Sesh Kumar et al., ICDAR 20071P. Xiu and H. S. Baird, DRR XV,2008; 2N. V. Neeba, C. V. Jawahar, ICPR, 20081T. M. Rath et al., IJDAR, 2007;2T. M. Rath et al., CVPR, 2003;3Anand Kumar et al., ACCV, 2007H. Tao, J. Hull, Document Analysis and Information Retrieval, 1995
• Character Recognition in Indian languages is still an unsolved problem.
• Telugu is one of the most complex scripts.• Recognition of a book has received some attention
recently.• Word images are efficiently matched for retreival.• Use of word image clusters to improve OCR accuracy
IIIT
Hyd
erab
ad
UM
AS
S
AM
HE
RS
T
Conventional Recognition Process
PreprocessingSegmentation and Word detection
Text (UNICODE)
Feature Extraction Classification
Recognizer
ScannedImages
Word level Feature ExtractionGrouping
Word Grouping (Clustering)
Wordgroups
Combining OCR Results
Proposed Recognition Process
IIIT
Hyd
erab
ad
UM
AS
S
AM
HE
RS
T
• LSH Goal: “r-Near Neighbour”– for any query q, return a point p P ∈
such that||p-q|| ≤ r (if it exists)• LSH has been used for
– Data Mining Taher H. Haveliwala, Aristides Gionis, Piotr Indyk, WebDB, 2000
– Information retrieval A.Andoni, M.Datar, N.Immorlica, V.Mirrokni, Piotr Indyk, 2006
– Document Image Search Anand Kumar, C.V.Jawahar, R.Manmatha, ACCV, 2007
Locality Sensitive Hashing (LSH)
IIIT
Hyd
erab
ad
UM
AS
S
AM
HE
RS
T
Character Majority Voting
OCR output ComponentsWord Cluster
Final Output
• Algorithm [TODO]
IIIT
Hyd
erab
ad
UM
AS
S
AM
HE
RS
T
Word Img OCR o/p
1
2
3
4
Dynamic Programming
Voting for 1 after aligning
DTW o/p for word 1 =
CMV o/p for word 1 =
Alignment
Dynamic Programming [1,2]
IIIT
Hyd
erab
ad
UM
AS
S
AM
HE
RS
T
Results• Word generation process makes correct annotations
available for evaluating the performance.
Component Accuracy Word Accuracy
Dataset OCR CMV DTW OCR CMV DTW
SF198.3 98.3 98.3 95.5 95.5 95.5
SF2 94.82 98.04 98.19 85.24 94.97 95.28
SF3 85.83 95.78 97.9 67.51 88.31 94.15
SF479.38 87.82 92.19 51.9 78.81 85.2
• 5000 clusters
• 20 variations
• Degraded dataset
More Details
IIIT
Hyd
erab
ad
UM
AS
S
AM
HE
RS
T
Results• Word Accuracy Vs No. of words
– Adding more no. of words makes thedata set more ambiguous
– Algorithm performance increases with no. of words, and saturates.
• Word Accuracy Vs Word Length– Word accuracy decreases as the word length increase.– Use of the cluster info helps in gaining good word accuracies.
IIIT
Hyd
erab
ad
UM
AS
S
AM
HE
RS
T
ResultsSize No. of
Clusters
LengthRange
WordWL
No. of words
Symbol accuracy Word Accuracy
OCR CMV DTW OCR CMV DTW
B1 Short 676 2-3 2.45 3778 90.64 91.61 91.66 80.56 82.39 82.45
B1 Medium 994 4-5 4.43 5161 90.78 92.35 92.42 73.34 79.14 80.53
B1 Long 690 6-16 7.31 4587 89.98 92.15 92.31 58.64 72.34 74.82
B1 ALL
B2 ALL
B3 ALL
B4 ALL
• For a small increase in component accuracy, there is a large improvement in the word accuracy.• The improvement is high for long words.• Relative improvement of 12% for words which occur at least twice.
IIIT
Hyd
erab
ad
UM
AS
S
AM
HE
RS
T
Analysis
•Cuts and Merges
•CMV vs. DTW
•Wrong word in the cluster.
•Cases that cant be handled
Image OCR CMV DTWImage OCR CMV DTW
Image OCR CMV DTWImage OCR CMV DTW
IIIT
Hyd
erab
ad
UM
AS
S
AM
HE
RS
T
Conclusion & Future work• A new framework has been proposed for OCRing the book.
• A word recognition technique which uses the document constraints is shown.
• An efficient clustering algorithm is used to speed up the process.
• Word level accuracy is improved from 70.37% to 79.12%.
• This technique can also be used for other languages.
• Extending it to include the uses of techniques to handle unique words by creating clusters over parts of words.
IIIT
Hyd
erab
ad
UM
AS
S
AM
HE
RS
T
LSH AlgorithmAlgorithm: Word Image Clustering
Require: Word Images Wj andFeatures Fj, j = 1,...,nEnsure : Word Image Clusters O for each i = 1,...,l do for each j = 1,...,n do Compute hash bucket I = gi (Fj )
Store word image Wj on bucket I of hash table Ti end for end for k = 1 for each i = 1,...,n and Wi unmarked do Query hash table for word Wi toget cluster Ok Mark word Wi with k k = k +1 end for
Back
IIIT
Hyd
erab
ad
UM
AS
S
AM
HE
RS
T
Word Error CorrectionAlgorithm: Word Error CorrectionRequire: Cluster C of words Wi ,i = 1,...,nEnsure: Clusters O of correct words for each i = 1,...,n do for each j = 1,...,n do if j != i then
Align word Wi and WjRecord errors Ek ,k = 1,...,m in WiRecord possible corrections Gk for Ek
end ifend for
Correct Ek if Probability pk of correction Gk is maximum O <- O U Wi
end for
Back
IIIT
Hyd
erab
ad
UM
AS
S
AM
HE
RS
T
• Dataset – 5000 clusters with 20 images of same word with different font size and
resolution.– Words were generated using Image Magick.– Words were degraded with Kanungo degradation model to approximate
real data.– SF1, SF2, SF3, SF4 datasets were degraded with 0, 10, 20, 30% noise.