iiit hyderabadumass amherst robust recognition of documents by fusing results of word clusters...

IIIT

Hyd

erab

ad

UM

AS

S

AM

HE

RS

T

Robust Recognition of Documents by Fusing Results of Word Clusters

Venkat Rasagna1, Anand Kumar1, C. V. Jawahar1, R. Manmatha2

1Center for Visual Information Technology, IIIT- Hyderabad2Center for Intelligent Information Retrieval, UMASS - Amherst

IIIT

Hyd

erab

ad

UM

AS

S

AM

HE

RS

T

• Recognition of books and collections.

• Recognition of words is crucial to Information Retrieval.

• Use of dictionaries and post processors are not feasible in many languages.

Introduction

IIIT

Hyd

erab

ad

UM

AS

S

AM

HE

RS

T

Motivation• Most of the (Indian language) OCRs recognize

glyph(component) and generate text from the class labels.• Word accuracies are far lower than component accuracies.• Word accuracy is inversely proportional to no. of components

in the word.• Use of language model for post processing is challenging.

– High entropy, Large vocabulary (eg. Telugu).– Language processing modules still emerging.

Com

pone

nt a

cc.

wor

d ac

c50

0

100

wor

d a

ccNo of components

Is it possible to make use of multiple occurrence of the same word to improve OCR performance ?

Recognize Parse

Average word length =

Component Accuracy = 9 / 12 = 75%Word Accuracy = 25%

IIIT

Hyd

erab

ad

UM

AS

S

AM

HE

RS

T

Overview

Text

Multiple occurrences of a word

• Words are degraded independently• OCR output is different for the word at different instances

OCR output Goal

Cluster

OCR OCR OCR OCR OCR

IIIT

Hyd

erab

ad

UM

AS

S

AM

HE

RS

T

Related Work

Malayalam Bangla

TamilHindi

U. Pal, B. Chaudhuri, Pattern Recognition, 20041A. Negi et al., ICDAR, 2001 ; 2C. V. Jawahar et al., ICDAR, 2003; 3K. S. Sesh Kumar et al., ICDAR 20071P. Xiu and H. S. Baird, DRR XV,2008; 2N. V. Neeba, C. V. Jawahar, ICPR, 20081T. M. Rath et al., IJDAR, 2007;2T. M. Rath et al., CVPR, 2003;3Anand Kumar et al., ACCV, 2007H. Tao, J. Hull, Document Analysis and Information Retrieval, 1995

• Character Recognition in Indian languages is still an unsolved problem.

• Telugu is one of the most complex scripts.• Recognition of a book has received some attention

recently.• Word images are efficiently matched for retreival.• Use of word image clusters to improve OCR accuracy

IIIT

Hyd

erab

ad

UM

AS

S

AM

HE

RS

T

Conventional Recognition Process

PreprocessingSegmentation and Word detection

Text (UNICODE)

Feature Extraction Classification

Recognizer

ScannedImages

Word level Feature ExtractionGrouping

Word Grouping (Clustering)

Wordgroups

Combining OCR Results

Proposed Recognition Process

IIIT

Hyd

erab

ad

UM

AS

S

AM

HE

RS

T

• LSH Goal: “r-Near Neighbour”– for any query q, return a point p P ∈

such that||p-q|| ≤ r (if it exists)• LSH has been used for

– Data Mining Taher H. Haveliwala, Aristides Gionis, Piotr Indyk, WebDB, 2000

– Information retrieval A.Andoni, M.Datar, N.Immorlica, V.Mirrokni, Piotr Indyk, 2006

– Document Image Search Anand Kumar, C.V.Jawahar, R.Manmatha, ACCV, 2007

Locality Sensitive Hashing (LSH)

IIIT

Hyd

erab

ad

UM

AS

S

AM

HE

RS

T

LSH clustering on word images[TODO]

IIIT

Hyd

erab

ad

UM

AS

S

AM

HE

RS

T

Character Majority Voting

OCR output ComponentsWord Cluster

Final Output

• Algorithm [TODO]

IIIT

Hyd

erab

ad

UM

AS

S

AM

HE

RS

T

Word Img OCR o/p

1

2

3

4

Dynamic Programming

Voting for 1 after aligning

DTW o/p for word 1 =

CMV o/p for word 1 =

Alignment

Dynamic Programming [1,2]

IIIT

Hyd

erab

ad

UM

AS

S

AM

HE

RS

T

Results• Word generation process makes correct annotations

available for evaluating the performance.

Component Accuracy Word Accuracy

Dataset OCR CMV DTW OCR CMV DTW

SF198.3 98.3 98.3 95.5 95.5 95.5

SF2 94.82 98.04 98.19 85.24 94.97 95.28

SF3 85.83 95.78 97.9 67.51 88.31 94.15

SF479.38 87.82 92.19 51.9 78.81 85.2

• 5000 clusters

• 20 variations

• Degraded dataset

More Details

IIIT

Hyd

erab

ad

UM

AS

S

AM

HE

RS

T

Results• Word Accuracy Vs No. of words

– Adding more no. of words makes thedata set more ambiguous

– Algorithm performance increases with no. of words, and saturates.

• Word Accuracy Vs Word Length– Word accuracy decreases as the word length increase.– Use of the cluster info helps in gaining good word accuracies.

IIIT

Hyd

erab

ad

UM

AS

S

AM

HE

RS

T

AnalysisImage OCR CMV DTW

IIIT

Hyd

erab

ad

UM

AS

S

AM

HE

RS

T

ResultsSize No. of

Clusters

LengthRange

WordWL

No. of words

Symbol accuracy Word Accuracy

OCR CMV DTW OCR CMV DTW

B1 Short 676 2-3 2.45 3778 90.64 91.61 91.66 80.56 82.39 82.45

B1 Medium 994 4-5 4.43 5161 90.78 92.35 92.42 73.34 79.14 80.53

B1 Long 690 6-16 7.31 4587 89.98 92.15 92.31 58.64 72.34 74.82

B1 ALL

B2 ALL

B3 ALL

B4 ALL

• For a small increase in component accuracy, there is a large improvement in the word accuracy.• The improvement is high for long words.• Relative improvement of 12% for words which occur at least twice.

IIIT

Hyd

erab

ad

UM

AS

S

AM

HE

RS

T

Analysis

•Cuts and Merges

•CMV vs. DTW

•Wrong word in the cluster.

•Cases that cant be handled

Image OCR CMV DTWImage OCR CMV DTW

Image OCR CMV DTWImage OCR CMV DTW

IIIT

Hyd

erab

ad

UM

AS

S

AM

HE

RS

T

Conclusion & Future work• A new framework has been proposed for OCRing the book.

• A word recognition technique which uses the document constraints is shown.

• An efficient clustering algorithm is used to speed up the process.

• Word level accuracy is improved from 70.37% to 79.12%.

• This technique can also be used for other languages.

• Extending it to include the uses of techniques to handle unique words by creating clusters over parts of words.

IIIT

Hyd

erab

ad

UM

AS

S

AM

HE

RS

T

END

IIIT

Hyd

erab

ad

UM

AS

S

AM

HE

RS

T

Additional slides

IIIT

Hyd

erab

ad

UM

AS

S

AM

HE

RS

T

LSH AlgorithmAlgorithm: Word Image Clustering

Require: Word Images Wj andFeatures Fj, j = 1,...,nEnsure : Word Image Clusters O for each i = 1,...,l do for each j = 1,...,n do Compute hash bucket I = gi (Fj )

Store word image Wj on bucket I of hash table Ti end for end for k = 1 for each i = 1,...,n and Wi unmarked do Query hash table for word Wi toget cluster Ok Mark word Wi with k k = k +1 end for

Back

IIIT

Hyd

erab

ad

UM

AS

S

AM

HE

RS

T

Word Error CorrectionAlgorithm: Word Error CorrectionRequire: Cluster C of words Wi ,i = 1,...,nEnsure: Clusters O of correct words for each i = 1,...,n do for each j = 1,...,n do if j != i then

Align word Wi and WjRecord errors Ek ,k = 1,...,m in WiRecord possible corrections Gk for Ek

end ifend for

Correct Ek if Probability pk of correction Gk is maximum O <- O U Wi

end for

Back

IIIT

Hyd

erab

ad

UM

AS

S

AM

HE

RS

T

• Dataset – 5000 clusters with 20 images of same word with different font size and

resolution.– Words were generated using Image Magick.– Words were degraded with Kanungo degradation model to approximate

real data.– SF1, SF2, SF3, SF4 datasets were degraded with 0, 10, 20, 30% noise.

IIIT

Hyd

erab

ad

UM

AS

S

AM

HE

RS

T

HashingHashedWords

Pre-processing

Segmentation and word detection

Feature Extraction

Hashing

Feature Extraction

OCR

Text

FusionMethod 1 / Method 2

OCRoutput

Cluster of words

Word image