automatic ground truth generation of camera captured documents using document image retrieval

Automatic Ground Truth Generation of Camera Captured Documents UsingDocument Image Retrieval

Automatic Ground Truth Generation of Camera Captured Documents Using Document Image RetrievalSheraz Ahmed, Koichi Kise, Masakazu Iwamura, Marcus Liwicki, and Andreas Dengel

Problem to be tackledOCR for camera-captured documents

Convenient Useful Poor OCR performanceOCR resultsOCR response for camera-captured wordsCamera-captured wordsGroundTruthTesseractGOCROCRopusotherwiseutharvdlulee=erecognisesT,-ee=Legislative\LR iild1K4A

Percentpauznx_______e=constructionummuciwwns ione=w==s

Suffer from blur, perspective distortion, illumination change and so onQuantity improves qualityA large quantity of data improves quality of recognition

DatasetRecognition rateLarge-scale datasets are demandedDataset sizeDatasetWider variety of fonts and distortionsExisting datasets on camera-captured textDocumentIUPR DatasetWord-level groundtruth is unavailable100 pagesSceneStreet View House Numbers630,000 numerals

NEOCR5,238 wordsChars74k74,107 characters

Not usable for OCR training

Limitation to use existing datasetsOnly numeralsToo small

Different tendencies from text in document imagesPurposeTo develop a method to easily create a large datasetDatasetSuccessfully groundtruthed one million word images with 99.98% accuracy!A way to create a dataset

Captured imageCropped word image

ProblematicThis is NationalGroundtruthing

Groundtruthing is problematicAutomatic groundtruthing is not reliableManual groundtruthing is laborious and costlyReliable automatic groundtruthing

GOALIdeaUse text information embedded in PDF files

Printed document

PDF file

Captured document image



Text info.IdeaUse text information embedded in PDF files

Printed document

PDF file

Captured document image



Text info.IdeaUse text information embedded in PDF filesHow do we fit the text information into the captured document image?

Printed document

PDF file

Captured document image



Text info.Fitting text information into captured document imageFor scanned document imageSimilarity transformation [Beusekom, DAS2008]

For camera-captured document imagePerspective transformationAffine transformation (approximately)Not applicable to camera-captured caseNo method existsLocally Likely Arrangement Hashing (LLAH)Find the region corresponding to the captured one from 20M pages in real time

Captured image (Query)Search resultDB: 20M pagesTime49ms/queryAccuracy 99.2%Pose is estimated simulateneouslyCorresponding pageCorresponding region

Proposed procedure (1):Document level matchingCaptured image (Query)DB

Digital doc. imagesFeaturesBased on LLAH

Proposed procedure (2):Part level processing

Cropped retrieved imageTransformed captured image

Overlapped imageThis is not the end of the proceedure

Displacement of textProposed procedure (3):Word level processingCropped Retrieved ImageTransformed Captured ImageOverlapped Bounding BoxesFind the closest bounding boxes and select perfectly aligned ones onlyDataset creationDocument images were captured

Dataset creationDocument images were capturedWith a few different camerasDocuments include proceedings, books, magazines and articlesWord and character image were automatically groundtruthedObtained degraded word images

Obtained character imagesEvaluation50,000 word images were randomly selected from one million imagesManual counting revealed that the accuracy was 99.98%The errors were caused by mainly wrong alignment of bounding boxesContributionA fully automatic groundtruthing method for word and character images in camera-captured documents is proposedOne million word images were groundtruthedAccuracy: 99.98%

Amazingly high for a fully automated methodAutomatic Ground Truth Generation of Camera Captured Documents Using Document Image RetrievalSheraz Ahmed, Koichi Kise, Masakazu Iwamura, Marcus Liwicki, and Andreas Dengel

Workaround of groundtruthingSynthetic approach with degradation models [Ishida, ICDAR2005] [Tsuji, KJPR2008]

Questionable to say this represents real degradation

DegradationWords at border

Partially missingWords at border

Can increase confusion between characters: Marked with special flag

