impact final conference - asaf tzadok

IBM Labs in Haifa © 2011 IBM Corporation CONCERT COoperative eNgine for Correction of ExtRacted Text Asaf Tzadok Manager, Image and Document Analytics Group October 2011

Post on 19-Oct-2014




0 download


IBM Adaptive OCR engine and CONCERT (Cooperative Correction (including the library perspective)


Page 1: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa © 2011 IBM Corporation

CONCERTCOoperative eNgine for Correction of ExtRacted Text

Asaf Tzadok

Manager, Image and Document Analytics Group

October 2011

Page 2: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa



An estimated of at least 100 Millions books have been produced since Johann Gutenberg invented movable type in the 15th century.

A large part of this vast literature is now being converted to digital books and moved into the world of electronic publishing.

The digitization process involves Scanning technologies OCR (Optical Character Recognition) Post correction

The OCR quality range between 50%-90% of word level accuracy Post correction is a must and it costs a lot and it takes time

~1 euro per A5 page

Page 3: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


Crowd Sourcing Projects

Distributed Proofreaders Gutenberg Project

National Library of Australia Australian Newspaper Digitisation

LDS Church Family Search

The National Library of Finland Digitalkoot

All are pure volunteer based crowd sourcing programs It works !!

Page 4: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


Gutenberg Project – 1st Gen.

Page 5: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


NLA – Australian Newspapers – 2nd Gen.

Page 6: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


Collaborative Correction – State of the Art cont.

State-of-the-art systems, such as Project Gutenberg, Simply show page image and OCR results to be corrected

Drawbacks: Slow and unproductive process Prone to errors Hard to cross-check/merge Two passes are needed to ensure quality

Result:Complex, hard to track process = a lot of manual labor = limited public participation and contribution

Page 7: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


DIGITALKOOT - Mole Games – 3rd Gen

Page 8: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


Collaborative Correction – Games

Wider and younger public participation Easy to cross check Allows Parallelism Fully Scalable

Drawbacks Low productivity factor Static process with huge amount of work Limited use of the feedback from the users

Very long process to complete the digitization

Page 9: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


Collaborative Correction – How does it work

A full web based collaborative-correction system Avoid any installation in the client side Intuitive for the wide public use

Call for participation (optional) Via the official website of the library Collection based

Volunteers keen on contributing to their cultural heritage preservation Top performers lists Library recognition awards Acknowledgements

Page 10: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa



Adaptive collaborative correction platform Uses the feedback from the users to improve productivity Fully connected to the Adaptive OCR Engine

Strong emphasis on productivity tools Reduce the time for verification/correction

Patented smart-key approach Motivate volunteers

Separating data entry process into several complementary tasks Optimized application dedicated to each task Break down the tasks into subtask Make it suitable for parallel processing Online compilation

Digitization flow optimizations Hierarchical context-level : character -> word -> page

Page 11: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


CONCERT System Architecture



OMNI Engine


Book Fonts


Book Optimized

Adaptive OCR



Quality ControlDictionaries



High Quality


Web Users


Productivity Tools



Page 12: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


Adaptive OCR - Requirements

Consistent and reliable confidence level Important for quality assurance

No use of prior knowledge on the font Crazy font can be handled

Good use of the feedback from the users Character and Word level

Robust to distortion Page level distortion and printing variations

Easy to migrate between books from the same publisher Continues update

Not too slow Around 2-3 times slower than OMNI Engines

Page 13: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


Adaptive OCR – Technical Considerations

Pixel Domain (Template matching) Pros

Easy to implement Scoring consistency

Cons Slow Sensitive to small distortion

Features Domain Pros

Fast Robust to small distortion Using invariant features can improve robustness to distortion

Cons Non consistent scoring mechanism

Page 14: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


Adaptive OCR - Hybrid Approach

Page 15: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


Distortion Example

Using hierarchical optic-flow High quality results for compensation for non-linear character warping Can overcome significant distortions

Page 16: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


System flow

Character (Carpet) session Fast validation of OCR results Every word with rejected character is routed to Word Session

Word Verifier Session Utilized for cases when contextual information is necessary Rejected word will be corrected in the Page Session

Page-level Session For final closure of the page When entire page view for understanding is required

Page 17: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


Character Session

OCR results are analyzed: Very high confidence results don’t require verification High confidence results may include some character recognition

errors. Hence, character session is used Low confidence results may have been caused by segmentation

errors. Hence Word session is used. For Character session, individual character images are extracted and

grouped together based on the recognition results (i.e. all the “b” would be grouped together at the same session)

For the given session, all the characters are grouped based on their confidence

Page 18: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


Character Session

Page 19: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


Character Session

Page 20: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


Character Session

Page 21: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


Word Session

Used for words Word is not in dictionary Having low confidence characters Having characters rejected in the Character Session

Shows Original word image Recognition results Possible spelling options

Words ordered alphabetic Based on the recognition results in lexicographic

Page 22: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


Word Session – Before data entry

Page 23: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


Word Session – After data entry

Page 24: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


Word Session – Before data entry

Page 25: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


Word Session – After data entry

Page 26: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


Page Session

Used for correction of cases where word segmentation fails

Can be activated in one of 4 flavors Word view Line view Paragraph view Tagging view

System can go automatically from one problematic word to another

Page 27: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


CONCERT - Page Session

Page 28: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


Multilingual Support - English


Page 29: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


Multilingual Support - French


Page 30: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


Multilingual Support - German Gothic


Page 31: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


Multilingual Support - Dutch 1789

Page 32: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


Multilingual Support - Japanese

Page 33: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


Heart Newsreel Collection – Index Card

Page 34: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


User Monitoring

Wide public participation may end up with data corruption by Malicious users Non qualified users

User rating and feedback motivates the use of the system Three ways validation

Good injection Characters/Words with high confidence to be true

Similar injection Characters/Words which may look similar but not identical For example: ‘O’ injection on ‘Q’ session

Error injection Characters/Words with high confidence to be false

Page 35: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


User Monitoring – Screenshots

Page 36: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


User Monitoring – Screenshots Cont.

Page 37: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


User Monitoring – Screenshots Cont.

Page 38: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


User Monitoring – Screenshots Cont.

Page 39: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


User Monitoring – Screenshots Cont.

Page 40: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


User Monitoring – Screenshots Cont.

Page 41: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


User Monitoring – Screenshots Cont.

Page 42: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa



Page 43: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


CONCERT in use

Hearst Newsreel Archive Collection First production use Tagging capabilities

Pilot in Japan for the Japanese Library Including customization for Japanese

1st phase pilots in major libraries in Europe KB – National Library of the Netherlands BL – British Library BSB – Bavarian State Library

Page 44: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa


CONCERT Future Planning

Search Over OCR Beyond transcription

Improve User Feedback Online advisor Best performers list

Community building around content Integrate community tools within the platform

CONCERT Games iPhone/iPad/Android/Desktop

E-Book creation Fully digital transcription Using original font as option

Page distortion correction Fully integrate the word-based page distortion correction

Page 45: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa © 2011 IBM Corporation

Thank You!