bl demo day - july2011 - (5) ocr for impact part 2
DESCRIPTION
Niall Anderson outlines the IMPACT approach to adaptive OCR and Post Production including tools prepared by IBM CONCERT and experimental tools from: USAL, NCSR and UIBK. Delivered at BL Demo Day - 12th July 2011TRANSCRIPT
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Niall Anderson, British Library
OCR and post-correction
Niall Anderson, British Library
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
2
RecapOCR will produce its best results on material with the following
characteristics
• The layout of the text is simple, with no tables or illustrations;
• The text itself is in a modern, computer-generated typeface;
• The digital image preserves a high contrast between the text block and non-text detail (including blank space)
• The image has been created from a perfectly flat and straight scan (if a digital copy from an analogue source)
• The text of the analogue source is clear, well aligned and consistently presented
• The basic material of the analogue source is undamaged; the text is in a single language
• The image has been taken from the original physical source and not a degraded surrogate (such as microfilm)
Niall Anderson, British Library
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
3
Context for Post-Correction CENL 2008 survey of its members:
– 350% increase in digitisation of historic texts in five years– 8 million digital items in existence– Era of mass digitisation of historical texts
British Library experience– OCR tools good but not good enough– On average > 20% words lost– Loss of letters, words, and “significant” words
Niall Anderson, British Library
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
4
Common characteristics of digital text images …
Niall Anderson, British Library
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
5
… and their effects on OCR
Niall Anderson, British Library
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
6
Collaborative correction: some prior and ongoing systems
Niall Anderson, British Library
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
7
www.nla.gov.au/ndp/project_details/documents/ANDP_ManyHands.pdf
Niall Anderson, British Library
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
8
The IMPACT view:“Mass digitisation demands the creation of a new digitisation paradigm by mobilising the general public to help with large-scale digitisation efforts. Because state-of-the-art systems to enable public participation are limited, we intend to resolve this difficulty by adopting advanced tools that will facilitate volunteer participation.”
Niall Anderson, British Library
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
9
The IMPACT approach:• CONCERT (Cooperative Engine for the Correction of
Extracted Text): a data validation/correction application that is simple and intuitive enough to be attractive to untrained users and yet effective enough to ensure high productivity
• Featuring automatic data management that allows for the combination of results from several users at once
Niall Anderson, British Library
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
10
System architecture Secure login Upload of books/volumes as image files or by URL Omni-OCR with language selection Download of compiled OCR metadata before or after key-in
Niall Anderson, British Library
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
11
System workflow:Three stages:• Character (carpet) session – for fast validation of OCR
results• Word session – in cases where contextual information is
needed to validate characters• Page-level session – in cases where full page view is
needed to interpret results
Niall Anderson, British Library
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
12
Character session Analysis of OCR results:
– High confidence results do not require verification– However, some high confidence results may be
misrecognitions– Individual character images are extracted and grouped
together based on recognition results– User selects and submits suspicious characters
Niall Anderson, British Library
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
13
Word session Shows words that contain low confidence characters Shows words that contain characters identified as
suspicious in the Character Session Shows original OCR recognition results and possible spelling
options Users can validate the appropriate spelling option or
provide their own spelling
Niall Anderson, British Library
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
14
Page Session Used primarily where a segmentation failure has led to a
word being misrecognised or not recognised at all Text can be shown in a variety of segmentation views:
word, line, paragraph or tag System can be automated to move from one problematic
word to the next
Niall Anderson, British Library
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
15
System demonstration http://fue.onb.ac.at/impact/gwsw/vid/EE1_showcase.html
Screencast created by Gerd Zechmeister of the Austrian National Library
Niall Anderson, British Library
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
16
Development plan User/job monitoring
– Introduction of planned errors to validate operator progress Productivity and quality benchmarking Correction and validation system for “document dictionaries”
– Each corrected book produces a unique dictionary that can be reused on other works
– Will include general language and name dictionaries being produced by IMPACT
Distributed workflow management tool and administrative session Greater number of output formats for OCR results and corrected results “Superkey” session
– Character validation by creation of “ideal” character patterns
Niall Anderson, British Library
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
17
Alternatives to OCR and post-correction: Word Spotting
Alternative technique for indexing historical documents
After word segmentation relevant words are detected and highlighted
Key words can be person and location names (e.g. taken from the Named Entities Registry)