bl demo day - july2011 - (5) ocr for impact part 2

17
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Niall Anderson, British Libra ry OCR and post-correction

Upload: impact-centre-of-competence

Post on 22-Nov-2014

1.513 views

Category:

Technology


5 download

DESCRIPTION

Niall Anderson outlines the IMPACT approach to adaptive OCR and Post Production including tools prepared by IBM CONCERT and experimental tools from: USAL, NCSR and UIBK. Delivered at BL Demo Day - 12th July 2011

TRANSCRIPT

Page 1: BL Demo Day - July2011 - (5) OCR for IMPACT Part 2

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Niall Anderson, British Library

OCR and post-correction

Page 2: BL Demo Day - July2011 - (5) OCR for IMPACT Part 2

Niall Anderson, British Library

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

2

RecapOCR will produce its best results on material with the following

characteristics

• The layout of the text is simple, with no tables or illustrations;

• The text itself is in a modern, computer-generated typeface;

• The digital image preserves a high contrast between the text block and non-text detail (including blank space)

• The image has been created from a perfectly flat and straight scan (if a digital copy from an analogue source)

• The text of the analogue source is clear, well aligned and consistently presented

• The basic material of the analogue source is undamaged; the text is in a single language

• The image has been taken from the original physical source and not a degraded surrogate (such as microfilm)

Page 3: BL Demo Day - July2011 - (5) OCR for IMPACT Part 2

Niall Anderson, British Library

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

3

Context for Post-Correction CENL 2008 survey of its members:

– 350% increase in digitisation of historic texts in five years– 8 million digital items in existence– Era of mass digitisation of historical texts

British Library experience– OCR tools good but not good enough– On average > 20% words lost– Loss of letters, words, and “significant” words

Page 4: BL Demo Day - July2011 - (5) OCR for IMPACT Part 2

Niall Anderson, British Library

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

4

Common characteristics of digital text images …

Page 5: BL Demo Day - July2011 - (5) OCR for IMPACT Part 2

Niall Anderson, British Library

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

5

… and their effects on OCR

Page 6: BL Demo Day - July2011 - (5) OCR for IMPACT Part 2

Niall Anderson, British Library

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

6

Collaborative correction: some prior and ongoing systems

Page 7: BL Demo Day - July2011 - (5) OCR for IMPACT Part 2

Niall Anderson, British Library

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7

www.nla.gov.au/ndp/project_details/documents/ANDP_ManyHands.pdf

Page 8: BL Demo Day - July2011 - (5) OCR for IMPACT Part 2

Niall Anderson, British Library

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

8

The IMPACT view:“Mass digitisation demands the creation of a new digitisation paradigm by mobilising the general public to help with large-scale digitisation efforts. Because state-of-the-art systems to enable public participation are limited, we intend to resolve this difficulty by adopting advanced tools that will facilitate volunteer participation.”

Page 9: BL Demo Day - July2011 - (5) OCR for IMPACT Part 2

Niall Anderson, British Library

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

9

The IMPACT approach:• CONCERT (Cooperative Engine for the Correction of

Extracted Text): a data validation/correction application that is simple and intuitive enough to be attractive to untrained users and yet effective enough to ensure high productivity

• Featuring automatic data management that allows for the combination of results from several users at once

Page 10: BL Demo Day - July2011 - (5) OCR for IMPACT Part 2

Niall Anderson, British Library

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

10

System architecture Secure login Upload of books/volumes as image files or by URL Omni-OCR with language selection Download of compiled OCR metadata before or after key-in

Page 11: BL Demo Day - July2011 - (5) OCR for IMPACT Part 2

Niall Anderson, British Library

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

11

System workflow:Three stages:• Character (carpet) session – for fast validation of OCR

results• Word session – in cases where contextual information is

needed to validate characters• Page-level session – in cases where full page view is

needed to interpret results

Page 12: BL Demo Day - July2011 - (5) OCR for IMPACT Part 2

Niall Anderson, British Library

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

12

Character session Analysis of OCR results:

– High confidence results do not require verification– However, some high confidence results may be

misrecognitions– Individual character images are extracted and grouped

together based on recognition results– User selects and submits suspicious characters

Page 13: BL Demo Day - July2011 - (5) OCR for IMPACT Part 2

Niall Anderson, British Library

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

13

Word session Shows words that contain low confidence characters Shows words that contain characters identified as

suspicious in the Character Session Shows original OCR recognition results and possible spelling

options Users can validate the appropriate spelling option or

provide their own spelling

Page 14: BL Demo Day - July2011 - (5) OCR for IMPACT Part 2

Niall Anderson, British Library

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

14

Page Session Used primarily where a segmentation failure has led to a

word being misrecognised or not recognised at all Text can be shown in a variety of segmentation views:

word, line, paragraph or tag System can be automated to move from one problematic

word to the next

Page 15: BL Demo Day - July2011 - (5) OCR for IMPACT Part 2

Niall Anderson, British Library

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

15

System demonstration http://fue.onb.ac.at/impact/gwsw/vid/EE1_showcase.html

Screencast created by Gerd Zechmeister of the Austrian National Library

Page 16: BL Demo Day - July2011 - (5) OCR for IMPACT Part 2

Niall Anderson, British Library

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

16

Development plan User/job monitoring

– Introduction of planned errors to validate operator progress Productivity and quality benchmarking Correction and validation system for “document dictionaries”

– Each corrected book produces a unique dictionary that can be reused on other works

– Will include general language and name dictionaries being produced by IMPACT

Distributed workflow management tool and administrative session Greater number of output formats for OCR results and corrected results “Superkey” session

– Character validation by creation of “ideal” character patterns

Page 17: BL Demo Day - July2011 - (5) OCR for IMPACT Part 2

Niall Anderson, British Library

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

17

Alternatives to OCR and post-correction: Word Spotting

Alternative technique for indexing historical documents

After word segmentation relevant words are detected and highlighted

Key words can be person and location names (e.g. taken from the Named Entities Registry)