bl demo day - july2011 - (9) impact interoperability and evaluation framework

20
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT Interoperability and Evaluation Framework Clemens Neudecker, National Library of the Netherlands IMPACT Demo Day, British Library 12/11/11

Upload: impact-centre-of-competence

Post on 14-Jun-2015

1.490 views

Category:

Technology


0 download

DESCRIPTION

Slides from Clemens Neudecker's presentation on the IMPACT Interoperability and Evaluation Framework within the IMPACT project at the British Library Demo-day on the 12th July 2011.

TRANSCRIPT

Page 1: BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT Interoperability and Evaluation FrameworkClemens Neudecker, National Library of the NetherlandsIMPACT Demo Day, British Library 12/11/11

Page 2: BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

OCR: A multitude of challenges…I. OCR challenges (gothic fonts, bleed-through, warping, etc.)

Page 3: BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

OCR: A multitude of challenges…II. Language challenges (spelling variants, inflection, and many more!)

Example: historical variants of the Dutch word ‘wereld’ (world):

werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled

Page 4: BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

And a multitude of solutions! 22 different ‘tools’ from diverse developers:

OCR (C++, C#), Image Processing & Lexica (DLL), Command Line Tools (Win/Linux), Java, Ruby, PHP, Perl, etc. + 3rd party software!

“One ring to rule them all...”

IMPACT Interoperability Framework (IIF)

Page 5: BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Main requirementsBehavioural: Minimize integration effort Minimize deployment effort Maximize usability Maximize scalability

Functional: Modular Transparent Expandable Open source Platform independent

Page 6: BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Architecture IMPACT Interoperability Framework: Technologies

- Java 6- Generic Web Service Wrapper- Apache Ant/Maven- Apache Tomcat/httpd- Apache Axis2- Apache Synapse- Taverna Workflow Engine

IMPACT Evaluation Framework: Dataset- approx. 5 TB raw data (images, text files, metadata) and growing

- Ground truth transcriptions - Evaluation modules

Page 7: BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Components I: IIF Enterprise Service Bus

receives (SOAP) requests from users and distributes the load to the availableworker nodes

Main effect: Process parallelization,Load distribution,Fail over

Page 8: BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Framework integration Easy to use generic command line wrapper (open source)

Page 9: BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Workflow development

OCR workflow = data pipeline

Building blocks =

processing steps (nodes)

Integration = interaction between nodes

(mashup)

Page 10: BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Workflow management Web 2.0 style registry: myExperiment Local client: Taverna Workbench Web client: project website API: SOAP/REST

Page 11: BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Community Web2.0 style workflow registry

Community of experts

Sharing of resources

Knowledge exchange

A central meeting point for users and researchers

Page 12: BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Components II: DatasetDatabase and front end, hosted at the PRIMA research group at University of Salford, School of Computing, United Kingdom

- more than 500.000 images from Digital Libraries- more than 50.000 ground truth representations- up to 10.000 direct access calls per month- 4 TB of space and growing

Page 13: BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Dataset Access to a representative and annotated dataset of significant size,

with metadata, ground truth and search facilities

Page 14: BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Evaluation features Text based comparison of result with ground truth,

using Levenshtein distance method Layout based comparison of result with ground truth,

using the Page Analysis And Ground Truth Elements Framework Example:

Page 15: BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

The PAGE Format FrameworkTwo-level architecture:

– root structure– task specific sub-formats

Separate XML Schema definitions Format identification

via Namespaces Mapping of

– dependencies– process chains– alternative processing steps

Linking via IDsProcessing results or ground truth (e.g. binarisation, dewarping, page content)

PAGE root(XML)

PAGE gts(XML)

PAGE gts(XML)

PAGE gts(XML)

Page 16: BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Ground-Truthing Tools Aletheia

FineReader

PAGE Exporter

GT Validator

GT Normalizer

16

Page 17: BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

17

Profile ‘Full Text Recognition’

Measure Weights Region Type Weights

Merge Text

Allowable Merge Image

Split Graphic

Allowable Split Chart

Miss Table

Partial Miss Separator

Misclassification Maths

False Detection Noise

1.5

1.0

2.0

2.0

1.0

1.0

1.0

0.0

0.0

0.0

0.0

0.0

0.0

0.00.5

0.5

Evaluation for general text recognition

Page 18: BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

18

Partial MissMiss

Merge

Measures – Segmentation Errors

Split

Ground Truth

Segmentation Result

Mis-classi-fication

Paragraph

Caption

Page 19: BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

OCR Accuracy

Page 20: BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Thank you! Questions?