impact hpc cloud day

13
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. A sustainable infrastructure for large scale document image analysis HPC Cloud day – 4 October

Upload: cneudecker

Post on 15-Jun-2015

63 views

Category:

Technology


5 download

DESCRIPTION

Scalable and sustainable - OCR & document image analysis in the cloud New Trends in Humanities Computing. HPC Cloud day, 4 October 2011, Amsterdam, Netherlands.

TRANSCRIPT

Page 1: IMPACT HPC Cloud Day

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

A sustainable infrastructure for large scale document image analysis

HPC Cloud day – 4 October

Page 2: IMPACT HPC Cloud Day

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

2

Background IMPACT – Improving Access to Text (2008 – 2011)

Large-scale integrating research project, funded by the EC– Consortium of 26 partners

– Coordinated by the National Library of the Netherlands (KB)

– EU funding: € 12 100 000 (FP7 ICT Work Programme)

– From 2012: sustainable Centre of Competence with alternative resources

Main objectives:

- Innovate OCR technology

- Capacity building in mass-digitisation

Page 3: IMPACT HPC Cloud Day

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

3

VVt Venetien den 1.Junij, Anno 1618.

DJgn i f paffato te S' aö'Jifeert mo?üen/bah .)etgi'uotbciraetail)i.r/JtmelchontDecht te /sbnbe bele btr felbrr geiufttceert baer bnber eeniglje jprant o^fen/bie ftcb .metbeSpaenfcbeu enbeeemgljen bifet Cbeiiupcen berbonbru befe

OCR: A multitude of challenges…

Page 4: IMPACT HPC Cloud Day

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

OCR: A multitude of challenges…• I. OCR challenges (gothic fonts, bleed-through, warping, etc.)

Page 5: IMPACT HPC Cloud Day

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

OCR: A multitude of challenges…• II. Language challenges (spelling variants, inflection, and many

more!)

Example: historical variants of the Dutch word ‘wereld’ (world):

werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled

Page 6: IMPACT HPC Cloud Day

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

6

IMPACT Solutions From a technical perspective:

> 20 software toolkits for solving different problems

Such as:

OCR (C++, C#),

Image Processing & Lexica (DLL),

Command Line Tools (Win/Linux),

Java, Ruby, PHP, Perl, etc.

IMPACT Interoperability Framework (IIF)

Page 7: IMPACT HPC Cloud Day

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7

Architecture IMPACT Interoperability Framework: Technologies

- Java 6

- Generic Web Service Wrapper

- Apache Maven

- Apache Tomcat

- Apache Axis2

- Apache Synapse

- Taverna Workflow Engine

IMPACT Interoperability Framework: Dataset

- PHP/mySQL database, frontend for search

- approx. 5 TB raw data (images, text files, metadata) and growing

Page 8: IMPACT HPC Cloud Day

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

8

How does it work?1. Digitisation/OCR challenges registered and tagged in database

Warped text

2. Database contains 99,95% correct result: “ground truth”

Page 9: IMPACT HPC Cloud Day

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

9

How does it work?3. Researcher develops new method to tackle a problem

4. Research prototype is wrapped to a SOAP web service

Page 10: IMPACT HPC Cloud Day

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

10

How does it work?5. Web service is integrated as a workflow module

6. Workflow module can be evaluated, based on the ground truth

Page 11: IMPACT HPC Cloud Day

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

11

Current setup Enterprise Service Bus

receives requests from users and distributes the load to the availableworker nodes (= serverwith all services installed)

Main effect: Process parallelization,Load distribution,Fail over

Drawback: Data is sent to worker nodes all around Europe = huge amount of data needs to be sent over the net!

Page 12: IMPACT HPC Cloud Day

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

12

Proposed setupSet up worker nodes on the HPC cloud (same location)

Advantage:

- Improve speed and availability for concurrent users

- Remove constraints for large-scale processing

Page 13: IMPACT HPC Cloud Day

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

13

Benefits Scalable platform Availability of resources to a large number of users Enable research into scalable computing Consolidation of support and maintenance Various interfaces (web/local)