collaborative workflow development and experimentation in the digital humanities

58
A Service-Oriented Architecture for Collaborative Workflow Development and Experimentation Clemens Neudecker, KB @cneudecker Zeki Mustafa Dogan, SUB-DL Sven Schlarb, ÖNB @SvenSchlarb Juan Garcés, GCDH @juan_garces eHumanities Seminar 2012 University of Leipzig 10-10-2012

Upload: cneudecker

Post on 15-Jun-2015

64 views

Category:

Technology


0 download

DESCRIPTION

A Service-Oriented-Architecture for Collaborative Workflow Development and Experimentation in the Digital Humanities 2012 Leipzig eHumanities Seminar, 10 October 2012, Leipzig, Germany.

TRANSCRIPT

Page 1: Collaborative Workflow Development and Experimentation in the Digital Humanities

A Service-Oriented Architecture for Collaborative Workflow

Development and Experimentation

Clemens Neudecker, KB @cneudeckerZeki Mustafa Dogan, SUB-DL

Sven Schlarb, ÖNB @SvenSchlarbJuan Garcés, GCDH @juan_garces

eHumanities Seminar 2012University of Leipzig

10-10-2012

Page 2: Collaborative Workflow Development and Experimentation in the Digital Humanities

Idea

• Provide web-based versions of tools (web services)

• Package web services, data and documentation into ready-to-run “components” (encapsulation)

• Chain the components to create workflows via drag-and-drop operation

• Share and use workflows to re-run experiments and to demonstrate results

Page 3: Collaborative Workflow Development and Experimentation in the Digital Humanities

Background

• High degree of diversity in research topics, but also tools and frameworks being used

• Technical resources should be easy to use, well documented, accessible from anywhere

• Prevent re-inventing of the wheel

Page 4: Collaborative Workflow Development and Experimentation in the Digital Humanities

Requirements

• Interoperability = connect different resources• Flexibility = easy to deploy and adapt• Modularity = allow different combinations of tools• Usability = simple to use for non-technical users• Re-usability = easy to share with others• Scalability = apt for large-scale processing• Sustainability = resources simple to preserve• Transparency = tools evaluated separately• Distributed development and deployment

Page 5: Collaborative Workflow Development and Experimentation in the Digital Humanities

Interoperability Framework (IIF)

• Modules:- Java Wrapper for command line tools- Web Services (incl. format converters)- Taverna Workflow Engine- Client interfaces- Repository connectors

Page 6: Collaborative Workflow Development and Experimentation in the Digital Humanities

Sources

https://github.com/impactcentre/interoperability-framework

Page 7: Collaborative Workflow Development and Experimentation in the Digital Humanities

IIF Command Line Wrapper

• Java project, builds using Maven2

• Creates a web service project from a given tool description (XML)

• Web service exposes SOAP & REST endpoints and Java API interface

• Requirements: command line call, no direct user interaction

Page 8: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 9: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 10: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 11: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 12: Collaborative Workflow Development and Experimentation in the Digital Humanities

IIF Web Services

• Web services are described by a WSDL

• Input/output data structures

• Data is referenced by URL

• Annotations

• Default values

Page 13: Collaborative Workflow Development and Experimentation in the Digital Humanities

REST

Page 14: Collaborative Workflow Development and Experimentation in the Digital Humanities

SOAP

Page 15: Collaborative Workflow Development and Experimentation in the Digital Humanities

IIF Workflows

• What is a workflow? (Yahoo Pipes, etc.)

• Different kinds of workflows: for a single command, application, chain of processes

• Main benefit: Encapsulation, Reuse

• Workflows as “components”: include link to WS endpoint, sample input data and documentation = ready-to-use resource

• Web 2.0 workflow registry: myExperiment

Page 16: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 17: Collaborative Workflow Development and Experimentation in the Digital Humanities

Why workflows?• “In-silico experimentation”

• Good structuring of experiment setup:– Challenge/Research question– Dataset definition– Processing with algorithms– Evaluation/Provenance– Presentation of results

• All this can be modelled into a workflow

Page 18: Collaborative Workflow Development and Experimentation in the Digital Humanities

Integration into Taverna

• Web Services (SOAP and REST)

• Command line tools (SH and SSH)

• Beanshells (can import Java libraries)

• R (statistics)

• Excel, CSV

• Additional service types can be added through dedicated plug-ins

Page 19: Collaborative Workflow Development and Experimentation in the Digital Humanities

Taverna flavours

• Workbench – local GUI client for Linux, Windows, OSX

• Command line tool – run workflows from the command line

• Server – Webapp with REST API and Java/Ruby client libs

• Web-Wf-Designer – Javascript version for designing workflows in a browser

Page 20: Collaborative Workflow Development and Experimentation in the Digital Humanities

Workbench

Page 21: Collaborative Workflow Development and Experimentation in the Digital Humanities

Webapp

Page 22: Collaborative Workflow Development and Experimentation in the Digital Humanities

Workflow registry

Page 23: Collaborative Workflow Development and Experimentation in the Digital Humanities

Client interfaces

• Web service client: create a simple HTML form from a given web service description

• Taverna client: create a simple HTML form from a given Taverna workflow description

integration into production and presentation environments via iframes

Page 24: Collaborative Workflow Development and Experimentation in the Digital Humanities

WS-client

Page 25: Collaborative Workflow Development and Experimentation in the Digital Humanities

T2-client

Page 26: Collaborative Workflow Development and Experimentation in the Digital Humanities

Repositories

• Accessible via web service API– Fedora Commons – WebDAV – PRImA

Page 27: Collaborative Workflow Development and Experimentation in the Digital Humanities

Architecture

Page 28: Collaborative Workflow Development and Experimentation in the Digital Humanities

Examples

• Use case 1: OCR (IMPACT)

• Start: Images (scanned documents)

• Processing: OCR, NLP, Evaluation

• Result: Full text, Entities, Sentiments

Page 29: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 30: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 31: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 32: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 33: Collaborative Workflow Development and Experimentation in the Digital Humanities

Examples

• Use case 2: Preservation (SCAPE)

• Start: Document collection preparation

• Processing: Hadoop, Hive

• Result: Statistics

Page 34: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 35: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 36: Collaborative Workflow Development and Experimentation in the Digital Humanities

find

/NAS/Z119585409/00000001.jp2/NAS/Z119585409/00000002.jp2/NAS/Z119585409/00000003.jp2…/NAS/Z117655409/00000001.jp2/NAS/Z117655409/00000002.jp2/NAS/Z117655409/00000003.jp2…/NAS/Z119585987/00000001.jp2/NAS/Z119585987/00000002.jp2/NAS/Z119585987/00000003.jp2…/NAS/Z119584539/00000001.jp2/NAS/Z119584539/00000002.jp2/NAS/Z119584539/00000003.jp2…/NAS/Z119599879/00000001.jp2l/NAS/Z119589879/00000002.jp2/NAS/Z119589879/00000003.jp2...

...

NAS

reading files from NAS

1,4 GB 1,2 GB

: ~ 5 h + ~ 38 h = ~ 43 h60.000 books

24 Million pages

Jp2PathCreator HadoopStreamingExiftoolRead

Z119585409/00000001 2345Z119585409/00000002 2340Z119585409/00000003 2543…Z117655409/00000001 2300Z117655409/00000002 2300Z117655409/00000003 2345…Z119585987/00000001 2300Z119585987/00000002 2340Z119585987/00000003 2432…Z119584539/00000001 5205Z119584539/00000002 2310Z119584539/00000003 2134…Z119599879/00000001 2312Z119589879/00000002 2300Z119589879/00000003 2300...

Reading image metadata

Page 37: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 38: Collaborative Workflow Development and Experimentation in the Digital Humanities

find

/NAS/Z119585409/00000707.html/NAS/Z119585409/00000708.html/NAS/Z119585409/00000709.html…/NAS/Z138682341/00000707.html/NAS/Z138682341/00000708.html/NAS/Z138682341/00000709.html…/NAS/Z178791257/00000707.html/NAS/Z178791257/00000708.html/NAS/Z178791257/00000709.html…/NAS/Z967985409/00000707.html/NAS/Z967985409/00000708.html/NAS/Z967985409/00000709.html…/NAS/Z196545409/00000707.html/NAS/Z196545409/00000708.html/NAS/Z196545409/00000709.html...

Z119585409/00000707

Z119585409/00000708

Z119585409/00000709

Z119585409/00000710

Z119585409/00000711

Z119585409/00000712

NAS

reading files from NAS

1,4 GB 997 GB (uncompressed)

: ~ 5 h + ~ 24 h = ~ 29 h60.000 books

24 Million pages

HtmlPathCreator SequenceFileCreator

Sequence file creation

Page 39: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 40: Collaborative Workflow Development and Experimentation in the Digital Humanities

Z119585409/00000001

Z119585409/00000002

Z119585409/00000003

Z119585409/00000004

Z119585409/00000005...

: ~ 6 h60.000 books

24 Million pages

Z119585409/00000001 2100 Z119585409/00000001 2200Z119585409/00000001 2300Z119585409/00000001 2400

Z119585409/00000002 2100 Z119585409/00000002 2200Z119585409/00000002 2300Z119585409/00000002 2400

Z119585409/00000003 2100 Z119585409/00000003 2200Z119585409/00000003 2300Z119585409/00000003 2400

Z119585409/00000004 2100 Z119585409/00000004 2200Z119585409/00000004 2300Z119585409/00000004 2400

Z119585409/00000005 2100 Z119585409/00000005 2200Z119585409/00000005 2300Z119585409/00000005 2400

Z119585409/00000001 2250

Z119585409/00000002 2250

Z119585409/00000003 2250

Z119585409/00000004 2250

Z119585409/00000005 2250

Map Reduce

HadoopAvBlockWidthMapReduce

SequenceFile Textfile

HTML parsing

Page 41: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 42: Collaborative Workflow Development and Experimentation in the Digital Humanities

: ~ 6 h60.000 books

24 Million pages

HiveLoadExifData & HiveLoadHocrData

jid jwidth

Z119585409/00000001 2250

Z119585409/00000002 2150

Z119585409/00000003 2125

Z119585409/00000004 2125

Z119585409/00000005 2250

hid hwidth

Z119585409/00000001 1870

Z119585409/00000002 2100

Z119585409/00000003 2015

Z119585409/00000004 1350

Z119585409/00000005 1700

htmlwidth

jp2width

Z119585409/00000001 1870Z119585409/00000002 2100Z119585409/00000003 2015Z119585409/00000004 1350Z119585409/00000005 1700

Z119585409/00000001 2250Z119585409/00000002 2150Z119585409/00000003 2125Z119585409/00000004 2125Z119585409/00000005 2250

CREATE TABLE jp2width(hid STRING, jwidth INT)

CREATE TABLE htmlwidth(hid STRING, hwidth INT)

Analytic Queries

Page 43: Collaborative Workflow Development and Experimentation in the Digital Humanities

: ~ 6 h60.000 books

24 Million pages

jid jwidth

Z119585409/00000001 2250

Z119585409/00000002 2150

Z119585409/00000003 2125

Z119585409/00000004 2125

Z119585409/00000005 2250

hid hwidth

Z119585409/00000001 1870

Z119585409/00000002 2100

Z119585409/00000003 2015

Z119585409/00000004 1350

Z119585409/00000005 1700

htmlwidthjp2width

jid jwidth hwidth

Z119585409/00000001

2250 1870

Z119585409/00000002

2150 2100

Z119585409/00000003

2125 2015

Z119585409/00000004

2125 1350

Z119585409/00000005

2250 1700

select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid

Analytic QueriesHiveSelect

Page 44: Collaborative Workflow Development and Experimentation in the Digital Humanities

Examples

• Use case 3: Curation (GDZ)

• Start: Get documents from repository

• Processing: Enrichment (OCR, Entities, GeoNames)

• Result: Online presentation

Page 45: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 46: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 47: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 48: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 49: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 50: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 51: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 52: Collaborative Workflow Development and Experimentation in the Digital Humanities

ROPEN(= Resource Oriented Presentation ENvironment)

Page 53: Collaborative Workflow Development and Experimentation in the Digital Humanities

Scalability

• Multiple options:

- Service parallelization

- Cloud

- Grid

- Hadoop

Page 54: Collaborative Workflow Development and Experimentation in the Digital Humanities

Compatibility

• Taverna UIMA

• Taverna Galaxy

• Taverna Kepler

• Taverna Weblicht

• Taverna Seasr

Page 55: Collaborative Workflow Development and Experimentation in the Digital Humanities

But…

• Multi-layered approach increases complexity (debugging, maintenance)

• Diverse set of endpoints (OS, CPU, etc.)

• Multiple dependencies

• Shared responsibilities

• Authentication & Authorization

• Error handling / Fail-over / Monitoring

Page 56: Collaborative Workflow Development and Experimentation in the Digital Humanities

Demo(s)

Page 57: Collaborative Workflow Development and Experimentation in the Digital Humanities

Discussion

• Potential/use cases DH?

• Tools/features to make available?

• Questions, comments or remarks?

Page 58: Collaborative Workflow Development and Experimentation in the Digital Humanities

Thank you!