panacea wp3 the platform

32
PANACEA WP3 The Platform WP participants: UPF, ILC, ILSP, LG, DCU, ELDA Final Annual Review 19 th February 2013 Marc Poch, UPF ([email protected]) 1

Upload: pello

Post on 24-Feb-2016

54 views

Category:

Documents


0 download

DESCRIPTION

PANACEA WP3 The Platform. WP participants: UPF, ILC, ILSP, LG, DCU, ELDA Final Annual Review 19 th February 2013 Marc Poch, UPF ([email protected]). Summary. Objectives Platform components / Demo Achievements Functional platform - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: PANACEA WP3  The Platform

PANACEA WP3 The Platform

WP participants: UPF, ILC, ILSP, LG, DCU, ELDA

Final Annual Review19th February 2013

Marc Poch, UPF ([email protected])

1

Page 2: PANACEA WP3  The Platform

Summary

• Objectives• Platform components / Demo• Achievements

– Functional platform– Interoperability: Travelling Object, Common Interfaces,

format converters, etc.

– Scalability• WP7 Evaluation• Conclusions and future work

2

Page 3: PANACEA WP3  The Platform

Objectives Development of a platform (a space of interoperability

defined by standardized protocols and common interfaces) for the easy integration of a variety of software components, tools and methodologies deployed as web services to configure a factory for the automation of acquisition, processing and annotation of language resources.

3

WP3.1. (T1-T6) Architecture and design of the platform WP3.2 (T15-T30) Work Flow editor and engine WP3.3. (T7-T30) Common interfaces, middleware and temporal files, journaling, etc. WP3.4 (T15-T30) The Registry WP3.5 (T7-T30) Deployment of web services of the components supplied by WP4 to WP6

Page 4: PANACEA WP3  The Platform

4

Tools to be integrated

Web Service wrapper

The Registry

Common Interfaces

Format Converters

Workflow editor and

engine

Sharing workflows

From local toolsto

sharing workflows

Page 5: PANACEA WP3  The Platform

Clients: Java, Python, Perl, etc.

Platform tools and portals

6

JAX-WS, Axis, CXF,

etc.

Workflows Social NetworkRegistryWeb Services

Share tools(remotely run

distributed tools)

Share and find Web Services

Call / chain Web Services

Share and find workflows

SOAP or REST

Soaplab

Biocatalogue Tavernawww.taverna.org.uk

PANACEA Registry:registry.elda.org

PANACEA myExperiment:myexperiment.elda.org

myExperiment

PANACEA Platform: uses, adapts and improves myGrid tools for eScience (used in biology, social science, music, astronomy, multimedia and chemistry).

Page 6: PANACEA WP3  The Platform

Technological option:Web Services

SOAPLAB 2 (SOAP)

• Easy deployment of command line tools as WS. (Java, Python, C++, UIMA, etc. )

• Clients: Java, Python, Perl, Taverna, etc.

• No coding needed! Only metadata

• “Polling” techniques for long lasting tasks

• Web form to run the web services• URL input / output ready• PANACEA improvement for SOAP

messaging (network usage and memory)

• PANACEA limit multiple users

TAVERNA

BioCatalogue

Web Services

Workflow editor

Registry

Social network myExperiment

7

Page 7: PANACEA WP3  The Platform

Technological option:Registry

SOAPLAB 2 (SOAP)

• User friendly GUI• Free, open source, Continuously

maintained • Search function• Users rating (users feedback)• Service annotations and Language

Categorization (PANACEA)• Monitoring system (web service

status and data results)

TAVERNA

BioCatalogue

Web Services

Workflow editor

Registry

Social network myExperiment

8

Passe

d

Warning

Failed

Unche

cked

Page 8: PANACEA WP3  The Platform

Technological option:Taverna

SOAPLAB 2 (SOAP)

• User friendly GUI• Free and open source• Continuously maintained (v. 2.4) • SOAP and REST web services• Credentials manger (passwords,

certificates, etc.)• Multiple files processing (“lists”)• PANACEA Workflows, best

practises, videos, etc. :• Parallelization, Error recovery:

“retries”, Polling• PANACEA collaboration: bug fixing

and pre-release tests

TAVERNA

BioCatalogue

Web Services

Workflow editor

Registry

Social network myExperiment

9

Page 10: PANACEA WP3  The Platform

Demos• Previous Review:

– PANACEA Registry / PANACEA myExperiment– Run Web Services and Workflows– Design and merging of workflows in Taverna

• Final Review: Specific examples– Creation of a bilingual dictionary– Twitter NLP– Web cleaner and anonymizer– PANACEA Registry / PANACEA myExperiment

11

Page 11: PANACEA WP3  The Platform

Demos ICreation of a bilingual dictionary– http://myexperiment.elda.org/workflows/93

– Input: Pairs of Basic Xces Documents• English: http://nlp.ilsp.gr/panacea/Bilingual/data/20101222/LAB_EN_FR/www.ilo.org/1.xml

• French: http://nlp.ilsp.gr/panacea/Bilingual/data/20101222/LAB_EN_FR/www.ilo.org/191.xml

1. Sentence alignment: Hunalign (3rd party tool) Interoperability

2. PoS tagging: Treetagger (3rd party tool) Interoperability 3. Build phrase tables: Moses (3rd party tool) Interoperability

4. Bilingual dictionary extractor

Video: http://ws02.iula.upf.edu/panacea/examples/videos/Panacea_bilingual_dictionary_extraction_v01.mp4

12

Page 12: PANACEA WP3  The Platform

Demos IITwitter NLP + Registry(3rd party tool) • This web service is based on the Twitter NLP tool developed by

Noah's ARK group. • Noah's ARK group is Noah Smith's research group at the

Language Technologies Institute, School of Computer Science, Carnegie Mellon University.

1. Search the WS in the Registry2. Check monitoring system3. Use web client with example data

13

Page 13: PANACEA WP3  The Platform

Demos IIIWeb cleaner and anonymizerhttp://myexperiment.elda.org/workflows/98

• Input: a list of URLs to process– Example: a web article from www.fifa.com

1. ILSP Web cleaner and text extractor WS2. UPF Anonymizer WS

– Internally calls Freeling NER WS (3rd party tool) Interoperability

14

Video: http://ws02.iula.upf.edu/panacea/examples/videos/Panacea_web_cleaner_and_anonymization_v01.mp4

Page 14: PANACEA WP3  The Platform

WP3 Achievements• Functional and Operational Platform

– Multiple tools, webs and features – Ready to use – Usability – Real Users

• Interoperability – Common Interfaces – Travelling Object – 3rd party tools Integration – Format converters

• Scalability – Web service scalability: long lasting tasks – Workflow design optimization: robustness – Machine resources: handling parallel requests

15

Page 15: PANACEA WP3  The Platform

Functional and Operational Platform

• PANACEA Registry – 157 web services PANACEA WS benefits: WS

are easy to deploy (low maintenance cost) – More than 1300 annotations Usability / Doc.– A cloud of 164 tags – Monitoring system: WS up and running 94.82%

since their deployment (97%) Availability• PANACEA myExperiment

– 74 shared workflows • Storage System Usability

16

Page 16: PANACEA WP3  The Platform

Functional and Operational Platform:

Tutorials and Documentation

17

• Tutorials • Specific and General tutorials • More than 12 videos Usability• Frequently Asked Questions

• Documentation • Registry annotations, tags and Categories • Common Interfaces documentation: xml, web, etc. • Travelling Objects documentation

Page 17: PANACEA WP3  The Platform

Functional and Operational Platform:

Users

18

• WP7 Validators• Linguatech (WP8)• Qualia (Business intelligence)

• CNGL (Centre for Next Generation Localisation)

• INCYTA (Translation)

• Master and Phd Students make use of the PANACEA platform

• http://ws02.iula.upf.edu/panacea/statistics/upf-statistics.html

Page 18: PANACEA WP3  The Platform

• Three levels of interoperability:– COMMUNICATION PROTOCOLS: Soap, Rest– DATA

– PARAMETERS

• Format N

Tool A

• Format M

Tool B

• Format L

Tool C

• Format N

Tool A

• emptyTool B

• emptyTool C

Interoperability

Tool B does not “understand” format N!All tools understand the previous format

Tool A

Tool B

ABCD

ABCD

Tool A

Tool B

YTQZ

ABCD

20

Page 19: PANACEA WP3  The Platform

Common Interface• A Common Interface (CI) defines the mandatory

parameters for every functionality:

PoS Tagger A

MANDATORY: inputlanguage

OPTIONALS:

Param A

PoS Tagger B

MANDATORY: inputlanguage

OPTIONALS:

PARAM 1PARAM 2

http://panacea-lr.eu/en/info-for-professionals/documents/http://registry.elda.org 21

Page 20: PANACEA WP3  The Platform

Travelling Object• The Travelling Object (TO) is the common data and

metadata format used in PANACEA to make components understand each other. (Interoperability)

• TO1 is the minimal common vertical in-line format used by the deployed tools since the first version of the platform using XCES standard

• TO2 GrAF standard: The Graph Annotation Format (Ide and Sudermam, 2007) is the XML serialization of LAF (ISO 24612, 2009)

• LMF for lexical resources• CONLL for parsers• Converters and adapted WS outputs 22

Page 21: PANACEA WP3  The Platform

Format Converters31 Format converters on the PANACEA Registry• Freeling to TO. CNR

http://registry.elda.org/services/207 • KAF to TO. CNR http://registry.elda.org/services/208• Basic Xces to txt. CNR http

://registry.elda.org/services/209 • PoS tag. (Freeling treetagger) to GrAF. UPF http://registry.elda.org/services/142 • Dependency parsing (Freeling) to GrAF. UPF http://registry.elda.org/services/197• Dependency CoNLL to GrAF. CNR http

://registry.elda.org/services/254 • Word doc to txt. UPF http://registry.elda.org/services/112 • In-house mwe to LMF. CNR http://registry.elda.org/services/296 • Pdf to text. UPF http://registry.elda.org/services/116 • Multi. encodings converter (ISO, UTF, etc.). UPF http://registry.elda.org/services/114 • Aligner to TO. DCU http://registry.elda.org/services/69 • Sentence alignment to TMX. DCU

http://registry.elda.org/services/219 • Treetagger to MOSES. DCU http://registry.elda.org/services/275 • UIMA to GrAF. ILSP http://registry.elda.org/services/182

• METASHARE metadata generators http://myexperiment.elda.org/workflows/96

23

Page 22: PANACEA WP3  The Platform

3rd party tools integration

• PANACEA WS wrapper (Soaplab) and the CI make it easy for WS Providers to integrate 3rd party tools.

• ILSP tools are UIMA tools UIMA• Freeling UPC• Treetagger University of Stuttgart• Twitter NLP Carnegie Mellon University• MALT Parser Uppsala University• DeSR Università di Pisa• MOSES / Giza++• DELiC4MT (MT evaluation) DCU• Berckeley tagger, parser, aligner Berkeley University California

24

Page 23: PANACEA WP3  The Platform

Web Services Scalability• Web services are being deployed using Soaplab 2.3.2:

– Service providers only need to use metadata (ACD) files Usability– Web client application to test WSs: Spinet Usability – PANACEA developers have been in contact with Soaplab developers Collaboration

– SOAP protocol standard Interoperability• WS can be called from Taverna or other workflow editors• WS can be called with many programming languages: Python,

Perl, Ruby, Java, etc.– Soaplab polling to avoid client timeouts Scalability– PANACEA Improvements Scalability

• Parallel request limit system • SOAP messaging optimization

25

Page 24: PANACEA WP3  The Platform

Workflows design optimization: Robustness

• Building workflows with Taverna– Version 2.4.2 Scalability– Polling (Soaplab) Scalability

• long lasting web service calls without timeouts– Retries Scalability– Parallelization Scalability– Tutorials and videos Usability

27

Page 25: PANACEA WP3  The Platform

Machine Resources: handling parallel requests

Parallelization level 3 (3 parallel request per service * 2 services = 6 concurrent requests)

Workflowname Freeling_tagging_for_crawled_data_with_output_downloadfile massive_freeling_for_crawled_data_v11_download.t2flowmyexp url http://myexperiment.elda.org/workflows/32Taverna 2.4.0 workbench

VM Cores RAM HDiula04 (UPF) 4 8 40GB (SAS)

WS parall. poll. int. poll. backoff poll. max int. retries ini. delay max factorWS1 python_preprocess + freeling_tagging

+ python_postprocessing3 2000 1 10000 2 5000 150000 20

WS2 postagger_to_xces_converter 3 2000 1 10000 2 5000 150000 20

corpus list file urls url example TokensMCv2 LAB_ES_list.sorted.txt 13188 http://nlp.ilsp.gr/panacea/D4.3/data/201109/LAB_ES/1.xml 61 M

Name Status Queued it. It. done It. w/error Average time/it.

Freeling_tagging_for_crawled_data_with_output_download

Finished - - - 5.2 h

download_dataUrl Finished 0 13188 0 31 msfreeling_tagging Finished 0 13188 5 4.2 spostagger_to_xces_converter Finished 0 13188 0 4.1 s

29

Page 26: PANACEA WP3  The Platform

Machine Resources: handling parallel requests

Parallelization level 10 (10 parallel request per service * 2 services = 20 concurrent requests)

Workflowname Freeling_tagging_for_crawled_data_with_output_downloadfile massive_freeling_for_crawled_data_v11_download.t2flowmyexp url http://myexperiment.elda.org/workflows/32Taverna 2.4.0 workbench

VM Cores RAM HDiula04 (UPF) 4 8 40GB (SAS)

WS parall. poll. int. poll. backoff poll. max int. retries ini. delay max factorWS1 python_preprocess + freeling_tagging

+ python_postprocessing10 2000 1 10000 2 5000 150000 20

WS2 postagger_to_xces_converter 10 2000 1 10000 2 5000 150000 20

corpus list file urls url example TokensMCv2 LAB_ES_list.sorted.txt 13188 http://nlp.ilsp.gr/panacea/D4.3/data/201109/LAB_ES/1.xml 61 M

Name Status Queued it. It. done It. w/error Average time/it.

Freeling_tagging_for_crawled_data_with_output_download

Finished - - - 2.2 h

download_dataUrl Finished 0 13188 0 29 msfreeling_tagging Finished 0 13188 5 5.9 spostagger_to_xces_converter Finished 0 13188 0 4.8 s

30

Page 27: PANACEA WP3  The Platform

Machine Resources: handling parallel requests

• From 1x to 10x experimenthttp://ws02.iula.upf.edu/panacea/examples/videos/Panacea_parallelization_scalability_v01.mp4

– Two Taverna instances running at the same time– 100 documents to be processed– 1 workflow with NO parallelization / the other with 10x– The same server: ws04 with 8GB RAM and 4 CPUs

• More resources > more parallel requests

31

Page 28: PANACEA WP3  The Platform

Machine Resources: handling parallel requests

• Conclusions:– PANACEA fulfils large data scalabilty goal Scalability– Requirements:

• Robust WS deployment: Soaplab (with Panacea improvements) or other robust framewoks.

• Taverna 2.4• Workflow design must follow the PANACEA massive data tutorial (retries,

polling, etc)

• The architecture is highly scalable: growth is just a matter of resources

Statistics

Typical Panacea server:• 2 - 4 cores• 4 - 8 GB RAM• 30 - 100 GB HDD• 100 Freeling WS parallel

requests

EMBL –EBI (European Bioinformatics Institute in Cambridge):• 200 Servers• 2000 cores• Server requests balancing Software, etc.More than 50000 Freeling WS parallel requests

32

Page 29: PANACEA WP3  The Platform

WP7 Evaluation

33

Page 30: PANACEA WP3  The Platform

Conclusions• Functional platform

– Web services software – Registry / myExperiment

• Usability for users and providers • Interoperability:

– Data formats – Common Interfaces

• Tutorials and Documentation • Scalability

34

Page 31: PANACEA WP3  The Platform

The future• Authentication Web Services Business opportunity

– Institutions and companies can sell their services and/or machine resources

• Automatically build workflows Usability and interoperability – Based on input data and user desired output, etc.

• Data Visualization tools / Widgets Usability

• Improve total throughput Scalability– With more machine resources we can achieve faster experiment results– Software optimization: task splitting and parallelization

• Publications with experiments Research – Researchers could link their publications to real experiments (WS, workflows, data.

etc.)– Fostering research making experiments easily replicable– Improved experiments: more data, more machine resources, faster results, etc.

35

Page 32: PANACEA WP3  The Platform

Thank you

Questions?

36