panacea wp3 the platform
DESCRIPTION
PANACEA WP3 The Platform. WP participants: UPF, ILC, ILSP, LG, DCU, ELDA Final Annual Review 19 th February 2013 Marc Poch, UPF ([email protected]). Summary. Objectives Platform components / Demo Achievements Functional platform - PowerPoint PPT PresentationTRANSCRIPT
PANACEA WP3 The Platform
WP participants: UPF, ILC, ILSP, LG, DCU, ELDA
Final Annual Review19th February 2013
Marc Poch, UPF ([email protected])
1
Summary
• Objectives• Platform components / Demo• Achievements
– Functional platform– Interoperability: Travelling Object, Common Interfaces,
format converters, etc.
– Scalability• WP7 Evaluation• Conclusions and future work
2
Objectives Development of a platform (a space of interoperability
defined by standardized protocols and common interfaces) for the easy integration of a variety of software components, tools and methodologies deployed as web services to configure a factory for the automation of acquisition, processing and annotation of language resources.
3
WP3.1. (T1-T6) Architecture and design of the platform WP3.2 (T15-T30) Work Flow editor and engine WP3.3. (T7-T30) Common interfaces, middleware and temporal files, journaling, etc. WP3.4 (T15-T30) The Registry WP3.5 (T7-T30) Deployment of web services of the components supplied by WP4 to WP6
4
Tools to be integrated
Web Service wrapper
The Registry
Common Interfaces
Format Converters
Workflow editor and
engine
Sharing workflows
From local toolsto
sharing workflows
Clients: Java, Python, Perl, etc.
Platform tools and portals
6
JAX-WS, Axis, CXF,
etc.
Workflows Social NetworkRegistryWeb Services
Share tools(remotely run
distributed tools)
Share and find Web Services
Call / chain Web Services
Share and find workflows
SOAP or REST
Soaplab
Biocatalogue Tavernawww.taverna.org.uk
PANACEA Registry:registry.elda.org
PANACEA myExperiment:myexperiment.elda.org
myExperiment
PANACEA Platform: uses, adapts and improves myGrid tools for eScience (used in biology, social science, music, astronomy, multimedia and chemistry).
Technological option:Web Services
SOAPLAB 2 (SOAP)
• Easy deployment of command line tools as WS. (Java, Python, C++, UIMA, etc. )
• Clients: Java, Python, Perl, Taverna, etc.
• No coding needed! Only metadata
• “Polling” techniques for long lasting tasks
• Web form to run the web services• URL input / output ready• PANACEA improvement for SOAP
messaging (network usage and memory)
• PANACEA limit multiple users
TAVERNA
BioCatalogue
Web Services
Workflow editor
Registry
Social network myExperiment
7
Technological option:Registry
SOAPLAB 2 (SOAP)
• User friendly GUI• Free, open source, Continuously
maintained • Search function• Users rating (users feedback)• Service annotations and Language
Categorization (PANACEA)• Monitoring system (web service
status and data results)
TAVERNA
BioCatalogue
Web Services
Workflow editor
Registry
Social network myExperiment
8
Passe
d
Warning
Failed
Unche
cked
Technological option:Taverna
SOAPLAB 2 (SOAP)
• User friendly GUI• Free and open source• Continuously maintained (v. 2.4) • SOAP and REST web services• Credentials manger (passwords,
certificates, etc.)• Multiple files processing (“lists”)• PANACEA Workflows, best
practises, videos, etc. :• Parallelization, Error recovery:
“retries”, Polling• PANACEA collaboration: bug fixing
and pre-release tests
TAVERNA
BioCatalogue
Web Services
Workflow editor
Registry
Social network myExperiment
9
PANACEA
10
Demos• Previous Review:
– PANACEA Registry / PANACEA myExperiment– Run Web Services and Workflows– Design and merging of workflows in Taverna
• Final Review: Specific examples– Creation of a bilingual dictionary– Twitter NLP– Web cleaner and anonymizer– PANACEA Registry / PANACEA myExperiment
11
Demos ICreation of a bilingual dictionary– http://myexperiment.elda.org/workflows/93
– Input: Pairs of Basic Xces Documents• English: http://nlp.ilsp.gr/panacea/Bilingual/data/20101222/LAB_EN_FR/www.ilo.org/1.xml
• French: http://nlp.ilsp.gr/panacea/Bilingual/data/20101222/LAB_EN_FR/www.ilo.org/191.xml
1. Sentence alignment: Hunalign (3rd party tool) Interoperability
2. PoS tagging: Treetagger (3rd party tool) Interoperability 3. Build phrase tables: Moses (3rd party tool) Interoperability
4. Bilingual dictionary extractor
Video: http://ws02.iula.upf.edu/panacea/examples/videos/Panacea_bilingual_dictionary_extraction_v01.mp4
12
Demos IITwitter NLP + Registry(3rd party tool) • This web service is based on the Twitter NLP tool developed by
Noah's ARK group. • Noah's ARK group is Noah Smith's research group at the
Language Technologies Institute, School of Computer Science, Carnegie Mellon University.
1. Search the WS in the Registry2. Check monitoring system3. Use web client with example data
13
Demos IIIWeb cleaner and anonymizerhttp://myexperiment.elda.org/workflows/98
• Input: a list of URLs to process– Example: a web article from www.fifa.com
1. ILSP Web cleaner and text extractor WS2. UPF Anonymizer WS
– Internally calls Freeling NER WS (3rd party tool) Interoperability
14
Video: http://ws02.iula.upf.edu/panacea/examples/videos/Panacea_web_cleaner_and_anonymization_v01.mp4
WP3 Achievements• Functional and Operational Platform
– Multiple tools, webs and features – Ready to use – Usability – Real Users
• Interoperability – Common Interfaces – Travelling Object – 3rd party tools Integration – Format converters
• Scalability – Web service scalability: long lasting tasks – Workflow design optimization: robustness – Machine resources: handling parallel requests
15
Functional and Operational Platform
• PANACEA Registry – 157 web services PANACEA WS benefits: WS
are easy to deploy (low maintenance cost) – More than 1300 annotations Usability / Doc.– A cloud of 164 tags – Monitoring system: WS up and running 94.82%
since their deployment (97%) Availability• PANACEA myExperiment
– 74 shared workflows • Storage System Usability
16
Functional and Operational Platform:
Tutorials and Documentation
17
• Tutorials • Specific and General tutorials • More than 12 videos Usability• Frequently Asked Questions
• Documentation • Registry annotations, tags and Categories • Common Interfaces documentation: xml, web, etc. • Travelling Objects documentation
Functional and Operational Platform:
Users
18
• WP7 Validators• Linguatech (WP8)• Qualia (Business intelligence)
• CNGL (Centre for Next Generation Localisation)
• INCYTA (Translation)
• Master and Phd Students make use of the PANACEA platform
• http://ws02.iula.upf.edu/panacea/statistics/upf-statistics.html
• Three levels of interoperability:– COMMUNICATION PROTOCOLS: Soap, Rest– DATA
– PARAMETERS
• Format N
Tool A
• Format M
Tool B
• Format L
Tool C
• Format N
Tool A
• emptyTool B
• emptyTool C
Interoperability
Tool B does not “understand” format N!All tools understand the previous format
Tool A
Tool B
ABCD
ABCD
Tool A
Tool B
YTQZ
ABCD
20
Common Interface• A Common Interface (CI) defines the mandatory
parameters for every functionality:
PoS Tagger A
MANDATORY: inputlanguage
OPTIONALS:
Param A
PoS Tagger B
MANDATORY: inputlanguage
OPTIONALS:
PARAM 1PARAM 2
http://panacea-lr.eu/en/info-for-professionals/documents/http://registry.elda.org 21
Travelling Object• The Travelling Object (TO) is the common data and
metadata format used in PANACEA to make components understand each other. (Interoperability)
• TO1 is the minimal common vertical in-line format used by the deployed tools since the first version of the platform using XCES standard
• TO2 GrAF standard: The Graph Annotation Format (Ide and Sudermam, 2007) is the XML serialization of LAF (ISO 24612, 2009)
• LMF for lexical resources• CONLL for parsers• Converters and adapted WS outputs 22
Format Converters31 Format converters on the PANACEA Registry• Freeling to TO. CNR
http://registry.elda.org/services/207 • KAF to TO. CNR http://registry.elda.org/services/208• Basic Xces to txt. CNR http
://registry.elda.org/services/209 • PoS tag. (Freeling treetagger) to GrAF. UPF http://registry.elda.org/services/142 • Dependency parsing (Freeling) to GrAF. UPF http://registry.elda.org/services/197• Dependency CoNLL to GrAF. CNR http
://registry.elda.org/services/254 • Word doc to txt. UPF http://registry.elda.org/services/112 • In-house mwe to LMF. CNR http://registry.elda.org/services/296 • Pdf to text. UPF http://registry.elda.org/services/116 • Multi. encodings converter (ISO, UTF, etc.). UPF http://registry.elda.org/services/114 • Aligner to TO. DCU http://registry.elda.org/services/69 • Sentence alignment to TMX. DCU
http://registry.elda.org/services/219 • Treetagger to MOSES. DCU http://registry.elda.org/services/275 • UIMA to GrAF. ILSP http://registry.elda.org/services/182
• METASHARE metadata generators http://myexperiment.elda.org/workflows/96
23
3rd party tools integration
• PANACEA WS wrapper (Soaplab) and the CI make it easy for WS Providers to integrate 3rd party tools.
• ILSP tools are UIMA tools UIMA• Freeling UPC• Treetagger University of Stuttgart• Twitter NLP Carnegie Mellon University• MALT Parser Uppsala University• DeSR Università di Pisa• MOSES / Giza++• DELiC4MT (MT evaluation) DCU• Berckeley tagger, parser, aligner Berkeley University California
24
Web Services Scalability• Web services are being deployed using Soaplab 2.3.2:
– Service providers only need to use metadata (ACD) files Usability– Web client application to test WSs: Spinet Usability – PANACEA developers have been in contact with Soaplab developers Collaboration
– SOAP protocol standard Interoperability• WS can be called from Taverna or other workflow editors• WS can be called with many programming languages: Python,
Perl, Ruby, Java, etc.– Soaplab polling to avoid client timeouts Scalability– PANACEA Improvements Scalability
• Parallel request limit system • SOAP messaging optimization
25
Workflows design optimization: Robustness
• Building workflows with Taverna– Version 2.4.2 Scalability– Polling (Soaplab) Scalability
• long lasting web service calls without timeouts– Retries Scalability– Parallelization Scalability– Tutorials and videos Usability
27
Machine Resources: handling parallel requests
Parallelization level 3 (3 parallel request per service * 2 services = 6 concurrent requests)
Workflowname Freeling_tagging_for_crawled_data_with_output_downloadfile massive_freeling_for_crawled_data_v11_download.t2flowmyexp url http://myexperiment.elda.org/workflows/32Taverna 2.4.0 workbench
VM Cores RAM HDiula04 (UPF) 4 8 40GB (SAS)
WS parall. poll. int. poll. backoff poll. max int. retries ini. delay max factorWS1 python_preprocess + freeling_tagging
+ python_postprocessing3 2000 1 10000 2 5000 150000 20
WS2 postagger_to_xces_converter 3 2000 1 10000 2 5000 150000 20
corpus list file urls url example TokensMCv2 LAB_ES_list.sorted.txt 13188 http://nlp.ilsp.gr/panacea/D4.3/data/201109/LAB_ES/1.xml 61 M
Name Status Queued it. It. done It. w/error Average time/it.
Freeling_tagging_for_crawled_data_with_output_download
Finished - - - 5.2 h
download_dataUrl Finished 0 13188 0 31 msfreeling_tagging Finished 0 13188 5 4.2 spostagger_to_xces_converter Finished 0 13188 0 4.1 s
29
Machine Resources: handling parallel requests
Parallelization level 10 (10 parallel request per service * 2 services = 20 concurrent requests)
Workflowname Freeling_tagging_for_crawled_data_with_output_downloadfile massive_freeling_for_crawled_data_v11_download.t2flowmyexp url http://myexperiment.elda.org/workflows/32Taverna 2.4.0 workbench
VM Cores RAM HDiula04 (UPF) 4 8 40GB (SAS)
WS parall. poll. int. poll. backoff poll. max int. retries ini. delay max factorWS1 python_preprocess + freeling_tagging
+ python_postprocessing10 2000 1 10000 2 5000 150000 20
WS2 postagger_to_xces_converter 10 2000 1 10000 2 5000 150000 20
corpus list file urls url example TokensMCv2 LAB_ES_list.sorted.txt 13188 http://nlp.ilsp.gr/panacea/D4.3/data/201109/LAB_ES/1.xml 61 M
Name Status Queued it. It. done It. w/error Average time/it.
Freeling_tagging_for_crawled_data_with_output_download
Finished - - - 2.2 h
download_dataUrl Finished 0 13188 0 29 msfreeling_tagging Finished 0 13188 5 5.9 spostagger_to_xces_converter Finished 0 13188 0 4.8 s
30
Machine Resources: handling parallel requests
• From 1x to 10x experimenthttp://ws02.iula.upf.edu/panacea/examples/videos/Panacea_parallelization_scalability_v01.mp4
– Two Taverna instances running at the same time– 100 documents to be processed– 1 workflow with NO parallelization / the other with 10x– The same server: ws04 with 8GB RAM and 4 CPUs
• More resources > more parallel requests
31
Machine Resources: handling parallel requests
• Conclusions:– PANACEA fulfils large data scalabilty goal Scalability– Requirements:
• Robust WS deployment: Soaplab (with Panacea improvements) or other robust framewoks.
• Taverna 2.4• Workflow design must follow the PANACEA massive data tutorial (retries,
polling, etc)
• The architecture is highly scalable: growth is just a matter of resources
Statistics
Typical Panacea server:• 2 - 4 cores• 4 - 8 GB RAM• 30 - 100 GB HDD• 100 Freeling WS parallel
requests
EMBL –EBI (European Bioinformatics Institute in Cambridge):• 200 Servers• 2000 cores• Server requests balancing Software, etc.More than 50000 Freeling WS parallel requests
32
Conclusions• Functional platform
– Web services software – Registry / myExperiment
• Usability for users and providers • Interoperability:
– Data formats – Common Interfaces
• Tutorials and Documentation • Scalability
34
The future• Authentication Web Services Business opportunity
– Institutions and companies can sell their services and/or machine resources
• Automatically build workflows Usability and interoperability – Based on input data and user desired output, etc.
• Data Visualization tools / Widgets Usability
• Improve total throughput Scalability– With more machine resources we can achieve faster experiment results– Software optimization: task splitting and parallelization
• Publications with experiments Research – Researchers could link their publications to real experiments (WS, workflows, data.
etc.)– Fostering research making experiments easily replicable– Improved experiments: more data, more machine resources, faster results, etc.
35