deliverable d7.5 - cerfacscerfacs.fr/wp-content/uploads/2017/03/globc_page... · european data...
Post on 17-Aug-2020
0 Views
Preview:
TRANSCRIPT
European Data
Grant agreement number: RI-283304
Deliverable D7.5.1Technology adaptation and development framework
in EUDAT WP7.3
Authors Emanuel Dima, Christian Page, Yvonne Kustermann,
Reinhard Budich
StatusDraft/Review/Approval/Final
Version v.1.0
Date November 21, 2013
Abstract:
This deliverable is reporting on the progress on the construction and integration of the Generic Exe-
cution Framework (GEF), as well as additional required tools and components. The focus is on any
lessons learned from the first stage of technology adaptation and construction. It describes how existing
EUDAT user technologies have been incorporated, including any necessary adaptations. It also suggests
expected behavior of the framework against User Community needs. The final report (D7.5.2) will
describe and assess the EUDAT GEF, including any adaptations that were necessary to accommodate
the requirements of the user communities.
Document identifier: EUDAT-DEL-WP7-D7.5.1
Deliverable lead Christian Page
Related work package 7
Author(s) Emanuel Dima, Christian Page, Yvonne Kustermann, Reinhard
Budich
Contributor(s) Stephane Coutin, Pascal Dugenie
Due date of deliverable 01/10/2013
Actual submission date 21/11/2013
Reviewed by Morris Riedel, Ari Lukkarinen
Approved by
Dissemination level PUBLIC
Website www.eudat.eu
Call FP7-INFRA-2011-1.2.2
Project number 283304
Instrument CP-CSA
Start date of project 01/10/2011
Duration 36 months
Disclaimer: The content of the document herein is the sole responsibility of the publishers and it does
not necessarily represent the views expressed by the European Commission or its services.
While the information contained in the document is believed to be accurate, the author(s) or any
other participant in the EUDAT Consortium make no warranty of any kind with regard to this material
including, but not limited to the implied warranties of merchantability and fitness for a particular purpose.
Neither the EUDAT Consortium nor any of its members, their officers, employees or agents shall
be responsible or liable in negligence or otherwise howsoever in respect of any inaccuracy or omission
herein.
Without derogating from the generality of the foregoing neither the EUDAT Consortium nor any of
its members, their officers, employees or agents shall be liable for any direct or indirect or consequential
loss or damage caused by or arising from any information advice or inaccuracy or omission herein.
EUDAT – 283304 D7.5
Contents
1 Introduction 5
1.1 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Functionality Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Service Catalog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 API 9
2.1 REST Generalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Request Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 List of Web Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Basic Data Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Execution Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.3 Filtering Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.4 Map-Reduce Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.5 Workflow Management Service . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Using the API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.1 Conceptual Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.2 Extended Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Job Control and Garbage Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Differences to Description of Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 User Interface 15
4 Implementation 16
4.1 The Web service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 The iRODS-Based Backend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Map-Reduce: Hadoop, Pig and PigLatin . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5 Testing Use Cases 19
5.1 ENES Use Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.1.1 Data Download . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.1.2 Data Subsetting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.1.3 Scientific Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2 GEF implementation in a data node at CINES . . . . . . . . . . . . . . . . . . . . . . 22
5.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2.2 Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3 CLARIN Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.3.1 Metadata Query Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.3.2 Google Books Ngram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6 Conclusion, notes, discussion 26
4/27 PUBLIC Copyright c© The EUDAT Consortium
1 Introduction
1 Introduction
This deliverable is reporting on the progress on the construction and integration of the
Generic Execution Framework (GEF), as well as additional required tools and components.
The GEF is a mechanism designed for the enctment of scientific workflows (with certain
restrictions) on massive amounts of data in an environment where the data is readily acces-
sible.
This document focuses on the lessons learned from the first stage of technology adap-
tation and construction. It describes how existing EUDAT user technologies have been
incorporated, including any necessary adaptations. It also suggests expected behaviour of
the framework against User Community needs. The final report (D7.5.2) will describe and
assess the EUDAT GEF, including any adaptations that were necessary to accommodate
the requirements of the user communities.
1.1 Design Goals
The GEF offers the possibility of doing processing on datasets at a network location very
close to the actual location of the data. The EUDAT CDI already hosts petabytes of
data in its datacenters. Analyzing these data often implies transferring them to a different
software environment, a prohibitive operation due to the data volume and available network
bandwidth. The GEF offers topological proximity which provides advantages like fast access
and lower network load. An especially useful purpose of the GEF is filtering and subsetting
datasets, then transferring the resulting data to a local computer for further analysis, again
lowering the network load and the time needed to perform data analysis.
Being designed as a general-purpose framework, for the use by many diverse scientific com-
munities, the GEF must be capable to work with the tools already in use in these com-
munities, but it must also propose a framework for those communities who have not yet
organized their data in a federation. Therefore the GEF implementation must be highly flex-
ible and designed for continous enhancement of its functionalities. The minimum required
GEF capability is to execute command line tools against the stored data. An additional exe-
cution module for map-reduce jobs will also be integrated. Plugging in additional execution
modules must be possible and easy by design (e.g. modules for processing streaming data
or modules for statistical analysis).
1.2 Functionality Overview
The GEF is defined as a collection of HTTP web services. The specification of the web
services constitutes the API layer. The API is the only layer which is guaranteed to be
stable against change. Non backwards-compatible changes will be added as new versions of
the API are developed. The stable, web service based API layer should considerably ease
the integration of the GEF with other software tools, including common workflow engines
(e.g. Taverna, Kepler) and communities-specific data federation interfaces. The API is
implemented by various back end modules (see Figure 1).
In the initial phase the GEF will only support a limited number of fixed functions, or services.
Calling a GEF service involves sending an HTTP request to the corresponding web service
Copyright c© The EUDAT Consortium PUBLIC 5/27
EUDAT – 283304 D7.5
GEF
Map-‐Reduce back end
Workflows back end
CLI scripts back end
iRODS federa@on
Services
Figure 1: GEF structural overview
endpoint and providing the necessary parameters. Some services will be bound to using
a particular dataset; others will accept an identifier of the input dataset as a parameter.
Services that need an input dataset to be specified should refuse the request when the
dataset is not close to the service, or the transfer would take too long.
The individual services will be distributed to various servers in the EUDAT infrastructure.
However, a single root URL for the GEF will serve as a request dispatcher (e.g. https:
//eudat.eu/gef). A request coming to this URL will be redirected to a different GEF
endpoint or refused, depending on the service name and the location of the input data.
Each GEF endpoint would be close to a data center and thus enabled to work with the local
datasets.
Input and output datasets should be specified indirectly (via URIs or handles/PIDs), not
transferred over HTTP. This requirement also applies to temporary files when there is a
sequence of steps (except PIDs). Also, they should be accessible via different protocols, at
least one of which should be capable of resiliently handling large data transfers. It must be
noted that although one of the goals of the GEF is to reduce the data volume size that
needs to be transferred over the net, in the near future the resulting transferred data can
still be large compared to today’s standards. The data transfer functionality will be unified
with the Lightweight Replication service1 when both services will attain a mature status.
Many workflows consist e.g. of a set of GEF operations executing partially in sequence and
partially in parallel. An efficient sequence of GEF operations would execute without any
data transfers, the input for one service being the output of the preceding one (the input
1The Lightweight Replication is a simple service for data movement in and out of the CDI: https://confluence.csc.
fi/display/Eudat/Lightweight+Replication+service
6/27 PUBLIC Copyright c© The EUDAT Consortium
1 Introduction
and output being specified by data handles). Commonly used workflows in the climate and
linguistic user communities start with a ’filter’ step, and only afterwards process the resulting
data. A data transfer to the user would only occur when the workflow is finished.
For the user, the GEF is the service framework together with the functions offered by
this framework. From an implementation perspective, however, the GEF is distinct from
the functions: it consists of a generic web service, the selection of the backend solution
(irods/hadoop/other), transferring parameters to the backend where the function is ex-
ecuted and returning results. A function is expressed either by a command line script
(executed via an irods rule), a scientific workflow (represented in one of the workflow
management systems) or a Pig/Hadoop script. The GEF API is common, but the functions
are to be created, owned and maintained by the user communities independently.
As specified above, a function can be implemented in the back end by a scientific workflow.
The term workflow, however, encompasses a large variety of meanings. In the scope of
the GEF, a workflow is understood as a representation of a data processing functionality,
enactable by means of a suitable workflow management system and constrained to execute
in the environment provided by a virtual machine with limited external connections.
The GEF execution backend of the prototype implementation is using the iRODS middleware
currently used in EUDAT for data management. A different backend module for executing
map-reduce jobs will make use of the Apache Pig and Apache Hadoop projects. Other
specific backends can be written by communities themselves if they already have a mature
data federation which is not using iRODS.
1.3 Security
All the GEF web services must run on encrypted connections, using the HTTPS protocol.
Ideally, a single sign on system (SSO) will be used, for which coordination with the EUDAT
AAI taskforce2 is necessary. The current envisioned solution is the usage of either OAuth,
for simple authorization cases, or the usage of client certificates transferred over HTTPS
from which the identity and the relevant attributes of the client can be extracted.
1.4 Metadata
The GEF will be most usable only if there is a Search Service (API) available to lookup
specific data, and return matching URIs/Handles/PIDs. This requires a Metadata Catalog,
both with at least the common semantics across EUDAT communities and with supplemental
community-specific Metadata. A common semantic catalogue needs to have a well defined
API which allows for typical search requests for data needed as input for the processing
in question. The returned URIs/Handles/PIDs of the requested data are provided to the
processing. Unless this functionality is available, the user (or the interface) is supposed to
know beforehand the required URIs/Handles/PIDs.
The current implementation of the EUDAT Metadata Catalog is based on CKAN, an open
source data management platform. CKAN provides a rich user interface with facet based
search facilities and also a HTTP REST API which can be used for the purpose.
2http://www.eudat.eu/authentication-and-authorization-infrastructure-aai
Copyright c© The EUDAT Consortium PUBLIC 7/27
EUDAT – 283304 D7.5
1.5 Service Catalog
A Catalog of Services available through the GEF will be available to the users, through
an API and a user interface. The catalog will be generated automatically from querying
all the registered GEF endpoints for available services. Each service metadata will contain
a human readable description, the locations (i.e. GEF endpoints/data centers) where the
service is available, the data types that can be used for the service and the details about
other parameters required by the service.
Currently the location of a dataset can be determined only indirectly and approximatively,
from the server domain specified in the URL.
At each GEF endpoint the same functionality should be available by using the OPTIONS
method of the HTTP protocol.
1.6 Related work
SHIWA (SHaring Interoperable Workflows for large-scale scientific simulations on Available
DCIs) is a FP7 project that aims to develop new technologies for workflow systems interop-
erability.3 The project provides an execution platform where the workflows can be executed
on various Distributed Computing Infrastructures (DCIs).
The SCAPE (SCAlable Preservation Environment) FP7 project is aiming to build a scal-
able platform for digital preservation.4 The preservation processes will be realized as data
pipelines and implemented as workflows expressed in the Taverna workflow system. SCAPE
will deploy large scale workflows and execute them on cloud infrastructures, also collecting
the provenance data produced during this process.
Many other research projects use various workflow systems for complex data analysis or
contribute to the workflow ecosystem in other ways. Contrail5 offers autonomic workflow
execution on cloud infrastructures. e-LICO6 provides services and tools to assist the user in
designing scientific workflows. Wf4Ever7 provides a management environment for Research
Objects, which it defines as comprising scientific workflows, the provenance data gathered
at execution, the interconnections between them and other resources and the related social
aspects.
Work on converting workflows from one workflow representation to another has been done
in the frame of the SCI-BUS8 project (conversion from the desktop based KNIME system
to the DCI based system gUSE9). A more general solution to the problem of workflow
translation was given in the frame of the SHIWA project by introducing an intermediate
workflow language, IWIR10.
3http://www.shiwa-workflow.eu4http://www.scape-project.eu/5http://contrail-project.eu6http://www.e-lico.eu7http://www.wf4ever-project.org/8http://www.sci-bus.eu9L. de la Garza, J. Kruger, C. Scharfe, M. Rottig, S. Aiche, K. Reinert, and O. Kohlbacher, 2013. From the Desktop to
the Grid: Conversion of KNIME Workflows to gUSE. http://ceur-ws.org/Vol-993/paper9.pdf10Kassian Plankensteiner, Johan Montagnat, and Radu Prodan. 2011. IWIR: a language enabling portability across grid
workflow systems. In Proceedings of the 6th workshop on Workflows in support of large-scale science (WORKS ’11).
ACM, New York, NY, USA, 97-106. http://doi.acm.org/10.1145/2110497.2110509
8/27 PUBLIC Copyright c© The EUDAT Consortium
2 API
Web based management of workflows has beed developed in the P-GRADE project11, where
the execution environment is offered by various grid platforms12.
2 API
This section is a description of the principles of the API. A separate technical report will
follow to describe the API in more details.
2.1 REST Generalities
Representational State Transfer13 (REST) is a architectural style in software engineering,
typically used in conjunction with the HTTP protocol for developing web applications. The
REST style requires client-server separation and stateless communication (no preservation
of context on the server) among other constraints. In exchange, the resulting system will
be scalable, reliable and easily modifiable.
An HTTP web service built on REST principles offers its functionality via an HTTP URL
that identifies the service. The client can create/update/get/delete resources on the server
using the HTTP methods POST/PUT/GET/DELETE.
In addition, the HEAD and the OPTIONS methods can be used for querying metadata
about the resources or available operations. The HEAD method is equivalent with the GET
method but only returns the header information (the metadata part) and not the actual data.
The OPTIONS method, which usually is used for determining the options or requirements
of a resource14, can be used in the GEF for reflecting the data operations available for a
specific dataset.
2.2 Request Parameters
An HTTP request can carry an arbitrary number of parameters. Raw data can also be
transferred to the service as part of a request. As an example, a common form of a web
service URL, using GET parameters is:
https://eudat.eu/gef/webservice ?parameter1=value1¶meter2=value2
Each GEF web service is expected to take a number of parameters, of various kinds. Some of
the parameters are common for all web services. These parameters are defined as key-value
pairs, where the values can be either free form strings or a limited set of values (controlled
vocabulary).
The common parameters are:
certificate This parameter contains the certificate for authenticating a user (in the case a
certificate is needed for authentication and authorization).
11http://portal.p-grade.hu/12Peter Kacsuk and Gergely Sipos, 2005. Multi-Grid, Multi-User Workflows in the P-GRADE Grid Portal, Journal of Grid
Computing 3:3-4, http://link.springer.com/article/10.1007%2Fs10723-005-9012-613http://www.ics.uci.edu/˜fielding/pubs/dissertation/rest˙arch˙style.htm14http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html
Copyright c© The EUDAT Consortium PUBLIC 9/27
EUDAT – 283304 D7.5
dataID A data handle that identifies the concrete resource that the service is processing.
This handle can be either a PID or, for transient data, a temporary data identifier.
The parameters can be of different kinds. Path parameters show up in the path of the web
service (e.g. the dataID parameter, with the value 12345/00-6789-ABCDE-F, can be part
of the path:
https://eudat.eu/gef/dataset/12345/00-6789-ABCDE-F?queryType=SRU CQL
&query=dc.title+2001
Query parameters are appended to the URL with a special syntax (e.g. the queryType and
query parameters in the previous example). Form parameters are sent with the data payload
in the request body and not visible in the URL.
2.3 List of Web Services
2.3.1 Basic Data Retrieval
The /gef/dataset/{dataID} service can be used to extract the data out of the EUDAT
CDI over HTTP, using the GET method; it may differ from the Lightweight Replication
service (which is designed to transfer data in and out of EUDAT servers) by accepting both
PIDs and temporary IDs for dataset identification. Also, it is not designed for ingesting
data into EUDAT, but just for getting data out of the system. The Lightweight Replication
service could be used as a backend for part of the implementation. This service should only
be used for transferring relatively small amounts of data.
dataID is a string identifying the dataset (either a PID or another temporary identifier).
2.3.2 Execution Functions
The /gef/function/{funcID} is the actual execution service, calling various data process-
ing workflows on the data identified by dataID, a required query parameter. The funcID
parameter identifies the function needed to be called.
2.3.3 Filtering Function
The most important service of the GEF in this first phase is the filtering service
(/gef/function/filter), which is used for filtering/subsetting the dataset identified by
the required query parameter dataID and should return a list of data handles.
The parameters are:
dataID : string
queryType : string
query : string
The filter web service depends on the requirements of the individual communities. Each
queryType value directs the execution of the service to internal community-specific subser-
vices.
10/27 PUBLIC Copyright c© The EUDAT Consortium
2 API
2.3.4 Map-Reduce Function
The /gef/function/mapreduce service will offer a map-reduce functionality using the Pig
Latin script and a Hadoop backend (see section 4.3).
dataID : string
scriptType : string; currently, the only allowed value is ”PigLatin”
script : string
2.3.5 Workflow Management Service
The /gef/workflow service allows for authorized users to enhance the functionality of the
GEF by uploading and installing scientific workflows on site. The workflows can subsequently
be called as just another function of the execution service.
As an example, we can imagine the case of a scientific workflow that takes a data file as input,
processes the input in some way and outputs the result of this processing to an output port.
The workflow can be POST-ed to the workflow management service, which would analyze
it, test for conformance, install it in the backend and return an identification (workflowID)
of the workflow and the function it performs. The workflow can subsequently be executed
by invoking the execution service (a POST command at /gef/function/{workflowID})
or removed by a DELETE /gef/workflow/{workflowID}.
The workflow management service is expected to be used much less frequently than the
execution and data-retrieval services. This service is only available to some users (usually
community representatives), as it can introduce security and stability issues. An important
requirement is for the back-end to have a flexible way of providing support for additional
workflow systems or other software environments, the current envisioned solution being
encapsulation of the necessary tools in virtual machines.
2.4 Using the API
2.4.1 Conceptual Example
In broad lines, a researcher should make use of the EUDAT/GEF software stack as in the
following use case:
1. The researcher would first search relevant data sets for her problem or hypothesis.
This involves going to the Metadata Catalogue (the web service which indexes all the
EUDAT ingested data), exploring the data sets and selecting a relevant few (with their
PIDs).
2. The researcher should then go to the Service Catalogue (the human-readable list of
GEF services), review the GEF functions available for the selected datasets at different
locations and select a set of relevant functions.
3. Once a set of relevant functions is identified, the researcher would start processing
the data. This could be done either from the GEF web based user interface, or by
directly using the API from a client-side workflow system or a programming language
Copyright c© The EUDAT Consortium PUBLIC 11/27
EUDAT – 283304 D7.5
environment or a command-line environment. During this step, the researcher would
iteratively produce new datasets by applying functions to existing datasets.
4. The final result(s) would be downloaded to the local computer either via the basic data
service or by using other available protocols, and/or could be stored in the EUDAT
CDI and assigned a PID. In the last case the researcher would probably also store the
provenance data generated by the workflow systems.
2.4.2 Extended Example
A large climatic/linguistic dataset is stored in an EUDAT datacenter and has a public pid:
1234/00-1111-2222-3333. This is the pid of the collection; the dataset is comprised of
multiple files. One user wants to process a subset of this data and get back the results.
The processing consists of 3 major steps:
a) filter of the data based on metadata (year and location for climatic data / year and
language for linguistic data)
b) do a simulation and predict a future statistical variable based on the filtered data /
create a collocation table and select the values for a set of words
c) create a visualization of the resulting data (color coded map/frequency chart)
The services used in steps b) and c) are not services currently included in the GEF but exam-
ples of extending the framework’s functionality based on the needs of the user communities.
Step 0.
The user makes the first request for filtering to the central GEF endpoint:
POST https://eudat.eu/gef/function/filter?dataPID=1234/00-1111-2222-3333
&queryType=SRU CQL&query=dc.title+2001+sortBy+dc.date/sort.descending
the endpoint responds with HTTP 307 (Temporary Redirect) to the following local GEF
endpoint:
Location: https://specific.datacenter.eu/eudat/gef/function/filter
?dataPID=1234/00-1111-2222-3333 &queryType=SRU CQL
&query=dc.title+2001+sortBy+dc.date/sort.descending
Step 1.
The user reissues the request to the local GEF endpoint:
POST https://specific.datacenter.eu/eudat/gef/function/filter
?dataPID=1234/00-1111-2222-3333 &queryType=SRU CQL
&query=dc.title+2001+sortBy+dc.date /sort.descending
the GEF responds with HTTP 202 (Accepted):
12/27 PUBLIC Copyright c© The EUDAT Consortium
2 API
Location: https://specific.datacenter.eu/eudat/gef/jobs/job1
The user busy-loops waiting for the job to end:
GET https://specific.datacenter.eu/eudat/gef/jobs/job1
response: HTTP 204 (No Content) – job˙status: running ˝
Eventually the user tries again and succeeds:
GET https://specific.datacenter.eu/eudat/gef/jobs/job1
response: HTTP 200 (OK) (job done):
–
size: ’192GB’,
dataId: ’jobs/job1/result’,
url: ’https://specific.datacenter.eu/eudat/gef/jobs/job1/result’,
irodsUrl: ’irods://specific.datacenter.eu/vzDATA/eudat/gef/jobs/job1/result’
˝
Step 2.
The user requests a fixed functionality of the GEF, using the previous result as input (forward
slashes being encoded as %2F):
POST https://specific.datacenter.eu/eudat/gef/function/func
?dataID=jobs%2Fjob1%2Fresult &...
response: HTTP 202 (Accepted):
Location: https://specific.datacenter.eu/eudat/gef/jobs/job2
User requests the result:
GET https://specific.datacenter.eu/eudat/gef/jobs/job2
response: HTTP 200 (OK) (job done):
–
size: ’20MB’,
dataId: ’jobs/job2/result’,
url: ’https://specific.datacenter.eu/eudat/gef/jobs/job2/result’,
irodsUrl: ’irods://specific.datacenter.eu/vzDATA/eudat/gef/jobs/job2/result’
˝
Step 3.
In the final visualization step, the same pattern is used:
Copyright c© The EUDAT Consortium PUBLIC 13/27
EUDAT – 283304 D7.5
POST https://specific.datacenter.eu/eudat/gef//function/visualize
?dataID=jobs%2Fjob2%2Fresult &...
server response: HTTP 202 (Accepted):
Location: https://specific.datacenter.eu/eudat/gef/jobs/job3
User requests the result:
GET https://specific.datacenter.eu/eudat/gef/jobs/job3
answered with HTTP 200 (OK) (job done):
–
size: ’1MB’,
dataId: ’jobs/job3/result’,
url: ’https://specific.datacenter.eu/eudat/gef/jobs/job3/result’,
irodsUrl: ’irods://specific.datacenter.eu/vzDATA/eudat/gef/jobs/job3/result’
˝
Step 4. The user chooses to download the end result over http:
GET https://specific.datacenter.eu/eudat/gef/jobs/job3/result
result: image.png
and also downloads the result of step 3 using irods icommands:
$ iinit #specific.datacenter.eu, port 1247, user, pass
$ iget /vzDATA/eudat/gef/function/job3/result ./function.dat
2.5 Job Control and Garbage Collection
Each job which is started as a result of a GEF request is referenced by the URI returned to the
user (e.g. https://specific.datacenter.eu/eudat/gef/jobs/job1). A GET request
to this resource returns the state of the job (scheduled, running, done, failed). A DELETE
request forcefully terminates the process and removes the URI resource. Requesting the
state of a job that has been deleted will necessarily return a HTTP 404 (Not Found) error.
When a job ends normally the user collects the result and has the option to DELETE
the job. If the user does not DELETE the job a garbage collection mechanism removes
the job results and the URI resource after a sufficiently large grace period. The garbage
collection mechanism has no information whether the user retrieved the data or not and its
implementation can be as simple as a cron job running every hour that removes all the jobs
older than a number of days.
The GEF should keep and manage as little state as possible. It should only expose the state
of the processes it starts and should not cache this state but retrieve it from the appropriate
backend. The job state reported back to the user will be one of: WAITING, RUNNING
or DONE. Each state can have substates, activated whenever the corresponding backend
provides more information on the jobs.
14/27 PUBLIC Copyright c© The EUDAT Consortium
3 User Interface
2.6 Differences to Description of Work
The DoW stresses that GEF should support data streaming instead of in-memory or file-
based processing. Streaming implies the availability in the GEF backends of software com-
ponents designed for streaming data. A simple example of such a workflow would be
input → processing node 1→ processing node 2→ output
All the steps of the processing should work in parallel, like a bash pipeline. The simplicity
of the GEF makes this scenario possible, with the appropriate backend. A suitable project
to use for this case is Storm15, a free and open-source framework for processing massive
streams of data (already used by Twitter).
DoW also specifies that the GEF should translate any workflow into a common format and
execute it. The current prototype implementation can only accept some type of workflows
and executes them using their native engine, thus ensuring 100% compatibility. The existing
work on workflow translation can be used in order to provide a common enactment engine
for all the workflow systems.
3 User Interface
Meeting the needs of all users of a software system is a challenging process, as one can
conclude from the diversity of the existing user interfaces technologies and design choices.
For example in the climate community, the “ESGF based ENES data infrastructure provides
a rich set of different data access methods to meet the different user demands”16. To
download ESGF 17 data there are too many interfaces available to list them here, as a first
glance to this diversity at the data how-to proofs.
An attempt to categorize users could, for example, look like the following:
The expert user wants an efficient interface no matter if it is complicated or not. This
user repeats very similar tasks a lot of times. If the interface is not optimized, using it is
too tedious to concentrate on the relevant scientific tasks. A web interface is no good
solution here. Clicking options takes too much time in comparison to a command line
interface which allows to vary only one aspect in the request and then execute variation
of this request.
The novice or occasional user needs a rich interface but not a complicated one with a lot
of possibilities which could be overwhelming.
Anybody else: people generally interested in the topic, or people following a link in a news-
paper article. They need an interface to the GEF with an extremely restricted selection
of workflows and options.
15http://storm-project.net16Citation from https://verc.enes.org/help/how-to-./data-access, last access November 21, 2013.17ESGF means Earth System Grid Federation and is an international collaboration for a data infrastructure in climate
science.
Copyright c© The EUDAT Consortium PUBLIC 15/27
EUDAT – 283304 D7.5
Having this in mind we need to design a generic interface API in the GEF to serve such
different interfaces. A typical minimal set for a community is a web- (simple) and a scripting
interface (complex). Another typical set of two interfaces is a graphical user interface (self
explaining) and a command line interface (not self explaining, but faster to use after some
tedious repetition).
These can be programmed separately, depending upon community needs. Our first choice
in EUDAT is for a web interface.
4 Implementation
The current prototypical implementation is organized as a system integrating the front-end
web service (which implements the GEF API) and the backend, which currently depends on
an iRODS environment and contains:
• The iRODS command trigger
• The command executor
• The workflow system
The prototype currently implements the basic data retrieval service and the workflow man-
agement service. It also partially implements the execution service (but without the special
cases of filtering and without map-reduce functionality) with support for Taverna workflows.
The sources are available on the EUDAT SVN18.
4.1 The Web service
The web service is the software component directly implementing the GEF API. It is a Java
servlet using Jersey19 as a REST framework.
The web service receives the http requests from the users and acts accordingly. In the case
of a data transfer request, it intermediates between the user and the iRODS server (via
the Java Jargon20 library). When the user requests data with a PID, the web service also
interrogates the handle system server for the actual URL of the data.
In the case of a workflow execution request, the web service does the following:
1. translates PIDs to iRODS URLs
2. aggregates the request parameters in a local file
3. transfers the parameters file to iRODS
4. returns a token to the user, identifying the new job in progress
18https://svn.eudat.eu/EUDAT/Services/WorkflowEngine/gef19https://jersey.java.net20https://www.irods.org/index.php/Jargon
16/27 PUBLIC Copyright c© The EUDAT Consortium
4 Implementation
4.2 The iRODS-Based Backend
The iRODS middleware is a core technology in EUDAT, allowing for data interchange via
federation and secure levels of access, facilitating distributed replicas of the data objects
and easing administration through the rule system and customizable data management
primitives.
Workflow Structured Objects (WSO) are the iRODS framework that is designed to provide
basic support for data management workflows. A workflow is effectively defined in iRODS
as any sequence of operations, allowing for the possibility of cycles21. An iRODS WSO
object (which is actually a type of script) defines the operations and mitigating procedures
in case of errors. For workflows, the system offers various customization options including
automatic data staging in and out of the execution environment. By reading a special file
the execution of the workflow is triggered and its success state is provided. As a side effect,
a collection of files is created, containing staged data and workflow results that can be
subsequently read or discarded.
Unfortunately, we were not able to use the WSO for integrating workflow support in iRODS.
The WSO is limited in the current implementation due to the following factors:
1. The WSO is a new feature in iRODS and potentially unstable; it only started working
correctly in the latest iRODS version to date (v.3.3). It is also insufficiently docu-
mented.
2. The workflow objects are difficult to access via the existing APIs. In particular the
Jargon library, which interfaces iRODS to the Java Virtual Machine, does not have (at
the time of this writing) full support for WSOs.
3. A single stage area is available for a single workflow object. Multiple workflows that
are running in parallel can potentially encounter data races.
Therefore we chose a different solution based on custom iRODS rules. The parameters sent
by the user are collected by the web service and deposited in a file in the iRODS system. The
ingestion of this file triggers the activation of a custom rule which starts the GEF command
executor; the executor reads the parameters from the file, prepares the required data and
runs whatever command line script or workflow system is needed. This general pattern is
also used for managing PIDs during Safe Replication.
A schematic view of the execution process is done in the provided sequence diagram (see
Figure 2); an architectural view of the system is also provided (see Figure 3).
4.3 Map-Reduce: Hadoop, Pig and PigLatin
Map-Reduce is a programming model for processing large datasets on computer clusters.
A computation expressed in this model consists of a ”map” step, during which the large
computation is split into smaller computations, and a ”reduce” step, when the results of
the smaller computations is coalesced into a final result.
Apache Hadoop22 is an entire software ecosystem built around the map-reduce paradigm.
Part of this ecosystem, the Hadoop software library is an open-source software framework
21https://www.irods.org/index.php/Introduction˙to˙Workflow˙as˙Objects22http://hadoop.apache.org
Copyright c© The EUDAT Consortium PUBLIC 17/27
EUDAT – 283304 D7.5
Figure 2: GEF service sequence diagram
18/27 PUBLIC Copyright c© The EUDAT Consortium
5 Testing Use Cases
HTTP User
Response
Request
iRODS
Command Executor
Workflow engine
OS process
iRODS creden@als AAI
Figure 3: GEF implementation architecture
that manages map-reduce jobs. A related project, Apache Pig23, is a higher level platform
for data analysis that relies on Hadoop as a backend. Pig has its own analysis language, Pig
Latin, which is a query algebra for expressing data transformation.
The GEF map-reduce execution module will be backed by Hadoop clusters and will use the
iRODS-Hadoop integration project done in the EUDAT work package 7.2.
5 Testing Use Cases
In this section we describe the testing use cases of our two communities, the linguistic and
the climate community.
5.1 ENES Use Case
The ENES Use Case consists of Downloading the data and then apply a scientific workflow
on it, whereas already the download is a challenging and complicated part of the workflow
which includes more than a network transfer of the data.
5.1.1 Data Download
In the ENES community, data is distributed through a federation of worldwide data servers,
with a few main gateways and several data nodes. The federation is called the Earth System
Grid Federation (ESGF).
23http://pig.apache.org
Copyright c© The EUDAT Consortium PUBLIC 19/27
EUDAT – 283304 D7.5
Accessing the data is possible via web interfaces, via scripting interfaces i.e. in python24 or
via scripts called from the command line. This diversity can confuse. Which is the right
method for downloading the data? To the authors there is no summary text to be found,
which describes all access methods. Asking the support of the DKRZ, one of the three
access sites for the ENES data, they point to the this page25 but they add the information
that it is also possible to get the data from CERA26 in case it is replicated there.
To check how good the current federation infrastructure is working, we have interviewed
members of the community. To illustrate how unnecessary difficult the data download is,
here – to express it in Scrum terms27 – one user story of a PhD student of the Max-
Planck-Fellowship Program at the Institute for Meteorology in Hamburg. She had a task in
mind (let us call it her scientific workflow) and knew which data she needed (data download
workflow). In the first attempt using one of the web interfaces for download did not work
and using one of the scripts for the download failed due to the lack of knowledge how the
experiment names are encoded (internal knowledge of file management). After finding the
right web interface and the right credentials, the download started and was interrupted due
to quota exceeding. The PhD student ended up to solve the download task by asking her
supervisor. The supervisor downloaded it for her, instead of telling her how to do it (here
we can assume that also explaining how to succeed in downloading would have been difficult
to explain). Thus in the end, the download part of the workflow was accomplished via
“social engineering”. A lot of knowledge played a role: from knowing the right web portals,
replication places for faster download, machine to download with fast network access and
choosing the directories where you have permission and enough quota, to mention only a
part of the required knowledge. To choose a PhD student for the given user story is done
on purpose: an experienced user cannot show all difficulties, because this type of user does
not realize all difficulties any more. This user story illustrates how important even a mere
download workflow is for the climate community.
Another example for the required technical “download knowledge” is given at the CERA
page: “Jblob is a command-line based program for downloading data from the CERA
database. Please note, this program does not replace the graphical user interface. It is
mostly useful for people who know which data to download and for batch downloads.”28. If
it comes to subsetting the data to avoid transferring data which is not needed, there is not
even an interface to use until now.
The ESGF implements the required standards of data being distributed, such as the Data
Reference Syntax (DRS)29 which specifies how the data files must be structured, as well
as required metadata described by a Common Information Model (CIM)30 and Controlled
Vocabulary (CV). This ensures uniformity among the data centers and the data sets.
Currently there is capability embedded in the ESGF software stack that enables the ex-
traction of spatial and temporal data subsets through the data query. However, the ENES
24https://github.com/stephenpascoe/esgf-pyclient25https://verc.enes.org/help/how-to-./data-access26http://cera-www.dkrz.de27In Scrum are used so called user stories to guide implementation.28emphasis added29Taylor, K. E., Balaji, V., Hankin, S., Juckes, M., Lawrence, B., and Pascoe, S. (2010). CMIP5 Data Reference Syntax
(DRS) and Controlled Vocabularies.30Guilyardi, E., Balaji, V., Callaghan, S., DeLuca, C., Devine, G., Denvil, S., Valcke, S. et al. (2011). The CMIP5 model
and simulation documentation: a new standard for climate modelling metadata. ClIVAR Exchanges, 16(2), 42-46.
20/27 PUBLIC Copyright c© The EUDAT Consortium
5 Testing Use Cases
community has several data processing tools which can be used to apply complex data pro-
cessing to data subsets. But, as said before, these tools can only be used after the data has
been downloaded to the user, on the user’s own computer systems.
The data volumes being used by community’s users is currently increasing rapidly. This
happens not only because there are more users, but also because of the increase in the data
volumes being generated by the community, due to spatial resolution increase, ensembles of
simulations, a larger number of experiments that are developed to enable the community to
answer more scientific questions, for example.
This rises the question whether there is a need for reserving bandwidth. Reserving bandwidth
using a normal internet connection is not possible with the current network technologies and
requires network research. A possible approach is a software defined network architecture.
This would mean that the user receives information from the GEF on how long the download
takes depending on when the user starts it. The scientist could decide whether at all and if
so when to start the download. This is investigated in EUDAT in the task WP7.2.
The scientist can concentrate on the semantics, on what data to use for which computation
to answer the scientific question. This means that also the time for the data subsetting
must be estimated. It is under the hood of the GEF whether the data needs to be just
transferred, or subsetted prior to the transfer.
The metadata taskforce is implementing the search functionality which gives back a handle
(PID) for every request. This PID subsequently is used by the GEF. Currently in ESGF the
user has to search manually31. Though, having a search interface giving back a PID is a
long term aim. In the near future the request will give back a URL, a DOI or a PID.
5.1.2 Data Subsetting
Currently subsetting the data is done after downloading. This results in unnecessary data
transfer and thus bandwidth usage. The subsetting in our workflow should be done directly
at the data centers prior to the data transfer.
This approach is realistic since the subsetting is done via the cdo-command-suite32, which
is portable and thus easy to install at all the heterogeneous data centers. To accelerate the
run of cdo, it would be possible to distribute it over a compute cluster via a Pig call (see
Section 2.3.4), if cdo would be made map-reduce-capable. The current version of cdo does
support OpenMP. The map reduce paradigm is not supported yet.
5.1.3 Scientific Computing
For some research questions it makes sense to provide standard workflows where the user can
choose custom parameters.33 For more complex questions it is necessary that the scientist
can design the part of the workflow after the data retrieval.
For designing workflows it is possible to use cross community-tools which provide a GUI.
Examples are Kepler, Taverna and Vistrails. These three examples were all investigated
31Some search capabilities are http://esgf.org/wiki/ESGF˙Search˙API, http://esgf.org/wiki/ESGF˙Search˙REST˙
API and https://github.com/stephenpascoe/esgf-pyclient.32https://code.zmaw.de/projects/cdo33Example workflows can be found at https://verc.enes.org/computing/workflows.
Copyright c© The EUDAT Consortium PUBLIC 21/27
EUDAT – 283304 D7.5
in work package 7.3. Also taken into account should be developments from the specific
community, in this case from the climate community. One example is a domain specific
language close to python which can encode workflows. But no matter which way to define
the workflow we choose, it will call the GEF. The GEF API is, as said before, the “fixed
point”.
5.2 GEF implementation in a data node at CINES
5.2.1 Background
CINES is located in Montpellier (France) and is part of the EUDAT Consortium. It of-
fers computer services to the scientific community in public research and higher education.
CINES is one of the Tier-1 computer operators and sites of national relevance selected
by GENCI, in charge of funding large HPC infrastructures for the French public research.
CINES is also involved in PRACE, PRACE 1IP, PRACE 2IP, HPC Europa 2 and in the ini-
tiative to put in place an integrated framework for auditing and certifying digital repositories
which is supported by the European Commission.
With this expertise, collaboration has taken place between ENES (CERFACS) and CINES
to install a first demo of the GEF (draft) in the ESGF infrastructure. The demo aims to
present an operational prototype for a data workflow use case based on ENES requirements.
Due to a tight schedule, the prototype has used some simplified solutions which would need
to be reviewed should we want to deliver a production system.
5.2.2 Scenario
Scenario The scenario is based on the use case 9 (UC9) defined as part of the WP7.3 initial
tests on the ENES workflows (see EUDAT MS23 Data Exploration Technology Experiments
and Benchmarking, section 3.2).
The objective of this use case was described as:
Generating data to support a Surface Temperature / Total Precipitation anomaly graph
over the largest possible number of scenarios:
• 30-year average 2050-2079 of rcp85 compared to
• 30-year average 1970-1999 of historical
(global, over France only and also over Europe only)
For the demo, the sequence is:
• The user enters on a web form the geographical coordinate’s box and two date ranges;
then launch the job
• The job kicks off in batch mode: based on the entered parameters, relevant files are
selected, a cdo set of commands calculates spatial and temporal averages for each
model. This creates a set of result files.
• The user can check the status of the job (ongoing, finished) from a web page.
• Once finished the user can either display the results on the map or download them.
22/27 PUBLIC Copyright c© The EUDAT Consortium
5 Testing Use Cases
• The user can also decide to store the result in the EUDAT node, choosing either the
basic safe replication storage or a data seal of approval compliant storage.
High level diagram
The diagram below is a high level representation of the scenario:
Entry form
Standard Storage (could be ESGF node)
Available data files (CMIP5 /
netCDF)
Result files (NetCDF)
YODA environment
Input files
Result files
EUDAT basic Storage
Files with PID
EUDAT DSA storage
AIP
REST calls
Generic Eudat Framework (API)
Job Control page
Result display on a map
Result download
Result store or archive
Copy required files
Data calculation using CDO
Copy result file
Launch as batch
Store in EUDAT
Archive in EUDAT
Figure 4: GEF demo ENES implementation at CINES
One of the constraints is to use the GEF (Generic Execution Framework) as it is defined in
its draft description. Even though it is an early stage definition it implements it as much as
possible. This is what drives the usage of the REST interface between client and server.
Sequence
‘Launch a new job’ page
For the demo, we assume only 1 job will be launched at a time. No control mechanism is
implemented.
Click on ‘Launch’ button
Copyright c© The EUDAT Consortium PUBLIC 23/27
EUDAT – 283304 D7.5
This click makes a REST call to the server:
https://server.cines.fr/eudat/gef/filter?queryType=UC9&lat1=xxx“
&long1=xxx&lat2=xxx&long2=xxx&year1=yyyy&year2=yyyy&year3=yyyy&year4=yyyy
(where xxxx or yyyy are the entered parameters)
Receiving this, the server launches a job, passing the parameters.
Once the job launched, the server responds with HTTP 202 (Accepted):
https://server.cines.fr/eudat/gef/filter/job-id
(to simplify job-id will always be eudat-cerfacs-demo)
‘Job tracking’ page
Click on ‘Track’ button
This click makes a REST call to the server:
https://server.cines.fr/eudat/gef/filter/eudat-cerfacs-demo
The server checks the status of the job and returns:
If the job is running: HTTP 204 (No Content) { job status: running }
If the job is finished: HTTP 200 (OK) (job done):
–
size: ’192GB’, // Size of the result file
url: ’https://server.cines.fr/eudat/gef/filter/eudat-cerfacs-demo/result’,
// URI for the result files
˝
The fields Job-id, Status and HTTP results are populated according to the answer. If the
status is ‘Finished’, the 3 other buttons are activated.
Click on ‘Display result’ button
This opens the Display result page if the job is finished.
Click on ‘Download result’ button
This download the result files using standard browser download.
Click on ‘Store in EUDAT’ button
This triggers a script which copies the result set of files into the iRods EUDAT space. This
script runs in batch mode Then it opens the CINES ISAAC web application on the SIP page.
Conclusion ENES Use Case
The demo of this ENES Use Case has been presented live both at the EUDAT Workshop
Days (25-26 September 2013, Barcelona) and at the EUDAT 2nd Conference (28-30 Oc-
tober 2013, Rome). Useful feedbacks from experts have been received to better design the
API. It has also demonstrated that the GEF draft API can be installed on a server nearby the
data storage to perform useful data reduction. This kind of Use Case can help in designing
a proper GEF API useful for the ENES communities and other scientific communities.
24/27 PUBLIC Copyright c© The EUDAT Consortium
5 Testing Use Cases
5.3 CLARIN Use Cases
5.3.1 Metadata Query Service
In the CLARIN community, data is usually assigned with exhaustive metadata. This meta-
data is stored as CMDI34 files together with the object data. A filtering mechanism can
utilize this metadata. For example, a query on the publication date of the objects in scope
can identify all objects which were published in the 19th century.
Such a filtering mechanism on top of metadata can be implemented using the GEF in-
frastructure as a workflow taking a filtering expression as input (e.g. an XPath expression,
depending on the form of the metadata). The output is then a list of resources which match
the filter criteria (see Figure 5).
!&*&-(/$"0&/123*$#-45&.3-6$
%474$8&*7&-$
9&74:474$+;<&/7$:474$
=&'31-/&'$
>/-(?7$5&74:474$@1&-A$
B('7$3C$-&'31-/&'$
D1&-A$&0&/123*$E.3-6F3.G$
H'&-$
%474$7-4*'C&-$
"H%IJ$
Figure 5: Query service: metadata based filtering
For interfacing with the rest of the CLARIN infrastructure, the filtering mechanism can
be wrapped in a REST-style web service. Subsequently, the filtering web service can be
integrated in other workflows, where the subsequent (web-) services can act on the list of
resources which were produced by the filter web service.
A full data querying mechanism, not only of the metadata, is also possible. The implemen-
tation of this service would use various streaming libraries for xml with support for (a subset
of) XPath (e.g. Nux35, Joost 36).
34http://www.clarin.eu/node/321935http://acs.lbl.gov/software/nux/36http://joost.sourceforge.net/
Copyright c© The EUDAT Consortium PUBLIC 25/27
EUDAT – 283304 D7.5
5.3.2 Google Books Ngram
The Google Books Ngram dataset37 is a diachronic collection of n-grams classified by lan-
guage and sorted by the number of occurrences. Google provides a simple viewer of the
data but for more advanced queries this functionality is insufficient; the users must download
and process the dataset locally. The size of the dataset makes this operation prohibitively
expensive for most linguistic researchers. The solution is therefore to place the dataset on
a data center and use the GEF with custom workflows as a filtering mechanism.
The dataset is freely available for download and licensed with a Creative Commons At-
tribution 3.0 Unsupported License. It consists of a collection of tabular text files with a
compressed size of 2.2TB, approximately 10TB uncompressed.
6 Conclusion, notes, discussion
This document presents a theoretical design view of the GEF backed by an in-progress
prototype implementation. The goal of the GEF is to offer an interface API which is
cross-communities and is able to offer data reduction (through data subsetting, variables
combination) to deal with nowadays data volumes, meaning that it must be generic enough
to access data in a heterogeneous landscape of data centres and federations, with several
data typology and types, hosted by several disjoint communities, but useful and transparent
enough to be adopted by most of EUDAT communities.
There already exists a plethora of workflow engines which are used in the scientific world,
which can be seen as generic. However many communities have adopted one or several
workflow engines, such as Kepler and VisTrails, but the implementation of the processing
within these workflow engines is very dependent on each communities. Part of these es-
tablished workflow engines could be refactored to interface with EUDAT services using the
EUDAT GEF, which then would provide a cross-communities execution engine supporting
these differences. Given that, it has to be stressed that the GEF will be highly dependent
on the Metadata TF and the AAI TF outcomes. There are also some dependencies on
Semantic Annotation, Data Staging and Data Replication.
The current implementation of the GEF in CLARIN and ENES are still in an alpha stage,
but within the next months it will be enhanced and much closer to the current GEF descrip-
tion, which will also evolve. Given the outcomes of the Workflows Track at the EUDAT
Workshop Days (25-26 September 2013, Barcelona), discussions on workflows with experts
have identified four recommendations (Table 1), which will need to be explored, discussed
further, and taken into account for further development. EUDAT must ensure that the GEF
be able to cope and also be efficient with the foreseen and current large data volumes, in
federated data environments, and that it is appealing enough to communities so that they
see large advantages in using it.
37http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
26/27 PUBLIC Copyright c© The EUDAT Consortium
6 Conclusion, notes, discussion
Table 1: Four EUDAT Workshop Days Workflows Track Recommendations
Action Short Description Priority Next Steps
Provide EUDAT
Service APIs for
use within
Workflows
Projects like EUDAT should not ‘create new
complete WFs tools’, but instead provide
service APIs for workflows; This enables re-
searchers to seamlessly take advantage of
current/new EUDAT services such as data-
staging, data-transfer, data replication, or
simple store, PID assignments.
High Some EUDAT services
already offer APIs; Cre-
ate a document that
provides an overview
how EUDAT service
APIs can be used.
Explore solutions
for EUDAT
Workflow
Provenance
Service(s)
There is an increasing variety of WF sys-
tems and many of the communities already
have chosen their solutions, but might re-use
components of others; EUDAT could offer
a service that enables ‘workflow component
sharing’ that represents a repository/registry
where components of workflows are stored in-
cluding provenance information; Such informa-
tion includes but is not limited to assignments
of PIDs for workflow components, including
concrete software elements, information about
concrete execution runs of it, and sample data
that enables other researchers to better un-
derstand the shared workflow components.
High PPNL has some work
of sharing components
and describing workflows
independently from con-
crete implementations;
Such work needs to be
surveyed and could be a
baseline for a potential
new EUDAT service.
Provide
higher-level
Analysis &
Analytics
Workflow
Components &
Service APIs
The presentations across all fields has shown
that statistical computing, data mining, and
machine learning algorithms (e.g. classifica-
tion, clustering, or regression techniques) are
used in some parts of the workflows; A poten-
tial set of ‘higher-level data analysis/analytics
services’ could be hosted by EUDAT close
to the data of researchers. This includes the
provisioning of service APIs for a seamless in-
tegration in (existing) analysis workflows) and
their ‘application enabling process’.
Medium Used statistical comput-
ing (e.g. R) or machine
learning (e.g. Apache
Mahout) software al-
ready exists; Provide an
overview which of these
packages could be con-
veniently hosted by EU-
DAT and which service
APIs could be provided;
Investigate
solutions for
data workflow
recommender
services
Data formats are set by user communities and
limited amounts of standardization is having
impact; EUDAT could investigate the possi-
bility of recommender services that provide
advice on suitable workflows in context de-
pending on data formats, scalability, porta-
bility, etc.; This might include benchmarks of
workflows in context and access to (captured)
best practices in the community;
Medium Some data formats are
especially used across
communities such as
HDF5 or NETCDF;
Survey use of common
data formats in
communities;
Copyright c© The EUDAT Consortium PUBLIC 27/27
top related