d6.5: interim interoperability testbed report · a. interoperability pilots in the context of...

41
D6.5: Interim Interoperability Testbed report Author(s) Doina Cristina Duma (INFN) Status Final Version V1.03 Date 29/08/2018 Dissemination Level X PU: Public PP: Restricted to other programme participants (including the Commission) RE: Restricted to a group specified by the consortium (including the Commission) CO: Confidential, only for members of the consortium (including the Commission) Abstract: The objectives of this deliverable are to describe the first results of the validation of services and demonstrators in the interoperability testbeds. The status of the testbeds used by the different Science Demonstrators is presented, with information on the services deployed in order to meet the respective use-case requirements. The deliverable also includes information of the testbeds deployed in the context of the Interoperability task, regarding resource centres networking (PiCo2), AAI in the context of the collaboration with AARC2 project, and data interoperability demonstrators. The European Open Science Cloud for Research pilot project (EOSCpilot) is funded by the European Commission, DG Research & Innovation under contract no. 739563

Upload: others

Post on 09-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

D6.5: Interim Interoperability Testbed report

Author(s) Doina Cristina Duma (INFN)

Status Final

Version V1.03

Date 29/08/2018

Dissemination Level

X PU: Public

PP: Restricted to other programme participants (including the Commission)

RE: Restricted to a group specified by the consortium (including the Commission)

CO: Confidential, only for members of the consortium (including the Commission)

Abstract:

The objectives of this deliverable are to describe the first results of the validation of services and demonstrators in the interoperability testbeds. The status of the testbeds used by the different Science Demonstrators is presented, with information on the services deployed in order to meet the respective use-case requirements. The deliverable also includes information of the testbeds deployed in the context of the Interoperability task, regarding resource centres networking (PiCo2), AAI in the context of the collaboration with AARC2 project, and data interoperability demonstrators.

The European Open Science Cloud for Research pilot project (EOSCpilot) is funded by the European Commission, DG Research & Innovation under contract no. 739563

Page 2: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 2

Document identifier: EOSCpilot -WP6-D6.4

Deliverable lead INFN

Related work package WP6

Author(s) D. C. Duma (INFN)

Contributor(s) T6.3 team & Science Demonstrator shepherds

Due date 30/04/2018

Actual submission date 29/08/2018

Reviewed by John Alan Kennedy (MPCDF - MPG)

Approved by Brian Matthews (UKRI)

Start date of Project 01/01/2017

Duration 24 months

Versioning and contribution history

Version Date Authors Notes

0.1 01/04/2018 Doina Cristina Duma (INFN) ToC available

0.2 29/04/2018 Doina Cristina Duma (INFN) First draft available for internal review

1.0 04/05/2018 Doina Cristina Duma (INFN), Andrea Ceccanti (INFN), Xavier Jeannin (RENATER), Violaine Louvet (U. Grenoble), Alain Franc (INRA), Giuseppe la Rocca (EGI Foundation), Rafael Jimenez (ELIXIR), John Kennedy (MPCDF), Kathrin Beck (MPCDF), Michael Schuh(DESY), Dario Vianello (EBI), Thomas Zastrow (MPCDF), Erik van den Bergh (EBI), Nick Juty (UMAN), Carole Goble (UMAN), Keith Jeffery (UMAN),

Information on various SDs status First version available for review

Page 3: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 3

Juan A Vizcaino (EMBL-EBI), Peter McQuilton (U Oxford), Giorgos Papanikos (Athena RC)

1.01 18/05/2018 Doina Cristina Duma (INFN) prepare to address reviewers comments

1.02 20/06/2018 Doina Cristina Duma (INFN), Benjamin Guillon (IN2P3), Andrea Ceccanti (INFN), Xavier Jeannin (RENATER), Violaine Louvet (U. Grenoble), Alain Franc (INRA), Giuseppe la Rocca (EGI Foundation), Rafael Jimenez (ELIXIR), Kathrin Beck (MPCDF), Michael Schuh(DESY), Thomas Zastrow (MPCDF), Erik van den Bergh (EBI), Nick Juty (UMAN), Carole Goble (UMAN), Keith Jeffery (UMAN), Juan A Vizcaino (EMBL-EBI), Peter McQuilton (U Oxford), Giorgos Papanikos (Athena RC)

improved version addressing reviewer’s comments

1.03 29/08/2016 Mark Thorley (UKRI) Minor typographic edits prior to submission.

Copyright notice: This work is licensed under the Creative Commons CC-BY 4.0 license. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0. Disclaimer: The content of the document herein is the sole responsibility of the publishers and it does not necessarily represent the views expressed by the European Commission or its services. While the information contained in the document is believed to be accurate, the author(s) or any other participant in the EOSCpilot Consortium make no warranty of any kind with regard to this material including, but not limited to the implied warranties of merchantability and fitness for a particular purpose. Neither the EOSCpilot Consortium nor any of its members, their officers, employees or agents shall be responsible or liable in negligence or otherwise howsoever in respect of any inaccuracy or omission herein. Without derogating from the generality of the foregoing neither the EOSCpilot Consortium nor any of its members, their officers, employees or agents shall be liable for any direct or indirect or consequential loss or damage caused by or arising from any information advice or inaccuracy or omission herein.

Page 4: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 4

TABLE OF CONTENTS EXECUTIVE SUMMARY 5

INTRODUCTION 7

A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8

1.2 Pilot for distributed authorization and authentication, WLCG-AARC2-EOSCpilot 11 1.3 Data Level interoperability 13

Evaluation of EDMI metadata to find and access datasets 14 Discovery of compliant data resources and data catalogues 16

Research schemas for exposing dataset metadata 18 Description and guidelines per metadata property 19

B. Science demonstrators 21

a. First set of Science Demonstrators 21 1. ENVRI Radiative Forcing Integration 22

2. Pan-Cancer Analysis in the EOSC 22 3. Research with Photons & Neutrons 24

4. Collaborative semantic enrichment of text - based datasets (TEXTCROWD) 25 5. WLCG Open Science Demonstrator - Data Preservation and Re-Use through Open Data Portal (DPHEP) 27

b. Second Set of Science Demonstrators 27 1. HPCaaS for Fusion (PROMINENCE) 28

2. EGA (European Genome-phenome Archive) Life Science Datasets Leveraging EOSC 29

3. Virtual Earthquake and Computational Earth Science e-science environment in Europe (EPOS/VERCE) 30 4. CryoEM workflows 31

5. Astronomy Open Science Cloud access to LOFAR data 32 c. Third Set of Science Demonstrators 33

1. Frictionless Data Exchange Across Research Data, Software and Scientific Paper Repositories 34 2. Mining a large image repository to extract new biological knowledge about human gene function 35

3. VisIVO: Data Knowledge Visual Analytics Framework for Astrophysics 36 4. Switching on the EOSC for Reproducible Computational Hydrology by FAIR-ifying eWaterCycle and SWITCH-ON 37

5. VisualMedia: a service for sharing and visualizing visual media files on the web 38 C. Conclusions and Future Work 39

ANNEXES 40

A.1. GLOSSARY 40

Page 5: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 5

EXECUTIVE SUMMARY In the context of the EOSCpilot project, the Interoperability Work Package (WP6) aims to develop and

demonstrate the interoperability requirements between e-Infrastructures, domain research infrastructures

and other service providers needed in the European Open Science Cloud. It deals with interoperability in

general leading to the ability to "plug and play" the services of the EOSC in the future. While the Science

Demonstrators will show how some particular services can work together, we also need to work on a more

general framework for interoperability between data and services. This WP works closely with WP4 (Science

Demonstrators) and WP5 (Services), taking inputs from the science demonstrators to ensure the relevance

of the interoperability framework, and from the Services.

The interoperability was mapped in two tracks:

● Research and Data Interoperability:

○ The Research and Data Interoperability Track – that provides the research infrastructure and

domain expert view in the work programme with focus on data interoperability. The

definition of a Data Interoperability framework in EOSC is based on the FAIR principles - data

and services need to be Findable, Accessible, Interoperable and Reusable.

● Infrastructure Interoperability:

● The complementary usage of Cloud, Grid, HTC and HPC infrastructures, including large

datastores, through high speed networks and performant data transfer protocols and tools.

The high level objective is to facilitate the most adequate infrastructures for the treatment

of extensive amounts of data, generated by new generations of instruments, observatories,

satellites, sensors, sequencers, imaging facilities and numerical simulations, and produced

by well-known data intensive communities but also by the long tail of science.

● In the Infrastructure Interoperability track the provider view is in the centre of the work

programme.

● The federated infrastructure pilots that have to be set up with the resources provided by

other partners involved in this WP and by the selected Science Demonstrators will enable

the analyses of the existing interoperation mechanisms for software components, services,

workflows, users and resource access within existing RI systems.

The above objectives are envisioned to be implemented through the instantiation of multi-infrastructure,

multi-community pilots. Services and the Science Demonstrators defined in WP4 and WP5 are continuously

been deployed and validated in these pilots from the standpoint of maturity, scalability, and usability for a

future EOSC.

The objective of this deliverable is to present the status of the testbeds made available in the context of

EOSCpilot in order to fulfill the requirements of the Science Demonstrators, e-infrastructure networking

Page 6: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 6

interoperability, authorization and authentication interoperability in the context of the collaboration with

AARC2 project, and Data Interoperability Demonstrators. This report contains the results of the use and the

feedback of the different pilots. Shortly it will be followed by deliverable D6.7, the “Revised requirements of

the interoperability testbeds”, containing an enriched set of new requirements and changes to be applied to

the testbeds and services deployed, together with the feedback received from the different stakeholders

involved in the various demonstrators.

The activity of the Task 6.3 will conclude with the last report, the deliverable D6.10 – “Final Interoperability

Testbed report”, with results and final feedback on the interoperability testbeds.

Page 7: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 7

INTRODUCTION

The EOSCpilot project has been funded to support the first phase in the development of the European Open Science Cloud (EOSC). One of the main three objectives of the project is to develop a number of demonstrators functioning as high-profile pilots that integrate services and infrastructures to show interoperability and its benefits in a number of scientific domains, such as Earth Sciences, High-Energy Physics, Social Sciences, Life Sciences, Physics and Astronomy. The activities in this direction will leverage on already existing resources and capabilities from different research infrastructure and e-infrastructure organizations to maximize their use across the research community.

In the context of the project the Interoperability Work Package, WP6, has specific objectives that are guiding the activity of task T6.3 - Interoperability Pilots (service implementation, integration, validation, provisioning for Science Demonstrators):

● Providing the architecture, validated technical solutions and best practices for enabling interoperability across multiple federated e-infrastructures, overcoming current gaps expressed by user communities and resource providers.

● Validating the compliance of services provided by WP5 with specifications and requirements defined by the Science Demonstrators in WP4.

● Defining and setting up distributed Interoperability Pilots, involving multiple infrastructures, providers and scientific communities, with the purpose of validating the WP5 service portfolio.

● Assessing the maturity level of solutions in close cooperation with the Science Demonstrators, taking into account factors such as TRL, openness, scalability, user community adoption and sustainability.

The purpose of this document is to provide a status report of the different testbeds deployed in order to fulfill the requirements of the different demonstrators envisaged by the project, both Science Demonstrators selected by WP4, and specific use cases defined in the Interoperability Workpackage.

The document is organised as follows:

The document starts with an introduction listing all the pilot testbeds available in the project, followed by specific sections with details on the status of each testbed, together with feedback received until now regarding the services deployed.

The document closes with some conclusions and future work on how to improve the interoperability pilots, eventually by deploying new services or requiring new features or improvement of the already deployed ones.

A. INTEROPERABILITY PILOTS IN THE CONTEXT OF WP6/T6.3 In the framework of the EOSCpilot, WP6 aims to develop and demonstrate the interoperability requirements between e-Infrastructures, domain research infrastructures and other service providers needed in the European Open Science Cloud. It provides solutions, based on analysis of existing and planned assets and techniques, to the challenge of interoperability. Two aspects of interoperability will be considered: Data interoperability, ensuring that data can be discovered accessed and used according to the FAIR principles, and Service interoperability, ensuring that services operated within different infrastructures can interchange and interwork.

The work of the Interoperability WP is aligned with the expected impact of the INFRADEV-4-2016 call, requiring the project to “facilitate access of researchers across all scientific disciplines to the broadest

Page 8: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 8

possible set of data and to other resources needed for data driven science to flourish”. In order to run this infrastructure and to enable usability, a rich set of interoperable infrastructure services (IaaS) ranging from AAI, reliable storage endpoints, cloud management frameworks, SDN endpoints, to infrastructure monitoring, billing and accounting is required.

As specific pilots developed in WP6 we have pilots regarding the e-infrastructure level interoperability, the PiCO2 and AAI interoperability demonstrator through an AARC scoped pilot, and data level interoperability, the 4 data demonstrators detailed below. The Grid-Cloud interoperability demonstrator for HEP community, mentioned in the previous deliverable, following the results obtained last year, also through the EOSCpilot project, was proposed and accepted as Thematic Service in the context of the EOSC-hub project, as the Dynamic On Demand Analysis Service (DODAS)1.

1.1 Pilot for Connecting Computing Centres Objective:

● PiCO2 (Pilot for COnnection between COmputing centers) aims at facilitating communications of data and codes between Tier2 and Tier1 infrastructures, being agnostic as far as scientific communities are concerned.

● Pico2 provides a kind of “data exchange bus” on which are connected laboratories, databases and grid computing resource thanks to iRODS.

● This bus is not local to a data center or to one country, so in order to optimize network exchange an multi-domain L3VPN (several European countries) is deployed in order to interconnect the Pico2 sites. L3VPN allows the sites to be sure that the L3VPN traffic is issued from Pico2 partners.

Figure 1: PiCO2 L3VPN & “ideal” iRODS deployment

The graph depicted in Figure 1 shows currently established connections and illustrates the planned expansion to German and Italian network providers and infrastructures.

Current participants:

1 https://confluence.egi.eu/display/EOSC/EOSC-hub+service+catalogue#EOSC-hubservicecatalogue-thematic

Page 9: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 9

The French sites initially involved in the PiCo2 project are

● GRICAD, Tier 2 HPC/HTC/Grid/Cloud, Grenoble ● MCIA, Tier 2 HPC/Grille/Cloud, Bordeaux ● Plafrim, cluster INRIA Bordeaux, Tier 2 HPC ● IDRIS, CNRS National Computing Center, Tier 1 HPC ● CC-IN2P3, IN2P3/CNRS Computing Center, Tier 1 HTC/Grid/Cloud ● RENATER, REN, France ● France-Grille, EGI NGI, France, Tier 1 Grid/Cloud ● GRIF / CEA, Tier 2, Saclay

The project is open for other European sites to join and contribute, a first candidate actively engaging is DESY. General interest has been expressed by INFN, too.

Each participating site provides its own resources. Resources provided differ in type (physical or virtual hardware), amount (storage space, hardware specs) and dedication (shared with actual users or dedicated to the PoC). A table is in preparation that will include information like actual storage and compute resources (storage space, CPUs, RAM), whether the resources are bare-metal or virtualized, whether it's shared with other users or dedicated, logical will also be mentioned (automation: heat templates, ansible playbooks). This table will be included in the D6.7, “Revised requirements of the interoperability testbeds”, highlighting also the heterogeneity in the testbeds and the points that could raise interoperability issues, and how these aspects could be a source of new requirements on the services envisaged.

For each technical track (storage, networking and computing) a working group was set up.

● One for building a network of peer to peer federations between iRODS zones (data storage service). This comprises connections, between Tier1 & Tier2 centers, between Tier2 & Tier2, and between Tier2 and other grid computing sites. Currently, in the logic of Tier pyramide, each regional data center has its own IRODS zone, for communication between laboratories connected to it. What is planned and ongoing in this networking is to federate zones between different Tier2, and between Tier 2 and Tier 1. This is ongoing between GRICAD, CC-IN2P3, IDRIS, PlaFRIM, NGI France-Grille. What is currently out of reach is to federate several IRODS zones; this can be done peer to peer only. Hence the complexity of the networks grows in a quadratic way with the number of zones, and this is under study. Some “real life tests” have been run between IDRIS (Tier 1) and PlaFRIM (Tier 2/Tier 3) with datasets produced at IDRIS (several 10s of Go).

● One for connecting the infrastructures within a L3VPN network and monitoring the performance of the network between sites (see above for a more detailed description)

● One for setting up, and making accessible, tools that facilitate the use of a given (same) workflow on various types of computing environments, (such as Cloud, HPC, HTC or Grids). A C program (called disseq, to compute pairwise distances between sequences) and a test dataset have been made available for each participant in a git. The tools which have been developed for "wrapping" this program are: NIX, GUIX, Stack, Jupyter notebooks, and containerization with Docker and Singularity. This has started in Spring 2018 in a common meeting in Lyon, is currently under development. The outcomes of these experiments (not limited list) are expected by fall 2018.

Challenges addressed in 2017:

● Seamless data flow between infrastructures has been achieved using iRODS ○ Initial tests were carried out between two sites in France (CC-IN2P3 and GRICAD) ○ This PoC used a dedicated virtualized iRODS cluster on one site (CC-IN2P3) and a physical

shared production cluster on the other site (GRICAD) ○ Tests included various file size transfers (from 1MB to 10GB) and reached a speed of roughly

100Mb/s.

Page 10: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 10

Actions for 2018:

● Ensure portability of code between infrastructures, ● Moreover, a couple of case studies are welcome for real life tests, with data flows and code sharing.

Two have been identified in the long tail of science : ○ connection between databases and models for climate modeling at regional scale ○ Sharing metabarcoding computing workflows, between labs at European scale: COST project

"DNAqua.Net" (http://dnaqua.net/) candidates for being a case study. It had its annual meeting in Pécs (Hungary) on June 11-15, including working group 4 (Data analysis and storage). Its management committee has expressed its interest for DNAqua.Net needs for data sharing and benchmarks for computations (23 countries) to be a use case for pico2. This involves sharing datasets and access to some computing infrastructures. The European infrastructure suitable for such a European wide use case in the long tail of the science seems to be EGI. VO BioMed (European wide) is ready to welcome a proof of concept, as well as VO France-Grille for testbeds. Further contacts will be made between DNAqua.Net, French NGI France-Grille, VO BioMed and possibly EGI to organize a training school at Lyon in autumn 2018, for a volunteering community within DNAqua.Net to have access to storage and computing facilities provided by VO BioMed. The connections currently set up within pico2 between infrastructures (GRICAD, PlaFRIM, CC-IN2P3, IDRIS, France-Grille) will be used to extend the network of partnership to include, if feasible, HPC infrastructures.

○ Integrating containerized Function-as-a-Service solution for Photon and Neutron community which has been developed in the context of the SD in WP4.

● Foster cooperation based on specific use-cases and extend participation to a European scale beyond continuation of work of existing technical groups on iRODS zone federation, monitoring of NREN

● Facilitate code circulation and accessibility with modern tools like containerisation (using docker or singularity containers)

All-Hands Meeting, Pisa - March 8-9 2018

● During the EOSCpilot AHM2 in Pisa, 8-9 March 2018, T6.3 organized a dedicated technical session aiming at extending the capabilities to demonstrate technological interoperability on a wider scale inside EOSCpilot.

○ Initiated by the technical teams of DESY and CC-IN2P3, The goal was to foster and work on technical collaborations between groups involved in EOSCpilot technical platforms.

○ In this context, DESY will provide proof-of-concept testing by joining the iRODS-based storage federation deployed for Pico2. Furthermore, this will provide an opportunity to demonstrate the interoperability between the iRODS federation and dCache based storage systems accessible on DESY’s OpenStack Cloud.

○ After the meeting, that saw the participation of CC-IN2P3, DESY and INFN, it was agreed to have a series of follow-up meetings and a workshop in order to elaborate on the iRODS deployment, and to collaborate on a validation phase of the Photon&Neutron SD results. INFN, at the beginning, will participate as observer in order to better understand the implications of the various setups from the point of view of networking and firewall policies.

2 https://eoscpilot.eu/eoscpilot-all-hands-meeting-agenda

Page 11: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 11

1.2 Pilot for distributed authorization and authentication, WLCG-AARC2-EOSCpilot The EOSCpilot and the AARC project3, have started a collaboration activity in the field of authorization and authentication, policies and recommendations regarding their design. In this section we present the status of the AAI interoperability demonstrator setup as part of the AARC pilots Task 1: Pilots with research communities based on use cases provided4 - the WLCG use case. The activities of the WLCG AuthZ working-group5 were based on analysis of momentum within WLCG to address AuthZ and enable Federated Identity, in particular focused on how to address:

● Membership Management ○ The VOMS backend will need to be maintained for some time, however VOMS-Admin

can/should be replaced ○ VOMS provisioning is not included in existing solutions (e.g. IAM, Checkin+CoManage) ○ Integration with CERN HR DB must be maintained for identity vetting purposes ○ CERN’s SSO will potentially be a single IdP for LCG VOs

● Token Translation ○ Required to produce x509 certificates for users authenticated via identity federations ○ Using RCauth with either COManage+Checkin or MasterPortal will allow the use of the ssh-

key command line flow ● Authorisation Schemas

○ Necessary to define, agree to and test an interoperable schema for JWT tokens between EGI, OSG and other contributing infrastructures

Objective: Demonstrate a pilot solution for a researcher without a certificate to register in a WLCG VO and access a grid service:

● Introducing the minimal required new components to allow smooth user experience ○ Central IDP/SP proxy ○ Token Translation Service ○ Attribute Authority

● Managing authentication and authorization to comply with WLCG requirements and standards ○ HR db integration ○ Acceptable level of assurance in line with IGTF profiles

● Minimizing the number of new developments required by WLCG Services

3 https://aarc-project.eu/ 4 https://wiki.geant.org/display/AARC/AARC+Pilots 5 https://twiki.cern.ch/twiki/bin/view/LCG/WLCGAuthorizationWG

Page 12: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 12

Figure 2: This picture gives an high-level overview of the main WLCG pilot components. The focus of the pilot is on the Proxy

components box where an IdP/SP proxy solution capable of integrating federated authentication (e.g., via EduGAIN and CERN SSO), identity vetting via the CERN HR database and VOMS provisioning will be investigated.

The current authentication model (Figure 2) is based on users’ X.509 certificates (IGTF), while the new envisaged one should be based on user federated credentials released by Home Institutes (eduGAIN IDPs), maintaining the use of X.509 certificates for legacy users. This approach would need new components like:

● IDP/SP proxy ( proxying OIDC/OAuth2 IDs, allowing attribute aggregation..) ● Token Translation Service ( SAML to X.509 ) ● Attribute Authority + Group Management

For what regards the authorization model the current one foresees that authorization is managed via VOMS proxy extension to user proxy presented to the Grid services ( Roles), while the new envisaged one would be based on user attributes released by IDP + additional ones provided by Group Management/Attribute Authority (AA) mapped to VOMS proxy extension (or JWT claims in future)

Implementation strategy chosen by the WLCG Auth working-group together with AARC2, and supported by the EOSCpilot in the setup of one of the pilots-demonstrators, is focused on the comparison and benchmarking of 2 possible solutions:

● EGI Checkin Service + CoManage ● INDIGO IAM

The components provided by the two solutions are listed in Table 1, below, and Table 2 highlights their main strengths and weaknesses.

Table 1: Components available

Page 13: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 13

Table 2: Pros and Cons of the envisaged solutions

Outcome of the pilot will be to provide the WLCG PMB with hands-on feedback on the two possible solutions capable of paving the way towards X.509-free WLCG:

● Both approaches required additional developments ● Both strategies will have to ensure sustainability of these developments in the forthcoming years

Possible Timelines:

● Finalization of new required developments for CoManage and INDIGO IAM: - March-July 2018 ● Deployment and pilot testing - August-September 2018

○ work in EOSCpilot for providing the INDIGO-IAM pilot started since the beginning of April ● Reporting / Benchmarking: - October-December 2018 ● Final dissemination on pilot: - January-March 2019

This pilot supports the EOSC AAI vision by extending and deploying existing solutions (e.g., INDIGO IAM, EGI Checkin) to support the transition of a large-scale scientific computing infrastructure towards an authentication and authorization model that shields users from the complexity of handling certificates. The architecture depicted above is inline with the recommendations of the AARC Blueprint Architecture, on which the EOSC AAI will be based. The know-how and results obtained in this pilot will be useful not only to better understand the evolution of the WLCG AAI, but also to provide experience and knowledge that could support other scientific communities with similar AAI needs.

1.3 Data Level interoperability EOSCpilot has produced a first draft of the strategy and recommendations6 to help users and services to find and access datasets across several scientific disciplines. This strategy relies on three main ideas:

1. Agreeing on some common and minimum dataset metadata properties to be exposed by data resources (following evaluation of existing metadata models and APIs).

2. Supporting a coordinated ecosystem of dataset metadata catalogues working together to efficiently manage and exchange their metadata.

3. Demonstrate the applicability of these recommendations by implementing them in real use cases to allow users services to find and access data.

One of the core outcomes from the draft guideline outlines are three components judged to be important for the implementation of this strategy:

6 https://eoscpilot.eu/content/d63-1st-report-data-interoperability-findability-and-interoperability

Page 14: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 14

● EDMI metadata guidelines - EOSC Datasets Minimum Information. A set of metadata properties and a metadata crosswalk (equivalence) across existing metadata models.

● Metadata catalogues strategy - Recommendations about how to support an ecosystem of metadata catalogues.

● Demonstrators - to validate and iteratively improve the proposed recommendations.

The status of the four demonstrators is presented in this section (Figure 3, Evaluation of EDMI metadata to find and access datasets, Figure 4, Discovery of compliant data resources and data catalogues; Figure 5, Research schemas for exposing dataset metadata; Figure 6, Description and guidelines per metadata property). Preparatory activities involving use case communities began at the end of the 2017, after identification through discussions that took place during the meeting organized by the Data Interoperability team, T6.2, and their definition in the D6.3 deliverable. We report the objectives of each demonstrator, the participants, the results of the work done during the first Face-to-Face meetings that took place during the EOSCpilot AHM7 in Pisa, 8 - 9 March 2018, on dedicated data interoperability sessions ad hands-on, and the workplan for the next period. Further and final results of the work on these demonstrators and eventual new interoperability requirements will be presented in the next WP6 deliverables, D6.6 - “2nd Report on Data Interoperability”, D6.7 - “Revised requirements of the interoperability testbeds”, D6.9 - “Final Report on data interoperability” and D6.10 - “Final Interoperability Testbed report”.

Evaluation of EDMI metadata to find and access datasets

Figure 3: First Data Interoperability demonstrator - The general strategy for EDMI metadata processing is illustrated: EDMI dataset

metadata will be adopted and exposed by individual resources (eg. PRIDE), harvested by dataset metadata catalogues (ex. OmicsDI), and made available to other services (eg. for indexing and monitoring, OpenAIRE). This demonstrator focuses on the

EDMI exposure and harvesting (indicated by the hatched red line).

Objectives:

● test and provide feedback about the functional and operational metadata proposed in the EDMI guidelines, especially about how it will help services to find, access, transfer and replicate data available in third party data resources.

● involve data resources, dataset metadata catalogues and services to test EDMI. ● explore how compliance of EDMI properties can be measured to evaluate FAIRness.

Tasks:

1. Expose EDMI metadata ○ Engaging data resources and dataset metadata catalogues to expose EDMI metadata in a

way it can be accessed programmatically, by reusing existent programmatic interfaces 7 https://eoscpilot.eu/eoscpilot-all-hands-meeting-agenda

Page 15: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 15

provided by the resource or adopting an existing standard or interface (e.g. schema.org or OAI-PMH).

○ Status: - status report presented at the Pisa hands-on meeting:

■ what was done during the hands-on: ● went to a limited number of properties ● expecting feedback to build the examples needed for the demonstrators,

also as means to refine the properties ■ next steps

● analyse feedback, see how schema.org and its vocabulary can be increased ● build references and examples for the full sets for the schema.org model

that will be used, as well as the specific schema on which feedback is expected

● cookbook on how to bring your schemas in the schema.org ● using all this info - in order to create a tool to be able to build the

edmischema.org 2. Index EDMI metadata

○ Engaging dataset metadata catalogues to test the indexing of EDMI metadata exposed by data resources

○ Status: - status report presented at the Pisa hands-on meeting:

■ was done during the hands-on: ● material and references were made available in a shared directory to be

used by participants ● there were defined priorities for the properties, for datasets, in the

Reference Materials.xls - to be used to feed into RDA ■ next steps

● further activity is expected to be done during the months May and June, to be concluded with a final report on September 2018

3. Use EDMI metadata ○ Engaging dataset metadata catalogues to test the indexing of EDMI metadata exposed by

data resources

Page 16: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 16

○ Status: the activities of this task should start in September, after the previous one is done,

and last until end of October 2018 4. Explore how to monitor compliance of EDMI

○ Testing how useful EDMI metadata is for services to find and access datasets from data resources

○ Status: the activities of this task are expected to be performed during September and

October 2018

Discovery of compliant data resources and data catalogues

Figure 4: Second Data Interoperability demonstrator- The general strategy for EDMI resource metadata processing is illustrated: Resource level EDMI metadata will be adopted and exposed by individual resources (eg. PRIDE), harvested by dataset metadata catalogues (eg. OmicsDI), and made available to other services (eg. for indexing and monitoring, OpenAIRE). This demonstrator focuses on the additional harvesting of EDMI metadata by resource metadata catalogues such as FAIRsharing (indicated by the

hatched red line).

Objectives:

● help users and services to better find existing catalogues and the data resources indexed by these catalogues. It also aims to recognise which catalogues and data resources comply with EDMI.

● To do so it aims to involve an existing metadata catalogue of data resources to 1) index and highlight dataset metadata catalogues compliant with EDMI, 2) index and highlight data resources compliant with EDMI and 3) make the link between dataset metadata catalogues and data resources.

Overall status: - status report presented at the Pisa hands-on meeting:

Page 17: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 17

- was done during the hands-on: ○ talked about on how to relate the EOSC resources ○ went on how fairsharing works ○ how add dates to fairsharing to describe all the resources ○ take the FAIRsharing data model and adapt it ○ list of some resources, associated to EOSC, we need more ○ discussed how to integrate with OPENAire, through schema.org, or define an API. ○ Use EDMI metadata ○ Explore how to monitor compliance of EDMI

- next steps ○ the timelines for further activities are listed below under each foreseen task

Tasks:

1. Create a collection of data resources per dataset catalogue ○ facilitating the discovery of dataset catalogues and the data resources they index by creating

collections per dataset catalogue in FAIRsharing. This task will specially look at the dataset metadata catalogues from the task “Index EDMI metadata” above.

○ Status: - the activities of this task are expected to be finalised during November 2018

2. Highlight compliance with EDMI ○ create an entry for the EDMI as minimum information guideline in FAIRsharing and highlight

data resources and metadata catalogues compliant with the EDMI guideline.

○ Status: - the activities of this task are expected to be performed during June and finalised in

November 2018

3. Strategy to expose indexed data resources ○ propose a simple strategy for dataset metadata catalogues to expose the list of resources

they are indexing.

○ Status: - the activities of this task are expected to be performed and finalised during June

2018

Page 18: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 18

Research schemas for exposing dataset metadata

Figure 5: Third Data Interoperability demonstrator - The general strategy for EDMI dataset metadata processing is illustrated: Dataset level EDMI metadata will be adopted and exposed by individual resources (eg. PRIDE), harvested by dataset metadata

catalogues (eg. OmicsDI), and made available to other services (eg. for indexing and monitoring, OpenAIRE). This demonstrator focuses on the implementation of EDMI metadata through the Schema.org-like mechanism of Bioschemas markup (indicated by the

hatched red line).

Objectives:

● explore how to use Schema.org in a manner akin to that used by Bioschemas to facilitate exposing EDMI metadata properties in data resource and dataset metadata catalogues.

● facilitate adoption coming up with several examples and testing how a dataset metadata catalogues can index schema.org metadata from a data resource.

Tasks:

1. Expose EDMI properties using schema.org ○ engage with data resources to expose EDMI metadata properties using schema.org.

○ Status:

■ Hands-on Workshop in Pisa - preparation and results: ● a shared folder was prepared containing all the references links needed for

the Pisa hands-on workshop ○ presentation of the issue to be solved ○ details on shema.org - Introduction, Concepts, Examples ○ EDMI - properties, crosswalks

Page 19: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 19

○ Tools & References - JSON editor (JSON-LD playground8, Online json editor9), Google Drive shared folder10

● results of the hands-on workshop: ○ “Schema” -> EDMI -> schema.org: - Identified gaps and

opportunities ○ EDMI compliant Catalogues - Web Pages as API, Structured markup

JSON-LD ○ Build a solid base reference

■ Work performed after the workshop ● Analysis of feedback and metadata catalogues. Chosen metadata catalogues

as demonstrator pilots: ○ OpenAIRE – Linked research publications, project, dataset and

author information ○ BlueBRIDGE – Earth Observation datasets ○ EBI Metagenomics – Life Science domain

● schema.org/Dataset profile identified to facilitate EDMI Dataset mapping ○ incremental, low barrier adoption ○ Focus on minimal subset ○ Case Studies and example mappings ○ Possible extensions identified ○ Findings reported11

● Conversion Tool ○ Crosswalk utility ○ Loose and tight coupling usage examples ○ Available in GitHub repository12

■ next steps ● Work on this demonstrator could be influenced also by findings in the other

data-demonstrators, so a further analysis will be done in the next deliverables to understand if there are new requirements and what else needs to be done, eventually.

2. Harvest EDMI metadata exposed in Schema.org ○ demonstrate how dataset metadata catalogues can harvest EDMI metadata exposed with

Schema.org by data resources and dataset metadata catalogues.

○ Status: - the activities of this task are expected to be performed and finalised during Sept

and October 2018

Description and guidelines per metadata property

8 https://json-ld.org/playground/ 9 https://jsoneditoronline.org/ 10 https://docs.google.com/document/d/1ABM5TadVrEELjkzrHDZM9isv2cqhr5f99ujVETF4mMA/edit 11 https://docs.google.com/document/d/1cAua1Hr5pdRm6BTy9z_fvotl2uW-n1nMss4E4aZN2dE/edit 12 https://github.com/madgik/schema2jsonld

Page 20: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 20

Figure 6: Fourth Data Interoperability demonstrator - The EDMI describe a set of minimal properties that are necessary for a

description of scientific resources, enhancing their discoverability. These individual properties have, at least conceptually, been identified as being important across a number of other metadata models, that are being actively worked upon by a number of

international groups. The aim of this demonstrator is to select individual properties identified as being EDMI central, which appear across metadata model, and to build a formal specification of which components (potentially other properties) they are composed of, and ascribe necessary attributes to capture valuable metadata, and enumerate valid values for those attributes. An example

property which may be elaborated upon is indicated in the red hatched area.

Objectives:

● Collaborate with the RDA MIG group to describe and provide recommendations for dataset metadata properties, with particular focus on those properties identified by the EOSCpilot as being part of EDMI.

Tasks:

1. Contribute to the RDA MIG guidelines ○ engage with the RDA MIG working group to define guidelines for dataset properties.

Participant of this task will focus on a set of dataset properties with special emphasis on EDMI properties that need better definition such as identifiers. These guidelines will provide generic as well as domain specific recommendations.

○ Status: - the activities of this task are expected to be finalised during May 2018

Page 21: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 21

2. Gap analysis and proposal of new properties ○ do a gap analysis between EDMI and RDA MIG dataset metadata properties and bring

specially those important operational metadata properties to RDA MIG.

○ Status: - the activities of this task are expected to be finalised during May 2018

As a conclusion we provide below the timelines of the four Data Interoperability Demonstrators together with the coordinators of each activity.

Table 3: Data Interoperability Demonstrators Timelines

B. SCIENCE DEMONSTRATORS One of the objectives of the EOSCpilot project is to understand and address the barriers that stop European research from fully tapping into the potential of data. To improve interoperability between data infrastructures, the project has to engage different scientific and economic domains, countries and governance models and will demonstrate how resources can be shared even when they are very large and complex. The aim of the EOSCpilot Science Demonstrators is to show the relevance and usefulness of the EOSC Services and their enabling of data reuse, to drive the EOSC development.

In the following sections we provide a summary of the status of the testbeds involved in each of the 15 Science Demonstrators.

a. First set of Science Demonstrators EOSCpilot started with five Science Demonstrators pre-selected from a call in 2016, with project execution from January to December 2017:

Page 22: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 22

● Environmental & Earth Sciences - ENVRI Radiative Forcing Integration to enable comparable data access across multiple research communities by working on data integration and harmonised access

● High Energy Physics - WLCG: large-scale, long-term data preservation and re-use of physics data through the deployment of HEP data in the EOSC open to other research communities

● Social Sciences – TEXTCROWD: Collaborative semantic enrichment of text-based datasets by developing new software to enable a semantic enrichment of text sources and make it available on the EOSC.

● Life Sciences - Pan-Cancer Analyses & Cloud Computing within the EOSC to accelerate genomic analysis on the EOSC and reuse solutions in other areas (e.g. for cardiovascular & neuro-degenerative diseases)

● Physics - The photon-neutron community to improve the community’s computing facilities by creating a virtual platform for all users (e.g., for users with no storage facilities at their home institutes)

1. ENVRI Radiative Forcing Integration

● Scientific and Technical Objectives ○ Focus on dynamics of greenhouse gases, aerosols and clouds and their role in radiative

forcing, ○ Interoperability between observations and climate modeling, ○ Cooperation between environmental research infrastructures ○ Improvement of data integration services based on metadata ontologies, ○ Model-data integration by use of HPC, ○ Petascale data movement, ○ Innovative services to compile and compare model output from different sources, especially

on semi-automatic spatiotemporal scale conversion

● Status: The SD activity is ongoing, a pre-final report has been submitted. It was requested and obtained an extension of SDs activities until 06/2018

○ IS-ENES and ICOS agreed upon which datasets were to be transferred matching the scientific use case.

○ the solution proposed and agreed upon was to transfer IS-ENES climate model data hosted on the DKRZ node of the Earth system Grid Federation (ESGF) on a OneData EGI node (Virtual Machine) via http or gridftp (exploiting Synda, a command line tool to search and download files from the ESGF archive).

○ CYFRONET configured OneData disk space in the EGI DataHub service to allocate 2TB in the CEPH installation. A VM provided on EGI FedCloud was configured to fetch data-sets from the original data repository into the OneData volume.

○ 1.1TB of IS-ENES climate data models have been uploaded in the CYFORNET data infrastructure.

○ Next steps ■ ICOS RIs check/validate transferred datasets.

○ Technical issues: ■ initial lack of resources to support the SD - solved with the contribution of CYFRONET ■ initial issues with the Oneprovider - solved by developers of Onedata, and released

in version “RC8” of the service

2. Pan-Cancer Analysis in the EOSC

Page 23: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 23

● Scientific and Technical Objectives ○ analyse large cohorts of cancer genomes, and pursue so-called pan-cancer studies to identify

factors that may be involved in tumour formation and disease progression across multiple cancer types. PCAWG (Pan-Cancer Analysis of Whole Genomes) project is currently analyzing >2800 cancer whole genomes, largely on academic and public clouds, and is also developing approaches for data integration with transcriptome & clinical data to address specific hypotheses

○ establish a portable cloud-based federated solution for collaborative cancer genomics and associated health data management, and an environment accessible to European scientists for analysis

○ Support the operation of the Butler13 application on multiple clouds

● Status: The SD activity has finished and a pre-final report has been submitted. A prolongation beyond the 12 months duration (on own resources) has been suggested to be able to deliver more precise recommendations

○ CYFRONET addressed scalability issues in the context of the HNSciCloud project. Currently testing the new set-up in EOSCpilot

○ The Butler scientific workflow framework has been set up and tested at three globally distributed cloud computing environments that are based on the OpenStack platform:

■ the EMBL-EBI Embassy Cloud in the UK, providing: ● 700 vCPU cores, 2.6TB of RAM, 4.9TB of volume storage and 32 floating IPs

■ the Cyfronet cloud in Poland, providing: ● 1000 vCPU cores, 4TB of RAM, 4 TB of volume storage and 1 PB of NFS

storage ■ the ComputeCanada cloud in Canada, providing:

● 1000 vCPU cores, 3.9TB of RAM, 62 TB of volume storage and 200TB of NFS storage

■ Results: ● >400 high coverage whole genome samples (~60 TB of data) from the ICGC

pediatric brain cancer cohort were downloaded to the ComputeCanada cloud and,

● ~50 TB of public data from the 1000 Genomes project was loaded onto the CYFRONET environment from the EMBL/EBI data servers utilising CYFRONET’s Oneprovider software.

● Butler was used to run a genomic alignment workflow (based on BWA and developed at The Sanger Institute) on >400 samples at ComputeCanada and >400 samples at EMBL/EBI Embassy cloud, with over 100 TB of data processed to date.

● infrastructure operations was monitored by Butler’s detailed monitoring and self-healing capabilities

○ Technical issues: ■ stability issues encountered at the Cyfronet environment with the allocated storage

and data staging mechanism (Oneprovider). ■ storage throughput issue faced at the ComputeCanada environment, solved by a

reconfiguration of the storage environment ■ interoperability issue - Uniform identity and access management across

environments remains an unsolved issue

13 https://github.com/llevar/butler

Page 24: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 24

3. Research with Photons & Neutrons

● Scientific and Technical Objectives ○ Enable interoperable cloud based storage and compute solutions to provide transparent

and secure remote access to scientific data and a generic and community agnostic compute platform.

○ The data analysis framework CrystFEL is applied to the platform, it is increasingly used at various synchrotrons and FELs to analyze data from serial (femto-second) x-ray crystallography

○ Integrate the CrystFEL14 framework for distributed computing remaining interoperable between cloud, grid and HPC/HTC clusters

○ Test and integrate the value of Jupyter Notebook as frontend service. ○ Develop web-services for easy consumption and visualization of the data ○ Exploit existing authentication and authorization solutions ○ Integrate with long term preservation of the data, aiming at FAIR data along the complete

data life cycle. ○ Promote data policies in laboratories and foster standardization of data formats and

discoverability of metadata

● Status: The SD activity has finished and a pre-final report has been submitted. It has requested and obtained an extension of SD activities until March 2018.

○ Deployed and tested software used by a large community in structural biology at Free Electron Laser and Synchrotrons on a local OpenStack cloud platform and on local HPC clusters at DESY, Hamburg.

○ Docker-machine has been tested to approach Infrastructure-as-a-Service which now is based on Magnum, Heat and performs well enabling cluster on demand, thereby simplifying the access and management of computing resources.

○ The team is in contact with EGI and prepares for FedCloud integration strongly enhancing cloud interoperability, this includes integration with AAI services like EGI AAI Check-In

○ Portability of workflows has been achieved by the containerization of several applications from the Photon and Neutron field, but also analysis code from the HEP community which is active at DESY has been profiled for potential integration

○ In cooperation with XFEL, further codes have been deployed and profiled, the most important being the OnDA (online data analysis) utility.

○ Function-as-a-service platforms have been examined to deploy codes, guarantee broad interoperability due to simple http protocol based usage and especially independence of the choice of programming languages and packages included. Starting from OpenFaas, the platform now integrates OpenWhisk, but Kubernetes Fission is planned to be evaluated and compared in terms of elasticity, auto-scaling and user experience.

○ Integrating the so achieved microservices into user-side applications and workflows, event based programming models has been discussed with pilot users and experts in the community.

14 “http://www.desy.de/~twhite/crystfel/”

Page 25: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 25

○ Due to the successful proof-of-concept of cloud based services, the on-site OpenStack and the Ceph backend storage are currently scaled up significantly to 512 physical, 924 logical CPUs and 7.3 TB RAM on the CEPH backend.

○ Scientific packages used in the SD are part of a complex, non-redistributable software stack, which is free to use by academia and non-profit organizations. It has been containerized both for Singularity and Docker and runs on any platform supporting one of the two container environments. However, the licences are needed to address the new distribution channels (Cloud Images, Container Images, Function-as -a-Service).

○ Running tests on the local OpenStack infrastructure and on the HPC cluster, in particular with a focus on automated provisioning of distributed systems and deployment of applications. Integration with existing production environments and administration layers, especially the HPC clusters, is achieved by using existing Puppet, Foreman, QIP stack. Ansible is used to deploy parts of the serverless platform.

○ The metering-as-a-service system ceilometer was replaced by Gnocchi at the end of the SD implementation, providing scalable multi-tenant time-series storage.

○ Used AFS, CVMFS, conventional (NetApp) SMB/NFS-shares and the local/federated Nextcloud instance in order to provide data exchange between VMs and external resources outside the OpenStack environment.

○ Concerning data management and access - investigating interoperability between dCache, iRod and onedata.

● Technical issues:

○ licensing - Substantial parts of the applications in the Photon/Neutron science domain is “free to use for academic purposes” but subject to restrictive licensing conditions

○ security - Cloud resources like the EOSC core, including DESY’s local OpenStack instance are presumably often behind firewalls in “demilitarized zones” (DMZ). Consuming such resource in the EOSC context might require access to specific EOSC entry points opening only very specific ports or routes to well-defined IPs

○ network - Access to cloud instances across domains and in particular privileged access would greatly benefit from inter-network trust and dynamically managed overlay networks for cloud VMs and container deployment in multi-cloud environments

● Next Steps: ○ Work on this SD’s objectives will continue with the integration of middleware solutions for

mass storage systems like dCache and iRODS ○ Preparation for FedCloud integration demands restructuring of network layout and security

measurements ○ Upscaling from microservices to user defined applications, workflows and software stacks ○ The achieved access to performant IaaS opens great opportunities for Dynamic On Demand

Analysis Services and a series of candidates are being looked at amongst them Jupyter Notebook as a service, Kafka Clusters, Spark Cluster, Docker and Kubernetes and the most important being scalable clusters to underpin the serverless architecture.

4. Collaborative semantic enrichment of text - based datasets (TEXTCROWD)

Page 26: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 26

● Scientific and Technical Objectives ○ work on the collaborative semantic enrichment of text-based datasets by developing new

software to enable a semantic enrichment of text sources and make it available on the EOSC

○ support the encoding/metadata enrichment of text documents that are the main part of datasets used in Digital Humanities and Cultural Heritage research

○ improve item searchability through the ARIADNE catalogue, as it accesses the report content in a semantic way, and not only its summary description

● Status: The SD activity has finished and a final report has been submitted. The SD team Interested to use B2DROP to store/share Metadata. A meeting has been scheduled with the EUDAT provider on Mon 30 April.

○ The TEXTCROWD service (https://textcrowd.d4science.org), empowered by the D4Science infrastructure, allows users to upload and store textual documents in a personal cloud folder, perform NLP and Named Entity Recognition (NER) operations, trigger the semantic enrichment process and get CIDOC CRM information in RDF. The demonstrator was deployed inside an EGI VM with EUDAT B2DROP linking to D4Science environment.

○ During the SD timeline the following achievements have been reached: ■ Semantic enrichment of text documents: Sets of similar texts concerning related

topics (e.g.: archaeological excavation); Small size (datasets of KBs or MBs, not TBs or PBs); Enrichment: metadata creation to improve indexing, discoverability, accessibility and reusability

■ Recognition and automatic annotation of entities and concepts in texts: Machine learning model for natural language processing (NLP) and Named Entities Recognition (NER)

■ Knowledge extraction in semantic format: CIDOC CRM (ISO 21127:2006) encoding in RDF format; Machine readable and consumable data to enhance integration and interoperability within CH domain

■ Cloud based architecture empowered by a modular and extensible framework: intuitive, web-based user interface; User and access management; Cloud storage of private and shared files; Results available and reusable within various Virtual Research Environments (VRE)

■ Interoperability of extracted knowledge: Semantic information in CIDOC CRM format: full integration and interoperability with other Cultural Heritage semantic data

■ FAIR Principles implementation: Metadata to be stored in various registries for easy findability and accessibility (e.g.: ARIADNE Registry, PARTHENOS Registry); Results ready to be reused within the same environment or consumed by other services in different scenarios; Standard NLP encoding to make data interoperable with other NLP tools

■ Machine-learning capabilities: TEXTCROWD has been boosted and made capable of browsing big online knowledge repositories, educating itself on demand and used for producing semantic metadata ready to be integrated with information coming from different domains, to establish an advanced machine learning scenario.

○ Technical issues: ■ No user friendly Cloud-based environment available

● Desktop pipeline developed in the first phase of development ● Cloud migration into D4Science infrastructure in the second phase of

development

Page 27: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 27

■ Identification of NLP tools, and in particular of POS and NER frameworks, suitable to analyze linguistic entities in Italian and able to efficiently recognize and mark named entities relevant for the chosen domain (archaeology).

■ Definition of vocabularies and terminological tools to speed up entity recognition and machine learning process for Italian archaeological reports.

■ Identification of a cloud environment suitable for hosting TEXTCRTOWD’s modular infrastructure

5. WLCG Open Science Demonstrator - Data Preservation and Re-Use through Open Data Portal (DPHEP)

● Scientific and Technical Objectives ○ overall goal of this Science Demonstrator is to implement a service for the long-term

preservation and re-use of HEP data, documentation and associated software ■ to use non-discipline specific services combined in a simple and transparent manner

(e.g. through PIDs) to build a system capable of storing and preserving Open Data at a scale of 100TB or more

■ deploy services that tackle the following functions: 1. Trusted / certified digital repositories where data is referenced by a

Persistent Identifier (PID); 2. Scalable “digital library” services where documentation is referenced by a

Digital Object Identifer (DOI); 3. A versioning file system to capture and preserve the associated software and

needed environment; 4. A virtualised environment that allows the above to run in Cloud, Grid and

many other environments.

● Status: The SD activity has finished and a final report has been submitted. ○ the Pilot integrates services from 3 well established e-infrastructures: OpenAIRE (e.g.

Zenodo), EGI Engage (e.g. CVMFS – already offered by EGI InSPIRE SA3 (led by CERN)) and a Trustworthy Digital Repository (TDR).

○ Whilst it was possible to upload a documentation file into the EUDAT B2SHARE test instance (chosen over Zenodo due to the affiliation of the shepherds) and whilst software from the LHC experiments is stored in the RAL CVMFS instance, there were significant delays in finding a site that could act as a TDR for this pilot

○ Technical issues: ■ the above 3 services are not interoperable ■ as stated in the final report - “The objective and the simple goal outlined above was

not achieved “

b. Second Set of Science Demonstrators The first EOSCpilot Open Call for Science Demonstrators in April 2017 resulted in five new Science Demonstrators with execution from July 2017 to June 2018.

● Energy Research – PROMINENCE: HPCaaS for Fusion - Access to HPC class nodes for the Fusion Research community through a cloud interface

● Earth Sciences – EPOS/VERCE: Virtual Earthquake and Computational Earth Science e-science environment in Europe

Page 28: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 28

● Life Sciences / Genome Research: Life Sciences Datasets: Leveraging EOSC to offload updating and standardizing life sciences datasets and to improve studies reproducibility, reusability and interoperability

● Life Sciences / Structural Biology: CryoEM Workflows: Linking distributed data and data analysis resources as workflows in Structural Biology with cryo Electron Microscopy: Interoperability and reuse

● Physical Sciences / Astronomy: LOFAR Data: Easy access to LOFAR data and knowledge extraction through Open Science Cloud

1. HPCaaS for Fusion (PROMINENCE)

● Scientific and Technical Objectives ○ Overall aim of this work is to demonstrate the feasibility of launching small scale

MPI/OpenMP jobs on cloud infrastructures including EGI FedCloud resources and allow external communities to assess whether HPC class nodes accessible through such a cloud interface are useful to them in their work

○ Run containerized fusion MPI based modelling codes on both CCFE cloud and private clouds ○ Demonstrate ability to burst on to EGI fedCloud. Ability for external communities to access

resources at CCFE

● Status: the SD activity is ongoing, an 8-months activity report has been submitted ○ Containerized MPI-based applications and successfully run them both on EGI and

commercial clouds. The clusters were initially created as static SLURM clusters, using Infrastructure Manager through the EC3 and CLUES, and later made dynamic using openVPN to create workers on multiple cloud instances using both OpenNebula and OpenStack.

○ Local OpenStack setup and testing ■ Negotiations with commercial OpenStack installers. ■ OpenStack was deployed locally by Bright Computing. ■ Successfully ran MPI applications (Geant-4 simulations) in Docker containers in

OpenStack VMs, making use of InfiniBand networking. ■ Internal discussions have started in order to provide more widespread access to

OpenStack by internal users as well as eventually external users. ○ EGI FedCloud access and familiarisation: Grid certificate renewed; Gained access to the

fedcloud.egi.eu VO; Carried out basic testing of EGI FedCloud using the VM Operations Dashboard, OCCI command line tools and Infrastructure Manager.

■ resources in use: 133 vCPU cores at CESNET-MetaCloud, 8 vCPU cores at IN2P3-IRES and CESGA

○ Running MPI jobs in containers on EGI FedCloud: Used Infrastructure Manager to deploy static SLURM clusters with OpenMPI and Singularity installed and tested running MPI applications on FedCloud sites and Azure: Used EC3 (http://servproject.i3m.upv.es/ec3) combined with Infrastructure Manager to deploy elastic SLURM clusters and run MPI jobs; Used OpenVPN along with EC3 in order to deploy elastic SLURM clusters spanning multiple clouds, enabling us to run MPI jobs on one preferred cloud site but burst onto others when resources were no longer available on the first cloud; Setup a proof-of-concept based on HTCondor providing a gateway where users can submit and monitor jobs. MPI clusters are created on demand, with the numbers of nodes and cores tailored to each job, and each cluster is destroyed after each job completes.

○ Container runtimes: Predominantly using Singularity to successfully run MPI jobs in containers; Testing use of uDocker and shifter as an alternative to Singularity. uDocker seems to give a small performance improvement of up to 4% compared to Singularity.

Page 29: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 29

○ Data handling: Each cluster is deployed with one node acting as an NFS server with a large attached disk (have tested up to 1 TB), providing the cluster with shared storage for storing temporary files; Obtained access to Ceph-based object storage at STFC, available via both Swift and S3 APIs; Setup a mechanism to automatically copy specified job output files to Ceph using the Swift API, and provide users with temporary URLs that can be used to access their data.

○ Monitoring: Initially deployed Ganglia monitoring on the SLURM clusters in order to monitor performance and resource consumption; Using a central Grafana instance for visualisation and InfluxDB for storing data, along with Telegraf installed on each node in each cluster in order to collect metrics and compare the usage of different clouds.

○ Next steps: ■ Ability to define policies for selecting what clouds to use

● Attempting to deploy a local instance of the INDIGO PaaS Orchestrator and associated components, such as the SLA Manager and IAM.

● Attempting to get access to an existing Orchestrator instance. ○ Technical issues:

■ issue deploying IM in a Docker container it was not possible to deploy infrastructure on EGI FedCloud due to a missing RPM in the Docker image. This problem was very quickly fixed by the IM developers and a new Docker image was released.

■ The example SLURM cluster RADL in the INDIGO IM Github repository had a bug preventing it from working. It was clear how to resolve this but the developers quickly updated the repository once notified.

■ EC3 was unable to deploy clusters on EGI FedCloud sites due to an SSL issue affecting the IM instance deployed on the SLURM frontend. Initially used a workaround by deploying IM in a Docker container, but once the developers were able to reproduce the problem they corrected the IM role in Ansible Galaxy

■ Since users in the fusion community mainly need to run MPI jobs it is of course preferable to deploy clusters on clouds which have low latency interconnects, such as InfiniBand or Intel Omnipath, in order to achieve the best performance. Currently it is not possible to determine automatically which EGI FedCloud sites provide this ability, however EGI are aware of this and will include this in their new schema used by sites publishing their capabilities. This is not a blocking issue as there are in fact no existing sites in EGI FedCloud providing low latency networking.

■ Different credentials are being used to deploy infrastructure on EGI FedCloud (X.509 certificate) and storage (username and secret key). Ideally a single credential would provide access to both compute resources and storage, but this would depend on the existence of interoperable AAI infrastructures.

■ To use container runtimes such as Singularity and uDocker with MPI requires the same version of MPI being installed on the host as in the container.

■ OpenNebula sites within EGI FedCloud have a limitation that all VMs have public IP addresses and are accessible on the internet, e.g. by ssh. From a security point of view it would of course be better if this was not the case.

■ Interoperability between the different AAI infrastructures in different organisations is an important issue that needs to be solved in order to simplify access to both compute and storage resources.

2. EGA (European Genome-phenome Archive) Life Science Datasets Leveraging EOSC

● Scientific and Technical Objectives

Page 30: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 30

○ leverage EOSC resources to refresh datasets from uploaded to the EGA using newly available or updated reference data

■ FAIRfied metadata on both result sets is available at a testing EGA server and/or at an appropriate repository

○ new dataset be made available in a FAIR manner, adding metadata according to the attributes that have been chosen to contribute the strongest to the FAIR principles.

○ develop pipelines and security mechanisms as part of the demonstrator to automate this process

■ A set of results data to be reproduced using a portable version of the pipeline. ■ The same result set to be updated by re-analyzing it with a current version.

● Status: the SD activity is ongoing, an 8-months activity report has been submitted ○ original software used in the GoNL pipeline extracted from the various sources and

containerised to obtain a containerised version of the original GoNL pipeline to process BAM files with the new pipeline

○ original workflow was translated to the NextFlow workflow manager, and its runner used to run the pipeline at the Barcelona Supercomputing Centre (BSC), by first uploading the GoNL dataset to the BSC filesystems and using Singularity containers

○ partial sample analyses from the original dataset were run with the containerised NextFlow workflow to compare these results with the original, these proved that the pipeline was correctly duplicated

○ Technical issues: ■ Original software versions were not always findable so the closest available version

had to be used. ■ Other software was not capable of running in a modern environment, in which case

alternatives had to be found. ■ Original software dataset turns out to be too large to run with the current resources. ■ Original ancillary files required for the pipeline execution were not available, so the

closest available version had to be used. ○ Next steps:

■ Collecting the security requirements and policies needed to run the workflows in other providers to address the resource issues

■ Interested to use the B2FIND Metadata Catalogue. A technical meeting with the B2FIND provider has been organized.

● Working on stabilising the metadata schema of EGA Dataset to be used by B2FIND

3. Virtual Earthquake and Computational Earth Science e-science environment in Europe (EPOS/VERCE)

● Scientific and Technical Objectives

○ produce data products such as: simulated seismic waveform images, wave propagation videos, 3D volumetric meshes, sharable KMZ packages and parametric results to address the research needs and institutional requirements

○ Readiness for FedCloud integration ○ Post-processing workflows on Cloud resources to be available ○ Integration of B2STAGE between HPC/Cloud. ○ Resources and self-managed iRODS instance completed. ○ Improvement of the provenance model and associated reproducibility and validation

services done.

Page 31: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 31

○ First test with large data post-processing and concurrent users successful realized on the Cloud

● Status: the SD activity is ongoing, a 4-months activity report has been submitted and an extension of 4 months has been requested, until 10/2018

○ Three cloud providers, IN2P3-IRES, SCAI and HG-09-Okeanos-Cloud, of the EGI Federation have been identified and configured to support the verce.eu VO

○ The Provenance system (S-ProvFlow) has been improved in many aspects: ■ better Rest API methods, ■ provenance repository performances, ■ frontend usability and ■ modular “Dockerisation” of each component.

○ A new DCI_BRIDGE VM image is now in the EGI AppDB and ready for testing. ○ The VERCE portal has extended the AAI framework to enable the access with the EGI AAI

Check-In service using the OIDC module developed by INFN for Science Gateways based on Liferay technology.

○ Technical issues: ■ VM image required by gUSE Cloud integration needed to be updated (dci-bridge

included in image, Obspy & GridFTP etc. included, LVM issue) ■ FedCloud integration requires the support of the site for public floating IPs, and a

Network policy which allows port access on the port where the DCI-Bridge listens on in the available VM image

■ Regarding support of Per-User Sub-Proxies, not all software supports PUSP out of the box. E.g. iRODS GridFTP DSI does not support PUSP (all PUSPs will be recognized as the same user, i.e. using the robot certificate).

○ Next steps: ■ Testing the integration with the EGI FedCloud infrastructure in progress. ■ The EPOS AAI Team suggested to wait a bit before to integrate the portal with the

official EPOS core AAI infrastructure (based on UNITY) since it is still not ready. This is the reason why the OIDC module has been used in the context of the EOSCpilot project to speed the pilot activities. The migration to the official EPOS AAI will be done once it is ready.

4. CryoEM workflows

● Scientific and Technical Objectives ○ develop ways to share detailed information on cryo-EM image processing workflows,

concentrating on those processes usually run at the Facility level, increasing in this way the reproducibility in Science

○ write from Scipion a workflow file that fully describes the image processing steps so that they can be re-executed resulting in exactly the same results (making the data more FAIR)

○ users of a representative subset of major CryoEM Facilities in Europe will leave the Facility not only with raw and preprocessed data, but also with a file linking the acquired data and the analysis workflows, that:

■ will contain detailed information enabling the reproduction of processing steps (assuming access to the original software),

■ will be ready and accepted to be deposited in cryo-EM databases like EMDB and EMPIAR, and

■ will be easy to browse and analyze over the Web.

Page 32: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 32

● Status: the SD activity is ongoing, an 8-months activity report has been submitted ○ Extended and improved the demonstrator in order to make it compliant with the FAIR

principles. ○ Scipion now writes the image processing pipeline used at electron microscopy facilities in a

JSON file that can be exported and imported from Scipion, and it can be deposited in public databases as EMPIAR. EMPIAR will integrate a web viewer specifically designed for this kind of files.

○ evaluated the possibility of using Common Workflow Language as a standardized reporting tool for other programmers

○ Technical issues: ■ none reported

○ Next steps: ■ Final version of the implementation of a JSON-based detailed workflow description

covering all steps of current preprocessing workflows within Scipion. ■ Final version of the Implementation of a web browser for the workflow description

files. ■ Implementing Scipion capability to import and reproduce workflow description files ■ Deployment at Facility sites ■ From the interoperability point of view - will work together with ELIXIR in terms of

ontological development, centered in extending EDAM to cryo-EM. ● Most test Facilities will be Instruct-sites

5. Astronomy Open Science Cloud access to LOFAR data

● Scientific and Technical Objectives ○ Existing LOFAR data will be made readily available to a much larger and broader audience,

enabling novel scientific breakthroughs. In particular: ■ Access to LOFAR data in accordance with FAIR principles ■ Container based processing of LOFAR data on EOSC infrastructure

○ Allow for the science community to locate, access, and extract science from the LOFAR archive without being an expert on data retrieval and data analysis tools. The pilot will develop services, based on existing tools such as Xenon, CWL, Docker, Virtuoso, that allow users to initiate processing on data stored in a distributed, large-scale archive

● Status: the SD activity is ongoing, an 8-months activity report has been submitted ○ provided access to EUDAT resources B2SHARE (AAI via B2ACCESS) needs more testing ○ provided access to PSNC/FZJ resources. ○ CWL implementation of three pipelines working.

■ demonstrated that the Common Workflow Language standards are sufficient to enable the portable description of radio astronomy data analysis pipeline via three real world examples

● Prefactor which is used for performing the first calibration of LOFAR data ● Presto which is a framework to search for pulsars in radio frequency datasets ● Spiel which is a radio telescope simulator, that generates ‘dirty’ images from

an input image ○ started making Debian packages for all required software ○ evaluated three containerisation platforms, Docker, singularity and uDocker (user-space

Docker) ○ the engine chosen for the first deployments is the open source Python-based Toil, a complete

but still lean implementation that supports parallelisation and scheduling

Page 33: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 33

○ run the pipelines on a variety of platforms: MacBook, Linux Server, the SURFsara HPC cloud environment and Cartesius, the Dutch national supercomputer, also hosted by SURFsara.

○ Benchmarks were performed on small datasets to evaluate the relative performance on the different systems

○ Technical issues: ■ bugs in the CWL implementations and ambiguities in the CWL standard:

● nested directory structures, typical for radio astronomy datasets (measurement sets) were not properly handled - Fixed

● optional in-place writing - important addition to the standard done as requested by the SD, had an issue where intermediate array products were not properly sorted, resulting in alignment problems if one uses multiple arrays - Fixed

● Toil depends on a scheduler for running jobs on a cluster which needs to be deployed if none is at hand. Was decided to try to set up a Mesos cluster to handle scheduling on a mini cluster deployed on the SURFsara HPC cloud but were not successful and did not have enough time left to fully investigate this issue so this setup could not be benchmarked.

● Container security: Docker is known to be insecure in a multi-user environment. Singularity seems to be more fitted for this purpose. Although Singularity does contain an executable with a ‘setuid bit’ set, which might expose a security risk, many HPC sites have adopted this technology.

● Rigid dependencies between Cuda drivers and Nvidia kernel modules: If a container containing a Cuda version not matching the hosts Nvidia kernel module version is used, GPU acceleration does not work. This breaks the host-platform independence. For Docker there is a workaround in the form of a Docker extension, but if a different container technology there is no solution (yet).

● Interoperability issues: ○ Addressed:

■ Portable and flexible pipeline/workflow definition & deployment using containers.

○ To be done: ■ Data access and storage (FAIR data repository) ■ UI for definition & starting pipelines.

○ Not necessarily covered within EOSCpilot: ■ Make workflow useable over more systems. ■ High bandwidth connectivity. ■ Federated access

○ Next steps: ■ Investigating FAIR services for LOFAR ■ CWL based workflow development ■ Notebook based processing, report/documentation

c. Third Set of Science Demonstrators The second EOSCpilot Open Call for Science Demonstrators in August/September 2017 resulted in five new Science Demonstrators with execution from December 1, 2017, to November 2018:

● Generic Technologies: Frictionless Data Exchange Across Research Data, Software and Scientific Paper Repositories

● Astro Sciences: VisIVO: Data Knowledge Visual Analytics Framework for Astrophysics

Page 34: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 34

● Social Sciences and Humanities: VisualMedia: a service for sharing and visualizing visual media files on the web

● Life Sciences - Genome Research - Bioimaging: Mining a large image repository to extract new biological knowledge about human gene function.

● Earth Sciences - Hydrology: Switching on the EOSC for Reproducible Computational Hydrology by FAIRifying eWaterCycle and SWITCH-ON

1. Frictionless Data Exchange Across Research Data, Software and Scientific Paper Repositories

● Scientific and Technical Objectives ○ demonstrator service for fast and highly scalable exchange of data across repositories storing

research datasets, manuscripts and scientific software ○ showcase how scholarly communication resources can be effectively, regularly and reliably

exchanged across systems ○ apply ResourceSync protocol on real-world use cases ○ show data synchronization across a cross-disciplinary network of repositories and between

repositories and global added-value services ○ Design and implementation of a scalable client solution for accessing CORE metadata and

full texts.

● Status: the SD activity is ongoing, a 4-months activity report has been submitted ○ Initial definition of the experimental design

■ experimental design finished - resulted in a plan of experiments to be conducted together with a list of evaluation metrics.

● OAI-PMH speed assessment across a variety of repository platforms ● evaluation of three different ResourceSync setups (standard, batch and

materialised dump) under several datasets. ● recall evaluation of OAI-PMH full text harvesting and their benchmark to

ResourceSync harvesting ■ experimental software developed and the first set of experiments conducted -

updated more scalable ResourceSync client15 ■ experimented, developed and tested a new client-server support for a new approach

we developed called ResourceSync Batch ■ Challenges encountered resulted in a new synchronisation approach proposed to the

ResourceSync community. ○ Currently not using TextCrowd dataset yet ○ Technical issues:

■ Time scalability issues of ResourceSync in scenarios requiring the synchronization of large numbers of small files (typically metadata files) - solved by developing a new approach called ResourceSync Batch16.

■ Issues around creating a fair benchmarking environment, considering effect of network latency, geolocation, data size, HTTP connection overhead, repository platform implementation differences, differences in average resource size. These issues have been discussed and tackled in the SD experimental design by:

● Drawing comparisons only from synchronization conducted on the same data

15 https://github.com/oacore/rs-aggregator 16 https://blog.core.ac.uk/2018/03/17/increasing-the-speed-of-harvesting-with-on-demand-resource-dumps/

Page 35: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 35

● Planning how to understanding network latency impact on the results ● Analysing variance in response time and average resource size of

repositories per repository platform (for example, EPrints, DSpace, OJS, etc.).

● Identifying representative use cases for the benchmarking exercise ○ Next steps:

■ working on a research paper reporting on the outcomes of the first batch of experiments

■ preliminary tests to be done on TextCrowd dataset ■ as from the workplan:

● improve ResourceSync server libraries to address scalability of metadata synchronization (synchronisation issues with many small files)

● discuss synergies with other EOSC demonstrators. ● develop software for benchmarking ● finalise the evaluation of experimental scenarios ● first draft of a the demonstrator results in the form of a scientific

publication ● Finalising the experimental software

2. Mining a large image repository to extract new biological knowledge about human gene function

● Scientific and Technical Objectives ○ establish the resources required to perform comprehensive machine learning analyses on

these datasets (Image-based genome-scale RNAi), along with a validity test of the approach and the demonstration on how large cloud-based collection of published datasets can be reused for novel discovery

○ comprehensive machine learning analysis ○ dataset reuse ○ reusable infrastructure and analysis for generating value from published image data ○ demonstrate the use of the infrastructure for users to run their own analysis via the cloud

on publicly available image data sets ○ show how image data can be reused in a research context ○ showcase the results reuse by the community e.g. for searching images by similarity, to

implement supervised machine-learning methods to mine the repository, etc

● Status: the SD activity is ongoing; a 4-months activity report has been submitted ○ Currently, the example setup in the EMBASSY cloud that ingests images from the IDR and

processes them is being scaled up to a ~200 vCPU deployment to test scalability ■ Tests done to compare results with the format that was available (JPG) vs RAW

images. Outcome was that results were correlated in most cases, which was acceptable for the current trial phase

■ creation of an LSF-like environment at EMBASSY using OpenLava ■ The SD have used this deployment to do small scale (~10 vCPU) test runs of the

workflow in this environment ■ Trying to scale up, the deployment showed some scalability issues around

networking, which were resolved together with the shepherd ■ Tests are now ongoing on a large scale deployment (~200 vCPUs)

○ Technical issues: ■ Images from the data source (IDR API) were not available in the correct format.

Page 36: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 36

■ Issues with the initial set-up of the tenancy: ● Software installation: lack of documentation, hidden parameters. ● Network configuration. ● NFS set up.

■ Issues with cloud deployment and orchestration. ● OpenLava was targeted by DMCA resulting in loss of source code and online

documentation. ● Hidden quotas in OpenStack: number of ports limited, preventing us to use

an architecture with many small nodes. ● Firewall related issues. ● Inconsistency in software dependencies between master and nodes.

■ Networking issues started to appear when scaling up the cloud deployment to >50 vCPUs - Now solved

○ Next steps: ■ Work is being undertaken on the IDR side to provide raw image access through the

API. ■ Explore integrative mining of the image - derived data ■ Start evaluating results ■ Final evaluation of the results ■ Writing paper and report ■ Exporting data to the IDR

3. VisIVO: Data Knowledge Visual Analytics Framework for Astrophysics

● Scientific and Technical Objectives ○ integration in the EOSCpilot e-infrastructure of a visual analytics environment based on

VisIVO (Visualization Interface for the Virtual Observatory) and its module VLVA (ViaLactea Visual Analytics).

○ publishing a data analysis and visualization environment fully integrated with a relevant set of astrophysical datasets

● Status: the SD activity is ongoing; a 4-months activity report has been submitted ○ The astrophysics survey data of the Galactic Plane have been deployed to the EGI Data

Resources (~400GB) ○ The DCI_BRIDGE VM image used by the Science Gateway to run the scientific workflow in

the EGI Fed Cloud infrastructure has been updated with the tools required by the SD and is published to the EGI AppDB: https://appdb.egi.eu/store/vappliance/visivo.sd.va

○ Technical issues: ■ Currently the local cloud-infrastructures of the SDs are not yet federated with EGI

FedCloud, hence requesting access to EGI Fed Cloud infrastructure for computing as the SD would need access to any federated resources for testing the Science Demonstrator.

○ Next steps: ■ Interested to reuse the WS-PGRADE/gUSE configuration used by the EPOS/VERCE

SD and to use the B2FIND solution ● meetings have been planned between SD team, sheperds, and experts of

the solutions mentioned above. ■ Sustainable storage of 0.5-1.5 TB in order to provide open access to astrophysics

survey data ■ first release of tools as-a-service in the cloud framework

Page 37: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 37

■ providing access to data in accordance with FAIR principles ■ testing with multiple simultaneous users ■ bug fixes, documentation, and workflow refinements especially when exploiting

cloud infrastructures

4. Switching on the EOSC for Reproducible Computational Hydrology by FAIR-ifying eWaterCycle and SWITCH-ON

● Scientific and Technical Objectives ○ integrate the top-down approach of the eWaterCycle project with the bottom-up approach

of the SWITCH-ON one by: ■ combine lessons learned from SWITCH-On and eWaterCycle ■ make steps towards reproducible, reusable, open, data driven science in Hydrology ■ showcase a modern approach to large scale simulation-based science

○ challenges: ■ create FAIR versions of data required for running the hydrological models ■ FAIR software (source code and ready-to-run software)

● Status: the SD activity is ongoing; a 4-months activity report has been submitted ○ Testing pipeline ○ CWL will be used as interoperable language for the workflows. ○ OneData is explored as data repository. ○ Details:

■ work started on translating the current workflow into the CWL language ■ engaging with OneData through other SDs and outside of the EOSC intensively in

order to better understand the solution on which the SD is interested ○ Technical issues:

■ Earth science community depends on NetCDF, a widely used metadata / data format. What is missing from the Earth Science community is an authoritative dataset repository that is widely used. Data right now is still divided in many separate “pillars” or private repositories.

■ Need of a basic storage that is common across research groups / institutes to get bits from one place to the other in a standardised way. OneData will probably play a big role to achieve this.

○ Next steps: ■ Obtain accounts for compute infrastructure used. ■ Setup OneData on all compute infrastructure. ■ Work on Hello-World BMI (Basic Model Interface)17 Container example to check if

compute infrastructure will support use-case. ■ Create overview and attainability assessment for input data FAIR archive per

dataset required. ■ Setup CWL Reference implementation on all compute infrastructure ■ Create BMI based Containers for eWaterCycle model. ■ Create CWL based containers for tools. ■ Start collection of input data into archive. ■ Create BMI “Driver” application to run models via BMI. ■ Select and/or Setup FAIR archive for input and output data.

17http://csdms.colorado.edu

Page 38: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 38

■ Create CWL workflow for deterministic simulation. ■ Community Building (outreach) at EGU General Assembly. ■ Create BMI based Container for HYPE model. ■ Add container support to OpenDA. ■ Create CWL workflow for forecast (including Data Assimilation). ■ Attain CWL implementation better suitable for compute infrastructure used

(supporting of parallel running, etc). Possibly based partly on Cylc.

5. VisualMedia: a service for sharing and visualizing visual media files on the web

● Scientific and Technical Objectives ○ provides researchers with a system to publish on the web, visualize and analyze images and

3D models in a common workspace, enabling sharing, interoperability and reuse. ■ web publications and visualization of high-resolution images and 3D models module ■ thumbnail-based browsing interface ■ customization of presentation through the selection of different browsing systems

for 3D models ○ porting Visual Media Service on a cloud platform with more sophisticated support for

a number of issues (user authentication, control of the allocation/use of computing and storage resources in real time)

○ metadata enrichment based on automatic processing of related documentation ○ visual (2D and 3D) data visualization improvement ○ collaborative research increase ○ educate Cultural Heritage (CH) researchers in moving towards the incorporation of virtual

research environments in research methodology and experience use of cooperative visualization

● Status: the SD activity is ongoing; a 4-months activity report has been submitted ○ Working with CNR team in order to run the workflow under the D4Science Infrastructure,

activity involves three steps: ■ Authentication - adding authentication - incorporated the authentication feature of

D4Science, including support of the D4Science and the Google authentication features

■ Storage - personal storage management, implemented a few small services connected with the now available user data

● worked on the data browsing component of the Visual Media Service, to provide an option to access / visualize just the proprietary data of the specific logged user

● included the possibility to load visual media data (data files) from the personal storage area of any specific user (from the D4Science user workspace).

■ Scalability ○ Technical issues:

■ none found ○ Next steps:

■ partial porting of Visual Media Service on the EOSC platform, endorsing a user authentication component provided by the D4Science infrastructure

■ delivery of Visual Media Service - OESC v.1, after development work of the enhanced functionalities

Page 39: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 39

■ Report on the results of the first assessment phase, delivery of Visual Media Service - OESC v.1.1

■ Final report on the results of the Demonstrator, delivery of the final release of Visual Media Service - OESC

C. CONCLUSIONS AND FUTURE WORK

In this report we highlight the status of all Science Demonstrators testbeds and activities, most of them in line with what planned initially, some of them requiring extensions in order to finalise the work. Testbeds deployed were performing as expected, the issues encountered until now were solved together with the respective SDs shepherds, service and infrastructure providers. From the activities done until now the SDs teams and their shepherds also highlighted new requirements both on the services used and on the infrastructures made available to them. All of them will be reported in the following deliverable D6.7, “Revised requirements of the interoperability testbed”.

The results of the work on the SDs in the following months and of the work on architecture and data interoperability, with impact on the testbeds made available for the SDs will be presented in the final WP6 project deliverable, D6.11 - “Final Interoperability Testbed report”, containing also considerations on the valuation of the readiness levels of the technical solutions requested and deployed on the various testbeds.

Page 40: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 40

ANNEXES

A.1. GLOSSARY Many definitions are taken from the EGI Glossary (https://wiki.egi.eu/wiki/Glossary). They are indicated by (EGI definition).

Term Explanation

e-infrastructures (definition of the Commission High Level Expert Group on the European Open Science Cloud in their report): this term is used to refer in a broader sense to all ICT-related infrastructures supporting ESFRIS (European Strategy Forum on Research Infrastructures) or research consortia or individual research groups, regardless of whether they are funded under the CONNECT scheme, nationally or locally.

High Performance Computing (HPC) (EGI definition) A computing paradigm that focuses on the efficient execution of compute intensive, tightly-coupled tasks. Given the high parallel communication requirements, the tasks are typically executed on low latency interconnects which makes it possible to share data very rapidly between a large numbers of processors working on the same problem. HPC systems are delivered through low latency clusters and supercomputers and are typically optimised to maximise the number of operations per seconds. The typical metrics are FLOPS, tasks/s, I/O rates.

High Throughput Computing (HTC) (EGI definition) A computing paradigm that focuses on the efficient execution of a large number of loosely-coupled tasks. Given the minimal parallel communication requirements, the tasks can be executed on clusters or physically distributed resources using grid technologies. HTC systems are typically optimised to maximise the throughput over a long period of time and a typical metric is jobs per month or year.

Page 41: D6.5: Interim Interoperability Testbed report · A. Interoperability pilots in the context of WP6/T6.3 7 1.1 Pilot for Connecting Computing Centres 8 1.2 Pilot for distributed authorization

www.eoscpilot.eu | [email protected] | Twitter: @eoscpiloteu | Linkedin: /eoscpiloteu 41

National Grid Initiative or National Grid Infrastructure (NGI)

(EGI definition)The national federation of shared computing, storage and data resources that delivers sustainable, integrated and secure distributed computing services to the national research communities and their international collaborators. The federation is coordinated by a National Coordinating Body providing a single point of contact at the national level and has official membership in the EGI Council through an NGI legal representative.

Virtual Organisation (VO) A group of people (e.g. scientists, researchers) with common interests and requirements, who need to work collaboratively and/or share resources (e.g. data, software, expertise, CPU, storage space) regardless of geographical location. They join a VO in order to access resources to meet these needs, after agreeing to a set of rules and Policies that govern their access and security rights (to users, resources and data).

AAI Authentication and Authorization Infrastructure

CMS Content Management System

EDMI EOSC Dataset Minimum Information

EOSC The European Open Science Cloud

FAIR Findable, Accessible, Interoperable and Reusable

RDA Research Data Alliance

RIs Research infrastructures

SD Science Demonstrator