deliverable 5.3 project id 654241 - phenomenal

Deliverable 5.3

Project ID 654241

Project Title A comprehensive and standardised e-infrastructure for analysing medical metabolic phenotype data

Project Acronym PhenoMeNal

Start Date of the Project

1st September 2015

Duration of the Project

36 Months

Work Package Number

5

Work Package Title Operations and Maintenance of PhenoMeNal GRID/Cloud

Deliverable Title D5.3 Operational grid/cloud allowing for combining data, tools, and compute VMIs. Most services available. Functional integration with EGI federated cloud/grid for compute resources. Demonstrated analysis on private/sensitive data in secure environment.

Delivery Date M24

Work Package leader

UU

Contributing Partners

UU, IPB, EMBL-EBI, ICL, CRS4

Authors Ola Spjuth, Pablo Moreno, Pierrick Roger, Etienne Thévenot, Kristian Peters, Steffen Neumann, Reza Salek, Christoph Steinbeck, Ken Haug, Gianluigi Zanetti, Pedro de Atauri, Tim Ebbels, Jianliang Gao, Payam Emami, Noureddin Sadawi, Anders Larsson, Pablo Moreno, Marta Cascante, Vitaly Selivanov

1

Abstract: This deliverable reports on the results of the first 24 months’ work towards an operational PhenoMeNal VRE with a number of usable services. The report includes the description of the architecture, development and testing frameworks and procedures. We also present the M24 PhenoMeNal VRE release (codename Bucetin) and a set of demonstrators in the form of Workflows that can be run on this release. A report on the ongoing integration with EGI federated cloud/grid as well as demonstrated analysis on private/sensitive data in secure environments is also presented. The conclusion includes a listing of general e-infrastructure contributions of the project.

Table of Contents 1 EXECUTIVESUMMARY....................................................................................................3

2 CONTRIBUTIONTOWARDSPROJECTOBJECTIVES............................................................3

3 DetailedReportontheDeliverable.................................................................................33.1 Introductionandoverview...................................................................................................33.2 PhenoMeNalarchitecture....................................................................................................5

3.2.1 Deployment............................................................................................................................63.2.2 KubeNow................................................................................................................................63.2.3 Gateway/Portal......................................................................................................................83.2.4 WorkfloworchestrationwithGalaxy......................................................................................93.2.5 Otheruserinterfacesandworkflowsystems.......................................................................10

3.3 Developmentandtesting...................................................................................................113.3.1 Infrastructuretesting............................................................................................................113.3.2 Containerdevelopment........................................................................................................113.3.3 ContainerTesting..................................................................................................................13

3.4 EGIsupport........................................................................................................................143.5 Bucetinrelease..................................................................................................................143.6 Demonstrators...................................................................................................................15

3.6.1 LC-MS/MSMetaboliteAnnotation.......................................................................................153.6.2 Univariate/multivariatestatisticsandsearchforbiomarkers..............................................163.6.3 NMRworkflow......................................................................................................................16

3.7 Analysisonprivate/sensitivedatainsecureenvironment..................................................163.7.1 ImperialCollegeLondon.......................................................................................................163.7.2 UppsalaUniversityHospital..................................................................................................18

3.8 Generale-infrastructurecontributions...............................................................................193.9 Risks..................................................................................................................................20

2

4 DeliveryandSchedule....................................................................................................20

5 Conclusion.....................................................................................................................20

3

1 EXECUTIVE SUMMARY The PhenoMeNal project develops a Virtual Research Environment (VRE) for interoperable and scalable metabolomics analysis. This deliverable reports on the results of the first 24 months’ work towards an operational PhenoMeNal VRE with a number of usable services including a description of the architecture, development and testing frameworks and procedures. We also present the M24 PhenoMeNal VRE release (codename Bucetin) and a set of demonstrators in the form of Workflows that can be run with this release. A report on the ongoing integration with EGI federated cloud/grid as well as demonstrated analysis on private/sensitive data in secure environments is also provided along with a listing general e-infrastructure contributions of the project.

2 CONTRIBUTION TOWARDS PROJECT OBJECTIVES The deliverable contributes towards the achievements of the following project objectives: Objective 5.1: Establishment of the PhenoMeNal e-infrastructure Objective 5.2: Operations and maintenance of the PhenoMeNal VRC portal Objective 5.3: Maintenance and provisioning of the PhenoMeNal services in the PhenoMeNal e-infrastructure

3 Detailed Report on the Deliverable

3.1 Introduction and overview The European e-Infrastructure landscape contains several key projects that are either directly or indirectly relevant for PhenoMeNal (see Figure 1). There are also commercial actors, such as public cloud providers, which offer e-infrastructure services that are of importance for scientists. However, there is currently a gap between the researchers that want to analyze data and the current e-infrastructure providers. It is technically too complex a process for most scientists to take advantage of the e-infrastructure offerings. PhenoMeNal aims to bridge this gap by providing easily accessible scalable Virtual Research Environments (VREs) that hide the complexity of e-infrastructure under user-friendly interfaces.

4

Figure 1: Overview of the e-infrastructure landscape. PhenoMeNal aims to bridge the gap between researchers and selected available e-infrastructure projects, available primarily in Europe, using Virtual Research Environments (VREs). PhenoMeNal focuses on VREs for interoperable and scalable metabolomics analysis. End-users, such as researchers and research teams, educators, SMEs, and any other type of user, will be able to create, on-demand and through a simple user interface, an environment of tools, services, data supporting their research needs. Hardware setup and software deployment required to operate these facilities are completely transparent to the VRE and hence the users can focus on the analysis and not the technicalities (see Figure 2). PhenoMeNal VREs are built to run on your private computer (laptop, workstation, server) as well as with any Infrastructure-as-a-Service-provider including public cloud providers (e.g. Google Cloud Platform, Amazon Web Services, Microsoft Azure etc) and academic computing centra.

5

Figure 2: Responsibilities when carrying out contemporary metabolomics data analysis. (Left:) Today’s situation: Scientists are responsible for everything, including the computer hardware, installing all necessary software, and carrying out the actual analysis. All execution is limited by the resources in the single computer. (Right:) The PhenoMeNal approach: Software tools are available as containers without the need for installations, with data in agreed-upon interoperable file formats. The VRE can be started on single computers or on cloud resources, and the scientists benefit from only needing to deal with the analysis as the technical implementations are handled by the VRE. PhenoMeNal has a release-cycle of 6 months, and the latest release is Bucetin which was released on 2017-08-16. See more in the section Bucetin release below.

3.2 PhenoMeNal architecture

The PhenoMeNal VRE is designed as a microservice architecture, with services being implemented as VMIs and software containers (we use Docker in PhenoMeNal for containers). Containers can easily be deployed without manual installation and dependency management, and can, in an elastic IT-environment scale out to run analysis in parallel on multiple compute nodes (see also recent paper “Software Simplified” by A. Silver, Nature 546, 173–174, 20171). A key objective with VREs is that all technical details are transparent for the researcher.

1 https://www.nature.com/news/software-simplified-1.22059

6

PhenoMeNal makes use of Kubernetes (by Google) for container orchestration. The main workflow system in PhenoMeNal is Galaxy, which also comes with a GUI for graphical workflow authoring and execution. PhenoMeNal also supports other types of user interfaces and workflow systems, including Luigi and Jupyter. The majority of services (tools) for metabolomics in PhenoMeNal are implemented as Docker containers. See section Bucetin release for a list of available containers and workflows. Apart from the metabolomics-services, the e-infrastructure developments in PhenoMeNal are completely general and can serve as an example and template on which to build VREs.

3.2.1 Deployment

The way to deploy a functional PhenoMeNal VRE consists of two steps:

1) Provision a Kubernetes cluster. In PhenoMeNal we use mainly KubeNow for this step (see below) but there are alternatives; e.g. GCP has a built-in solution to provision Kubernetes clusters.

2) Deploy PhenoMeNal services in Kubernetes. For this process we use the Helm package system and Ansible. The goal is to move towards a single Helm chart in M36 release (Dalcotidine).

We have developed two alternative strategies to make the deployment procedure as simple as possible:

A. Deployment via the PhenoMeNal client This consists of a simple deploy script:

>phenomenal.sh deploy gcp/aws/ostack/kvm

The entire deployment environment comes as a docker container, so the only dependency that needs to be installed on the computer is docker. We currently support Mac OS X and Linux for the client, Windows support is planned for M30 release (Cerebellin).

B. Deployment via PhenoMeNal portal/gateway This is a Web GUI that runs the same deploy script in the background. The gateway only supports gcp/aws so far, openstack support is planned for M30 release (Cerebellin). The portal/gateway is developed in WP6.

3.2.2 KubeNow

Different deployment strategies have been evaluated (UU, EMBL-EBI) in the context of Infrastructure as code (IaC), towards a streamlined provisioning of the PhenoMeNal VRE on different cloud providers. We have worked so far with MANTL (UU), Kargo

7

(EMBL-EBI) and KubeNow (UU) to deploy our scalable infrastructure layer. Automated deployment is a complex and critical task for PhenoMeNal, and as such we reduced risk by trying different alternatives. During the project, the MANTL project was halted (funding by Cisco ended). Other kubernetes deployment frameworks are not yet fully supporting the functionality to instantiate and tear down kubernetes clusters on a regular basis, but more on the establishment of long-running installations which is more common in industry. This necessitated us to develop a novel deployment framework (UU), KubeNow allowing to rapidly deploy, scale, and tear down Kubernetes clusters on public and private cloud systems (e.g., AWS, GCE and OpenStack) as well as local clusters (Vagrant/Virtualbox or libvirt/KVM). KubeNow is a thin layer on top of existing established software (Terraform, Packer, Ansible and kubeadm), see Figure 3 below. Following this approach we provide a simple, light-weight, tool for Kubernetes provisioning, and a critical tool needed for PhenoMeNal. By deploying a KubeNow cluster the user will get:

● A Kubernetes cluster up and running in less than 10 minutes (deployed with Terraform and provisioned with kubeadm);

● Weave networking; ● Traefik HTTP reverse proxy and load balancer; ● Cloudflare dynamic DNS integration; ● GlusterFS distributed file system managed via Heketi (remote deployment) or

mounted local volume/NFS (local deployment); ● HTTPS via cloudflare

On top of the deployed Kubernetes cluster, KubeNow provisions Galaxy through a Helm chart (EMBL-EBI, CRS4, CEA, UU), which can be used independently to run our Galaxy installations on other Kubernetes clusters deployed through other means, diversifying the usage scenarios. For instance, our developers use the same Galaxy Helm deployment that is used for production environment on VREs to run Galaxy on the minikube Kubernetes development environment. There is also a possibility to run VREs locally via KubeNow using libvirt and kvm.

8

Figure 3. KubeNow delivers kubernetes clusters with dynamic DNS integration, networks, load balancing, and a distributed file system. KubeNow wraps around existing industry-standard tools and is used to deploy PhenoMeNal VREs. KubeNow is called by the PhenoMeNal Portal developed in WP6 to provide automated web deployment of VRE’s.

3.2.3 Gateway/Portal The PhenoMeNal VRE Portal/Gateway is the first point of access that external users have for deploying PhenoMeNal VREs on top of the different cloud providers. The VRE Portal allows users to authenticate with their own external credentials through the Elixir Single Sign On (SSO) system, providing access through their institutional academic accounts or third party identity providers such as Google, Orcid or LinkedIn. Once logged in, the user can proceed to deploy a PhenoMeNal CRE on OpenStack, Google Cloud Environment (GCE) or Amazon Web Services (AWS). Before being allowed to deploy, the user is presented with Terms of Use and documentation regarding Ethical, Legal and Social Implications (ELSI) that need to be acknowledged. Besides deploying, the VRE Portal allows the user to administer existing running PhenoMeNal VREs on his/her cloud tenancies. Deployment of VREs through the portal happens through two other components: the EBI Cloud Portal (developed by the EBI Technology and Science Interface group, TSI) and KubeNow. When the VRE Portal triggers a deployment of a PhenoMeNal VRE due to user request, the VRE Portal communicates this request to the EBI Cloud Portal through a REST API of this second resource. The EBI Cloud Portal has, as a registered deployment repository, the definition of the KubeNow deployment used by PhenoMeNal, and triggers its deployment as a PhenoMeNal VRE on the desired cloud provider. For more flexibility in the deployment in terms of setting up the number of machines and other options, an advanced user could access the EBI TSI Portal and configure its own deployment of KubeNow on the available cloud providers, check for logs, etc. The EBI Cloud Portal uses the same Elixir SSO login facilities than the

9

PhenoMeNal VRE Portal, and as such any user accessing PhenoMeNal VRE Portal can also access the EBI Cloud Portal (and actually see its active deployments as well through the EBI Cloud Portal). Figure 4 shows the interaction between these components. Given the result of our Usability rounds (UX), it was decided that the user facing name for Virtual Research Environment (VRE) would be Cloud Research Environment (CRE), so the reviewer will encounter that name on slides and the web portals.

Figure 4: Users authenticate with the PhenoMeNal gateway using Elixir AAI, and the gateway makes use of the EMBL-EBI Cloud Portal to run KubeNow and instantiate the PhenoMeNal VRE on a virtual infrastructure provisioned on a supported IaaS provider. The use of EMBL-EBI Cloud Portal implies that PhenoMeNal will run on future providers supported by ELIXIR, such as Indigo DataCloud, EGI, and ELIXIR Cloud Platform.

3.2.4 Workflow orchestration with Galaxy

Apart from lifting a complete virtual infrastructure and setting up a kubernetes cluster, users of the PhenoMeNal VRE need interfaces and tools to work with the containerized software applications developed in WP9. Galaxy (https://galaxyproject.org/) is a workflow environment tool developed by a large Bioinformatics community, mostly by people working in the context of Next Generation Sequencing (NGS) tools, but lately also including communities in the Proteomics and Metabolomics areas, such as Galaxy-P, Workflow4Metabolomics and Galaxy-M. As a workflow environment, it allows researchers with no programming ability to concatenate common bioinformatics tools to

10

create pipelines or workflows.

The Galaxy workflow environment was extended to support scheduling jobs as Docker containers on a Kubernetes cluster (EMBL-EBI, CRS4, CEA), and the contribution was pushed upstream to the galaxy project. Figure shows how Galaxy interacts with Kubernetes through our contribution. Galaxy was integrated into the main deployment process of the PhenoMeNal VRE, which means that users deploying a private PhenoMeNal immediately get a working instance of Galaxy that is secured for their own private usage only. Figure 5 shows how a Job flows from a user request, through the workflow environment, to the Container Orchestrator cluster (Kubernetes) and its results provided back to the workflow environment. This process has been described in earlier deliverables, but is repeated here to show the entire process.

Figure 5. Diagram shows how Galaxy interacts with the Kubernetes container orchestrator, while running inside the same container orchestrator.

3.2.5 Other user interfaces and workflow systems

Luigi (https://github.com/spotify/luigi) is a workflow engine developed by Spotify Inc. to aid data analytics. Luigi was extended to support scheduling jobs on Kubernetes clusters consisting of containers (UU). This contribution has been pushed upstream to Luigi (http://luigi.readthedocs.io/en/stable/api/luigi.contrib.kubernetes.html).

Jupyter notebooks is a system to combine text (including e.g. mathematical equations) and code in an easy-to-read document that renders in a web browser (see Figure 5 for a screenshot). The code can be run directly from the notebook and display textual or graphical output, and there are a lot of kernels (i.e., backends) for many different types

11

of programming languages and data analytics frameworks. Jupyter notebooks have been integrated in PhenoMeNal as a Helm chart and is available as one of the standard components.

3.3 Development and testing

3.3.1 Infrastructure testing

KubeNow follows the Infrastructure as Code (IaC) paradigm, which is the process of deploying and provisioning the computing resources through machine readable definition files. This paradigm has the benefit of automating the deployment, and also allows one to define the infrastructure as a set of files that have enough information to replicate the infrastructure over time and at different data centers. This programmatic approach to infrastructure definition makes it possible to apply the same best practices that are used for software testing to infrastructure testing, including version control. In practice, every time we make a commit to the repository that holds the KubeNow infrastructure definition, the continuous integration (CI) process is triggered and the infrastructure is automatically lifted and tested on the supported cloud providers.

3.3.2 Container development In PhenoMeNal, services are implemented as containers and made available for analysis primarily through workflow environments (e.g. Galaxy). Containers are written and maintained by individual tool developers with source published in code repositories (e.g. GitHub). The PhenoMeNal Continuous Integration System (CI) pulls the source code, builds the containers, tests the containers, and if tests pass pushes them to container repositories such as Docker Hub and PhenoMeNal private container repository (see Figure 6). Since June 2017, released tag containers are also pushed to BioContainers, a community drive repository of Life Sciences software containers. From the CRE, these containers are made available for download and use from within the workflow engine (such as Galaxy) and can be scheduled inside the Kubernetes cluster.

12

Figure 6: Overview of the continuous development and operation in PhenoMeNal. PhenoMeNal provides guidelines for the development of our containers2, and we try to be as harmonious as possible with the guidelines from BioContainers. Once the container is ready, PhenoMeNal developers are expected to add them to the PhenoMeNal CI3. Every few weeks, and particularly before every release, developers are asked to make release versions of their containers. Besides the PhenoMeNal container registry, release container images are also deposited in the BioContainers DockerHub repository automatically. The “Tool container release process” is formalized and available on the PhenoMeNal GitHub4. In general, all our guidelines for contributing tools to PhenoMeNal, followed in general by our developers but available for anyone interested in using the infrastructure for their own needs, are available5.

2 https://github.com/phnmnl/phenomenal-h2020/wiki/Dockerfile-Guide 3 https://github.com/phnmnl/phenomenal-h2020/wiki/Jenkins-Guide 4 https://github.com/phnmnl/phenomenal-h2020/wiki/Tool-container-release-process 5 https://github.com/phnmnl/phenomenal-h2020/wiki/How-to-make-your-software-tool-available-through-PhenoMeNal

13

3.3.3 Container Testing

Testing in the field of containers is not well defined within the software engineering community, and in order to fill this void we have contributed to define and implement new testing strategies. Through our Container Testing workshop (Nov 14-15, 2016 at EMBL-EBI in Hinxton) we have established best practices and a scheme for automatically testing tools that have been containerised within our continuous integration system (EMBL-EBI, CEA). Figure 7 shows the flow through the automatic testing scheme for containers with real data within a Kubernetes cluster. Running within this environment permits that testing is as data intensive as needed. Developers only need to provide the testing and data download logic within a single file in the container6. This aims to customize testing for each container.

During the same meeting, partners also presented a method to test entire workflows automatically within Galaxy (CRS4). This is implemented by Workflow Testing 4 Galaxy (WFT4Galaxy, https://github.com/phnmnl/wft4galaxy). We are now in a position to not only test tools individually within their containers, but run integration tests of our tools as they are involved in different workflows. Both tool testing and workflow testing are executed in the same container orchestration context, Kubernetes, that any PhenoMeNal VRE deployment uses.

Figure 7: Flow designed for automatically testing containers using real datasets through the CI inside a Kubernetes cluster, to replicate as much as possible the environment that a tool would encounter on a real case usage. The testing and data download logic is stored in a shell file inside the container in an standard location. This shell file is not executed during container build time, but only during data related testing time. 6 https://github.com/phnmnl/phenomenal-h2020/wiki/Testing-Guide-Proposal-3

14

3.4 EGI support Since the time of writing the grant application the main focus in the cloud computing area has moved from virtual machine images (VMI) to containers (such as Docker). The PhenoMeNal project has achieved full EGI OpenStack support and is aligned with the communities in terms of focusing on container support. Since the first days of PhenoMeNal we are in close contact with the EGI who is actively working on container support for Federated Cloud, but this is not yet ready at the time of writing this report. PhenoMeNal will be capable of working with EGI’s federated container support when this is ready by the EGI. We are working together with EGI developers and hope to be able to implement a working prototype by the PhenoMeNal M30 release (Cerebellin).

3.5 Bucetin release The 2017-08 release of PhenoMeNal, also known as “Bucetin”, has been released August 16th. In comparison to the previous release this was an upgrade of the 2017-02 beta release. It has a richer set of tools, is no longer considered beta and all tools had to go through much more stringent automated testing to be included in the release. We’ve improved the automated testing of tools, containers and workflows, and have introduced usability/user testing. One month prior to the release we built a release candidate (RC). This RC has been deployed to our test environment7, which allowed contributors to try the new version before actually going live as the main public instance and on cloud deployments. From the moment we released the RC we only allowed hotfixes, and focused on testing and improving documentation of each of the Galaxy tools and for the Git repositories of each container. A similar approach was used for the cloud deployment portal, pre-releasing a release candidate8. As part of the release as well, workflow names were normalized with naming patterns after recommendations of the SAB and other external users. In the weeks before the release we conducted several usability/testing sessions, where we assessed the quality and usability of the services. The testing was done with the clinician/technician in mind, users that usually have limited knowledge about high performance/scalable cloud solutions like PhenoMeNal. The users participating in these sessions had no prior experience with any of the services that PhenoMeNal offers. The sessions were usually about 1-2 hours where the user was given several 10-20 minute tasks. During these sessions an observer was present to take notes and guide the user

7 https://publicdev.phenomenal-h2020.eu/ 8 https://portaldev.phenomenal-h2020.eu/

15

where needed. After each task the observer and user discussed the outcome, reviewed the notes and the users were asked for feedback. Preparations for the sessions were supervised by usability expert Paula de Matos, who will do a heuristic usability scan and conduct external usability sessions in Q4 of 2017, after the 2017-08 release. More about the release can be found in the release notes9.

3.6 Demonstrators Among the developed workflows that combine different tools in reproducible workflows, several of these workflows are featured as part of the Bucetin release, and are available both on the public Galaxy instance and in newly deployed PhenoMeNal CRE instances. They have been selected to cover different areas of typical metabolomics data analysis tasks. Their development was part of the WP9 tasks, this work is also reported in D1.4.4:

● Fluxomics - 13C traced MS fluxomics data analysis. (tutorial10/video11) ● MS-MetFrag-XCMS - annotates molecules from compound (metabolite)

databases to MS/MS (tandem mass spectrometry) spectra. (tutorial12/video13) ● Univariate/multivariate statistics and search for biomarkers - statistics for

exploratory data analysis, prediction, and feature selection. (tutorial14/video15) ● NMR1d - 1D NMR Workflow. (tutorial16/ video17)

3.6.1 LC-MS/MS Metabolite Annotation This workflow concerns MS/MS compound spectra extraction, annotation and identification using XCMS, CAMERA and MetFrag. Metabolite identification in clinical studies is a crucial step when trying to understand e.g. the courses of a disease on the metabolomics level. The MetFrag workflow goes a first step into this direction as it annotates molecules from compound (metabolite) databases to MS/MS (tandem mass spectrometry) spectra. This annotation is based on 9 https://github.com/phnmnl/phenomenal-h2020/wiki/Release-notes-2017-08 10 https://portal.phenomenal-h2020.eu/help/fluxomics-workflow 11 https://portal.phenomenal-h2020.eu/help/fluxomics-workflow 12 https://portal.phenomenal-h2020.eu/help/MS-MetFrag-XCMS-Workflow 13 https://www.youtube.com/watch?v=V5M-dcE3qz8 14 https://portal.phenomenal-h2020.eu/help/Sacurine-statistical-workflow 15 https://www.youtube.com/watch?v=ABW9982s5gY&feature=youtu.be 16 https://portal.phenomenal-h2020.eu/help/NMR1d-Workflow 17 https://portal.phenomenal-h2020.eu/help/NMR1d-Workflow

16

the mapping of in silico generated fragments to the experimental spectra and scoring of these mappings based on different criteria. Details on the workflow will be reported in D9.4. The Continuous integration results available on Jenkins18.

3.6.2 Univariate/multivariate statistics and search for biomarkers This workflow focuses on statistical analysis by allowing to explore the datasets, find variables of interest and build predictive models. It is based on Workflow4Metabolomics W4M00001 history19. Statistical analysis is the next key step after metabolite annotation. They allow to detect single variables or group variable that are of particular interest for the study. This workflow is composed of three modules:

● Univariate analysis. ● Multivariate analysis. ● Biosigner (to find significant signatures).

Continuous integration results available on Jenkins20.

3.6.3 NMR workflow The 1D NMR workflow performs the processing of 1D NMR experiments from raw data to a data matrix required for visualisation and statistics. The first tool imports NMR data from MetaboLights study of choice, which is then converted it to nmrML, an open source raw format. The rnmr1d tool will run the fourier transformation (ftt), zero-filling, line broadening, phase-correction and baseline correction. This steps are considered the pre-processing steps. After those step, the alignment and bucketing of spectra takes place. The data matrix file is produced as output and then fed into the following step that will produce a stacked plot with all spectra.

3.7 Analysis on private/sensitive data in secure environment

3.7.1 Imperial College London At Imperial College London (ICL), an instance of PhenoMeNal has been installed on a local server. The objective of the local installation is to test the feasibility of running the

18 http://phenomenal-h2020.eu/jenkins/view/C.-Integration%20tests/job/xcms-camera-metfrag-workflow-test/ 19 http://workflow4metabolomics.org 20 http://phenomenal-h2020.eu/jenkins/view/%20C.-%20Integration%20tests/job/sacurine-workflow-test/

17

Phenomenal infrastructure behind an institutional firewall in order to maintain privacy and security for sensitive data. ICL installed and used a local instance of PhenoMeNal on a server with the following specs: 40 cores, 2 threads per core giving 80 virtual CPUs and 1TB RAM. This server is sitting behind the ICL firewall and VPN. This means it is only accessible from within the ICL local network or via remote connection to the ICL VPN. Even after remotely connecting to ICL VPN, or being inside ICL, this server has a strict access policy as one can only connect to it via ssh-key or a strong password. In addition, physically this server is placed in an off-campus high security data centre. The data used in these experiments was from the Multi-Ethnic Study of Atherosclerosis (MESA21 and see deliverable D9.1). MESA is a medical research study involving more than 6,000 men and women in the United States focusing on the characteristics of subclinical cardiovascular diseases. Both NMR and LC-MS metabolomics data were produced as part of the COMBI-BIO project (http://www.combi-bio.eu/). In the tests reported here, we used a large subset (n=1980) of the full MESA NMR dataset. The Phenomenal project has permission to use this data for development and testing of the e-infrastructure. However the data cannot be shared outside the Phenomenal consortium and hence this was selected as a representative test set to run on the private local instance. ICL created virtual clusters of varying sizes and capacities to test the scalability of an example NMR processing tool, BATMAN. A large NMR dataset was split into smaller chunks of equal sizes so that each chunk can be processed separately by each node. The idea here is to have multiple instances of BATMAN running at the same time (on the same or separate nodes). We run dockerised BATMAN so that they can be run as Kubernetes pods. As an initial test, a cluster of the following specification was used:

Node type Count No vCPUs RAM in GB

Master 1 2 2

Edge 1 2 2

Worker Nodes 10 11 86

Table 1: Cluster specifications for dockerised BATMAN evaluation. Experiments via Luigi: One way to allocate resources to each pod is by writing Python wrappers via Luigi (each pod is run as a Luigi Worker). Hence, one node in the cluster can run one or more Luigi Workers (and consequently one or more BATMAN instances). The resources we

21 https://www.mesa-nhlbi.org/

18

allocated to each Luigi Worker (and therefore to each pod or BATMAN instance) were 2 vCPUs and 16 GB RAM.

In a provisional test, ICL ran jobs with varying number of Luigi Workers and recorded the running times as shown in Figure 8. It is clear that the running time goes down as we increase the number of workers. However, run time plateaus when the number of workers reaches approximately 40. We suspect that this is because we have reached the highest possible performance that the server can deliver under this configuration.

Overall, this indicates that the Phenomenal infrastructure was successfully deployed on a local server behind an institutional firewall, showing that the system can be run on sensitive data without any need for external connections or access.

Figure 8: Runtime vs Number of Luigi Workers for the BATMAN runs at ICL.

3.7.2 Uppsala University Hospital Within the CARAMBA group, serving as a mass spectrometry research lab as well as an accredited lab for clinical diagnostics, PhenoMeNal VREs were deployed on local servers, Google Cloud, and on SNIC Science Cloud and used for analysis of data with varying degree of sensitivity. The most sensitive analysis were done on a local server (ncores: 72, RAM: 256 GB) and non-sensitive runs were executed on Google Cloud and on SNIC Science Cloud. Data sets analyzed included cerebrospinal fluid (CSF) samples from thirteen relapsing-remitting multiple sclerosis (RRMS) and fourteen secondary progressive multiple sclerosis (SPMS) patients as well as ten healthy controls all of which were processed on the local server. In addition, the experiments contained

19

multiple quality control and identification samples which were pre-processed on Google Cloud, and on SNIC Science Cloud. The downstream analysis was performed using PhenoMeNal Jupyter notebook deployed on the local server. Manuscripts are currently being written and will be submitted to scientific journals.

3.8 General e-infrastructure contributions

Many contributions of PhenoMeNal are not specific to metabolomics but general. Some major e-infra contributions are listed in Table 2 below:

Component name Description PhenoMeNal contribution

Kubernetes Container orchestration, developed by Google

Developed KubeNow for simplified Kubernetes cluster provisioning.

Galaxy GUI-centric workflow system, large community in bio

Developed Galaxy-K8S engine, pushed upstream.

Developed new tool for automated testing of Galaxy workflows,

Luigi Code-centric workflow system, developed by Spotify inc.

Developed Luigi-K8S runner, pushed upstream.

Jupyter Notebooks for interactive analysis

Instantiate via KubeNow.

EBI Cloud Portal Services and GUI to interact with public and private clouds

Use cases and testing with KubeNow

Table 2: General e-infrastructure contributions by PhenoMeNal.

20

3.9 Risks

Specific risks for WP5 have been reported in earlier deliverables, e.g. D5.1 and D5.2. No new risks have appeared at this stage of the project. One thing that is worth mentioning is that in this field, state-of-the-art is a moving target. It is crucial for our and similar projects to be agile. New frameworks emerges and will emerge that simplifies tasks, and there is substantial effort needed in order to stay updated on the recent developments in the field.

4 Delivery and Schedule The deliverable was submitted on time.

5 Conclusion This deliverable reports on the results of the first 24 months’ work towards an operational PhenoMeNal VRE with a number of usable services. We also have demonstrated analysis on private/sensitive data in secure environments. The integration with EGI federated cloud/grid is ongoing and well on the way, and we are depending on further EGI developments in order to proceed with our demonstrator which we aim to complete during M24-M30.

deliverable 5.3 project id 654241 - phenomenal

Documents