[acm press the wicsa/ecsa 2012 companion volume - helsinki, finland (2012.08.20-2012.08.24)]...

8
Delivering ICT Infrastructure for Biomedical Research Tommi Henrik Nyrönen Jarno Laitinen Olli Tourunen Danny Sternkopf Risto Laurikainen Per Öster Pekka T. Lehtovuori CSC – IT Center for Science, P.O.Box 405, FI-02101 ESPOO, Finland +358-9-4572235 [email protected] Timo A. Miettinen Tomi Simonen Teemu Perheentupa Imre Västrik Olli Kallioniemi Institute for Molecular Medicine Finland – FIMM, P.O.Box 20 FI-00014 University of Helsinki, Finland [email protected] Andrew Lyall Janet Thornton EMBL– EBI, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK [email protected] ABSTRACT This paper describes an implementation of the Infrastructure-as-a- Service (IaaS) concept for scientific computing and seven service pilot implementations with requirements from biomedical use cases at the CSC - IT Center for Science. The key service design requirements were enabling the use of any scientific software environment the use cases needed to succeed, and delivering the distributed infrastructure ICT resources seamlessly with the local ICT resources for the scientist users. The service concept targets the IT administrators at research organisations and delivers virtualised compute cluster and storage capacity via private network solutions. The virtualised resources can become part of the local cluster as virtual nodes and they can share the same file system as the physical nodes assuming the network performance is sufficient. Extension of the local resources can then be made transparent to enable an easy infrastructure uptake to the scientist end-users. Based on 20 months of service piloting most of the biomedical organisations express a sustained and growing need for the distributed compute and storage resources delivered with the IaaS. We conclude that a successful implementation of the IaaS can improve access and reduce the effort to run expensive ICT infrastructure needed for biomedical research. Categories and Subject Descriptors A.0 [General]: Conference proceedings; D.4.7. [Operating systems]: Organization and Design – Distributed systems; K.6.4 [Management of computing and information systems]: System Management – Centralization/decentralization. General Terms Management, Experimentation, Security, Human Factors. Keywords Biomedical, Research, Data, ICT, Infrastructure, Service, Biological Information, IaaS, Computing, Storage, Network, Datacenter, Health, Biobanks. 1. INTRODUCTION Biomedical science has become a data and computationally intensive discipline, with rapidly evolving software stacks for data analysis [1; 4; 13]. Many research groups and institutions have set up their own local computational resources to satisfy this need - usually compute clusters and storage solutions. At a quickening pace, these resources become insufficient. In addition to computing services, users need significant storage capacities to store their data, and access to large reference data to reflect their findings in the context of the current knowledge. The size of the reference datasets in biomedical science like the 1000 Genomes 1 and The Cancer Genome Atlas 2 (TCGA) are already hundreds of terabytes and grow rapidly. These challenges form a research bottleneck for biosciences. The speed of biological data production will surpass the foreseeable available compute and storage capacities [12]. Various suggestions have emerged on how to best serve the scientific community with information technology resources using emerging cloud technology solutions [3-7; 10-12; 14; 15]. To assist solving some of these challenges, CSC - IT Center for Science (CSC), a Finnish non-profit research service organisation, is developing an Infrastructure as a Service (IaaS) concept [2; 16] in collaboration with biomedical research organisations. The development is part of the construction of the ELIXIR research infrastructure 3 . The IaaS delivers virtualised compute cluster capacity and storage for collaborating institutes and service providers via private network solutions. Importantly, the IaaS concept enables organisations to build specialised scientific workflows in Virtual Machines (VMs) they control. The scientific knowledge of the research group can be encapsulated in the local software environment inside the VMs. Now scientific organisations can have any software tool stack and modify it and be able to distribute this knowhow, while benefitting from the capacities of a trusted computing centre. 1 http://www.1000genomes.org 2 http://cancergenome.nih.gov/ 3 http://www.elixir-europe.org Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WICSA/ECSA 2012. August 20-24, Helsinki, Finland. Copyright 2012 ACM 978-1-4503-1568-5/12/08… $15 37

Upload: tomi

Post on 14-Dec-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Delivering ICT Infrastructure for Biomedical Research Tommi Henrik Nyrönen

Jarno Laitinen Olli Tourunen Danny Sternkopf Risto Laurikainen

Per Öster Pekka T. Lehtovuori

CSC – IT Center for Science, P.O.Box 405, FI-02101 ESPOO,

Finland +358-9-4572235

[email protected]

Timo A. Miettinen

Tomi Simonen Teemu Perheentupa

Imre Västrik Olli Kallioniemi

Institute for Molecular Medicine Finland – FIMM, P.O.Box 20 FI-00014 University of Helsinki,

Finland

[email protected]

Andrew Lyall

Janet Thornton EMBL– EBI, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire

CB10 1SD, UK

[email protected]

ABSTRACT This paper describes an implementation of the Infrastructure-as-a-Service (IaaS) concept for scientific computing and seven service pilot implementations with requirements from biomedical use cases at the CSC - IT Center for Science. The key service design requirements were enabling the use of any scientific software environment the use cases needed to succeed, and delivering the distributed infrastructure ICT resources seamlessly with the local ICT resources for the scientist users. The service concept targets the IT administrators at research organisations and delivers virtualised compute cluster and storage capacity via private network solutions. The virtualised resources can become part of the local cluster as virtual nodes and they can share the same file system as the physical nodes assuming the network performance is sufficient. Extension of the local resources can then be made transparent to enable an easy infrastructure uptake to the scientist end-users. Based on 20 months of service piloting most of the biomedical organisations express a sustained and growing need for the distributed compute and storage resources delivered with the IaaS. We conclude that a successful implementation of the IaaS can improve access and reduce the effort to run expensive ICT infrastructure needed for biomedical research.

Categories and Subject Descriptors A.0 [General]: Conference proceedings; D.4.7. [Operating systems]: Organization and Design – Distributed systems; K.6.4 [Management of computing and information systems]: System Management – Centralization/decentralization.

General Terms Management, Experimentation, Security, Human Factors.

Keywords Biomedical, Research, Data, ICT, Infrastructure, Service, Biological Information, IaaS, Computing, Storage, Network,

Datacenter, Health, Biobanks.

1. INTRODUCTION Biomedical science has become a data and computationally intensive discipline, with rapidly evolving software stacks for data analysis [1; 4; 13]. Many research groups and institutions have set up their own local computational resources to satisfy this need - usually compute clusters and storage solutions. At a quickening pace, these resources become insufficient. In addition to computing services, users need significant storage capacities to store their data, and access to large reference data to reflect their findings in the context of the current knowledge. The size of the reference datasets in biomedical science like the 1000 Genomes1 and The Cancer Genome Atlas2 (TCGA) are already hundreds of terabytes and grow rapidly. These challenges form a research bottleneck for biosciences. The speed of biological data production will surpass the foreseeable available compute and storage capacities [12]. Various suggestions have emerged on how to best serve the scientific community with information technology resources using emerging cloud technology solutions [3-7; 10-12; 14; 15].

To assist solving some of these challenges, CSC - IT Center for Science (CSC), a Finnish non-profit research service organisation, is developing an Infrastructure as a Service (IaaS) concept [2; 16] in collaboration with biomedical research organisations. The development is part of the construction of the ELIXIR research infrastructure3. The IaaS delivers virtualised compute cluster capacity and storage for collaborating institutes and service providers via private network solutions. Importantly, the IaaS concept enables organisations to build specialised scientific workflows in Virtual Machines (VMs) they control. The scientific knowledge of the research group can be encapsulated in the local software environment inside the VMs. Now scientific organisations can have any software tool stack and modify it and be able to distribute this knowhow, while benefitting from the capacities of a trusted computing centre.

1 http://www.1000genomes.org 2 http://cancergenome.nih.gov/ 3 http://www.elixir-europe.org

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WICSA/ECSA 2012. August 20-24, Helsinki, Finland. Copyright 2012 ACM 978-1-4503-1568-5/12/08… $15

37

This has not been the standard way to deliver a computing service for research organisations. Computing centres typically aim to serve customers from many disciplines and thus the software environment and the interfaces provided by the computing centre tend to be generic. The users of the local compute clusters and storage solutions would like to move their tasks to a more efficient facility, but the technical work involved has been criticised by biomedical scientists in Finland.

Cloud computing is a technology that can offer a solution for exactly that. This paper describes the delivery of the CSC computing capacity with the IaaS interfaces and reports seven different ways how biomedical computing service providers have used it so far. Similar approaches have been proposed by other research infrastructure service providers including SARA4 and the Australian research cloud5. The IaaS concept has been critically analysed [2] and mostly accepted as a good option [16] for delivering scientific computing. The service proposition is basically an interface to compute and storage capacities via a high-quality network connection designed to fit the way leading biomedical scientific organisations work. CSC’s aim is that the workflow of the biomedical specialists in the scientific organisations whom the overall research infrastructure serves is minimally disturbed by the way the ICT infrastructure is delivered.

2. SERVICE IMPLEMENTATION

2.1 Service for ICT experts The Infrastructure as a Service targets the information technology administrator staff at biomedical organisations. The service extends local resources and can be transparent for the end-users. The technical know-how of the local IT environment and the application workflow is developed and maintained by research organisations who are the service users (Figure 1). In addition to a technical contact, the service setup requires an organisational

4 http://www.sara.nl/services/cloud-computing 5 http://www.nectar.org.au/research-cloud

contact who can sign a contract to start the collaboration. Once the collaboration is in place, the IaaS provides

• Physical cluster compute nodes and storage resources via virtualisation and (cloud) interfaces

• Support for usage of the resources

ICT service provider does not necessarily manage virtual machines or control or support what is running on the VMs. The service user is a research organisation represented by an IT administrator who

• Manages VMs with administrative privileges

• Installs and maintains the operating system and other software for VMs and pays any software license fees

• Can connect the existing compute and storage resources of the research organisation through a private network solutions built with the service provider.

2.2 Storage The VMs have fast local disk that is not shared among the virtual cluster nodes. The service user can get a network file system (NFS) share that is visible to all nodes inside the network of the research organisation. A remote file system (like NFS or parallel file system in the local research organisation) can be connected to the organisation’s VMs over the network. Its usability depends on the network performance and the usage pattern. The service could offer back-up and long-term archive functionality, but we do not provide these at the moment.

Instead of typical block-level access to shared mass storage, the IaaS NFS-service is virtualised and hides the internal SAN architecture from the clients. The benefit is that each customer organisation can have their own NFS Unix user identity (UID) range i.e. these can overlap between the customer organisations. The organisations can utilise different authentication methods. Each virtualised NFS instance has its own routing and L2 forwarding tables that enables providing private IP-address spaces.

Organisation’s Management Virtual Machine

Disk array

Network appliance

SAN

Virtual Machine of organisation

Storage

Organisation

Physical host X

Optical Private Network

Compute resources

Switch Switch

ssh (management connection)

Virtual Private Networke.g. OpenVPN

ORNFS

Local disk

Internet

ICT Service

Figure 1. Overview of the implemented IaaS concept. The local resources of the research organisation can be connected to

the service provider either through Internet by using VPN or through OPN connection.

38

A customer organisation can have their own snapshot policy to save versions of the recently changed files. However, to save storage space this is not usually enabled unless specifically requested.

2.3 Network The VMs have a flexible internet protocol address range. Each service user organisation has own virtual local area network (VLAN) ID on the physical cloud hardware network interface. The network between the service provider and the service user can rely on an Optical Private Network (OPN) a.k.a. lightpaths that are VLAN tagged, or alternatively on a Virtual Private Network (VPN) over the Internet. The VPN solution is currently OpenVPN with routing on the customer cloud frontend side. Traffic to the Internet generated for example by a biomedical service relying on the IaaS goes through the service user’s firewall.

2.4 Hypervisor The virtualisation hypervisor is KVM and the cloud middleware is OpenNebula6 which provides a command line interface and a web user interface with console access to VMs. The open source OpenNebula middleware is under heavy development and has been updated on the test system a few times. Minor tailoring has been implemented to improve service security. Virtualisation causes some overhead but it is considered to be low for CPU. The network latency between the nodes is still too high to run communication-intensive parallel jobs efficiently.

2.5 Virtual machine images The virtual machines are created from the image files containing an operating system and a software environment. It is also possible to use CD-ROM/DVD ISO images. Modified virtual machines can be saved as new images. This makes it easy to revert back to earlier versions of software environments to e.g. reproduce results and to distribute them. Different file formats for the virtual machine images are supported and are fairly portable across different host operating systems.

2.6 Access to VMs New VMs can be deployed by the service user according to the resource allocation policy and the physical limitations of the service provider. The cloud middleware provides a command line and a graphical interface to manage the VMs. The service user can e.g. list available images and start/stop virtual machines on hosts.

2.7 System administration The service provider gives each user a permanent administrator interface accessible via SSH, where the service user can manage VMs, images and NFS space. In the VPN network solution, the IT administrator interface also acts as a gateway between the service user organisation and the ICT capacities provided from the IaaS, providing cloud cluster front-end, installer, DHCP, batch system, monitoring etc. functionalities.

2.8 Support The service provider communicates with the service users via two email channels: one for announcements, new features and service breaks etc., and one for issues. All parties have access to a shared documentation Wiki. In addition, in the pilot phases the important part of establishing trust is active person to person communication between the cloud provider and local IT administrators. To set up

6 http://opennebula.org

a service pilot CSC needs a technical and an organisational contact. The point of contact is [email protected].

2.9 Hardware The first generation of the IaaS pilot hardware in 2010-2011 has 48 hosts or nodes in total. Each node has 2 CPUs (Intel Xeon X5650 2.66 GHz) with a total of 12 physical cores which are shown as 24 cores with HyperThreading and 24/48 GB of memory, 2 x 148/300/600 GB SAS local storage, with optional network file system (NFS) storage. The physical nodes set the limits to the VMs. This resource soon became fully used and was extended in December 2011 with 72 new nodes with up to 96 GB memory, with 68 nodes with 1 TB SATA and four nodes with 3,6 TB SAS storage. The upgrade extended the operational use cases and allowed new pilots to join. The data transfer speed between equipment is 10 Gbit/s. The effective NFS space is currently 300 TB.

3. RESULTS We have created a delivery model for virtualised compute and storage capacity intended to be used as a building block for distributed biomedical research services and piloted the service model in practice for 20 months.

3.1 Service delivery In the pilot implementation the service provider delivers virtualised compute cluster capacity for biomedical service providers via private network solutions (Figure 2). Biomedical organisations can build specialised scientific software environments in virtual machines (VMs) they control. VMs can encapsulate the local software environment. The software environment is typically part of a topical data provisioning or analysis service needed by the biomedical research community.

The IaaS extends local resources for running the VMs. Depending on policies and usage, the VMs can be on the same Internet protocol address range as the institute’s local nodes. If the VMs become a part of the local cluster as virtual nodes, they can share the same file system as the physical nodes if the network performance is sufficient. This makes it possible to make the

Figure 2. Delivery of service for the IT experts in the

biomedical organisations.

39

infrastructure adaptation transparent for the local end-users. The local IT administrator can add virtual nodes and use them as backend nodes for local services to free local physical nodes, or configure them in the batch scheduling system. The scientist user experience is that the local cluster is able to run the software environment they control faster. The local IT administrator does the infrastructure maintenance as usual, but also using virtual capacity from the IaaS provider.

An alternative method for the local system service use is to make a queue for the VMs in the local batch scheduling system of the research organisation’s compute clusters. Jobs matching certain requirements (e.g. serial) could be assigned to that queue. For the scientist users, adopting the cloud resource as part of the local IT infrastructure of the biomedical organisation is transparent also in this case. However, some work on the job submission scripts is likely to be needed. For instance, the directory structure in the target virtual machine can be different than in the local cluster.

Certain applications suit the virtualised computing resource better than others [9; 16]. Differences arise from the effectiveness of the virtualisation of resources (memory, CPUs, disk space) and the relative weight of using these resources for the application workflow that runs in the virtual machine.

Based on the feedback and a technical query the biomedical organisations that have taken part in the pilot service approve this delivery model because it supports the local IT administrators to in turn support the local research operations, and the uptake does not require actions from the scientist end-users. Our model for division of work thus improves human and technical interoperability since it allows specialisation between biomedical and e-Infrastructure organisations, and holds potential to make public non-profit research infrastructure ICT resource usage and collaboration more effective for biomedical use. Adapting a cloud interface requires some work from the IT administrator, but makes the computing centres’ resources easier to utilise with the local IT resources.

3.2 Private network as a delivery method The service can be delivered via optical private network (OPN) solutions like a 10 Gbit/s link set up between the distributed physical hardware (Figure 1.). The choice for the data location during the computations affects the overall distributed system performance.

Dedicated (OPN) connections are generally considered preferable due to performance and security reasons since a breach would need to be made physically. OPN offers more bandwidth and decreased latency compared to the Internet. OPN takes a few months to establish and has annual running costs that are commonly supported by the research and education network operators. It is suitable for long term service co-operation, and if there is a data transfer need that requires large network capacity. The Virtual Private Network (VPN) solution is faster to establish and cheaper to maintain than OPN. The traffic goes over the Internet, but there are various methods to encrypt the traffic like SSL encryption. It is possible to acquire a dedicated VPN device or use VPN software like OpenVPN.

3.3 Data movements between distributed sites Depending on the amount of data and data handling bottlenecks it can make sense to automate the copying of data next to the cloud computing resources before executing the application work flow, if the data privacy policies allow this. A characteristic of not copying data over is that no permanent copy stays at the cloud

compute service provider’s system which might be required [2]. The data transfer is encrypted using secure copy (scp) or by using VPN connections. In OPN connections a breach in data connection would have to be made physically between the hardware (switches) situated at the distributed collaborating sites. Monitoring is then typically considered to be an adequate security measure for data transmissions. Encryption of an OPN data movement is theoretically feasible, but not currently provided.

The staging of hundreds-of-terabytes reference data sets like the 1000 Human Genomes or The Cancer Genome Atlas (TCGA) for computing purposes for the biomedical community at large remains a challenge. Two of the current pilots partially download and provision TCGA via shared NFS (Figure 1.) and in one case UniProt7 raw data is part of the virtual machine image. A better way for providing large datasets would be provisioning of the data to the virtual machines from a shared file system from the IaaS service. Recently, Amazon announced a joint consortium to service the 1000 Genomes data to the EC2 compute cloud in this way.8

4. BIOMEDICAL USE OF CSC IAAS The information technology requirements of biomedical use cases were surveyed with the IT administrators who support the local research groups in biomedical organisations. Services provided with the CSC IaaS concept in response to these requirements are listed in the Table 1.

Observed requests for IT capacity vary. In one extreme, a pilot requested ten terabytes of shared storage but reported no need for compute capacity. For a small organisation, a cluster with 192 cores seems to be sufficient for most capacity requests. In peaks, however, the demand could be four times larger. For data intensive computing like genome analysis, 48 GB of RAM is often not enough. During the lifecycle of the current hardware we have extended the RAM and NFS storage capacity compared to the original physical hardware design. Requests from the use cases to scale-up typically emerge at short timescales according to the research interests. The provisioning of this type of flexibility requires spare resources from the service provider. Large RAM per processor is important for many use cases and is currently difficult to secure from public cloud providers. Tens of terabytes of storage space is important for the genomic sequence analysis projects.

Table 1. Physical resources used and the estimated maximum resource requirements of the biomedical IaaS pilot use cases.

Use Cases

Compute nodes/cores a

Est. max.

Storage NFS,TB

Network

Biomedinfra.fi 42/1008 1000X 50 OPN

Anduril 2/48 4X 30 OPN

Webmicroscope b OPN

Chipster 8/192 10X 5 CSC

CBS ELIXIR tools 4/96 4X 0,5 VPN

PredictProtein 2/48 20X 1 VPN

Bioinformatics lab 4/96 4X 4 VPN

7 http://uniprot.org 8 http://aws.amazon.com/1000genomes/

40

Abbreviations: Est. Max., Multiplier for IT administrator’s Estimated

Maximum resources in peaks; NFS, Network File System; TB, terabyte; OPN, Optical Private Network; VPN, Virtual Private Network.

a Compute nodes have 24 cores with Hyper-Threading (Intel X5650 2.66

GHz). Each node has 24/48/96 GB of memory, fast local storage, 10

Gbit/s internal connections and optional NFS storage.

b Integrated in Biomedinfra.fi

4.1 Biomedinfra.fi The IaaS is used as a distributed IT infrastructure for an integrated biomedical sciences European research infrastructure (ESFRI9) node in Finland between Otaniemi and Meilahti campuses where over 1800 researchers are engaged in biomedical research. Optical Private Network (OPN) extends CSC’s services to Biomedicum Helsinki, the centre for medical practice, research and training and delivers virtual cluster capacity to the local network of the Finnish Institute for Molecular Medicine - FIMM. CSC’s infrastructure service is seamlessly embedded within the information technology environment of FIMM. Scientific software environments made by tens of research groups in the campus have access to CSC’s capacities. In practice FIMM’s compute cluster has seemingly quintupled its size in terms of processing power provided by CSC’s IaaS cloud. The arrangement utilises OPN for high-speed data transmission between Meilahti and Otaniemi. Besides being a 10 Gbit/s connection a dedicated network brings data security and latency benefits. The distribution of various resources such as tiered data storage is feasible with the connection. The IT infrastructure and services are being developed for the purpose of analysing and utilising genetic information retrieved from the Finnish biobanks.10

4.2 Anduril The IaaS is a resource for the Anduril11 data workflow system [8] within the Finnish national research and education network (Funet), and benefits from the high-performance dedicated data connection between the Otaniemi and Meilahti campuses. The high-throughput biomedical data workflow system Anduril includes an engine responsible for scheduling and executing an acyclic network of components. The workflow is made of dependencies between the input and output of components, preset priorities and forced execution sequences. Upon execution, output datasets created by workflow components are forwarded as input to other component(s). The engine is written in Java, whereas the components are Java, R, Matlab, Octave, Python, Perl or bash shell script. Sources of input to a component include data imported into a pipe-line, the output created by other components, and results from querying external biological databases. Typical dataset sizes vary from few gigabytes to over a terabyte. Some components are computationally intensive and are demanding on I/O and storage space. The overall workflow requires a substantial amount of memory. I/O performance is crucial especially during simultaneous execution of components that access multiple sources of data. In summary, resources needed by Anduril workflows are dependent on each component and its requirements that in turn are dependent on the pipeline topology.

9 http://ec.europa.eu/research/esfri/ 10 http://bbmri.fi 11 http://csbi.ltdk.helsinki.fi/anduril/

4.3 WebMicroscope The IaaS is a web service backend for preparing images for WebMicroscope service12. The service is integrated inside the FIMM infrastructure utilising the OPN. CSC IaaS cloud resources are available for the WebMicroscope image pre-processing transparently from the FIMM local cluster and software environ-ment. WebMicroscope is a web-based administration platform for virtual microscopy, with emphasis on functions for research purposes and connection of image data to patient clinical data. Computationally intensive applications include automated scoring of immunohistochemically stained specimens, quantification of fluorescence in situ hybridisation signals, and texture analysis for various pattern recognition tasks like tissue classification and cell and organism detection. The instructions for remotely running a batch on the WebMicroscope server use standardised procedures, for example to access a web service or download files. An example computational run in the local cluster extended with the IaaS is making a set of slides to WebMicroscope. This produces approximately 1 TB of raw image data that needs computational preparations to produce the information for the WebMicroscope service. Information is stored in SQL databases. The IaaS resources employed in the pilot correspond to a single moderate capacity server instance with storage space for a moderately sized single biopsy ranging from 100 GB to 1 TB.

4.4 Chipster The IaaS is used as a backend for a scientific service inside CSC’s internal network. Chipster13 [4] is user-friendly analysis software for high-throughput data. It contains over 200 analysis tools for next generation sequencing (NGS), microarray and proteomics data. Users can visualise data interactively, and save and share automatic analysis workflows. Chipster's client software uses Java Web Start to install itself automatically, and it connects to computing servers for the actual analysis. Resources of the IaaS pilot are utilised as file server nodes and computing nodes. The file server node needs access to terabyte-level storage. The computing nodes need less storage space, as many cores as possible and moderate amounts of memory. The IaaS resources are connected to an instance of Chipster. Future interests include running a Hadoop cluster made from nodes from the IaaS14.

4.5 CBS ELIXIR tools node At the Center for Biological Sequence Analysis (CBS15) the role of the IaaS is an on-line web service backend. Connection to the IaaS resources is made via VPN. The physical network cable length via Nordunet15 to IaaS resource is over 1000 km. Bioinformatics as an e-science is accomplished through the wide use of web-service based tools. The ability to use these for research requires that data produced by the tools is presented in an interoperable, machine-readable way. Researchers also need to be able to find the right tools through catalogues of services. The tools infrastructure is part of the European Life Science Infrastructure for Biological Information ELIXIR [1]. In this context, the CSC IaaS has three utilisation scenarios: first, to spawn short response jobs covering one of many (50+) WWW based services with limited storage, I/O, compute and memory requirements. Second, this evolves into a computationally more

12 http://fimm.webmicroscope.net 13 http://chipster.csc.fi 14 http://seqahead.eu/meetings:hdfsvsparallelpage 15 http://www.cbs.dtu.dk/services/

41

demanding and requiring regularly updated databases of tens to hundreds of gigabytes in size on the IaaS. Third, future interest include being able to efficiently mount the cloud storage and submit jobs directly from a local queue to the nodes in the IaaS cloud.

4.6 PredictProtein The IaaS is used as a backend for making releases of the PredictProtein service16. Connection to the IaaS resources is via VPN, and the physical network cable length via Nordunet17 and GÉANT18 research networks to CSC is over 2000 km. The PredictProtein service builds upon a refinement of UniProt5 and other biological datasets scanned for updates every two weeks. The PredictProtein service is part of the Rostlab19 infrastructure and publishes a public service release e.g. as virtual machine images. Currently more than 30 scientific software packages are encapsulated in the PredictProtein software environment image. Rostlab runs the PredictProtein environment on the IaaS cloud (software and data) and returns the compute results to their own infrastructure. Inside the compute workflow, the UniProt database is first converted into a flat table using post processing tools over the latest database release. This data file is currently integrated in the virtual machines images. A better way could be to provision the data to the virtual machines from a shared file system from the IaaS provider’s site.

4.7 Structural bioinformatics laboratory The IaaS is used to extend the local computing cluster20 with more nodes. Connection to the IaaS resources is via VPN running within the Finnish national research and education network (Funet). The physical network cable length to CSC is less than 500 km. The laboratory cluster infrastructure contains versatile structural bioinformatics software support workflows in structural analysis, molecular modelling and x-ray crystallography. Some focus in the laboratory is on the latter, involving computing electron density from diffraction images and then building structural model based on the data. Most jobs are single-threaded. The space consumption of raw data is only 5-20 GB per data collection and there are about a dozen collections per year. The most resource intensive procedures are molecular docking and dynamics simulations. Bioimaging, e.g. high-throughput microscopy, is a new user group that the local cluster aims to serve. Imaging will produce terabytes of raw data (bitmap images) per run, and the intention is to use the local cluster to computationally process the data.

5. DISCUSSION Modern biomedical science rapidly creates new information technology requirements. Turning these requirements into a research infrastructure service on a short timescale is a challenging task and requires collaboration from both commercial and governmental service providers. However, there is a fundamental difference between a service optimised for business and a non-profit service provider: For the latter, success is measured with scientific and technological impact rather than economical benefit for the service operator or owner. If they so

16 https://www.predictprotein.org/ 17 http://nordu.net 18 http://www.geant.net 19 http://rostlab.org 20 http://web.abo.fi/fak/mnf/bkf/research/sbl/

choose, governmental cloud service providers can employ scientific experts to assist in the research infrastructure service delivery. According to our experience delivering ICT infrastructure for biomedical science requires IT support service experts who work close to the biomedical users. The commercial cloud providers could provide this as a service as well, but it constitutes a significant added cost.

Over the course of the piloting, the Biomedinfra.fi use case generated petabytes of network traffic, employed several hundred CPU cores 24/7, and has been in operation for 20 months. This is a usage pattern that is expensive to acquire from commercial cloud providers21. This observation is in line with a recent report on scientific cloud computing [16]. However, the interfaces to governmental supercomputing centres have not always been easy to adapt for biomedical computing use, and therefore many organisations have had to rely on local clusters. In any case, with the development of cloud interfaces to compute centres infrastructures the rationale of biomedical organisations sustaining and growing self-owned IT resources is becoming harder to justify.

5.1 Enabling organisational specialisation In order to scale up the resources for biomedical services that become more widely used as they become popular, the institutes need an easy mechanism to meet the ever growing demand. The Biomedinfra.fi is an example of how a biomedical expert organisation can focus its resources to higher-level scientific service building and rely on the public sector IaaS cloud service provider to provide IT infrastructure, with potential to scale up or down with demand on the biomedical end-user service.

If a biomedical organisation decides to rely on a distributed cloud service provider to meet the increasing demand, trust is of paramount importance. For instance, if the network solution is controlled bilaterally between organisations in the research and education networks, e.g. via dedicated network connections, it gives control over of the physical location of the data. If data is sensitive, cannot leave the country, or is restricted due to various legal reasons, data location control in a geographic region covered by certain laws could actually be a requirement in order to use a cloud service provider [2].

Projected needs of e-Infrastructure services by the biomedical research community to analyse data are various and challenging to meet. Disorganised computer hardware setups are currently common and cannot achieve as good efficiency as data centres that have been built with energy efficiency as a factor in the design [16]. Both environmental impact and savings from economics of scale of operations for electricity prices are drivers for biomedical operators as well, and this should motivate them to put their compute operation in the hands of specialised service organisations. However, as long as interfaces and support to access to centralised resources are a bottleneck for the division of work between specialised organisations for e-Infrastructure and biomedical science services, this will not happen in practice. Fortunately the situation is changing rapidly.

5.2 Summary Biomedical research data has properties that affect ICT infrastructure service provisioning. Biomedical information is in many cases regulated for ethical, societal, legal or other reasons. For this reason data typically needs to reside within the network 21 http://aws.amazon.com/pricing_effective_july_2011/

42

of the organisation that created it and has the mandate to handle it. Biomedical organisations need to balance the costs of ICT for the data handling with for example legal requirements for the proper handling of sensitive data. Therefore analysing the cost for delivering ICT for data analysis is not trivial and depends on the viewpoint, with the unwanted option being that research is delayed because adequate ICT resources cannot be delivered from the research infrastructure. Attaining large memory per CPU is important for computational genomic analysis, as well as large compute-accessible storage space for analysis instrument output data. Only a few biomedical organisations can secure ICT resources for leading research as the field is becoming more and more data analysis intensive. Integration with distributed IaaS resources can offer a solution assuming the trust and legal requirements can be met, and interfaces for doing so are becoming more standard at a rapid pace. According to our experiences expert to expert support at the level of IT administrators is one of the key success factors to achieve a successful implementation of the IaaS.

The properties of the proposed cloud IaaS delivery model for scientific computing are:

• Workflow of the specialists in the scientific organisations is minimally disturbed upon uptake of the infrastructure

• Private network solution acts as delivery route via OPN or VPN in research and education networks, enabling control of physical data location

• Non-profit government organisation acts as a service provider for science rather than makes profit

• Services are based on energy-efficient data centres considering the Environment (green-ICT)

• Service provider’s IT experts have expertise for dialogue with biomedical organisations’ IT experts

• In the future service can give computational access to large public reference datasets like human genomic variation.

6. ACKNOWLEDGMENTS The IaaS development is part of the Biomedinfra.fi effort bridging nationally in Finland between European research infrastructures for bioinformatics – ELIXIR, biobanking – BBMRI and translational research – EATRIS.

IaaS pilot use case technical contacts are Peter Wad Sackett, Jukka Lehtonen, Javier Nunez-Fóñtarnau, Kristian Ovaska, Laszlo Kajan, Timo Miettinen, Tomi Simonen, Aleksi Kallio, Mikael Lundin; Administrational contacts are Kristoffer Rapacki, Søren Brunak, Mark S. Johnson, Sampsa Hautaniemi, Burkhard Rost, Olli Kallioniemi, Imre Västrik, Eija Korpelainen and Johan Lundin.

The IaaS service has been made with support from The Academy of Finland grants (No’s. 136452, 137370, 141512, 263164), European Commission grant FP7 for ELIXIR preparative stage (No. 211601), and CSC – IT Center for Science.

7. REFERENCES [1] Brunak, S., Godzik, A., Blanchet, C., Clausen, I.G., Bryne,

J.C., Toldstrup, N., Gordon, P., Lopez, R., and Ouellette, F., 2009. ELIXIR Infrastructure for Tools Integration. In ELIXIR European Life Sciences Infrastructure for Biological

Information. http://www.elixir-europe.org/prep/bcms/elixir/Documents/reports/WP12_Infrastructure_for_Tools_Integration_Final_Report.pdf

[2] Catteddu, D., 2011. Security & Resilience in Governmental clouds - Making an informed decision. European Network and Information Security Agency. http://www.enisa.europa.eu/activities/risk-management/emerging-and-future-risk/deliverables/security-and-resilience-in-governmental-clouds/

[3] Fusaro, V.A., Patil, P., Gafni, E., Wall, D.P., and Tonellato, P.J., 2011. Biomedical Cloud Computing With Amazon Web Services. PLOS Computational Biology 7, e1002147. DOI=http://dx.doi.org/10.1371/journal.pcbi.1002147.

[4] Kallio, M.A., Tuimala, J.T., Hupponen, T., Klemelä, P., Gentile, M., Scheinin, I., Koski, M., Käki, J., and Korpelainen, E.I., 2011. Chipster: user-friendly analysis software for microarray and other high-throughput data. BMC Genomics 12, 507. DOI=http://dx.doi.org/10.1186/1471-2164-12-507.

[5] Kallioniemi, O., Wessels, L., and Valencia, A., 2011. On the organization of bioinformatics core services in biology-based research institutes. Bioinformatics 27, 10, 1345. DOI= http://dx.doi.org/10.1093/bioinformatics/btr125

[6] Kupfer, D.M., 2012. Cloud Computing in Biomedical Research. Aviation, Space, and Environmental Medicine 83, 2, 152-153. DOI=http://dx.doi.org/10.3357/ASEM.3242.2012

[7] O’Connor, B.D., Merriman, B., and Nelson, S.F., 2010. SeqWare Query Engine: storing and searching sequence data in the cloud. BMC Bioinformatics 11(Suppl 12):S2, 9. DOI=http://dx.doi.org/10.1186/1471-2105-11-S12-S2

[8] Ovaska, K., Laakso, M., Haapa-Paananen, S., Louhimo, R., Chen, P., Aittomäki, V., Valo, E., Nunez-Fontarnau, J., Rantanen, V., Karinen, S., Nousiainen, K., Lahesmaa-Korpinen, A.M., Miettinen, M., Saarinen, L., Kohonen, P., WU, J., Westermarck, J., and Hautaniemi, S., 2011. Large-scale data integration framework provides a comprehensive view on glioblastoma multiforme. Genome Med 2, 9, 65. DOI= http://dx.doi.org/10.1186/gm186

[9] Qiu, J., Ekanayake, J., Gunarathne, T., Choi, J.Y., Bae, S.-H., LI, H., Zhanz, B., Wu, T.-L., Ruan, Y., Ekanayake, S., Hughes, A., and Fox, G., 2010. Hybrid cloud and cluster computing paradigms for life science applications. BMC Bioinformatics 11(Suppl 12):S3, 6. DOI=http://dx.doi.org/10.1186/1471-2105-11-S12-S3

[10] Schadt, E.E., Linderman, M.D., Sorenson, J., Lee, L., and Nolan, G.P., 2011. Cloud and heterogeneous computing solutions exist today for the emerging big data problems in biology. Nature Rev. Genet. 12, 224. DOI=http://dx.doi.org/doi:10.1038/nrg2857-c2

[11] Schatz, M.C., Langmead, B., and Salzberg, S.L., 2010. Cloud computing and the DNA data race. Nature Biotech. 28, 7, 691-693. DOI=http://dx.doi.org/10.1038/nbt0710-691

[12] Stein, L.D., 2010. The case for cloud computing in genome informatics. Genome Biology 11 (5), 207. DOI=http://dx.doi.org/10.1186/gb-2010-11-5-207

[13] Swertz, M.A., Dijkstra, M., Adamusiak, T., Velde, J.K.V.D., Kanterakis, A., Roos, E.T., Lops, J., Thorisson, G.A., Arends, D., Byelas, G., Muilu, J., Brookes, A.J., Brock,

43

E.O.D., Jansen, R.C., and Parkinson, H., 2010. The MOLGENIS toolkit: rapid prototyping of biosoftware at the push of a button. BMC Bioinformatics 11(Suppl 12):S12, 9. DOI= http://dx.doi.org/10.1186/1471-2105-11-S12-S12

[14] Taylor, R.C., 2010. An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinformatics 11(Suppl 12):S1, 6. DOI=http://dx.doi.org/10.1186/1471-2105-11-S12-S1

[15] Wallom, D., Turilli, M., Martin, A., Raun, A., Taylor, G., Hargreaves, N., and McMoran, A., 2011. myTrustedCloud: Trusted Cloud Infrastructure for Security-critical

Computation and Data Management. 2011 IEEE Third International Conference on Cloud Computing Technology

and Science cloudcom, 247-254. DOI=http://dx.doi.org/10.1109/CloudCom.2011.41

[16] Yelick, K., Coghlan, S., Draney, B., and Canon, R.S., 2011. The Magellan Report on Cloud Computing for Science. U.S. Department of Energy http://science.energy.gov/~/media/ascr/pdf/program-documents/docs/Magellan_final_report.pdf

44