exploring architecture options for a federated, cloud...

Exploring Architecture Options for a Federated, Cloud-based Systems Biology

KnowledgebaseIan Gorton, Jenny Liu, Jian Yin

1

Systems Biology

Systems BiologyIntegrated study of organisms as a wholeObtain, integrate, and analyze complex data from multiple experimental sources using interdisciplinary tools

RequirementsLarge amount of dataDifferent types of toolsLarge amount of computation resources

2

Systems Biology Knowledgebase

Drawbacks of the current approachThreshold of entrance can be highLittle reusing and sharing of the data and tools, wasteful repetitive effort to develop similar software toolsResults are hard to replicated

Seamlessly sharing and integration of data and software tools between multiple institution are attractiveThe goal of system biology knowledgebase is to exploit cloud computing technologies to enable sharing of data and software tools

3

Why Cloud Computing

Enable sharing of data and software toolsDynamic allocation of computing resourcesMany software tools can be converted to run on top of cloud computing services such as Hadoop

4

Outline

IntroductionSystem ArchitecturePrototype of selected components

Case studyHadoop based systems biology tools

Conclusion

5

Centralize verse Federated

Advantages of centralized approachEase of integrationMore efficient computing resource allocations

However, many institute may want to retain controls of their data and toolsFederated approach

Leverage specialized computing resources across organizations

6

Architecture Overview

7

Workflow Tools, Web Portals, Desktop Apps

cURL scriptsphp java python

Cloud storage

Kbase core

Cloud computations

Cloud-based data APIs

e.g. EC2 e.g. Clusters

Kbase Interface Layer (for flexible federation of Kbase Data and Compute resources)

HPC-basedcomputations

RESTful API LayerMiddleware and Workflow Utilities Database Adaptors

e.g. S3

ExampleFederated Resources

Data and Resource Directories

User AccessLayer

InfrastructureLayer

FederationLayer

Semantic Access Interface Layer

Components

Location independent components Uniformed interfacesEasy compositionExecution can be monitored with JBPM

8

Secure Communication

Security must be ensured for communication across institutions

Only SSL traffic are allowed through firewallRequiring all the components to use SSL could be difficultUse SOCKS to minimize code changes of components

9

Example

Original codeURL url = new URL(urlname);Modified codeSocketAddress addr = new InetSocketAddress("localhost", 8182);Proxy proxy = new Proxy(Proxy.Type.SOCKS, addr); URL url = new URL(urlname); // Create the URLURLConnection uc = url.openConnection(proxy);

10

1Advanced Visualizations

VESPA

Prototype

Script: translate DNA (.fna file) in six frames

Protein fasta file(.faa file)

Visualization tool

Polygraph

Query & copy the .fna fileQuery & copy the

.faa file

Visualization at a user’s local workstation

Query & copy the .fna file and the .gbk file

peptide file

post-process script

parameter file

Proteomics data(dta files)

GenBankQuery & copy the .dta files

Hadoop Based Polygraph

Polygraph is a proteomics application to identify peptides from MS dataInitially implemented with MPILoosely coupled and suitable for HadoopSmall amount of effort to adapt it to run on top of Hadoop

12

Running Polygraph

13

Experimental Results

14

Comparison

MPI-base implementation is highly tuned and thus more efficientHadoop based approach is more flexible

Most cloud computing providers provide Hadoop serviceFlexibility for leveraging various amounts of computing resource without changing code

Can produce results even with one machineMore machines can speed up the computation

Many system biology applications can be adapted to Map Reduce paradigm

15

Conclusion

Sharing data, software tools, and computing resources is essential for systems biologyCloud computing can provide the ideal platforms

Many applications are loosely coupled and can be adapted to run in cloud computing environments

Federated approach provides more flexibilityUniformed interfaces enable easy integration

16

exploring architecture options for a federated, cloud...

Documents