exploring architecture options for a federated, cloud...
TRANSCRIPT
Exploring Architecture Options for a Federated, Cloud-based Systems Biology
KnowledgebaseIan Gorton, Jenny Liu, Jian Yin
1
Systems Biology
Systems BiologyIntegrated study of organisms as a wholeObtain, integrate, and analyze complex data from multiple experimental sources using interdisciplinary tools
RequirementsLarge amount of dataDifferent types of toolsLarge amount of computation resources
2
Systems Biology Knowledgebase
Drawbacks of the current approachThreshold of entrance can be highLittle reusing and sharing of the data and tools, wasteful repetitive effort to develop similar software toolsResults are hard to replicated
Seamlessly sharing and integration of data and software tools between multiple institution are attractiveThe goal of system biology knowledgebase is to exploit cloud computing technologies to enable sharing of data and software tools
3
Why Cloud Computing
Enable sharing of data and software toolsDynamic allocation of computing resourcesMany software tools can be converted to run on top of cloud computing services such as Hadoop
4
Outline
IntroductionSystem ArchitecturePrototype of selected components
Case studyHadoop based systems biology tools
Conclusion
5
Centralize verse Federated
Advantages of centralized approachEase of integrationMore efficient computing resource allocations
However, many institute may want to retain controls of their data and toolsFederated approach
Leverage specialized computing resources across organizations
6
Architecture Overview
7
Workflow Tools, Web Portals, Desktop Apps
cURL scriptsphp java python
Cloud storage
Kbase core
Cloud computations
Cloud-based data APIs
e.g. EC2 e.g. Clusters
Kbase Interface Layer (for flexible federation of Kbase Data and Compute resources)
HPC-basedcomputations
RESTful API LayerMiddleware and Workflow Utilities Database Adaptors
e.g. S3
ExampleFederated Resources
Data and Resource Directories
User AccessLayer
InfrastructureLayer
FederationLayer
Semantic Access Interface Layer
Components
Location independent components Uniformed interfacesEasy compositionExecution can be monitored with JBPM
8
Secure Communication
Security must be ensured for communication across institutions
Only SSL traffic are allowed through firewallRequiring all the components to use SSL could be difficultUse SOCKS to minimize code changes of components
9
Example
Original codeURL url = new URL(urlname);Modified codeSocketAddress addr = new InetSocketAddress("localhost", 8182);Proxy proxy = new Proxy(Proxy.Type.SOCKS, addr); URL url = new URL(urlname); // Create the URLURLConnection uc = url.openConnection(proxy);
10
1Advanced Visualizations
VESPA
Prototype
Script: translate DNA (.fna file) in six frames
Protein fasta file(.faa file)
Visualization tool
Polygraph
Query & copy the .fna fileQuery & copy the
.faa file
Visualization at a user’s local workstation
Query & copy the .fna file and the .gbk file
peptide file
post-process script
parameter file
Proteomics data(dta files)
GenBankQuery & copy the .dta files
Hadoop Based Polygraph
Polygraph is a proteomics application to identify peptides from MS dataInitially implemented with MPILoosely coupled and suitable for HadoopSmall amount of effort to adapt it to run on top of Hadoop
12
Running Polygraph
13
Experimental Results
14
Comparison
MPI-base implementation is highly tuned and thus more efficientHadoop based approach is more flexible
Most cloud computing providers provide Hadoop serviceFlexibility for leveraging various amounts of computing resource without changing code
Can produce results even with one machineMore machines can speed up the computation
Many system biology applications can be adapted to Map Reduce paradigm
15
Conclusion
Sharing data, software tools, and computing resources is essential for systems biologyCloud computing can provide the ideal platforms
Many applications are loosely coupled and can be adapted to run in cloud computing environments
Federated approach provides more flexibilityUniformed interfaces enable easy integration
16