l forer - cloudgene: an execution platform for mapreduce programs in public and private clouds
DESCRIPTION
Presentation at BOSC2012 by L Forer - Cloudgene: an execution platform for MapReduce programs in public and private cloudsTRANSCRIPT
Cloudgene - an execution platform for MapReduce programs in public and private clouds
Lukas Forer, Sebastian Schönherr, Hansi Weißensteiner
University of Innsbruck, AustriaMedical University Innsbruck, Austria
BOSC 2012
Parallel approach
MapReduce
2MapReduce: Simplified Data Processing on Large Clusters - Dean & Ghemawat - 2004
Serial approach
cluster
cloud
private public
How to support scientists when using (our) MapReduce programs?
Simplify the execution of MapReduce programs including data management
Simplify access to a working MapReduce cluster
Maintain data sensitivity
MapReduce in Genetics
CloudBurst
highly sensitive read mapping with MapReduce; Schatz, 2009
Crossbow
Searching for SNPs with cloud computing; Langmead et al., 2009
MyRNA
Cloud-scale RNA-sequencing differential expression analysis with Myrna; Langmead et al., 2010
Seal
a Distributed Short Read Mapping and Duplicate Removal Tool; Pireddu et al., 2012
Hadoop BAM
directly manipulating next generation sequencing data in the cloud; Matti Niemenmaa et al., 2012
CloudBioLinux
CloudBioLinux: pre-configured and on-demand bioinformatics computing for the genomics community; Krampis et al., 2012
3
Difficulties with MapReduce
4
Required steps when cluster is up and running, Hadoop installed
Additional steps, when setting up a cluster in a public environment
Approaches
Possible approachesProgram specific approach
Implement a GUI for every program
Redundant work for the developer
Heterogeneity
Workflow systemsGalaxy, Taverna, Mobyle
Possible, but no HDFS support, blackbox
Our approach for Hadoop MapReduceOne GUI for different programs
Feedback, Standardized Import/Export
Integration of programs via a plugin interface
5
Open-source platform to improve the usability of Hadoop MapReduce jobs
Provides a graphical web interface for their execution
Programs can be integrated by writing a simple configuration file
Public cloud & private cloudSetting up a cluster in the cloud, installs all data on it
History of executed jobs with defined input/output parameters
Runs in your browser
Cloudgene
What is Cloudgene?
6
CloudBurstCrossbowSealCloudBioLinux
Myrna
Cloudgene
7
Features
Integration of programs easily possiblestandard MapReduce programs (Java -> CloudBurst)
streaming jobs (e.g. Mapper and Reducer using Perl-> Myrna)
command line programs (e.g. using Pydoop -> Seal)
Data can be imported from different sourcesS3 / HTTP / FTP
Import of huge datasets
Export results to S3 (public cloud)
Connect different MapReduce programs to a pipeline
Install additional programs via a web repository
8
Features
Cloudgene can be used on private and public clusters
sensitive data
local data
data on S3
no in-house clusteravailable
Open source
9
} public cloud
} private cloud
Summary
10
Cloudgene in Action
How to integrate a new program in Cloudgene 1. Implement the program (or use existing)
2. Write plugin configuration file
11
Cloudgene in Action
12
Step 1 - Implement a program, executable via the command line
e.g: FastQ pre-processing with MapReduce
base quality / sequence quality / duplication levels / length distribution
hadoop jar exomePreprocessing.jar -input exomeData-step baseJob -encoding 0 -output resultsOutput
Cloudgene in Action
Step 2 - Write configuration file including 3 parts
Part 1 – General information:
13
Cloudgene in Action
Step 2 - Write configuration file including 3 parts
Part 2 – Public cloud information:
14
Cloudgene in Action
Step 2 - Write configuration file including 3 parts
Part 3 – MapReduce information:
15
Cloudgene in Action
16
Cloudgene in Action
17
Cloudgene in Action
18
Cloudgene in Action
19
Cloudgene in Action
Different application – different GUI
20
Technologies
Apache Hadoophttp://hadoop.apache.org
Apache Whirrhttp://whirr.apache.org
Restlethttp://www.restlet.org
ExtJShttp://www.sencha.com
H2http://www.h2database.com
21
Evaluation
Amazon Elastic MapReduce (EMR)Graphical execution for MapReduceprograms
Excellent solution for public clouds Combination with S3
butdata sensitivity
Reproducibility
Additional costs
22
Cloudgene Amazon EMR0 sec
500 sec
1000 sec
1500 sec
2000 sec
2500 sec
3000 sec
3500 sec
4000 sec
ExportCalculationImportSetup
Integrated programs
23
http://sourceforge.net/apps/mediawiki/cloudburst-bio/nfs/project/c/cl/cloudburst-bio/7/70/MediaWikiSidebarLogo.png
Exome Preprocessing
Wordcount, Grep, etc.
Finding SNPs
in house
Acknowledgements
24
Project-Website:
http://cloudgene.uibk.ac.at
Source Code:
http://github.com/genepi
Lukas ForerSebastian Schönherr Hansi Weissensteiner
Anita Kloss-Brandstätter Florian Kronenberg Günther SpechtThanks to the Open Source Community