everything comes in 3's
DESCRIPTION
A talk given at BioIT World conference 2010 Cloud Computing WorkshopTRANSCRIPT
![Page 1: Everything comes in 3's](https://reader033.vdocuments.us/reader033/viewer/2022060118/558964edd8b42aac4b8b4657/html5/thumbnails/1.jpg)
Everything Comes in 3’s
Angel PizarroDirector, ITMAT Bioinformatics Facility
University of Pennsylvania School of Medicine
![Page 2: Everything comes in 3's](https://reader033.vdocuments.us/reader033/viewer/2022060118/558964edd8b42aac4b8b4657/html5/thumbnails/2.jpg)
Outline
• This talk looks at the practical aspects of Cloud Computing–We will be diving into specific examples
• 3 pillars of systems design
• 3 storage implementations
• 3 areas of bioinformatics – And how they are affected by clouds
• 3 interesting internal projectsThere are 2 hard problems in computer science: caching, naming, and off-by-1 errors
![Page 3: Everything comes in 3's](https://reader033.vdocuments.us/reader033/viewer/2022060118/558964edd8b42aac4b8b4657/html5/thumbnails/3.jpg)
Pillars of Systems Design
1. Provisioning– API access (AWS, Microsoft, RackSpace, GoGrid,
etc.)– Not discussing further, since this is the WHOLE
POINT of cloud computing.
2. Configuration– How to get a system up to the point you can do
something with it
3. Command and Control– How to tell the system what to do
![Page 4: Everything comes in 3's](https://reader033.vdocuments.us/reader033/viewer/2022060118/558964edd8b42aac4b8b4657/html5/thumbnails/4.jpg)
System Configuration with Chef
• Automatic installation of packages, service configuration and initialization
• Specifications use a real programming language with known behavior
• Bring the system to an idempotent state
• http://opscode.com/chef/
http://hotclub.files.wordpress.com/2009/09/swedish_chef_bork-sleeper-cell.jpg
![Page 5: Everything comes in 3's](https://reader033.vdocuments.us/reader033/viewer/2022060118/558964edd8b42aac4b8b4657/html5/thumbnails/5.jpg)
Chef Recipes & Cookbooks
• The specification for installing and configuring a system component
• Able to support more than one platform• Has access to system-wide information– hostname, IP addr, RAM, # processors, etc.
• Contain templates, documentation, static files & assets
• Can define dependencies on other recipes• Executed in order, execution stops at first failure
![Page 6: Everything comes in 3's](https://reader033.vdocuments.us/reader033/viewer/2022060118/558964edd8b42aac4b8b4657/html5/thumbnails/6.jpg)
Simple Recipe : Rsync
• Install rsync to the system• Meta data file states what
platforms are supported• Note that Chef is a Linux
centric system• BUT, the WikiWiki is
MessyMessy– Look at Chef Solo &
Resources
![Page 7: Everything comes in 3's](https://reader033.vdocuments.us/reader033/viewer/2022060118/558964edd8b42aac4b8b4657/html5/thumbnails/7.jpg)
More Complex Recipe: Heartbeat
• Installs heartbeat package
• Registers the service and specifies that is can be restarted and provides a status message
• Finally it starts the service
![Page 8: Everything comes in 3's](https://reader033.vdocuments.us/reader033/viewer/2022060118/558964edd8b42aac4b8b4657/html5/thumbnails/8.jpg)
Command and Control
• Traditional grid computing– QSUB – SGE, PBS, Torque– Usually requires tightly coupled and static systems– Shared file systems, firewalls, user accounts, shared
exe & lib locations– Best for capability processes (e.g. MPI)
• Map-Reduce is the new hotness– Best for data-parallel processes– Assumes loosely coupled non-static components– Job staging is a critical component
![Page 9: Everything comes in 3's](https://reader033.vdocuments.us/reader033/viewer/2022060118/558964edd8b42aac4b8b4657/html5/thumbnails/9.jpg)
Map Reduce in a Nutshell
• Algorithm pioneered by Google for distributed data analysis– Data-parallel analysis fit
well into this model– Split data, work on each
part in parallel, then merge results
• Hadoop, Disco, CloudCrowd, …
![Page 10: Everything comes in 3's](https://reader033.vdocuments.us/reader033/viewer/2022060118/558964edd8b42aac4b8b4657/html5/thumbnails/10.jpg)
Serial Execution of Proteomics Search
![Page 11: Everything comes in 3's](https://reader033.vdocuments.us/reader033/viewer/2022060118/558964edd8b42aac4b8b4657/html5/thumbnails/11.jpg)
Parallel Proteomics Search
![Page 12: Everything comes in 3's](https://reader033.vdocuments.us/reader033/viewer/2022060118/558964edd8b42aac4b8b4657/html5/thumbnails/12.jpg)
Roll-Your-Own MR on AWS
• Define small scripts to– Split a FASTA file– Run a BLAT search– The first script make defines the inputs of the second
• Submit the input FASTA to S3• Start a master node as the central communication
hub• Start slave nodes, configured to ask for work from
master and save results back to S3• Press “Play”
![Page 13: Everything comes in 3's](https://reader033.vdocuments.us/reader033/viewer/2022060118/558964edd8b42aac4b8b4657/html5/thumbnails/13.jpg)
Workflow of Distributed BLAT
S3
PC
Initial process splits fasta file. Subsequent jobs BLAT smaller files and save each result as it goes
Master
Slave
Slave
Slave
Slave
Boot master & slaves
Upload inputs
Download results
Submit the BLAT job
![Page 14: Everything comes in 3's](https://reader033.vdocuments.us/reader033/viewer/2022060118/558964edd8b42aac4b8b4657/html5/thumbnails/14.jpg)
Master Node => Resque
• Github developed background job processing framework
• Jobs attached to a class from your application, stored as JSON
• Uses REDIS key-value store
• Simple front end for viewing job queue status, failed job
Resque can invoke any class that has a class method “perform()”
http://github.com/defunkt/resque
![Page 15: Everything comes in 3's](https://reader033.vdocuments.us/reader033/viewer/2022060118/558964edd8b42aac4b8b4657/html5/thumbnails/15.jpg)
The scripts
![Page 16: Everything comes in 3's](https://reader033.vdocuments.us/reader033/viewer/2022060118/558964edd8b42aac4b8b4657/html5/thumbnails/16.jpg)
Storage in the Cloud : S3
• Permanent storage for your data
• Pay as you go for transmission and holding
• Eliminates backups• Pretty good CDN
– Able to hook into better CDN SLA via CloudFront
• Can be slow at times– Reports of 10 second delay,
but average is 300ms response
S3
Your Data
![Page 17: Everything comes in 3's](https://reader033.vdocuments.us/reader033/viewer/2022060118/558964edd8b42aac4b8b4657/html5/thumbnails/17.jpg)
S3 CostsUsage Rates Usage Example
$0.15 GB / month 1,690 GB
$0.10 GB / month IN 100 GB IN
$0.15 GB / month OUT 100 GB OUT
$0.01 per 1,000 PUT/POST requests
1,000,000 requests
$0.01 per 10,000 GET requests
1,000,000 requests
$289.50 per month
$0.17 per GB per month
$2.06 per GB per year
$3,474.00 per 1690 GB per year
![Page 18: Everything comes in 3's](https://reader033.vdocuments.us/reader033/viewer/2022060118/558964edd8b42aac4b8b4657/html5/thumbnails/18.jpg)
Storage 2: Distributed FS on EC2
• Hadoop HDFS, Gigaspaces, etc.
• Network latency may be an issue for traditional DFSs– Gluster, GPFS, etc.
• Tighter integration with execution framework, better performance?
EC2 NodeEC2 Node
EC2 NodeEC2 Node
EC2 Node Disk
Your Data
![Page 19: Everything comes in 3's](https://reader033.vdocuments.us/reader033/viewer/2022060118/558964edd8b42aac4b8b4657/html5/thumbnails/19.jpg)
DFS on EC2 m1.xlarge CostsInitial cost Usage costs
$2,800.00 3-yr reserved instance fee
$0.24 ¢/hr
24 hours / day
365 days / yr
3 yrs
$9,107.20 Total 3 yr cost
$3,035.73 cost 1690 GB per year*
$1.80 cost per GB per year*
* Does not take into account transmission fees, or data redundancy. Final costs is probably >= S3
![Page 20: Everything comes in 3's](https://reader033.vdocuments.us/reader033/viewer/2022060118/558964edd8b42aac4b8b4657/html5/thumbnails/20.jpg)
Storage 3: Memory Grids
• “RAM is the new Disk”• Application level RAM
clustering– Terracotta, Gemstone
Gemfire, Oracle, Cisco, Gigaspaces
• Performance for capability jobs?
EC2 RAMEC2 RAM
EC2 RAMEC2 RAM
EC2 RAMEC2 RAM
Your Data
* There is also the “Disk is the new RAM” groups, where redundant disk is used to mitigate seek times on subsequent reads
![Page 21: Everything comes in 3's](https://reader033.vdocuments.us/reader033/viewer/2022060118/558964edd8b42aac4b8b4657/html5/thumbnails/21.jpg)
Memory Grid CostInitial cost Usage costs
$9,800.00 3-yr reserved instance fee
$0.84 ¢/hr
24 hours / day
365 days / yr
3 yrs
$31,875.20 Total 3 yr cost
$10,625.07 cost per yr
$155.34 cost per GB per year
$262,519.92Cost 1690 GB per yr
Take home message: Unless your needs are small, you may be better off procuring bare-metal resources
![Page 22: Everything comes in 3's](https://reader033.vdocuments.us/reader033/viewer/2022060118/558964edd8b42aac4b8b4657/html5/thumbnails/22.jpg)
Cloud Influence on Bioinformatics
• Computational Biology– Algorithms will need to account for large I/O latency– Statistical tests will need to account for incomplete
information, or incremental results• Software Engineering– Built for the cloud algorithms are popping up
• CloudBurst is a feature example in AWS EMR!
• Application to Life Sciences– Deploy ready-made images for use
• Cycle Computing, ViPDAC, others soon to follow
![Page 23: Everything comes in 3's](https://reader033.vdocuments.us/reader033/viewer/2022060118/558964edd8b42aac4b8b4657/html5/thumbnails/23.jpg)
Algorithms need to be I/O centric
• Incur a slightly higher computational burden to reduce I/O across non-optimal networks
P. Balaji, W. Feng, H. Lin 2008
![Page 24: Everything comes in 3's](https://reader033.vdocuments.us/reader033/viewer/2022060118/558964edd8b42aac4b8b4657/html5/thumbnails/24.jpg)
Some Internal Projects• Resource Manager
– Service for on-demand provisioning and release of EC2 nodes– Utilizes Chef to define and apply roles (compute node, DB server, etc)– Terminates idle compute nodes at 52 minutes
• Workflow Manager– Defines and executes data analysis workflows– Relies on RM to provision nodes– Once appropriate worker nodes are available, acts as the central work queue
• RUM– RNA-Seq Ultimate Mapper– Map Reduce RNA-Seq analysis pipeline– Combines Bowtie + BLAT and feeds results into a decision tree for more
accurate mapping of sequence reads
![Page 25: Everything comes in 3's](https://reader033.vdocuments.us/reader033/viewer/2022060118/558964edd8b42aac4b8b4657/html5/thumbnails/25.jpg)
Bowtie Alone
74%
8%
18%
Mapping Efficiency
MappedAmbiguousUnmapped
38.0%
37.0%
25.0%
Mapping Breakdown
Unique PairedUnique SingleAmbiguous
![Page 26: Everything comes in 3's](https://reader033.vdocuments.us/reader033/viewer/2022060118/558964edd8b42aac4b8b4657/html5/thumbnails/26.jpg)
RUM (Bowtie + BLAT + processing)
70%
16%
14%
Mapping Breakdown
Unique PairedUnique SingleAmbiguous
81%
4% 15%
Mapping Efficiency
Mapped
Unmapped
Mapped Ambiguously
Significantly increases the confidence of your data
![Page 27: Everything comes in 3's](https://reader033.vdocuments.us/reader033/viewer/2022060118/558964edd8b42aac4b8b4657/html5/thumbnails/27.jpg)
RUM Costs
• Computational cost ~$100 - $200– 6-8 hours per lane on m2.4xlarge ($2.40 / hour)
• Cost of reagents ~= $10,000
1% of total
![Page 28: Everything comes in 3's](https://reader033.vdocuments.us/reader033/viewer/2022060118/558964edd8b42aac4b8b4657/html5/thumbnails/28.jpg)
Acknowledgements
• Garret FitzGerald• Ian Blair
• John Hogenesch• Greg Grant• Tilo Grosser
• NIH & UPENN for support
• My Team– David Austin– Andrew Brader– Weichen Wu
Rate me! http://speakerrate.com/talks/3041-everything-comes-in-3-s