langmead bosc2010 cloud-genomics
TRANSCRIPT
Cloud-scale genomics: examples and lessons
Ben Langmead
Department of Biostatistics
Why?
• Cost?
• Elastic supply
• Not my hardware
• Our only hope?
Why not?
• Cost?
• Harder to program
• Less user-friendly
• Data movement
• Loosely-coupled only
• Privacy (e.g. IRB)
Cloud debate on 1 slide
1.6 Gbp/day1 5 Gbp/day1 25 Gbp/day2
1. http://www.politigenomics.com/next-generation-sequencing-informatics2. http://www.politigenomics.com/2010/01/hiseq-2000.html
Conclusion: let’s try it but hedge our bets
Crossbow
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT
GTCGCAGTANCTGTCT||||||||| ||||||GTCGCAGTATCTGTCT
GGATCTGCGATATACC|||||| |||||||||GGATCT-CGATATACC
AATCTGATCTTATTTT||||||||||||||||AATCTGATCTTATTTT
ATATATATATATATAT||||||||||||||||ATATATATATATATAT
TCTCTCCCANNAGAGC||||||||| |||||TCTCTCCCAGGAGAGC
AlignAlignAggregatAggregatee
Reference
Call: HET A, Gp-value: 0.0023
GTCGCAGTATCTGTCT GTCGCAGTATCTGTNN TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTT TATATCGCAGTATCTG NATATCGCAGTATNTG CCCTATATCGCAGTAT ACACCCTATGTCGCA ACACCCTATCTCGCA ACACCCTATGTCGCA GA-CACCCTATGTCGC CCGGA-CACCCTATAT CCGGA-CACCCTATATGCCGGA-CACCCTATG
StatisticsStatistics
Parallel by readParallel by read
Handled by
Hadoop
Handled by
Hadoop
Parallel by genome
bin
Parallel by genome
bin
Gene 1Gene 1
Myrna
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT
GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTG TATATCGCAGTATCTG TATATCGCAGTATCTG CCCTATATCGCAGTAT AGCACCCTATGTCGCA AGCACCCTATATCGCA AGCACCCTATGTCGCA GAGCACCCTATGTCGC CCGGAGCACCCTATAT CCGGAGCACCCTATATGCCGGAGCACCCTATG
GTCGCAGTANCTGTCT||||||||| ||||||GTCGCAGTATCTGTCT
GGATCTGCGATATACC|||||| |||||||||GGATCT-CGATATACC
AATCTGATCTTATTTT||||||||||||||||AATCTGATCTTATTTT
ATATATATATATATAT||||||||||||||||ATATATATATATATAT
TCTCTCCCANNAGAGC||||||||| |||||TCTCTCCCAGGAGAGC
AlignAlign
Gene 1differentially expressed?: YES
p-value: 0.0012
TGTCGCAGTATCTGTC AGCACCCTATGTCGCAGCCGGAGCACCCTATG
GTCGCAGTANCTGTCT||||||||| ||||||GTCGCAGTATCTGTCT
GGATCTGCGATATACC|||||| |||||||||GGATCT-CGATATACC
AATCTGATCTTATTTT||||||||||||||||AATCTGATCTTATTTT
ATATATATATATATAT||||||||||||||||ATATATATATATATAT
TCTCTCCCANNAGAGC||||||||| |||||TCTCTCCCAGGAGAGC
Sample A
Sample B
AlignAlign Aggrega
Aggrega
tete
AggregaAggregatete
OverlapOverlap
AggregAggreg
ateateNorm
alize
Normalize
AggregAggregateate
Norm
alize
Norm
alize
AggregAggregateate
Statis
tic
Statis
tic
ss
Parallel by readParallel by read
Handled by
Hadoop
Handled by
Hadoop
Parallel by genome
bin
Parallel by genome
bin
Handled by
Hadoop
Handled by
Hadoop
Parallel by sample
Parallel by sample
Handled by
Hadoop
Handled by
Hadoop
Parallel by gene
Parallel by gene
MyrnaMyrna Runtime, Cost for 1.1 billion reads from Pickrell et al study
EC2 Nodes1 master,
10 workers1 master,
20 workers1 master,
40 workersWorker CPU cores 80 160 320Wall clock time 4h:20m 2h:32m 1h:38m
Cluster setup 4m 4m 3mAlign 2h:56m 1h:31m 54mOverlap 52m 31m 16mNormalize 6m 7m 6mStatistics 9m 6m 6mSummarize & Postprocess 13m 14m 13m
Approximate cost(N. Virginia / Elsewhere)
$44.00 / $49.50 $50.40 / $56.70 $65.60 / $73.80
Table 1. Timing and cost for a Myrna experiment with 1.1 billion 35 bp unpaired reads from the Pickrell et al study as input. Costs are approximate and based on the pricing as of this writing, that is, $0.68 per extra-large high-CPU EC2 node per hour in the Northern Virginia zone and $0.78 in other zones, plus a $0.12 per-node-per-hour surcharge for Elastic MapReduce in all zones. Times can vary subject to, for example, congestion and Internet traffic conditions.
Data transfer adds about 1hr:15m, $11
Myrna
71%
55%
Bet-hedging architecture
Cloud driver script
WrapperWrapper
bowtiebowtie
WrapperWrapper
soapsnp
soapsnp
PostprocessPostprocessH
ad
oop
Had
oop
WrapperWrapper
bowtiebowtie
WrapperWrapper
soapsnp
soapsnp
PostprocessPostprocess
Had
oop
Had
oop
Singleton driver script
WrapperWrapper
bowtiebowtie
WrapperWrapper
soapsnp
soapsnp
PostprocessPostprocess
Perl, fo
rk, sort
Perl, fo
rk, sort
Hadoop driver script
Cloud mode Hadoop mode Single-computer mode
Acknowledgements
•Michael Schatz
• Jimmy Lin
•Mihai Pop
• Steven Salzberg
• Jeff Leek
• Kasper Hansen
•Hector Corrada Bravo
•Rafael Irizarry
Crossbow
Data transfer adds about 1hr:15m, $28
Crossbow
43%58%