langmead bosc2010 cloud-genomics

10
Cloud-scale genomics: examples and lessons Ben Langmead Department of Biostatistics

Upload: bosc-2010

Post on 11-Jun-2015

724 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Langmead bosc2010 cloud-genomics

Cloud-scale genomics: examples and lessons

Ben Langmead

Department of Biostatistics

Page 2: Langmead bosc2010 cloud-genomics

Why?

• Cost?

• Elastic supply

• Not my hardware

• Our only hope?

Why not?

• Cost?

• Harder to program

• Less user-friendly

• Data movement

• Loosely-coupled only

• Privacy (e.g. IRB)

Cloud debate on 1 slide

1.6 Gbp/day1 5 Gbp/day1 25 Gbp/day2

1. http://www.politigenomics.com/next-generation-sequencing-informatics2. http://www.politigenomics.com/2010/01/hiseq-2000.html

Conclusion: let’s try it but hedge our bets

Page 3: Langmead bosc2010 cloud-genomics

Crossbow

GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT

GTCGCAGTANCTGTCT||||||||| ||||||GTCGCAGTATCTGTCT

GGATCTGCGATATACC|||||| |||||||||GGATCT-CGATATACC

AATCTGATCTTATTTT||||||||||||||||AATCTGATCTTATTTT

ATATATATATATATAT||||||||||||||||ATATATATATATATAT

TCTCTCCCANNAGAGC||||||||| |||||TCTCTCCCAGGAGAGC

AlignAlignAggregatAggregatee

Reference

Call: HET A, Gp-value: 0.0023

GTCGCAGTATCTGTCT GTCGCAGTATCTGTNN TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTT TATATCGCAGTATCTG NATATCGCAGTATNTG CCCTATATCGCAGTAT ACACCCTATGTCGCA ACACCCTATCTCGCA ACACCCTATGTCGCA GA-CACCCTATGTCGC CCGGA-CACCCTATAT CCGGA-CACCCTATATGCCGGA-CACCCTATG

StatisticsStatistics

Parallel by readParallel by read

Handled by

Hadoop

Handled by

Hadoop

Parallel by genome

bin

Parallel by genome

bin

Page 4: Langmead bosc2010 cloud-genomics

Gene 1Gene 1

Myrna

GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT

GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTG TATATCGCAGTATCTG TATATCGCAGTATCTG CCCTATATCGCAGTAT AGCACCCTATGTCGCA AGCACCCTATATCGCA AGCACCCTATGTCGCA GAGCACCCTATGTCGC CCGGAGCACCCTATAT CCGGAGCACCCTATATGCCGGAGCACCCTATG

GTCGCAGTANCTGTCT||||||||| ||||||GTCGCAGTATCTGTCT

GGATCTGCGATATACC|||||| |||||||||GGATCT-CGATATACC

AATCTGATCTTATTTT||||||||||||||||AATCTGATCTTATTTT

ATATATATATATATAT||||||||||||||||ATATATATATATATAT

TCTCTCCCANNAGAGC||||||||| |||||TCTCTCCCAGGAGAGC

AlignAlign

Gene 1differentially expressed?: YES

p-value: 0.0012

TGTCGCAGTATCTGTC AGCACCCTATGTCGCAGCCGGAGCACCCTATG

GTCGCAGTANCTGTCT||||||||| ||||||GTCGCAGTATCTGTCT

GGATCTGCGATATACC|||||| |||||||||GGATCT-CGATATACC

AATCTGATCTTATTTT||||||||||||||||AATCTGATCTTATTTT

ATATATATATATATAT||||||||||||||||ATATATATATATATAT

TCTCTCCCANNAGAGC||||||||| |||||TCTCTCCCAGGAGAGC

Sample A

Sample B

AlignAlign Aggrega

Aggrega

tete

AggregaAggregatete

OverlapOverlap

AggregAggreg

ateateNorm

alize

Normalize

AggregAggregateate

Norm

alize

Norm

alize

AggregAggregateate

Statis

tic

Statis

tic

ss

Parallel by readParallel by read

Handled by

Hadoop

Handled by

Hadoop

Parallel by genome

bin

Parallel by genome

bin

Handled by

Hadoop

Handled by

Hadoop

Parallel by sample

Parallel by sample

Handled by

Hadoop

Handled by

Hadoop

Parallel by gene

Parallel by gene

Page 5: Langmead bosc2010 cloud-genomics

MyrnaMyrna Runtime, Cost for 1.1 billion reads from Pickrell et al study

EC2 Nodes1 master,

10 workers1 master,

20 workers1 master,

40 workersWorker CPU cores 80 160 320Wall clock time 4h:20m 2h:32m 1h:38m

Cluster setup 4m 4m 3mAlign 2h:56m 1h:31m 54mOverlap 52m 31m 16mNormalize 6m 7m 6mStatistics 9m 6m 6mSummarize & Postprocess 13m 14m 13m

Approximate cost(N. Virginia / Elsewhere)

$44.00 / $49.50 $50.40 / $56.70 $65.60 / $73.80

Table 1. Timing and cost for a Myrna experiment with 1.1 billion 35 bp unpaired reads from the Pickrell et al study as input. Costs are approximate and based on the pricing as of this writing, that is, $0.68 per extra-large high-CPU EC2 node per hour in the Northern Virginia zone and $0.78 in other zones, plus a $0.12 per-node-per-hour surcharge for Elastic MapReduce in all zones. Times can vary subject to, for example, congestion and Internet traffic conditions.

Data transfer adds about 1hr:15m, $11

Page 6: Langmead bosc2010 cloud-genomics

Myrna

71%

55%

Page 7: Langmead bosc2010 cloud-genomics

Bet-hedging architecture

Cloud driver script

WrapperWrapper

bowtiebowtie

WrapperWrapper

soapsnp

soapsnp

PostprocessPostprocessH

ad

oop

Had

oop

WrapperWrapper

bowtiebowtie

WrapperWrapper

soapsnp

soapsnp

PostprocessPostprocess

Had

oop

Had

oop

Singleton driver script

WrapperWrapper

bowtiebowtie

WrapperWrapper

soapsnp

soapsnp

PostprocessPostprocess

Perl, fo

rk, sort

Perl, fo

rk, sort

Hadoop driver script

Cloud mode Hadoop mode Single-computer mode

Page 8: Langmead bosc2010 cloud-genomics

Acknowledgements

•Michael Schatz

• Jimmy Lin

•Mihai Pop

• Steven Salzberg

• Jeff Leek

• Kasper Hansen

•Hector Corrada Bravo

•Rafael Irizarry

Page 9: Langmead bosc2010 cloud-genomics

Crossbow

Data transfer adds about 1hr:15m, $28

Page 10: Langmead bosc2010 cloud-genomics

Crossbow

43%58%