strata big data science talk on adam
TRANSCRIPT
ADAM: Fast, Scalable Genome Analysis
Frank Austin NothaftAMPLab, University of California, Berkeley
with: Matt Massie, André Schumacher, Timothy Danford, Chris Hartl, Jey Kottalam, Arun Aruha, Neal Sidhwaney, Michael Linderman, Jeff Hammerbacher, Anthony Joseph, and Dave
Patterson
https://github.com/bigdatagenomics
Wednesday, February 12, 14
Problem
• Whole genome files are large
• Biological systems are complex
• Population analysis requires petabytes of data
• Analysis time is often a matter of life and death
Wednesday, February 12, 14
Shredded Book AnalogyDickens accidentally shreds the first printing of A Tale of Two Cities
Text printed on 5 long spools
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
Slide credit to Michael Schatz http://schatzlab.cshl.edu/Wednesday, February 12, 14
Shredded Book AnalogyDickens accidentally shreds the first printing of A Tale of Two Cities
Text printed on 5 long spools
It was the best of of times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, …
It was the best worst of times, it wasof times, it was the the age of wisdom, it was the age of foolishness,
It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …
It was was the worst of times,the best of times, it it was the age of wisdom, it was the age of foolishness, …
It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, …
Slide credit to Michael Schatz http://schatzlab.cshl.edu/Wednesday, February 12, 14
Shredded Book AnalogyDickens accidentally shreds the first printing of A Tale of Two Cities
Text printed on 5 long spools
• How can he reconstruct the text?– 5 copies x 138, 656 words / 5 words per fragment = 138k fragments– The short fragments from every copy are mixed together– Some fragments are identical
It was the best of of times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, …
It was the best worst of times, it wasof times, it was the the age of wisdom, it was the age of foolishness,
It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …
It was was the worst of times,the best of times, it it was the age of wisdom, it was the age of foolishness, …
It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, …
Slide credit to Michael Schatz http://schatzlab.cshl.edu/Wednesday, February 12, 14
Shredded Book AnalogyDickens accidentally shreds the first printing of A Tale of Two Cities
Text printed on 5 long spools
• How can he reconstruct the text?– 5 copies x 138, 656 words / 5 words per fragment = 138k fragments– The short fragments from every copy are mixed together– Some fragments are identical
It was the best of of times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, …
It was the best worst of times, it wasof times, it was the the age of wisdom, it was the age of foolishness,
It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …
It was was the worst of times,the best of times, it it was the age of wisdom, it was the age of foolishness, …
It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, …
Slide credit to Michael Schatz http://schatzlab.cshl.edu/Wednesday, February 12, 14
What is ADAM?
• File formats: columnar file format that allows efficient parallel access to genomes
• API: interface for transforming, analyzing, and querying genomic data
• CLI: a handy toolkit for quickly processing genomes
Wednesday, February 12, 14
The Broad Institute “Best Practices” Pipeline
SNAPADAM Avocado
MLBaseGraphX
SiRen CAGE
Wednesday, February 12, 14
Design Goals
• Develop processing pipeline that enables efficient, scalable use of cluster/cloud
• Provide data format that has efficient parallel/distributed access across platforms
• Enhance semantics of data and allow more flexible data access patterns
Wednesday, February 12, 14
Whole GenomeData Sizes
Input PipelineStage
Output
SNAP 1GB Fasta150GB Fastq
Alignment 250GB BAM
ADAM 250GB BAM Pre-processing
200GB ADAM
Avocado 200GB ADAM
Variant Calling 10MB ADAM
Variants found at about 1 in 1,000 loci
Wednesday, February 12, 14
ADAM Stack
Physical
File/Block
Record/Split
‣Commodity Hardware‣Cloud Systems - Amazon, GCE, Azure
‣Hadoop Distributed Filesystem‣Local Filesystem
‣Schema-driven records w/ Apache Avro‣Store and retrieve records using Parquet‣Read BAM Files using Hadoop-BAM
RDD
‣Transform records using Apache Spark‣Query with SQL using Shark‣Graph processing with GraphX‣Machine learning using MLBase
Wednesday, February 12, 14
Implementation
• Work accelerated at end of September
• 15K lines of Scala code
• 100% Apache-licensed open-source
• Code contributions from Mt. Sinai, GenomeBridge, The Broad Institute
• Proof of concept implementations at The Broad Institute, Duke, Harvard, UCSC
• Global Alliance looking at ADAM as potential standard
Commits
Wednesday, February 12, 14
AWS EC2 Performance
Picard
ADAM 1 node
ADAM 32 nodes
ADAM 100 nodes
1 10 100 1000 10000
28
70
1222
20
32
536
1064
Minutes (log scale)
Sort Mark Duplicates
NA12878 Whole Genome234GB as BAM, 229 GB as ADAM
2x
33x
53x
Wednesday, February 12, 14
Mt. Sinai Performance
Picard
ADAM 16 nodes
ADAM 32 nodes
ADAM 64 nodes
ADAM 82 nodes
1 10 100 1000 10000
14
15
34
50
1222
8
11
17
29
1064
Minutes (log scale)
Sort Mark Duplicates
NA12878 Whole Genome234GB as BAM, 229 GB as ADAM
36x
62x
96x
133x
Wednesday, February 12, 14
Scalability
HG00096 à 16GBNA12878 à 234GB
Wednesday, February 12, 14
Acknowledgements
• UC Berkeley: Matt Massie, André Schumacher, Jey Kottalam, Christos Kozanitis
• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Michael Linderman, Jeff Hammerbacher
• GenomeBridge: Timothy Danford, Carl Yeksigian
Wednesday, February 12, 14
Acknowledgements
This research is supported in part by NSF CISE Expeditions award CCF-1139158 and DARPA
XData Award FA8750-12-2-0331, and gifts from Amazon Web Services, Google, SAP, Apple, Inc.,
Cisco, Clearstory Data, Cloudera, Ericsson, Facebook, GameOnTalis, General Electric,
Hortonworks, Huawei, Intel, Microsoft, NetApp, Oracle, Samsung, Splunk, VMware, WANdisco and
Yahoo!.
Wednesday, February 12, 14
“Before seeing your data, I would not think sorting could be done that fast.”
- research scientist at the Broad Institute
Wednesday, February 12, 14
Call for contributions
• As an open source project, we welcome contributions
• We maintain a list of open enhancements at our Github issue tracker
• Enhancements tagged with “Pick me up!” don’t require a genomics background
• Github: https://github.com/bigdatagenomics/
Wednesday, February 12, 14