genomics is not special: towards data intensive biology
DESCRIPTION
Genomics and life sciences is using antiquated technology for processing data. As the data volume is increasing in the life sciences, many in the biology community are reinventing the wheel, without realizing the existence of a rich ecosystem of tools for processing large data sets: Hadoop.TRANSCRIPT
Genomics IsNot Special
Uri Laserson // [email protected] // 13 November 2014
Toward Data-Intensive Biology
2© 2014 Cloudera, Inc. All rights reserved.http://omicsmaps.com/
>25 Pbp / year
3© 2014 Cloudera, Inc. All rights reserved.
Carr and Church, Nat. Biotech. 27: 1151 (2009)
4© 2014 Cloudera, Inc. All rights reserved.
For every “-ome” there’s a “-seq”
Genome DNA-seq
Transcriptome
RNA-seq
FRT-seq
NET-seq
Methylome Bisulfite-seq
Immunome Immune-seq
ProteomePhIP-seq
Bind-n-seq
http://liorpachter.wordpress.com/seq/
5© 2014 Cloudera, Inc. All rights reserved.
For every “-ome” there’s a “-seq”
Genome DNA-seq
Transcriptome
RNA-seq
FRT-seq
NET-seq
Methylome Bisulfite-seq
Immunome Immune-seq
ProteomePhIP-seq
Bind-n-seq
http://liorpachter.wordpress.com/seq/
6© 2014 Cloudera, Inc. All rights reserved.
Based on IMGT/LIGM release 201111-6
7© 2014 Cloudera, Inc. All rights reserved.
8© 2014 Cloudera, Inc. All rights reserved.
9© 2014 Cloudera, Inc. All rights reserved.
Developer/computational efficiency becoming
paramount
Genome Biology 12: 125 (2011)
10© 2014 Cloudera, Inc. All rights reserved.
Software and data management around
since 1970s
• Version control/reproducibility
• Testing/automation/integration
• Databases and data formats
• API design
• Lots (most?) of big data innovation happening in industry
11© 2014 Cloudera, Inc. All rights reserved.
Example query
For each variant that is• overlapping a DNase HS site
• predicted to be deleterious
• absent from dbSNP
compute the MAF by subpopulation
using samples in Framing Heart Study
PARTNER LOGO
CH
R
POS RE
F
AL
T
POP MAF POLYPHEN
7 122892
37
A G Plain 0.01 possibly
damaging
7 122892
37
A G Star-
bellied
0.03 possibly
damaging
12 228833
2
T C Plain 0.00
3
probably
damaging
12 228833
2
T C Star-
bellied
0.09 probably
damaging
12© 2014 Cloudera, Inc. All rights reserved.
Available data
Data set Format Size
Population genotypes VCF 10-100s of
billions
Dnase HS sites
(ENCODE)
narrowPeak
(BED)
<1 million
dbSNP CSV 10s of millions
Sample phenotypes JSON thousands
13© 2014 Cloudera, Inc. All rights reserved.
Why text data is a bad idea
• Text is highly inefficient• Compresses poorly
• Values must be parsed
• Text is semi-structured at best• Flexible schemas make parsing difficult
• Difficult to make assumptions on data structure
• Text poorly separates the roles of delimiters and data• Requires escaping of control characters
• (ASCII actually includes RS 0x1E and FS 0x1F, but they’re never used)
• But still almost always better than Excel
14© 2014 Cloudera, Inc. All rights reserved.
Some reasons VCF in particular is bad
• Number of records (variants) grows with new variants, rather than new genotypes• difficult to write data
• adding a sample requires rewrite of entire file
• Data must be sorted
• Semi-structured: need to build a parser for each file
• Conflates two functions:• catalogue of variation
• repository actual observed genotypes
• If gzipped, it’s not splittable
• Variants are not encoded uniquely by the VCF spec
15© 2014 Cloudera, Inc. All rights reserved.
Manually executing query in Python
class IntervalTree(object):
def update(self, feature):
pass # ...implement tree update
def overlaps(self, feature):
return True or False
dnase_sites = IntervalTree()
with open('path/to/dnase.narrowPeak', 'r') as ip:
for line in ip:
feature = parse_feature(line)
dnase_sites.update(feature)
samples = {}
with open('path/to/samples.json', 'r') as ip:
for line in ip:
sample = json.loads(line)
if is_framingham(sample):
samples[sample['name']] = sample
dbsnp = set()
with open('path/to/dbsnp.csv', 'r') as ip:
for line in ip:
snp = tuple(line.split()[:3])
dbsnp.add(snp)
16© 2014 Cloudera, Inc. All rights reserved.
Additional metadata must fit in memory
class IntervalTree(object):
def update(self, feature):
pass # ...implement tree update
def overlaps(self, feature):
return True or False
dnase_sites = IntervalTree()
with open('path/to/dnase.narrowPeak', 'r') as ip:
for line in ip:
feature = parse_feature(line)
dnase_sites.update(feature)
samples = {}
with open('path/to/samples.json', 'r') as ip:
for line in ip:
sample = json.loads(line)
if is_framingham(sample):
samples[sample['name']] = sample
dbsnp = set()
with open('path/to/dbsnp.csv', 'r') as ip:
for line in ip:
snp = tuple(line.split()[:3])
dbsnp.add(snp)
17© 2014 Cloudera, Inc. All rights reserved.
Can only read from POSIX filesystem
class IntervalTree(object):
def update(self, feature):
pass # ...implement tree update
def overlaps(self, feature):
return True or False
dnase_sites = IntervalTree()
with open('path/to/dnase.narrowPeak', 'r') as ip:
for line in ip:
feature = parse_feature(line)
dnase_sites.update(feature)
samples = {}
with open('path/to/samples.json', 'r') as ip:
for line in ip:
sample = json.loads(line)
if is_framingham(sample):
samples[sample['name']] = sample
dbsnp = set()
with open('path/to/dbsnp.csv', 'r') as ip:
for line in ip:
snp = tuple(line.split()[:3])
dbsnp.add(snp)
18© 2014 Cloudera, Inc. All rights reserved.
Manually executing query in Python
genotype_data = {}
reader = vcf.Reader('path/to/genotypes.vcf')
for variant in reader:
if (dnase_sites.overlaps(variant) and is_deleterious(call)
and not in_dbsnp(variant)):
for call in variant.samples:
if call.sample in samples:
pop = samples[call.sample]['population']
genotype_data.setdefault((variant, pop), []).append(call)
mafs = {}
for (variant, pop) in genotype_data.iter_keys():
mafs[(variant, pop)] = compute_maf(genotype_data[(variant, pop)])
19© 2014 Cloudera, Inc. All rights reserved.
Genotype data may be split across files
genotype_data = {}
reader = vcf.Reader('path/to/genotypes.vcf')
for variant in reader:
if (dnase_sites.overlaps(variant) and is_deleterious(call)
and not in_dbsnp(variant)):
for call in variant.samples:
if call.sample in samples:
pop = samples[call.sample]['population']
genotype_data.setdefault((variant, pop), []).append(call)
mafs = {}
for (variant, pop) in genotype_data.iter_keys():
mafs[(variant, pop)] = compute_maf(genotype_data[(variant, pop)])
20© 2014 Cloudera, Inc. All rights reserved.
Manually executing query in Python
• If file is gzipped, cannot split file without decompressing (use Snappy)
• Reading files required access to POSIX-style file system
• Probably want to split VCF file into pieces to parallelize• Requires manual scatter-gather
• Samples may be scattered among multiple VCF files (difficult to append to VCF)
• Manually implementing broadcast join• Build side must fit into memory
21© 2014 Cloudera, Inc. All rights reserved.
Manually executing query in Python on HPC
$ bsub –q shared_12h python split_genotypes.py
$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_1.vcf agg1.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_2.vcf agg2.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_3.vcf agg3.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_4.vcf agg4.csv
$ bsub –q shared_12h python merge_maf.py
22© 2014 Cloudera, Inc. All rights reserved.
Manually executing query in Python on HPC
$ bsub –q shared_12h python split_genotypes.py
$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_1.vcf agg1.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_2.vcf agg2.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_3.vcf agg3.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_4.vcf agg4.csv
$ bsub –q shared_12h python merge_maf.py
How to serialize intermediate output?
Manually specify requested resources
Manually split and mergeBabysit and
check for errors/failures
23© 2014 Cloudera, Inc. All rights reserved.
HPC separates compute from storage
HPC is about compute.Hadoop is about data.
Storage infrastructure• Proprietary,
distributed file system
• Expensive
Compute cluster• High-perf, reliable
hardware• Expensive
Big network
pipe ($$$)
User typically works by manually submitting jobs to scheduler(e.g., LSF, Grid Engine, etc.)
24© 2014 Cloudera, Inc. All rights reserved.
HPC is lower-level than Hadoop
• HPC only exposes job scheduling
• Parallelization typically through MPI• Very low-level communication primitives
• Difficult to horizontally scale by simply adding nodes
• Large data sets must be manually split
• Failures must be dealt with manually
25© 2014 Cloudera, Inc. All rights reserved.
HPC uses file system as DB; text file as LCD
• All tools assume flat files with POSIX semantics
• Sharing data/collaboration involves copying large files
• Broad joint caller with 25k genomes hits file handle limits
• Files always streamed over network (HPC architecture)
26© 2014 Cloudera, Inc. All rights reserved.
HPC uses job scheduler as workflow tool
• Submitting jobs to scheduler is low level
• Workflow engines/execution models provide high level execution graphs with built-in fault tolerance• e.g., MapReduce, Oozie, Spark, Luigi, Crunch, Cascading, Pig, Hive
27© 2014 Cloudera, Inc. All rights reserved.
Prepping data for local analysis in R/Python
• Manual script to prepare CSV file for working locally
• Same issues as above
• Requires working set of data to fit into memory of a single machine
• Visualization
28© 2014 Cloudera, Inc. All rights reserved.
Domain-specific tools (e.g., PLINK/Seq)
$ pseq path/to/project v-stats --mask phe=framingham locset=dnase ref.ex=dbsnp
one of a limited set of specific, useful tasks
(yet another) custom query specification
29© 2014 Cloudera, Inc. All rights reserved.
Domain-specific tools (e.g., PLINK/Seq)
• Works great if your problem fits into the pre-designed computations
• Only works if your problem fits into the pre-designed computations
• How to do stats by subpopulation?• Probably possible, but need to learn new notation
• Must work to get data in to begin-with
• Not obviously parallelizable for performance on large data sets
• Built on SQLite underneath
30© 2014 Cloudera, Inc. All rights reserved.
RDBMS and SQL (e.g., MySQL)
SELECT g.chr, g.pos, g.ref, g.alt, s.pop, MAF(g.call)
FROM genotypes g
INNER JOIN samples s
ON g.sample = s.sample
INNER JOIN dnase d
ON g.chr = d.chr
AND g.pos >= d.start
AND g.pos < d.end
LEFT OUTER JOIN dbsnp p
ON g.chr = p.chr
AND g.pos = p.pos
AND g.ref = p.ref
AND g.alt = p.alt
WHERE
s.study = "framingham"
p.pos IS NULL AND
g.polyphen IN ( "possibly damaging", "probably damaging" )
GROUP BY g.chr, g.pos, g.ref, g.alt, s.pop
31© 2014 Cloudera, Inc. All rights reserved.
RDBMS and SQL (e.g., MySQL)
• Feature-rich and very mature
• Highly optimized and allows indexing
• Declarative (and abstracted) language for data
• Hassle to get data in; data end up formatted one way
• No clear scalability story
• SQL-only
32© 2014 Cloudera, Inc. All rights reserved.
Problems with old way
• Expensive
• No fault-tolerance
• No horizontal scalability
• Poor separation of data modeling and storage formats• File format proliferation
• Inefficient text formats
33© 2014 Cloudera, Inc. All rights reserved.
34© 2014 Cloudera, Inc. All rights reserved.
Indexing the web
• Web is Huge• Hundreds of millions of pages in 1999
• How do you index it?• Crawl all the pages
• Rank pages based on relevance metrics
• Build search index of keywords to pages
• Do it in real time!
35© 2014 Cloudera, Inc. All rights reserved.
36© 2014 Cloudera, Inc. All rights reserved.
Databases in 1999
• Buy a really big machine
• Install expensive DBMS on it
• Point your workload on it
• Hope it doesn’t fail
• Ambitious: buy another big machine as backup
37© 2014 Cloudera, Inc. All rights reserved.
38© 2014 Cloudera, Inc. All rights reserved.
Database limitations
• Didn’t scale horizontally• High marginal cost ($$$)
• No real fault-tolerance story
• Vendor lock-in ($$$)
• SQL unsuited for search ranking• Complex analysis (PageRank)
• Unstructured data
39© 2014 Cloudera, Inc. All rights reserved.
40© 2014 Cloudera, Inc. All rights reserved.
Google does something different
• Designed their own storage and processing infrastructure• Google File System (GFS) and MapReduce (MR)
• Goals: cheap, scalable, reliable
• General framework for large-scale batch computation
• Powered Google Search for many years• Still used internally to this day (millions of jobs)
41© 2014 Cloudera, Inc. All rights reserved.
Google benevolent enough to publish
2003 2004
42© 2014 Cloudera, Inc. All rights reserved.
Birth of Hadoop at Yahoo!
• 2004-2006: Doug Cutting and Mike Cafarella implement GFS/MR
• 2006: Spun out as Apache Hadoop
• Named after Doug’s son’s yellow stuffed elephant
43© 2014 Cloudera, Inc. All rights reserved.
Open-source proliferation
Google Open-source Function
GFS HDFS Distributed file system
MapReduce MapReduce Batch distributed data
processing
Bigtable HBase Distributed DB/key-value store
Protobuf/Stubb
y
Thrift or Avro Data serialization/RPC
Pregel Giraph Distributed graph processing
Dremel/F1 Impala Scalable interactive SQL (MPP)
FlumeJava Crunch Abstracted data pipelines on
Hadoop
44© 2014 Cloudera, Inc. All rights reserved.
Hadoop provides:
• Data centralization on HDFS• No rewriting data for each tool/application
• Data-local execution to avoid moving terabytes
• High-level execution engines• SQL (Impala, Hive)
• Relational algebra (Spark, MapReduce)
• Bulk synchronous parallel (GraphX)
• Distributed in-memory
• Built-in horizontal scalability and fault-tolerance
• Hadoop-friendly, evolvable serialization formats/RPC
45© 2014 Cloudera, Inc. All rights reserved.
Hadoop provides serialization/RPC formats
(Avro)
• Specify schemas/services in user-friendly IDLs
• Code-generation to multiple languages (wire-compatible/portable)
• Compact, binary formats
• Support for schema evolution
• Like binary JSONrecord Feature {
union { null, string } featureId = null;union { null, string } featureType = null; // e.g., DNase HSunion { null, string } source = null; // e.g., BED, GFF fileunion { null, Contig } contig = null;union { null, long } start = null;union { null, long } end = null;union { null, Strand } strand = null;union { null, double } value = null;array<Dbxref> dbxrefs = [];array<string> parentIds = [];map<string> attributes = {};
}
46© 2014 Cloudera, Inc. All rights reserved.
APIs instead of file formats
• Service-oriented architectures (SOA) ensure stable contracts
• Allows for implementation changes with new technologies
• Software community has lots of experience with SOA, along with mature tools
• Can be implemented in language-independent fashion
47© 2014 Cloudera, Inc. All rights reserved.
Current file format hairball
48© 2014 Cloudera, Inc. All rights reserved.
API-oriented architecture
49© 2014 Cloudera, Inc. All rights reserved.
Hadoop provides columnar storage
(Parquet)
• Designed for general data storage
• Columnar format• read fewer bytes
• compression more efficient
• Splittable
• Avro/Thrift-compatible
• Predicate pushdown
• RLE, dictionary-encoding
50© 2014 Cloudera, Inc. All rights reserved.
Hadoop provides columnar storage
(Parquet)
51© 2014 Cloudera, Inc. All rights reserved.
Hadoop provides columnar storage
(Parquet)
Vertical partitioning(projection pushdown)
Horizontal partitioning(predicate pushdown)
Read only the data you
need!
+ =
52© 2014 Cloudera, Inc. All rights reserved.
Hadoop provides abstractions for data
processing
HDFS (scalable, distributed storage)
YARN (resource management)
MapReduc
e
Impala
(SQL)
Solr
(search)Spark
ADAMquince guacamole …
bdg-f
orm
ats
(A
vro
/Parq
uet)
53© 2014 Cloudera, Inc. All rights reserved.
Hadoop examples: filesystem
[laserson@bottou01-10g ~]$ hadoop fs –ls /user/lasersonFound 16 itemsdrwx------ - laserson laserson 0 2014-11-12 16:00 .Trashdrwxr-xr-x - laserson laserson 0 2014-11-12 00:29 .sparkStagingdrwx------ - laserson laserson 0 2014-06-07 13:27 .stagingdrwxr-xr-x - laserson laserson 0 2014-10-30 14:15 1kgdrwxr-xr-x - laserson laserson 0 2014-05-08 17:29 bigmldrwxr-xr-x - laserson laserson 0 2014-10-30 14:14 bookdrwxrwxr-x - laserson laserson 0 2014-06-16 12:59 editingdrwxr-xr-x - laserson laserson 0 2014-06-06 13:49 gdelt-rw-r--r-- 3 laserson laserson 0 2014-10-27 16:24 hg19_textdrwxr-xr-x - laserson laserson 0 2014-06-12 19:53 madlibportdrwxr-xr-x - laserson laserson 0 2014-03-20 18:09 rock-health-pythondrwxr-xr-x - laserson laserson 0 2014-05-15 13:25 test-udfdrwxr-xr-x - laserson laserson 0 2014-08-21 17:58 test_pymcdrwxr-xr-x - laserson laserson 0 2014-10-27 22:25 tmpdrwxr-xr-x - laserson laserson 0 2014-10-07 20:30 udf-scratchdrwxr-xr-x - laserson laserson 0 2014-03-02 13:50 udfs
54© 2014 Cloudera, Inc. All rights reserved.
Hadoop examples: batch MapReduce job
hadoop jar vcf2parquet-0.1.0-jar-with-dependencies.jar \com.cloudera.science.vcf2parquet.VCFtoParquetDriver \hdfs:///path/to/variants.vcf \hdfs:///path/to/output.parquet
55© 2014 Cloudera, Inc. All rights reserved.
Hadoop examples: interactive Spark shell
[laserson@bottou01-10g ~]$ spark-shell --master yarnWelcome to
____ __/ __/__ ___ _____/ /___\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.1.0/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67)Type in expressions to have them evaluated.[...]
scala>
56© 2014 Cloudera, Inc. All rights reserved.
Hadoop examples: interactive Spark shell
def inDbSnp(g: Genotype): Boolean = true or false
def isDeleterious(g: Genotype): Boolean = g.getPolyPhen
val samples = sc.textFile("path/to/samples").map(parseJson(_)).collect()
val dbsnp = sc.textFile("path/to/dbSNP").map(_.split(",")).collect()
val genotypesRDD = sc.adamLoad("path/to/genotypes")
val dnaseRDD = sc.adamBEDFeatureLoad("path/to/dnase")
val filteredRDD = genotypesRDD
.filter(!inDbSnp(_))
.filter(isDeleterious(_))
.filter(isFramingham(_))
val joinedRDD = RegionJoin.partitionAndJoin(sc, filteredRDD, dnaseRDD)
val maf = joinedRDD
.keyBy(x => (x.getVariant, getPopulation(x)))
.groupByKey()
.map(computeMAF(_))
.saveAsNewAPIHadoopFile("path/to/output")
57© 2014 Cloudera, Inc. All rights reserved.
Hadoop provides abstractions for data
processing
HDFS (scalable, distributed storage)
YARN (resource management)
MapReduc
e
Impala
(SQL)
Solr
(search)Spark
ADAMquince guacamole …
bdg-f
orm
ats
(A
vro
/Parq
uet)
58© 2014 Cloudera, Inc. All rights reserved.
Genomics ETL
.fastq .bam .vcf
.bed/.gtf/etc
short read
alignment
genotype calling analysis
59© 2014 Cloudera, Inc. All rights reserved.
Hadoop variant store architecture
Impala shell (SQL)
REST API
JDBC
SQL query
Impala engine
Hive metastoreResult
set
.parquet.vcf
ETL
60© 2014 Cloudera, Inc. All rights reserved.
Data denormalization
##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
• Amortize join cost up-front• Replace joins with predicates
(allowing predicate pushdown)
61© 2014 Cloudera, Inc. All rights reserved.
Hadoop solution characteristics
• Data stored as Parquet columnar format for performance and compression
• Impala/Hive metastore provide unified, flexible data model
• Impala implements RDBMS-style operations (by experts in distributed systems)
• Spark offers flexible relational algebra operators (and in-memory computing)
• Built-in fault tolerance for computations and horizontal scalability
62© 2014 Cloudera, Inc. All rights reserved.
Example variant-filtering query
• “Give me all SNPs that are:• chromosome 16
• absent from dbSNP
• present in COSMIC
• observed in breast cancer samples”
• On full 1000 Genome data set• ~37 billion genotypes
• 14 node cluster
• query completion in several seconds
SELECT cosmic as snp_id,
vcf_chrom as chr,
vcf_pos as pos,
sample_id as sample,
vcf_call_gt as genotype,
sample_affection as phenotype
FROM
hg19_parquet_snappy_join_cached_partitioned
WHERE
COSMIC IS NOT NULL AND
dbSNP IS NULL AND
sample_study = ”breast_cancer" AND
VCF_CHROM = "16";
PARTNER LOGO
63© 2014 Cloudera, Inc. All rights reserved.
Other queries/use cases
• All-vs-all eQTL integrated with ENCODE• >120 billion p-values
• “Top 20 eQTLs for 5 genes of interest”: interactive
• “Find all cis-eQTLs”: several minutes
• Population genetics queries (e.g., backend for PLINK)
• Interval arithmetic on large ENCODE data sets
• Duke CHGV• ATAV DSL for preparing data for GWAS
• Week-long queries now take a few hours by parallelizing on Spark
64© 2014 Cloudera, Inc. All rights reserved.
Computational biologists are reinventing the
wheel
• e.g., CRAM (columnar storage)
• e.g., workflow managers (Galaxy)
• e.g., GATK (scatter-gather)
65© 2014 Cloudera, Inc. All rights reserved.
Large-scale data analysis has been solved*
• Cheaper in terms of hardware
• Easier in terms of productivity
• Built-in horizontal scaling
• Built-in fault tolerance
• Layered abstractions for data modeling
• Hadoop!
66© 2014 Cloudera, Inc. All rights reserved.
Science on Hadoop
• ADAM project for genomics on Spark• http://bdgenomics.org/
• Guacamole for somatic variation on Spark• https://github.com/hammerlab/guacamole/
• Thunder project for neuroimaging on Spark• http://thefreemanlab.com/thunder/
• Quince for variant store on Impala• currently barebones, but with examples
• https://github.com/laserson/quince
67© 2014 Cloudera, Inc. All rights reserved.
Suggestions/resources
• Everyone should learn Python• (also, everyone should try some experiments)
• Everyone should use version control (e.g., git)• GitHub enables easy collaboration
• See Titus Brown’s blog
• Use the IPython Notebook (Jupyter) for productivity
• Big data is often about engineering; use the best tools
• For getting industry jobs:• Show people you know how to code: put your projects on GitHub
• You should feel lucky if others will start using your code
68© 2014 Cloudera, Inc. All rights reserved.
69© 2014 Cloudera, Inc. All rights reserved.
Acknowledgements
• Cloudera• Sandy Ryza (Spark development)
• Nong Li (Impala)
• Skye Wanderman-Milne (Impala)
• Impala genomics collaborators• Kiran Mukhyala
• Slaton Lipscomb
• ADAM project• Matt Massie
• Frank Nothaft
• Timothy Danford
• Mount Sinai School of Medicine• Jeff Hammerbacher (+ lab)
• Duke CHGV• Jonathan Keebler
Thank you.