genomics is not special: towards data intensive biology

Genomics IsNot Special

Uri Laserson // [email protected] // 13 November 2014

Toward Data-Intensive Biology

2© 2014 Cloudera, Inc. All rights reserved.http://omicsmaps.com/

>25 Pbp / year

3© 2014 Cloudera, Inc. All rights reserved.

Carr and Church, Nat. Biotech. 27: 1151 (2009)


For every “-ome” there’s a “-seq”

Genome DNA-seq

Transcriptome

RNA-seq

FRT-seq

NET-seq

Methylome Bisulfite-seq

Immunome Immune-seq

ProteomePhIP-seq

Bind-n-seq

http://liorpachter.wordpress.com/seq/


Based on IMGT/LIGM release 201111-6


Developer/computational efficiency becoming

paramount

Genome Biology 12: 125 (2011)


Software and data management around

since 1970s

• Version control/reproducibility

• Testing/automation/integration

• Databases and data formats

• API design

• Lots (most?) of big data innovation happening in industry


Example query

For each variant that is• overlapping a DNase HS site

• predicted to be deleterious

• absent from dbSNP

compute the MAF by subpopulation

using samples in Framing Heart Study

PARTNER LOGO

CH

R

POS RE

F

AL

T

POP MAF POLYPHEN

7 122892

37

A G Plain 0.01 possibly

damaging

7 122892

37

A G Star-

bellied

0.03 possibly

damaging

12 228833

2

T C Plain 0.00

3

probably

damaging

12 228833

2

T C Star-

bellied

0.09 probably

damaging


Available data

Data set Format Size

Population genotypes VCF 10-100s of

billions

Dnase HS sites

(ENCODE)

narrowPeak

(BED)

<1 million

dbSNP CSV 10s of millions

Sample phenotypes JSON thousands


Why text data is a bad idea

• Text is highly inefficient• Compresses poorly

• Values must be parsed

• Text is semi-structured at best• Flexible schemas make parsing difficult

• Difficult to make assumptions on data structure

• Text poorly separates the roles of delimiters and data• Requires escaping of control characters

• (ASCII actually includes RS 0x1E and FS 0x1F, but they’re never used)

• But still almost always better than Excel


Some reasons VCF in particular is bad

• Number of records (variants) grows with new variants, rather than new genotypes• difficult to write data

• adding a sample requires rewrite of entire file

• Data must be sorted

• Semi-structured: need to build a parser for each file

• Conflates two functions:• catalogue of variation

• repository actual observed genotypes

• If gzipped, it’s not splittable

• Variants are not encoded uniquely by the VCF spec


Manually executing query in Python

class IntervalTree(object):

def update(self, feature):

pass # ...implement tree update

def overlaps(self, feature):

return True or False

dnase_sites = IntervalTree()

with open('path/to/dnase.narrowPeak', 'r') as ip:

for line in ip:

feature = parse_feature(line)

dnase_sites.update(feature)

samples = {}

with open('path/to/samples.json', 'r') as ip:

for line in ip:

sample = json.loads(line)

if is_framingham(sample):

samples[sample['name']] = sample

dbsnp = set()

with open('path/to/dbsnp.csv', 'r') as ip:

for line in ip:

snp = tuple(line.split()[:3])

dbsnp.add(snp)


Additional metadata must fit in memory








for line in ip:



samples = {}


for line in ip:




dbsnp = set()


for line in ip:


dbsnp.add(snp)


Can only read from POSIX filesystem








for line in ip:



samples = {}


for line in ip:




dbsnp = set()


for line in ip:


dbsnp.add(snp)



genotype_data = {}

reader = vcf.Reader('path/to/genotypes.vcf')

for variant in reader:

if (dnase_sites.overlaps(variant) and is_deleterious(call)

and not in_dbsnp(variant)):

for call in variant.samples:

if call.sample in samples:

pop = samples[call.sample]['population']

genotype_data.setdefault((variant, pop), []).append(call)

mafs = {}

for (variant, pop) in genotype_data.iter_keys():

mafs[(variant, pop)] = compute_maf(genotype_data[(variant, pop)])


Genotype data may be split across files

genotype_data = {}

reader = vcf.Reader('path/to/genotypes.vcf')

for variant in reader:

if (dnase_sites.overlaps(variant) and is_deleterious(call)

and not in_dbsnp(variant)):

for call in variant.samples:

if call.sample in samples:

pop = samples[call.sample]['population']

genotype_data.setdefault((variant, pop), []).append(call)

mafs = {}

for (variant, pop) in genotype_data.iter_keys():

mafs[(variant, pop)] = compute_maf(genotype_data[(variant, pop)])



• If file is gzipped, cannot split file without decompressing (use Snappy)

• Reading files required access to POSIX-style file system

• Probably want to split VCF file into pieces to parallelize• Requires manual scatter-gather

• Samples may be scattered among multiple VCF files (difficult to append to VCF)

• Manually implementing broadcast join• Build side must fit into memory


Manually executing query in Python on HPC

$ bsub –q shared_12h python split_genotypes.py

$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_1.vcf agg1.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_2.vcf agg2.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_3.vcf agg3.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_4.vcf agg4.csv

$ bsub –q shared_12h python merge_maf.py


Manually executing query in Python on HPC

$ bsub –q shared_12h python split_genotypes.py

$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_1.vcf agg1.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_2.vcf agg2.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_3.vcf agg3.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_4.vcf agg4.csv

$ bsub –q shared_12h python merge_maf.py

How to serialize intermediate output?

Manually specify requested resources

Manually split and mergeBabysit and

check for errors/failures


HPC separates compute from storage

HPC is about compute.Hadoop is about data.

Storage infrastructure• Proprietary,

distributed file system

• Expensive

Compute cluster• High-perf, reliable

hardware• Expensive

Big network

pipe ($$$)

User typically works by manually submitting jobs to scheduler(e.g., LSF, Grid Engine, etc.)


HPC is lower-level than Hadoop

• HPC only exposes job scheduling

• Parallelization typically through MPI• Very low-level communication primitives

• Difficult to horizontally scale by simply adding nodes

• Large data sets must be manually split

• Failures must be dealt with manually


HPC uses file system as DB; text file as LCD

• All tools assume flat files with POSIX semantics

• Sharing data/collaboration involves copying large files

• Broad joint caller with 25k genomes hits file handle limits

• Files always streamed over network (HPC architecture)


HPC uses job scheduler as workflow tool

• Submitting jobs to scheduler is low level

• Workflow engines/execution models provide high level execution graphs with built-in fault tolerance• e.g., MapReduce, Oozie, Spark, Luigi, Crunch, Cascading, Pig, Hive


Prepping data for local analysis in R/Python

• Manual script to prepare CSV file for working locally

• Same issues as above

• Requires working set of data to fit into memory of a single machine

• Visualization


Domain-specific tools (e.g., PLINK/Seq)

$ pseq path/to/project v-stats --mask phe=framingham locset=dnase ref.ex=dbsnp

one of a limited set of specific, useful tasks

(yet another) custom query specification


Domain-specific tools (e.g., PLINK/Seq)

• Works great if your problem fits into the pre-designed computations

• Only works if your problem fits into the pre-designed computations

• How to do stats by subpopulation?• Probably possible, but need to learn new notation

• Must work to get data in to begin-with

• Not obviously parallelizable for performance on large data sets

• Built on SQLite underneath


RDBMS and SQL (e.g., MySQL)

SELECT g.chr, g.pos, g.ref, g.alt, s.pop, MAF(g.call)

FROM genotypes g

INNER JOIN samples s

ON g.sample = s.sample

INNER JOIN dnase d

ON g.chr = d.chr

AND g.pos >= d.start

AND g.pos < d.end

LEFT OUTER JOIN dbsnp p

ON g.chr = p.chr

AND g.pos = p.pos

AND g.ref = p.ref

AND g.alt = p.alt

WHERE

s.study = "framingham"

p.pos IS NULL AND

g.polyphen IN ( "possibly damaging", "probably damaging" )

GROUP BY g.chr, g.pos, g.ref, g.alt, s.pop


RDBMS and SQL (e.g., MySQL)

• Feature-rich and very mature

• Highly optimized and allows indexing

• Declarative (and abstracted) language for data

• Hassle to get data in; data end up formatted one way

• No clear scalability story

• SQL-only


Problems with old way

• Expensive

• No fault-tolerance

• No horizontal scalability

• Poor separation of data modeling and storage formats• File format proliferation

• Inefficient text formats


Indexing the web

• Web is Huge• Hundreds of millions of pages in 1999

• How do you index it?• Crawl all the pages

• Rank pages based on relevance metrics

• Build search index of keywords to pages

• Do it in real time!


Databases in 1999

• Buy a really big machine

• Install expensive DBMS on it

• Point your workload on it

• Hope it doesn’t fail

• Ambitious: buy another big machine as backup


Database limitations

• Didn’t scale horizontally• High marginal cost ($$$)

• No real fault-tolerance story

• Vendor lock-in ($$$)

• SQL unsuited for search ranking• Complex analysis (PageRank)

• Unstructured data


Google does something different

• Designed their own storage and processing infrastructure• Google File System (GFS) and MapReduce (MR)

• Goals: cheap, scalable, reliable

• General framework for large-scale batch computation

• Powered Google Search for many years• Still used internally to this day (millions of jobs)


Google benevolent enough to publish

2003 2004


Birth of Hadoop at Yahoo!

• 2004-2006: Doug Cutting and Mike Cafarella implement GFS/MR

• 2006: Spun out as Apache Hadoop

• Named after Doug’s son’s yellow stuffed elephant


Open-source proliferation

Google Open-source Function

GFS HDFS Distributed file system

MapReduce MapReduce Batch distributed data

processing

Bigtable HBase Distributed DB/key-value store

Protobuf/Stubb

y

Thrift or Avro Data serialization/RPC

Pregel Giraph Distributed graph processing

Dremel/F1 Impala Scalable interactive SQL (MPP)

FlumeJava Crunch Abstracted data pipelines on

Hadoop


Hadoop provides:

• Data centralization on HDFS• No rewriting data for each tool/application

• Data-local execution to avoid moving terabytes

• High-level execution engines• SQL (Impala, Hive)

• Relational algebra (Spark, MapReduce)

• Bulk synchronous parallel (GraphX)

• Distributed in-memory

• Built-in horizontal scalability and fault-tolerance

• Hadoop-friendly, evolvable serialization formats/RPC


Hadoop provides serialization/RPC formats

(Avro)

• Specify schemas/services in user-friendly IDLs

• Code-generation to multiple languages (wire-compatible/portable)

• Compact, binary formats

• Support for schema evolution

• Like binary JSONrecord Feature {

union { null, string } featureId = null;union { null, string } featureType = null; // e.g., DNase HSunion { null, string } source = null; // e.g., BED, GFF fileunion { null, Contig } contig = null;union { null, long } start = null;union { null, long } end = null;union { null, Strand } strand = null;union { null, double } value = null;array<Dbxref> dbxrefs = [];array<string> parentIds = [];map<string> attributes = {};

}


APIs instead of file formats

• Service-oriented architectures (SOA) ensure stable contracts

• Allows for implementation changes with new technologies

• Software community has lots of experience with SOA, along with mature tools

• Can be implemented in language-independent fashion


Current file format hairball


API-oriented architecture


Hadoop provides columnar storage

(Parquet)

• Designed for general data storage

• Columnar format• read fewer bytes

• compression more efficient

• Splittable

• Avro/Thrift-compatible

• Predicate pushdown

• RLE, dictionary-encoding



(Parquet)



(Parquet)

Vertical partitioning(projection pushdown)

Horizontal partitioning(predicate pushdown)

Read only the data you

need!

+ =


Hadoop provides abstractions for data

processing

HDFS (scalable, distributed storage)

YARN (resource management)

MapReduc

e

Impala

(SQL)

Solr

(search)Spark

ADAMquince guacamole …

bdg-f

orm

ats

(A

vro

/Parq

uet)


Hadoop examples: filesystem

[laserson@bottou01-10g ~]$ hadoop fs –ls /user/lasersonFound 16 itemsdrwx------ - laserson laserson 0 2014-11-12 16:00 .Trashdrwxr-xr-x - laserson laserson 0 2014-11-12 00:29 .sparkStagingdrwx------ - laserson laserson 0 2014-06-07 13:27 .stagingdrwxr-xr-x - laserson laserson 0 2014-10-30 14:15 1kgdrwxr-xr-x - laserson laserson 0 2014-05-08 17:29 bigmldrwxr-xr-x - laserson laserson 0 2014-10-30 14:14 bookdrwxrwxr-x - laserson laserson 0 2014-06-16 12:59 editingdrwxr-xr-x - laserson laserson 0 2014-06-06 13:49 gdelt-rw-r--r-- 3 laserson laserson 0 2014-10-27 16:24 hg19_textdrwxr-xr-x - laserson laserson 0 2014-06-12 19:53 madlibportdrwxr-xr-x - laserson laserson 0 2014-03-20 18:09 rock-health-pythondrwxr-xr-x - laserson laserson 0 2014-05-15 13:25 test-udfdrwxr-xr-x - laserson laserson 0 2014-08-21 17:58 test_pymcdrwxr-xr-x - laserson laserson 0 2014-10-27 22:25 tmpdrwxr-xr-x - laserson laserson 0 2014-10-07 20:30 udf-scratchdrwxr-xr-x - laserson laserson 0 2014-03-02 13:50 udfs


Hadoop examples: batch MapReduce job

hadoop jar vcf2parquet-0.1.0-jar-with-dependencies.jar \com.cloudera.science.vcf2parquet.VCFtoParquetDriver \hdfs:///path/to/variants.vcf \hdfs:///path/to/output.parquet


Hadoop examples: interactive Spark shell

[laserson@bottou01-10g ~]$ spark-shell --master yarnWelcome to

____ __/ __/__ ___ _____/ /___\ \/ _ \/ _ `/ __/ '_/

/___/ .__/\_,_/_/ /_/\_\ version 1.1.0/_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67)Type in expressions to have them evaluated.[...]

scala>


Hadoop examples: interactive Spark shell

def inDbSnp(g: Genotype): Boolean = true or false

def isDeleterious(g: Genotype): Boolean = g.getPolyPhen

val samples = sc.textFile("path/to/samples").map(parseJson(_)).collect()

val dbsnp = sc.textFile("path/to/dbSNP").map(_.split(",")).collect()

val genotypesRDD = sc.adamLoad("path/to/genotypes")

val dnaseRDD = sc.adamBEDFeatureLoad("path/to/dnase")

val filteredRDD = genotypesRDD

.filter(!inDbSnp(_))

.filter(isDeleterious(_))

.filter(isFramingham(_))

val joinedRDD = RegionJoin.partitionAndJoin(sc, filteredRDD, dnaseRDD)

val maf = joinedRDD

.keyBy(x => (x.getVariant, getPopulation(x)))

.groupByKey()

.map(computeMAF(_))

.saveAsNewAPIHadoopFile("path/to/output")


Hadoop provides abstractions for data

processing

HDFS (scalable, distributed storage)

YARN (resource management)

MapReduc

e

Impala

(SQL)

Solr

(search)Spark

ADAMquince guacamole …

bdg-f

orm

ats

(A

vro

/Parq

uet)


Genomics ETL

.fastq .bam .vcf

.bed/.gtf/etc

short read

alignment

genotype calling analysis


Hadoop variant store architecture

Impala shell (SQL)

REST API

JDBC

SQL query

Impala engine

Hive metastoreResult

set

.parquet.vcf

ETL


Data denormalization

##fileformat=VCFv4.1

##fileDate=20090805

##source=myImputationProgramV3.1

##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta

##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>

##phasing=partial

##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">

##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">

##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">

##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">

##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">

##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">

##FILTER=<ID=q10,Description="Quality below 10">

##FILTER=<ID=s50,Description="Less than 50% of samples have data">

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">

##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">

##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003

20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.

20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3

20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4

20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2

• Amortize join cost up-front• Replace joins with predicates

(allowing predicate pushdown)


Hadoop solution characteristics

• Data stored as Parquet columnar format for performance and compression

• Impala/Hive metastore provide unified, flexible data model

• Impala implements RDBMS-style operations (by experts in distributed systems)

• Spark offers flexible relational algebra operators (and in-memory computing)

• Built-in fault tolerance for computations and horizontal scalability


Example variant-filtering query

• “Give me all SNPs that are:• chromosome 16

• absent from dbSNP

• present in COSMIC

• observed in breast cancer samples”

• On full 1000 Genome data set• ~37 billion genotypes

• 14 node cluster

• query completion in several seconds

SELECT cosmic as snp_id,

vcf_chrom as chr,

vcf_pos as pos,

sample_id as sample,

vcf_call_gt as genotype,

sample_affection as phenotype

FROM

hg19_parquet_snappy_join_cached_partitioned

WHERE

COSMIC IS NOT NULL AND

dbSNP IS NULL AND

sample_study = ”breast_cancer" AND

VCF_CHROM = "16";

PARTNER LOGO


Other queries/use cases

• All-vs-all eQTL integrated with ENCODE• >120 billion p-values

• “Top 20 eQTLs for 5 genes of interest”: interactive

• “Find all cis-eQTLs”: several minutes

• Population genetics queries (e.g., backend for PLINK)

• Interval arithmetic on large ENCODE data sets

• Duke CHGV• ATAV DSL for preparing data for GWAS

• Week-long queries now take a few hours by parallelizing on Spark


Computational biologists are reinventing the

wheel

• e.g., CRAM (columnar storage)

• e.g., workflow managers (Galaxy)

• e.g., GATK (scatter-gather)


Large-scale data analysis has been solved*

• Cheaper in terms of hardware

• Easier in terms of productivity

• Built-in horizontal scaling

• Built-in fault tolerance

• Layered abstractions for data modeling

• Hadoop!


Science on Hadoop

• ADAM project for genomics on Spark• http://bdgenomics.org/

• Guacamole for somatic variation on Spark• https://github.com/hammerlab/guacamole/

• Thunder project for neuroimaging on Spark• http://thefreemanlab.com/thunder/

• Quince for variant store on Impala• currently barebones, but with examples

• https://github.com/laserson/quince

http://bdgenomics.org/

https://github.com/hammerlab/guacamole/

http://thefreemanlab.com/thunder/

https://github.com/laserson/quince


Suggestions/resources

• Everyone should learn Python• (also, everyone should try some experiments)

• Everyone should use version control (e.g., git)• GitHub enables easy collaboration

• See Titus Brown’s blog

• Use the IPython Notebook (Jupyter) for productivity

• Big data is often about engineering; use the best tools

• For getting industry jobs:• Show people you know how to code: put your projects on GitHub

• You should feel lucky if others will start using your code


Acknowledgements

• Cloudera• Sandy Ryza (Spark development)

• Nong Li (Impala)

• Skye Wanderman-Milne (Impala)

• Impala genomics collaborators• Kiran Mukhyala

• Slaton Lipscomb

• ADAM project• Matt Massie

• Frank Nothaft

• Timothy Danford

• Mount Sinai School of Medicine• Jeff Hammerbacher (+ lab)

• Duke CHGV• Jonathan Keebler

Thank you.