genomics is not special: towards data intensive biology

70
Genomics Is Not Special Uri Laserson // [email protected] // 13 November 2014 Toward Data-Intensive Biology

Upload: uri-laserson

Post on 02-Jul-2015

2.534 views

Category:

Science


2 download

DESCRIPTION

Genomics and life sciences is using antiquated technology for processing data. As the data volume is increasing in the life sciences, many in the biology community are reinventing the wheel, without realizing the existence of a rich ecosystem of tools for processing large data sets: Hadoop.

TRANSCRIPT

Page 1: Genomics Is Not Special: Towards Data Intensive Biology

Genomics IsNot Special

Uri Laserson // [email protected] // 13 November 2014

Toward Data-Intensive Biology

Page 2: Genomics Is Not Special: Towards Data Intensive Biology

2© 2014 Cloudera, Inc. All rights reserved.http://omicsmaps.com/

>25 Pbp / year

Page 3: Genomics Is Not Special: Towards Data Intensive Biology

3© 2014 Cloudera, Inc. All rights reserved.

Carr and Church, Nat. Biotech. 27: 1151 (2009)

Page 4: Genomics Is Not Special: Towards Data Intensive Biology

4© 2014 Cloudera, Inc. All rights reserved.

For every “-ome” there’s a “-seq”

Genome DNA-seq

Transcriptome

RNA-seq

FRT-seq

NET-seq

Methylome Bisulfite-seq

Immunome Immune-seq

ProteomePhIP-seq

Bind-n-seq

http://liorpachter.wordpress.com/seq/

Page 5: Genomics Is Not Special: Towards Data Intensive Biology

5© 2014 Cloudera, Inc. All rights reserved.

For every “-ome” there’s a “-seq”

Genome DNA-seq

Transcriptome

RNA-seq

FRT-seq

NET-seq

Methylome Bisulfite-seq

Immunome Immune-seq

ProteomePhIP-seq

Bind-n-seq

http://liorpachter.wordpress.com/seq/

Page 6: Genomics Is Not Special: Towards Data Intensive Biology

6© 2014 Cloudera, Inc. All rights reserved.

Based on IMGT/LIGM release 201111-6

Page 7: Genomics Is Not Special: Towards Data Intensive Biology

7© 2014 Cloudera, Inc. All rights reserved.

Page 8: Genomics Is Not Special: Towards Data Intensive Biology

8© 2014 Cloudera, Inc. All rights reserved.

Page 9: Genomics Is Not Special: Towards Data Intensive Biology

9© 2014 Cloudera, Inc. All rights reserved.

Developer/computational efficiency becoming

paramount

Genome Biology 12: 125 (2011)

Page 10: Genomics Is Not Special: Towards Data Intensive Biology

10© 2014 Cloudera, Inc. All rights reserved.

Software and data management around

since 1970s

• Version control/reproducibility

• Testing/automation/integration

• Databases and data formats

• API design

• Lots (most?) of big data innovation happening in industry

Page 11: Genomics Is Not Special: Towards Data Intensive Biology

11© 2014 Cloudera, Inc. All rights reserved.

Example query

For each variant that is• overlapping a DNase HS site

• predicted to be deleterious

• absent from dbSNP

compute the MAF by subpopulation

using samples in Framing Heart Study

PARTNER LOGO

CH

R

POS RE

F

AL

T

POP MAF POLYPHEN

7 122892

37

A G Plain 0.01 possibly

damaging

7 122892

37

A G Star-

bellied

0.03 possibly

damaging

12 228833

2

T C Plain 0.00

3

probably

damaging

12 228833

2

T C Star-

bellied

0.09 probably

damaging

Page 12: Genomics Is Not Special: Towards Data Intensive Biology

12© 2014 Cloudera, Inc. All rights reserved.

Available data

Data set Format Size

Population genotypes VCF 10-100s of

billions

Dnase HS sites

(ENCODE)

narrowPeak

(BED)

<1 million

dbSNP CSV 10s of millions

Sample phenotypes JSON thousands

Page 13: Genomics Is Not Special: Towards Data Intensive Biology

13© 2014 Cloudera, Inc. All rights reserved.

Why text data is a bad idea

• Text is highly inefficient• Compresses poorly

• Values must be parsed

• Text is semi-structured at best• Flexible schemas make parsing difficult

• Difficult to make assumptions on data structure

• Text poorly separates the roles of delimiters and data• Requires escaping of control characters

• (ASCII actually includes RS 0x1E and FS 0x1F, but they’re never used)

• But still almost always better than Excel

Page 14: Genomics Is Not Special: Towards Data Intensive Biology

14© 2014 Cloudera, Inc. All rights reserved.

Some reasons VCF in particular is bad

• Number of records (variants) grows with new variants, rather than new genotypes• difficult to write data

• adding a sample requires rewrite of entire file

• Data must be sorted

• Semi-structured: need to build a parser for each file

• Conflates two functions:• catalogue of variation

• repository actual observed genotypes

• If gzipped, it’s not splittable

• Variants are not encoded uniquely by the VCF spec

Page 15: Genomics Is Not Special: Towards Data Intensive Biology

15© 2014 Cloudera, Inc. All rights reserved.

Manually executing query in Python

class IntervalTree(object):

def update(self, feature):

pass # ...implement tree update

def overlaps(self, feature):

return True or False

dnase_sites = IntervalTree()

with open('path/to/dnase.narrowPeak', 'r') as ip:

for line in ip:

feature = parse_feature(line)

dnase_sites.update(feature)

samples = {}

with open('path/to/samples.json', 'r') as ip:

for line in ip:

sample = json.loads(line)

if is_framingham(sample):

samples[sample['name']] = sample

dbsnp = set()

with open('path/to/dbsnp.csv', 'r') as ip:

for line in ip:

snp = tuple(line.split()[:3])

dbsnp.add(snp)

Page 16: Genomics Is Not Special: Towards Data Intensive Biology

16© 2014 Cloudera, Inc. All rights reserved.

Additional metadata must fit in memory

class IntervalTree(object):

def update(self, feature):

pass # ...implement tree update

def overlaps(self, feature):

return True or False

dnase_sites = IntervalTree()

with open('path/to/dnase.narrowPeak', 'r') as ip:

for line in ip:

feature = parse_feature(line)

dnase_sites.update(feature)

samples = {}

with open('path/to/samples.json', 'r') as ip:

for line in ip:

sample = json.loads(line)

if is_framingham(sample):

samples[sample['name']] = sample

dbsnp = set()

with open('path/to/dbsnp.csv', 'r') as ip:

for line in ip:

snp = tuple(line.split()[:3])

dbsnp.add(snp)

Page 17: Genomics Is Not Special: Towards Data Intensive Biology

17© 2014 Cloudera, Inc. All rights reserved.

Can only read from POSIX filesystem

class IntervalTree(object):

def update(self, feature):

pass # ...implement tree update

def overlaps(self, feature):

return True or False

dnase_sites = IntervalTree()

with open('path/to/dnase.narrowPeak', 'r') as ip:

for line in ip:

feature = parse_feature(line)

dnase_sites.update(feature)

samples = {}

with open('path/to/samples.json', 'r') as ip:

for line in ip:

sample = json.loads(line)

if is_framingham(sample):

samples[sample['name']] = sample

dbsnp = set()

with open('path/to/dbsnp.csv', 'r') as ip:

for line in ip:

snp = tuple(line.split()[:3])

dbsnp.add(snp)

Page 18: Genomics Is Not Special: Towards Data Intensive Biology

18© 2014 Cloudera, Inc. All rights reserved.

Manually executing query in Python

genotype_data = {}

reader = vcf.Reader('path/to/genotypes.vcf')

for variant in reader:

if (dnase_sites.overlaps(variant) and is_deleterious(call)

and not in_dbsnp(variant)):

for call in variant.samples:

if call.sample in samples:

pop = samples[call.sample]['population']

genotype_data.setdefault((variant, pop), []).append(call)

mafs = {}

for (variant, pop) in genotype_data.iter_keys():

mafs[(variant, pop)] = compute_maf(genotype_data[(variant, pop)])

Page 19: Genomics Is Not Special: Towards Data Intensive Biology

19© 2014 Cloudera, Inc. All rights reserved.

Genotype data may be split across files

genotype_data = {}

reader = vcf.Reader('path/to/genotypes.vcf')

for variant in reader:

if (dnase_sites.overlaps(variant) and is_deleterious(call)

and not in_dbsnp(variant)):

for call in variant.samples:

if call.sample in samples:

pop = samples[call.sample]['population']

genotype_data.setdefault((variant, pop), []).append(call)

mafs = {}

for (variant, pop) in genotype_data.iter_keys():

mafs[(variant, pop)] = compute_maf(genotype_data[(variant, pop)])

Page 20: Genomics Is Not Special: Towards Data Intensive Biology

20© 2014 Cloudera, Inc. All rights reserved.

Manually executing query in Python

• If file is gzipped, cannot split file without decompressing (use Snappy)

• Reading files required access to POSIX-style file system

• Probably want to split VCF file into pieces to parallelize• Requires manual scatter-gather

• Samples may be scattered among multiple VCF files (difficult to append to VCF)

• Manually implementing broadcast join• Build side must fit into memory

Page 21: Genomics Is Not Special: Towards Data Intensive Biology

21© 2014 Cloudera, Inc. All rights reserved.

Manually executing query in Python on HPC

$ bsub –q shared_12h python split_genotypes.py

$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_1.vcf agg1.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_2.vcf agg2.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_3.vcf agg3.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_4.vcf agg4.csv

$ bsub –q shared_12h python merge_maf.py

Page 22: Genomics Is Not Special: Towards Data Intensive Biology

22© 2014 Cloudera, Inc. All rights reserved.

Manually executing query in Python on HPC

$ bsub –q shared_12h python split_genotypes.py

$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_1.vcf agg1.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_2.vcf agg2.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_3.vcf agg3.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_4.vcf agg4.csv

$ bsub –q shared_12h python merge_maf.py

How to serialize intermediate output?

Manually specify requested resources

Manually split and mergeBabysit and

check for errors/failures

Page 23: Genomics Is Not Special: Towards Data Intensive Biology

23© 2014 Cloudera, Inc. All rights reserved.

HPC separates compute from storage

HPC is about compute.Hadoop is about data.

Storage infrastructure• Proprietary,

distributed file system

• Expensive

Compute cluster• High-perf, reliable

hardware• Expensive

Big network

pipe ($$$)

User typically works by manually submitting jobs to scheduler(e.g., LSF, Grid Engine, etc.)

Page 24: Genomics Is Not Special: Towards Data Intensive Biology

24© 2014 Cloudera, Inc. All rights reserved.

HPC is lower-level than Hadoop

• HPC only exposes job scheduling

• Parallelization typically through MPI• Very low-level communication primitives

• Difficult to horizontally scale by simply adding nodes

• Large data sets must be manually split

• Failures must be dealt with manually

Page 25: Genomics Is Not Special: Towards Data Intensive Biology

25© 2014 Cloudera, Inc. All rights reserved.

HPC uses file system as DB; text file as LCD

• All tools assume flat files with POSIX semantics

• Sharing data/collaboration involves copying large files

• Broad joint caller with 25k genomes hits file handle limits

• Files always streamed over network (HPC architecture)

Page 26: Genomics Is Not Special: Towards Data Intensive Biology

26© 2014 Cloudera, Inc. All rights reserved.

HPC uses job scheduler as workflow tool

• Submitting jobs to scheduler is low level

• Workflow engines/execution models provide high level execution graphs with built-in fault tolerance• e.g., MapReduce, Oozie, Spark, Luigi, Crunch, Cascading, Pig, Hive

Page 27: Genomics Is Not Special: Towards Data Intensive Biology

27© 2014 Cloudera, Inc. All rights reserved.

Prepping data for local analysis in R/Python

• Manual script to prepare CSV file for working locally

• Same issues as above

• Requires working set of data to fit into memory of a single machine

• Visualization

Page 28: Genomics Is Not Special: Towards Data Intensive Biology

28© 2014 Cloudera, Inc. All rights reserved.

Domain-specific tools (e.g., PLINK/Seq)

$ pseq path/to/project v-stats --mask phe=framingham locset=dnase ref.ex=dbsnp

one of a limited set of specific, useful tasks

(yet another) custom query specification

Page 29: Genomics Is Not Special: Towards Data Intensive Biology

29© 2014 Cloudera, Inc. All rights reserved.

Domain-specific tools (e.g., PLINK/Seq)

• Works great if your problem fits into the pre-designed computations

• Only works if your problem fits into the pre-designed computations

• How to do stats by subpopulation?• Probably possible, but need to learn new notation

• Must work to get data in to begin-with

• Not obviously parallelizable for performance on large data sets

• Built on SQLite underneath

Page 30: Genomics Is Not Special: Towards Data Intensive Biology

30© 2014 Cloudera, Inc. All rights reserved.

RDBMS and SQL (e.g., MySQL)

SELECT g.chr, g.pos, g.ref, g.alt, s.pop, MAF(g.call)

FROM genotypes g

INNER JOIN samples s

ON g.sample = s.sample

INNER JOIN dnase d

ON g.chr = d.chr

AND g.pos >= d.start

AND g.pos < d.end

LEFT OUTER JOIN dbsnp p

ON g.chr = p.chr

AND g.pos = p.pos

AND g.ref = p.ref

AND g.alt = p.alt

WHERE

s.study = "framingham"

p.pos IS NULL AND

g.polyphen IN ( "possibly damaging", "probably damaging" )

GROUP BY g.chr, g.pos, g.ref, g.alt, s.pop

Page 31: Genomics Is Not Special: Towards Data Intensive Biology

31© 2014 Cloudera, Inc. All rights reserved.

RDBMS and SQL (e.g., MySQL)

• Feature-rich and very mature

• Highly optimized and allows indexing

• Declarative (and abstracted) language for data

• Hassle to get data in; data end up formatted one way

• No clear scalability story

• SQL-only

Page 32: Genomics Is Not Special: Towards Data Intensive Biology

32© 2014 Cloudera, Inc. All rights reserved.

Problems with old way

• Expensive

• No fault-tolerance

• No horizontal scalability

• Poor separation of data modeling and storage formats• File format proliferation

• Inefficient text formats

Page 33: Genomics Is Not Special: Towards Data Intensive Biology

33© 2014 Cloudera, Inc. All rights reserved.

Page 34: Genomics Is Not Special: Towards Data Intensive Biology

34© 2014 Cloudera, Inc. All rights reserved.

Indexing the web

• Web is Huge• Hundreds of millions of pages in 1999

• How do you index it?• Crawl all the pages

• Rank pages based on relevance metrics

• Build search index of keywords to pages

• Do it in real time!

Page 35: Genomics Is Not Special: Towards Data Intensive Biology

35© 2014 Cloudera, Inc. All rights reserved.

Page 36: Genomics Is Not Special: Towards Data Intensive Biology

36© 2014 Cloudera, Inc. All rights reserved.

Databases in 1999

• Buy a really big machine

• Install expensive DBMS on it

• Point your workload on it

• Hope it doesn’t fail

• Ambitious: buy another big machine as backup

Page 37: Genomics Is Not Special: Towards Data Intensive Biology

37© 2014 Cloudera, Inc. All rights reserved.

Page 38: Genomics Is Not Special: Towards Data Intensive Biology

38© 2014 Cloudera, Inc. All rights reserved.

Database limitations

• Didn’t scale horizontally• High marginal cost ($$$)

• No real fault-tolerance story

• Vendor lock-in ($$$)

• SQL unsuited for search ranking• Complex analysis (PageRank)

• Unstructured data

Page 39: Genomics Is Not Special: Towards Data Intensive Biology

39© 2014 Cloudera, Inc. All rights reserved.

Page 40: Genomics Is Not Special: Towards Data Intensive Biology

40© 2014 Cloudera, Inc. All rights reserved.

Google does something different

• Designed their own storage and processing infrastructure• Google File System (GFS) and MapReduce (MR)

• Goals: cheap, scalable, reliable

• General framework for large-scale batch computation

• Powered Google Search for many years• Still used internally to this day (millions of jobs)

Page 41: Genomics Is Not Special: Towards Data Intensive Biology

41© 2014 Cloudera, Inc. All rights reserved.

Google benevolent enough to publish

2003 2004

Page 42: Genomics Is Not Special: Towards Data Intensive Biology

42© 2014 Cloudera, Inc. All rights reserved.

Birth of Hadoop at Yahoo!

• 2004-2006: Doug Cutting and Mike Cafarella implement GFS/MR

• 2006: Spun out as Apache Hadoop

• Named after Doug’s son’s yellow stuffed elephant

Page 43: Genomics Is Not Special: Towards Data Intensive Biology

43© 2014 Cloudera, Inc. All rights reserved.

Open-source proliferation

Google Open-source Function

GFS HDFS Distributed file system

MapReduce MapReduce Batch distributed data

processing

Bigtable HBase Distributed DB/key-value store

Protobuf/Stubb

y

Thrift or Avro Data serialization/RPC

Pregel Giraph Distributed graph processing

Dremel/F1 Impala Scalable interactive SQL (MPP)

FlumeJava Crunch Abstracted data pipelines on

Hadoop

Page 44: Genomics Is Not Special: Towards Data Intensive Biology

44© 2014 Cloudera, Inc. All rights reserved.

Hadoop provides:

• Data centralization on HDFS• No rewriting data for each tool/application

• Data-local execution to avoid moving terabytes

• High-level execution engines• SQL (Impala, Hive)

• Relational algebra (Spark, MapReduce)

• Bulk synchronous parallel (GraphX)

• Distributed in-memory

• Built-in horizontal scalability and fault-tolerance

• Hadoop-friendly, evolvable serialization formats/RPC

Page 45: Genomics Is Not Special: Towards Data Intensive Biology

45© 2014 Cloudera, Inc. All rights reserved.

Hadoop provides serialization/RPC formats

(Avro)

• Specify schemas/services in user-friendly IDLs

• Code-generation to multiple languages (wire-compatible/portable)

• Compact, binary formats

• Support for schema evolution

• Like binary JSONrecord Feature {

union { null, string } featureId = null;union { null, string } featureType = null; // e.g., DNase HSunion { null, string } source = null; // e.g., BED, GFF fileunion { null, Contig } contig = null;union { null, long } start = null;union { null, long } end = null;union { null, Strand } strand = null;union { null, double } value = null;array<Dbxref> dbxrefs = [];array<string> parentIds = [];map<string> attributes = {};

}

Page 46: Genomics Is Not Special: Towards Data Intensive Biology

46© 2014 Cloudera, Inc. All rights reserved.

APIs instead of file formats

• Service-oriented architectures (SOA) ensure stable contracts

• Allows for implementation changes with new technologies

• Software community has lots of experience with SOA, along with mature tools

• Can be implemented in language-independent fashion

Page 47: Genomics Is Not Special: Towards Data Intensive Biology

47© 2014 Cloudera, Inc. All rights reserved.

Current file format hairball

Page 48: Genomics Is Not Special: Towards Data Intensive Biology

48© 2014 Cloudera, Inc. All rights reserved.

API-oriented architecture

Page 49: Genomics Is Not Special: Towards Data Intensive Biology

49© 2014 Cloudera, Inc. All rights reserved.

Hadoop provides columnar storage

(Parquet)

• Designed for general data storage

• Columnar format• read fewer bytes

• compression more efficient

• Splittable

• Avro/Thrift-compatible

• Predicate pushdown

• RLE, dictionary-encoding

Page 50: Genomics Is Not Special: Towards Data Intensive Biology

50© 2014 Cloudera, Inc. All rights reserved.

Hadoop provides columnar storage

(Parquet)

Page 51: Genomics Is Not Special: Towards Data Intensive Biology

51© 2014 Cloudera, Inc. All rights reserved.

Hadoop provides columnar storage

(Parquet)

Vertical partitioning(projection pushdown)

Horizontal partitioning(predicate pushdown)

Read only the data you

need!

+ =

Page 52: Genomics Is Not Special: Towards Data Intensive Biology

52© 2014 Cloudera, Inc. All rights reserved.

Hadoop provides abstractions for data

processing

HDFS (scalable, distributed storage)

YARN (resource management)

MapReduc

e

Impala

(SQL)

Solr

(search)Spark

ADAMquince guacamole …

bdg-f

orm

ats

(A

vro

/Parq

uet)

Page 53: Genomics Is Not Special: Towards Data Intensive Biology

53© 2014 Cloudera, Inc. All rights reserved.

Hadoop examples: filesystem

[laserson@bottou01-10g ~]$ hadoop fs –ls /user/lasersonFound 16 itemsdrwx------ - laserson laserson 0 2014-11-12 16:00 .Trashdrwxr-xr-x - laserson laserson 0 2014-11-12 00:29 .sparkStagingdrwx------ - laserson laserson 0 2014-06-07 13:27 .stagingdrwxr-xr-x - laserson laserson 0 2014-10-30 14:15 1kgdrwxr-xr-x - laserson laserson 0 2014-05-08 17:29 bigmldrwxr-xr-x - laserson laserson 0 2014-10-30 14:14 bookdrwxrwxr-x - laserson laserson 0 2014-06-16 12:59 editingdrwxr-xr-x - laserson laserson 0 2014-06-06 13:49 gdelt-rw-r--r-- 3 laserson laserson 0 2014-10-27 16:24 hg19_textdrwxr-xr-x - laserson laserson 0 2014-06-12 19:53 madlibportdrwxr-xr-x - laserson laserson 0 2014-03-20 18:09 rock-health-pythondrwxr-xr-x - laserson laserson 0 2014-05-15 13:25 test-udfdrwxr-xr-x - laserson laserson 0 2014-08-21 17:58 test_pymcdrwxr-xr-x - laserson laserson 0 2014-10-27 22:25 tmpdrwxr-xr-x - laserson laserson 0 2014-10-07 20:30 udf-scratchdrwxr-xr-x - laserson laserson 0 2014-03-02 13:50 udfs

Page 54: Genomics Is Not Special: Towards Data Intensive Biology

54© 2014 Cloudera, Inc. All rights reserved.

Hadoop examples: batch MapReduce job

hadoop jar vcf2parquet-0.1.0-jar-with-dependencies.jar \com.cloudera.science.vcf2parquet.VCFtoParquetDriver \hdfs:///path/to/variants.vcf \hdfs:///path/to/output.parquet

Page 55: Genomics Is Not Special: Towards Data Intensive Biology

55© 2014 Cloudera, Inc. All rights reserved.

Hadoop examples: interactive Spark shell

[laserson@bottou01-10g ~]$ spark-shell --master yarnWelcome to

____ __/ __/__ ___ _____/ /___\ \/ _ \/ _ `/ __/ '_/

/___/ .__/\_,_/_/ /_/\_\ version 1.1.0/_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67)Type in expressions to have them evaluated.[...]

scala>

Page 56: Genomics Is Not Special: Towards Data Intensive Biology

56© 2014 Cloudera, Inc. All rights reserved.

Hadoop examples: interactive Spark shell

def inDbSnp(g: Genotype): Boolean = true or false

def isDeleterious(g: Genotype): Boolean = g.getPolyPhen

val samples = sc.textFile("path/to/samples").map(parseJson(_)).collect()

val dbsnp = sc.textFile("path/to/dbSNP").map(_.split(",")).collect()

val genotypesRDD = sc.adamLoad("path/to/genotypes")

val dnaseRDD = sc.adamBEDFeatureLoad("path/to/dnase")

val filteredRDD = genotypesRDD

.filter(!inDbSnp(_))

.filter(isDeleterious(_))

.filter(isFramingham(_))

val joinedRDD = RegionJoin.partitionAndJoin(sc, filteredRDD, dnaseRDD)

val maf = joinedRDD

.keyBy(x => (x.getVariant, getPopulation(x)))

.groupByKey()

.map(computeMAF(_))

.saveAsNewAPIHadoopFile("path/to/output")

Page 57: Genomics Is Not Special: Towards Data Intensive Biology

57© 2014 Cloudera, Inc. All rights reserved.

Hadoop provides abstractions for data

processing

HDFS (scalable, distributed storage)

YARN (resource management)

MapReduc

e

Impala

(SQL)

Solr

(search)Spark

ADAMquince guacamole …

bdg-f

orm

ats

(A

vro

/Parq

uet)

Page 58: Genomics Is Not Special: Towards Data Intensive Biology

58© 2014 Cloudera, Inc. All rights reserved.

Genomics ETL

.fastq .bam .vcf

.bed/.gtf/etc

short read

alignment

genotype calling analysis

Page 59: Genomics Is Not Special: Towards Data Intensive Biology

59© 2014 Cloudera, Inc. All rights reserved.

Hadoop variant store architecture

Impala shell (SQL)

REST API

JDBC

SQL query

Impala engine

Hive metastoreResult

set

.parquet.vcf

ETL

Page 60: Genomics Is Not Special: Towards Data Intensive Biology

60© 2014 Cloudera, Inc. All rights reserved.

Data denormalization

##fileformat=VCFv4.1

##fileDate=20090805

##source=myImputationProgramV3.1

##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta

##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>

##phasing=partial

##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">

##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">

##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">

##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">

##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">

##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">

##FILTER=<ID=q10,Description="Quality below 10">

##FILTER=<ID=s50,Description="Less than 50% of samples have data">

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">

##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">

##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003

20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.

20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3

20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4

20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2

• Amortize join cost up-front• Replace joins with predicates

(allowing predicate pushdown)

Page 61: Genomics Is Not Special: Towards Data Intensive Biology

61© 2014 Cloudera, Inc. All rights reserved.

Hadoop solution characteristics

• Data stored as Parquet columnar format for performance and compression

• Impala/Hive metastore provide unified, flexible data model

• Impala implements RDBMS-style operations (by experts in distributed systems)

• Spark offers flexible relational algebra operators (and in-memory computing)

• Built-in fault tolerance for computations and horizontal scalability

Page 62: Genomics Is Not Special: Towards Data Intensive Biology

62© 2014 Cloudera, Inc. All rights reserved.

Example variant-filtering query

• “Give me all SNPs that are:• chromosome 16

• absent from dbSNP

• present in COSMIC

• observed in breast cancer samples”

• On full 1000 Genome data set• ~37 billion genotypes

• 14 node cluster

• query completion in several seconds

SELECT cosmic as snp_id,

vcf_chrom as chr,

vcf_pos as pos,

sample_id as sample,

vcf_call_gt as genotype,

sample_affection as phenotype

FROM

hg19_parquet_snappy_join_cached_partitioned

WHERE

COSMIC IS NOT NULL AND

dbSNP IS NULL AND

sample_study = ”breast_cancer" AND

VCF_CHROM = "16";

PARTNER LOGO

Page 63: Genomics Is Not Special: Towards Data Intensive Biology

63© 2014 Cloudera, Inc. All rights reserved.

Other queries/use cases

• All-vs-all eQTL integrated with ENCODE• >120 billion p-values

• “Top 20 eQTLs for 5 genes of interest”: interactive

• “Find all cis-eQTLs”: several minutes

• Population genetics queries (e.g., backend for PLINK)

• Interval arithmetic on large ENCODE data sets

• Duke CHGV• ATAV DSL for preparing data for GWAS

• Week-long queries now take a few hours by parallelizing on Spark

Page 64: Genomics Is Not Special: Towards Data Intensive Biology

64© 2014 Cloudera, Inc. All rights reserved.

Computational biologists are reinventing the

wheel

• e.g., CRAM (columnar storage)

• e.g., workflow managers (Galaxy)

• e.g., GATK (scatter-gather)

Page 65: Genomics Is Not Special: Towards Data Intensive Biology

65© 2014 Cloudera, Inc. All rights reserved.

Large-scale data analysis has been solved*

• Cheaper in terms of hardware

• Easier in terms of productivity

• Built-in horizontal scaling

• Built-in fault tolerance

• Layered abstractions for data modeling

• Hadoop!

Page 66: Genomics Is Not Special: Towards Data Intensive Biology

66© 2014 Cloudera, Inc. All rights reserved.

Science on Hadoop

• ADAM project for genomics on Spark• http://bdgenomics.org/

• Guacamole for somatic variation on Spark• https://github.com/hammerlab/guacamole/

• Thunder project for neuroimaging on Spark• http://thefreemanlab.com/thunder/

• Quince for variant store on Impala• currently barebones, but with examples

• https://github.com/laserson/quince

Page 67: Genomics Is Not Special: Towards Data Intensive Biology

67© 2014 Cloudera, Inc. All rights reserved.

Suggestions/resources

• Everyone should learn Python• (also, everyone should try some experiments)

• Everyone should use version control (e.g., git)• GitHub enables easy collaboration

• See Titus Brown’s blog

• Use the IPython Notebook (Jupyter) for productivity

• Big data is often about engineering; use the best tools

• For getting industry jobs:• Show people you know how to code: put your projects on GitHub

• You should feel lucky if others will start using your code

Page 68: Genomics Is Not Special: Towards Data Intensive Biology

68© 2014 Cloudera, Inc. All rights reserved.

Page 69: Genomics Is Not Special: Towards Data Intensive Biology

69© 2014 Cloudera, Inc. All rights reserved.

Acknowledgements

• Cloudera• Sandy Ryza (Spark development)

• Nong Li (Impala)

• Skye Wanderman-Milne (Impala)

• Impala genomics collaborators• Kiran Mukhyala

• Slaton Lipscomb

• ADAM project• Matt Massie

• Frank Nothaft

• Timothy Danford

• Mount Sinai School of Medicine• Jeff Hammerbacher (+ lab)

• Duke CHGV• Jonathan Keebler

Page 70: Genomics Is Not Special: Towards Data Intensive Biology

Thank you.