adam

13
ADAM https://github.com/massie/adam Matt Massie University of California, Berkeley [email protected] Saturday, November 2, 13

Upload: matt-massie

Post on 07-May-2015

1.027 views

Category:

Technology


11 download

DESCRIPTION

Introductory talk on ADAM -- a system of storing and analyzing genomic data using Avro, Parquet and Spark.

TRANSCRIPT

Page 1: ADAM

ADAMhttps://github.com/massie/adam

Matt MassieUniversity of California, Berkeley

[email protected]

Saturday, November 2, 13

Page 2: ADAM

SAM BAM ADAM

Sequence Alignment Map (SAM)Binary Alignment Map (BAM)Avro Data Alignment Map (ADAM)

Saturday, November 2, 13

Page 3: ADAM

Pipeline Issues Today: Time and Scale

• The time to go from reads to answers is too long

• Processing thousands of BAM files for statistical analysis doesn’t scale

Saturday, November 2, 13

Page 4: ADAM

ADAM:Speed and Scale

• Read BAM once, perform transformations (e.g. sort, mark duplicates, BQSR) in distributed memory, write the analysis-ready ADAM file once

• Use a distribute filesystem (HDFS), a fast execution system (Spark) and columnar data formats (Parquet) to scale

Saturday, November 2, 13

Page 5: ADAM

Unlocking Genomic Data

Spark

ADAM ADAM

BAM

ADAM

Shark (SQL)

Impala (SQL)

ADAM

Hadoop Distributed File System (HDFS)

Local Filesystem

ADAM

HadoopM/R

ADAMADAM

ADAMADAM ADAMADAM ADAMADAM ADAM

ADAMADAM

ADAMADAMADAMADAM

Saturday, November 2, 13

Page 6: ADAM

record ADAMRecord { union { null, string } referenceName = null; union { null, int } referenceId = null; union { null, long } start = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupId = null;

union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } primaryAlignment = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false;

union { null, string } mismatchingPositions = null; union { null, string } attributes = null;

union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null;

union { null, int } mateReferenceId = null;}

http://avro.apache.org/

Saturday, November 2, 13

Page 7: ADAM

https://blog.twitter.com/2013/dremel-made-simple-with-parquet

Parquet

Column-oriented layout

Row-oriented layout

http://parquet.io

Saturday, November 2, 13

Page 8: ADAM

Genomic Data Examplechrom20 TCGA 4M

chrom20 GAAT 4M1D

chrom20 CCGAT 5M

chrom20 chrom20 chrom20 TCGA GAAT CCGAT 4M 4M1D 5M

chrom20 TCGA 4M chrom20 GAAT 4M1D chrom20 CCGAT 5M

Column Oriented

Row Oriented

Saturday, November 2, 13

Page 9: ADAM

http://spark.incubator.apache.org/

Saturday, November 2, 13

Page 10: ADAM

Low-Coverage BAM Experiment

• 14GB Low-coverage BAM with 145M reads

• 10-node ec2 cluster m2.4xlarge

• Reduced to 13GB with ADAM

• Conversion/upload to HDFS 22mins

• Sorted in 7minutes

Saturday, November 2, 13

Page 11: ADAM

• Input: 237GB NA12878- high coverage, PCR free, whole-genome BAM

• Conversion took 4hrs on ec2 m2.4xlarge (8cpu, 68.4gb mem)

• Output size: 237GB BAM reduced to 212GB ADAM

High-Coverage BAM Experiment

Saturday, November 2, 13

Page 12: ADAM

Current Features

• Convert BAM to ADAM (read-oriented)

• Sort an ADAM file by reference

• Generate ADAMPileups

• Print mpileup output

• Very soon ADAM will be able to mark duplicates (initial benchmarks look good)

Saturday, November 2, 13

Page 13: ADAM

In progress...

• Frank is working on a distributed variant caller (https://github.com/fnothaft/avocado), local realignment, adam2bam

• Chris Hartl is integrating ADAM with GATK (https://github.com/chartl/GAParquet) DiagnoseTargets, adding new VCF formats to ADAM, BQSR

• Christos Kozanitis has been working on Shark and Impala integration for ad-hoc SQL read queries

• Collaborations with Mt. Sinai, GenomeBridge and the Broad Institute who are interested in using ADAM

Saturday, November 2, 13