snap: fast, accurate sequence alignment enabling biological applications ravi pandya, microsoft...

16
SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014

Upload: john-lyons

Post on 22-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014

SNAP: Fast, accurate sequence alignment enabling biological

applicationsRavi Pandya, Microsoft Research

ASHG 10/19/2014

Page 2: SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014

SNAP

SNAP is fast *Align 50x genome in 1.2 hours(BWA-MEM = 11.75 hours)Sort + index + markdup BAM in 2 hours(samtools+sambamba = 4.25 hours)

SNAP is as accurate as BWA-MEM, Bowtie2, etc.ROC on simulated data% aligned on real dataVariant calls on real data

* NA12878:ERR194147, Azure D14 (16 cores, 112GB RAM, 800GB SSD)

Page 3: SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014

Sequence alignment

The problem:Given a read R and a reference genome GFind the position in p in G that minimizesEditDistance(R, G[p .. p + |R|])

SNAP solves this quickly and accurately because of:Efficient system architectureReducing the number of comparisonsReducing the cost of comparisons

Page 4: SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014

System architecturefull

align sort

async read async write

emptytemp file

mergesort

markduplicates

index

compress

Page 5: SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014

The sequence alignment problemThe easy part:

97% of 20-mersin the human genomeoccur only oncebut at only 75% of locations

The hard part:

The other 3% of 20-mersand 25% of locations

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Single Equally Weighted

Paired Equally Weighted

Single Time Weighted

Paired Time Weighted

10% of reads

95% of time

CDF of per-read/pair alignment time, NA18705 169M pairs(using deeper search parameters than current defaults)

Bill Bolosky, MSR

Page 6: SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014

Hash table lookup

Build a multi-valued map (~30GB for hg19)from all seeds S in G all locations of S in G

330 reads/s

14k reads/s

For all seeds in read, all locations of seed in genome,Score implied alignment of read, keep the best

Ignore frequent seeds (>300 occurrences)Only use a few seeds/read

42x

Bill Bolosky, MSR

Page 7: SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014

Fast scoring

113k reads/s

154k reads/s(470x overall)

Sort candidates by # of seed hits

Skip locations with #seed misses > limit

1.4x

92k reads/s O(n2) Ukkonen O(nd), n=len, d=min(limit, actual)Use limit = best score so far + 2 (for MAPQ)

1.2x

6.6x

Bill Bolosky, MSR

Page 8: SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014

Paired-end alignment

Find & score candidate location pairsC(R1:R2) = C(R1) ∩ C(R2) {± insert size}Enumerate in O(h log n) h = |C(R1) ∩ C(R2)| n = |C(R1)| + |C(R2)|Increases accuracy by allowingmuch higher limit on seed occurrences(e.g. 4k vs 300)

Bill Bolosky, MSR

Page 9: SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014

Results: simulated data

Mason-generated paired-end 100bp reads

Page 10: SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014

Results: real data

NA18507 (Illumina HiSeq 50x)

* AWS cr1.8xlarge (32 cores, 244GB RAM, 2x120GB SSD)

Page 11: SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014

Results: GATK variant calls

Broad GATK pipeline, curated NA12878 variant calls

Page 12: SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014

Results: NIST Genome-in-a-BottleAppistry GATK pipeline, GIAB highly confident callsLonger seeds are much faster, similar precision/recall

11.75

ERR194147*.fastq.gz, Azure D14 (16 cores, 112GB RAM, 800GB SSD)

Page 13: SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014

Results: NIST Genome-in-a-BottleLower confidence calls (qual>20, 2 platforms)

Highly confident indel snp Aligner Recall Precision Recall Precisionbwa-mem 97.24% 97.15% 99.57% 99.65%snap-20 97.04% 97.48% 99.51% 99.57%snap-24 97.04% 97.46% 99.52% 99.57%snap-28 97.04% 97.45% 99.53% 99.57%snap-32 97.00% 97.41% 99.51% 99.57%

Lower confidence indel snp Aligner Recall Precision Recall Precisionbwa-mem 96.38% 96.30% 99.00% 99.32%snap-20 96.17% 96.68% 98.94% 99.25%snap-24 96.17% 96.67% 98.95% 99.23%snap-28 96.16% 96.62% 98.96% 99.21%snap-32 96.11% 96.55% 98.94% 99.17%

Page 14: SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014

Pathogen ID: SURPI (Charles Chiu, UCSF)

“This analysis of DNA sequences required just 96 minutes. A similar analysis conducted with the use of previous generations of computational software on the same hardware platform would have taken 24 hours or more to complete, Chiu said.”

Page 15: SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014

SURPI

SNAP enables SURPI with:Fast filtering mode64-bit index for >40GB ntDBSecondary mapping output

Charles Chiu, UCSF

Page 16: SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014

Acknowledgements

Microsoft ResearchBill BoloskyRavi PandyaUC San FranciscoTaylor SittlerBroad InstituteChristopher Hartl

UC Berkeley AMPLabMatei ZahariaKristal CurtisArmando FoxScott ShenkerIon StoicaDavid Patterson

Binaries, source, documentation (Apache 2.0 licensed)http://snap.cs.berkeley.edu