Download - Accelerating Genomic Computations 1000X with Hardware Talks... · Genomic Granular Computing Applications 3 • 4 million newborns per year in the US alone • 1 in 33 newborns with

Accelerating Genomic Computations 1000X with Hardware

Prof. Bill Dally (Electrical Engineering and Computer Science)Prof. Gill Bejerano (Computer Science, Developmental Biology and Pediatrics)

Yatish Turakhia EE PhD candidate Stanford University

DNA sequencing costs and data explosion

2

•  Since 2003, genomics data doubling every 7 months!

•  Exabyte data by 2025 – 100M to 2B genomes to be sequenced!

1st gen

2nd gen

3rd gen

“Storing and processing genome data will exceed the computing challenges of running YouTube and Twitter, biologists warn.”[Nature News, 2015]

“The decreasing cost of sequencing and the increasing number of sequence reads being generated are placing greater demand on the computational resources and knowledge necessary to handle sequence data.”[Genome Biology, 2016]

Stephens, Zachary D., et al. "Big data: astronomical or genomical?." PLoS Biology (2015)

Genomic Granular Computing Applications

3

•  4 million newborns per year in the US alone•  1 in 33 newborns with rare genetic conditions

admitted to NICU •  Time of essence for genome-based

diagnosis

•  Non-invasively diagnose for over 3,000 rare genetic conditions (e.g. Down Syndrome)

•  Free-floating DNA in blood – enormous volume!

•  Early cancer detection – life-saving application for millions of individuals

•  Non-invasive – circulating tumor DNA•  Periodic sequencing of healthy individuals -

enormous volume!

Neonatal ICU

Prenatal ICU and IVF clinics

Liquid Biopsy

Patient Diagnosis: Sample-to-answer

4

ATGTCGAT CGATACGA GAGTCATC ACTGACGT

Reads 1 2

REFERENCE:--ATGTCGATGATCCAGAGGATACTAGGATAT-

PATIENT: --ATGTCTATGATC--GAGGATATTAGGATAT-

Genome (3 Billion base pairs)

Genome Sequencing Machine

Mutations

Read assembly

Find the causal mutation

3

•  Long reads (>10Kbp) offer a better resolution of the mutation spectrum but have high error rate (15-40%)

•  >1,300 CPU hours for reference-guided assembly of noisy long reads•  14.2M CPU-years for 100M individuals

•  >15,600 CPU hours for de novo assembly of noisy long reads•  178M CPU-years for 100M individuals

Patient

Darwin: A Genomics Co-processor

High speed and programmability1.  D-SOFT: Tunable speed/precision to match any error profile2.  GACT: First algorithm with O(1) memory for compute-

intensive step of alignment allowing arbitrarily long alignments in hardware – ideal for long reads

3.  First framework shown to accelerate reference-guided as well as de novo assembly of reads in hardware

D-SOFT(filter)

GACT(aligner)

GACT APID-SOFT API

Software Aligner

DarwinD-SOFT

Reference (R)

Que

ry (Q

)

GACT

Reference (R)

Que

ry (Q

)

5

Darwin: 40nm ASIC configuration

6

D-SOFT

Sof

twar

e

GACT GACT GACT GACT

GACT GACT GACT GACT

Darwin

LPDDR4 (32GB)

LPDDR4 (32GB)

Area: 300mm2 Power: 9W

D-S

OFT

API

GAC

T AP

I

Algorithm Power(1 thread)

BWA-MEM 9.2W

GraphMap 10.7W

DALIGNER 8.8W

Software (Intel Xeon E5)

GACT algorithm and hardware design

7

Strategies for long sequence alignment

Algorithm Time Space (compute-intensive

step)

Optimal

Smith-Waterman O(mn) O(mn) Y

Hirschberg O(mn) O(m+n) Y

Banded Smith-Waterman

O(n) O(n) N

X-drop O(n) O(n) N

GACT O(n) O(1) N

Profound hardware design implications

Prior assumptions (hardware) Small upper bound on sequence length n

OR Trace-back of alignment in software – SLOW!

m, n: sequence lengths m >= n

8

Genome Alignment using Constant-memory Trace-back (GACT)

* G G C G A C T T T*GGTCGTTT

Reference (R)

Que

ry (Q

)

Tile 1

Tile 2

1.  Initialize Icurr, Jcurr in R, Q2.  Form tile of maximum size T

around Icurr, Jcurr in R, Q3.   Align tile with trace-back

from Icurr, Jcurr with at most (T-D) steps

4.  Update Icurr, Jcurr with trace-back end coordinates

5.  Repeat 2-4 till extension no longer possible

G G - C G A C T T T| | | | | | | G G T C G - - T T T

Optimal Alignment

Score = 11

T = 5, D=2

Tile 3

G G - C G A C T T T| | | | | | | G G T C G - - T T T

Alignment

Score = 11

9

(Icurr, Jcurr)

(Icurr, Jcurr)

GACT empirically provides optimal alignments

10

}  GACT tile size T=400}  GACT compared to optimal Smith-Waterman for 200,000 10Kbp

sequences with 4 different error rates: 10%, 20%, 30% and 40%}  Simple scoring (match: +1, mismatch: -1, gap: -1)}  At D=120, all observed alignments were optimal

D (in bp)

Fraction alignments non-optimal

Worst-case score loss

10% 20% 30% 40% 10% 20% 30% 40% 0 30.4% 61.0% 83.0% 94.7% 0.29% 0.67% 1.26% 2.38%

30 0.0% 0.02% 0.55% 55.3% 0.0% 0.35% 0.63% 1.59% 60 0.0% 0.0% 0.01% 1.38% 0.0% 0.0% 0.34% 0.81% 90 0.0% 0.0% 0.0% 0.05% 0.0% 0.0% 0.0% 0.33%

120 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

GACT Hardware-acceleration

11

A G G T C G G T AAGTCACTAT

Query Block 1

Query Block 2

Query Block 3

Reference

Que

ry

PE 0 PE 1 PE 2 PE 3

TB Logic

A C T A

G C T G

SRA

M

SRA

M

SRA

M

SRA

M

}  Systolic array of Npe (= 4) processing elements (PEs) solve Smith-Waterman-Gotoh}  Tile with size T > Npe, query divided into blocks, reference streamed through each block}  Computation exploits wave-front parallelism}  On-chip SRAM for storing trace-back state (4-bit per cell) }  Total SRAM size = 4-bit x (Tmax)2 => 128KB for Tmax = 512

T = 9

Darwin: GACT Performance

12

•  Runtime scales linearly to sequence length•  300-1000X faster than Edlib•  10,000X faster than software implementation of GACT

574K

108K 54K

1

10

100

1000

10000

100000

1000000

1 2 3 4 5 6 7 8 9 10

Alig

nmen

ts/s

ec

Sequence length (Kbp)

GACT (Software) Edlib GACT (Darwin)

302X

591X

986X

35X

19X11X

D-SOFT algorithm and hardware design

13

Seed Position table based exact matching

14

R = AGCTATACTA

AA

AC 6

AG 0

AT 4

CA

CC

CG

CT 2 7

GA

GC 1

GG

GT

TA 3 5 8

TC

TG

TT

Seed Positions Q = GCTA

GC:1 CT: 2, 7 TA: 3, 5, 8

Slope=1

R

Q

For human genome, seed position table size > 12GB (4B x 3 x 109)

3210

1 2 3 4 5 6 7 8

Diagonal-band Seed Overlapping based Filtration Technique (D-SOFT)

}  Divide R into NB bins (diagonal bands)}  Use N seeds of size k bp from different offsets in Q}  Lookup positions of seeds in R and assign each seed hit to

corresponding bin (diagonal band)}  Count non-overlapping Q base-pairs covered by seed hits for

each bin and filter based on threshold h (same as DALIGNER)

Bin 1 Bin 2 Bin 3 Bin 4 Bin 5 Bin 6

Reference (R)

Que

ry (Q

)

123456789

106 5 9 4 0 5

NB = 6N = 10 k = 4h = 7

15

D-SOFT hardware-acceleration design

16

Area: 264 mm2

Power: 7.3W

•  Random accesses to update bins using on-chip SRAM (bin count SRAM)

•  Area and power both dominated by 64MB Bin count SRAM •  Hardware exploits DRAM channel parallelism for seed position lookup

D-SOFT hardware-acceleration throughput

}  ~2X speedup from parallel DRAM channels}  ~3X reduction in number of memory accesses to the DRAM}  All random memory accesses to update bins using on-chip

SRAM (64MB)}  On-chip updates completely hide off-chip (DRAM) bandwidth

k Avg. hits per seed (Human Genome)

Throughput (103 seeds/sec) Darwin speedup Software Darwin

11 1765 7.9 760.6 96.3X 12 457 29.1 2,796.2 96.1X 13 118 136.1 9,126.3 67.1X 14 32 339.0 21,271.1 62.7X 15 8 784.3 34,166.7 43.5X

17

Long read assembly on Darwin

18

Darwin: Read assembly

19

Reference-guided

De novo

Darwin: Performance Results

Reference-guided (54X human genome)

De novo (54X human genome)

Baseline: BWA-MEM (15%), GraphMap (30%, 40%)

Baseline: DALIGNER

Read Error Rate

D-SOFT settings (k, N, h)

Sensitivity SpeedupBaseline Darwin

15% (14, 750, 24) 95.95% 99.91% 4,110X30% (12, 1000, 25) 98.11% 98.40% 4,088X40% (11, 1300, 22) 97.10% 97.40% 128X

Read Error Rate

D-SOFT settings (k, N, h)

Sensitivity Speedup(Bottleneck)Baseline Darwin

15% (14, 1300, 24) 99.80% 99.89% 264X

20

Thank you!

Questions or feedback?

21

Download - Accelerating Genomic Computations 1000X with Hardware Talks... · Genomic Granular Computing Applications 3 • 4 million newborns per year in the US alone • 1 in 33 newborns with

Top Related