Accelerating Genomic Computations 1000X with Hardware
Prof. Bill Dally (Electrical Engineering and Computer Science)Prof. Gill Bejerano (Computer Science, Developmental Biology and Pediatrics)
Yatish Turakhia EE PhD candidate Stanford University
DNA sequencing costs and data explosion
2
• Since 2003, genomics data doubling every 7 months!
• Exabyte data by 2025 – 100M to 2B genomes to be sequenced!
1st gen
2nd gen
3rd gen
“Storing and processing genome data will exceed the computing challenges of running YouTube and Twitter, biologists warn.”[Nature News, 2015]
“The decreasing cost of sequencing and the increasing number of sequence reads being generated are placing greater demand on the computational resources and knowledge necessary to handle sequence data.”[Genome Biology, 2016]
Stephens, Zachary D., et al. "Big data: astronomical or genomical?." PLoS Biology (2015)
Genomic Granular Computing Applications
3
• 4 million newborns per year in the US alone• 1 in 33 newborns with rare genetic conditions
admitted to NICU • Time of essence for genome-based
diagnosis
• Non-invasively diagnose for over 3,000 rare genetic conditions (e.g. Down Syndrome)
• Free-floating DNA in blood – enormous volume!
• Early cancer detection – life-saving application for millions of individuals
• Non-invasive – circulating tumor DNA• Periodic sequencing of healthy individuals -
enormous volume!
Neonatal ICU
Prenatal ICU and IVF clinics
Liquid Biopsy
Patient Diagnosis: Sample-to-answer
4
ATGTCGAT CGATACGA GAGTCATC ACTGACGT
Reads 1 2
REFERENCE:--ATGTCGATGATCCAGAGGATACTAGGATAT-
PATIENT: --ATGTCTATGATC--GAGGATATTAGGATAT-
Genome (3 Billion base pairs)
Genome Sequencing Machine
Mutations
Read assembly
Find the causal mutation
3
• Long reads (>10Kbp) offer a better resolution of the mutation spectrum but have high error rate (15-40%)
• >1,300 CPU hours for reference-guided assembly of noisy long reads• 14.2M CPU-years for 100M individuals
• >15,600 CPU hours for de novo assembly of noisy long reads• 178M CPU-years for 100M individuals
Patient
Darwin: A Genomics Co-processor
High speed and programmability1. D-SOFT: Tunable speed/precision to match any error profile2. GACT: First algorithm with O(1) memory for compute-
intensive step of alignment allowing arbitrarily long alignments in hardware – ideal for long reads
3. First framework shown to accelerate reference-guided as well as de novo assembly of reads in hardware
D-SOFT(filter)
GACT(aligner)
GACT APID-SOFT API
Software Aligner
DarwinD-SOFT
Reference (R)
Que
ry (Q
)
GACT
Reference (R)
Que
ry (Q
)
5
Darwin: 40nm ASIC configuration
6
D-SOFT
Sof
twar
e
GACT GACT GACT GACT
GACT GACT GACT GACT
Darwin
LPDDR4 (32GB)
LPDDR4 (32GB)
Area: 300mm2 Power: 9W
D-S
OFT
API
GAC
T AP
I
Algorithm Power(1 thread)
BWA-MEM 9.2W
GraphMap 10.7W
DALIGNER 8.8W
Software (Intel Xeon E5)
GACT algorithm and hardware design
7
Strategies for long sequence alignment
Algorithm Time Space (compute-intensive
step)
Optimal
Smith-Waterman O(mn) O(mn) Y
Hirschberg O(mn) O(m+n) Y
Banded Smith-Waterman
O(n) O(n) N
X-drop O(n) O(n) N
GACT O(n) O(1) N
Profound hardware design implications
Prior assumptions (hardware) Small upper bound on sequence length n
OR Trace-back of alignment in software – SLOW!
m, n: sequence lengths m >= n
8
Genome Alignment using Constant-memory Trace-back (GACT)
* G G C G A C T T T*GGTCGTTT
Reference (R)
Que
ry (Q
)
Tile 1
Tile 2
1. Initialize Icurr, Jcurr in R, Q2. Form tile of maximum size T
around Icurr, Jcurr in R, Q3. Align tile with trace-back
from Icurr, Jcurr with at most (T-D) steps
4. Update Icurr, Jcurr with trace-back end coordinates
5. Repeat 2-4 till extension no longer possible
G G - C G A C T T T| | | | | | | G G T C G - - T T T
Optimal Alignment
Score = 11
T = 5, D=2
Tile 3
G G - C G A C T T T| | | | | | | G G T C G - - T T T
Alignment
Score = 11
9
(Icurr, Jcurr)
(Icurr, Jcurr)
GACT empirically provides optimal alignments
10
} GACT tile size T=400} GACT compared to optimal Smith-Waterman for 200,000 10Kbp
sequences with 4 different error rates: 10%, 20%, 30% and 40%} Simple scoring (match: +1, mismatch: -1, gap: -1)} At D=120, all observed alignments were optimal
D (in bp)
Fraction alignments non-optimal
Worst-case score loss
10% 20% 30% 40% 10% 20% 30% 40% 0 30.4% 61.0% 83.0% 94.7% 0.29% 0.67% 1.26% 2.38%
30 0.0% 0.02% 0.55% 55.3% 0.0% 0.35% 0.63% 1.59% 60 0.0% 0.0% 0.01% 1.38% 0.0% 0.0% 0.34% 0.81% 90 0.0% 0.0% 0.0% 0.05% 0.0% 0.0% 0.0% 0.33%
120 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
GACT Hardware-acceleration
11
A G G T C G G T AAGTCACTAT
Query Block 1
Query Block 2
Query Block 3
Reference
Que
ry
PE 0 PE 1 PE 2 PE 3
TB Logic
A C T A
G C T G
SRA
M
SRA
M
SRA
M
SRA
M
} Systolic array of Npe (= 4) processing elements (PEs) solve Smith-Waterman-Gotoh} Tile with size T > Npe, query divided into blocks, reference streamed through each block} Computation exploits wave-front parallelism} On-chip SRAM for storing trace-back state (4-bit per cell) } Total SRAM size = 4-bit x (Tmax)2 => 128KB for Tmax = 512
T = 9
Darwin: GACT Performance
12
• Runtime scales linearly to sequence length• 300-1000X faster than Edlib• 10,000X faster than software implementation of GACT
574K
108K 54K
1
10
100
1000
10000
100000
1000000
1 2 3 4 5 6 7 8 9 10
Alig
nmen
ts/s
ec
Sequence length (Kbp)
GACT (Software) Edlib GACT (Darwin)
302X
591X
986X
35X
19X11X
D-SOFT algorithm and hardware design
13
Seed Position table based exact matching
14
R = AGCTATACTA
AA
AC 6
AG 0
AT 4
CA
CC
CG
CT 2 7
GA
GC 1
GG
GT
TA 3 5 8
TC
TG
TT
Seed Positions Q = GCTA
GC:1 CT: 2, 7 TA: 3, 5, 8
Slope=1
R
Q
For human genome, seed position table size > 12GB (4B x 3 x 109)
3210
1 2 3 4 5 6 7 8
Diagonal-band Seed Overlapping based Filtration Technique (D-SOFT)
} Divide R into NB bins (diagonal bands)} Use N seeds of size k bp from different offsets in Q} Lookup positions of seeds in R and assign each seed hit to
corresponding bin (diagonal band)} Count non-overlapping Q base-pairs covered by seed hits for
each bin and filter based on threshold h (same as DALIGNER)
Bin 1 Bin 2 Bin 3 Bin 4 Bin 5 Bin 6
Reference (R)
Que
ry (Q
)
123456789
106 5 9 4 0 5
NB = 6N = 10 k = 4h = 7
15
D-SOFT hardware-acceleration design
16
Area: 264 mm2
Power: 7.3W
• Random accesses to update bins using on-chip SRAM (bin count SRAM)
• Area and power both dominated by 64MB Bin count SRAM • Hardware exploits DRAM channel parallelism for seed position lookup
D-SOFT hardware-acceleration throughput
} ~2X speedup from parallel DRAM channels} ~3X reduction in number of memory accesses to the DRAM} All random memory accesses to update bins using on-chip
SRAM (64MB)} On-chip updates completely hide off-chip (DRAM) bandwidth
k Avg. hits per seed (Human Genome)
Throughput (103 seeds/sec) Darwin speedup Software Darwin
11 1765 7.9 760.6 96.3X 12 457 29.1 2,796.2 96.1X 13 118 136.1 9,126.3 67.1X 14 32 339.0 21,271.1 62.7X 15 8 784.3 34,166.7 43.5X
17
Long read assembly on Darwin
18
Darwin: Read assembly
19
Reference-guided
De novo
Darwin: Performance Results
Reference-guided (54X human genome)
De novo (54X human genome)
Baseline: BWA-MEM (15%), GraphMap (30%, 40%)
Baseline: DALIGNER
Read Error Rate
D-SOFT settings (k, N, h)
Sensitivity SpeedupBaseline Darwin
15% (14, 750, 24) 95.95% 99.91% 4,110X30% (12, 1000, 25) 98.11% 98.40% 4,088X40% (11, 1300, 22) 97.10% 97.40% 128X
Read Error Rate
D-SOFT settings (k, N, h)
Sensitivity Speedup(Bottleneck)Baseline Darwin
15% (14, 1300, 24) 99.80% 99.89% 264X
20
Thank you!
Questions or feedback?
21