hive algorithm for next-generation sequencing analysis h i v emaintains a bloom lookup table where...

1
HIVE Algorithm for Next-Generation Sequencing Analysis Alin Voskanian 1 , Luis Santana-Quintero 1 , Raja Mazumder 2 , Jean et Danielle Thierry-Mieg 3 , Hayley Dingerdissen 1,2 , Vahan Simonyan 1 1 Food and Drug Administration, Center for Biologics Evaluation and Research, Rockville, MD, 20852 2 Department of Biochemistry and Molecular Biology, The George Washington University, Washington, DC, 20037 3 National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894 H I E V The High-performance Integrated Virtual Environment (HIVE) system, a cloud-based environment optimized for the storage and analysis of extra-large data, presents a novel algorithmic solution in the form of the dna- hexagon DNA-seq aligner. dna-hexagon implements a number of novel algorithmic paradigms to take full advantage of the nature of the underlying sequence data and to exploit CPU, RAM and I/O architecture to quickly compute accurate alignments in a massively parallel execution environment. Dynamic programing matrix linearization HIVE-hexagon implements a floating diagonal approach where the diagonal of the computation is maintained along the two sides of the current highest scoring path of the matrix. (A) We assume the optimal alignment will be along the diagonal. (B) Multiple insertions or deletions can result in the optimal path traversing outside the defined diagonal belt. (C) By defining a constant width to the diagonal and pre-computing cell scores line by line, the limits of the remaining diagonal matrix can float along with the likely optimal scoring path. (D) Furthermore, pre-computation of cells in the diagonal line by line allows the exclusion of regions which cannot possibly contain the highest scoring path. (E) The resultant minimized, dynamic diagonal matrix is linearized to further simplify the process and minimize the memory footprint required. Workflow for HIVE-hexagon alignment utility Overall alignment schema for dna-hexagon: short reads are non-redundified (A) and parallel portions (B) are sent to distributed cloud for computation. Reference genomes are then split into smaller pieces (C) and compiled into bloom/hash table (D). Every parallel execution thread performs a K-mer lookup against every reference sequence (E) then extends matches via inexact alignments (F) and performs a subsequent optimal alignment search on remaining candidates (G). Optimal alignment search optimization schema (A) Dynamic programming matrix Needleman-Wunsch or Smith-Waterman algorithms use a two dimensional rectangular matrix where cells represent the cumulative score of the alignment between short read (horizontal) and reference genome(vertical). (B,C) The fuzzy extension algorithm allows accurate definition of the alignment frame and squeezes the window where the high scoring alignments might be discovered. (D) dna-hexagon maintains a bloom lookup table where each K-mer is represented only by a single bit signifying the presence or absence of that K-mer and 2-na hash table where a sequence’s binary numeric representation is used as an index. Non-redundification: removal of identical reads This is a representation of 6 sequences in a tree. The tree is composed by linking all sequences’ tails (suffixes) to the parent nodes which represent the longest prefixes common to each branch. It contains all the information of the 6 sequences, including position and length, at each node. Once the tree is completed, we can traverse the tree in a left maze order (always taking a left path) to obtain the sorted list of elements as: S = {4,5,1,3,6,7,2}, which refers to the sorted list AGAC, AGACT, AGTAC, CCGGA, GTAGA, GTCTCA, TAGC. Motivation: Modern next-generation sequencing platforms generate a large number of short reads mapped to the same genomic position with low error rate, resulting in a high degree of redundancy. Approach: Innovative usage of redundancy identification and self-similarity between reads allows minimization of the memory footprint by removing repeats and by allowing a higher rate of vertical compression. Novel prefix-tree algorithms used to discover self-similarity decrease the number of individual reads in the alignment thereby accelerating the alignment process. Modifications to the optimal alignment searching Smith-Waterman allow the diagonal to float together with the running high scoring path, thus taking full advantage of the dynamic matrices diagonalization method. A refined approach to conventional heuristic seeding involving high levels of parallelization minimizes data transfer and removes most of the I/O bottlenecks. Accuracy and Reproducibility: dna-hexagon mapping is up to 99% accurate and is capable of reproducing results on par with other industry-standard tools. Speed: HIVE speed is competitive with other industry-standard systems. The dna-hexagon aligner can align 31GB of Human mRNA_seq reads to a 45MB genome in 23 minutes. NON-REDUNDIFICATION OPTIMAL ALIGNMENT ACKNOWLEDGEMENTS Konstantin Chumakov, PhD, Associate Director for Research, Office of Vaccines Research and Review, FDA CBER Carolyn A. Wilson, PhD, Associate Director for Research, Office of the Center Director, FDA CBER Implementation: blueHIVE technology group [email protected] (Mazumder and Simonyan groups) Reference sequences (C) serialization Short reads 1 0 1 0 1 1 0 1 1 1 1 (D) bloom/hash table compilation (A) non-redundification (E)K-mer lookup (F) fuzzy extension (G) optimal alignment search A C G T Pos= 1 Len = 2 P=3 L=5 P=6 L=2 P=2 L=4 P=1 L=5 P=4 L=4 P=5 L=5 P=6 L=5 P=7 L=6 Sequences in the order of appearance: Sorted order = {4, 5, 1, 3, 6, 7, 2} A G C C G G A G T T A G C Root A C T A C T A G A C T C A 1. AGTAC 2. TAGC 3. CCGGA 4. AGAC 5. AGACT 6. GTAGA 7. GTCTCA Short read sequence (size S) Reference sequence size R 1 0=AAA . . . AA 12,23,45,... 1 1=AAA . . . AC 24,322,3432,... 0 2=AAA . . . AT 1 n-1=CAA . . . AA 356,3434,23328,... 1 n=TTT . . . TT 123,3782,12332,... Bloom K-mer seed occurrence position buckets ATGCaCGAG.CGTATGCAG ATGCtCGAGACGTAT-CAG (C) Score along alignment position ATGCaCGAG.CGTATGCAG ||||.||||-|||||-||| ATGCaCGAGACGTAT-CAG (A) Rectangular matrix (B) Squared matrix (D) Bloom and hash tables (D) Armenian Cheese Diagonal Short read sequence Reference sequence (A)Diagonal optimization (B) Multiple In-Del alignment problem X (C) Floating Diagonal (E) Final Dynamic Matrix CATAGTGGACACTG CATAGTGGATAATA AAAAAA . . . AAA AAAAAA . . . AAC . . . TTTTTT . . . TTG TTTTTT . . . TTT Similarity measure shown in gray Reference sequence Candidate positions found by seeds for the first sequence (blue) and next sequence (red) Seed hit extension sizes shown as white boxes on a reference frame X X X Pool of optimal alignment candidates In this example, the first sequence CATAGTGGACACTG has generated 5 possible hit candidate positions (blue arrows), but only three of those have failed to extend or align long enough (red “X”). The next sequence, CATAGTGGATAATA, having a significant length of prefix similar to the prior one, does not need to consider all candidate positions, specifically those where the extension attempt failed with a wide margin of error.

Upload: others

Post on 14-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HIVE Algorithm for Next-Generation Sequencing Analysis H I V Emaintains a bloom lookup table where each K-mer is represented . only by a single bit signifying the presence or absence

HIVE Algorithm for Next-Generation Sequencing Analysis

Alin Voskanian1, Luis Santana-Quintero1, Raja Mazumder2, Jean et Danielle Thierry-Mieg3, Hayley Dingerdissen1,2, Vahan Simonyan1

1Food and Drug Administration, Center for Biologics Evaluation and Research, Rockville, MD, 20852 2Department of Biochemistry and Molecular Biology, The George Washington University, Washington, DC, 20037

3National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894

H I E V

The High-performance Integrated Virtual Environment (HIVE) system, a cloud-based environment optimized for the storage and analysis of extra-large data, presents a novel algorithmic solution in the form of the dna-hexagon DNA-seq aligner. dna-hexagon implements a number of novel algorithmic paradigms to take full advantage of the nature of the underlying sequence data and to exploit CPU, RAM and I/O architecture to quickly compute accurate alignments in a massively parallel execution environment.

Dynamic programing matrix linearization HIVE-hexagon implements a floating diagonal approach where the diagonal of the computation is maintained along the two sides of the current highest scoring path of the matrix. (A) We assume the optimal alignment will be along the diagonal. (B) Multiple insertions or deletions can result in the optimal path traversing outside the defined diagonal belt. (C) By defining a constant width to the diagonal and pre-computing cell scores line by line, the limits of the remaining diagonal matrix can float along with the likely optimal scoring path. (D) Furthermore, pre-computation of cells in the diagonal line by line allows the exclusion of regions which cannot possibly contain the highest scoring path. (E) The resultant minimized, dynamic diagonal matrix is linearized to further simplify the process and minimize the memory footprint required.

Workflow for HIVE-hexagon alignment utility Overall alignment schema for dna-hexagon: short reads are non-redundified (A) and parallel portions (B) are sent to distributed cloud for computation. Reference genomes are then split into smaller pieces (C) and compiled into bloom/hash table (D). Every parallel execution thread performs a K-mer lookup against every reference sequence (E) then extends

matches via inexact alignments (F) and performs a subsequent optimal alignment search on remaining candidates (G).

Optimal alignment search optimization schema (A) Dynamic programming matrix Needleman-Wunsch or Smith-Waterman algorithms use a two dimensional rectangular matrix where cells represent the cumulative score of the alignment between short read (horizontal) and reference genome(vertical). (B,C) The fuzzy extension algorithm allows accurate definition of the alignment frame and squeezes the window where the high scoring alignments might be discovered. (D) dna-hexagon maintains a bloom lookup table where each K-mer is represented only by a single bit signifying the presence or absence of that K-mer and 2-na hash table where a sequence’s binary numeric representation is used as an index.

Non-redundification: removal of identical reads This is a representation of 6 sequences in a tree. The tree is composed by linking all sequences’ tails (suffixes) to the parent nodes which represent the longest prefixes common to each branch. It contains all the information of the 6 sequences, including position and length, at each node. Once the tree is completed, we can traverse the tree in a left maze order (always taking a left path) to obtain the sorted list of elements as: S = {4,5,1,3,6,7,2}, which refers to the sorted list AGAC, AGACT, AGTAC, CCGGA, GTAGA, GTCTCA, TAGC.

Motivation: Modern next-generation sequencing platforms generate a large number of short reads mapped to the same genomic position with low error rate, resulting in a high degree of redundancy. Approach: Innovative usage of redundancy identification and self-similarity between reads allows minimization of the memory footprint by removing repeats and by allowing a higher rate of vertical compression. Novel prefix-tree algorithms used to discover self-similarity decrease the number of individual reads in the alignment thereby accelerating the alignment process. Modifications to the optimal alignment searching Smith-Waterman allow the diagonal to float together with the running high scoring path, thus taking full advantage of the dynamic matrices diagonalization method. A refined approach to conventional heuristic seeding involving high levels of parallelization minimizes data transfer and removes most of the I/O bottlenecks. Accuracy and Reproducibility: dna-hexagon mapping is up to 99% accurate and is capable of reproducing results on par with other industry-standard tools. Speed: HIVE speed is competitive with other industry-standard systems. The dna-hexagon aligner can align 31GB of Human mRNA_seq reads to a 45MB genome in 23 minutes.

NO

N-R

EDUN

DIFI

CA

TION

O

PTIMA

L ALIG

NM

ENT

ACKNOWLEDGEMENTS Konstantin Chumakov, PhD, Associate Director for Research, Office of Vaccines Research and Review, FDA CBER Carolyn A. Wilson, PhD, Associate Director for Research, Office of the Center Director, FDA CBER Implementation: blueHIVE technology group [email protected] (Mazumder and Simonyan groups)

Reference sequences

(C) serialization

Short reads

1 0 1 0 1 1 0 1 1 1 1

(D) bloom/hash table compilation

(A) non-redundification (E)K-mer lookup

(F) fuzzy extension

(G) optimal alignment search

A C G T

Pos= 1 Len = 2

P=3 L=5

P=6 L=2

P=2 L=4

P=1 L=5

P=4 L=4

P=5 L=5

P=6 L=5

P=7 L=6

Sequences in the order of

appearance:

Sorted order = {4, 5, 1, 3, 6, 7, 2}

A G

C C G G A

GT

T A G C

Root

A C

T A C

T

A G A

C T C A

1. AGTAC 2. TAGC 3. CCGGA 4. AGAC 5. AGACT 6. GTAGA 7. GTCTCA

Short read sequence (size S)

Reference sequence size R

1 0=AAA . . . AA 12,23,45,... 1 1=AAA . . . AC 24,322,3432,... 0 2=AAA . . . AT 1 n-1=CAA . . . AA 356,3434,23328,... 1 n=TTT . . . TT 123,3782,12332,...

Bloom K-mer seed occurrence position buckets

ATGCaCGAG.CGTATGCAG ATGCtCGAGACGTAT-CAG

(C) Score along alignment position

ATGCaCGAG.CGTATGCAG ||||.||||-|||||-||| ATGCaCGAGACGTAT-CAG

(A) Rectangular matrix (B) Squared matrix

(D) Bloom and hash tables

(D) Armenian Cheese Diagonal

Short read sequence Reference sequence

(A)Diagonal optimization (B) Multiple In-Del alignment problem

X

(C) Floating Diagonal

(E) Final Dynamic Matrix

CATAGTGGACACTG CATAGTGGATAATA

AAAAAA . . . AAA AAAAAA . . . AAC . . . TTTTTT . . . TTG TTTTTT . . . TTT

Similarity measure shown in gray

Reference sequence

Candidate positions found by seeds for the first sequence (blue) and next sequence (red)

Seed hit extension sizes shown as white boxes on a reference frame

X X X

Pool of optimal alignment candidates In this example, the first sequence CATAGTGGACACTG has generated 5 possible hit candidate positions (blue arrows), but only three of those have failed to extend or align long enough (red “X”). The next sequence, CATAGTGGATAATA, having a significant length of prefix similar to the prior one, does not need to consider all candidate positions, specifically those where the extension attempt failed with a wide margin of error.