accelerating bioinformatics algorithms with reconfigurable computing presentation to mapld...
TRANSCRIPT
Accelerating Bioinformatics Algorithms with
Reconfigurable Computing
Presentation to MAPLD Conference
September 2004
JYardley. 183/MAPLD2004
Overview
• The Problem– BioInformatics Algorithm: Smith Waterman
– Current Implementations
• The Solution– Viva as a Reconfigurable Computing SW & HW Design Tool
– Hypercomputer Architecture for High-End RC applications
• The Implementation– Smith Waterman Viva Code
– Smith Waterman Pipeline Design
– Smith Waterman Pipeline applied to Hypercomputer Architecture
– Smith Waterman Pipeline Primitives inside the FPGA
• The Results– Visualization of Rat vs. Human Genetic Code
– Informal Benchmarks
• Other Potential Applications– Seismic Data Processing; Weather Modeling; Image Rendering
Page 2
JYardley. 183/MAPLD2004
The Problem:Enormous Biosciences Problems
• Exploding Datasets in Biosciences:
• DNA Sequencing
• Gene Expression
• Protein Identification
Page 3
JYardley. 183/MAPLD2004
The Need: High-Speed High Sensitivity Algorithms
• High-Speed High-Sensitivity DNA and Protein Searching Algorithms – Critical in virtually every branch of molecular biology. – Smith-Waterman:
• Theoretically optimal for sequence matching.
• BUT Compute Intensive!
– BLAST and FASTA: • Approximations.
• Faster than Smith Waterman, but less sensitive.
Page 4
JYardley. 183/MAPLD2004
The Need: High-Speed High Sensitivity Algorithms
• Comparative Genomics: Comparing the genomes of related species– Identifying genes, defining gene structure, elucidating evolutionary change, identifying
regulatory elements and revealing combinatorial control of gene regulation
• Sequencing Effort – Human sequence is completed; other organisms now being sequenced
– Sequencing effort will require high sensitivity DNA searches and alignments
– SmithWaterman preferred method of choice—more accurate, specific
– NCBI BLAST, WU BLAWST not effective in low-coverage DNA situations
• RNA interference (RNAi): seeking novel therapies & developing new drugs. – The process: Choosing the correct genetic sequence to effectively block a targeted
messenger RNA (mRNA) without silencing additional genes
– Due to word length limitations, BLAST algorithms can miss sequences that have one or more mismatches compared to the query siRNA sequence
• Genome Annotation– BLAST does not allow for long introns or frameshifts
– Smith-Waterman is both frameshift- and intron-tolerant.
Page5
JYardley. 183/MAPLD2004
The Need: High-Speed Smith Waterman
• Large Matrix comparison
• Large datasets
• High level of detail for each SW calculation
• NOT heuristic approximations
0 2 3 4 5 6 7 8 90 G A A G A A G C G
score 0 0 0 0 0 0 0 0 0 0patst 0 0 0 0 0 0 0 0 0 0 0datast 0 0 1 2 3 4 5 6 7 8 9
score 0 0 1 1 0 1 1 0 0 0patst A 1 1 0 0 1 0 0 1 1 1datast 1 0 1 1 2 4 4 5 7 8 9
score 0 0 1 2 0 1 2 0 0 0patst A 2 2 1 0 2 1 0 2 2 2datast 2 0 1 1 1 4 4 4 7 8 9
score 0 1 0 0 3 0 0 3 0 1patst G 3 2 3 3 0 3 3 0 3 2datast 3 0 0 2 3 1 5 6 4 8 8
score 0 0 0 0 0 2 0 0 4 0patst C 4 4 4 4 4 0 4 4 0 0datast 4 0 1 2 3 4 1 6 7 4 4
score 0 1 0 0 1 0 1 1 0 5patst G 5 4 5 5 4 5 0 4 5 0datast 5 0 0 2 3 3 5 1 6 8 4
Page 6
JYardley. 183/MAPLD2004
The Need: High Performance Biosciences Platform
• Cluster Computing—most widely used platform. BUT there are diminishing returns:
– Expensive to build, difficult to maintain– Require significant power, air conditioning, and physical space– Architecture inherently limits scalability and performance
• Reconfigurable Computing(RC)—the promising alternative– Advantages of a Custom Chip:
• Implement algorithms directly in hardware
• Performance advantages of an ASIC, but without chip development cost
– Advantages of a General Purpose Platform• Development time comparable to software development
• FPGAs can be reconfigured to perform other computational tasks.
Page 7
JYardley. 183/MAPLD2004
The SolutionFPGA-Programming Environment: Viva
• VIVA GRAPHICAL LANGUAGE– Capture natively parallel code
– Accommodate data of any type, size, or precision
– Tune algorithms for speed of execution or conservation of hardware resources
• VIVA EDITOR– Call Viva algorithms from legacy code such as C, C++, or
Fortran
– Interactively debug code
– Import/Export EDIF files
• VIVA COMPILER/SYNTHESIZER– Program multi-million gate designs
– Compile hardware designs quickly for efficient development
• VIVA LIBRARIES– Reuse flexible Viva objects which accept any data type or size
– Target any hardware platform with a ‘System Description’
– Prototype Viva on any X-86-based Windows machine
Page 8
JYardley. 183/MAPLD2004
The Solution: FPGA-based Hypercomputers
Page 9
JYardley. 183/MAPLD2004
Structure of an FPGA Processing Element
Page 10
JYardley. 183/MAPLD2004
Structure of a Processing Element Quad
Page 11
JYardley. 183/MAPLD2004
Structure of a Hypercomputer Accelerator Board
Page 12
JYardley. 183/MAPLD2004
The Prototype Implementation:Smith Waterman in Viva Code
Page 13
JYardley. 183/MAPLD2004
Smith Waterman Program Flow
• As the query sequence is loaded, the Init_Cells object creates our initial column and stores it in SW_Cell_Mem.
• After this initialization period, SW_Cell_Mem will provide a cell to the chain SW_Iteration objects every clock cycle. It will also write a newly calculated cell every clock cycle.
• The SW_Cell_Mem object stores every nth column, where n is the number of SW_Iteration objects.
Page 14
JYardley. 183/MAPLD2004
Smith Waterman Cells
• There are as many cells as there are characters in the query sequence.
• The array of cells represent a column of the scoring matrix. • The initial (zero) column is initialized and stored into the cell
memory object, SW_Cell_Mem.• Each cell contains the following four parameters:
– Pattern – a character from the query sequence
– Score – the score of this cell in the current i,j position
– PatternStart – the position in the query sequence from which the score was calculated
– DataStart – the position in the reference sequence from which the score was calculated
Page 15
JYardley. 183/MAPLD2004
Cell Data Types
• Data Element size may be adjusted depending on usage:– Pattern – contains as many bits as needed to encode characters from
the sequences – 4 bits for nucleotides.
– Score and PatternStart – Equal in size. Must be large enough to encode the number of entries in the query sequence
– DataStart – will be the largest data set as it must be able to encode any position in the reference sequence.
• Right size for the job:– Less circuitry is needed to calculate matches in smaller sequences.
– Smaller sequences may exploit more parallelism.
Page 16
JYardley. 183/MAPLD2004
In this example, our Pattern contains 4 bits, for modeling nucleotides. The Score and PatternStart parameters contain 26 bits, so our query sequence may contain up to 67,108,864 characters. The DataStart parameter contains 27 bits, meaning our reference sequence may contain up to 134,217,728 characters.
Smith Waterman Data Sets
Page 17
JYardley. 183/MAPLD2004
Smith Waterman Iteration
Page 18
JYardley. 183/MAPLD2004
SW_Iteration Object
• Inputs:– Matrix_In: receives a constant stream of cells. It is imperative for
efficiency that the pipe remains full.– Data: receives a single character from the reference sequence. The
cells computed will be for the column of the scoring matrix corresponding to the Data value.
– CountBy: the radix of the algorithm (number of iteration objects)– Init_J_In: this iteration object’s index in the chain of iteration objects– ClkG: System Clock– Token_In: a token pulse precedes a set of cells, allowing the
iteration object to clear-out data from the previous set of cells– Init: initialization pulse utilized only before search commences– G: accompanies each valid cell
Page 19
JYardley. 183/MAPLD2004
SW_Iteration Object
• Outputs:– Matrix_Out: newly-computed cell– Token_Out: passes token to next iteration object– D: accompanies each newly-computed cell– Init_J_Out: used by next iteration object– I & J: current row and column – used to report results
Page 20
JYardley. 183/MAPLD2004
Pipe Stages
• The SW_Iteration object contains four pipe stages.
• A cell is received by and produced by the SW_Iteration object every clock cycle.
• When a cell enters, it is coming from the previous column, so its values are those of the West neighbor.
• Since the cell in the row above any given cell is in the next pipe stage, access to both the North and Northwest neighbors’ values are possible.
Page 21
JYardley. 183/MAPLD2004
Parallelism
• If a given hardware system has enough physical resources to accommodate n SW_Iteration objects, the Smith Waterman program may operate on n columns in parallel.
• Hence n cells are computed every clock cycle.
• Each Virtex II 6000 can support 64 iteration objects
Page 22
JYardley. 183/MAPLD2004
The Implementation:Pipeline Primitives Inside the FPGA
Page 23
JYardley. 183/MAPLD2004
PE2
XPE Data Distribution
XPR Router
Bus Controller
X86 S
yst
em
PE1(Controller) PE3 PE4 PE5 PE6 PE7 PE8
The Implementation: Smith Waterman Pipeline
Page 24
JYardley. 183/MAPLD2004
The Results: Rat vs. Human Genetic Code
Page 25
JYardley. 183/MAPLD2004
The Results: Bacteria to Bacteria Comparison
Page 26
JYardley. 183/MAPLD2004
The Results: Informal Statistics
• Total # Operations / Second– 1 Smith-Waterman Step includes:
• 25 Logic Operations (Adds, compares, mostly 26-27 bit ops, some single bit ops)• 13 Data Reorder Operations (Move, Combine…)• 11 Data Stor (Assignment)
– Logic Operations Only:• 25 Ops * 25Mhz * 448 Smith-Waterman kernels = 280Billion Operations / Second
– Logic & Data Operations:• 49 Ops * 25Mhz * 448 Smith-Waterman kernels = 550Billion Operations / Second
• Total Aggregate Communications Bandwidth of Systolic Array– 12 * 88 * 25Mhz = 26.4 Gb/s plus 7 * 22 * 50Mhz = 7Gb/s = 34.1 Gb/s
• Resources Consumed / Resources Available– PE2 – PE7: 60% to 70% consumed– PE1 20% consumed; XPE 5%; XPR .1%
• Compilation time– # Gates: 70 Million Total– Time to compile: 20 Minutes
• Power Consumption– Meter—50 Watts
Page 27
JYardley. 183/MAPLD2004
Summary & Conclusions
• This Viva prototype of the Smith-Waterman algorithm demonstrates that the algorithm can be parallelized for fast operation in an FPGA system and validates the usage of FPGAs to increase the speed of the Smith-Waterman algorithm compared to clusters
• Speed of the Prototype:– An HC-62 has the bandwidth to pass cells between 7 FPGAs, allowing for 448
parallel SW_Iteration objects– At a conservative 30 Mhz system clock speed, this gives 30,000 * 448 = 13.4
Billion Smith Waterman steps/second.
• Opportunities to further optimize the algorithm include:– Increasing the number of SW_Iterations that can be done in parallel (up to 100
Billion Smith Waterman steps/second)– Increasing the clock speed of the hardware (up to 1 Trillion Smith Waterman
steps/second)
Page 28