massively parallel computing for protein alignment
DESCRIPTION
Massively Parallel Computing for Protein Alignment. Bertil Schmidt School of Computer Engineering Nanyang Technological University Singapore. Contents. Motivation Smith-Waterman Algorithm Parallelization on the Hybrid Architecture Parallelization on the Fuzion 150 - PowerPoint PPT PresentationTRANSCRIPT
Massively Parallel Computing for Protein Alignment
Bertil SchmidtSchool of Computer Engineering
Nanyang Technological University
Singapore
Contents
MotivationSmith-Waterman Algorithm Parallelization on the Hybrid
ArchitectureParallelization on the Fuzion 150Performance EvaluationConclusion and Future Work
Motivation
Genetic sequence databases are growing exponentially Database growth rate will continue for the foreseeable future,
since multiple concurrent genome projects have begun, with more to come
Motivation
Discovered sequences are analyzed by comparison with databases
Complexity of sequence comparison is proportional to the product of query size times database size
Analysis too slow on sequential computersAnalysis too slow on sequential computersTwo possible approaches
HeuristicsHeuristics, e.g. BLAST,FastA, but the more efficient the heuristics, the worse the quality of the results
Parallel ProcessingParallel Processing, get high-quality results in reasonable time
Full Genome Comparison
related Organisms, but Tuberculosis causes a disease find common and different parts
16106 pairwise sequence comparisons Project with IMCB, Thomas Dick
3918 ProteinSequences1.329.298
AminoAcids
4289 ProteinSequences1.359.008
AminoAcids
Protein Sequence Alignment
BLAST, FastA, Smith-Waterman
GGHSRLILSQLGEEG.RLLAIDRDPQAIAVAKT....IDDPRFSII
GGHAERFL.E.GLPGLRLIGLDRDPTALDVARSRLVRFAD.RLTLV|||::::| : |::| ||:::||||:|:|||:: ::| |::::
BLAST
FastA
Smith-Waterman
Slower
Faster
SearchSpeed
DataQuality
Lower Higher
Smith-Waterman Algorithm
Optimal local alignment of two sequences Performs an exhaustive search for the
optimal local alignment Complexity O(nm) for sequence lengths n and m
Based on the 'dynamic programming' (DP) algorithm Fill the DP matrix using a substitution (mutation) matrix Find the maximal value (score) in the matrix Trace back from the score until a 0 value is reached
Smith-Waterman Algorithm Aligning S1 and S2 of length l1 and l2 using Recurrences:
21 ,11,
)2,1()1,1(
),(
),(
0
max),( ljli
SSSbtjiH
jiF
jiEjiH
ji
0),0(),0(
0)0,()0,(
jFjH
iEiH
),1(
),1(max),( ,
)1,(
)1,(max),(
jiF
jiHjiF
jiE
jiHjiE
Calculate three possible ways to extend the alignment by one AminoAcid (AA) in each sequence by one AA in the first sequence and align it with a gap in the second by one AA in the second sequence and align it with a gap in the first
Smith-Waterman AlgorithmAlign S1=ATCTCGTATGATGATCTCGTATGATG S2=GTCTATCACGTCTATCAC
GTCTATCAC
A T C T C G T A T G A T G
0 0 0 0 0 2 1 0 0 2 1 00000000000
0 0 0 0 0 0 0 0 0 0 0 0 02
0 2 1 2 1 1 4 3 2 1 1 3 20021021
1224321
4323654
3654554
4554657
3444556
3546545
3475576
2569876
1458876
03677
109
2258799
2147788
108
97
534
2
0
else 1
)( if 2),(
yxyxSbt
=1, =1
A T C T C G T A T G A T GA T C T C G T A T G A T G
G T C G T C T A T C A CT A T C A C
)2,1()1,1(
1)1,(
1),1(
0
max),(
ji SSSbtjiH
jiH
jiHjiH
Parallel Architectures for Bioinformatics
Embedded Massively Parallel Accelerators
Fuzion 150: 1536 processors on a single chip
Other accelerators: Decypher, Biocellerator, GeneMatcher2, Kestrel, SAMBA
Systola 1024: PC add-on board with 1024 processors
Parallel Architectures for Bioinformatics
High speed Myrinet switchHigh speed Myrinet switch
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
combines SIMD and MIMD paradigm within a parallel architecture Hybrid ComputerHybrid Computer
Previous Applications
Volume VisualizationAutomatic Visual Quality Control (Opel)CryptographyComputer TomographyVideo CompressionRange of Transforms (Fourier, Wavelet,
Hough, Radon)Computer Graphics
Architecture of Systola 1024
Interface processors
ISA
RAM NORTH
host computer bus
Controller
RAM WEST
program memory
Instruction Systolic Array: 32 32 mesh of
processing elements wavefront instruction
execution
14
Instruction Systolic Array
+
row selectors
columnselectorsinstructions
*
-
+
-
*-
+*+
+*-+
+*
* +-+
+*-
+* +*
+*-
++*
*-*-+
+*
+*
-
-
-
+*
+*- +*- -
wavefront instruction execution fast accumulation operations (e.g. row sum, broadcast, ringshift)
Parallelization of Smith-Waterman
matrix cells along a single diagonal are computed in parallel comparison is performed in l1+l21 steps on l1 PEs
GTCTATCAC
A T C T C G T A T G A T G
0 0 0 0 0 2 1 0 0 2 1 00000000000
0 0 0 0 0 0 0 0 0 0 0 0 02
0 2 1 2 1 1 4 3 2 1 1 3 20021021
1224321
4323654
3654554
4554657
3444556
3546545
3475576
2569876
1458876
03677
109
2258799
2147788
000 0
02
0
01
14
2
2
2
0
3
2
1
3
2
1
52
43
l2
l1
P1 P2 P13
Mapping onto Systola 1024
a30a31 a0
a63 a62 a32
a992a1022a1023
bk….b1b0bk….b1b0…c1c0 X
bb: subject sequence
aa: query sequence (equal to 1024)
Subject sequences can be pipelined with only step delay k steps for subject sequence of length k
Efficient routing on the ISA: Row Ringshift and Broadcast
Performance Evaluation
Scan times in seconds for TrEMBL 14 (351’834 Protein Sequences) for various query sequence lengths
Query sequence length 256 512 1024 2048 4096
Systola 1024speedup to PIII 850
2945
5776
11376
22416
46116
Cluster of 16 Systolasspeedup to PIII 850
2081
3886
7391
14294
29094
Parallel implementation scales linearly with sequence length and number of PCs
Computing time dominates data transfer time
Fuzion 150 Architecture
0.25-m, single-chip, SIMD architecture 1536 PEs @ 200 MHz 300 GOPS 600 GB/s on-chip, 6.4 GB/s off-chip bandwidth Multithreading (control units interact via semaphores) developed by Clearspeed Technology (UK) for graphics, networking processing
Linear SIMD Array1536 PEs
each with 2 Kbytes DRAM
Linear SIMD Array1536 PEs
each with 2 Kbytes DRAM
FUZION BusFUZION Bus
32-bit EPU(ARC)
32-bit EPU(ARC)
VideoI/O
VideoI/O
DisplayDisplay
Instruction FetchInstruction Fetch
SIMD ControllerSIMD Controller
Local MemoryLocal
Memory1,2 or 4
Channels (6.4 GB/s)
HostHost AGP Rambus
Fuzion 150 Architecture
PE(0,0)
PE(0,1)
PE(0,255)
Fuz
ion
Bus
PE(1,0)
PE(1,1)
PE(1,255)
PE(5,0)
PE(5,1)
PE(5,255)
Local MemoryLocal
Memory
Block 5
Block 1
Block 0
ALU(8 bits)
Register file32 Bytes
PE Memory2 KByte DRAM
Right PE
Instructions
Block I/O Channel
Left PE
Fuzion 150 - Debugger
Mapping onto the Fuzion 150 Block 5
Block 1
Block 0
bb: subject sequence
bk….b1b0bk….b1b0
a1a0 a255
a511 a510 a256
a1280a1534a1535aa: query sequence (equal to 1536)
…c1c0 X
No fast global communication 2-step local communcication Subject sequence can be pipelined with only step delay
Performance Evaluation
Scan times in seconds for TrEMBL 14 (351’834 Protein Sequences) for various query sequence lengths
Query sequence length 256 512 1024 2048 4096
Fuzion 150speedup to PIII 850
12136
22151
42157
82163
162165
Parallel implementation scales linearly with sequence length Computing time dominates data transfer time
Performance Evaluation Normalized time Comparison for a 10 Mbase
search on different parallel architectures with different query length
1
10
100
SAMBA Fuzion 150 Kestrel 16K-PEMasPar
Se
con
ds 512
1024
2048
4faster than 16K-PE MasPar 6faster than Kestrel 5faster than SAMBA (special-purpose 3-board
architecture)
Conclusions and Future Work
Demonstrated how fine-grained parallel architectures can be applied efficiently for Comparative Genomics
Significant runtime savings for full genome comparisons and database searching More Discovery Is Possible at a good price-performance ratio
Accelerating other Bioinformatics Applications, e.g. Hidden Markov Models
Build a next generation architecture at Center for High Performance Embedded Systems, NTU
Integration of accelerators in a Grid Environment