massively parallel computing for protein alignment

Massively Parallel Computing for Protein Alignment

Bertil SchmidtSchool of Computer Engineering

Nanyang Technological University

Singapore

Contents

MotivationSmith-Waterman Algorithm Parallelization on the Hybrid

ArchitectureParallelization on the Fuzion 150Performance EvaluationConclusion and Future Work

Motivation

Genetic sequence databases are growing exponentially Database growth rate will continue for the foreseeable future,

since multiple concurrent genome projects have begun, with more to come

Motivation

Discovered sequences are analyzed by comparison with databases

Complexity of sequence comparison is proportional to the product of query size times database size

Analysis too slow on sequential computersAnalysis too slow on sequential computersTwo possible approaches

HeuristicsHeuristics, e.g. BLAST,FastA, but the more efficient the heuristics, the worse the quality of the results

Parallel ProcessingParallel Processing, get high-quality results in reasonable time

Full Genome Comparison

related Organisms, but Tuberculosis causes a disease find common and different parts

16106 pairwise sequence comparisons Project with IMCB, Thomas Dick

3918 ProteinSequences1.329.298

AminoAcids

4289 ProteinSequences1.359.008

AminoAcids

Protein Sequence Alignment

BLAST, FastA, Smith-Waterman

GGHSRLILSQLGEEG.RLLAIDRDPQAIAVAKT....IDDPRFSII

GGHAERFL.E.GLPGLRLIGLDRDPTALDVARSRLVRFAD.RLTLV|||::::| : |::| ||:::||||:|:|||:: ::| |::::

BLAST

FastA

Smith-Waterman

Slower

Faster

SearchSpeed

DataQuality

Lower Higher

Smith-Waterman Algorithm

Optimal local alignment of two sequences Performs an exhaustive search for the

optimal local alignment Complexity O(nm) for sequence lengths n and m

Based on the 'dynamic programming' (DP) algorithm Fill the DP matrix using a substitution (mutation) matrix Find the maximal value (score) in the matrix Trace back from the score until a 0 value is reached

Smith-Waterman Algorithm Aligning S1 and S2 of length l1 and l2 using Recurrences:

21 ,11,

)2,1()1,1(

),(

),(

0

max),( ljli

SSSbtjiH

jiF

jiEjiH

ji

0),0(),0(

0)0,()0,(

jFjH

iEiH

),1(

),1(max),( ,

)1,(

)1,(max),(

jiF

jiHjiF

jiE

jiHjiE

Calculate three possible ways to extend the alignment by one AminoAcid (AA) in each sequence by one AA in the first sequence and align it with a gap in the second by one AA in the second sequence and align it with a gap in the first

Smith-Waterman AlgorithmAlign S1=ATCTCGTATGATGATCTCGTATGATG S2=GTCTATCACGTCTATCAC

GTCTATCAC

A T C T C G T A T G A T G

0 0 0 0 0 2 1 0 0 2 1 00000000000

0 0 0 0 0 0 0 0 0 0 0 0 02

0 2 1 2 1 1 4 3 2 1 1 3 20021021

1224321

4323654

3654554

4554657

3444556

3546545

3475576

2569876

1458876

03677

109

2258799

2147788

108

97

534

2

0

else 1

)( if 2),(

yxyxSbt

=1, =1

A T C T C G T A T G A T GA T C T C G T A T G A T G

G T C G T C T A T C A CT A T C A C

)2,1()1,1(

1)1,(

1),1(

0

max),(

ji SSSbtjiH

jiH

jiHjiH

Parallel Architectures for Bioinformatics

Embedded Massively Parallel Accelerators

Fuzion 150: 1536 processors on a single chip

Other accelerators: Decypher, Biocellerator, GeneMatcher2, Kestrel, SAMBA

Systola 1024: PC add-on board with 1024 processors

Parallel Architectures for Bioinformatics

High speed Myrinet switchHigh speed Myrinet switch

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

combines SIMD and MIMD paradigm within a parallel architecture Hybrid ComputerHybrid Computer

Previous Applications

Volume VisualizationAutomatic Visual Quality Control (Opel)CryptographyComputer TomographyVideo CompressionRange of Transforms (Fourier, Wavelet,

Hough, Radon)Computer Graphics

Architecture of Systola 1024

Interface processors

ISA

RAM NORTH

host computer bus

Controller

RAM WEST

program memory

Instruction Systolic Array: 32 32 mesh of

processing elements wavefront instruction

execution

14

Instruction Systolic Array

+

row selectors

columnselectorsinstructions

*

-

+

-

*-

+*+

+*-+

+*

* +-+

+*-

+* +*

+*-

++*

*-*-+

+*

+*

-

-

-

+*

+*- +*- -

wavefront instruction execution fast accumulation operations (e.g. row sum, broadcast, ringshift)

Parallelization of Smith-Waterman

matrix cells along a single diagonal are computed in parallel comparison is performed in l1+l21 steps on l1 PEs

GTCTATCAC

A T C T C G T A T G A T G

0 0 0 0 0 2 1 0 0 2 1 00000000000

0 0 0 0 0 0 0 0 0 0 0 0 02

0 2 1 2 1 1 4 3 2 1 1 3 20021021

1224321

4323654

3654554

4554657

3444556

3546545

3475576

2569876

1458876

03677

109

2258799

2147788

000 0

02

0

01

14

2

2

2

0

3

2

1

3

2

1

52

43

l2

l1

P1 P2 P13

Mapping onto Systola 1024

a30a31 a0

a63 a62 a32

a992a1022a1023

bk….b1b0bk….b1b0…c1c0 X

bb: subject sequence

aa: query sequence (equal to 1024)

Subject sequences can be pipelined with only step delay k steps for subject sequence of length k

Efficient routing on the ISA: Row Ringshift and Broadcast

Performance Evaluation

Scan times in seconds for TrEMBL 14 (351’834 Protein Sequences) for various query sequence lengths

Query sequence length 256 512 1024 2048 4096

Systola 1024speedup to PIII 850

2945

5776

11376

22416

46116

Cluster of 16 Systolasspeedup to PIII 850

2081

3886

7391

14294

29094

Parallel implementation scales linearly with sequence length and number of PCs

Computing time dominates data transfer time

Fuzion 150 Architecture

0.25-m, single-chip, SIMD architecture 1536 PEs @ 200 MHz 300 GOPS 600 GB/s on-chip, 6.4 GB/s off-chip bandwidth Multithreading (control units interact via semaphores) developed by Clearspeed Technology (UK) for graphics, networking processing

Linear SIMD Array1536 PEs

each with 2 Kbytes DRAM

Linear SIMD Array1536 PEs

each with 2 Kbytes DRAM

FUZION BusFUZION Bus

32-bit EPU(ARC)

32-bit EPU(ARC)

VideoI/O

VideoI/O

DisplayDisplay

Instruction FetchInstruction Fetch

SIMD ControllerSIMD Controller

Local MemoryLocal

Memory1,2 or 4

Channels (6.4 GB/s)

HostHost AGP Rambus

Fuzion 150 Architecture

PE(0,0)

PE(0,1)

PE(0,255)

Fuz

ion

Bus

PE(1,0)

PE(1,1)

PE(1,255)

PE(5,0)

PE(5,1)

PE(5,255)

Local MemoryLocal

Memory

Block 5

Block 1

Block 0

ALU(8 bits)

Register file32 Bytes

PE Memory2 KByte DRAM

Right PE

Instructions

Block I/O Channel

Left PE

Fuzion 150 - Debugger

Mapping onto the Fuzion 150 Block 5

Block 1

Block 0

bb: subject sequence

bk….b1b0bk….b1b0

a1a0 a255

a511 a510 a256

a1280a1534a1535aa: query sequence (equal to 1536)

…c1c0 X

No fast global communication 2-step local communcication Subject sequence can be pipelined with only step delay

Performance Evaluation

Scan times in seconds for TrEMBL 14 (351’834 Protein Sequences) for various query sequence lengths

Query sequence length 256 512 1024 2048 4096

Fuzion 150speedup to PIII 850

12136

22151

42157

82163

162165

Parallel implementation scales linearly with sequence length Computing time dominates data transfer time

Performance Evaluation Normalized time Comparison for a 10 Mbase

search on different parallel architectures with different query length

1

10

100

SAMBA Fuzion 150 Kestrel 16K-PEMasPar

Se

con

ds 512

1024

2048

4faster than 16K-PE MasPar 6faster than Kestrel 5faster than SAMBA (special-purpose 3-board

architecture)

Conclusions and Future Work

Demonstrated how fine-grained parallel architectures can be applied efficiently for Comparative Genomics

Significant runtime savings for full genome comparisons and database searching More Discovery Is Possible at a good price-performance ratio

Accelerating other Bioinformatics Applications, e.g. Hidden Markov Models

Build a next generation architecture at Center for High Performance Embedded Systems, NTU

Integration of accelerators in a Grid Environment

massively parallel computing for protein alignment

Documents

query sequence equal

sequence lengths n

subject sequences

results parallel processing

parallel computing

protein sequences

s2 of length l1

dp matrix