04/23/2003 massively parallel solutions for molecular sequence analysis prabhakar r. gudla cmsc 838t...

04/23/2003

Massively Parallel Solutions for Molecular Sequence Analysis

Prabhakar R. GudlaCMSC 838T Presentation

04/23/2003 CMSC 838T – Presentation 2

Outline

Motivation Smith-Waterman Algorithm

Parallelization

High Performance Computing Hybrid Architecture Fuzion 150

Performance Evaluation Conclusions and Comments


Motivation

Discovered sequences are analyzed by comparison

with databases

Complexity is proportional to the product of query size

times database size

☞ Analysis too slow on sequential computers


Sequence Alignment

Two possible approaches Heuristics, e.g. BLAST, FASTA, but the more efficient the

heuristics, the worse the quality of the results Parallel Processing, get high-quality results in reasonable time

BLAST, FASTA, Smith-Waterman (S-W)

BLAST

FASTA

Smith-Waterman

Slower

Faster

SearchSpeed

DataQuality

Lower Higher


Outline


Parallelization


Performance Evaluation Conclusion and Comments


Parallelization of S-W

matrix cells along a single diagonal are computed in parallel

comparison is performed in l1+l21 steps on l1 PEs

GTCTATC

A T C T C G

l2

l1

P1 P2 P6

0 0 0 0 0 0 00000000

00 00

0 00 20

0 02 1

00

1

0 01 2

02

12

4

0 22 1

2

1

2

2

4

33

1

043

236

6545

4554

344456

A T C T C G

GTCTATC

GTCTATC

T GCTATC

TATC

C T GT C T GATC

TC

A T C T GC T A T C T G CTATCTG


Parallel Architectures

Embedded Massively Parallel Accelerators

Fuzion 150: 1536 processors on a single chip

Other accelerators: Decypher, Biocellerator, GeneMatcher2, Kestrel, SAMBA, P-NAC, Splash-2, BioScan

Systola 1024: PC add-on board with 1024 processors


Outline


Parallelization




Previous Applications

Volume Visualization [Schmidt `00] Automatic Visual Quality Control (Automobile

Industry) Computer Tomography [Schmidt, Schimmler, and Schröder

`98] Video Compression [Schmidt and Schimmler `99] Range of Transforms (Fourier, Wavelet, Hough,

Radon) [Schmidt, Schimmler and Schröder `99] Image Processing [Schimmler and Lang `96, Lenders and

Schröder `90, Jiang Edirisinghe, and Schröder `97]


Hybrid Architecture

High speed Myrinet switchHigh speed Myrinet switch

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

combines SIMD and MIMD paradigm within a parallel architecture Hybrid ComputerHybrid Computer


Architecture of Systola 1024

Interface processors

ISA

RAM NORTH

host computer bus

Controller

RAM WEST

program memory

Instruction Systolic Array: 32 32 mesh of

processing elements wavefront instruction

execution


Mapping onto Systola 1024

a30a31 a0

a63 a62 a32

a992a1022a1023

bk….b1b0bk….b1b0…c1c0 X

bb: subject sequence

aa: query sequence (equal to 1024)

Subject sequences can be pipelined with only step delay k steps for subject sequence of length k

Efficient routing on the ISA: Row Ringshift and Broadcast


Fuzion 150 Architecture

0.25-m, single-chip, SIMD architecture 1536 PEs @ 200 MHz 300 GOPS 600 GB/s on-chip, 6.4 GB/s off-chip bandwidth multithreading (control units interact via semaphores) developed by Clearspeed Technology (UK) for graphics, networking processing

Linear SIMD Array1536 PEs

each with 2 Kbytes DRAM

Linear SIMD Array1536 PEs

each with 2 Kbytes DRAM

FUZION BusFUZION Bus

32-bit EPU(ARC)

32-bit EPU(ARC)

VideoI/O

VideoI/O

DisplayDisplay

Instruction FetchInstruction Fetch

SIMD ControllerSIMD Controller

Local MemoryLocal

Memory

1,2 or 4 Channels (6.4 GB/s)

HostHost AGP Rambus


Fuzion 150 Architecture

PE(0,0)

PE(0,1)

PE(0,255)

Fuz

ion

Bus

PE(1,0)

PE(1,1)

PE(1,255)

PE(5,0)

PE(5,1)

PE(5,255)

Local MemoryLocal

Memory

Block 5

Block 1

Block 0

ALU(8 bits)

Register file32 Bytes

PE Memory2 KByte DRAM

Right PE

Instructions

Block I/O Channel

Left PE


Mapping onto the Fuzion 150Block 5

Block 1

Block 0

bb: subject sequence

bk….b1b0bk….b1b0

a1a0 a255

a511 a510 a256

a1280a1534a1535aa: query sequence (equal to 1536)

…c1c0 X

No fast global communication 2-step local communication Subject sequence can be pipelined with only step delay


Contents


Parallelization




Performance Evaluation

Scan times in seconds for TrEMBL 14 (351’834 Protein Sequences) for various query sequence lengths

Parallel implementation scales linearly with sequence lengthComputing time dominates data transfer time

Query sequence length 256 512 1024 2048 4096

Fuzion 150speedup to PIII 1Ghz

1288

2297

42102

82105

162106

Systola 1024speedup to PIII 1Ghz

2944

5774

11374

22414

46114

Cluster of 16 Systolasspeedup to PIII 1GHz

2053

3856

7358

14260

29059

Fuzion 150 is 25 times faster than a single Systola 1024; difference in CMOS technology (0.25 vs 1.0)



Time comparisons for a 10 Mbase search on different parallel architectures with different query length

1

10

100

SAMBA Fuzion 150 Kestrel 16K-PEMasPar

Sec

on

ds 512

1024

2048

4faster than 16K-PE MasPar 6faster than Kestrel 5faster than SAMBA (special-purpose 3-board architecture)



USparc : Sun Ultrasparc 140 MHz

B-SYS: 470-PE ISA

Alpha: DEC Alpha – 433 MHz

1K MP2: 1K-PE MasPar

Paragon: 32-node Paragon

Decy-1: 1-board Decypher-II*

Merc1: 1-board Mercury+

Bcll-1: Biocellerator*

Samba: 2-board Samba+

16-MP2: 16K-PE MasPar

FDF-3: 5-Board Paracell FDF+

Kestrel: 1-board Kestrel


+ (single purpose); * (FPGA) Source: Dahle et. al, PDPTA, 1243-1249, 1999


Outline


Parallelization


Performance Evaluation Conclusions and Comments


Conclusions

Demonstrated how fine-grained and hybrid parallel architectures can be applied efficiently for Comparative Genomics

Significant runtime savings for full genome comparisons and database searching

Same systems can be used for accelerating other bioinformatics applications, e.g. Hidden Markov Models


Comments

☞ With hardware support, is S-W as fast as BLAST?

Search Tools

(against Swiss-Prot

DB)

Sequence Under Test

ELVIS (5) Metr (276) Arp_arath (536)

Time taken for the search (seconds)

FASTA 3.3 4.3 20.0 25.0

BLAST 2.2 1.0 4.0 10.0

SSearch (SW) 6.0 240.0 565.0

H’Ware Accl. 3.2 16.8 29.7

Comparative search speeds on 600 MHz 21264A Alpha machine (comparable MCUPS as Hybrid System and Fuzion 150)

* Source: Shane Sturrock, SCS, 2(1), April 2002


Comments

☞ Is it feasible to use S-W as the default ? Currently offered as a default option at EBI (European

Bioinformatics Institute), handles 15K queries per month w/ full implementation of S-W

Depends on the “objectives” of the search

☞ Just how much more accurate is S-W ? 5-10% more “sensitive” towards divergent matches than

BLAST (Shpaer et. al., Genomics 38, 179-191, 1996) BLAST will retrieve most biologically significant similarities,

but will miss a few and will include some chance similarities


Comparison of S-W VS BLAST

Source: Shpaer et.al., Genomics 38(2), pp.179-191, 1996

☞ Is there a real difference in the results ? YES


Comparison of S-W, FASTA, and BLAST

Note: The numbers in the table show for how many protein SF the method in the column performed better than the one in the row


Acknowledgements

Dr. Bertil Schmidt

Dr. Chau-Wen Tseng


Q&A


Extra Slides


Full Genome Comparison

related Organisms, but Tuberculosis causes a disease find common and different parts

16106 pairwise sequence comparisons

3918 ProteinSequences1.329.298

AminoAcids

4289 ProteinSequences1.359.008

AminoAcids


Smith-Waterman Algorithm

Optimal local alignment of two sequences Performs an exhaustive search for the optimal

local alignment Complexity O(nm) for sequence lengths n and m

Based on the 'dynamic programming' (DP) algorithm Fill the DP matrix using a substitution (mutation) matrix Find the maximal value (score) in the matrix Trace back from the score until a 0 value is reached



Aligning S1 and S2 of length l1 and l2 using recurrences:

1 2

0

( , )( , ) max ,1 , 1

( , )

( 1, 1) ( 1 , 2 )i j

E i jH i j i l j l

F i j

H i j Sbt S S

0),0(),0(

0)0,()0,(

jFjH

iEiH

),1(

),1(max),( ,

)1,(

)1,(max),(

jiF

jiHjiF

jiE

jiHjiE

Calculate three possible ways to extend the alignment by one aminoacid (AA) in each sequence by one AA in the first sequence and align it with a gap in the second by one AA in the second sequence and align it with a gap in the first



Align S1=ATCTCGTATGATGATCTCGTATGATG S2=GTCTATCACGTCTATCAC

GTCTATCAC

A T C T C G T A T G A T G

0 0 0 0 0 2 1 0 0 2 1 00000000000

0 0 0 0 0 0 0 0 0 0 0 0 02

0 2 1 2 1 1 4 3 2 1 1 3 20021021

1224321

4323654

3654554

4554657

3444556

3546545

3475576

2569876

1458876

03677

109

2258799

2147788

108

97

534

2

0

else 1

)( if 2),(

yxyxSbt

=1, =1

A T C T C G T A T G A T GA T C T C G T A T G A T G

G T C G T C T A T C A CT A T C A C

)2,1()1,1(

1)1,(

1),1(

0

max),(

ji SSSbtjiH

jiH

jiHjiH


Principles of the ISA

.......

...


Principles of the ISA

Communication- Register


Interface Processors

Interface Processors Interface Processors NorthNorth

Interface Interface Processors WestProcessors West

ISA

. . . ..

. .

.


Instruction Systolic Array

+

row selectors

columnselectorsinstructions

*

-

+

-

*-

+*+

+*-+

+*

* +-+

+*-

+* +*

+*-

++*

*-*-+

+*

+*

-

-

-

+*

+*- +*- -

wavefront instruction execution fast accumulation operations (e.g. row sum, broadcast, ringshift)


Advantage of ISA’s: Performing Aggregate Functions

• Row Broadcast

• Row Sum

• Row Ringshift

C := C[WEST]C := C[WEST]

C := CW

C = 234 C = 0 C = 0 C = 0234

C := C + C[WEST]C := C + C[WEST]

noop

C = 1 C = 2 C = 3 C = 4

C := C[WEST]; C:=C[EAST]C := C[WEST]; C:=C[EAST]

noop

C = 1000 C = 1 C = 1 C = 1

C = 234 C = 234 C = 0 C = 0234

C := CW

C = 1 C = 3

C:=C+CW

C = 3 C = 4

C := CW

C = 1 C = 1000 C = 1 C = 1

C:=CWC := CWC:=CE

C = 234 C = 234 C = 234 C = 0234

C := CW

C = 1 C = 3

C:=C+CW

C = 6 C = 4

C := CW

C = 1 C = 1 C = 1000 C = 1

C:=CWC := CW C:=CE

C = 234 C = 234 C = 234 C = 234234

C := CW

C = 1 C = 3 C = 6

C:=C+CW

C = 10

C := CW

C = 1 C = 1 C = 1 C = 1000

C:=CWC := CW C:=CE


Data Transfer

In Systola 1024, input of new character (bj) into the lower western IP, and

when l1 > 2048, the input of previously computed H, E, and F

cells and output of H, E, and F cells

For Fuzion 150, during the 16 new H-cells in each PE, one new character is input via Fuzion bus


Instruction Counts

Instruction Count (IC) to update 2 and 16 H-cells in Systola 1024 and Fuzion 150, respectively:

Operations in each PE per iteration step Systola Fuzion

Get H(i – 1, j), F(i – 1), bj, maxi-1 from neighbor 20 22

Compute t = max{0, H(i – 1, j – 1) + Sbt(ai, bj)} 20 576

Compute F(i, j) = max{H(i – 1, j} – , F(i – 1, j) – } 8 336

Compute E(i, j) = max{H(i, j – 1} – , E(i, j – 1) – } 8 448

Compute F(i, j) = max{t, H(i, j}, F(i, j)} 8 368

Compute maxi = max{H(i, j), maxi-1} 4 184

Sum 68 1934


Maximum Characters/PE

The memory per PE on Systola is 32 (16-bit) registers 2 characters per PE is the maximal possible (2 chars x 20 AAs substitution row x 8-bit per substitution

value = 20 registers)

The memory per PE on Fuzion is 2Kb maximum chars per PE is 16 restricted due to “indirect addressing” per PE


Indirect Address

An addressing mode found in many processors' instruction sets where the instruction contains the address of a memory location which contains the address of the operand (the "effective address") or specifies a register which contains the effective address


Myrinet - Overview

Myrinet is a cost-effective, high-performance, packet-communication and switching technology that is widely used to interconnect clusters of workstations, PCs, servers, or single-board computers

Conventional networks (e.g., ethernet) can be used to build clusters, but do not provide the performance/features required for HPC or high-availability clustering


Myrinet - Characteristics

Full-duplex 2+2 Gigabit/second data rate links, switch ports, and interface ports

Flow control, error control, and "heartbeat" continuity monitoring on every link

Low-latency, cut-through, crossbar switches, with monitoring for high-availability applications

Switch networks that can scale to tens of thousands of hosts, and that can also provide alternative communication paths between hosts

Host interfaces that execute a control program to interact directly with host processes ("OS bypass") for low-latency communication, and directly with the network to send, receive, and buffer packets


lq processors: Hybrid

Query sequence = M, Number of processors

in ISA = N2, assuming M = k x N:

1. k N: Each k x N subarray computes the alignment of the same query sequence with different subject sequences

2. k ≥ N :• k/N = 2: load 2 chars per PE• k/N > 2: split query sequence into k/2N passes and load 2N2

chars in each pass


lq processors: Fuzion 150

Length of query sequence = M, Number

of processors = 1536:

1. k x M = 1536: k alignments of same query sequence w/ different subject sequences carried out in parallel

2. k x 1536 = M:• Split into k passes – requires I/O of intermediate results in each

step

• Data transfers can be minimized by assigning k/M chars per PE – currently 16 chars per PE is the limit


Concept of true and false hits

The following cases were distinguished: true positives, alignments between proteins of similar

structure that fall above a given threshold (defined by the sequence alignment method)

false positives, alignments between proteins of dissimilar structure that fall above a given threshold of the sequence alignment

true negatives, alignments between proteins of dissimilar structure that that fall below a given threshold

false negatives, alignments between proteins of similar structure that fall below a given threshold


Guidelines

When to use S-W ? if you are looking for a protein distantly related to your query

sequence (e.g., you have a known protein sequence and you want to find possible distant homologues)

if you are looking for the protein encoded in your low-quality DNA query sequence (e.g., you have a badly sequenced cDNA clone)

if you are looking for a DNA sequence corresponding to your protein query sequence (e.g., you want to identify potential homologues of your protein in the EST databases)

When to use BLAST ? if you are looking for close matches and you don't mind missing

lower homology sequences if you want a quick answer


Performance Evaluation of SAMBA

Query sequence length 10 30 100 300 1000 3000 10000

Time in seconds

Samba 25 25 26 30 40 77 210

DEC-Alpha – 150 Mhz

Speed up

57

2.3

120

4.8

350

13.5

1041

34.7

3468

86.7

11510

150

38450

183

SUN-Sparc 5 – 110 MHz

Speed up

95

3.8

239

9.5

746

28.6

2215

7.4

7300

183

24269

315

80300

382

DEC 5000/250 – 40 MHz

Speed up

182

7.3

548

22

1407

54

4054

135

12920

323

41169

534

131193

625

Source: Jamet and Laveneir, CABIOS, 12(7), 609-615, 1997

☞ The longer the query length, the better the speed-up


Performance Evaluation of Kestrel

USparc : Sun Ultrasparc 140 MHz

B-SYS: 470-PE ISA

Alpha: DEC Alpha – 433 MHz

1K MP2: 1K-PE MasPar

Paragon: 32-node Paragon


Merc1: 1-board Mercury+

Bcll-1: Biocellerator*

Samba: 2-board Samba+

16-MP2: 16K-PE MasPar

FDF-3: 5-Board Paracell FDF+

Kestrel: 1-board Kestrel


+ (single purpose); * (FPGA) Source: Dahle et. al, PDPTA, 1243-1249, 1999


Performance Evaluation of Splash-2

Hardware Specifics MCUPS

Splash-2 Unidir; 16 boards 43,000

Splash-2 Bidir; 16 boards 34,000

Splash-2 Unidir; 1 board 3,000

Splash-2 Bidir; 1 board 2,100

Splash-1 Bidir; 746 PE’s 370

SPARC 10/30 GX gcc –O2 1.2

VAX 6620 VMS; CC 1.0

SPARC-1 gcc –O2 0.87

486DX-50 PC DOS; gcc –O2 0.67

Source: Hoang, IEEE-CMM, 185-191, 1993

04/23/2003 massively parallel solutions for molecular sequence analysis prabhakar r. gudla cmsc 838t...

Documents

t c g t c t

t c tg c t

t c t g slide

t presentation slide

t c t c atctg ctatctg

atctcg g t c t

pes g t c t

parallel u comparison