streaming algorithms for biological sequence alignment on gpus

Streaming Algorithms forBiological Sequence Alignment on GPUs

Weiguo Liu, Bertil Schmidt, Member, IEEE, Gerrit Voss, and Wolfgang Muller-Wittig, Member, IEEE

Abstract—Sequence alignment is a common and often repeated task in molecular biology. Typical alignment operations consist offinding similarities between a pair of sequences (pairwise sequence alignment) or a family of sequences (multiple sequence alignment).The need for speeding up this treatment comes from the rapid growth rate of biological sequence databases: Every year their sizeincreases by a factor of 1.5 to 2. In this paper, we present a new approach to high-performance biological sequence alignment based oncommodity PC graphics hardware. Using modern graphics processing units (GPUs) for high-performance computing is facilitated bytheir enhanced programmability and motivated by their attractive price/performance ratio and incredible growth in speed. To derive anefficient mapping onto this type of architecture, we have reformulated dynamic-programming-based alignment algorithms as streamingalgorithms in terms of computer graphics primitives. Our experimental results show that the GPU-based approach allows speedups ofmore than one order of magnitude with respect to optimized CPU implementations.

Index Terms—Streaming architectures, dynamic programming, pairwise sequence alignment, multiple sequence alignment, graphics

hardware, GPGPU.

Ç

1 INTRODUCTION

DYNAMIC programming (DP)-based algorithms are com-monly used for computing optimal pairwise sequence

alignments in a genetic context. However, their correspond-ing complexities are quadratic with respect to the lengths ofalignment targets (query sequences and subject sequences)and, therefore, this approach is time consuming for applica-tions involving large data sets. Corresponding runtimerequirements are likely to become even more severe due tothe rapid growth in the size of available biological sequencedatabases. An approach to accelerate this operation is tointroduce heuristics in the alignment algorithms. The draw-back is that the more efficient the heuristics, the worse thequality of the result is. Another approach to get high-qualityresults in a short time is to use high-performance computing.In practice, increased attention has been given to redesigningalignment algorithms to fully utilize the available architec-tural characteristics and to exploit parallel execution possi-bilities, reducing the runtime [1], [2], [3], [4]. In this paper, weinvestigate how commodity PC graphics hardware can beused as a computational platform to accelerate DP-basedalignment algorithms.

Our approach is motivated by the rapidly increasingpower of the graphics processing unit (GPU). Its streamingarchitecture opens up a range of new possibilities for avariety of applications. With the enhanced programmabilityof commodity GPUs [6], [7], [8], [9], [10], these chips arenow capable of performing more than the specific graphicscomputations they were originally designed for. Recent

work shows the design and implementation of algorithmsfor general-purpose computations on GPUs (GPGPUs).Examples include scientific computing [11], computationalgeometry [12], database operations [13], image processing[14], and bioinformatics [15], [16], [17]. The evolution ofGPUs is driven by the computer game market. This leads toa relatively small price per unit and to very rapiddevelopments of next generations. Currently, the peakperformance of high-end GPUs such as the GeForce 7900GTX has a flops rating of around 200 Gigaflops compared toa high-end PC, which is capable of around 10 Gigaflops.Further, GPU performance has been increasing from two totwo-and-a-half times a year (see Fig. 1). This growth rate isfaster than Moore’s law as it applies to CPUs, whichcorresponds to about one-and-a-half times a year [18].Consequently, GPUs are an attractive alternative for high-performance computing and will become even moreattractive in the near future.

Architecturally, modern GPUs implement what isreferred to as a streaming processor [9], [10]. Thisarchitecture gains its speed by devoting significantly morechip real estate to the computational engine than aconventional CPU. In order to exploit the GPU’s capabilitiesfor high-performance sequence alignment, we present twoDP-based streaming algorithms. Both algorithms takeadvantage of the particular data dependency relationshipin the DP matrix. They have been implemented using C++and OpenGL Shading Language (GLSL [6]) and applied toscanning of large databases, as well as to the computation ofmultiple sequence alignment (MSA) of up to 1,000 globinsequences. We have used techniques such as texturemapping, buffer binding, render-to-texture (RTT), andmultiple render buffers to achieve high efficiency. Further-more, we introduce the tunable multibatch and multiquerymethods for efficient data partitioning. Synchronization andfrequent data transfer between CPU and GPU can often beproblematic for GPGPU implementations. However, due tothe high computation-to-communication ratio, this is notthe case for the implementations discussed in this paper.We show that the combination of these techniques leads to asignificant performance improvement on Nvidia GeForce

1270 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 18, NO. 9, SEPTEMBER 2007

. W. Liu, G. Voss, and W. Muller-Wittig are with Nanyang TechnologicalUniversity, School of Computer Engineering, N4-02a-32, CAMTech,Singapore 639798. E-mail: {liuweiguo, asgerrit, askwmwittig}@ntu.edu.sg.

. B. Schmidt is with UNSW Asia, Division of Engineering, Science, andTechnology, Tanglin Campus, 1 Kay Siang Road, Singapore 248982.E-mail: [email protected].

Manuscript received 17 Aug. 2006; revised 26 Dec. 2006; accepted 2 Jan.2007; published online 19 Jan. 2007.Recommended for acceptance by D. Bader.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TPDS-0234-0806.Digital Object Identifier no. 10.1109/TPDS.2007.1069.

1045-9219/07/$25.00 � 2007 IEEE Published by the IEEE Computer Society

series cards. Our achieved speedups compared to opti-mized CPU implementations are up to 16 for databasescanning and up to 12 for MSA.

The rest of this paper is organized as follows: In Section 2,we introduce the basic alignment algorithms and give a briefsummary of previous work on parallelization of thesealgorithms on different architectures. Important features ofthe GPU architectures are described in Section 3. Section 4presents the new streaming alignment algorithms and theirefficient GPU implementation. Data partitioning schemes arediscussed in Section 5. The performance is evaluated inSection 6. Finally, Section 7 concludes the paper.

2 RELATED WORK

2.1 Pairwise Sequence Alignment

Surprising relationships have been discovered betweenprotein sequences that have little overall similarity but inwhich similar subsequences can be found. In that sense, theidentification of similar subsequences is probably the mostuseful and practical method for comparing two sequences.The Smith-Waterman algorithm [19] finds the most similarsubsequences of two sequences (the local alignment) by DP.The algorithm compares two sequences by computing adistance that represents the minimal cost of transformingone segment into another. Two elementary operations areused: substitution and insertion/deletion (also called a gapoperation). Through a series of such elementary operations,any segment can be transformed into any other segment.The smallest number of operations required to change onesegment into another can be taken as the measure of thedistance between the segments.

Consider two strings S1 and S2 of length l1 and l2. Toidentify the common subsequences, the Smith-Watermanalgorithm computes the similarity HAði; jÞ of the twosequences, ending at position i and j of the two sequencesS1 and S2. The computation of HAði; jÞ, for 1 � i � l1,1 � j � l2, is given by the following recurrences:

HAði; jÞ ¼ max 0; Eði; jÞ; F ði; jÞ; HAði� 1; j� 1ÞfþsbtðS1½i�; S2½j�Þg;

Eði; jÞ ¼ maxfHAði; j� 1Þ � �;Eði; j� 1Þ � �g;F ði; jÞ ¼ maxfHAði� 1; jÞ � �; F ði� 1; jÞ � �g;

where sbt is a character substitution cost table. The initializa-tion of these values is given byHAði; 0Þ ¼ Eði; 0Þ ¼ HAð0; jÞ ¼F ð0; jÞ ¼ 0 for 0 � i � l1, 0 � j � l2. Multiple gap costs aretaken into account as follows: � is the cost of the first gap; � isthe cost of the following gaps. This type of gap cost is knownas affine gap penalty. Some applications also use a linear gappenalty, that is, � ¼ �. For linear gap penalties, the aboverecurrence relations can be simplified to

HLði; jÞ ¼ max 0; HLði; j� 1Þ � �;HLði� 1; jÞ � �;fHLði� 1; j� 1Þ þ sbtðS1½i�; S2½j�Þg;

where HLði; jÞ represents a similarity value. The twosegments ofS1 andS2 producing this value can be determinedby a traceback procedure. Fig. 2 illustrates an example.

2.2 MSA

The extension of the DP method for simultaneous align-ment of multiple sequences is impractical as the time andspace complexities are in the order of the product of thelengths of the sequences. Thus, many heuristics to computeMSA in reasonable time have been developed. Progressivealignment is a widely used heuristic [20]. Typically,progressive alignment methods such as ClustalW [21]consist of three steps (see Fig. 3):

1. Distance matrix. A distance value between each pairof sequences is computed using the Smith-Water-man algorithm. These values are stored in a so-calleddistance matrix.

2. Guided tree. This step uses the distance matrixobtained from the first step and forms a guided treeusing the neighbor-joining method [22]. The leaves

LIU ET AL.: STREAMING ALGORITHMS FOR BIOLOGICAL SEQUENCE ALIGNMENT ON GPUS 1271

Fig. 1. Peak performance comparison (measured in multiply-addinstructions) of GPUs and CPUs in recent years. Figure taken fromthe work of Owens et al. [5].

Fig. 2. Example of the Smith-Waterman algorithm to compute the localalignment between two DNA sequences ATCTCGTATGAT andGTCTATCAC. The matrix HLði; jÞ is shown for the linear gap cost � ¼ 1and a substitution cost of þ2 if the characters are identical and �1otherwise. From the highest score (þ10 in the example), a tracebackprocedure delivers the corresponding alignment, the two subsequencesTCGTATGA and TCTATCA.

Fig. 3. The three stages of progressive MSA. (a) Distance matrix.

(b) Guided tree. (c) Progressive alignment along the tree.

of the tree contain the various sequences. Thetopology of the tree is totally dependent upon thesequences that are taken, that is, closely relatedsequences are placed together and share a commonbranch in the guided tree and divergent sequencesare widely spaced in the tree. The guided tree isused to find out closely related sequences or a groupof sequences that are aligned progressively in thelast step to form the final MSA.

3. Progressive Alignment. First, closely related sequencesor groups of sequences are aligned and, then, mostdivergent sequences are aligned to get the final MSA.

2.3 Hardware-Accelerated Alignment Algorithms

A number of parallel architectures have been developed forbiological sequence alignment. In addition to architecturesspecifically designed for sequence analysis, existing pro-grammable sequential and parallel architectures have beenused for solving sequence alignment problems.

Special-purpose architectures can provide the fastestmeans of running a particular algorithm with very highprocessing element (PE) density. Each PE is specificallydesigned for the computation of one cell in the DP matrix(see Fig. 2). However, such architectures are limited to onesingle algorithm and, thus, cannot supply the flexibilitynecessary to run a variety of algorithms required foranalyzing deoxyribonucleic acid (DNA), ribonucleic acid(RNA), and proteins. Princeton Nucleic Acid Comparator(P-NAC) was the first such machine and computed editdistance over a four-character alphabet [23]. More recentexamples, better tuned to the needs of computationalbiology, include the Biological Sequence ComparativeAnalysis Node (BioSCAN) [24] and Biological InformationSignal Processor (BISP) [25].

An approach presented in [3] is based on instructionsystolic arrays (ISAs). ISAs combine the speed and simplicityof systolic arrays with flexible programmability. Severalother approaches are based on the single-instruction, multi-ple-data (SIMD) concept, for example, Kestrel [26] and Fuzion[3]. SIMD and ISA architectures are programmable and can beused for a wider range of applications such as imageprocessing and scientific computing. Since these architec-tures contain more general-purpose parallel processors, theirPE density is less than the density of special-purposeapplication-specific integrated circuits (ASICs). Neverthe-less, SIMD solutions can still achieve significant runtimesavings. However, the costs involved in designing andproducing SIMD architectures are quite high. As a conse-quence, none of the above solutions has a successorgeneration, making upgrading impossible.

Reconfigurable systems are based on programmable logicsuch as field-programmable gate arrays (FPGAs). They aregenerally slower and have lower PE densities than special-purpose architectures, for example, [27]. They are flexible, butthe configuration must be changed for each algorithm, whichis generally more complicated than writing a new code for aprogrammable architecture. Solutions based on FPGAs havethe additional advantage that they can be regularly upgradedto state-of-the-art technology.

2.4 General-Purpose Computing on GPUs

All of the above approaches can be seen as accelerators—anapproach satisfying the demand for a low-cost solution tocomputationally intensive problems. The main advantageof GPUs compared to the architectures mentioned above is

that they are commodity components. In particular, mostusers have already access to PCs with modern graphicscards. For these users, this direction provides a zero-costsolution. Even if a graphics card has to be bought, theinstallation of such a card is trivial (plug and play). Writingthe software for such a card does still require specialistknowledge, but new high-level languages such as Cg [8]offer a simplified programming environment.

As the programmability of GPUs is increasingly en-hanced, many researchers have advocated the use of GPUsas a streaming processor [9], [28], [10] for general-purposecomputation. The idea of using the power of graphicshardware for general-purpose computation is not new. Thefirst approaches were done on machines like the Ikonas [29],the Pixel Machine [30], and Pixel-Planes [31] before the ageof GPUs. High throughput and direct access to texturememory make GPUs powerful engines for certain numer-ical applications, including robot motion planning [32],procedural texturing and shading [33], [31], collisiondetection [34], ray tracing [28], image-based modeling[35], multigrid solvers for boundary value problems [36],physically-based visual simulation [37], simulation of clouddynamics [38], and database operations [13].

The paper presented by Liu et al. [39] is close to theapproach presented in this paper since it also uses GPUs toaccelerate pairwise sequence alignments. In [39], parallelismisachievedbyprocessingtwopairwisealignmentsat thesametime. Unfortunately, this significantly limits the number ofcells computed in parallel in each iteration step, which makesthe GPU’s internal memory bandwidth a bottleneck. Ourapproach overcomes this bottleneck by computing muchlarger batches of alignments in parallel. We also show howthis can be extended to compute MSAs efficiently.

3 GPU ARCHITECTURE

Computation on a GPU follows a fixed order of processingstages, called the graphics pipeline (see Fig. 4). The pipelineconsists of three stages: vertex processing, rasterization, andfragment processing. The vertex processing stage trans-forms three-dimensional (3D) vertex world coordinates intotwo-dimensional (2D) vertex screen coordinates. The raster-izer then converts the geometric vertex representation intoan image fragment representation. Finally, the fragmentprocessor forms a color for each pixel by reading texturepixels (texels) from the texture memory. Modern GPUssupport the programmability of the vertex and fragmentprocessor. Fragment programs for instance can be used toimplement any mathematical operation on one or more


Fig. 4. Illustration of the GPU graphics pipeline.

input vectors (textures or fragments) to compute the colorof a pixel. In order to meet the ever-increasing performancerequirements set by the gaming industry, modern GPUs usetwo types of parallelism. First, multiple processors work onthe vertex and fragment processing stage, that is, theyoperate on different vertices and fragments in parallel. Forexample, a typical midrange graphics card such as theNvidia GeForce 6800 GT has six vertex processors and16 fragment processors. Second, operations on four-dimen-sional vectors (the four channels red, green, blue, alpha(RGBA)) are natively supported without performance loss.

The explicit parallelism and explicit memory localitymake stream processing very fast. Streaming processorsread an input stream, apply the kernel to the stream, andwrite the results into an output stream. In case of severalkernels, the output stream of the leading kernel is the inputstream for the following kernel (see Fig. 5). The vastmajority of general-purpose GPU applications use onlyfragment programs for their computation. In this case,textures are considered input streams and the renderbuffers are output streams. Because fragment processorsare SIMD architectures, only one program can be loaded ata time. Applying several kernels thus means to do severalpasses (see Fig. 6).

A typical GPGPU program is structured as follows:

1. Data-parallel sections of the application are identi-fied by the programmer. Each such section can beconsidered a kernel and is implemented as afragment program. The input and output of eachkernel are one or more data arrays, which are storedin textures in GPU memory.

2. To invoke a kernel, the range of the computation (orthe size of the output stream) must be specified. Theprogrammer does this by passing vertices to theGPU. A typical GPGPU invocation is a quadrilateral(quad) oriented parallel to the image plane, sized tocover a rectangular region of pixels matching thedesired size of the output array.

3. The rasterizer generates a fragment for every pixellocation in the quad, producing thousands tomillions of fragments.

4. Each of the generated fragments is then processed bythe active kernel fragment program. The fragmentprogram can read from arbitrary texture memory

locations, but can only write to memory locationscorresponding to the location of the fragment in theframe buffer.

5. The output of the fragment program is a value (orvector of values) per fragment. This output may bethe final result of the application or it may be storedas a texture and then used in subsequent passes.This feedback loop is realized by using the outputbuffer of a completed pass as input texture for thefollowing one (known as RTT).

6. If the output of the fragment program will be furtherprocessed on the CPU, data readback from GPU toCPU is required. Because of the relatively low busbandwidth between the CPU and GPU, the readbackoperation is a known bottleneck for GPGPU applica-tions and should be minimized.

4 STREAMING ALIGNMENT ALGORITHMS ON GPUS

4.1 Pairwise Sequence Alignment

4.1.1 Streaming Algorithm

Our method takes advantage of the fact that all elements in thesame antidiagonal of the Smith-Waterman DP matrix can becomputed independent of each other in parallel (see Fig. 7).Thus, the basic idea is to compute the DP matrix inantidiagonal order. The antidiagonals are stored as texturesin the texture memory. Kernels are then used to implementthe arithmetic operations specified by the recurrence relation.

Assume that we are aligning two sequences of length l1and l2 with affine gap penalties on a GPU. As apreprocessing step, both sequences and the substitutionmatrix are loaded into the texture memory. We are thencomputing the DP matrix in l1 þ l2 � 1 kernels. In kernel(rendering pass) k ð1 � k � l1 þ l2 � 1Þ, the values HAði; jÞ,Eði; jÞ, and F ði; jÞ for all i; j with 1 � i � l1, 1 � j � l2, andk ¼ iþ j� 1 are computed by the fragment processors inparallel. The new antidiagonal is stored in the texturememory as a texture. The subsequent kernel then reads thetwo previous antidiagonals from this memory.

Note that the purpose of our algorithm is the accelerationof sequence database scanning. This requires the alignmentof a query sequences to a set of subject sequences. Allsubject sequences are ranked by the maximum score in thesimilarity matrix. Reconstruction of the actual alignment(the traceback) is merely performed for the highest rankedsequences. Therefore, only the highest score of each


Fig. 5. Streaming model that applies kernels to an input stream and

writes to an output stream.

Fig. 6. Multipass method for applying n kernels. The output buffer of theleading kernel is the input texture for the following kernel.

Fig. 7. Data dependency relationship in the Smith-Waterman DP matrix.Each cell ði; jÞ depends on its left neighbor ði; j� 1Þ, upper neighborði� 1; jÞ, and upper left neighbor ði� 1; j� 1Þ. Therefore, all cells alongantidiagonal k can be computed in parallel from the antidiagonals k� 1and k� 2.

pairwise alignment is computed on the GPU. Ranking thecompared sequences and reconstructing the alignments arecarried out by the front-end PC. Because this last operationis only performed for very few subject sequences, itscomputation time is negligible.

Fig. 8 shows the streaming algorithm framework forsequence database scanning. It consists of two loops. Theouter loop loads a new batch of subject sequences into thetexture memory that will then be aligned to the querysequence. The inner loop computes the correspondingDP matrices in antidiagonal order as described above.

4.1.2 GPU Implementation

The GPU analog of arrays on the CPU are textures. Since GPUstreat objects as polygon meshes, textures can then be attachedto the polygon. Each vertex of the polygon contains a texturelocation information in the form of ðx; yÞ coordinates and therequested texture is interpolated across the polygon surface.This process is called texture mapping. Texture mapping is anefficient way to provide intricate surface detail withoutincreasing an object’s polygon count and GPUs havespecialized texture mapping hardware that is designed toassign the color of the fragments of a primitive. The fragmentcolor is then assigned by performing a lookup based on thetexture coordinates.

Our GPU implementation of the streaming algorithm forpairwise alignment maps the query sequence, the batch ofsubject sequences, and the substitution matrix onto threetextures. Another two textures are used to transfer theneighboring cells’ coordinates to the kernel. These fivetextures will be used by the kernel to do operations such aslookup and rendering.

As mentioned above, the kernel k computes the anti-diagonal k ð1 � k � l1 þ l2 � 1Þ if two sequences of length l1and l2 are aligned. The newly computed antidiagonal isstored in the texture memory as a texture. The subsequentrendering pass then reads the two previous antidiagonalsfrom the texture memory. Since diagonal k depends on thediagonals k� 1 and k� 2, three diagonals have to be storedas separate buffers. We are using a cyclic method to changethe buffer function as follows: Antidiagonals k� 1 and k� 2are in the form of texture input, and diagonal k is the rendertarget. In the subsequent iteration, k becomes k� 1, k� 1becomes k� 2, and k� 2 becomes k. This is furtherillustrated in Fig. 9. An arrow pointing toward the fragment

program means that the buffer is used as texture. An arrowpointing from the fragment program to a buffer means thatthe buffer is used as the render target. Fig. 10 shows thepseudocode of our GPU implementation.

One concern is the way to map each antidiagonal of theDP matrix into a quad. Fig. 11a shows that drawing anantidiagonal quad would introduce many unwanted cellssince all cells touched by the quad would be rendered. Thisproblem can be solved by shifting each row of the matrix byits row number (see Fig. 11b). With this method, a 1� LðkÞquad can be rendered in each iteration, where LðkÞ denotesthe length of antidiagonal k. Furthermore, it is possible toperform N pairwise comparisons at the same time by using2D render buffers. This is shown in Fig. 12. The renderbuffer contains N antidiagonals of length LðkÞ. Thecomputation is invoked by drawing an N � LðkÞ quad ineach iteration. In order to perform rendering on a 2D renderbuffer, 2D texture lookups are used.

The overall maximum computation of the matrix HA canbe incorporated as follows: The value maxfHAði; jÞ; HAði�1; jÞ; HAði; j� 1Þg is calculated for each cell and stored in theA-channel of an RGBA color pixel. The R, G, and B-channelsare used for the computation of HAði; jÞ, Eði; jÞ, and F ði; jÞ,respectively (see Fig. 13).

4.2 MSA

4.2.1 Streaming Algorithm

The first stage (distance matrix computation) of the progres-sive alignment method is usually its most computationallyintensive part. For example, our profiling of ClustalW reveals


Fig. 8. Our GPU-based streaming alignment algorithm framework.

Fig. 9. Cyclic change of the functions of buffers A, B, and C for thecomputation of antidiagonals in the DP matrix.

Fig. 10. The pseudocode of our GPU implementation for databasescanning using pairwise alignments. We are using a cyclic method tochange the buffer function as follows: Antidiagonals k� 1 and k� 2 arebound as texture input and antidiagonal k is treated as the render target.In the subsequent iteration, k becomes k� 1, k� 1 becomes k� 2, andk� 2 becomes k. This cyclic loop will continue until all antidiagonals arecomputed.

that more 93 percent of the overall runtime is spent on thisstage (see Table 2). Therefore, we only concentrate ondeveloping a streaming algorithm for this stage. Thefollowing definition of distance between two sequences isused by the ClustalW program.

Definition 1. Given a set of n sequences S ¼ fS1; . . . ; Sng. Fortwo sequences Si; Sj 2 S, we define their distance dðSi; SjÞ as

follows:

dðSi; SjÞ ¼ 1� nidðSi; SjÞminfli; ljg

;

where nidðSi; SjÞ denotes the number of exact matches in theoptimal local alignment of Si and Sj (with respect to the given

scoring system, that is, the substitution matrix sbt and gappenalty parameters � and � for affine gap penalties or just � forlinear gap penalties), and li ðljÞ denotes the length of Si ðSjÞ.

The value nidðSi; SjÞ can be computed by counting thenumber of exact character matches during the tracebackprocedure of the Smith-Waterman algorithm. For instance,the nid-value for the example given in Fig. 2 is six.Unfortunately, using this method would require storingthe complete DP matrix, which is not feasible for longsequences. In this section, we present a new recurrencerelation for the nid-value computation that is suitable forstreaming architectures such as GPUs. It facilitatesnid-calculation without computation of the actual tracebackand exhibits the data dependency shown in Fig. 2. In thefollowing, we first explain the idea for the linear gappenalty and then generalize it for affine gap penalties.

Definition 2. Given are the two sequencesS1 andS2, the linear gap

penalty �, and a substation table sbt. The matrix NLði; jÞ ð1 �i � l1; 1 � j � l2Þ is recursively defined as follows:

NLði; jÞ ¼0; if HLði; jÞ ¼ 0

NLði� 1; j� 1Þ þmði; jÞ;if HLði; jÞ ¼ HLði� 1; j� 1Þ

þsbtðS1½i�; S2½j�ÞNLði; j� 1Þ; if HLði; jÞ ¼ HLði; j� 1Þ � �NLði� 1; jÞ; if HLði; jÞ ¼ HLði� 1; jÞ � �;

8>>>>>>>><>>>>>>>>:

where

mði; jÞ ¼ 1; if S1½i� ¼ S2½j�0; otherwise:

�

Theorem 1. For the local alignment of the sequences S1 and S2,according to a linear gap penalty � and substitution matrixsbt, it holds that

nidðS1; S2Þ ¼ Nðimax; jmaxÞ;

where ðimax; jmaxÞ denote the coordinates of the maximumvalue in the corresponding matrix HL.

Proof. Consider the optimal alignment of all pairs of suffixesof the first i characters of S1ðS1½1 . . . i�Þ and the firstj characters of S2ðS2½1 . . . j�Þ. This alignment is called theoptimal i; j suffix alignment (of S1 and S2). It can be foundby computing a traceback in the matrix HL starting fromcell ði; jÞ. We now show that, for a given pair of indicesi; j ð1 � i � l1; 1 � j � l2Þ, NLði; jÞ is equal to the number


Fig. 11. (a) Drawing a quad over the antidiagonal results in rendering too

many pixels. (b) Matrix with shifted rows.

Fig. 12. Using 2D render buffers to do N pairwise alignments in parallel.Assuming the length of the query sequence is l1, the maximum length ofthe subject sequences is l2. Therefore, there is a total of l1 þ l2 � 1rendering passes (antidiagonals). The render buffer at rendering pass kis shown. It renders N diagonals of length LðkÞ in parallel (in differentfragment processors).

Fig. 13. Using the RGBA channels of 2D texture buffers for the

computation of HA, E, F , and maximum.

of exact matches in the optimal i; j suffix alignment. The

claim then follows from the fact that the optimal imax, jmaxsuffix alignment is equal to the optimal local alignment.

Case 1. HLði; jÞ ¼ 0. The corresponding alignment isempty. Hence, NLði; jÞ ¼ 0.

Case 2. HLði; jÞ ¼ HLði� 1; j� 1Þ þ sbtðS1½i�; S2½j�Þ.The alignment ends with S1½i� aligned to S2½j�, whichcontributes mðS1½i�; S2½j�Þ to the number of exact matches.The remaining number is then equal to the number of exactmatches found in the optimal i� 1, j� 1 suffix alignment.Hence, NLði; jÞ ¼ NLði� 1; j� 1Þ þmðS1½i�; S2½j�Þ.

Case 3. HLði; jÞ ¼ HLði� 1; jÞ � �. The alignmentends with S1½i� aligned to a gap, which contributes zeroexact matches. The remaining number is equal to thenumber found in the optimal i� 1, j suffix alignment.Hence, NLði; jÞ ¼ NLði� 1; jÞ.

Case 4. HLði; jÞ ¼ HLði; j� 1Þ � �. Similar to Case 3, itfollows that NLði; jÞ ¼ NLði; j� 1Þ.

Because HLði; jÞ must be equal to one of these fourcases, the theorem is proven. tu

The recurrence relation for matrix NLði; jÞ exhibits a

dependency similar to the Smith-Waterman algorithm: Each

cell ði; jÞ is computed from the cells ði� 1; jÞ, ði; j� 1Þ, and

ði� 1; j� 1Þ. Hence, it can be computed using the anti-

diagonal method described in Fig. 7 together with the

matrix HLði; jÞ. For affine gap penalties, our method is

extended as follows:

Definition 3. Given are the two sequences S1 and S2, the affine

gap penalties � and �, and the substitution table sbt. The

matrix NAði; jÞ ð1 � i � l1; 1 � j � l2Þ is recursively defined

as follows:

NAði; jÞ ¼0; if HAði; jÞ ¼ 0

NAði� 1; j� 1Þ þmði; jÞ;if HAði; jÞ ¼ HAði� 1; j� 1Þ

þsbtðS1½i�; S2½j�ÞNEði; j� 1Þ; if HAði; jÞ ¼ Eði; jÞNF ði; jÞ; if HAði; jÞ ¼ F ði; jÞ;

8>>>>>>>><>>>>>>>>:

where

mði; jÞ ¼1; if S1½i� ¼ S2½j�0; otherwise;

�

NEði; jÞ ¼0; if j ¼ 1

NAði; j� 1Þ; if Eði; jÞ ¼ HAði; j� 1Þ � �NEði; j� 1Þ; if Eði; jÞ ¼ Eði; j� 1Þ � �;

8><>:

NF ði; jÞ ¼0; if i ¼ 1

NAði� 1; jÞ; if F ði; jÞ ¼ HAði� 1; jÞ � �NF ði� 1; jÞ; if F ði; jÞ ¼ F ði� 1; jÞ � �:

8><>:

Theorem 2. For the local alignment of the sequences S1 and S2,

according to the affine gap penalties � and � and substitution

matrix sbt, it holds that

nidðS1; S2Þ ¼ NAðimax; jmaxÞ;

where ðimax; jmaxÞ denote the coordinates of the maximum

value in the corresponding matrix HA.

Proof. Similar to the proof of Theorem 1, we show that,

for a given pair of indices i; j ð1 � i � l1; 1 � j � l2Þ,NAði; jÞ is equal to the number of exact matches in the

optimal i, j suffix alignment.Case 1. HAði; jÞ ¼ 0. The corresponding alignment is

empty. Hence, NAði; jÞ ¼ 0.Case 2. HAði; jÞ ¼ HAði� 1; j� 1Þ þ sbtðS1½i�; S2½j�Þ.

Similarly to Case 2 in the previous proof, it follows thatNAði; jÞ ¼ NAði� 1; j� 1Þ þmðS1½i�; S2½j�Þ.

Case 3. HAði; jÞ ¼ F ði; jÞ. The alignment ends withS1½i� aligned to a gap, which contributes zero exactmatches. The remaining number is equal to thenumber found in the optimal i� 1; j suffix alignment.Depending on whether this alignment ends with a gap,this number is either NAði� 1; jÞ or NF ði� 1; jÞ. Hence,NAði; jÞ ¼ NF ði; jÞ.

Case 4. HAði; jÞ ¼ Eði; jÞ. Similarly to Case 3, itfollows that NAði; jÞ ¼ NEði; jÞ.

Because HAði; jÞ must be equal to one of these fourcases, Theorem 2 is proven. tu

4.2.2 GPU Implementation

Due to the introduction of the matrices NA, NF , and NE , our

previous implementation (see Fig. 10) has to be extended by

additional textures and rendering buffers. Fig. 14 gives the

corresponding pseudocode for the streaming alignment

algorithm on the GPU.In the renderbuffers, we packthe dataas follows: Thevalue

maximumandnidarecalculated for eachcell andstored in the

A-channel of an RGBA color pixel of two separate render

buffers. In the first render buffer, the R, G, and B-channels are

used for the computation ofHA,E, andF , respectively. In the

second render buffer, R, G, and B-channels are used for the

computation ofNA,NE , andNF , respectively.


Fig. 14. Pseudocode of our GPU implementation for MSA. Two renderbuffers c � and nidC � are used. We first create textures and renderbuffers. Then, we dynamically bind antidiagonal textures with corre-sponding read buffers, set coordinates, and call the kernel program toperform rendering operations.

5 PARTITIONING

The streaming architecture of modern GPUs is optimizedfor high-throughput computations. Hence, the more theperformance of streaming alignment algorithms increases,the more DP matrix cells can be computed. This hasmotivated our choice to compute N pairwise alignments inparallel in each rendering pass (see Fig. 12). In practice, N isrestricted by the size of the rendering buffer, that is, N �4; 096 for most modern GPUs.

However, the length of biological sequences, such asproteins, varies. Since rendering buffers are rectangularobjects, the length distribution of subject sequences within abatch influences the application’s performance. The size ofthe buffer is determined by the longest sequences of thebatch. Hence, superfluous cell computations are introducedfor the alignment to the shorter sequences. For databasescanning using pairwise alignments, the amount of super-fluous computation can be reduced significantly as follows:The sequences in the database are sorted by their lengthsand then partitioned into smaller sized batches. Fig. 15illustrates this situation.

The number of sequences involved in MSA (typicallyseveral hundred) is usually much smaller than in databasescanning (typically several hundred thousand). Hence, thedimension of the subject sequence set for distance matrixcomputation is usually smaller than the maximum dimen-sion of a texture buffer. Therefore, this application requiresother partitioning schemes. Depending on the length of thelongest sequence in a given subject sequence set (calledmaxlen), we are using one of two available schemes: themultiquery method or the multibatch method.

If maxlen is relatively small, the multiquery method ischosen. In this method, we put several subject sequence setsinto a batch. A different query sequence will then be alignedto each set within a batch in parallel (see Fig. 16c). However,if maxlen is relatively large, the multiquery method wouldintroduce a lot of superfluous computation (similarly toFig. 15a). Hence, in this case, the multibatch method will be

used. This method partitions the given subject sequenceinto several batches. The query sequence will then bealigned to all subject sequences within a batch in parallel.Several rendering passes are required to align the querysequence to each batch (see Fig. 16b).

We are using the parameter threshold in order to decidebetween these two methods. If maxlen > threshold for theremaining subject sequence set, the multibatch method willused. Otherwise, the multiquery method is chosen. Thenumber of sequences within each batch is determined byanother parameter called step. Fig. 16 illustrates ourpartitioning method.

The performance of our application is affected by thevalues of threshold and step. Decreasing the values ofthreshold and step reduces the amount of superfluouscomputation. However, it also reduces the throughput(since fewer cells are computed per rendering pass). Hence,the choice of values for threshold and step is a trade-offbetween computational efficiency (that is, the percentage ofthe relevant cells) and throughput (that is, the total cellscomputed per second).

6 PERFORMANCE EVALUATION

We have implemented the proposed algorithms using C++and the high-level GPU programming language GLSL andevaluated it on the following graphics cards:


Fig. 15. (a) Buffers filled with a randomly distributed subject sequence

set. (b) Buffers filled with an ordered sequence set. A lot of superfluous

computations will be omitted using an ordered subject set.

Fig. 16. Two parameters, threshold and step, are introduced in ourimplementation. (a) A, B, and C are sequences longer than thethreshold, whereas D, E, and F are shorter than the threshold. (b) Themultibatch alignment method is used for sequences A, B, and C todelete the superfluous computations. (c) The multiquery alignmentmethod is used for sequences D, E, and F to make full use of theparallelism of GPUs.

. Nvidia GeForce 6800 GTO. This has a 414 MHz engineclock speed, a 1.10 GHz memory clock speed, fivevertex processors, 16 fragment processors, and a256 Mbyte memory.

. Nvidia GeForce 7900 GTX. This has a 717 MHz engineclock speed, a 1.79 GHz memory clock speed, eightvertex processors, 24 fragment processors, and a512 Mbyte memory.

Tests have been conducted with these cards installed in a

PC with an Intel Pentium 4 3.0 GHz, 1 Gbyte RAM running

on Windows XP.

6.1 Experimental Results for Swiss-Prot DatabaseScanning

A performance measurement commonly used in computa-

tional biology is cell updates per second (CUPS). A CUPS

represents the time for a complete computation of one

entry in the similarity matrix, including all comparisons,

additions, and maxima operations. We have scanned the

Swiss-Prot protein databank (release 46.3, which contains

176,469 sequences with an average length of 361) for

query sequences of various lengths using our implemen-

tation on a GeForce 6800 GTO and a GeForce 7900 GTX.

They allow handling sequences up to a length of 4,096.

This restriction is imposed by the maximum texture buffer

size of these graphics cards. However, this limitation is

not severe since 99.8 percent of the sequences in the

Swiss-Prot database are of length < 4; 096. Furthermore, it

is reasonable to expect that the allowed texture buffer

sizes will increase in next-generation graphics hardware.We have also compared the performance of our GPU

implementation to a widely used CPU program for

database scanning—FASTA [40]. FASTA stands for FAST-

All, reflecting the fact that it can be used for a fast protein

comparison or a fast nucleotide comparison between a

query sequence and a large database of known sequences.

OSEARCH and SSEARCH [41] are two Smith-Waterman

implementations that are part of the FASTA program.

OSEARCH is a straightforward Smith-Waterman imple-

mentation. SSEARCH [41] is a highly optimized heuristic

implementation of the Smith-Waterman algorithm. How-

ever, OSEARCH is more sensitive and accurate. We have

run OSEARCH34 and SSEARCH34 on a Pentium 4 3.0 GHzCPU running on Red Hat Linux 7.3.

Table 1 reports the runtimes of our GPU implementation,OSEARCH, and SSEARCH for different query sequencelengths. Fig. 17 compares the corresponding MCUPSperformance values. As can be seen, our GPU implementa-tion achieve speedups of up to 16 compared to OSEARCHand 10 compared to SSEARCH while producing the sameaccuracy as OSEARCH, which has a higher quality thanSSEARCH.

6.2 Experimental Results for MSA

A set of performance evaluation tests has been conductedusing different numbers of globin protein sequences toevaluate the processing time of the GPU implementationversus that of the original ClustalW pairwise alignmentstage on the PC. The ClustalW application is benchmarkedon an Intel Pentium 4 3 GHz processor with a 1 Gbyte RAM.We have used the ClustalW code from the work of Li ([42],available online at http://www.bii.a-star.edu.sg/software/clustalw-mpi/) for our evaluation. Profiling of the threestages of ClustalW for different numbers of globin


TABLE 1Runtimes (in Seconds) for Scanning the Swiss-Prot 46.3 with OSEACRH and SSEARCH Running on a Pentium 4 3 GHz CPU and

Our GPU Implementations on a GeForce 6800 GTO and a GeForce 7900 GTX for Varying Query Lengths

The query sequences have the accession numbers P58229, P39985, Q96HP0, and P36022. The corresponding speedups compared to OSEARCH(speedup1) and SSEARCH (speedup2) are also reported. We have used the OSEARCH and SSEARCH implementations from ftp://ftp.virginia.edu/pub/fasta/ for our evaluation.

Fig. 17. Performance comparison (in MCUPS) for scanning the Swiss-

Prot database (release 46.3, March 2005). The query sequences have

accession numbers O29181, P03630, P53765, Q8ZGB4, P58229,

P39985, Q96HP0, and P36022.

sequences (see Table 2) reveals that more than 93 percent of

the overall runtime is spent on the first stage (distance

matrix computation). This justifies our choice to only

accelerate this stage.In Section 5, we have analyzed the influence of threshold

and step on the performance of our implementations. Fig. 18

shows the performance of four sequence sets using different

combinations of threshold and step. The measurements show

that, for all four test sets with different number of

sequences, the best performances (minimum runtime) are

obtained when threshold is set from 500 to 900 and step is set

from 50 to 160. As shown in Fig. 18, there exists an


TABLE 2Comparison of Runtimes (in Seconds) and Speedups of ClustalW Running on a Single Pentium 4 3 GHz CPUto Our GPU-ClustalW Version Running on a Pentium 4 3 GHz CPU with an Nvidia GeForce 7900 GTX 512 for

400, 600, 800, and 1,000 Input Globin Protein Sequences

Fig. 18. Performance comparison using different combinations of threshold and step. (a) For the sequence set with 400 sequences, the average

length is 408. (b) For the sequence set with 600 sequences, the average length is 462. (c) For the sequence set with 800 sequences, the average

length is 454. (d) For the sequence set with 1,000 sequences, the average length is 446. For all sequence sets, the length of the longest sequence is

3,313, and the shortest one is 60.

optimized trade-off between threshold and step for eachsequence set. The best performance can be observed at thecorresponding coordinate point.

According to the measurements in Fig. 18, we get thebest performance of our implementation for each sequenceset. They are shown in Table 2. As can be seen, our GPUimplementation achieves speedups of almost 12 comparedto the first stage of ClustalW and 7 compared to the overallruntime.

7 CONCLUSION

In this paper, we have introduced two streaming algorithmsfor DP-based biological sequence alignment that can beefficiently implemented on modern graphics hardware. Ourimplementations achieve speedups of over an order ofmagnitude on easily programmable mass-produced hard-ware available for less than US $500 at any local computeroutlet. The very rapid growth of biological sequencedatabases demands even more powerful high-performancesolutions in the near future. Hence, our results areespecially encouraging since GPU performance grows fasterthan Moore’s law as it applies to CPUs.

Our future work will include the development of apattern-based system that allows for easy implementationof DP-based streaming algorithms [43]. Due to the advancesin modern VLSI technology (cheap arithmetic and expen-sive bandwidth), several parallel streaming architectureshave been introduced recently [44]. It would be interestingto evaluate the performance of the streaming algorithmsintroduced in this paper on them and other futurearchitectures.

ACKNOWLEDGMENTS

The work was supported by the A*Star BMRC ResearchGrant 04/1/22/19/375.

REFERENCES

[1] D. Bader, “Computational Biology and High-Performance Com-puting,” Comm. ACM, vol. 47, no. 11, pp. 34-41, 2004.

[2] S. Rajko and S. Aluru, “Space and Time Optimal Parallel SequenceAlignments,” IEEE Trans. Parallel and Distributed Systems, vol. 15,no. 11, pp. 1070-1081, Nov. 2004.

[3] B. Schmidt, H. Schroder, and M. Schimmler, “Massively ParallelSolutions for Molecular Sequence Analysis,” Proc. First IEEE Int’lWorkshop High Performance Computational Biology (HiCOMB ’02),2002.

[4] T. Rognes, “ParAlign: A Parallel Sequence Alignment Algorithmfor Rapid and Sensitive Database Searches,” Nucleic Acids Research,vol. 29, no. 7, pp. 1647-1652, 2001.

[5] J. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A.Lefohn, and T. Purcell, “A Survey of General-Purpose Computa-tion on Graphics Hardware,” Proc. Eurographics, pp. 21-51, 2005.

[6] J. Kessenich, D. Baldwin, and R. Rost, “The OpenGL ShadingLanguage, Document Revision 59,” technical report, http://www.opengl.org/documentation/oglsl.html, 2005.

[7] Microsoft, “High-Level Shader Language,” technical report,http://msdn.microsoft.com/library/default.asp?url=/library/en-us/directx9_c /dx9_graphics_reference_hlsl.asp, 2006.

[8] W. Mark, R. Glanville, K. Akeley, and M. Kilgard, “Cg: A Systemfor Programming Graphics Hardware in a C-Like Language,”ACM Trans. Graphics, vol. 22, pp. 896-907, 2003.

[9] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Mike, and H. Pat,“Brook for GPUs: Stream Computing on Graphics Hardware,”Proc. ACM SIGGRAPH, 2004.

[10] M. Mccool, Z. Qin, and T. Popa, “Shader Metaprogramming,” Proc.ACM SIGGRAPH/Eurographics Graphics Hardware Workshop, 2002.

[11] J. Kruger and R. Westermann, “Linear Algebra Operators for GPUImplementation of Numerical Algorithms,” ACM Trans. Graphics,vol. 22, pp. 908-916, 2003.

[12] P. Agarwal, S. Krishnan, N. Mustafa, and S. Venkatasubramanian,“Streaming Geometric Optimization Using Graphics Hardware,”Proc. 11th European Symp. Algorithms, 2003.

[13] N. Govindaraju, B. Lloyd, W. Wang, M. Lin, and D. Manocha,“Fast Computation of Database Operations Using GraphicsProcessors,” Proc. ACM SIGMOD Int’l Conf. Management of Data(SIGMOD ’04), pp. 215-226, 2004.

[14] F. Xu and K. Mueller, “Ultra-Fast 3D Filtered Backprojection onCommodity Graphics Hardware,” Proc. IEEE Int’l Symp. Biomedi-cal Imaging (ISBI ’04), 2004.

[15] D. Horn, M. Houston, and P. Hanrahan, “ClawHMMer: AStreaming HMMer-Search Implementation,” Proc. ACM/IEEEConf. Supercomputing (SC ’05), 2005.

[16] W. Liu, B. Schmidt, G. Voss, A. Schroder, and W. Muller-Wittig,“Bio-Sequence Database Scanning on a GPU,” Proc. 20th IEEE Int’lParallel and Distributed Processing Symp. (High Performance Compu-tational Biology (HiCOMB) Workshop), 2006.

[17] W. Liu, B. Schmidt, G. Voss, and W. Muller-Wittig, “GPU-ClustalW: Using Graphics Hardware to Accelerate MultipleSequence Alignment,” Proc. 13th Ann. IEEE Int’l Conf. HighPerformance Computing (HiPC ’06), pp. 363-374, 2006.

[18] D. Manocha, “General-Purpose Computations Using GraphicsProcessors,” Computer, vol. 38, no. 8, pp. 85-88, Aug. 2005.

[19] T. Smith and M. Waterman, “Identification of Common MolecularSubsequences,” J. Molecular Biology, vol. 147, pp. 195-197, 1981.

[20] D. Feng and R. Doolittle, “Progressive Sequence Alignment as aPrerequisite to a Correct Phylogenetic Trees,” J. MolecularEvolution, vol. 25, pp. 351-360, 1987.

[21] J. Thompson, D. Higgins, and T. Gibson, “ClustalW: Improvingthe Sensitivity of Progressive Multiple Sequence Alignmentthrough Sequence Weighting, Position-Specific Gap Penaltiesand Weight Matrix Choice,” Nucleic Acids Research, vol. 22,pp. 4673-4680, 1994.

[22] N. Saitou and M. Nei, “The Neighbor-Joining Method: A NewMethod for Reconstructing Phylogenetic Trees,” Molecular Biologyand Evolution, vol. 4, pp. 406-425, 1987.

[23] D. Lopresti, “P-NAC: A Systolic Array for Comparing NucleicAcid Sequences,” Computer, vol. 20, no. 7, pp. 98-99, July 1987.

[24] R. Singh, “BioSCAN: A Network Sharable ComputationalResource for Searching Biosequence Databases,” Computer Appli-cations in the Biosciences, vol. 12, no. 3, pp. 191-196, 1996.

[25] E. Chow, T. Hunkapiller, J. Peterson, and M. Waterman,“Biological Information Signal Processor,” Proc. Int’l Conf.Application-Specific Array Processors (ASAP ’91), pp. 144-160, 1991.

[26] A. Di Blas, “The UCSC Kestrel Parallel Processor,” IEEE Trans.Parallel and Distributed Systems, vol. 16, no. 1, pp. 80-92, Jan. 2005.

[27] T. Oliver, B. Schmidt, D. Nathan, R. Clemens, and D. Maskell,“Using Reconfigurable Hardware to Accelerate Multiple SequenceAlignment with ClustalW,” Bioinformatics, vol. 21, pp. 3431-3432,2005.

[28] T. Purcell, I. Buck, W. Mark, and P. Hanrahan, “Ray Tracing onProgrammable Graphics Hardware,” ACM Trans. Graphics,pp. 703-712, 2002.

[29] J. England, “A System for Interactive Modeling of PhysicalCurved Surface Objects,” Proc. ACM SIGGRAPH ’78, pp. 336-340, 1978.

[30] M. Potmesil and E. Hoffert, “The Pixel Machine: A Parallel ImageComputer,” Proc. ACM SIGGRAPH, pp. 69-78, 1989.

[31] J. Rhoades, G. Turk, A. Bell, A. State, U. Neumann, and A.Varshney, “Real-Time Procedural Textures,” Proc. Symp. Inter-active 3D Graphics, pp. 95-100, 1992.

[32] J. Lengyel, M. Reichert, B. Donald, and D. Greenberg, “Real-TimeRobot Motion Planning Using Rasterizing Computer GraphicsHardware,” Proc. ACM SIGGRAPH ’90, pp. 327-335, 1990.

[33] K. Proudfoot, W. Mark, S. Tzvetkov, and P. Hanrahan, “A Real-Time Procedural Shading System for Programmable GraphicsHardware,” Proc. 28th Ann. Int’l Conf. Computer Graphics andInteractive Techniques (SIGGRAPH ’01), pp. 159-170, 2001.

[34] N. Govindaraju, S. Redon, M. Lin, and D. Manocha, “Cullide:Interactive Collision Detection between Complex Models in LargeEnvironments Using Graphics Hardware,” Proc. ACM SIG-GRAPH/Eurographics Graphics Hardware Workshop, pp. 25-32, 2003.


[35] K. Hillesland, S. Molinov, and R. Grzeszczuk, “NonlinearOptimization Framework for Image-Based Modeling on Program-mable Graphics Hardware,” Proc. ACM SIGGRAPH ’03, pp. 925-934, 2003.

[36] N. Goodnight, C. Woolley, G. Lewin, D. Luebke, and G.Humphreys, “A Multigrid Solver for Boundary Value ProblemsUsing Programmable Graphics Hardware,” Proc. ACM SIG-GRAPH/Eurographics Graphics Hardware Workshop, 2003.

[37] M. Harris, G. Coombe, T. Scheuermann, and A. Lastra, “Physi-cally-Based Visual Simulation on Graphics Hardware,” Proc. ACMSIGGRAPH/Eurographics Graphics Hardware Workshop, pp. 109-118,2002.

[38] M. Harris, W. Baxter, T. Scheuermann, and A. Lastra, “Simulationof Cloud Dynamics on Graphics Hardware,” Proc. ACM SIG-GRAPH/Eurographics Graphics Hardware Workshop, pp. 92-101,2003.

[39] Y. Liu, W. Huang, J. Johnson, and S. Vaidya, “GPU AcceleratedSmith-Waterman,” Proc. Int’l Conf. Computational Science (ICCS’06), pp. 188-195, 2006

[40] W. Pearson, “Rapid and Sensitive Sequence Comparison withFASTP and FASTA,” Methods in Enzymology, vol. 183, pp. 63-98,1990.

[41] W. Pearson, “Searching Protein Sequence Libraries: Comparisonof the Sensitivity and Selectivity of the Smith-Waterman andFASTA Algorithms,” Genomics, vol. 11, pp. 635-650.

[42] K. Li, “ClustalW-MPI: ClustalW Analysis Using Parallel andDistributed Computing,” Bioinformatics, vol. 19, pp. 1585-1586,2003.

[43] W. Liu and B. Schmidt, “Parallel Pattern-Based Systems for HighPerformance Computational Biology: A Case Study,” IEEE Trans.Parallel and Distributed Systems, vol. 17, no. 8, pp. 750-763, Aug.2006.

[44] W. Dally, P. Hanrahan, M. Erez, T. Knight, F. Labonte, J.-H. Ahn,N. Jayasena, U. Kapasi, A. Das, J. Gummaraju, and I. Buck,“Merrimac: Supercomputing with Streams,” Proc. ACM/IEEEConf. Supercomputing (SC ’03), Nov. 2003.

Weiguo Liu received the bachelor’s and master’sdegrees from Xi’an JiaoTong University, China, in1998 and 2002 and the PhD degree fromNanyang Technological University (NTU), Singa-pore, in 2006. He is currently a research fellow atthe Center for Advanced Media Technology,Nanyang Technological University (NTU). Hisresearch interests include computational biology,parallel algorithms and architectures, high-per-formance computing, and data mining.

Bertil Schmidt received the master’s degree incomputer science from Kiel University, Ger-many, in 1995 and the PhD degree from theLoughborough University, United Kingdom, in1999. He has worked with the company ISATECin the area of embedded parallel systems. From1999 to 2006, he worked as an assistantprofessor at Nanyang Technological University(NTU), Singapore. He currently is an associateprofessor in the Division of Engineering,

Science, and Technology at the University of New South Wales Asia(UNSW Asia). His research interests include parallel algorithms andarchitectures, high-performance computing, and bioinformatics. He is amember of the IEEE.

Gerrit Voss received the diploma degree incomputer science from Darmstadt TechnicalUniversity, Germany, in 1997. He is a seniorstaff member of the Centre for Advanced MediaTechnology, Nanyang Technological University(NTU), Singapore, since 2001. Prior to this, heworked in the Department of “Visualization andVirtual Reality” at Fraunhofer Institute for Com-puter Graphics (Fraunhofer-IGD), Darmstadt,Germany. He is a member of the core develop-

ment team of OpenSG (www.opensg.org). He is an author of severalpublications in the area of real-time rendering, virtual reality, augmentedreality, and general-purpose computation on graphics processing units(GPGPU).

Wolfgang Muller-Wittig received the universitydegree (Dipl.-Inform.) and the doctoral degree(Dr.-Ing.) in computer science from the Darm-stadt University of Technology, Germany. Hehas been the director of the Centre for AdvancedMedia Technology (CAMTech) since January2001. CAMTech is a joint venture between theFraunhofer Institute for Computer Graphics(Fraunhofer-IGD), Darmstadt, Germany, andNanyang Technological University (NTU), Sin-

gapore. Furthermore, he is an associate professor at the NTU School ofComputer Engineering. Prior to joining CAMTech, he worked first as ascientist in the “Visualization and Virtual Reality” Department, Fraunho-fer-IGD, and, then, as the head of the “Visualization Group.” His currentresearch foci include highly interactive three-dimensional computergraphics for manufacturing, engineering, edutainment, cultural heritage,and biomedical sciences. He is a member of the IEEE.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


streaming algorithms for biological sequence alignment on gpus

Documents