# Algorithms for String Comparison on GPUs

Post on 17-Mar-2016

212 views

DESCRIPTION

A novel approach for solving any large pairwise local dependency dynamic programming problem using graphics processing units (GPUs). Our results include a new superior layout for utilizing the coarse-grained GPU parallelism.TRANSCRIPT

ALGORITHMS FORSTRING COMPARISON ON GPUS

Kenneth Skovhus Andersen, s062390Lasse Bach Nielsen, s062377

!

!

!

!

!

!

Template!Generic!Prose!Document!

DD.MM.YYY,!vX.Y!

Harald!Strrle!!!!!!!!

Institute!for!Informatics!and!Mathematical!Modelling!

Technical!University!of!Denmark!

!

T1!Technical University of Denmark

Informatics and Mathematical Modelling

Supervisors: Inge Li Grtz & Philip Bille

August, 2012

DTU InformaticsDepartment of Informatics and Mathematical ModelingTechnical University of DenmarkAsmussens Alle, Building 305, DK-2800 Lyngby, DenmarkPhone +45 4525 3351, Fax +45 4588 2673reception@imm.dtu.dkwww.imm.dtu.dk

ABSTRACT

We consider parallelization of string comparison algorithms, including se-quence alignment, edit distance and longest common subsequence. Theseproblems are all solvable using essentially the same dynamic programmingscheme over a two-dimensional matrix, where an entry locally depends onneighboring entries. We generalize this set of problems as local dependencydynamic programming (LDDP).

We present a novel approach for solving any large pairwise LDDP prob-lem using graphics processing units (GPUs). Our results include a new supe-rior layout for utilizing the coarse-grained parallelism of the many-core GPU.The layout performs up to 18% better than the most widely used layout. Toanalyze layouts, we have devised theoretical descriptions, which accuratelypredict the relative speedup between different layouts on the coarse-grainedparallel level of GPUs.

To evaluate the potential of solving LDDP problems on GPU hardware,we implement an algorithm for solving longest common subsequence. In ourexperiments we compare large biological sequences, each consisting of twomillion symbols, and show a 40X speedup compared to a state-of-the-art se-quential CPU solution byDriga et al. Our results can be generalized on severallevels of parallel computation using multiple GPUs.

iii

RESUME

Vi betragter parallelisering af algoritmer til sammenligning af strenge, herun-der sequence alignment, edit distance og longest common subsequence. Disseproblemer kan alle lses med en todimensional dynamisk programmerings-matrix med lokale afhngigheder. Vi generaliserer disse problemer til localdependency dynamic programming (LDDP).

Vi prsenterer en ny tilgang til at lse store parvise LDDP-problemer medgrafikprocessorer (GPUer). Ydermere har vi udviklet et nyt layout til ud-nyttelse af GPUens multiprocessorer. Vores nye layout forbedrer kretidenmed op til 18% i forhold til tidligere layouts. Til analyse af et layouts egensk-aber, har vi udviklet teoretiske beskrivelser, der prcist forudsiger den rela-tive kretidsforbedring mellem forskellige layouts.

For at vurdere GPUens potentiale til at lse LDDP-problemer, har vi im-plementeret en algoritme, som lser longest common subsequence. I voreseksperimenter sammenligner vi lange biologiske sekvenser, der hver bestar afto millioner symboler. Vi viser mere end 40X hastighedsforgelse i forhold tilen state-of-the-art sekventiel CPU-lsning af Driga et al. Vores resultater kangeneraliseres pa flere niveauer af parallelitet ved brug af flere GPUere.

v

PREFACE

This masters thesis has been prepared at DTU Informatics at the TechnicalUniversity of Denmark from February to August 2012 under supervision ofassociate professors Inge Li Grtz and Philip Bille. It has an assigned work-load of 30 ECTS credits for each of the two authors.

The thesis deals with the subject of local dependency dynamic program-ming algorithms for solving large scale string comparison problems on mod-ern graphical processing units (GPUs). The focus is to investigate, combineand further develop existing state of the art algorithms.

Acknowledgments

We would like to thank our supervisors for their guidance during the project.A special thanks to PhD student Morten Stockel at the IT University of Copen-hagen for providing the source code for sequential string comparison algo-rithms [1] and PhD student Hjalte Wedel Vildhj at DTU Informatics for hisvaluable feedback.

Lasse Bach Nielsen Kenneth Skovhus Andersen

August, 2012

vii

CONTENTS

Abstract iii

Resume v

Preface vii

1 Introduction 11.1 This Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Our Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Local Dependency Dynamic Programming 52.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Previous Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Our Approach Based on Previous Results . . . . . . . . . . . . . 9

3 Graphics Processing Units 113.1 GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3 Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Parallel Layouts for LDDP 154.1 Diagonal Wavefront . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2 Column-cyclic Wavefront . . . . . . . . . . . . . . . . . . . . . . 174.3 Diagonal-cyclic Wavefront . . . . . . . . . . . . . . . . . . . . . . 184.4 Applying Layouts to the GPU Architecture . . . . . . . . . . . . 194.5 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . 20

5 Implementing LDDP on GPUs 235.1 Grid-level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2 Thread-level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.3 Space Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6 Experimental Results for Grid-level 316.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7 Experimental Results for Thread-level 377.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377.2 Results for Forward Pass Kernels . . . . . . . . . . . . . . . . . . 377.3 Results for Backward Pass Kernel . . . . . . . . . . . . . . . . . . 417.4 Part Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

ix

CONTENTS

8 Performance Evaluation 438.1 The Potential of Solving LDDP Problems on GPUs . . . . . . . . 438.2 Comparing to Similar GPU Solutions . . . . . . . . . . . . . . . . 44

9 Conclusion 459.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Bibliography 47

Appendices 51

A NVIDIA GPU Data Sheets 53A.1 NVIDIA Tesla C2070 . . . . . . . . . . . . . . . . . . . . . . . . . 53A.2 NVIDIA GeForce GTX 590 . . . . . . . . . . . . . . . . . . . . . . 54

B Kernel Source Code 55B.1 Forward pass kernels . . . . . . . . . . . . . . . . . . . . . . . . . 55

x

1 INTRODUCTION

We revisit the classic algorithmic problem of comparing strings, includingsolving sequence alignment, edit distance and finding the longest commonsubsequence. In many textural information retrieval systems, the exact com-parison of large-scale strings is an important, but very time consuming task.As an example, the exact alignment of huge biological sequences, such asgenes and genomes, has previously been infeasible due to computing andmemory requirements. Consequently, much research effort has been investedin faster heuristic algorithms1 for sequence alignment. Although these meth-ods are faster than exact methods, they come at the cost of sensitivity. How-ever, the rise of new parallel computing platforms such as graphics processingunits, are able to change this scenario.

Graphics processing units (GPUs) are designed for graphics applications,having a large degree of data parallelism using hundreds of cores, and aredesigned to solve multiple independent parallel tasks. Previous results foraccelerating sequence alignment using GPUs show a significant speedup, butare currently focused on aligning many independent short sequences, a prob-lemwhere the GPU architecture excel. Our focus, is based on the need to solvelarge-scale exact pairwise string comparison of biological sequences contain-ing millions of symbols.

Our work is motivated by the increasing power of GPUs, and the chal-lenge of making exact comparison of large strings feasible. We consider par-allelization of a general set of pairwise string comparison algorithms, all solv-able using essentially the same dynamic programming scheme over a twodimensional-matrix. Taking two input strings X and Y of the same length n,these problems can be solved by computing all entries in an n n matrix us-ing a specific cost function. Computation of an entry in the matrix dependson the neighboring entries. We generalize all these problems as local depen-dency dynamic programming (LDDP). In general, LDDP problems are not triv-ially solved in parallel, as the local dependencies gives a varying degree ofparallelism across the entire problem space.

The parallelism of a GPU is exposed as a coarse-grained grid of blocks,where each block consists of finer-grained threads. We call these levels of par-allelism the grid- and thread-level. We focus on layouts as a mean to describehow LDDP problems can be mapped to the different levels on the GPU.

1One of the first heuristic algorithms for sequence alignment was FASTA, presented by Pear-son and Lipman in 1988 [2].

1

1. INTRODUCTION

1.1 This Report

We start by presenting a short description of our work. The following chap-ter gives a theoretical introduction to LDDP problems, including a survey ofprevious sequential and parallel solutions. Based on this, we select a set ofstate-of-the-art algorithms as a basis for our new GPU solution. We then in-troduce the GPU architecture, and the programming model. This is followedby a chapter describing layouts for distributing LDDP problems to parallelcompute units. The chapter also introduces our main result: a new layout forimproving the performance of solving LDDP problems on GPUs, targeted atthe GPU grid-level.

In the following we describe our implementation on the NVIDIA GPUarchitecture, and what design considerations have been made. Finally, theexperimental results examine the practical performance of our LDDP imple-mentation on GPUs.

1.2 Previous Work

Currently there are several GPU solutions for solving LDDP problems. Manyof these implement the Smith-Waterman algorithm for the local alignmentproblem [3, 4, 5, 6, 7, 8, 9, 10]. The solutions achieve very high utilizationof GPU multiprocessors by comparing a large number of short sequences,thereby solving multiple independent LDDP problems.

Existing GPU solutions for longest common subsequence [11, 12] are ableto compare rather large sequences by decomposing the LDDP problem intosmaller subproblems, called tiles, and process these in an anti-diagonal man-ner on the GPU multiprocessors. This widely used layout, for mapping tilesonto compute units, is referred to as Diagonal Wavefront. It assigns all tilesin an anti-diagonal for computation and continues to the next anti-diagonal,once all tiles have been computed.

Galper and Brutlag [13] presented another layout, Column-cyclic Wave-front,2 which improves resource utilization compared to the widely used Di-agonal Wavefront. Krusche and Tiskin [14] used a similar layout on the Bulk-Synchronous Parallelism Model [15] for solving large-scale LDDP, mappingeach column in the tiled LDDP problem to a processor.

1.3 Our Results

We present an algorithm for solving large pairwise LDDP problems usingGPUs. Our main result is a new layout Diagonal-cyclic Wavefront for dis-tributing a decomposed LDDP problem onto the GPU. Let an n nmatrix bedecomposed into k k equally sized tiles, each computable on a GPU multi-processor. The mapping of tiles onto multiprocessors take place on the grid-level, and the computation of entries inside a tile is done on the thread-level.

2Originally called RowWavefront Approach (RWF).

2

1.3. Our Results

To theoretically evaluate our new layout Diagonal-cyclic Wavefront withthe widely used Diagonal Wavefront and Column-cyclic Wavefront, we ex-amine the utilization of each. The utilization of a layout is the fraction of tilescomputed where all multiprocessors are fully utilized to the total number oftiles. Theoretically, we show that our new layout Diagonal-cyclic Wavefront,in general, achieves the best utilizationdepicted in Figure 1.1.

Our experiments confirm the theoretical analysis; our new layout is su-perior to the widely used Diagonal Wavefront, and achieves a performancespeedup up to 18% for any LDDP problem. Furthermore, for various inputsizes, our new layout generally outperforms Column-cyclic Wavefront. Weshow, that our theoretical descriptions of the layouts give very accurate pre-dictions on how the architecture behaves. In general, the new layout Diagonal-cyclicWavefront should always be used for distributing large LDDP problemsonto the GPU grid-level.

To explore the potential of solving LDDP problems on GPUs, we imple-ment a set of kernels for solving longest common subsequence. A kernel de-fines the computation of individual tiles. For simplicity, we focus on Diag-onal Wavefront for distributing computable entries to threads. To determinethe best kernel parameters, e.g., tile size and number of threads, we conductautomatic performance tuning [16]. We present a scaling technique for space-reduction of cost values inside a tile, giving a better kernel performance. Scal-ing can be applied to a subset of LDDP problems. Besides, we have investi-gated a rather undocumented feature of the CUDA compiler, the volatilekeyword. Depending on placement, we have observed up to 10% perfor-mance increase.

To evaluate whether the problem is viable to solve on a GPU, we com-pare our results to state-of-the-art sequential CPU solutions. The experimentsshows an average of 40X performance advantage over the sequential CPUsolution by Driga et al. [17] for comparing strings larger than 219 using theNVIDIA Tesla C2070 GPU and a Intel i7 2.66GHz CPU. To the best of ourknowledge, our LDDP implementation supports the largest input size n forGPUs in literature, up to 221.

28 42 56 70 84 98 112 126 140 154 168 182 196 210 224 238 252 266 280Number of tiles (k)

50

60

70

80

90

100

UtilizationU(%

)

Utilization of k k tiles using p = 14 compute units

Diagonal-cyclic WavefrontColumn-cyclic WavefrontDiagonal Wavefront

Figure 1.1: Utilization of the three layouts.

3

2 LOCAL DEPENDENCY DYNAMIC PROGRAMMING

A large group of string comparison problems can be solved using essentiallythe same scheme over a two-dimensional dynamic programming matrix (DPM),where an entry (i, j) in the matrix depends on at most three neighboring en-tries. These include widely-used string problems in bioinformatics such asedit distance, sequence alignment and longest common subsequence. We re-fer to all these problems as local dependency dynamic programming (LDDP) prob-lems.

2.1 Definitions

Let X and Y be the input strings with characters from a finite alphabet S, forsimplicity we assume equal string length, i.e., |X| = |Y| = n. The character atposition i in X is denoted X[i].

2.1.1 Local Dependency Dynamic Programming

Given the input strings X and Y, an LDDP problem can be solved by filling a(n+ 1) (n+ 1) DPM, denoted c. The entry c[i, j] depends on at most threeneighboring entries, c[i 1, j 1], c[i, j 1], c[i 1, j] and the characters X[i]and Y[j]. We let parent(i, j) denote the neighboring entries that determinesc[i, j]. In general, the recurrence for solving an LDDP problem is:

c[i, j] =

(b(i, j) if i = 0_ j = 0,f (X[i],Y[j], parent(i, j)) if i, j > 0

(2.1)

The function b initializes the north and west border of the DPM c in timeO(1)for each entry. The function f (X[i],Y[j], parent(i, j)) computes the solution tothe subproblem c[i, j] in time O(1) as it depends on three neighboring entriesand input characters X[i] and Y[j]. The forward pass computes the length ofthe optimal path by filling the DPM, and backward pass finds the optimal pathby backtracking through the DPM.

2.1.2 Longest Common Subsequence

For simplicity, we use the longest common subsequence (LCS) problem as a casestudy, although all techniques and results presented generalize to any LDDPproblem. We define the problem as follows:

Let X[i, j] denote the substring of X from position i to j. A subsequence ofX is any string Z with zero or more elements left out of X. We say that Z is

5

2. LOCAL DEPENDENCY DYNAMIC PROGRAMMING

a common subsequence of X and Y if Z is a subsequence of both X and Y. Thelongest common subsequence problem for two input strings X and Y, is tofind a maximum-length common subsequence of both X and Y.

Given two strings X and Y where |X| = |Y| = n, the standard dynamicprogramming solution to LCS fills a (n+ 1) (n+ 1) dynamic programmingmatrix c using the following recurrence [18]:

c[i, j] =

8>:0 if i = 0_ j = 0,c[i 1, j 1] + 1 if i, j > 0^ X[i] = Y[j],max ( c[i, j 1], c[i 1, j] ) if i, j > 0^ X[i] 6= Y[j]

(2.2)

The length of the LCS between X[1, i] and Y[1, j] is c[i, j], therefor the length ofthe LCS of X and Y is c[n, n]. To compute the forward pass, the algorithm useO(n2) time and space.

The solution path and thus the LCS is deduced by backtracking from c[n, n]to some c[i0, j0]where i0 = 0_ j0 = 0. For a given entry c[i, j] the backward passdetermines inO(1) time which of the three values in parent(i, j) that was usedto compute c[i, j]. The complete LCS is reconstructed in O(n) time, when allcost values in the DPM are available.

2.2 Previous Results

We start by presenting an overview of previous results for solving LDDP prob-lems in general. We divide the findings in sequential and parallel solutions.

2.2.1 Sequential Solutions

Wagner and Fischer 1974 [19] presented one of the first dynamic program-ming solutions to the Levenshtein distance problem using O(n2) timeand space. We call this a full-matrix algorithm as it stores the completeDPM. Needleman-Wunch 1970 [20] and Smith-Waterman 1981 [21] pre-sented other examples of full-matrix algorithms for LDDP problems.

Hirschberg 1975 [22] improved space at the cost of increased time for thebackward pass. They used a divide and conquer approach by combininga standard and reverse application of the linear space cost-only variationto find a partitioning midpoint. Although the original solution was pre-sented for LCS, Myers and Miller [23] generalized it to sequence align-ment in 1988. The algorithm uses O(n2) time, O(n) space and O(n2)recomputations.

Driga et al. 2006 [17] presented their cache-aware Fast Linear-Space Align-ment (FLSA). It divides the DPM into k2 equally sized tiles, as shownin Figure 2.1. All tiles share boundaries of intersecting cost-values. Thetime-space tradeoff parameter k is selected so the problem space in a tilecan be computed using full-matrix. The forward pass fills the bound-aries. The backward pass uses the boundaries to compute the optimal

6

2.2. Previous Results

path by processing only tiles that intersect the optimal path. The algo-rithm implements backward pass optimization which reduces the sizeof the tiles according to the entry point of the optimal path. FLSA usesO(n2) time, O(nk) space and O( n

2

k ) recomputations.

Chowdhury and Ramachandran 2006 [24] also tiled the DPM, but reducedI/O bound by splitting the DPM into four tiles, then recursively com-pute each tile. Unlike Driga et al. [17], the algorithm is cache-oblivious[25]. The algorithm uses O(n2) time and O(n) space. As the backwardpass intersect at most 3/4 tiles, it performs O(n2) recomputations.

Bille and Stockel 2012 [1] combined the k2 tiles from Driga et al. [17] with re-cursive and cache-oblivious approach fromChowdhury and Ramachan-dran [24]. Experiments showed a superior performance over Chowd-hury and a comparable performance to Driga. The algorithm usesO(n2)time, O(nk) space and O( n

2

k ) recomputations.

All presented algorithms solve LDDP string comparison problems in gen-eral. For specific LDDP problems, specialized solutions exist improving com-plexity or space bounds by restricting the problem in terms of alphabet size,cost function or by exploiting properties of a specific LDDP problemsee e.g.,[26, 27, 28, 29] and the surveys [30, 31].

1

2

3

4

2

3

4

5

3

4

5

6

4

5

6

7

X

Y

input characterstored valuet =

nk

Figure 2.1: Decomposition of LDDP problems using tiled linear-space reduction as presented byDriga et al. [17]. The DPM is divided into k2 equally sized tiles sharing boundaries of intersectingcost-values. The forward pass of a tile receives as input the boundaries on north and west, andoutputs the south and east boundaries. The backward pass uses the stored boundaries to computethe optimal path by processing only tiles that intersect the optimal path. In a parallel context, thenumbers inside each tile refer to the order in which the tiles can be calculated during a forwardpass.

7

2. LOCAL DEPENDENCY DYNAMIC PROGRAMMING

2.2.2 Parallel Solutions

Several theoretical results for LDDP are based on the Parallel Random AccessMachine (PRAM) model [32], which ignores memory hierarchy, memory la-tency and cost of synchronization. As an example, Mathies [33] shows analgorithm for determining edit distances for two strings of size m and n inO(logm log n) time for mn processors. Although these results show the ex-tend of parallelism, their assumptions, that the number of processors is in theorder of the problem size, make their algorithm impractical.

In general, to solve an LDDP problem in parallel, the problem space mustbe decomposed and distributed to compute units. Let an n n DPM be decom-posed into k2 equally sized square tiles of size t = n/k. A layout defines theorder in which the computation of tiles in the DPM is performed.

Parallel Solutions for CPU

Galper and Brutlag 1990 [13] presented the layout Column-cyclic Wavefront(originally called RowWavefront Approach) for efficiently solving LDDPproblems on a shared-memory multiprocessor. The layout is examinedand analyzed in chapter 4 on page 15.

Krusche and Tiskin 2006 [14] used a similar layout as Galper and Brutlag tofind the length of the longest common subsequence using the Bulk Syn-chronous Parallelism Model (BSP) [15]. Their algorithm decomposes theDPM into rectangular tiles similar to Driga et al. [17], and sequentiallycomputes the values inside the tile.

Driga et al. 2006 [17] presented a parallel version of their linear-space FLSAalgorithm for CPUmulticore systems. The algorithm computes the tiledDPM by advancing in a diagonal wavefront pattern, called the DiagonalWavefront layout. The computation flow is shown in Figure 2.1. Theirexperiments showed a linear speedup up to 8 processors for sequenceswhere n < 219.

Chowdhury and Ramachandran 2008 [34] showed a general cache-efficientrecursive multicore algorithm for solving LDDP problems. They con-sidered three types of caching models for chip multiprocessors (CMP) in-cluding private, shared and multicore. Performance tests for two LDDPproblems pairwise sequence alignment with affine gap cost and median of 3sequences, again with affine gap penalty, solved using their CMP algorithm,showed a 5 times speedup on a 8-core multiprocessor.

Diaz et al. 2011 [35] implemented Smith-Waterman and Needleman-Wunschon the Tilera Tile64 processor having 64 cores. They based their parallelalgorithm on FLSA by Driga et al. [17]. Their implementation achievedup to 15 times performance increase compared to the same algorithm onan x86 multicore architecture.

8

2.3. Our Approach Based on Previous Results

Parallel Solutions for GPU

Currently there are several GPU solutions to LDDP problems, but we foundthe Smith-Waterman algorithm for local alignment to be the most explored.The most important is listed here:

Liu, W. et al. 2006 [3, 4] presented the first solution to Smith-Waterman on aGPU, and achieved a very high utilization of GPU multiprocessors bycomparing a large number of independent short sequences. This meansthat they solve multiple independent LDDP problems with no depen-dencies on a GPU grid-level. To reduce space when computing the op-timal length of a n2 cost matrix, they only store three separate buffers oflength n holding cost values for the most recent calculated diagonalswe call this linear space reduction three cost diagonals. Similar solutionswere presented in 20082009 [5, 6, 7].

Liu, Y. et al. 2010 [8, 9] presented CUDASW++, reported to perform up to 17billion cells update per second (GCUPS) on a single-GPU GeForce GTX280 for solving Smith-Waterman. We note that CUDASW++ use theColumn-cyclic Wavefront layout on the thread-level. No backtrackingis made, and their algorithm does not generalize to large LDDP.

AlthoughmanyGPU solutions for Smith-Watermanwere found, they are onlyable to compare strings of size n < 216. As a result, they are not applicable forcomparing large biological sequences considered in this report.

For solution of large LDDP problems, we found two interesting GPU im-plementations of Longest Common Subsequence (LCS):

Kloetzli et al. 2008 [11] presented a combined CPU/GPU solution for solv-ing LCS of large sequences (up to n 220). They showed a five-foldspeedup over the cache-oblivious single processor algorithm presentedby Chowdhury and Ramachandran [24]. The experiments were per-formed on an AMD Athlon 64 and an NVIDIA G80 family GTX GPU.

Deorowicz 2010 [12] calculates the length of LCS on large sequences. Theiralgorithm decomposes the problem space into tiles like Driga et al. [17],and calculates the tiles using the Diagonal Wavefront layout. Their ex-periments show a significant speedups obtained over their own serialCPU implementation of the same algorithm for n = 216. Unfortunatelyno comparison is made for any known CPU solutions, and, despite hav-ing tried, we have not been able to obtain the source code.

2.3 Our Approach Based on Previous Results

We now select relevant results, for our further investigations. As a basis forour LDDP solution, we use the tiling approach by Driga et al. [17] to achieve alinear-space reduction and decomposition of the problem space. Furthermore,we wish to investigate the properties and efficiencies of the layouts DiagonalWavefront [12, 17] and Column-cyclic Wavefront [13, 14]. The three cost diag-onals presented by Liu, W. et al. [3, 4] is explored for space reduction.

9

3 GRAPHICS PROCESSING UNITS

In this chapter we introduce relevant aspects of graphics processing unit ar-chitectures and the programming model exposing the hardware.

Where central processing units (CPUs) are highly optimized for solving awide range of single-threaded applications, GPUs are built for graphics ap-plications having a large degree of data parallelism. Graphics applicationsare also latency tolerant, as the processing of each pixel can be delayed aslong as frames are processed at acceptable rates. As a result, GPUs can tradeoff single-thread performance for increased parallel processing. As a conse-quence, each processing element on the GPU is relatively simple and hun-dreds of cores can be packed per die. [36]

There are currently several frameworks exposing the computation powerof GPUs, including ATI Stream, Open Computing Language (OpenCL) andNVIDIAs Compute Unified Device Architecture (CUDA) [37]. For our imple-mentation we choose to work on NVIDIA CUDA.

3.1 GPU Architecture

AGPU is composed of a number of streaming multiprocessors (SM) each havinga number of compute units called streaming processors (SP) running in lockstep.The number of streaming multiprocessors differ between GPU models, but asan example the NVIDIA Tesla GPUs has 14 SMs each with 32 SPs, totaling 448SPs. See hardware specifications in Appendix A.

The architecture of a GPU is akin to Single InstructionMultiple Data (SIMD),however, a GPU refines the SIMD architecture into Single Instruction Multi-ple Thread (SIMT). Instructions are issued to a collection of threads called awarp. SIMT allows individual execution paths of threads to diverge as a resultof branching. If threads within a warp diverge, the warp will serialize eachpath taken by the threads. [37]

Compared to CPU-threads, threads on a GPU are lightweight and handledin hardware. Registermemory for individual threads is kept in the SM registermemory, making hardware-based context switching possible at no cost.

Warps are not scheduled to run until data is available to all threads withinthe warp, making it possible to hide memory latency.

11

3. GRAPHICS PROCESSING UNITS

3.1.1 CUDA Programming Model

The programming model provides two levels of parallelism, coarse and fine-grained. On the coarse-grained grid-level, partitioning of work is done by di-viding the problem space into a grid consisting of a number of blocks. A blockis mapped to a symmetric multiprocessor, and represents a task which can besolved independently. On the fine-grained thread-level concurrent threads areassigned to a block, and provide data and instruction parallelism. The levelsof parallelism are depicted in Figure 3.1.

Grid

Block 1 Block ... Block gridDim

Warp 1Instructions

Warp ...Instructions

Warp nInstructions

Threads 1..32

Kernel with size (gridDim, blockDim) and a set of instructions.

Thread i..blockDim

Figure 3.1: Taxonomy of the CUDA work partitioning hierarchy.

A kernel function sets the partitioning parameters and defines the instructionsto be executed.

If the available resources of an SM allows it, multiple blocks can be allo-cated on an SM. This way, the hardware resources are better utilized.

Synchronization Primitives

Each level has different means of synchronizing.

Grid-level No specific synchronization primitive is available to handle syn-chronization between blocks as concurrent blocks represent indepen-dent tasks. Implicit synchronization can, however, be achieved by anumber of serialized kernel calls.

Thread-level CUDA only supports barrier synchronization for all threadswithin a block.

12

3.2. Memory

3.2 Memory

Global Memory

StreamingMultiprocessor

Shared Memory

Streaming Processors

Registers

Texture Cache

Constant Cache

L2 Cache

L1 Cache

Texture

Constant

Figure 3.2: CUDA memory spaces accessible froma streaming processor (SP). Please note that for sim-plicity only a single SM and a single SP are shown.

GPU memories are shown in Fig-ure 3.2. Registers are used as pri-vate memory for threads, while allthreads within a block have ac-cess to shared memory. All threadsacross blocks have access to globalmemory and the read-only textureand constantmemory.

Two levels of caches, L1 andL2, exist. Both caches are remark-ably smaller than typical cacheson CPUs. Each SM is equippedwith its own L1 cache that residesin shared memory. The L2 cacheis shared between all SM as a fullycoherent unified cache, with cachelines of 128 bytes. As shown inFigure 3.2, L1 and L2 caches areused for global memory. The spe-cial caches; texture and constantmemory can be mapped to spe-cific parts of global memory, and provide specialized cached access patternsto these parts of global memory. The CUDAmemory types and their traits areshown in Table 3.1.

For accessing global memory, the number of memory transactions per-formed will be equal to the number of L2 cache lines needed to completelysatisfy the request.

Shared memory is divided into equal sized memory banks which can beaccessed simultaneously. Concurrent memory access, which falls into distinctbanks, can be handled simultaneously, whereas concurrent access to the samebank will cause serialized accessreferred to as bank conflicts.

Access time for texture and constant cache depends on access patterns, butconstant cache is stated to be as fast as reading from a register, as long as allthreads read the same address [38].

Type Location on SM Cached Access Scope Access latency(non-cached)

Register yes n/a R/W 1 thread 0-1

Shared yes n/a R/W threads in block 1

Local no yes R/W 1 thread 400-600

Global no yes R/W all threads + host 400-600

Constant no yes R all threads + host 400-600

Texture no yes R all threads + host 400-600

Table 3.1: Memory types in CUDA. n/a stand for not applicable, R for read and W for write.The documented access latencies is given in cycles. [38]

13

3. GRAPHICS PROCESSING UNITS

3.3 Best Practices

A number of best practices to effectively exploit the GPU architecture are de-scribed by NVIDIA [38]. The most important are presented here:

Shared memory should be used when possible, as shared memory is fasterthan global memory. Values which are accessed often should be placedin shared memory.

Global memory to compute ratio should be maximized, as global memoryaccess is slow, while parallel computation is fast.

Minimize kernel branch diversion because divergent branches means seri-alized execution for each divergent branch.

14

4 PARALLEL LAYOUTS FOR LDDP

We will now present how LDDP problems can be computed in parallel. Fromrecurrence 2.1 on page 5, an entry (i, j) in the DPM is computable if neighboringentries to the west, north-west and north have been computed. Thus, for anyentry (i, j) to be computable, the data dependencies (id, jd) are where 0 id

*t2. Our empirical studies have shown thatt2 1.7t1. The tendency seen here for B = 2 will be similar in cases whereB > 2.*Speedup Predictability

By applying the virtual time function to the concept of steps, we are ableto theoretically determine speedup for COLCYCLIC and DIACYCLIC relative toDIAWAVE. By comparing the theoretical speedup with the speedup measuredin our experiments, we will now investigate how well the theoretical descrip-tions predict the actual speedup on the GPU architecture.

Figure 6.2 shows the difference in theoretical speedup and experimentalspeedup with one block per multiprocessor. Plot (a) compares predictabilityfor a kernel with substrings in shared memory, and (b) shows predictabilitywhen the substrings are in global memory. When the substrings are in sharedmemory, the access time is close to constant, and it also gives less pressure onthe caches where boundaries will reside. When the substrings are in global

33

6. EXPERIMENTAL RESULTS FOR GRID-LEVEL

memory caches will be used for both substrings and boundaries, giving morepressure on the cache, which results in larger variations in memory accesstime. This explains the reason plot (b) contains more noise than (a).

In general we see that our theoretical descriptions predict a less than 0.30percentage point better speedup, than what is observed in experiments. Forthe kernels where substrings are in shared memory, we see a less than 0.10percentage point better speedup in theory, than observed in experiments.

28 42 56 70 84 98 112 126 140 154 168 182 196 210 224 238 252 266 280Number of tiles (k)

0.00

0.05

0.10

0.15

0.20

0.25

Percen

tage

point

Difference between theoretical and experimental speedup

Diagonal-cyclic Wavefront ( = 0.08)Column-cyclic Wavefront ( = 0.06)

(a)

28 42 56 70 84 98 112 126 140 154 168 182 196 210 224 238 252 266 280Number of tiles (k)

0.00

0.05

0.10

0.15

0.20

0.25

Percen

tage

point

Difference between theoretical and experimental speedup

Diagonal-cyclic Wavefront ( = 0.23)Column-cyclic Wavefront ( = 0.19)

(b)

Figure 6.2: Comparing theoretical layout-performance with experiments for using a fixed tilesize of 1024 and a varying number of tiles k = [28; 280] and one block per multiprocessor. Av-erage deviation and standard deviation s from theory is listed in the labels. The plot showsa slight worse speedup in practice than predicted in our theory, due to the implementation ofDiagonal Wavefront. (a) LCS kernel GKERNELSHARED having input strings in shared memory. (b)GKERNELGLOBAL with input strings in global memory.

Figure 6.3 shows how well we can predict speedup theoretically when upto two blocks can execute concurrently on a multiprocessor. The data is fromthe same experiments, as shown in Figure 6.1 (b). A lower accuracy is ob-served compared to when maximum one block can execute concurrently ona multiprocessor. This is expected, since virtual time becomes more approxi-mative as B gets larger due to the t2 term. In this case, the theoretical speeduppredicts a less than 1.2 percentage point better speedup, thanwhat is observedin experiments.

For most results, we see that the theory predicts a better speedup thanwhatwe see in our experiments. This degradation in our experimental speedupis due to the implementation of the DIAWAVE. As explained in section 5.1, wereduce the amount of grid-level synchronization by only executing a single

34

6.2. Results

28 42 56 70 84 98 112 126 140 154 168 182 196 210 224 238 252 266 280Number of tiles (k)

0.5

0.0

0.5

1.0

Percen

tage

point

Difference between theoretical and experimental speedup

Diagonal-cyclic Wavefront ( = 0.66)Column-cyclic Wavefront ( = 0.58)

Figure 6.3: Comparing theoretical layout-performance for experiments shown in Figure 6.1 (b),where two concurrent blocks can execute per multiprocessor. The plot shows a slight worsespeedup in practice than predicted in our theory. The LCS kernel used is GKERNELSHARED with512 threads per block.

kernel call per wavefront front. This has the effect, that our implementationwill have a slightly better running time than predicted by step to virtual time-function.

6.2.2 Part Conclusion

On the grid-level, the theoretical description of the three layouts captureswhat we see in our experimental results with a very small margin of error.This has been shown for one and multiple concurrent blocks per multiproces-sor. In the cases where our theoretical predictions are inaccurate, we are ableto pinpoint what is causing itand these inaccuracies are small enough toconsider negligible. Thus, our theoretical approach for describing the layoutsas steps, and mapping steps to virtual time, gives extremely good predictionson how the architecture behaves. It also shows that the utilizationmetric givesa good theoretical way of comparing layouts on the grid-level.

Overall this shows our new Diagonal-cyclic Wavefront gives the best over-all performance, both in experiments and theoretically.

35

7 EXPERIMENTAL RESULTS FOR THREAD-LEVEL

This chapter documents the structured approach and experiments carried outfor optimizing the kernel performance. A number of best practice optimiza-tions have been evaluated during implementation, and a selection of these arepresented in subsection 5.2.3.

7.1 Setup

On the thread-level we have four independent kernels with different char-acteristics for computing the forward pass of longest common subsequence.All kernels use the Diagonal Wavefront layout for mapping the problem ontoCUDA threads. A simple backward pass kernel have also been designed. Allkernels are implemented in NVIDIA CUDA. We use the same setup as thegrid-level experiments, described in section 6.1.

Performance of the kernels are tested on DNA sequences [39] and ran-domly generated symbols from an alphabet of 256 symbols.

To compare the results we use cell updates per second (CUPS), a com-monly used performance measure in bioinformatics literature [4, 6, 9, 10].CUPS represents the time for a complete computation of one entry in theDPM, including memory operations and communication. Given a DPM withn n entries, the GCUPS (billion cell updates per second) measure is n2/(t 109), where t is the total computation time in seconds.

7.2 Results for Forward Pass Kernels

We present the results for our four forward pass kernels for solving longestcommon subsequence. For all of them, each thread block computes a tile hav-ing t t entries by comparing the substrings X0 and Y0 of length t, using twotile boundaries as input, and outputs the boundaries to the south and east.The four forward pass kernels are:

GKERNELSHARED general kernel, input strings in shared memory

GKERNELGLOBAL general kernel, input strings in global memory

SKERNELSHARED specialized kernel, input strings in shared memory

SKERNELGLOBAL specialized kernel, input strings in global memory

37

7. EXPERIMENTAL RESULTS FOR THREAD-LEVEL

7.2.1 Automatic Performance Tuning

Selecting the optimal kernel parameters is a key task in optimizing the perfor-mance of GPU applications. As performance of kernels is almost impossibleto analyze theoretically, performance optimizations are driven by empiricalstudies. The strategy is to exhaustive search the parameter space by executingseveral runs using different tuning parameters. This is a technique known asautomatic performance tuning, or auto-tuning. The effectiveness of this tech-nique depends on the chosen tuning parameters to optimize. Auto-tuningof GPU kernel parameters can include relevant parameters like block size,threads per block, loop unrolling level and internal algorithm trade-offs. Formore information on the topic in a GPU context, see [16, 41, 42].

Our automatic performance tuning is conducted by selecting a set of ker-nel parameters, and for each of these, automatically measure the running timeon our four kernels. To make the results independent of the grid-level lay-outs, we occupy all multiprocessors with the maximum number of concur-rent blocks for the given parameters. So the performance measured is for thebest case, where maximum utilization on the grid-level is achieved. All testpresented are for the targeted NVIDIA Tesla C2070 GPU.

Tile Size and Threads Ratio

We start by examining the optimal relationship between the kernel parameterstile size t and the number of threads per block. The tile size is selected to be amultiple of the architectures 128 byte memory transaction size and up to themaximum size supported by the kernels. The number of threads is selected tobe a multiple of the warp size (32 threads), and up to the maximum of 1024threads. Our tests show a tendency for the optimal ratio between tile size andnumber of threads per block to be 1:4 for the general kernels and 1:8 for thespecialized kernels. Figure 7.1 visualizes the result of an auto-tuning test.

512 1024 1536 2048 2560 3072Tile size t

128

256

384

512

640

768

1024

Thread

sperb

lock

5.1

4.6

3.6

5.0

5.4

4.7

3.5

4.0

5.4

5.4

5.2

4.4

2.2

3.9

4.9

4.6

2.2

4.0

5.2

5.2

2.3

4.1

5.1

5.3

5.3

5.2

Exploring parameter space for GKERNELSHARED

Figure 7.1: Example of auto-tuning two kernel parameters for GKERNELSHARED. The performancein GCUPS is shown for each supported configuration of tile size t and number of threads perblock. The black cells denote unsupported configurations, where number of threads does notdivide t. The best performance for each tile size t is shown with a underline.

38

7.2. Results for Forward Pass Kernels

Optimal Tile Size

To find the optimal value for tile size t, we tested all kernels by varying tusing the best ratio between t and number of threads per block. Notice thatwith different tile sizes, the number of concurrent blocks per multiprocessorwill vary. The results for the general kernels are shown in Figure 7.2, andFigure 7.3 on the next page shows results for the specialized kernels.

The results clearly show that selecting the tile size to be a multiple of thearchitectures 128 byte memory transaction size, generally yield the best per-formance. Also, having multiple blocks per multiprocessor gives a better per-formance for all kernels.

512 640 768 896 1024 1152 1280 1408 1536 1664 1792 1920 2048Tile size

4.2

4.4

4.6

4.8

5.0

5.2

5.4

5.6

GigaCellU

pdates

perS

econ

d(GCUPS

)

6 5 4 3 2 1

Max performance of GKERNELSHARED (threads per block = 1/4 tile size)

(a)

512 768 1024 1280 1536 1792 2048 2304 2560 2816 3072 3328 3584 3840 4096Tile size

4.2

4.4

4.6

4.8

5.0

5.2

5.4

5.6

GigaCellU

pdates

perS

econ

d(GCUPS

)

8 7 6 5 4 3 2 1

Max performance of GKERNELGLOBAL (threads per block = 1/4 tile size)

(b)

Figure 7.2: Experimental tile-performance for the general kernels using a varying tile size withmaximum number of blocks for each configuration. The vertical lines and numbers inside theplots, denote the number of blocks per multiprocessor. The circles show where t is a multiple ofthe 128 byte memory transaction size. Test were conducted on the NVIDIA Tesla C2070 GPU.

39

7. EXPERIMENTAL RESULTS FOR THREAD-LEVEL

512 768 1024 1280 1536 1792 2048 2304 2560 2816 3072 3328 3584 3840 4096Tile size

4.0

4.5

5.0

5.5

6.0

6.5

GigaCellU

pdates

perSecond

(GCUPS

)

8 7 6 5 4 3 2 1

Max performance of SKERNELSHARED (threads per block = 1/8 tile size)

(a)

512 1024 1536 2048 2560 3072 3584 4096 4608 5120 5632 6144 6656 7168 7680 8192Tile size

3.5

4.0

4.5

5.0

5.5

6.0

GigaCellU

pdates

perS

econ

d(GCUPS

)

8 7 6 5 4 3 2 1

Max performance of SKERNELGLOBAL (threads per block = 1/8 tile size)

(b)

Figure 7.3: Experimental tile-performance for the specialized kernels using a varying tile sizewith maximum number of blocks for each configuration. The vertical lines and numbers insidethe plots, denote the number of blocks per multiprocessor. The circles show where t is a multipleof the 128 byte memory transaction size. Test were conducted on the NVIDIA Tesla C2070 GPU.

When comparing the overall performance of the kernels seen in Figure 7.2and 7.3, the specialized kernels surprisingly outperforms the general kernelfor most configurations. The specialized kernels do more computations whenscaling cost values, however the smaller datatype reduces shared memoryaccess. This does not fully explain why the performance is better. Anotherexplanation could be different optimizations applied by the CUDA compiler.

40

7.3. Results for Backward Pass Kernel

Summary and Comparison

To give an overview of kernel performance on the NVIDIA Tesla GPU usingthe auto-tuned parameters and ratios, Table 7.1 presents combined results fortile size t = 1024, 2048, 4096, 8192 using 256, 512, 1024 threads per block. Theresults indicate the same ratio between t and number of threads per block, asfound in our auto-tuning test.

Execution time in Giga Cell Updates per Second (GCUPS)

Tile size Threads SKERNELSHARED SKERNELGLOBAL GKERNELSHARED GKERNELGLOBAL

1024 256 5.63 (4) 5.39 (5) 5.36 (3) 5.45 (4)1024 512 4.87 (2) 4.70 (2) 4.71 (2) 4.81 (2)1024 1024 3.62 (1) 3.58 (1) 3.53 (1) 3.62 (1)

2048 256 6.18 (3) 5.71 (4) 3.86 (1) 5.09 (2)2048 512 5.79 (2) 5.36 (2) 4.91 (1) 5.51 (2)2048 1024 4.81 (1) 4.59 (1) 4.64 (1) 4.70 (1)

4096 256 4.18 (1) 5.12 (2) - 3.51 (1)4096 512 5.73 (1) 5.77 (2) - 5.32 (1)4096 1024 5.74 (1) 5.32 (1) - 5.49 (1)

8192 256 - 3.26 (1) - -8192 512 - 5.24 (1) - -8192 1024 - 5.67 (1) - -

Table 7.1: Kernel performance in GCUPS with maximum number of blocks for different configu-rations of tile size t and threads per block. The number in the parenthesis denotes the number ofconcurrent blocks per SM, B. Tests were conducted on the NVIDIA Tesla C2070 GPU.

We conducted the same tests for the NVIDIA GeForce GTX 590, showing aperformance increase around 9% for all kernels compared to the Tesla GPU.The speedup can be explained by the increased clock frequency and fastermemory access, due to the lack of ECC, on the GeForce.

7.2.2 Effect of Alphabet Size and String Similarity

By evaluating the performance of varying the alphabet size up to 256 symbolsfor random sequences, we see no impact on the running time. The similarityof the given strings X and Y also has no influence on running time. This is dueto the branch reduction achieved by CUDAmax()-functions in the implemen-tation, giving no branching within the kernel when matching character pairsfrom the strings.

7.3 Results for Backward Pass Kernel

We have implemented a simple backward pass kernel BPKERNEL. Althoughthe implementation does not utilize the available resources efficiently, the timespend on the backward pass out of the total running time is less than 2%for n = 221.

41

7. EXPERIMENTAL RESULTS FOR THREAD-LEVEL

7.4 Part Conclusion

Currently there are no possibilities ofmodeling the behavior at the GPU thread-level, as small changes can have cascading effects, which are hard to predict.Some metrics, like utilization and data locality can help determine what isviable to implement, but experimental results are the only way to get solidanswers.

The kernels presented, are all implemented with NVIDIAs best practicesin mind [38]. As examples, we have no bank conflicts for shared memory ac-cess, we have minimized branch diversion when possible, and we have inves-tigated the effects of loop unrolling. We have found a rather undocumentedfeature, the volatile keyword, which can have an impact on performance.By doing this we have made sure that we achieved a high performance for theimplemented kernels.

Furthermore, we have experimentally auto-tuned the kernel parameters,by performing an exhaustive search of the parameter space. This providesus with general pointers on optimal relationship between kernel parameters.The auto-tuning results are also valuable in determining optimal tile size fora given input size.

42

8 PERFORMANCE EVALUATION

8.1 The Potential of Solving LDDP Problems on GPUs

To evaluate the potential of solving LDDP problems on GPU hardware, wecompare our results for finding longest common subsequence (LCS) to se-quential CPU solutions. For comparison we have used the CPU solutions pro-vided by Stockel and Bille [1], including LCS implementations for Hirschberg[22], FLSA byDriga et al. [17], Chowdhury and Ramachandran cache-obliviousalgorithm [24] and their own FCO [1]. As the fastest known general sequentialCPU solution, we use Stockels optimized implementation of FLSA by Drigaet al. [17].

Table 8.1 shows a performance comparison between our GPU solution onan NVIDIA Tesla C2070 and FLSA [17] running on a Intel i7 2.66GHz hav-ing 4GB memory. Both of the two architectures were released in 2010. Thedata shows that our GPU solution has an average of over 40X performanceadvantage over FLSA for comparing two string of size larger than 219. Ourexperiments show that our inefficient backward pass has a negative impacton the GPU speedup for smaller strings.

Input size n FLSA CPU running time GPU running time GPU speed-up

218 0.132h 0.0035h 38X219 0.529h 0.0125h 42X220 2.127h 0.0473h 45X221 8.741h 0.2077h 42X

Table 8.1: Performance comparison of state-of-the-art single-threaded CPU solution FLSA [17](on a Intel i7 2.66GHz having 4GBmemory) and our GPU solution (NVIDIA Tesla C2070) for solv-ing the longest common subsequence. The timings include complete computation time, memorytransfer and traceback of solution path. All GPU tests use our new Diagonal-cyclic Wavefrontlayout. For n = 221, SKERNELGLOBAL was used as kernel with tile size t = 8192 and 1024 threadsper block. All other tests used SKERNELSHARED with t = 2048 and 256 threads.

Driga et al. [17] presented a parallel FLSA with an almost linear speedupfor up to eight processors, when comparing two strings with length just above218. For 32 processors the speedup is halved. When comparing this with therun times presented in Table 8.1, the GPU is still an order of magnitudes fasterthan the parallel FLSA CPU solution running on a 32-core CPU. We believethis shows the potential of solving LDDP problems on GPU hardware.

43

8. PERFORMANCE EVALUATION

8.2 Comparing to Similar GPU Solutions

As stated in section 2.2.2, there are currently many GPU solutions for LDDPproblems, especially for Smith-Waterman. All of them are able to compare alarge number of independent short sequences, with a length up to 216. As aresult, they are not targeted the same problem size as we consider, makingdirect comparison impossible.

The GPU+CPU solution by Kloetzli et al. [11] is able to solve longest com-mon subsequence (LCS) on string length up to 220. They showed a five-foldspeedup over the single processor algorithm presented by Chowdhury andRamachandran [24]. Since we achieve a much higher speedup, we concludethat our solution is superior.

We found Deorowicz [12] solution to be the most resembling GPU imple-mentation, although the implementation only computes the length of the LCS.The implementation is based on the widely used Diagonal Wavefront layout.Their experiments showed a significant speedup obtained over their own se-rial CPU implementation of the same anti-diagonal algorithm for n = 216. Un-fortunately no comparison is made for any known CPU solutions. Althoughwe have tried to obtain the source code for their implementation, it has notbeen possible.

To the best of our knowledge, all existing GPU solutions for solving large-scale LDDPs problem use Diagonal Wavefront-layout on the grid-level, whereit would be an advantage to use our Diagonal-cyclicWavefront layout instead.

44

9 CONCLUSION

Based on an analysis of state-of-the-art algorithms for solving local depen-dency dynamic programming (LDDP) problems and a thorough investiga-tion of GPU architectures, we have combined and further developed existingLDDP solutions. The result is a novel approach for solving any large pairwiseLDDP problem, supporting the largest input size for GPUs in literature.

Our results include a new superior layout Diagonal-cyclic Wavefront forutilizing the coarse-grained parallelism of the many-core GPU. For various in-put sizes the Diagonal-cyclic Wavefront always outperforms the widely used,the Diagonal Wavefront, and in most cases the Column-cyclic Wavefront lay-out. In general, any GPU algorithm for large LDDP problems can adopt ournew layout with advantage.

Theoretically we have been able to analyze the efficiency of the layouts,and accurately predict the relative speedup. Our results can be generalized toseveral levels of parallel computation using multiple GPUs.

As case study, we have implemented GPU kernels for finding the longestcommon subsequence. We present ways of optimizing kernel performance byminimizing branch divergence, scaling inner cost values, evaluating NVIDIAbest practices, performing automatic tuning of kernel parameters and exploit-ing the compiler keyword volatile with a previously undocumented speedupeffect.

Comparedwith the fastest known sequential CPU algorithm byDriga et al.[17], our GPU solution obtain a 40X speedup. Comparing two sequences of2 million symbols each, the CPU running time was close to 9 hours. Ourimplementation solves the same problem in 12minutes. This shows, that exactcomparison of large biological sequence is now feasible.

9.1 Future Work

In this report we have focused on efficient layouts for the GPU grid-level. Tofurther optimize the performance, experiments must be made applying differ-ent layouts on the fine-grained thread-level. We believe the most appropriateis Column-cyclic Wavefront, due to the degree of data locality. The potentialof using multiple GPUs should also be examined.

We have focused on implementing longest common subsequence withinour parallel LDDP solution. It could be interesting to extend this implemen-tation with other LDDP algorithms with more advanced cost functions.

45

BIBLIOGRAPHY

[1] P. Bille and M. Stockel. Fast and cache-oblivious dynamic programmingwith local dependencies. Language and Automata Theory and Applications,pages 131142, 2012.

[2] W.R. Pearson and D.J. Lipman. Improved tools for biological sequencecomparison. Proceedings of the National Academy of Sciences, 85(8):2444,1988.

[3] W. Liu, B. Schmidt, G. Voss, A. Schroder, and W. Muller-Wittig. Bio-sequence database scanning on a gpu. In Parallel and Distributed Process-ing Symposium, 2006. IPDPS 2006. 20th International, pages 8pp. IEEE,2006.

[4] Schmidt B.b Voss G.a Muller-Wittig W.a Liu, W.a. Streaming algorithmsfor biological sequence alignment on gpus. IEEE Transactions on Paralleland Distributed Systems, 18(9):12701281, 2007.

[5] Valle G.aManavski, S.A.a b. Cuda compatible gpu cards as efficient hard-ware accelerators for smith-waterman sequence alignment. BMC Bioin-formatics, 9(SUPPL. 2), 2008.

[6] Rudnicki W. Ligowski, L. An efficient implementation of smith water-man algorithm on gpu using cuda, for massively parallel scanning ofsequence databases. 2009.

[7] G.M. Striemer and A. Akoglu. Sequence alignment with gpu: Perfor-mance and design challenges. In Parallel & Distributed Processing, 2009.IPDPS 2009. IEEE International Symposium on, pages 110. IEEE, 2009.

[8] Maskell D.L. Schmidt B. Liu, Y. Cudasw++: Optimizing smith-watermansequence database searches for cuda-enabled graphics processing units.BMC Research Notes, 2, 2009.

[9] Yongchao Liu, Bertil Schmidt, and Douglas LMaskell. Cudasw++2.0: en-hanced smith-waterman protein database search on cuda-enabled gpusbased on simt and virtualized simd abstractions. BMC Research Notes,3(1):93, 2010.

[10] Jacek Blazewicz, Wojciech Frohmberg, Michal Kierzynka, Erwin Pesch,and PawelWojciechowski. Protein alignment algorithmswith an efficientbacktracking routine on multiple gpus. BMC BIOINFORMATICS, 12:,2011.

[11] J. Kloetzli, B. Strege, J. Decker, and M. Olano. Parallel longest commonsubsequence using graphics hardware. In Proceedings of the Eurographics

47

BIBLIOGRAPHY

Symposium on Parallel Graphics and Visualization. Eurographics Association,2008.

[12] Sebastian Deorowicz. Solving longest common subsequence and relatedproblems on graphical processing units. Software: Practice and Experience,40(8):673700, 2010.

[13] A.R. Galper, D.L. Brutlag, and Stanford University. Medical ComputerScience. Knowledge Systems Laboratory. Parallel similarity search andalignment with the dynamic programming method. Knowledge Systems Lab-oratory, Medical Computer Science, Stanford University, 1990.

[14] P. Krusche and A. Tiskin. Efficient longest common subsequence compu-tation using bulk-synchronous parallelism. Computational Science and ItsApplications-ICCSA 2006, pages 165174, 2006.

[15] L.G. Valiant. A bridging model for parallel computation. Communicationsof the ACM, 33(8):103111, 1990.

[16] H.H.B. Srensen. Auto-tuning of level 1 and level 2 blas for gpus. 2012.

[17] A. Driga, P. Lu, J. Schaeffer, D. Szafron, K. Charter, and I. Parsons. Fastlsa:a fast, linear-space, parallel and sequential algorithm for sequence align-ment. Algorithmica, 45(3):337375, 2006.

[18] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and CliffordStein. Introduction to Algorithms. The MIT Press, 2001.

[19] R.A. Wagner and M.J. Fischer. The string-to-string correction problem.Journal of the ACM (JACM), 21(1):168173, 1974.

[20] S.B. Needleman, C.D. Wunsch, et al. A general method applicable to thesearch for similarities in the amino acid sequence of two proteins. Journalof molecular biology, 48(3):443453, 1970.

[21] Waterman M.S. Smith, T.F. Identification of common molecular subse-quences. Journal of Molecular Biology, 147(1):195197, 1981.

[22] D.S. Hirschberg. A linear space algorithm for computing maximal com-mon subsequences. Communications of the ACM, 18(6):341343, 1975.

[23] E.W. Myers and W. Miller. Optimal alignments in linear space. Computerapplications in the biosciences: CABIOS, 4(1):1117, 1988.

[24] R.A. Chowdhury and V. Ramachandran. Cache-oblivious dynamic pro-gramming. In Proceedings of the seventeenth annual ACM-SIAM symposiumon Discrete algorithm, pages 591600. ACM, 2006.

[25] E.D. Demaine. Cache-oblivious algorithms and data structures. LectureNotes from the EEF Summer School on Massive Data Sets, pages 129, 2002.

[26] P. Bille. Faster approximate string matching for short patterns. Theory ofComputing Systems, pages 124, 2008.

48

Bibliography

[27] J.W. Hunt and T.G. Szymanski. A fast algorithm for computing longestcommon subsequences. Communications of the ACM, 20(5):350353, 1977.

[28] G.M. Landau and U. Vishkin. Fast parallel and serial approximate stringmatching. Journal of Algorithms, 10(2):157169, 1989.

[29] W.J. Masek and M.S. Paterson. A faster algorithm computing string editdistances. Journal of Computer and System sciences, 20(1):1831, 1980.

[30] L. Bergroth, H. Hakonen, and T. Raita. A survey of longest commonsubsequence algorithms. In String Processing and Information Retrieval,2000. SPIRE 2000. Proceedings. Seventh International Symposium on, pages3948. IEEE, 2000.

[31] G. Navarro. A guided tour to approximate string matching. ACM com-puting surveys (CSUR), 33(1):3188, 2001.

[32] S. Fortune and J. Wyllie. Parallelism in random access machines. In Pro-ceedings of the tenth annual ACM symposium on Theory of computing, pages114118. ACM, 1978.

[33] T.R. Mathies. A fast parallel algorithm to determine edit distance. 1988.

[34] R.A. Chowdhury and V. Ramachandran. Cache-efficient dynamic pro-gramming algorithms for multicores. In Proceedings of the twentieth an-nual symposium on Parallelism in algorithms and architectures, pages 207216. ACM, 2008.

[35] David Diaz, Francisco Jose Esteban, Pilar Hernandez, Juan Antonio Ca-ballero, Gabriel Dorado, and Sergio Galvez. Parallelizing and optimizinga bioinformatics pairwise sequence alignment algorithm for many-corearchitecture. PARALLEL COMPUTING, 37(4-5):244259, APR-MAY 2011.

[36] V.W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A.D. Nguyen,N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, et al. De-bunking the 100x gpu vs. cpu myth: an evaluation of throughput com-puting on cpu and gpu. In ACM SIGARCH Computer Architecture News,volume 38, pages 451460. ACM, 2010.

[37] NVidia. CUDA C Programming Guide Version 4.1. November 2011.

[38] NVidia. CUDA C Best Practices Guide Version 4.1. January 2012.

[39] P. Ferragina and G. Navarro. Pizza & chili corpus. University of Pisa andUniversity of Chile. http://pizzachili.di.unipi.it/. 2012.

[40] NVidia. CUDA C Toolkit Reference Manual version 4.2. April 2012.

[41] Y. Li, J. Dongarra, and S. Tomov. A note on auto-tuning gemm for gpus.Computational ScienceICCS 2009, pages 884892, 2009.

[42] P. Micikevicius. Analysis-driven optimization. In GPU Technology Con-ference. NVIDIA, 2010.

49

Appendices

51

A NVIDIA GPU DATA SHEETS

A.1 NVIDIA Tesla C2070

The Tesla C2070 is developed for high performance scientific computing.

Release year 2010

CUDA Compute and Graphics Architecture Fermi

CUDA Driver Version 4.1

CUDA Compute Capability 2.0

Total amount of global memory: 5375 MBytes (5636554752 bytes)

Symmetric Multiprocessor (SM) 14

(14) Multiprocessors x (32) CUDA Cores/MP: 448 CUDA Cores (SPs)

GPU Clock Speed: 1.15 GHz

Memory Clock rate: 1494.00 Mhz

Memory Bus Width: 384-bit

L2 Cache Size: 786432 bytes

Total amount of constant memory: 65536 bytes

Total amount of shared memory per SM: 49152 bytes

Total number of registers available per SM: 32768 bytes

Warp size: 32

Maximum number of threads per block: 1024

Maximum sizes of each dimension of a block: 1024 x 1024 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535

Maximum number of resident blocks per SM: 8

Concurrent copy and execution: Yes with 2 copy engines

Device has ECC support enabled: Yes

53

A. NVIDIA GPU DATA SHEETS

A.2 NVIDIA GeForce GTX 590

The GeForce GTX 590 is intended for the PC gaming market.The specifications written in bold highlights where the GeForce GTX 590 dif-fers from NVIDIA Tesla C2070.

Release year 2011

CUDA Compute and Graphics Architecture Fermi

CUDA Driver Version 4.1

CUDA Compute Capability 2.0

Total amount of global memory: 1536 MBytes (1610285056 bytes)

Symmetric Multiprocessor (SM) 16

(16) Multiprocessors x (32) CUDA Cores/MP: 512 CUDA Cores (SPs)

GPU Clock Speed: 1.22 GHz

Memory Clock rate: 1707.00 Mhz

Memory Bus Width: 384-bit

L2 Cache Size: 786432 bytes

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 49152 bytes

Total number of registers available per block: 32768 bytes

Warp size: 32

Maximum number of threads per block: 1024

Maximum sizes of each dimension of a block: 1024 x 1024 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535

Maximum number of resident blocks per SM: 8

Concurrent copy and execution: Yes with 1 copy engine

Device has ECC support enabled: No

54

B KERNEL SOURCE CODE

B.1 Forward pass kernels

B.1.1 GKERNELSHARED

1 /**2 * Computes the LCS cost -boundaries for the tiles defined in boxArray.3 *4 * @param boxArray list of tiles to compute5 * @param xStr X6 * @param yStr Y7 * @param boundaryX boundaries in X direction8 * @param boundaryY boundaries in Y direction9 * @param n input size10 * @param boxDim a multiple of #threads per block11 */12 __global__13 __launch_bounds__(MAXTHREADSPERBLOCK)14 void kernel_lcs_fp_wave_boundary_striving(const int2 *boxArray , char const *

const xStr , char const * const yStr , int * const boundaryX , int *const boundaryY , int n, int boxDim) {

15

16 int boxI = boxArray[blockIdx.x].x;17 int boxJ = boxArray[blockIdx.x].y;18

19 extern __shared__ char dynSmemWstringsGeneral [];20

21 char * const xStr_shared = &dynSmemWstringsGeneral [0];22 char * const yStr_shared = &dynSmemWstringsGeneral[boxDim ];23 int * diag_base = (int*)&dynSmemWstringsGeneral[boxDim *2];24

25 // inner cost diagonals26 int *subMatrix [3] = {27 &diag_base [0],28 &diag_base[boxDim],29 &diag_base [2* boxDim]30 };31 // a pointer for wrapping around the diagonals32 int *tmpPointer;33

34 // stride variables based on the problem size and number of threads35 int totalstrides = (boxDim/blockDim.x);36 int strideWidth = boxDim/totalstrides;37

38 // Copy local X-string and Y-string needed for the current tile39 for (int stride = 0; stride < totalstrides; ++ stride) {40 xStr_shared[threadIdx.x+stride*strideWidth] =41 xStr[boxI*boxDim + threadIdx.x+stride*strideWidth ];42 yStr_shared[threadIdx.x+stride*strideWidth] =43 yStr[boxJ*boxDim + threadIdx.x+stride*strideWidth ];44 }45

46 int totalDiagonals = boxDim *2 - 1;47

48 // pre -calculate index values for y/boundaryX49 int boundaryXOffsetRead = n*(boxJ -1) + boxDim * boxI;50 int boundaryXOffsetWrite = boundaryXOffsetRead + n;51

55

B. KERNEL SOURCE CODE

52 int boundaryYOffsetRead = n*(boxI -1) + boxDim * boxJ;53 int boundaryYOffsetWrite = boundaryYOffsetRead + n;54

55 // sync all threads in block56 __syncthreads ();57

58 // calculate the cost values59 for (int slice = 0; slice < totalDiagonals; ++slice) {60

61 // for each stride62 for (int stride = 0; stride < totalstrides; ++ stride) {63 // update i,j64 int i = threadIdx.x + (stride*strideWidth);65 int j = slice - i;66

67 if (!(j < 0 || j >= boxDim)) {68 // calculate69 int northWestValue , result;70 int northValue = j == 0 ? boundaryX[boundaryXOffsetRead + i] :

subMatrix[NOMATCH ][i];71 int westValue = i == 0 ? boundaryY[boundaryYOffsetRead + j] :

subMatrix[NOMATCH ][i - 1];72

73 if (j == 0) {74 // border to the north75 northWestValue = boundaryX[boundaryXOffsetRead + i - 1];76 } else if (i == 0) {77 // border to the west78 northWestValue = boundaryY[boundaryYOffsetRead + j - 1];79 } else {80 // not on border , read from own cost values81 northWestValue = subMatrix[MATCH][i - 1];82 }83

84 result = max(85 westValue ,86 northWestValue + (yStr_shared[j] == xStr_shared[i])87 );88 result = max(northValue , result);89

90 subMatrix[RESULT ][i] = result;91 // end of calculation92

93 // on south/east edge? Save in output boundary94 if (j == boxDim - 1) {95 // south edge. Note: corner is only saved here and not in

boundaryY96 boundaryX[boundaryXOffsetWrite + i] = subMatrix[RESULT ][i];97 } else if (i == boxDim - 1) {98 // east edge99 boundaryY[boundaryYOffsetWrite + j] = subMatrix[RESULT ][i];100 }101 }102 }103

104 // memory wrap around105 tmpPointer = subMatrix [2];106 subMatrix [2] = subMatrix [0];107 subMatrix [0] = subMatrix [1];108 subMatrix [1] = tmpPointer;109

110 __syncthreads ();111 }112 }

56

B.1. Forward pass kernels

B.1.2 SKERNELSHARED

1 /**2 * Computes the LCS cost -boundaries for the tiles defined in boxArray.3 *4 * @param boxArray list of tiles to compute5 * @param xStr X6 * @param yStr Y7 * @param boundaryX boundaries in X direction8 * @param boundaryY boundaries in Y direction9 * @param n input size10 * @param boxDim a multiple of #threads per block11 */12 __global__13 __launch_bounds__(MAXTHREADSPERBLOCK)14 void kernel_lcs_fp_wave_boundary_striving_globalStrings(const int2 *boxArray

, char const * const xStr , char const * const yStr , int * constboundaryX , int * const boundaryY , int n, int boxDim) {

15

16 int boxI = boxArray[blockIdx.x].x;17 int boxJ = boxArray[blockIdx.x].y;18

19 extern __shared__ int dynSmemWOstringsGeneral [];20

21 int *subMatrix [3] = {22 &dynSmemWOstringsGeneral [0],23 &dynSmemWOstringsGeneral[boxDim],24 &dynSmemWOstringsGeneral[boxDim *2]25 };26 int *tmpPointer;27

28 int totalstrides = (boxDim/blockDim.x);29 volatile int strideWidth = boxDim/totalstrides;30 volatile int totalDiagonals = boxDim *2 - 1;31

32 // pre -calculate index values for y/boundaryX33 volatile int boundaryXOffsetRead = n*(boxJ -1) + boxDim * boxI;34 int boundaryXOffsetWrite = boundaryXOffsetRead + n;35 volatile int boundaryYOffsetRead = n*(boxI -1) + boxDim * boxJ;36 int boundaryYOffsetWrite = boundaryYOffsetRead + n;37

38 // no sync , needed39 // calculate the cost matrix40 for (int slice = 0; slice < totalDiagonals; ++slice) {41

42 // for each stride43 for (int stride = 0; stride < totalstrides; ++ stride) {44 // update i,j45 int i = threadIdx.x + (stride*strideWidth);46 int j = slice - i;47

48 // from here the kernel is the same as kernel_lcs_fp_01_wave_boundary49 if (!(j < 0 || j >= boxDim)) {50 // calculate51 int northWestValue , result;52 int northValue = j == 0 ? boundaryX[boundaryXOffsetRead + i] :

subMatrix[NOMATCH ][i];53 int westValue = i == 0 ? boundaryY[boundaryYOffsetRead + j] :

subMatrix[NOMATCH ][i - 1];54

55 if (j == 0) {56 // border to the north57 northWestValue = boundaryX[boundaryXOffsetRead + i - 1];58 } else if (i == 0) {59 // border to the west60 northWestValue = boundaryY[boundaryYOffsetRead + j - 1];61 } else {62 // not on border , read from own cost values63 northWestValue = subMatrix[MATCH][i - 1];64 }

57

B. KERNEL SOURCE CODE

65

66 result = max(67 westValue ,68 northWestValue + (yStr[boxJ*boxDim + j] == xStr[boxI*boxDim + i])69 );70 result = max(northValue , result);71

72 subMatrix[RESULT ][i] = result;73 // end of calculation74

75 // on south/east edge? Save in output boundary76 if (j == boxDim - 1) {77 // south edge. Corner is only saved here and not in boundaryY78 boundaryX[boundaryXOffsetWrite + i] = subMatrix[RESULT ][i];79 } else if (i == boxDim - 1) {80 // east edge81 boundaryY[boundaryYOffsetWrite + j] = subMatrix[RESULT ][i];82 }83 }84 }85

86 // memory wrap around87 tmpPointer = subMatrix [2];88 subMatrix [2] = subMatrix [0];89 subMatrix [0] = subMatrix [1];90 subMatrix [1] = tmpPointer;91

92 __syncthreads ();93 }94 }

58

B.1. Forward pass kernels

B.1.3 SKERNELSHARED

1 /**2 * Computes the LCS cost -boundaries for the tiles defined in boxArray.3 *4 * @param boxArray list of tiles to compute5 * @param xStr X6 * @param yStr Y7 * @param boundaryX boundaries in X direction8 * @param boundaryY boundaries in Y direction9 * @param n input size10 * @param boxDim a multiple of #threads per block11 */12 __global__13 __launch_bounds__(MAXTHREADSPERBLOCK)14 void kernel_lcs_fp_wave_boundary_striving_scaling(const int2 *boxArray , char

const * const xStr , char const * const yStr , int * const boundaryX ,int * const boundaryY , int n, int boxDim) {

15

16 int boxI = boxArray[blockIdx.x].x;17 int boxJ = boxArray[blockIdx.x].y;18

19 extern __shared__ char dynSmemWstringsSpecialized [];20

21 char * const xStr_shared = &dynSmemWstringsSpecialized [0];22 char * const yStr_shared = &dynSmemWstringsSpecialized[boxDim ];23 diag_t* diag_base = (diag_t *)&dynSmemWstringsSpecialized[boxDim *2];24

25 diag_t *subMatrix [3] = {26 &diag_base [0],27 &diag_base[boxDim],28 &diag_base [2* boxDim]29 };30 diag_t *tmpPointer;31

32 int totalStrides = boxDim/blockDim.x;33 int strideWidth = boxDim/totalStrides;34 volatile int totalDiagonals = boxDim *2 - 1;35

36 // Copy X-string and Y-string needed for the current tile37 for (int stride = 0; stride < totalStrides; ++ stride) {38 xStr_shared[threadIdx.x+stride*strideWidth] =39 xStr[boxI*boxDim + threadIdx.x+stride*strideWidth ];40 yStr_shared[threadIdx.x+stride*strideWidth] =41 yStr[boxJ*boxDim + threadIdx.x+stride*strideWidth ];42 }43

44 // pre -calculate index values for y/boundaryX45 volatile int boundaryXOffsetRead = n*(boxJ -1) + boxDim * boxI;46 int boundaryXOffsetWrite = boundaryXOffsetRead + n;47 volatile int boundaryYOffsetRead = n*(boxI -1) + boxDim * boxJ;48 int boundaryYOffsetWrite = boundaryYOffsetRead + n;49

50 // scaling51 int resultScaling = (boxJ ==0) ? 0 : boundaryX[boundaryXOffsetRead ];52

53 // sync all threads in block54 __syncthreads ();55

56 // calculate the cost matrix57 for (int slice = 0; slice < totalDiagonals; ++slice) {58

59 // for each stride60 //#pragma unroll 861 for (int stride = 0; stride < totalStrides; ++ stride) {62 // update i,j63 int i = threadIdx.x + (stride*strideWidth);64 int j = slice - i;65

66 if(!(j= boxDim)) {

59

B. KERNEL SOURCE CODE

67 int northWestValue , result;68 int northValue = j==0 ? boundaryX[boundaryXOffsetRead+i] -

resultScaling : subMatrix[NOMATCH ][i]; // scaling69 int westValue = i==0 ? boundaryY[boundaryYOffsetRead+j] -

resultScaling : subMatrix[NOMATCH ][i-1]; // scaling70

71 if(j==0) {72 // border to the north73 northWestValue = boundaryX[boundaryXOffsetRead+i-1] -

resultScaling; // scaling74 } else if(i==0) {75 // border to the west76 northWestValue = boundaryY[boundaryYOffsetRead+j-1] -

resultScaling; // scaling77 } else {78 // not on border , read from own cost values79 northWestValue = subMatrix[MATCH][i-1];80 }81

82 result = max(83 westValue , northWestValue + (yStr_shared[j] == xStr_shared[i])84 );85 result = max(northValue , result);86

87 subMatrix[RESULT ][i] = result;88 // end of calculation89

90 // on south/east edge? Save in output boundary91 if(j == boxDim -1) {92 // south edge , corner is only saved here and not in boundaryY93 boundaryX[boundaryXOffsetWrite+i] =94 subMatrix[RESULT ][i] + resultScaling; // scaling95 } else if(i == boxDim -1) {96 // east edge97 boundaryY[boundaryYOffsetWrite+j] =98 subMatrix[RESULT ][i] + resultScaling; // scaling99 }100 }101 }102

103 // memory wrap around104 tmpPointer = subMatrix [2];105 subMatrix [2] = subMatrix [0];106 subMatrix [0] = subMatrix [1];107 subMatrix [1] = tmpPointer;108

109 __syncthreads ();110 }111 }

60

B.1. Forward pass kernels

B.1.4 SKERNELGLOBAL

1 /**2 * Computes the LCS cost -boundaries for the tiles defined in boxArray.3 *4 * @param boxArray list of tiles to compute5 * @param xStr X6 * @param yStr Y7 * @param boundaryX boundaries in X direction8 * @param boundaryY boundaries in Y direction9 * @param n input size10 * @param boxDim a multiple of #threads per block11 */12 __global__13 __launch_bounds__(MAXTHREADSPERBLOCK)14 void kernel_lcs_fp_wave_boundary_striving_scaling_globalStrings(const int2 *

boxArray , char const * const xStr , char const * const yStr , int * constboundaryX , int * const boundaryY , int n, int boxDim) {

15

16 int boxI = boxArray[blockIdx.x].x;17 int boxJ = boxArray[blockIdx.x].y;18

19 extern __shared__ diag_t dynSmemWOstringsSpecialized [];20

21 diag_t *subMatrix [3] = {22 &dynSmemWOstringsSpecialized [0],23 &dynSmemWOstringsSpecialized[boxDim],24 &dynSmemWOstringsSpecialized[boxDim *2]25 };26 diag_t *tmpPointer;27

28 int totalStrides = boxDim/blockDim.x;29 volatile int strideWidth = boxDim/totalStrides;30 volatile int totalDiagonals = boxDim *2 - 1;31

32 // pre -calculate index values for y/boundaryX33 volatile int boundaryXOffsetRead = n*(boxJ -1) + boxDim * boxI;34 int boundaryXOffsetWrite = boundaryXOffsetRead + n;35 volatile int boundaryYOffsetRead = n*(boxI -1) + boxDim * boxJ;36 int boundaryYOffsetWrite = boundaryYOffsetRead + n;37

38 // scaling39 int resultScaling = (boxJ ==0) ? 0 : boundaryX[boundaryXOffsetRead ];40

41 // no sync needed42 // calculate the cost matrix43 for (int slice = 0; slice < totalDiagonals; ++slice) {44

45 // for each stride46 // stride counter is char , fix for register usage47 //#pragma unroll 848 for (char stride = 0; stride < totalStrides; ++ stride) {49

50 // update i,j51 int i = threadIdx.x + (stride*strideWidth);52 int j = slice - i;53

54 if(!(j= boxDim)) {55 // calculate56 int northWestValue , result;57 int northValue = j==0 ? boundaryX[boundaryXOffsetRead+i] -

resultScaling : subMatrix[NOMATCH ][i]; // scaling58 int westValue = i==0 ? boundaryY[boundaryYOffsetRead+j] -

resultScaling : subMatrix[NOMATCH ][i-1]; // scaling59

60 if(j==0) {61 // border to the north62 northWestValue = boundaryX[boundaryXOffsetRead+i-1] -

resultScaling; // scaling63 } else if(i==0) {

61

B. KERNEL SOURCE CODE

64 // border to the west65 northWestValue = boundaryY[boundaryYOffsetRead+j-1] -

resultScaling; // scaling66 } else {67 // not on border , read from own cost values68 northWestValue = subMatrix[MATCH][i-1];69 }70

71 result = max(72 westValue ,73 northWestValue +74 (yStr[boxJ*boxDim + j] == xStr[boxI*boxDim + i])75 );76 result = max(northValue , result);77

78 subMatrix[RESULT ][i] = result;79 // end of calculation80

81 // on south/east edge? Save in output boundary82 if(j == boxDim -1) {83 // south edge , corner is only saved here and not in boundaryY84 boundaryX[boundaryXOffsetWrite+i] =85 subMatrix[RESULT ][i] + resultScaling; // scaling86 } else if(i == boxDim -1) {87 // east edge88 boundaryY[boundaryYOffsetWrite+j] =89 subMatrix[RESULT ][i] + resultScaling; // scaling90 }91 }92 }93

94 // memory wrap around95 tmpPointer = subMatrix [2];96 subMatrix [2] = subMatrix [0];97 subMatrix [0] = subMatrix [1];98 subMatrix [1] = tmpPointer;99

100 __syncthreads ();101 }102 }

62

B.1. Forward pass kernels

B.1.5 BPKERNEL

1 /**2 * Computes the backward pass using boundaries and pinned memory.3 *4 * @param xStr X5 * @param yStr Y6 * @param subBoundaryX boundaries in X direction7 * @param subBoundaryY boundaries in Y direction8 * @param boxDim a multiple of #threads per block9 * @param subBoxDim equal to #threads per block10 * @param hostLcs pinned memory for the LCS string11 * @param globalLcsIndex index of current LCS char12 * @param current_trace where are we in the global cost matrix13 * @param current_box where are we in the current tile14 */15 // Defines for bank conflict fix16 #define JUMPCNT 4 // Number of entries within one bank.17 #define WARPLENGTH 32 // number of threads in a warp18 __global__ void kernel_lcs_bp( char *xStr , char *yStr , int *subBoundaryX ,

int *subBoundaryY , int boxDim , int subBoxDim , char *hostLcs , int *globalLcsIndex , int2 *current_trace , int2 *current_box ) {

19

20 // Bank conflict mapping , mod operations can be optimized by compiler.21

22 /* index of a warp within a warpCollection */23 int subWarpIndex = (threadIdx.x/WARPLENGTH)%JUMPCNT;24

25 // collection index. one collection calcs i=0.. warpsize*entriesPerBank26 int warpCollectionIdx = threadIdx.x/( WARPLENGTH*JUMPCNT);27

28 // all threads IDs are relative to a warp29 int warpIndex = threadIdx.x%WARPLENGTH;30 int i = (warpIndex*JUMPCNT)+subWarpIndex + warpCollectionIdx *128;31

32 /* bank conflicts fixed */33

34 // shared y/x strings with the width of a subBoxDim35 __shared__ char xStr_shared[MAX_BACKWARD_PASS_N ];36 __shared__ char yStr_shared[MAX_BACKWARD_PASS_N ];37

38 // DPM for the current subBox39 __shared__ char subMatrix[MAX_BACKWARD_PASS_N*MAX_BACKWARD_PASS_N ];40

41 int totalDiagonals = subBoxDim *2 - 1;42 int subBoundaryXOffsetRead;43 int subBoundaryYOffsetRead;44

45 // index to current subBox46 __shared__ int2 subBox; // volatile47 int2 subTrace;48

49 int lcsIndex = *globalLcsIndex;50

51 // initialize thread 0 variables52 if (i == 0) {53 subBox.x = floor( (float) (current_trace ->x % boxDim) / subBoxDim );54 subBox.y = floor( (float) (current_trace ->y % boxDim) / subBoxDim );55

56 subTrace.x = current_trace ->x % subBoxDim;57 subTrace.y = current_trace ->y % subBoxDim;58 }59

60 __syncthreads (); // make sure all is in sync before the while loop61

62 // loop until we are outside the matrix63 while(subBox.x >= 0 && subBox.y >= 0) {64 // Copy X-string and Y-string needed for the current box65 xStr_shared[i] = xStr[subBox.x*subBoxDim + i];66 yStr_shared[i] = yStr[subBox.y*subBoxDim + i];

63

B. KERNEL SOURCE CODE

67

68 // index to sub boundary saved in the forward pass for the currentsubBox

69 subBoundaryXOffsetRead = boxDim *( subBox.y-1) + subBoxDim * subBox.x;70 subBoundaryYOffsetRead = boxDim *( subBox.x-1) + subBoxDim * subBox.y;71

72 int resultScaling = subBoundaryX[subBoundaryXOffsetRead ];73

74 // sync all threads in tile75 __syncthreads ();76

77 // calculate the sub -matrix for the current subBox78 for (int slice = 0; slice < totalDiagonals; ++slice) {79 int j = slice - i;80 int resultOffset = j*subBoxDim + i;81

82 if(!(j= subBoxDim)) {83 int northWestValue , result;84 int northValue = j == 085 ? subBoundaryX[subBoundaryXOffsetRead + i] - resultScaling86 : subMatrix[resultOffset - subBoxDim ];87

88 int westValue = i == 089 ? subBoundaryY[subBoundaryYOffsetRead + j] - resultScaling90 : subMatrix[resultOffset - 1];91

92 if(j==0) {93 // border to the north94 northWestValue = subBoundaryX[subBoundaryXOffsetRead+i-1] -

resultScaling;95

96 // boxJ >0 could be removed by adding one extra value to boundaryY97 northWestValue = (i == 0 && subBox.x==0 && subBox.y>0)98 ? subBoundaryY[subBoundaryYOffsetRead+j-1] - resultScaling99 : northWestValue;100

101 } else if(i==0) {102 // border to the west103 northWestValue = subBoundaryY[subBoundaryYOffsetRead+j-1] -

resultScaling;104 } else {105 // not on border , read from own sub -matrix values106 northWestValue = subMatrix[resultOffset -subBoxDim -1];107 }108

109 result = max(110 westValue ,111 northWestValue + (yStr_shared[j] == xStr_shared[i])112 );113 result = max(northValue , result);114

115 subMatrix[resultOffset] = result;116

117 }118

119 __syncthreads (); // sync all threads after each diagonal120 }121

122 // sub -matrix is done , do backward pass using one thread123

124 // one thread backtrace125 if (i == 0) {126 // backtrace127

128 while(subTrace.x >= 0 && subTrace.y >=0) {129 // inside tile130

131 if(yStr_shared[subTrace.y] == xStr_shared[subTrace.x]){132 // match , save result , local and global133 hostLcs[lcsIndex] = yStr_shared[subTrace.y];

64

B.1. Forward pass kernels

134 lcsIndex --;135 subTrace.x--;136 subTrace.y--;137 } else {138 // no match , go north or west?139 int northValue = subTrace.y == 0 ? subBoundaryX[

subBoundaryXOffsetRead + subTrace.x] - resultScaling :subMatrix [( subTrace.y-1)*subBoxDim + subTrace.x]; // scaling

140 int westValue = subTrace.x == 0 ? subBoundaryY[subBoundaryYOffsetRead + subTrace.y] - resultScaling :subMatrix[subTrace.y*subBoxDim + subTrace.x-1]; //scaling

141

142 if(northValue >= westValue) {143 subTrace.y--; // go north144 } else {145 subTrace.x--; // go west146 }147 } // end if148 } // while end (outside tile)149

150 // done , on border - update boxI/J and subTrace.x/J151 if(subTrace.x==-1) {152 // go west153 subBox.x--;154 // flip trace I155 subTrace.x = subBoxDim -1;156 }157

158 if(subTrace.y==-1) {159 // go north160 subBox.y--;161 // flip trace J162 subTrace.y = subBoxDim -1;163 }164

165 } // one thread is done with the trace for current subTile166

167 __syncthreads (); // backward trace done for current subTile , sync168 } // all is done169

170 // thread 0 updates global lcs index and pinned traces171 if(i==0) {172 *globalLcsIndex = lcsIndex;173

174 // update current trace based on subTrace and box...175 current_trace ->x =176 (current_box ->x*boxDim + subBox.x*subBoxDim + subTrace.x);177 current_trace ->y =178 (current_box ->y*boxDim + subBox.y*subBoxDim + subTrace.y);179 }180 }

65

AbstractResumPrefaceIntroductionThis ReportPrevious WorkOur Results

Local Dependency Dynamic ProgrammingDefinitionsPrevious ResultsOur Approach Based on Previous Results

Graphics Processing UnitsGPU ArchitectureMemoryBest Practices

Parallel Layouts for LDDPDiagonal WavefrontColumn-cyclic WavefrontDiagonal-cyclic WavefrontApplying Layouts to the GPU ArchitectureSummary and Discussion

Implementing LDDP on GPUsGrid-levelThread-levelSpace Constraints

Experimental Results for Grid-levelSetupResults

Experimental Results for Thread-levelSetupResults for Forward Pass KernelsResults for Backward Pass KernelPart Conclusion

Performance EvaluationThe Potential of Solving LDDP Problems on GPUsComparing to Similar GPU Solutions

ConclusionFuture Work

BibliographyAppendicesNVIDIA GPU Data SheetsNVIDIA Tesla C2070NVIDIA GeForce GTX 590

Kernel Source CodeForward pass kernels