algorithms for string comparison on gpus

ALGORITHMS FORSTRING COMPARISON ON GPUS

Kenneth Skovhus Andersen, s062390Lasse Bach Nielsen, s062377

!

!

!

!

!

!

Template!Generic!Prose!Document!

DD.MM.YYY,!vX.Y!

Harald!Störrle!!!!!!!!

Institute!for!Informatics!and!Mathematical!Modelling!

Technical!University!of!Denmark!

!

T1!Technical University of Denmark

Informatics and Mathematical Modelling

Supervisors: Inge Li Gørtz & Philip Bille

August, 2012

DTU InformaticsDepartment of Informatics and Mathematical ModelingTechnical University of DenmarkAsmussens Alle, Building 305, DK-2800 Lyngby, DenmarkPhone +45 4525 3351, Fax +45 4588 [email protected]

ABSTRACT

We consider parallelization of string comparison algorithms, including se-quence alignment, edit distance and longest common subsequence. Theseproblems are all solvable using essentially the same dynamic programmingscheme over a two-dimensional matrix, where an entry locally depends onneighboring entries. We generalize this set of problems as local dependencydynamic programming (LDDP).

We present a novel approach for solving any large pairwise LDDP prob-lem using graphics processing units (GPUs). Our results include a new supe-rior layout for utilizing the coarse-grained parallelism of the many-core GPU.The layout performs up to 18% better than the most widely used layout. Toanalyze layouts, we have devised theoretical descriptions, which accuratelypredict the relative speedup between different layouts on the coarse-grainedparallel level of GPUs.

To evaluate the potential of solving LDDP problems on GPU hardware,we implement an algorithm for solving longest common subsequence. In ourexperiments we compare large biological sequences, each consisting of twomillion symbols, and show a 40X speedup compared to a state-of-the-art se-quential CPU solution by Driga et al. Our results can be generalized on severallevels of parallel computation using multiple GPUs.

iii

RESUME

Vi betragter parallelisering af algoritmer til sammenligning af strenge, herun-der sequence alignment, edit distance og longest common subsequence. Disseproblemer kan alle løses med en todimensional dynamisk programmerings-matrix med lokale afhængigheder. Vi generaliserer disse problemer til localdependency dynamic programming (LDDP).

Vi præsenterer en ny tilgang til at løse store parvise LDDP-problemer medgrafikprocessorer (GPU’er). Ydermere har vi udviklet et nyt layout til ud-nyttelse af GPU’ens multiprocessorer. Vores nye layout forbedrer køretidenmed op til 18% i forhold til tidligere layouts. Til analyse af et layouts egensk-aber, har vi udviklet teoretiske beskrivelser, der præcist forudsiger den rela-tive køretidsforbedring mellem forskellige layouts.

For at vurdere GPU’ens potentiale til at løse LDDP-problemer, har vi im-plementeret en algoritme, som løser longest common subsequence. I voreseksperimenter sammenligner vi lange biologiske sekvenser, der hver bestar afto millioner symboler. Vi viser mere end 40X hastighedsforøgelse i forhold tilen state-of-the-art sekventiel CPU-løsning af Driga et al. Vores resultater kangeneraliseres pa flere niveauer af parallelitet ved brug af flere GPU’ere.

v

PREFACE

This master’s thesis has been prepared at DTU Informatics at the TechnicalUniversity of Denmark from February to August 2012 under supervision ofassociate professors Inge Li Gørtz and Philip Bille. It has an assigned work-load of 30 ECTS credits for each of the two authors.

The thesis deals with the subject of local dependency dynamic program-ming algorithms for solving large scale string comparison problems on mod-ern graphical processing units (GPUs). The focus is to investigate, combineand further develop existing state of the art algorithms.

Acknowledgments

We would like to thank our supervisors for their guidance during the project.A special thanks to PhD student Morten Stockel at the IT University of Copen-hagen for providing the source code for sequential string comparison algo-rithms [1] and PhD student Hjalte Wedel Vildhøj at DTU Informatics for hisvaluable feedback.

Lasse Bach Nielsen Kenneth Skovhus Andersen

August, 2012

vii

CONTENTS

Abstract iii

Resume v

Preface vii

1 Introduction 11.1 This Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Our Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Local Dependency Dynamic Programming 52.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Previous Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Our Approach Based on Previous Results . . . . . . . . . . . . . 9

3 Graphics Processing Units 113.1 GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3 Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Parallel Layouts for LDDP 154.1 Diagonal Wavefront . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2 Column-cyclic Wavefront . . . . . . . . . . . . . . . . . . . . . . 174.3 Diagonal-cyclic Wavefront . . . . . . . . . . . . . . . . . . . . . . 184.4 Applying Layouts to the GPU Architecture . . . . . . . . . . . . 194.5 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . 20

5 Implementing LDDP on GPUs 235.1 Grid-level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2 Thread-level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.3 Space Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6 Experimental Results for Grid-level 316.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7 Experimental Results for Thread-level 377.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377.2 Results for Forward Pass Kernels . . . . . . . . . . . . . . . . . . 377.3 Results for Backward Pass Kernel . . . . . . . . . . . . . . . . . . 417.4 Part Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

ix

CONTENTS

8 Performance Evaluation 438.1 The Potential of Solving LDDP Problems on GPUs . . . . . . . . 438.2 Comparing to Similar GPU Solutions . . . . . . . . . . . . . . . . 44

9 Conclusion 459.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Bibliography 47

Appendices 51

A NVIDIA GPU Data Sheets 53A.1 NVIDIA Tesla C2070 . . . . . . . . . . . . . . . . . . . . . . . . . 53A.2 NVIDIA GeForce GTX 590 . . . . . . . . . . . . . . . . . . . . . . 54

B Kernel Source Code 55B.1 Forward pass kernels . . . . . . . . . . . . . . . . . . . . . . . . . 55

x

1 INTRODUCTION

We revisit the classic algorithmic problem of comparing strings, includingsolving sequence alignment, edit distance and finding the longest commonsubsequence. In many textural information retrieval systems, the exact com-parison of large-scale strings is an important, but very time consuming task.As an example, the exact alignment of huge biological sequences, such asgenes and genomes, has previously been infeasible due to computing andmemory requirements. Consequently, much research effort has been investedin faster heuristic algorithms1 for sequence alignment. Although these meth-ods are faster than exact methods, they come at the cost of sensitivity. How-ever, the rise of new parallel computing platforms such as graphics processingunits, are able to change this scenario.

Graphics processing units (GPUs) are designed for graphics applications,having a large degree of data parallelism using hundreds of cores, and aredesigned to solve multiple independent parallel tasks. Previous results foraccelerating sequence alignment using GPUs show a significant speedup, butare currently focused on aligning many independent short sequences, a prob-lem where the GPU architecture excel. Our focus, is based on the need to solvelarge-scale exact pairwise string comparison of biological sequences contain-ing millions of symbols.

Our work is motivated by the increasing power of GPUs, and the chal-lenge of making exact comparison of large strings feasible. We consider par-allelization of a general set of pairwise string comparison algorithms, all solv-able using essentially the same dynamic programming scheme over a twodimensional-matrix. Taking two input strings X and Y of the same length n,these problems can be solved by computing all entries in an n ⇥ n matrix us-ing a specific cost function. Computation of an entry in the matrix dependson the neighboring entries. We generalize all these problems as local depen-dency dynamic programming (LDDP). In general, LDDP problems are not triv-ially solved in parallel, as the local dependencies gives a varying degree ofparallelism across the entire problem space.

The parallelism of a GPU is exposed as a coarse-grained grid of blocks,where each block consists of finer-grained threads. We call these levels of par-allelism the grid- and thread-level. We focus on layouts as a mean to describehow LDDP problems can be mapped to the different levels on the GPU.

1One of the first heuristic algorithms for sequence alignment was FASTA, presented by Pear-son and Lipman in 1988 [2].

1

1. INTRODUCTION

1.1 This Report

We start by presenting a short description of our work. The following chap-ter gives a theoretical introduction to LDDP problems, including a survey ofprevious sequential and parallel solutions. Based on this, we select a set ofstate-of-the-art algorithms as a basis for our new GPU solution. We then in-troduce the GPU architecture, and the programming model. This is followedby a chapter describing layouts for distributing LDDP problems to parallelcompute units. The chapter also introduces our main result: a new layout forimproving the performance of solving LDDP problems on GPUs, targeted atthe GPU grid-level.

In the following we describe our implementation on the NVIDIA GPUarchitecture, and what design considerations have been made. Finally, theexperimental results examine the practical performance of our LDDP imple-mentation on GPUs.

1.2 Previous Work

Currently there are several GPU solutions for solving LDDP problems. Manyof these implement the Smith-Waterman algorithm for the local alignmentproblem [3, 4, 5, 6, 7, 8, 9, 10]. The solutions achieve very high utilizationof GPU multiprocessors by comparing a large number of short sequences,thereby solving multiple independent LDDP problems.

Existing GPU solutions for longest common subsequence [11, 12] are ableto compare rather large sequences by decomposing the LDDP problem intosmaller subproblems, called tiles, and process these in an anti-diagonal man-ner on the GPU multiprocessors. This widely used layout, for mapping tilesonto compute units, is referred to as Diagonal Wavefront. It assigns all tilesin an anti-diagonal for computation and continues to the next anti-diagonal,once all tiles have been computed.

Galper and Brutlag [13] presented another layout, Column-cyclic Wave-front,2 which improves resource utilization compared to the widely used Di-agonal Wavefront. Krusche and Tiskin [14] used a similar layout on the Bulk-Synchronous Parallelism Model [15] for solving large-scale LDDP, mappingeach column in the tiled LDDP problem to a processor.

1.3 Our Results

We present an algorithm for solving large pairwise LDDP problems usingGPUs. Our main result is a new layout Diagonal-cyclic Wavefront for dis-tributing a decomposed LDDP problem onto the GPU. Let an n ⇥ n matrix bedecomposed into k ⇥ k equally sized tiles, each computable on a GPU multi-processor. The mapping of tiles onto multiprocessors take place on the grid-level, and the computation of entries inside a tile is done on the thread-level.

2Originally called Row Wavefront Approach (RWF).

2

1.3. Our Results

To theoretically evaluate our new layout Diagonal-cyclic Wavefront withthe widely used Diagonal Wavefront and Column-cyclic Wavefront, we ex-amine the utilization of each. The utilization of a layout is the fraction of tilescomputed where all multiprocessors are fully utilized to the total number oftiles. Theoretically, we show that our new layout Diagonal-cyclic Wavefront,in general, achieves the best utilization—depicted in Figure 1.1.

Our experiments confirm the theoretical analysis; our new layout is su-perior to the widely used Diagonal Wavefront, and achieves a performancespeedup up to 18% for any LDDP problem. Furthermore, for various inputsizes, our new layout generally outperforms Column-cyclic Wavefront. Weshow, that our theoretical descriptions of the layouts give very accurate pre-dictions on how the architecture behaves. In general, the new layout Diagonal-cyclic Wavefront should always be used for distributing large LDDP problemsonto the GPU grid-level.

To explore the potential of solving LDDP problems on GPUs, we imple-ment a set of kernels for solving longest common subsequence. A kernel de-fines the computation of individual tiles. For simplicity, we focus on Diag-onal Wavefront for distributing computable entries to threads. To determinethe best kernel parameters, e.g., tile size and number of threads, we conductautomatic performance tuning [16]. We present a scaling technique for space-reduction of cost values inside a tile, giving a better kernel performance. Scal-ing can be applied to a subset of LDDP problems. Besides, we have investi-gated a rather undocumented feature of the CUDA compiler, the volatile

keyword. Depending on placement, we have observed up to 10% perfor-mance increase.

To evaluate whether the problem is viable to solve on a GPU, we com-pare our results to state-of-the-art sequential CPU solutions. The experimentsshows an average of 40X performance advantage over the sequential CPUsolution by Driga et al. [17] for comparing strings larger than 219 using theNVIDIA Tesla C2070 GPU and a Intel i7 2.66GHz CPU. To the best of ourknowledge, our LDDP implementation supports the largest input size n forGPUs in literature, up to 221.

28 42 56 70 84 98 112 126 140 154 168 182 196 210 224 238 252 266 280Number of tiles (k)

50

60

70

80

90

100

Util

izat

ion

U(%

)

Utilization of k ⇥ k tiles using p = 14 compute units

Diagonal-cyclic WavefrontColumn-cyclic WavefrontDiagonal Wavefront

Figure 1.1: Utilization of the three layouts.

3

2 LOCAL DEPENDENCY DYNAMIC PROGRAMMING

A large group of string comparison problems can be solved using essentiallythe same scheme over a two-dimensional dynamic programming matrix (DPM),where an entry (i, j) in the matrix depends on at most three neighboring en-tries. These include widely-used string problems in bioinformatics such asedit distance, sequence alignment and longest common subsequence. We re-fer to all these problems as local dependency dynamic programming (LDDP) prob-lems.

2.1 Definitions

Let X and Y be the input strings with characters from a finite alphabet S, forsimplicity we assume equal string length, i.e., |X| = |Y| = n. The character atposition i in X is denoted X[i].

2.1.1 Local Dependency Dynamic Programming

Given the input strings X and Y, an LDDP problem can be solved by filling a(n + 1) ⇥ (n + 1) DPM, denoted c. The entry c[i, j] depends on at most threeneighboring entries, c[i � 1, j � 1], c[i, j � 1], c[i � 1, j] and the characters X[i]and Y[j]. We let parent(i, j) denote the neighboring entries that determinesc[i, j]. In general, the recurrence for solving an LDDP problem is:

c[i, j] =

(b(i, j) if i = 0 _ j = 0,f (X[i], Y[j], parent(i, j)) if i, j > 0

(2.1)

The function b initializes the north and west border of the DPM c in time O(1)for each entry. The function f (X[i], Y[j], parent(i, j)) computes the solution tothe subproblem c[i, j] in time O(1) as it depends on three neighboring entriesand input characters X[i] and Y[j]. The forward pass computes the length ofthe optimal path by filling the DPM, and backward pass finds the optimal pathby backtracking through the DPM.

2.1.2 Longest Common Subsequence

For simplicity, we use the longest common subsequence (LCS) problem as a casestudy, although all techniques and results presented generalize to any LDDPproblem. We define the problem as follows:

Let X[i, j] denote the substring of X from position i to j. A subsequence ofX is any string Z with zero or more elements left out of X. We say that Z is

5

2. LOCAL DEPENDENCY DYNAMIC PROGRAMMING

a common subsequence of X and Y if Z is a subsequence of both X and Y. Thelongest common subsequence problem for two input strings X and Y, is tofind a maximum-length common subsequence of both X and Y.

Given two strings X and Y where |X| = |Y| = n, the standard dynamicprogramming solution to LCS fills a (n + 1) ⇥ (n + 1) dynamic programmingmatrix c using the following recurrence [18]:

c[i, j] =

8><

>:

0 if i = 0 _ j = 0,c[i � 1, j � 1] + 1 if i, j > 0 ^ X[i] = Y[j],max ( c[i, j � 1], c[i � 1, j] ) if i, j > 0 ^ X[i] 6= Y[j]

(2.2)

The length of the LCS between X[1, i] and Y[1, j] is c[i, j], therefor the length ofthe LCS of X and Y is c[n, n]. To compute the forward pass, the algorithm useO(n2) time and space.

The solution path and thus the LCS is deduced by backtracking from c[n, n]to some c[i0, j0] where i0 = 0 _ j0 = 0. For a given entry c[i, j] the backward passdetermines in O(1) time which of the three values in parent(i, j) that was usedto compute c[i, j]. The complete LCS is reconstructed in O(n) time, when allcost values in the DPM are available.

2.2 Previous Results

We start by presenting an overview of previous results for solving LDDP prob-lems in general. We divide the findings in sequential and parallel solutions.

2.2.1 Sequential Solutions

Wagner and Fischer 1974 [19] presented one of the first dynamic program-ming solutions to the Levenshtein distance problem using O(n2) timeand space. We call this a full-matrix algorithm as it stores the completeDPM. Needleman-Wunch 1970 [20] and Smith-Waterman 1981 [21] pre-sented other examples of full-matrix algorithms for LDDP problems.

Hirschberg 1975 [22] improved space at the cost of increased time for thebackward pass. They used a divide and conquer approach by combininga standard and reverse application of the linear space cost-only variationto find a partitioning midpoint. Although the original solution was pre-sented for LCS, Myers and Miller [23] generalized it to sequence align-ment in 1988. The algorithm uses O(n2) time, O(n) space and O(n2)recomputations.

Driga et al. 2006 [17] presented their cache-aware Fast Linear-Space Align-ment (FLSA). It divides the DPM into k2 equally sized tiles, as shownin Figure 2.1. All tiles share boundaries of intersecting cost-values. Thetime-space tradeoff parameter k is selected so the problem space in a tilecan be computed using full-matrix. The forward pass fills the bound-aries. The backward pass uses the boundaries to compute the optimal

6

2.2. Previous Results

path by processing only tiles that intersect the optimal path. The algo-rithm implements backward pass optimization which reduces the sizeof the tiles according to the entry point of the optimal path. FLSA usesO(n2) time, O(nk) space and O( n2

k ) recomputations.

Chowdhury and Ramachandran 2006 [24] also tiled the DPM, but reducedI/O bound by splitting the DPM into four tiles, then recursively com-pute each tile. Unlike Driga et al. [17], the algorithm is cache-oblivious[25]. The algorithm uses O(n2) time and O(n) space. As the backwardpass intersect at most 3/4 tiles, it performs O(n2) recomputations.

Bille and Stockel 2012 [1] combined the k2 tiles from Driga et al. [17] with re-cursive and cache-oblivious approach from Chowdhury and Ramachan-dran [24]. Experiments showed a superior performance over Chowd-hury and a comparable performance to Driga. The algorithm uses O(n2)

time, O(nk) space and O( n2

k ) recomputations.

All presented algorithms solve LDDP string comparison problems in gen-eral. For specific LDDP problems, specialized solutions exist improving com-plexity or space bounds by restricting the problem in terms of alphabet size,cost function or by exploiting properties of a specific LDDP problem—see e.g.,[26, 27, 28, 29] and the surveys [30, 31].

1

2

3

4

2

3

4

5

3

4

5

6

4

5

6

7

X

Y

input character

stored valuet =

nk

Figure 2.1: Decomposition of LDDP problems using tiled linear-space reduction as presented byDriga et al. [17]. The DPM is divided into k2 equally sized tiles sharing boundaries of intersectingcost-values. The forward pass of a tile receives as input the boundaries on north and west, andoutputs the south and east boundaries. The backward pass uses the stored boundaries to computethe optimal path by processing only tiles that intersect the optimal path. In a parallel context, thenumbers inside each tile refer to the order in which the tiles can be calculated during a forwardpass.

7

2. LOCAL DEPENDENCY DYNAMIC PROGRAMMING

2.2.2 Parallel Solutions

Several theoretical results for LDDP are based on the Parallel Random AccessMachine (PRAM) model [32], which ignores memory hierarchy, memory la-tency and cost of synchronization. As an example, Mathies [33] shows analgorithm for determining edit distances for two strings of size m and n inO(log m log n) time for mn processors. Although these results show the ex-tend of parallelism, their assumptions, that the number of processors is in theorder of the problem size, make their algorithm impractical.

In general, to solve an LDDP problem in parallel, the problem space mustbe decomposed and distributed to compute units. Let an n ⇥ n DPM be decom-posed into k2 equally sized square tiles of size t = n/k. A layout defines theorder in which the computation of tiles in the DPM is performed.

Parallel Solutions for CPU

Galper and Brutlag 1990 [13] presented the layout Column-cyclic Wavefront(originally called Row Wavefront Approach) for efficiently solving LDDPproblems on a shared-memory multiprocessor. The layout is examinedand analyzed in chapter 4 on page 15.

Krusche and Tiskin 2006 [14] used a similar layout as Galper and Brutlag tofind the length of the longest common subsequence using the Bulk Syn-chronous Parallelism Model (BSP) [15]. Their algorithm decomposes theDPM into rectangular tiles similar to Driga et al. [17], and sequentiallycomputes the values inside the tile.

Driga et al. 2006 [17] presented a parallel version of their linear-space FLSAalgorithm for CPU multicore systems. The algorithm computes the tiledDPM by advancing in a diagonal wavefront pattern, called the DiagonalWavefront layout. The computation flow is shown in Figure 2.1. Theirexperiments showed a linear speedup up to 8 processors for sequenceswhere n < 219.

Chowdhury and Ramachandran 2008 [34] showed a general cache-efficientrecursive multicore algorithm for solving LDDP problems. They con-sidered three types of caching models for chip multiprocessors (CMP) in-cluding private, shared and multicore. Performance tests for two LDDPproblems pairwise sequence alignment with affine gap cost and median of 3sequences, again with affine gap penalty, solved using their CMP algorithm,showed a 5 times speedup on a 8-core multiprocessor.

Diaz et al. 2011 [35] implemented Smith-Waterman and Needleman-Wunschon the Tilera Tile64 processor having 64 cores. They based their parallelalgorithm on FLSA by Driga et al. [17]. Their implementation achievedup to 15 times performance increase compared to the same algorithm onan x86 multicore architecture.

8

2.3. Our Approach Based on Previous Results

Parallel Solutions for GPU

Currently there are several GPU solutions to LDDP problems, but we foundthe Smith-Waterman algorithm for local alignment to be the most explored.The most important is listed here:

Liu, W. et al. 2006 [3, 4] presented the first solution to Smith-Waterman on aGPU, and achieved a very high utilization of GPU multiprocessors bycomparing a large number of independent short sequences. This meansthat they solve multiple independent LDDP problems with no depen-dencies on a GPU grid-level. To reduce space when computing the op-timal length of a n2 cost matrix, they only store three separate buffers oflength n holding cost values for the most recent calculated diagonals—we call this linear space reduction three cost diagonals. Similar solutionswere presented in 2008–2009 [5, 6, 7].

Liu, Y. et al. 2010 [8, 9] presented CUDASW++, reported to perform up to 17billion cells update per second (GCUPS) on a single-GPU GeForce GTX280 for solving Smith-Waterman. We note that CUDASW++ use theColumn-cyclic Wavefront layout on the thread-level. No backtrackingis made, and their algorithm does not generalize to large LDDP.

Although many GPU solutions for Smith-Waterman were found, they are onlyable to compare strings of size n < 216. As a result, they are not applicable forcomparing large biological sequences considered in this report.

For solution of large LDDP problems, we found two interesting GPU im-plementations of Longest Common Subsequence (LCS):

Kloetzli et al. 2008 [11] presented a combined CPU/GPU solution for solv-ing LCS of large sequences (up to n 220). They showed a five-foldspeedup over the cache-oblivious single processor algorithm presentedby Chowdhury and Ramachandran [24]. The experiments were per-formed on an AMD Athlon 64 and an NVIDIA G80 family GTX GPU.

Deorowicz 2010 [12] calculates the length of LCS on large sequences. Theiralgorithm decomposes the problem space into tiles like Driga et al. [17],and calculates the tiles using the Diagonal Wavefront layout. Their ex-periments show a significant speedups obtained over their own serialCPU implementation of the same algorithm for n = 216. Unfortunatelyno comparison is made for any known CPU solutions, and, despite hav-ing tried, we have not been able to obtain the source code.

2.3 Our Approach Based on Previous Results

We now select relevant results, for our further investigations. As a basis forour LDDP solution, we use the tiling approach by Driga et al. [17] to achieve alinear-space reduction and decomposition of the problem space. Furthermore,we wish to investigate the properties and efficiencies of the layouts DiagonalWavefront [12, 17] and Column-cyclic Wavefront [13, 14]. The three cost diag-onals presented by Liu, W. et al. [3, 4] is explored for space reduction.

9

3 GRAPHICS PROCESSING UNITS

In this chapter we introduce relevant aspects of graphics processing unit ar-chitectures and the programming model exposing the hardware.

Where central processing units (CPUs) are highly optimized for solving awide range of single-threaded applications, GPUs are built for graphics ap-plications having a large degree of data parallelism. Graphics applicationsare also latency tolerant, as the processing of each pixel can be delayed aslong as frames are processed at acceptable rates. As a result, GPUs can tradeoff single-thread performance for increased parallel processing. As a conse-quence, each processing element on the GPU is relatively simple and hun-dreds of cores can be packed per die. [36]

There are currently several frameworks exposing the computation powerof GPUs, including ATI Stream, Open Computing Language (OpenCL) andNVIDIA’s Compute Unified Device Architecture (CUDA) [37]. For our imple-mentation we choose to work on NVIDIA CUDA.

3.1 GPU Architecture

A GPU is composed of a number of streaming multiprocessors (SM) each havinga number of compute units called streaming processors (SP) running in lockstep.The number of streaming multiprocessors differ between GPU models, but asan example the NVIDIA Tesla GPUs has 14 SMs each with 32 SPs, totaling 448SPs. See hardware specifications in Appendix A.

The architecture of a GPU is akin to Single Instruction Multiple Data (SIMD),however, a GPU refines the SIMD architecture into Single Instruction Multi-ple Thread (SIMT). Instructions are issued to a collection of threads called awarp. SIMT allows individual execution paths of threads to diverge as a resultof branching. If threads within a warp diverge, the warp will serialize eachpath taken by the threads. [37]

Compared to CPU-threads, threads on a GPU are lightweight and handledin hardware. Register memory for individual threads is kept in the SM registermemory, making hardware-based context switching possible at no cost.

Warps are not scheduled to run until data is available to all threads withinthe warp, making it possible to hide memory latency.

11

3. GRAPHICS PROCESSING UNITS

3.1.1 CUDA Programming Model

The programming model provides two levels of parallelism, coarse and fine-grained. On the coarse-grained grid-level, partitioning of work is done by di-viding the problem space into a grid consisting of a number of blocks. A blockis mapped to a symmetric multiprocessor, and represents a task which can besolved independently. On the fine-grained thread-level concurrent threads areassigned to a block, and provide data and instruction parallelism. The levelsof parallelism are depicted in Figure 3.1.

Grid

Block 1 Block ... Block gridDim

Warp 1

InstructionsWarp ...Instructions

Warp n

Instructions

Threads 1..32

Kernel with size (gridDim, blockDim) and a set of instructions.

Thread i..blockDim

Figure 3.1: Taxonomy of the CUDA work partitioning hierarchy.

A kernel function sets the partitioning parameters and defines the instructionsto be executed.

If the available resources of an SM allows it, multiple blocks can be allo-cated on an SM. This way, the hardware resources are better utilized.

Synchronization Primitives

Each level has different means of synchronizing.

Grid-level No specific synchronization primitive is available to handle syn-chronization between blocks as concurrent blocks represent indepen-dent tasks. Implicit synchronization can, however, be achieved by anumber of serialized kernel calls.

Thread-level CUDA only supports barrier synchronization for all threadswithin a block.

12

3.2. Memory

3.2 Memory

Global Memory

StreamingMultiprocessor

Shared Memory

Streaming Processors

Registers

Texture Cache

Constant Cache

L2 Cache

L1 Cache

Texture

Constant

Figure 3.2: CUDA memory spaces accessible froma streaming processor (SP). Please note that for sim-plicity only a single SM and a single SP are shown.

GPU memories are shown in Fig-ure 3.2. Registers are used as pri-vate memory for threads, while allthreads within a block have ac-cess to shared memory. All threadsacross blocks have access to globalmemory and the read-only textureand constant memory.

Two levels of caches, L1 andL2, exist. Both caches are remark-ably smaller than typical cacheson CPUs. Each SM is equippedwith its own L1 cache that residesin shared memory. The L2 cacheis shared between all SM as a fullycoherent unified cache, with cachelines of 128 bytes. As shown inFigure 3.2, L1 and L2 caches areused for global memory. The spe-cial caches; texture and constantmemory can be mapped to spe-cific parts of global memory, and provide specialized cached access patternsto these parts of global memory. The CUDA memory types and their traits areshown in Table 3.1.

For accessing global memory, the number of memory transactions per-formed will be equal to the number of L2 cache lines needed to completelysatisfy the request.

Shared memory is divided into equal sized memory banks which can beaccessed simultaneously. Concurrent memory access, which falls into distinctbanks, can be handled simultaneously, whereas concurrent access to the samebank will cause serialized access—referred to as bank conflicts.

Access time for texture and constant cache depends on access patterns, butconstant cache is stated to be as fast as reading from a register, as long as allthreads read the same address [38].

Type Location on SM Cached Access Scope Access latency(non-cached)

Register yes n/a R/W 1 thread 0-1

Shared yes n/a R/W threads in block 1

Local no yes R/W 1 thread 400-600

Global no yes R/W all threads + host 400-600

Constant no yes R all threads + host 400-600

Texture no yes R all threads + host 400-600

Table 3.1: Memory types in CUDA. n/a stand for ”not applicable”, R for read and W for write.The documented access latencies is given in cycles. [38]

13

3. GRAPHICS PROCESSING UNITS

3.3 Best Practices

A number of best practices to effectively exploit the GPU architecture are de-scribed by NVIDIA [38]. The most important are presented here:

Shared memory should be used when possible, as shared memory is fasterthan global memory. Values which are accessed often should be placedin shared memory.

Global memory to compute ratio should be maximized, as global memoryaccess is slow, while parallel computation is fast.

Minimize kernel branch diversion because divergent branches means seri-alized execution for each divergent branch.

14

4 PARALLEL LAYOUTS FOR LDDP

We will now present how LDDP problems can be computed in parallel. Fromrecurrence 2.1 on page 5, an entry (i, j) in the DPM is computable if neighboringentries to the west, north-west and north have been computed. Thus, for anyentry (i, j) to be computable, the data dependencies (id, jd) are where 0 id <i ^ 0 jd < j. Due to these data dependencies, the order in which the DPMcan be computed in parallel follows anti-diagonal lines, a pattern known aswavefront parallelism [17]. A wavefront Wd consists of the set of entries (i, j)where d = i + j + 1 and the number of entries in the set is denoted |Wd|.The dependencies give a varying degree of parallelism across the DPM, aninherent property of all LDDP problems.

In our context and modeling of the problem, parallel computations areexecuted as a sequence of steps, each consisting of computations and commu-nications followed by a barrier synchronization. Once all compute units reachthe barrier in step si they proceed to the next step si+1. The total number ofsteps to complete a computation is denoted S. This corresponds to the termsupersteps in Valiant’s BSP model [15].

To partition the LDDP problem for parallel computation and reduce thespace usage, we use Driga’s tiling algorithm [17]. The n ⇥ n DPM is decom-posed into k2 equally sized tiles of size t = n/k. The data dependency forentries applies to tiles as well. After decomposing the LDDP problem, com-putable tiles are mapped onto compute units. A scheme for this mapping iscalled a layout. A layout defines a series of steps, each computing a set of tiles.

As the degree of parallelism varies, so will resource utilization across theseries of steps a layout defines. When applying a layout, we distinguish be-tween tiles computed in a step where all compute units are fully utilized, andtiles computed in a step where compute units are under-utilized. These notionsare denoted TU=1 and TU<1 respectively, and called utilization-parameters. Theresource utilization U of a layout is the fraction of tiles computed where allcompute units are fully utilized to the total number of tiles: U = TU=1/k2.

From a utilization perspective, the best possible layout using p processorsis only under-utilizing the processors in the wavefronts in the north-west andsouth-east corners where |Wd| < p, i.e., the wavefronts with 1, 2, . . . , p � 1tiles. Each corner consists of Âp�1

d=1 |Wd| = (p2 � p)/2 tiles. For both corners,the best case number of tiles under-utilizing resources is TU<1 = p2 � p.

As our focus is comparison of large strings on GPU architecture, we as-sume k is an order of magnitude larger than the number of compute units p,so k � p, where p is constant.

We now describe three layouts with a different flow of computations: thewidely used Diagonal Wavefront, Column-cyclic Wavefront [13, 14] and ournew Diagonal-cyclic Wavefront.

15

4. PARALLEL LAYOUTS FOR LDDP

4.1 Diagonal Wavefront

The widely used Diagonal Wavefront (DIAWAVE) layout process the tiled DPMin an anti-diagonal manner where a wavefront Wd consist of all tiles (i, j) inwavefront. Wd is computable when Wd�1 has been processed and a barriersynchronization prevents Wd+1 from being processed before Wd is finished.Since all entries in Wd�1 have been computed, entries in Wd can be computedindependently. This layout is depicted in Figure 4.1 for k = 6 and a numberof compute units p.

1

2

2

3

3

3

4

4

4

4

5

5

5

5

5

6

6

6

6

6

6

7

7

7

7

7

8

8

8

8

9

9

9

10

10

11

1

2

2

3

3

3

4

4

4

5

6

6

6

7

7

8

8

8

9

9

9

10

10

10

11

11

12

12

12

13

14

14

14

15

15

16

1

2

2

3

3

3

4

4

4

4

5

5

5

5

6

7

7

7

7

8

8

9

9

9

9

10

11

11

11

11

12

12

12

13

13

14

(c) p = n = 6(a) p = 3 (b) p = 4

Figure 4.1: Diagonal Wavefront layout. A DPM subdivided into k ⇥ k tiles, here k = 6. Thenumber inside the tiles denote their parallel computation step. For clarity, the figure shows anordered execution of tiles in a Wd, although they could be computed in any order. The gray colorshows tiles, that are computed in a step where some compute units under-utilized. (a) 16 stepsare used to compute the DPM, with utilization U = 24/36 = 67%. (b) 14 steps and U = 56%.(c) 11 steps and U = 6/36 = 17%. Only step 6 is not under-utilizing compute units.

A complete wavefront Wd of length |Wd| can be computed inl

|Wd |p

mpar-

allel computation steps. As there are 2k � 1 diagonals in a k ⇥ k matrix, thenumber of steps to compute the entire DPM is:

S =2k�1

Âd=1

⇠|Wd|

p

⇡

As |Wd| = d for 1 d k, then by symmetry the number of steps for allWd where d 6= k is 2 Âk�1

d=1dd/pe. The complete number of steps is:

S = 2k�1

Âd=1

dd/pe + dk/pe (4.1)

Resource Utilization

When the length of a wavefront is not divisible by p, under-utilization will oc-cur. As each wavefront is divided into

l|Wd |

p

msteps, the utilization-parameter

TU<1 for each wavefront Wd is |Wd| mod p.Starting from the first wavefront, and advancing p wavefronts, the number

of under-utilized tiles1 are: Âpd=1 (|Wd| mod p) = (p2 � p)/2. This number

of under-utilized tiles will periodically continue to occur, when advancing p

1For simplicity, the phrase “under-utilized tiles” is used for the number of tiles computed ina step where under-utilization occurs.

16

4.2. Column-cyclic Wavefront

wavefronts further, see Figure 4.1 (a). This gives a total of 2bk/pc full periods,where (p2 � p)/2 tiles are computed while under-utilizing.The remaining k mod p wavefronts will by symmetry give (k mod p)2 tiles,see Figure 4.1 (b). For a full computation of k2 tiles, the total number of tileswhere DIAWAVE is under-utilizing the available resources are:

TU<1 = bk/pc(p2 � p) + (k mod p)2 (4.2)

4.2 Column-cyclic Wavefront

Galper and Brutlag [13] presented the Column-cyclic Wavefront (COLCYCLIC)layout, where columns in the tiled DPM are divided into column groups span-ning p columns. The layout is shown in Figure 4.2.

Each column group G is calculated as a wavefront Wd where |Wd| p.When column i in the group is completed, column i + p in the next group willstart. In cases where a compute unit can be explicitly mapped to a specificcolumn and values can be stored locally between steps, COLCYCLIC minimizesmemory transfers by keeping the values from (i, j) when advancing to tile(i, j + 1). Krusche and Tiskin [14] exploit this potential by mapping the layoutto the BSP model [15].

1

2

2

3

3

3

4

4

4

7

5

5

5

8

8

6

6

6

7

7

8

9

9

9

10

10

10

11

11

11

12

12

12

13

13

14

1

2

2

3

3

3

4

4

4

4

5

5

5

5

7

6

6

6

6

8

8

7

7

7

9

9

8

8

9

10

10

11

11

12

12

13

p0 p1 p2 p3 p0 p1p0 p1 p2 p0 p1 p2

(a) p = 3 (b) p = 4 (c)

p

k

p � (k mod p)

|Gsub||Gsub|

Figure 4.2: Column-cyclic Wavefront layout where the vertical line shows the separation of col-umn groups spanning p columns. p0 � p3 indicates the column mapping to compute units aspresented by Krusche and Tiskin [14]. Tiles computed while under-utilizing the compute unitsare shown with a gray color. (a) p = 3 using 14 steps and U = 83%. (b) p = 4 using 13 steps andU = 56%. Note the suboptimal utilization for the last column group Gsub. (c) Illustrates how tocalculate the number of steps when k is not a multiple of p.

When k is not a multiple of p, as in the example in Figure 4.2 (b), the utiliza-tion is suboptimal in the last column group Gsub since the number of columns|Gsub| = (k mod p) < p.

The number of steps needed by COLCYCLIC is described by two cases:

When k is a multiple of p we have k/p column groups and k steps are takenfor each, giving k2/p = kdk/pe steps. Due to the in-column wavefront,additional p � 1 steps are needed to get the total number of steps.

When k is not a multiple of p we treat the last column group Gsub as if thenumber of columns |Gsub| = p, and get kdk/pe + p � 1 steps. We now

17


need to adjust the steps to reflect the fact that |Gsub| < p. The number ofsuperfluous steps taken by the initial assumption is p � (k mod p). Thisgives kdk/pe + p � 1 � (p � (k mod p)) steps.

From the above two cases, the total number steps for COLCYCLIC is:

S =

⇠kp

⇡k + p � 1 +

(0 if k mod p = 0,(k mod p) � p if k mod p 6= 0

(4.3)


When k is a multiple of p, COLCYCLIC achieves the optimal utilization TU<1 =p2 � p. However, when that is not the case, Gsub has a negative impact onU. The tiles which are computed in steps where compute units are not fullyutilized in Gsub are TU<1 = (k � p)(k mod p). Thus, the total number of tilescomputed where COLCYCLIC under-utilizes the resources is:

TU<1 = p2 � p + (k � p)(k mod p) (4.4)

4.3 Diagonal-cyclic Wavefront

We present a new Diagonal-cyclic Wavefront2 (DIACYCLIC) layout which, inthe general case, improves utilization. The layout allows compute units tocyclically continue to Wd+1 even though Wd is not completely finished.

The layout imposes the constraint that the series of steps follow a consec-utive order for wavefronts where |Wd| � p. When the remaining tiles on Wdbecome less than p, the superfluous compute units will continue on the nextwavefront Wd+1, where the same consecutive ordering is honored. It followstrivially that the local dependencies hold for this layout under the given con-straints. The DIACYCLIC layout is shown in Figure 4.3.

DIACYCLIC will take 2(p � 1) steps, computing p2 � p tiles for wavefrontswhere |Wd| < p, the north-west and the south-east corners. The remaining

1

2

2

3

3

3

4

4

4

5

5

5

6

6

6

7

7

7

8

8

8

9

9

9

10

10

10

11

11

11

12

12

12

13

13

14

1

2

2

3

3

3

4

4

4

4

5

5

5

5

6

6

6

6

7

7

7

7

8

8

8

8

9

9

9

9

10

10

10

11

11

12

(a) p = 3 (b) p = 4

Figure 4.3: Diagonal-cyclic Wavefront layout. Tiles computed while under-utilizing are shownwith a gray color. (a) p = 3 using 14 steps and U = 83%. The arrows indicate the continuation ofstep 5 and 10 on the next diagonal. (b) p = 4 using 12 steps and U = 61%.

2A.k.a Snake.

18

4.4. Applying Layouts to the GPU Architecture

part is where the cyclic approach is used, and has k2 � (p2 � p) entries, com-puted by p processing units, giving (k2 � p2 + p)/p steps. To cope with thecase where (k2 � p2 + p) is not a multiple of p we ceil the expression. Thisgives total number of steps for Diagonal-cyclic Wavefront:

S = 2(p � 1) +

⇠k2 � p2 + p

p

⇡=

⇠k2

p

⇡+ p � 1 (4.5)


For wavefronts where |Wd| � p, the number of tiles computed while computeunits are fully utilized is TU=1 =

jk2�p2+p

p

kp. When k2 is not a multiple of

p, the last step in the cyclic approach will under-utilize the compute units,and the number of tiles in this last step is k2 mod p. The total number of tilescomputed where Diagonal-cyclic Wavefront under-utilize resources is:

TU<1 = p2 � p + k2 mod p (4.6)

4.4 Applying Layouts to the GPU Architecture

When applying the layouts to the GPU, we map tiles to a grid of blocks and theentries in a tile are computed by a thread block. The two levels of parallelismdiffer architecturally, and we consider these differences in the context of ourlayout descriptions.

Grid-level

Depending on the resource usage of a block, multiple blocks might be con-currently executed on a symmetric multiprocessor (SM). We let B denote thenumber of concurrent block per multiprocessor. To reflect this fact, we definethe number of compute units p that are used in our layout descriptions as:

p = B · number of SMs

As an example, if we have 14 multiprocessors available capable of computing2 concurrent blocks, under-utilization will occur when less than 28 tiles arecomputed in a step. Hence, this definition of p makes the expressions forunder-utilization applicable for all values of B.

On the GPU grid-level, there are currently no possibility of mapping ablock to a specific processor, and between steps nothing can be kept in sharedmemory, thus all blocks will have the exact same starting point. This means,the calculation of a tile can be assumed to take constant time—making it pos-sible to use steps to predict running time. This also means utilization U givesa way of comparing differences in running times for the layouts.

19


Thread-level

Different memories are available on the thread-level and data can be reusedfrom one step to the next. This means steps will take a varying amount oftime, and predicting running times from the number of steps is not possibleon this level. However, the utilization U can be used as an indication forperformance.

To extend the layout description for the thread-level, an entry is consideredto be a tile of size 1 ⇥ 1.

4.5 Summary and Discussion

The layouts Column-cyclic Wavefront (COLCYCLIC) and Diagonal-cyclic Wave-front (DIACYCLIC) take steps using a cyclic distribution of tiles which improvesutilization compared to Diagonal Wavefront (DIAWAVE). COLCYCLIC and DIA-CYCLIC divert when it comes to data locality as COLCYCLIC attains a higherdegree of data locality than DIACYCLIC. DIACYCLIC on the other hand, willin most cases take fewer parallel computation steps to complete than COL-CYCLIC.

Table 4.1 summarizes the number of parallel computation steps S each lay-out takes. Steps are also closely related to the utilization U as less steps for agiven k and p will give higher utilization.

Layout Steps S (forward pass)

Diagonal Wavefront 2k�1

Âd=1

dd/pe + dk/pe

Column-cyclic Wavefront⇠

kp

⇡k + p � 1 +

(0 if k mod p = 0,(k mod p) � p if k mod p 6= 0

Diagonal-cyclic Wavefront⇠

k2

p

⇡+ p � 1

Table 4.1: Comparison of different layouts for decomposing k ⇥ k tiles in the forward pass ofLDDP problems. Steps S corresponds to the term supersteps in Valiant’s BSP model [15].

Table 4.2 gives an overview of the number of tiles where the three lay-outs under-utilize available resources, including best-case and worst-case forconstant p. In general, the under-utilized corners |Wd| < p will have lessimpact on utilization when k grows.

The expressions show the relative TU<1-utilization is periodic around k mod p.For the layout DIAWAVE, TU<1 grows as k grows. Within the period k mod p,DIACYCLIC will in the worst case have TU<1 = p2 � 1. The utilization param-eter TU<1 for COLCYCLIC spans DIAWAVE and DIACYCLIC in the period, wherein the worst case TU<1 equals that of DIAWAVE and in the best case equalsDIACYCLIC. In general, the following is maintained for all k and p:

UDIAWAVE UCOLCYCLIC UDIACYCLIC

20

4.5. Summary and Discussion

Layout TU<1 Best-case Worst-case

Diagonal Wavefront (p2 � p)bk/pc + (k mod p)2 Q(k) Q(k)

Column-cyclic Wavefront p2 � p + (k � p)(k mod p) Q(1) Q(k)

Diagonal-cyclic Wavefront p2 � p + (k2 mod p) Q(1) Q(1)

Table 4.2: Expressions showing the number of tiles where the resources are under-utilized. Ex-pressed by the constant number of processors p and varying k ⇥ k tiles for the three layouts.

Furthermore, when k > 2p, then UDIAWAVE < UDIACYCLIC. The utilization Ufor the three layouts, when p = 14 and varying values of k are shown inFigure 4.4. The period k mod p is clear from the figure, and as expected wesee the upper-limit of COLCYCLIC is DIACYCLIC and lower limit DIAWAVE.

From a theoretical point of view, we have shown that our Diagonal-cyclicWavefront-layout gives the overall best resource utilization, which on the gridlevel means best running time, as explained in section 4.4. On this level ofparallelism, the Column-cyclic Wavefront only gives as good utilization asDiagonal-cyclic Wavefront when p divides k. The widely used layout Diago-nal Wavefront gives overall worst utilization.

28 42 56 70 84 98 112 126 140 154 168 182 196 210 224 238 252 266 280Number of tiles (k)

50

60

70

80

90

100

Util

izat

ion

U(%

)

Utilization of k ⇥ k tiles using p = 14 compute units

Diagonal-cyclic WavefrontColumn-cyclic WavefrontDiagonal Wavefront

Figure 4.4: Utilization of the three layouts for 28 k 280 and p = 14. For other values of p, theplot will show the same tendency, however the period will change.

21

5 IMPLEMENTING LDDP ON GPUS

This chapter describes relevant aspects of our LDDP implementation on theGPU grid- and thread-level. To achieve a linear-space reduction, we use thetiling algorithm presented by Driga et al. [17]. Here the problem space isdecomposed into tiles, and only intersecting boundaries between the tiles issaved, as depicted in Figure 2.1. In our algorithm, the boundaries are allo-cated in global memory, and are available for all threads during the runtimeof the algorithm.

5.1 Grid-level

The three layouts Diagonal Wavefront, Column-cyclic Wavefront and our newDiagonal-cyclic Wavefront layout have all been implemented on the grid-levelfor distributing tiles to multiprocessors.

On the grid-level, tiles are represented by CUDA blocks. A CUDA kernelcall executes a set of blocks, on the multiprocessors, and in our context it isfollowed by a barrier synchronization. In relation to the theoretical descrip-tion of a layout, then to execute a step, a kernel call is made with the blocksneeded to compute the step.

The implementations of Column-cyclic Wavefront and our Diagonal-cyclicWavefront will make kernel calls with at maximum p blocks. For DiagonalWavefront, the implementation executes kernel calls containing all blocks in adiagonal. Relating to the description of the layout in section 4.1 on page 16, thesteps needed to complete a full diagonal is given by the length of a diagonaldivided by p. This is an approximation, as small variation in the executiontime of kernels will make it possible to better schedule blocks when a kernelis called with more blocks. This has the effect, that our Diagonal Wavefrontimplementation will have a slightly better running time than predicted by thelayout steps.

5.2 Thread-level

On the thread-level we solve part of the full DPM, by computing a t ⇥ t tilerepresented by a CUDA block. A block is organized as a set of data-parallelthreads, and the implementation of a block is called a kernel.

The cost function used in our thread-level computes longest common sub-sequence (LCS), but can be extended to solve any LDDP problem. In the fol-lowing we present our forward and backward pass kernels, and evaluate op-timization strategies.

23

5. IMPLEMENTING LDDP ON GPUS

5.2.1 Forward Pass Kernels

Input boundary

Inpu

t bou

ndar

y (c

ompu

ted

valu

es)

Output boundary

Out

put b

ound

ary

(res

ult v

alue

s)

Temporar

y resu

lt valu

es

!

Figure 5.1: Computation of a forward pass tilehaving t ⇥ t entries. A block of threads receivescomputed cost values, computes result values ina local DPM c (shown in white) and saves theresult values in the output boundaries. The tem-porary result values is locally stored in three di-agonals.

The forward pass kernels define thecalculation of a t ⇥ t tile using ablock of threads. The input is thesubstrings X0 and Y0 of size t, andinput boundaries for previously com-puted DPM values. The organiza-tion of boundaries is shown in Fig-ure 5.1. When threads compute en-tries in the local DPM c[i, j] wherei = 0 _ j = 0, the neighboring valuesare read from the input boundaries inglobal memory. When an entry hasbeen computed on the output bor-der i = t � 1 _ j = t � 1, the costvalue is saved in the output boundaryin global memory.

Selecting a Layout The previouslydescribed layouts are all candidatesfor distributing work in the forwardpass kernels. We believe the mostappropriate, due to data locality, is Column-cyclic Wavefront. Despite that,we have decided to explore the thread-level using the most simple layout, theDiagonal Wavefront. As the task of implementing efficient CUDA kernels isvery time consuming, we chose to focus on a simple layout. Optimizationfor the kernel implementation has been explored using best practice recom-mendations from NVIDIA [38]. Furthermore we have investigated optimalperformance in the kernel parameter space.

Space Reduction using Three Cost Diagonals In the algorithm by Drigaet al. [17], the problem space is divided down to a level where a complete full-matrix using O(n2) space can be stored in nearby memory. Due to the smallamount of shared memory on a GPU, this approach would reduce the tile sizeto t 128. A small tile size in that order would mean that boundaries taketoo much global memory space, as the size of boundaries in global memory isdirectly dependent on the tile size. Another drawback of a small tile size is asmall compute-to-memory ratio.

Since the Diagonal Wavefront layout computes a wavefront Wd concur-rently, the dependencies for Wd are the cost values on the previous two diag-onals Wd�1 and Wd�2, as shown in Figure 5.1. For space reduction, we onlyallocate memory for these three diagonals, like Liu et al. [3, 4]. This gives alinear space usage, resulting in a larger maximum tile sizes. The three cost-diagonals are placed in shared memory because they are frequently accessedand entries need to be accessible for all threads.

24

5.2. Thread-level

Boundaries When values are accessed on the boundaries, only a single entryis read from or written to global memory at a time. This memory access pat-tern is suboptimal. Following best practices, we ought to first transfer globalinput boundaries to shared memory, compute entries, store temporary outputboundaries in shared memory, then finally flush the temporary boundariesto global memory. However, by experiments we have found that reading theinput boundaries into shared memory does not have a significant speedupon the compute time compared to the additional 2t shared memory it woulduse—same goes for storing all output boundary temporary in shared memorybefore flushing them to global memory.

Substrings The substrings X0 and Y0 can be read directly from global mem-ory, or they can be transferred to shared memory for better memory accesstime. For an alphabet limited to 256 characters, shared memory substringscan be stored using 2t bytes. When the substrings are in global memory moreshared memory is available for the cost diagonals, resulting in a larger maxi-mum tile size at the cost of higher access times when reading characters fromthe substrings.

General and Specialized Kernels Two sets of kernels have been developed,general kernels and specialized kernels. The difference between them is the datatype used for representing the DPM cost values in a tile. The general kernelsrepresent inner cost values as 4-byte integers, whereas the specialized kernelsuse a smaller datatype. The general kernels, solves all LDDP problems, aslong as the cost values can be represented by 4-byte integers.

In the specialized kernels we apply the observation, that for a full tile com-putation, we are only interested in how much the cost values changes from in-put boundary to output boundary. If all possible cost-changes between neigh-boring entries can be represented in a smaller datatype, we can reduce theamount of shared memory needed.

The boundaries in global memory are represented as 4-byte integers. Torepresent cost values as a smaller data type, scaling is applied between bound-aries and inner cost values. Scaling is applied by subtraction and addition ofa tile scaling constant q.

When reading a value b from the input boundaries, the inner cost valuewill be computed using the scaled value b � q. Equivalently scaling by addi-tion is used when writing to the output boundaries to global memory.

For an LDDP problem to be solvable using the specialized kernels, we needto investigate how much the cost values maximum varies between entries,since this value must be representable by the smaller data type. The scalingconstant q is determined from an appropriate value on the input boundaries.

For LCS, the cost for a match is 1, hence the maximum difference in costvalues occurs when X0 = Y0. In this case the maximum difference equalsthe tile size t. The scaling value q is assigned to the input boundary valuenorth-west of c[0, 0]. Using a 2-byte integer datatype to represent the innercost values we are able to halve the amount of shared memory used for thecost values compared to the general kernels. This comes at the expense ofextra subtraction and addition operations when reading from and writing tothe global memory boundaries.

25


Overview of Our Forward Pass Kernels

The following forward pass kernels have been implemented:

GKERNELSHARED general kernel, input strings in shared memory

GKERNELGLOBAL general kernel, input strings in global memory

SKERNELSHARED specialized kernel, input strings in shared memory

SKERNELGLOBAL specialized kernel, input strings in global memory

5.2.2 Backward Pass Kernel

Input boundary

Inpu

t bou

ndar

y

!

Sub tile 1 Sub tile 1

Sub tile 1 Sub tile 4full-matrix

Figure 5.2: Computation of a backwardpass using sub-tiling.

Although our focus is on the forwardpass we have implemented a simple LCSbackward pass kernel BPKERNEL. LikeDriga et al. [17] our algorithm uses theboundaries saved in the forward pass tocompute one optimal path by processingtiles that intersect the path.

Due to shared memory limitationswhen using full-matrix, our BPKERNELis able to compute the deduced back-ward pass for a tile of up to 128 ⇥ 128.

For k ⇥ k tiles, we start by computingthe optimal path for the last tile (k, k). Ifthe tile size t used in the forward passis greater than 128, we need to decom-pose the problem space down to a level where full-matrix can be computedin shared memory. In this case we decompose the tile into sub-tiles of size128 ⇥ 128, and run a forward pass kernel to compute sub boundaries. When theproblem space can be represented in shared memory, we run our BPKERNELto obtain an optimal path through all sub-tiles. Once the kernel reaches theborder of the current tile, the computation is moved to the next tile where theoptimal path intersects. The algorithm continues in this way until the compu-tation reaches the border of the DPM. The sub-tiling is shown in Figure 5.2. Inour implementation, the solution path is saved directly in host memory.

For space reduction, the cost-matrix c used in BPKERNEL could be left outin favor of a matrix b containing directions taken while computing the forwardpass. A given point in b[i, j] points to the neighboring entry correspondingto the optimal solution chosen when computing c[i, j]. This approach is de-scribed by Cormen et al. [18]. As we still need cost-values in the computationof the forward pass, we could use the three cost diagonal approach for tem-porary cost values. Although this idea would increase the backward pass tilesize, it introduces bank-conflicts and makes the backward pass kernel rathercomplex.

26

5.2. Thread-level

5.2.3 Best Practice Recommendations

The NVIDIA Best Practices Guide [38] has been used as a guideline duringimplementation of the kernels. An excerpt of what we have found especiallyuseful is listed below.

Minimize Branching

In the forward pass kernels, two CUDA max()-functions can replace the branch-ing needed when testing for character match and finding the maximum be-tween neighboring values. This reduces branch diversion, which might oth-erwise occur for each entry. The reduced branching increased performancewith up to 8%.

Shared Memory

When accessing shared memory, we have organized the access pattern so nobank conflicts occur.

Loop Unrolling

Using #pragma unroll on loops will help the compiler predict branches inthe code. Experiments have shown up to 8% speedup for our kernels. As theright level of loop unrolling is very time consuming to find and dependent onother kernel parameters, the directive is not used in the results presented.

5.2.4 Undocumented CUDA Features

Due to findings on the NVIDIA Developer Forum,1 we have investigatedthe use of the rather undocumented features of the volatile keyword. TheCUDA compiler often inline operations needed to compute a value of a vari-able used in the code, instead of keeping the value saved in a register. Whendeclaring a variable volatile, it forces the compiler to save the variable in aregister, i.e., the value will not be recomputed. Besides changing the flow ofcomputations in the kernels, the register usage is changed. Manually explor-ing all relevant placement of the volatile keyword, gave up to 10% speedupfor some of our kernels.

1http://blog.icare3d.org/2010/04/cuda-volatile-trick.html

27

http://blog.icare3d.org/2010/04/cuda-volatile-trick.html


5.3 Space Constraints

We look into GPU memory constraints on the thread- and grid-level. Thedatatypes used will impact the amount of memory used. DVar describes thesize in bytes for representing one variable Var. The datatypes are described inTable 5.1 below.

Variable Description Size in bytes

DInputString Size of input string type 1DDPM-value Size of DPM representation in a block 2 or 4DBoundary Size of the global memory boundaries type 4

Table 5.1: Sizes of variables.

5.3.1 Thread-level

On the thread-level we are limited by the amount of shared memory Smem permultiprocessor. Each block requires an amount of memory depending on thetile size t. The upper limits on t are:

X0 and Y0 in shared memory

2t · DInputString| {z }Input strings X0 + Y0

+ 3t · DDPM-value| {z }DPM Cost Values

Smem

X0 and Y0 in global memory

3t · DDPM-value Smem

Notice, that DDPM-value can have two values, depending on whether the ker-nel applies scaling. Table 5.2 shows the the maximum value of tile size t fordifferent kernels on NVIDIA GPUs with Smem = 49152 bytes.

Forward Pass Kernel Maximum tile size t

GKERNELSHARED 3510GKERNELGLOBAL 4096SKERNELSHARED 6144SKERNELGLOBAL 8192

Table 5.2: Comparison of tile size limitations for different forward pass kernels. Values are forNVIDIA GPUs with compute capability 2.0, see data sheet in section A.1.

28

5.3. Space Constraints

5.3.2 Grid-level Memory Usage

On the grid-level we are limited by the amount of global memory Gmem onthe GPU. The constraints given by n and k, where k = n/t

2n · DInputString| {z }Input strings X + Y

+ 2n(k + 1)DBoundary| {z }Forward pass boundaries

+nk

⇣ n64k

� 2⌘

· DBoundary| {z }

Backward pass boundaries

Gmem

The solution path is saved directly in host memory. The relation shows thatthe maximum solvable problem size n becomes larger as k decreases. So tomaximize n, the largest available tile size t is selected.

When selecting the forward pass kernel SKERNELGLOBAL, the largest tilesize is t = 8192. Using this kernel on an NVIDIA Tesla C2070 with 5375 Mbyteof global memory, our implementation is able to compare two strings of up to2.39 · 106 symbols each, including finding the optimal path. For the GeForceGTX 590 with 1536 Mbyte of global memory, we can compare two strings oflength up to 1.2 · 106.

29

6 EXPERIMENTAL RESULTS FOR GRID-LEVEL

In this chapter we present our results for mapping the three LDDP layoutsDiagonal Wavefront (DIAWAVE), Column-cyclic Wavefront (COLCYCLIC) andDiagonal-cyclic Wavefront (DIACYCLIC) onto the GPU grid-level. The widelyused Diagonal Wavefront layout is selected as baseline to compare relativespeedup for the Column-cyclic Wavefront and Diagonal-cyclic Wavefront.

The results will be related to the theoretical descriptions of the layouts,presented in chapter 4.

6.1 Setup

The experiments are performed on an NVIDIA Tesla C2070 GPU having 14multiprocessors and ECC memory error protection.1 To explore the portabil-ity of our solution we conduct the same tests for the NVIDIA GeForce GTX 590having 16 multiprocessors. Unlike the Tesla, which is targeted for scientificcomputing, the GeForce series is intended for the PC gaming market, thusECC is not available and global memory is limited. A data sheet for each GPUis listed in Appendix A. Unless otherwise stated the Tesla GPU is used forthe presented results. All algorithms are implemented in CUDA/C++ andcompiled using nvcc 4.1 and gcc 4.4.3.

The input data is selected from the DNA collection of the Pizza & ChiliCorpus [39], with the alphabet S = {A, C, G, T}. The data is selected to berepresentative of real-life samples used in bioinformatics.

To perform accurate timings we use CUDA event timers, which are basedon on-board counters on the GPU with a sub-microsecond resolution [40]. Ad-ditional information has been obtained using the CUDA Profiler. Althoughthe independent timings are almost identical, all running times presented arethe median of 5 independent runs.

1Error-Correcting Code memory is used, where data corruption cannot be tolerated.

31

6. EXPERIMENTAL RESULTS FOR GRID-LEVEL

6.2 Results

Figure 6.1 shows the experimental speedup for COLCYCLIC and DIACYCLIC

relative to DIAWAVE for k = [28; 280] in the two cases where the maximumnumber of blocks per multiprocessor B is (a) 1 and (b) 2. As predicted by theutilization in chapter 4, DIACYCLIC always outperforms DIAWAVE. Also, asanticipated, COLCYCLIC varies between DIAWAVE and DIACYCLIC.

In the cases where the number of steps S is equal for DIACYCLIC and COL-CYCLIC, we observe a slightly better speedup for COLCYCLIC, the differencebeing less than 0.2 percentage points. This can be explained by the fact thatCOLCYCLIC has a higher degree of data locality. Since the L2 cache is notpurged between kernel calls, a higher hit rate on the L2 cache occurs for COL-CYCLIC. This has been confirmed by profiling the two layouts.

The results also show that COLCYCLIC performs up to 1% worse than DIA-WAVE, when the number of steps are the same. This is due to differences inplacements of the step synchronization barriers, refer to section 5.1.

28 42 56 70 84 98 112 126 140 154 168 182 196 210 224 238 252 266 280Number of tiles (k)

0

5

10

15

20

Spee

dup

com

pare

dto

Dia

gona

lWav

efro

nt(%

)

Experimental layout-performance compared to Diagonal Wavefront

Diagonal-cyclic WavefrontColumn-cyclic Wavefront

(a)

28 42 56 70 84 98 112 126 140 154 168 182 196 210 224 238 252 266 280Number of tiles (k)

0

5

10

15

20

Spee

dup

com

pare

dto

Dia

gona

lWav

efro

nt(%

)

Experimental layout-performance compared to Diagonal Wavefront

Diagonal-cyclic WavefrontColumn-cyclic Wavefront

(b)

Figure 6.1: Relative layout-performance of Column-cyclic Wavefront and Diagonal-cyclic Wave-front compared to Diagonal Wavefront with fixed tile size of 1024 and a varying number of tilesk = [28; 280]. The result is independent of the actual forward pass kernel used. (a) Shows speedupwith one block per multiprocessor. (b) Two concurrent blocks per multiprocessor.

32

6.2. Results

6.2.1 Comparing Theoretical and Experimental Speedup

We will now investigate how well the theoretical descriptions predict the ac-tual speedup for the grid-level layouts.

Depending on how many resources a block requires, the GPU may be ableto schedule multiple concurrent blocks on a symmetric multiprocessor. Wedenote the number of concurrent blocks per multiprocessor B and the numberof multiprocessors P.

To predict speedup theoretically, we need to take into account schedulingof multiple blocks on the multiprocessors. As examples, when B = 1, the timeto compute a number of blocks grows as a linear step function with steps thelength of P. When B = 2, then time cannot be predicted by linear steps astwo concurrent blocks on a multiprocessor will take less than double time tofinish.

We define a function virtual time which simulates the concurrent executionof blocks. It maps the characteristics of each step of a layout into a relativetime unit, see Table 6.1.

Number of tiles in a step

]0; P] ]P; 2P] ]2P; 3P] ]3P; 4P] ]4P; 5P] ]5P; 6P] . . .

B = 1 t1 2t1 3t1 4t1 5t1 6t1 . . .B = 2 t1 t2 t2 + t1 2t2 2t2 + t1 3t2 . . .

Table 6.1: Virtual time function simulating the GPU when B = 1 and B = 2. It shows approximatetimings for a step containing a number of tiles. Notice that the number of tiles is given in intervalsof number of multiprocessors P. For the timings t1 and t2, the following apply: t1 < t2 < 2t1.From empirical profiling of the architecture, we have found that: t2 ⇡ 1.7t1.

As an example, when B = 2 and the number of tiles in step lies in theinterval ]P; 2P], then due to better utilization on the multiprocessor, the com-putation time is less than 2t1 > t2. Our empirical studies have shown thatt2 ⇡ 1.7t1. The tendency seen here for B = 2 will be similar in cases whereB > 2.

Speedup Predictability

By applying the virtual time function to the concept of steps, we are ableto theoretically determine speedup for COLCYCLIC and DIACYCLIC relative toDIAWAVE. By comparing the theoretical speedup with the speedup measuredin our experiments, we will now investigate how well the theoretical descrip-tions predict the actual speedup on the GPU architecture.

Figure 6.2 shows the difference in theoretical speedup and experimentalspeedup with one block per multiprocessor. Plot (a) compares predictabilityfor a kernel with substrings in shared memory, and (b) shows predictabilitywhen the substrings are in global memory. When the substrings are in sharedmemory, the access time is close to constant, and it also gives less pressure onthe caches where boundaries will reside. When the substrings are in global

33

6. EXPERIMENTAL RESULTS FOR GRID-LEVEL

memory caches will be used for both substrings and boundaries, giving morepressure on the cache, which results in larger variations in memory accesstime. This explains the reason plot (b) contains more noise than (a).

In general we see that our theoretical descriptions predict a less than 0.30percentage point better speedup, than what is observed in experiments. Forthe kernels where substrings are in shared memory, we see a less than 0.10percentage point better speedup in theory, than observed in experiments.

28 42 56 70 84 98 112 126 140 154 168 182 196 210 224 238 252 266 280Number of tiles (k)

0.00

0.05

0.10

0.15

0.20

0.25

Perc

enta

gepo

int

Difference between theoretical and experimental speedup

Diagonal-cyclic Wavefront (µ = 0.08)Column-cyclic Wavefront (µ = 0.06)

(a)

28 42 56 70 84 98 112 126 140 154 168 182 196 210 224 238 252 266 280Number of tiles (k)

0.00

0.05

0.10

0.15

0.20

0.25

Perc

enta

gepo

int



(b)

Figure 6.2: Comparing theoretical layout-performance with experiments for using a fixed tilesize of 1024 and a varying number of tiles k = [28; 280] and one block per multiprocessor. Av-erage deviation µ and standard deviation s from theory is listed in the labels. The plot showsa slight worse speedup in practice than predicted in our theory, due to the implementation ofDiagonal Wavefront. (a) LCS kernel GKERNELSHARED having input strings in shared memory. (b)GKERNELGLOBAL with input strings in global memory.

Figure 6.3 shows how well we can predict speedup theoretically when upto two blocks can execute concurrently on a multiprocessor. The data is fromthe same experiments, as shown in Figure 6.1 (b). A lower accuracy is ob-served compared to when maximum one block can execute concurrently ona multiprocessor. This is expected, since virtual time becomes more approxi-mative as B gets larger due to the t2 term. In this case, the theoretical speeduppredicts a less than 1.2 percentage point better speedup, than what is observedin experiments.

For most results, we see that the theory predicts a better speedup thanwhat we see in our experiments. This degradation in our experimental speedupis due to the implementation of the DIAWAVE. As explained in section 5.1, wereduce the amount of grid-level synchronization by only executing a single

34

6.2. Results

28 42 56 70 84 98 112 126 140 154 168 182 196 210 224 238 252 266 280Number of tiles (k)

�0.5

0.0

0.5

1.0

Perc

enta

gepo

int



Figure 6.3: Comparing theoretical layout-performance for experiments shown in Figure 6.1 (b),where two concurrent blocks can execute per multiprocessor. The plot shows a slight worsespeedup in practice than predicted in our theory. The LCS kernel used is GKERNELSHARED with512 threads per block.

kernel call per wavefront front. This has the effect, that our implementationwill have a slightly better running time than predicted by step to virtual time-function.

6.2.2 Part Conclusion

On the grid-level, the theoretical description of the three layouts captureswhat we see in our experimental results with a very small margin of error.This has been shown for one and multiple concurrent blocks per multiproces-sor. In the cases where our theoretical predictions are inaccurate, we are ableto pinpoint what is causing it—and these inaccuracies are small enough toconsider negligible. Thus, our theoretical approach for describing the layoutsas steps, and mapping steps to virtual time, gives extremely good predictionson how the architecture behaves. It also shows that the utilization metric givesa good theoretical way of comparing layouts on the grid-level.

Overall this shows our new Diagonal-cyclic Wavefront gives the best over-all performance, both in experiments and theoretically.

35

7 EXPERIMENTAL RESULTS FOR THREAD-LEVEL

This chapter documents the structured approach and experiments carried outfor optimizing the kernel performance. A number of best practice optimiza-tions have been evaluated during implementation, and a selection of these arepresented in subsection 5.2.3.

7.1 Setup

On the thread-level we have four independent kernels with different char-acteristics for computing the forward pass of longest common subsequence.All kernels use the Diagonal Wavefront layout for mapping the problem ontoCUDA threads. A simple backward pass kernel have also been designed. Allkernels are implemented in NVIDIA CUDA. We use the same setup as thegrid-level experiments, described in section 6.1.

Performance of the kernels are tested on DNA sequences [39] and ran-domly generated symbols from an alphabet of 256 symbols.

To compare the results we use cell updates per second (CUPS), a com-monly used performance measure in bioinformatics literature [4, 6, 9, 10].CUPS represents the time for a complete computation of one entry in theDPM, including memory operations and communication. Given a DPM withn ⇥ n entries, the GCUPS (billion cell updates per second) measure is n2/(t ·109), where t is the total computation time in seconds.

7.2 Results for Forward Pass Kernels

We present the results for our four forward pass kernels for solving longestcommon subsequence. For all of them, each thread block computes a tile hav-ing t ⇥ t entries by comparing the substrings X0 and Y0 of length t, using twotile boundaries as input, and outputs the boundaries to the south and east.The four forward pass kernels are:

GKERNELSHARED general kernel, input strings in shared memory

GKERNELGLOBAL general kernel, input strings in global memory

SKERNELSHARED specialized kernel, input strings in shared memory

SKERNELGLOBAL specialized kernel, input strings in global memory

37

7. EXPERIMENTAL RESULTS FOR THREAD-LEVEL

7.2.1 Automatic Performance Tuning

Selecting the optimal kernel parameters is a key task in optimizing the perfor-mance of GPU applications. As performance of kernels is almost impossibleto analyze theoretically, performance optimizations are driven by empiricalstudies. The strategy is to exhaustive search the parameter space by executingseveral runs using different tuning parameters. This is a technique known asautomatic performance tuning, or auto-tuning. The effectiveness of this tech-nique depends on the chosen tuning parameters to optimize. Auto-tuningof GPU kernel parameters can include relevant parameters like block size,threads per block, loop unrolling level and internal algorithm trade-offs. Formore information on the topic in a GPU context, see [16, 41, 42].

Our automatic performance tuning is conducted by selecting a set of ker-nel parameters, and for each of these, automatically measure the running timeon our four kernels. To make the results independent of the grid-level lay-outs, we occupy all multiprocessors with the maximum number of concur-rent blocks for the given parameters. So the performance measured is for thebest case, where maximum utilization on the grid-level is achieved. All testpresented are for the targeted NVIDIA Tesla C2070 GPU.

Tile Size and Threads Ratio

We start by examining the optimal relationship between the kernel parameterstile size t and the number of threads per block. The tile size is selected to be amultiple of the architectures 128 byte memory transaction size and up to themaximum size supported by the kernels. The number of threads is selected tobe a multiple of the warp size (32 threads), and up to the maximum of 1024threads. Our tests show a tendency for the optimal ratio between tile size andnumber of threads per block to be 1:4 for the general kernels and 1:8 for thespecialized kernels. Figure 7.1 visualizes the result of an auto-tuning test.

512 1024 1536 2048 2560 3072Tile size t

128

256

384

512

640

768

1024

Thre

ads

perb

lock

5.1

4.6

3.6

5.0

5.4

4.7

3.5

4.0

5.4

5.4

5.2

4.4

2.2

3.9

4.9

4.6

2.2

4.0

5.2

5.2

2.3

4.1

5.1

5.3

5.3

5.2

Exploring parameter space for GKERNELSHARED

Figure 7.1: Example of auto-tuning two kernel parameters for GKERNELSHARED. The performancein GCUPS is shown for each supported configuration of tile size t and number of threads perblock. The black cells denote unsupported configurations, where number of threads does notdivide t. The best performance for each tile size t is shown with a underline.

38

7.2. Results for Forward Pass Kernels

Optimal Tile Size

To find the optimal value for tile size t, we tested all kernels by varying tusing the best ratio between t and number of threads per block. Notice thatwith different tile sizes, the number of concurrent blocks per multiprocessorwill vary. The results for the general kernels are shown in Figure 7.2, andFigure 7.3 on the next page shows results for the specialized kernels.

The results clearly show that selecting the tile size to be a multiple of thearchitecture’s 128 byte memory transaction size, generally yield the best per-formance. Also, having multiple blocks per multiprocessor gives a better per-formance for all kernels.

512 640 768 896 1024 1152 1280 1408 1536 1664 1792 1920 2048Tile size

4.2

4.4

4.6

4.8

5.0

5.2

5.4

5.6

Gig

aC

ellU

pdat

espe

rSec

ond

(GC

UPS

)

6 5 4 3 2 1

Max performance of GKERNELSHARED (threads per block = 1/4 tile size)

(a)

512 768 1024 1280 1536 1792 2048 2304 2560 2816 3072 3328 3584 3840 4096Tile size

4.2

4.4

4.6

4.8

5.0

5.2

5.4

5.6

Gig

aC

ellU

pdat

espe

rSe

cond

(GC

UPS

)

8 7 6 5 4 3 2 1

Max performance of GKERNELGLOBAL (threads per block = 1/4 tile size)

(b)

Figure 7.2: Experimental tile-performance for the general kernels using a varying tile size withmaximum number of blocks for each configuration. The vertical lines and numbers inside theplots, denote the number of blocks per multiprocessor. The circles show where t is a multiple ofthe 128 byte memory transaction size. Test were conducted on the NVIDIA Tesla C2070 GPU.

39


512 768 1024 1280 1536 1792 2048 2304 2560 2816 3072 3328 3584 3840 4096Tile size

4.0

4.5

5.0

5.5

6.0

6.5

Gig

aC

ellU

pdat

espe

rSe

cond

(GC

UPS

)

8 7 6 5 4 3 2 1

Max performance of SKERNELSHARED (threads per block = 1/8 tile size)

(a)

512 1024 1536 2048 2560 3072 3584 4096 4608 5120 5632 6144 6656 7168 7680 8192Tile size

3.5

4.0

4.5

5.0

5.5

6.0

Gig

aC

ellU

pdat

espe

rSec

ond

(GC

UPS

)

8 7 6 5 4 3 2 1

Max performance of SKERNELGLOBAL (threads per block = 1/8 tile size)

(b)

Figure 7.3: Experimental tile-performance for the specialized kernels using a varying tile sizewith maximum number of blocks for each configuration. The vertical lines and numbers insidethe plots, denote the number of blocks per multiprocessor. The circles show where t is a multipleof the 128 byte memory transaction size. Test were conducted on the NVIDIA Tesla C2070 GPU.

When comparing the overall performance of the kernels seen in Figure 7.2and 7.3, the specialized kernels surprisingly outperforms the general kernelfor most configurations. The specialized kernels do more computations whenscaling cost values, however the smaller datatype reduces shared memoryaccess. This does not fully explain why the performance is better. Anotherexplanation could be different optimizations applied by the CUDA compiler.

40

7.3. Results for Backward Pass Kernel

Summary and Comparison

To give an overview of kernel performance on the NVIDIA Tesla GPU usingthe auto-tuned parameters and ratios, Table 7.1 presents combined results fortile size t = 1024, 2048, 4096, 8192 using 256, 512, 1024 threads per block. Theresults indicate the same ratio between t and number of threads per block, asfound in our auto-tuning test.

Execution time in Giga Cell Updates per Second (GCUPS)

Tile size Threads SKERNELSHARED SKERNELGLOBAL GKERNELSHARED GKERNELGLOBAL

1024 256 5.63 (4) 5.39 (5) 5.36 (3) 5.45 (4)1024 512 4.87 (2) 4.70 (2) 4.71 (2) 4.81 (2)1024 1024 3.62 (1) 3.58 (1) 3.53 (1) 3.62 (1)

2048 256 6.18 (3) 5.71 (4) 3.86 (1) 5.09 (2)2048 512 5.79 (2) 5.36 (2) 4.91 (1) 5.51 (2)2048 1024 4.81 (1) 4.59 (1) 4.64 (1) 4.70 (1)

4096 256 4.18 (1) 5.12 (2) - 3.51 (1)4096 512 5.73 (1) 5.77 (2) - 5.32 (1)4096 1024 5.74 (1) 5.32 (1) - 5.49 (1)

8192 256 - 3.26 (1) - -8192 512 - 5.24 (1) - -8192 1024 - 5.67 (1) - -

Table 7.1: Kernel performance in GCUPS with maximum number of blocks for different configu-rations of tile size t and threads per block. The number in the parenthesis denotes the number ofconcurrent blocks per SM, B. Tests were conducted on the NVIDIA Tesla C2070 GPU.

We conducted the same tests for the NVIDIA GeForce GTX 590, showing aperformance increase around 9% for all kernels compared to the Tesla GPU.The speedup can be explained by the increased clock frequency and fastermemory access, due to the lack of ECC, on the GeForce.

7.2.2 Effect of Alphabet Size and String Similarity

By evaluating the performance of varying the alphabet size up to 256 symbolsfor random sequences, we see no impact on the running time. The similarityof the given strings X and Y also has no influence on running time. This is dueto the branch reduction achieved by CUDA max()-functions in the implemen-tation, giving no branching within the kernel when matching character pairsfrom the strings.

7.3 Results for Backward Pass Kernel

We have implemented a simple backward pass kernel BPKERNEL. Althoughthe implementation does not utilize the available resources efficiently, the timespend on the backward pass out of the total running time is less than 2%for n = 221.

41


7.4 Part Conclusion

Currently there are no possibilities of modeling the behavior at the GPU thread-level, as small changes can have cascading effects, which are hard to predict.Some metrics, like utilization and data locality can help determine what isviable to implement, but experimental results are the only way to get solidanswers.

The kernels presented, are all implemented with NVIDIA’s best practicesin mind [38]. As examples, we have no bank conflicts for shared memory ac-cess, we have minimized branch diversion when possible, and we have inves-tigated the effects of loop unrolling. We have found a rather undocumentedfeature, the volatile keyword, which can have an impact on performance.By doing this we have made sure that we achieved a high performance for theimplemented kernels.

Furthermore, we have experimentally auto-tuned the kernel parameters,by performing an exhaustive search of the parameter space. This providesus with general pointers on optimal relationship between kernel parameters.The auto-tuning results are also valuable in determining optimal tile size fora given input size.

42

8 PERFORMANCE EVALUATION

8.1 The Potential of Solving LDDP Problems on GPUs

To evaluate the potential of solving LDDP problems on GPU hardware, wecompare our results for finding longest common subsequence (LCS) to se-quential CPU solutions. For comparison we have used the CPU solutions pro-vided by Stockel and Bille [1], including LCS implementations for Hirschberg[22], FLSA by Driga et al. [17], Chowdhury and Ramachandran cache-obliviousalgorithm [24] and their own FCO [1]. As the fastest known general sequentialCPU solution, we use Stockel’s optimized implementation of FLSA by Drigaet al. [17].

Table 8.1 shows a performance comparison between our GPU solution onan NVIDIA Tesla C2070 and FLSA [17] running on a Intel i7 2.66GHz hav-ing 4GB memory. Both of the two architectures were released in 2010. Thedata shows that our GPU solution has an average of over 40X performanceadvantage over FLSA for comparing two string of size larger than 219. Ourexperiments show that our inefficient backward pass has a negative impacton the GPU speedup for smaller strings.

Input size n FLSA CPU running time GPU running time GPU speed-up

218 0.132h 0.0035h 38X219 0.529h 0.0125h 42X220 2.127h 0.0473h 45X221 8.741h 0.2077h 42X

Table 8.1: Performance comparison of state-of-the-art single-threaded CPU solution FLSA [17](on a Intel i7 2.66GHz having 4GB memory) and our GPU solution (NVIDIA Tesla C2070) for solv-ing the longest common subsequence. The timings include complete computation time, memorytransfer and traceback of solution path. All GPU tests use our new Diagonal-cyclic Wavefrontlayout. For n = 221, SKERNELGLOBAL was used as kernel with tile size t = 8192 and 1024 threadsper block. All other tests used SKERNELSHARED with t = 2048 and 256 threads.

Driga et al. [17] presented a parallel FLSA with an almost linear speedupfor up to eight processors, when comparing two strings with length just above218. For 32 processors the speedup is halved. When comparing this with therun times presented in Table 8.1, the GPU is still an order of magnitudes fasterthan the parallel FLSA CPU solution running on a 32-core CPU. We believethis shows the potential of solving LDDP problems on GPU hardware.

43

8. PERFORMANCE EVALUATION

8.2 Comparing to Similar GPU Solutions

As stated in section 2.2.2, there are currently many GPU solutions for LDDPproblems, especially for Smith-Waterman. All of them are able to compare alarge number of independent short sequences, with a length up to 216. As aresult, they are not targeted the same problem size as we consider, makingdirect comparison impossible.

The GPU+CPU solution by Kloetzli et al. [11] is able to solve longest com-mon subsequence (LCS) on string length up to 220. They showed a five-foldspeedup over the single processor algorithm presented by Chowdhury andRamachandran [24]. Since we achieve a much higher speedup, we concludethat our solution is superior.

We found Deorowicz’ [12] solution to be the most resembling GPU imple-mentation, although the implementation only computes the length of the LCS.The implementation is based on the widely used Diagonal Wavefront layout.Their experiments showed a significant speedup obtained over their own se-rial CPU implementation of the same anti-diagonal algorithm for n = 216. Un-fortunately no comparison is made for any known CPU solutions. Althoughwe have tried to obtain the source code for their implementation, it has notbeen possible.

To the best of our knowledge, all existing GPU solutions for solving large-scale LDDPs problem use Diagonal Wavefront-layout on the grid-level, whereit would be an advantage to use our Diagonal-cyclic Wavefront layout instead.

44

9 CONCLUSION

Based on an analysis of state-of-the-art algorithms for solving local depen-dency dynamic programming (LDDP) problems and a thorough investiga-tion of GPU architectures, we have combined and further developed existingLDDP solutions. The result is a novel approach for solving any large pairwiseLDDP problem, supporting the largest input size for GPUs in literature.

Our results include a new superior layout Diagonal-cyclic Wavefront forutilizing the coarse-grained parallelism of the many-core GPU. For various in-put sizes the Diagonal-cyclic Wavefront always outperforms the widely used,the Diagonal Wavefront, and in most cases the Column-cyclic Wavefront lay-out. In general, any GPU algorithm for large LDDP problems can adopt ournew layout with advantage.

Theoretically we have been able to analyze the efficiency of the layouts,and accurately predict the relative speedup. Our results can be generalized toseveral levels of parallel computation using multiple GPUs.

As case study, we have implemented GPU kernels for finding the longestcommon subsequence. We present ways of optimizing kernel performance byminimizing branch divergence, scaling inner cost values, evaluating NVIDIAbest practices, performing automatic tuning of kernel parameters and exploit-ing the compiler keyword volatile with a previously undocumented speedupeffect.

Compared with the fastest known sequential CPU algorithm by Driga et al.[17], our GPU solution obtain a 40X speedup. Comparing two sequences of2 million symbols each, the CPU running time was close to 9 hours. Ourimplementation solves the same problem in 12 minutes. This shows, that exactcomparison of large biological sequence is now feasible.

9.1 Future Work

In this report we have focused on efficient layouts for the GPU grid-level. Tofurther optimize the performance, experiments must be made applying differ-ent layouts on the fine-grained thread-level. We believe the most appropriateis Column-cyclic Wavefront, due to the degree of data locality. The potentialof using multiple GPUs should also be examined.

We have focused on implementing longest common subsequence withinour parallel LDDP solution. It could be interesting to extend this implemen-tation with other LDDP algorithms with more advanced cost functions.

45

BIBLIOGRAPHY

[1] P. Bille and M. Stockel. Fast and cache-oblivious dynamic programmingwith local dependencies. Language and Automata Theory and Applications,pages 131–142, 2012.

[2] W.R. Pearson and D.J. Lipman. Improved tools for biological sequencecomparison. Proceedings of the National Academy of Sciences, 85(8):2444,1988.

[3] W. Liu, B. Schmidt, G. Voss, A. Schroder, and W. Muller-Wittig. Bio-sequence database scanning on a gpu. In Parallel and Distributed Process-ing Symposium, 2006. IPDPS 2006. 20th International, pages 8–pp. IEEE,2006.

[4] Schmidt B.b Voss G.a Muller-Wittig W.a Liu, W.a. Streaming algorithmsfor biological sequence alignment on gpus. IEEE Transactions on Paralleland Distributed Systems, 18(9):1270–1281, 2007.

[5] Valle G.a Manavski, S.A.a b. Cuda compatible gpu cards as efficient hard-ware accelerators for smith-waterman sequence alignment. BMC Bioin-formatics, 9(SUPPL. 2), 2008.

[6] Rudnicki W. Ligowski, L. An efficient implementation of smith water-man algorithm on gpu using cuda, for massively parallel scanning ofsequence databases. 2009.

[7] G.M. Striemer and A. Akoglu. Sequence alignment with gpu: Perfor-mance and design challenges. In Parallel & Distributed Processing, 2009.IPDPS 2009. IEEE International Symposium on, pages 1–10. IEEE, 2009.

[8] Maskell D.L. Schmidt B. Liu, Y. Cudasw++: Optimizing smith-watermansequence database searches for cuda-enabled graphics processing units.BMC Research Notes, 2, 2009.

[9] Yongchao Liu, Bertil Schmidt, and Douglas L Maskell. Cudasw++2.0: en-hanced smith-waterman protein database search on cuda-enabled gpusbased on simt and virtualized simd abstractions. BMC Research Notes,3(1):93, 2010.

[10] Jacek Blazewicz, Wojciech Frohmberg, Michal Kierzynka, Erwin Pesch,and Pawel Wojciechowski. Protein alignment algorithms with an efficientbacktracking routine on multiple gpus. BMC BIOINFORMATICS, 12:–,2011.

[11] J. Kloetzli, B. Strege, J. Decker, and M. Olano. Parallel longest commonsubsequence using graphics hardware. In Proceedings of the Eurographics

47

BIBLIOGRAPHY

Symposium on Parallel Graphics and Visualization. Eurographics Association,2008.

[12] Sebastian Deorowicz. Solving longest common subsequence and relatedproblems on graphical processing units. Software: Practice and Experience,40(8):673–700, 2010.

[13] A.R. Galper, D.L. Brutlag, and Stanford University. Medical ComputerScience. Knowledge Systems Laboratory. Parallel similarity search andalignment with the dynamic programming method. Knowledge Systems Lab-oratory, Medical Computer Science, Stanford University, 1990.

[14] P. Krusche and A. Tiskin. Efficient longest common subsequence compu-tation using bulk-synchronous parallelism. Computational Science and ItsApplications-ICCSA 2006, pages 165–174, 2006.

[15] L.G. Valiant. A bridging model for parallel computation. Communicationsof the ACM, 33(8):103–111, 1990.

[16] H.H.B. Sørensen. Auto-tuning of level 1 and level 2 blas for gpus. 2012.

[17] A. Driga, P. Lu, J. Schaeffer, D. Szafron, K. Charter, and I. Parsons. Fastlsa:a fast, linear-space, parallel and sequential algorithm for sequence align-ment. Algorithmica, 45(3):337–375, 2006.

[18] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and CliffordStein. Introduction to Algorithms. The MIT Press, 2001.

[19] R.A. Wagner and M.J. Fischer. The string-to-string correction problem.Journal of the ACM (JACM), 21(1):168–173, 1974.

[20] S.B. Needleman, C.D. Wunsch, et al. A general method applicable to thesearch for similarities in the amino acid sequence of two proteins. Journalof molecular biology, 48(3):443–453, 1970.

[21] Waterman M.S. Smith, T.F. Identification of common molecular subse-quences. Journal of Molecular Biology, 147(1):195–197, 1981.

[22] D.S. Hirschberg. A linear space algorithm for computing maximal com-mon subsequences. Communications of the ACM, 18(6):341–343, 1975.

[23] E.W. Myers and W. Miller. Optimal alignments in linear space. Computerapplications in the biosciences: CABIOS, 4(1):11–17, 1988.

[24] R.A. Chowdhury and V. Ramachandran. Cache-oblivious dynamic pro-gramming. In Proceedings of the seventeenth annual ACM-SIAM symposiumon Discrete algorithm, pages 591–600. ACM, 2006.

[25] E.D. Demaine. Cache-oblivious algorithms and data structures. LectureNotes from the EEF Summer School on Massive Data Sets, pages 1–29, 2002.

[26] P. Bille. Faster approximate string matching for short patterns. Theory ofComputing Systems, pages 1–24, 2008.

48

Bibliography

[27] J.W. Hunt and T.G. Szymanski. A fast algorithm for computing longestcommon subsequences. Communications of the ACM, 20(5):350–353, 1977.

[28] G.M. Landau and U. Vishkin. Fast parallel and serial approximate stringmatching. Journal of Algorithms, 10(2):157–169, 1989.

[29] W.J. Masek and M.S. Paterson. A faster algorithm computing string editdistances. Journal of Computer and System sciences, 20(1):18–31, 1980.

[30] L. Bergroth, H. Hakonen, and T. Raita. A survey of longest commonsubsequence algorithms. In String Processing and Information Retrieval,2000. SPIRE 2000. Proceedings. Seventh International Symposium on, pages39–48. IEEE, 2000.

[31] G. Navarro. A guided tour to approximate string matching. ACM com-puting surveys (CSUR), 33(1):31–88, 2001.

[32] S. Fortune and J. Wyllie. Parallelism in random access machines. In Pro-ceedings of the tenth annual ACM symposium on Theory of computing, pages114–118. ACM, 1978.

[33] T.R. Mathies. A fast parallel algorithm to determine edit distance. 1988.

[34] R.A. Chowdhury and V. Ramachandran. Cache-efficient dynamic pro-gramming algorithms for multicores. In Proceedings of the twentieth an-nual symposium on Parallelism in algorithms and architectures, pages 207–216. ACM, 2008.

[35] David Diaz, Francisco Jose Esteban, Pilar Hernandez, Juan Antonio Ca-ballero, Gabriel Dorado, and Sergio Galvez. Parallelizing and optimizinga bioinformatics pairwise sequence alignment algorithm for many-corearchitecture. PARALLEL COMPUTING, 37(4-5):244–259, APR-MAY 2011.

[36] V.W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A.D. Nguyen,N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, et al. De-bunking the 100x gpu vs. cpu myth: an evaluation of throughput com-puting on cpu and gpu. In ACM SIGARCH Computer Architecture News,volume 38, pages 451–460. ACM, 2010.

[37] NVidia. CUDA C Programming Guide Version 4.1. November 2011.

[38] NVidia. CUDA C Best Practices Guide Version 4.1. January 2012.

[39] P. Ferragina and G. Navarro. Pizza & chili corpus. University of Pisa andUniversity of Chile. http://pizzachili.di.unipi.it/. 2012.

[40] NVidia. CUDA C Toolkit Reference Manual version 4.2. April 2012.

[41] Y. Li, J. Dongarra, and S. Tomov. A note on auto-tuning gemm for gpus.Computational Science–ICCS 2009, pages 884–892, 2009.

[42] P. Micikevicius. Analysis-driven optimization. In GPU Technology Con-ference. NVIDIA, 2010.

49

http://pizzachili.di.unipi.it/

Appendices

51

A NVIDIA GPU DATA SHEETS

A.1 NVIDIA Tesla C2070

The Tesla C2070 is developed for high performance scientific computing.

Release year 2010

CUDA Compute and Graphics Architecture Fermi

CUDA Driver Version 4.1

CUDA Compute Capability 2.0

Total amount of global memory: 5375 MBytes (5636554752 bytes)

Symmetric Multiprocessor (SM) 14

(14) Multiprocessors x (32) CUDA Cores/MP: 448 CUDA Cores (SPs)

GPU Clock Speed: 1.15 GHz

Memory Clock rate: 1494.00 Mhz

Memory Bus Width: 384-bit

L2 Cache Size: 786432 bytes

Total amount of constant memory: 65536 bytes

Total amount of shared memory per SM: 49152 bytes

Total number of registers available per SM: 32768 bytes

Warp size: 32

Maximum number of threads per block: 1024

Maximum sizes of each dimension of a block: 1024 x 1024 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535

Maximum number of resident blocks per SM: 8

Concurrent copy and execution: Yes with 2 copy engines

Device has ECC support enabled: Yes

53

A. NVIDIA GPU DATA SHEETS

A.2 NVIDIA GeForce GTX 590

The GeForce GTX 590 is intended for the PC gaming market.The specifications written in bold highlights where the GeForce GTX 590 dif-fers from NVIDIA Tesla C2070.

Release year 2011

CUDA Compute and Graphics Architecture Fermi

CUDA Driver Version 4.1

CUDA Compute Capability 2.0

Total amount of global memory: 1536 MBytes (1610285056 bytes)

Symmetric Multiprocessor (SM) 16

(16) Multiprocessors x (32) CUDA Cores/MP: 512 CUDA Cores (SPs)

GPU Clock Speed: 1.22 GHz

Memory Clock rate: 1707.00 Mhz

Memory Bus Width: 384-bit

L2 Cache Size: 786432 bytes

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 49152 bytes

Total number of registers available per block: 32768 bytes

Warp size: 32

Maximum number of threads per block: 1024

Maximum sizes of each dimension of a block: 1024 x 1024 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535

Maximum number of resident blocks per SM: 8

Concurrent copy and execution: Yes with 1 copy engine

Device has ECC support enabled: No

54

B KERNEL SOURCE CODE

B.1 Forward pass kernels

B.1.1 GKERNELSHARED

1 /**2 * Computes the LCS cost -boundaries for the tiles defined in boxArray.3 *4 * @param boxArray list of tiles to compute5 * @param xStr X’6 * @param yStr Y’7 * @param boundaryX boundaries in X direction8 * @param boundaryY boundaries in Y direction9 * @param n input size

10 * @param boxDim a multiple of #threads per block11 */12 __global__13 __launch_bounds__(MAXTHREADSPERBLOCK)14 void kernel_lcs_fp_wave_boundary_striving(const int2 *boxArray , char const *

const xStr , char const * const yStr , int * const boundaryX , int *const boundaryY , int n, int boxDim) {

15

16 int boxI = boxArray[blockIdx.x].x;17 int boxJ = boxArray[blockIdx.x].y;18

19 extern __shared__ char dynSmemWstringsGeneral [];20

21 char * const xStr_shared = &dynSmemWstringsGeneral [0];22 char * const yStr_shared = &dynSmemWstringsGeneral[boxDim ];23 int * diag_base = (int*)&dynSmemWstringsGeneral[boxDim *2];24

25 // inner cost diagonals26 int *subMatrix [3] = {27 &diag_base [0],28 &diag_base[boxDim],29 &diag_base [2* boxDim]30 };31 // a pointer for wrapping around the diagonals32 int *tmpPointer;33

34 // stride variables based on the problem size and number of threads35 int totalstrides = (boxDim/blockDim.x);36 int strideWidth = boxDim/totalstrides;37

38 // Copy local X-string and Y-string needed for the current tile39 for (int stride = 0; stride < totalstrides; ++ stride) {40 xStr_shared[threadIdx.x+stride*strideWidth] =41 xStr[boxI*boxDim + threadIdx.x+stride*strideWidth ];42 yStr_shared[threadIdx.x+stride*strideWidth] =43 yStr[boxJ*boxDim + threadIdx.x+stride*strideWidth ];44 }45

46 int totalDiagonals = boxDim *2 - 1;47

48 // pre -calculate index values for y/boundaryX49 int boundaryXOffsetRead = n*(boxJ -1) + boxDim * boxI;50 int boundaryXOffsetWrite = boundaryXOffsetRead + n;51

55

B. KERNEL SOURCE CODE

52 int boundaryYOffsetRead = n*(boxI -1) + boxDim * boxJ;53 int boundaryYOffsetWrite = boundaryYOffsetRead + n;54

55 // sync all threads in block56 __syncthreads ();57

58 // calculate the cost values59 for (int slice = 0; slice < totalDiagonals; ++slice) {60

61 // for each stride62 for (int stride = 0; stride < totalstrides; ++ stride) {63 // update i,j64 int i = threadIdx.x + (stride*strideWidth);65 int j = slice - i;66

67 if (!(j < 0 || j >= boxDim)) {68 // calculate69 int northWestValue , result;70 int northValue = j == 0 ? boundaryX[boundaryXOffsetRead + i] :

subMatrix[NOMATCH ][i];71 int westValue = i == 0 ? boundaryY[boundaryYOffsetRead + j] :

subMatrix[NOMATCH ][i - 1];72

73 if (j == 0) {74 // border to the north75 northWestValue = boundaryX[boundaryXOffsetRead + i - 1];76 } else if (i == 0) {77 // border to the west78 northWestValue = boundaryY[boundaryYOffsetRead + j - 1];79 } else {80 // not on border , read from own cost values81 northWestValue = subMatrix[MATCH][i - 1];82 }83

84 result = max(85 westValue ,86 northWestValue + (yStr_shared[j] == xStr_shared[i])87 );88 result = max(northValue , result);89

90 subMatrix[RESULT ][i] = result;91 // end of calculation92

93 // on south/east edge? Save in output boundary94 if (j == boxDim - 1) {95 // south edge. Note: corner is only saved here and not in

boundaryY96 boundaryX[boundaryXOffsetWrite + i] = subMatrix[RESULT ][i];97 } else if (i == boxDim - 1) {98 // east edge99 boundaryY[boundaryYOffsetWrite + j] = subMatrix[RESULT ][i];

100 }101 }102 }103

104 // memory wrap around105 tmpPointer = subMatrix [2];106 subMatrix [2] = subMatrix [0];107 subMatrix [0] = subMatrix [1];108 subMatrix [1] = tmpPointer;109

110 __syncthreads ();111 }112 }

56

B.1. Forward pass kernels

B.1.2 SKERNELSHARED


10 * @param boxDim a multiple of #threads per block11 */12 __global__13 __launch_bounds__(MAXTHREADSPERBLOCK)14 void kernel_lcs_fp_wave_boundary_striving_globalStrings(const int2 *boxArray

, char const * const xStr , char const * const yStr , int * constboundaryX , int * const boundaryY , int n, int boxDim) {

15


19 extern __shared__ int dynSmemWOstringsGeneral [];20

21 int *subMatrix [3] = {22 &dynSmemWOstringsGeneral [0],23 &dynSmemWOstringsGeneral[boxDim],24 &dynSmemWOstringsGeneral[boxDim *2]25 };26 int *tmpPointer;27

28 int totalstrides = (boxDim/blockDim.x);29 volatile int strideWidth = boxDim/totalstrides;30 volatile int totalDiagonals = boxDim *2 - 1;31

32 // pre -calculate index values for y/boundaryX33 volatile int boundaryXOffsetRead = n*(boxJ -1) + boxDim * boxI;34 int boundaryXOffsetWrite = boundaryXOffsetRead + n;35 volatile int boundaryYOffsetRead = n*(boxI -1) + boxDim * boxJ;36 int boundaryYOffsetWrite = boundaryYOffsetRead + n;37

38 // no sync , needed39 // calculate the cost matrix40 for (int slice = 0; slice < totalDiagonals; ++slice) {41

42 // for each stride43 for (int stride = 0; stride < totalstrides; ++ stride) {44 // update i,j45 int i = threadIdx.x + (stride*strideWidth);46 int j = slice - i;47

48 // from here the kernel is the same as kernel_lcs_fp_01_wave_boundary49 if (!(j < 0 || j >= boxDim)) {50 // calculate51 int northWestValue , result;52 int northValue = j == 0 ? boundaryX[boundaryXOffsetRead + i] :

subMatrix[NOMATCH ][i];53 int westValue = i == 0 ? boundaryY[boundaryYOffsetRead + j] :

subMatrix[NOMATCH ][i - 1];54

55 if (j == 0) {56 // border to the north57 northWestValue = boundaryX[boundaryXOffsetRead + i - 1];58 } else if (i == 0) {59 // border to the west60 northWestValue = boundaryY[boundaryYOffsetRead + j - 1];61 } else {62 // not on border , read from own cost values63 northWestValue = subMatrix[MATCH][i - 1];64 }

57


65

66 result = max(67 westValue ,68 northWestValue + (yStr[boxJ*boxDim + j] == xStr[boxI*boxDim + i])69 );70 result = max(northValue , result);71


75 // on south/east edge? Save in output boundary76 if (j == boxDim - 1) {77 // south edge. Corner is only saved here and not in boundaryY78 boundaryX[boundaryXOffsetWrite + i] = subMatrix[RESULT ][i];79 } else if (i == boxDim - 1) {80 // east edge81 boundaryY[boundaryYOffsetWrite + j] = subMatrix[RESULT ][i];82 }83 }84 }85


92 __syncthreads ();93 }94 }

58


B.1.3 SKERNELSHARED


10 * @param boxDim a multiple of #threads per block11 */12 __global__13 __launch_bounds__(MAXTHREADSPERBLOCK)14 void kernel_lcs_fp_wave_boundary_striving_scaling(const int2 *boxArray , char

const * const xStr , char const * const yStr , int * const boundaryX ,int * const boundaryY , int n, int boxDim) {

15


19 extern __shared__ char dynSmemWstringsSpecialized [];20

21 char * const xStr_shared = &dynSmemWstringsSpecialized [0];22 char * const yStr_shared = &dynSmemWstringsSpecialized[boxDim ];23 diag_t* diag_base = (diag_t *)&dynSmemWstringsSpecialized[boxDim *2];24

25 diag_t *subMatrix [3] = {26 &diag_base [0],27 &diag_base[boxDim],28 &diag_base [2* boxDim]29 };30 diag_t *tmpPointer;31

32 int totalStrides = boxDim/blockDim.x;33 int strideWidth = boxDim/totalStrides;34 volatile int totalDiagonals = boxDim *2 - 1;35

36 // Copy X-string and Y-string needed for the current tile37 for (int stride = 0; stride < totalStrides; ++ stride) {38 xStr_shared[threadIdx.x+stride*strideWidth] =39 xStr[boxI*boxDim + threadIdx.x+stride*strideWidth ];40 yStr_shared[threadIdx.x+stride*strideWidth] =41 yStr[boxJ*boxDim + threadIdx.x+stride*strideWidth ];42 }43


50 // scaling51 int resultScaling = (boxJ ==0) ? 0 : boundaryX[boundaryXOffsetRead ];52

53 // sync all threads in block54 __syncthreads ();55

56 // calculate the cost matrix57 for (int slice = 0; slice < totalDiagonals; ++slice) {58

59 // for each stride60 //#pragma unroll 861 for (int stride = 0; stride < totalStrides; ++ stride) {62 // update i,j63 int i = threadIdx.x + (stride*strideWidth);64 int j = slice - i;65

66 if(!(j<0 || j>= boxDim)) {

59


67 int northWestValue , result;68 int northValue = j==0 ? boundaryX[boundaryXOffsetRead+i] -

resultScaling : subMatrix[NOMATCH ][i]; // scaling69 int westValue = i==0 ? boundaryY[boundaryYOffsetRead+j] -

resultScaling : subMatrix[NOMATCH ][i-1]; // scaling70

71 if(j==0) {72 // border to the north73 northWestValue = boundaryX[boundaryXOffsetRead+i-1] -

resultScaling; // scaling74 } else if(i==0) {75 // border to the west76 northWestValue = boundaryY[boundaryYOffsetRead+j-1] -

resultScaling; // scaling77 } else {78 // not on border , read from own cost values79 northWestValue = subMatrix[MATCH][i-1];80 }81

82 result = max(83 westValue , northWestValue + (yStr_shared[j] == xStr_shared[i])84 );85 result = max(northValue , result);86


90 // on south/east edge? Save in output boundary91 if(j == boxDim -1) {92 // south edge , corner is only saved here and not in boundaryY93 boundaryX[boundaryXOffsetWrite+i] =94 subMatrix[RESULT ][i] + resultScaling; // scaling95 } else if(i == boxDim -1) {96 // east edge97 boundaryY[boundaryYOffsetWrite+j] =98 subMatrix[RESULT ][i] + resultScaling; // scaling99 }

100 }101 }102


109 __syncthreads ();110 }111 }

60


B.1.4 SKERNELGLOBAL


10 * @param boxDim a multiple of #threads per block11 */12 __global__13 __launch_bounds__(MAXTHREADSPERBLOCK)14 void kernel_lcs_fp_wave_boundary_striving_scaling_globalStrings(const int2 *

boxArray , char const * const xStr , char const * const yStr , int * constboundaryX , int * const boundaryY , int n, int boxDim) {

15


19 extern __shared__ diag_t dynSmemWOstringsSpecialized [];20

21 diag_t *subMatrix [3] = {22 &dynSmemWOstringsSpecialized [0],23 &dynSmemWOstringsSpecialized[boxDim],24 &dynSmemWOstringsSpecialized[boxDim *2]25 };26 diag_t *tmpPointer;27

28 int totalStrides = boxDim/blockDim.x;29 volatile int strideWidth = boxDim/totalStrides;30 volatile int totalDiagonals = boxDim *2 - 1;31


38 // scaling39 int resultScaling = (boxJ ==0) ? 0 : boundaryX[boundaryXOffsetRead ];40

41 // no sync needed42 // calculate the cost matrix43 for (int slice = 0; slice < totalDiagonals; ++slice) {44

45 // for each stride46 // stride counter is char , fix for register usage47 //#pragma unroll 848 for (char stride = 0; stride < totalStrides; ++ stride) {49

50 // update i,j51 int i = threadIdx.x + (stride*strideWidth);52 int j = slice - i;53

54 if(!(j<0 || j>= boxDim)) {55 // calculate56 int northWestValue , result;57 int northValue = j==0 ? boundaryX[boundaryXOffsetRead+i] -

resultScaling : subMatrix[NOMATCH ][i]; // scaling58 int westValue = i==0 ? boundaryY[boundaryYOffsetRead+j] -

resultScaling : subMatrix[NOMATCH ][i-1]; // scaling59

60 if(j==0) {61 // border to the north62 northWestValue = boundaryX[boundaryXOffsetRead+i-1] -

resultScaling; // scaling63 } else if(i==0) {

61


64 // border to the west65 northWestValue = boundaryY[boundaryYOffsetRead+j-1] -

resultScaling; // scaling66 } else {67 // not on border , read from own cost values68 northWestValue = subMatrix[MATCH][i-1];69 }70

71 result = max(72 westValue ,73 northWestValue +74 (yStr[boxJ*boxDim + j] == xStr[boxI*boxDim + i])75 );76 result = max(northValue , result);77


81 // on south/east edge? Save in output boundary82 if(j == boxDim -1) {83 // south edge , corner is only saved here and not in boundaryY84 boundaryX[boundaryXOffsetWrite+i] =85 subMatrix[RESULT ][i] + resultScaling; // scaling86 } else if(i == boxDim -1) {87 // east edge88 boundaryY[boundaryYOffsetWrite+j] =89 subMatrix[RESULT ][i] + resultScaling; // scaling90 }91 }92 }93


100 __syncthreads ();101 }102 }

62


B.1.5 BPKERNEL

1 /**2 * Computes the backward pass using boundaries and pinned memory.3 *4 * @param xStr X’5 * @param yStr Y’6 * @param subBoundaryX boundaries in X direction7 * @param subBoundaryY boundaries in Y direction8 * @param boxDim a multiple of #threads per block9 * @param subBoxDim equal to #threads per block

10 * @param hostLcs pinned memory for the LCS string11 * @param globalLcsIndex index of current LCS char12 * @param current_trace where are we in the global cost matrix13 * @param current_box where are we in the current tile14 */15 // Defines for bank conflict fix16 #define JUMPCNT 4 // Number of entries within one bank.17 #define WARPLENGTH 32 // number of threads in a warp18 __global__ void kernel_lcs_bp( char *xStr , char *yStr , int *subBoundaryX ,

int *subBoundaryY , int boxDim , int subBoxDim , char *hostLcs , int *globalLcsIndex , int2 *current_trace , int2 *current_box ) {

19

20 // Bank conflict mapping , mod operations can be optimized by compiler.21

22 /* index of a warp within a warpCollection */23 int subWarpIndex = (threadIdx.x/WARPLENGTH)%JUMPCNT;24

25 // collection index. one collection calcs i=0.. warpsize*entriesPerBank26 int warpCollectionIdx = threadIdx.x/( WARPLENGTH*JUMPCNT);27

28 // all threads IDs are relative to a warp29 int warpIndex = threadIdx.x%WARPLENGTH;30 int i = (warpIndex*JUMPCNT)+subWarpIndex + warpCollectionIdx *128;31

32 /* bank conflicts fixed */33

34 // shared y/x strings with the width of a subBoxDim35 __shared__ char xStr_shared[MAX_BACKWARD_PASS_N ];36 __shared__ char yStr_shared[MAX_BACKWARD_PASS_N ];37

38 // DPM for the current subBox39 __shared__ char subMatrix[MAX_BACKWARD_PASS_N*MAX_BACKWARD_PASS_N ];40

41 int totalDiagonals = subBoxDim *2 - 1;42 int subBoundaryXOffsetRead;43 int subBoundaryYOffsetRead;44

45 // index to current subBox46 __shared__ int2 subBox; // volatile47 int2 subTrace;48

49 int lcsIndex = *globalLcsIndex;50

51 // initialize thread 0 variables52 if (i == 0) {53 subBox.x = floor( (float) (current_trace ->x % boxDim) / subBoxDim );54 subBox.y = floor( (float) (current_trace ->y % boxDim) / subBoxDim );55

56 subTrace.x = current_trace ->x % subBoxDim;57 subTrace.y = current_trace ->y % subBoxDim;58 }59

60 __syncthreads (); // make sure all is in sync before the while loop61

62 // loop until we are outside the matrix63 while(subBox.x >= 0 && subBox.y >= 0) {64 // Copy X-string and Y-string needed for the current box65 xStr_shared[i] = xStr[subBox.x*subBoxDim + i];66 yStr_shared[i] = yStr[subBox.y*subBoxDim + i];

63


67

68 // index to sub boundary saved in the forward pass for the currentsubBox

69 subBoundaryXOffsetRead = boxDim *( subBox.y-1) + subBoxDim * subBox.x;70 subBoundaryYOffsetRead = boxDim *( subBox.x-1) + subBoxDim * subBox.y;71

72 int resultScaling = subBoundaryX[subBoundaryXOffsetRead ];73

74 // sync all threads in tile75 __syncthreads ();76

77 // calculate the sub -matrix for the current subBox78 for (int slice = 0; slice < totalDiagonals; ++slice) {79 int j = slice - i;80 int resultOffset = j*subBoxDim + i;81

82 if(!(j<0 || j>= subBoxDim)) {83 int northWestValue , result;84 int northValue = j == 085 ? subBoundaryX[subBoundaryXOffsetRead + i] - resultScaling86 : subMatrix[resultOffset - subBoxDim ];87

88 int westValue = i == 089 ? subBoundaryY[subBoundaryYOffsetRead + j] - resultScaling90 : subMatrix[resultOffset - 1];91

92 if(j==0) {93 // border to the north94 northWestValue = subBoundaryX[subBoundaryXOffsetRead+i-1] -

resultScaling;95

96 // boxJ >0 could be removed by adding one extra value to boundaryY97 northWestValue = (i == 0 && subBox.x==0 && subBox.y>0)98 ? subBoundaryY[subBoundaryYOffsetRead+j-1] - resultScaling99 : northWestValue;

100

101 } else if(i==0) {102 // border to the west103 northWestValue = subBoundaryY[subBoundaryYOffsetRead+j-1] -

resultScaling;104 } else {105 // not on border , read from own sub -matrix values106 northWestValue = subMatrix[resultOffset -subBoxDim -1];107 }108

109 result = max(110 westValue ,111 northWestValue + (yStr_shared[j] == xStr_shared[i])112 );113 result = max(northValue , result);114

115 subMatrix[resultOffset] = result;116

117 }118

119 __syncthreads (); // sync all threads after each diagonal120 }121

122 // sub -matrix is done , do backward pass using one thread123

124 // one thread backtrace125 if (i == 0) {126 // backtrace127

128 while(subTrace.x >= 0 && subTrace.y >=0) {129 // inside tile130

131 if(yStr_shared[subTrace.y] == xStr_shared[subTrace.x]){132 // match , save result , local and global133 hostLcs[lcsIndex] = yStr_shared[subTrace.y];

64


134 lcsIndex --;135 subTrace.x--;136 subTrace.y--;137 } else {138 // no match , go north or west?139 int northValue = subTrace.y == 0 ? subBoundaryX[

subBoundaryXOffsetRead + subTrace.x] - resultScaling :subMatrix [( subTrace.y-1)*subBoxDim + subTrace.x]; // scaling

140 int westValue = subTrace.x == 0 ? subBoundaryY[subBoundaryYOffsetRead + subTrace.y] - resultScaling :subMatrix[subTrace.y*subBoxDim + subTrace.x-1]; //scaling

141

142 if(northValue >= westValue) {143 subTrace.y--; // go north144 } else {145 subTrace.x--; // go west146 }147 } // end if148 } // while end (outside tile)149

150 // done , on border - update boxI/J and subTrace.x/J151 if(subTrace.x==-1) {152 // go west153 subBox.x--;154 // flip trace I155 subTrace.x = subBoxDim -1;156 }157

158 if(subTrace.y==-1) {159 // go north160 subBox.y--;161 // flip trace J162 subTrace.y = subBoxDim -1;163 }164

165 } // one thread is done with the trace for current subTile166

167 __syncthreads (); // backward trace done for current subTile , sync168 } // all is done169

170 // thread 0 updates global lcs index and pinned traces171 if(i==0) {172 *globalLcsIndex = lcsIndex;173

174 // update current trace based on subTrace and box...175 current_trace ->x =176 (current_box ->x*boxDim + subBox.x*subBoxDim + subTrace.x);177 current_trace ->y =178 (current_box ->y*boxDim + subBox.y*subBoxDim + subTrace.y);179 }180 }

65

algorithms for string comparison on gpus

Documents

lddp problems

common subsequence

lse lddpproblemer

af et layouts egenskaber

sammenligning af strenge

different layouts

toanalyze layouts

tidligere layouts