algorithms for string comparison on gpus

Download Algorithms for String Comparison on GPUs

Post on 17-Mar-2016

212 views

Category:

Documents

0 download

Embed Size (px)

DESCRIPTION

A novel approach for solving any large pairwise local dependency dynamic programming problem using graphics processing units (GPUs). Our results include a new superior layout for utilizing the coarse-grained GPU parallelism.

TRANSCRIPT

  • ALGORITHMS FORSTRING COMPARISON ON GPUS

    Kenneth Skovhus Andersen, s062390Lasse Bach Nielsen, s062377

    !

    !

    !

    !

    !

    !

    Template!Generic!Prose!Document!

    DD.MM.YYY,!vX.Y!

    Harald!Strrle!!!!!!!!

    Institute!for!Informatics!and!Mathematical!Modelling!

    Technical!University!of!Denmark!

    !

    T1!Technical University of Denmark

    Informatics and Mathematical Modelling

    Supervisors: Inge Li Grtz & Philip Bille

    August, 2012

  • DTU InformaticsDepartment of Informatics and Mathematical ModelingTechnical University of DenmarkAsmussens Alle, Building 305, DK-2800 Lyngby, DenmarkPhone +45 4525 3351, Fax +45 4588 2673reception@imm.dtu.dkwww.imm.dtu.dk

  • ABSTRACT

    We consider parallelization of string comparison algorithms, including se-quence alignment, edit distance and longest common subsequence. Theseproblems are all solvable using essentially the same dynamic programmingscheme over a two-dimensional matrix, where an entry locally depends onneighboring entries. We generalize this set of problems as local dependencydynamic programming (LDDP).

    We present a novel approach for solving any large pairwise LDDP prob-lem using graphics processing units (GPUs). Our results include a new supe-rior layout for utilizing the coarse-grained parallelism of the many-core GPU.The layout performs up to 18% better than the most widely used layout. Toanalyze layouts, we have devised theoretical descriptions, which accuratelypredict the relative speedup between different layouts on the coarse-grainedparallel level of GPUs.

    To evaluate the potential of solving LDDP problems on GPU hardware,we implement an algorithm for solving longest common subsequence. In ourexperiments we compare large biological sequences, each consisting of twomillion symbols, and show a 40X speedup compared to a state-of-the-art se-quential CPU solution byDriga et al. Our results can be generalized on severallevels of parallel computation using multiple GPUs.

    iii

  • RESUME

    Vi betragter parallelisering af algoritmer til sammenligning af strenge, herun-der sequence alignment, edit distance og longest common subsequence. Disseproblemer kan alle lses med en todimensional dynamisk programmerings-matrix med lokale afhngigheder. Vi generaliserer disse problemer til localdependency dynamic programming (LDDP).

    Vi prsenterer en ny tilgang til at lse store parvise LDDP-problemer medgrafikprocessorer (GPUer). Ydermere har vi udviklet et nyt layout til ud-nyttelse af GPUens multiprocessorer. Vores nye layout forbedrer kretidenmed op til 18% i forhold til tidligere layouts. Til analyse af et layouts egensk-aber, har vi udviklet teoretiske beskrivelser, der prcist forudsiger den rela-tive kretidsforbedring mellem forskellige layouts.

    For at vurdere GPUens potentiale til at lse LDDP-problemer, har vi im-plementeret en algoritme, som lser longest common subsequence. I voreseksperimenter sammenligner vi lange biologiske sekvenser, der hver bestar afto millioner symboler. Vi viser mere end 40X hastighedsforgelse i forhold tilen state-of-the-art sekventiel CPU-lsning af Driga et al. Vores resultater kangeneraliseres pa flere niveauer af parallelitet ved brug af flere GPUere.

    v

  • PREFACE

    This masters thesis has been prepared at DTU Informatics at the TechnicalUniversity of Denmark from February to August 2012 under supervision ofassociate professors Inge Li Grtz and Philip Bille. It has an assigned work-load of 30 ECTS credits for each of the two authors.

    The thesis deals with the subject of local dependency dynamic program-ming algorithms for solving large scale string comparison problems on mod-ern graphical processing units (GPUs). The focus is to investigate, combineand further develop existing state of the art algorithms.

    Acknowledgments

    We would like to thank our supervisors for their guidance during the project.A special thanks to PhD student Morten Stockel at the IT University of Copen-hagen for providing the source code for sequential string comparison algo-rithms [1] and PhD student Hjalte Wedel Vildhj at DTU Informatics for hisvaluable feedback.

    Lasse Bach Nielsen Kenneth Skovhus Andersen

    August, 2012

    vii

  • CONTENTS

    Abstract iii

    Resume v

    Preface vii

    1 Introduction 11.1 This Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Our Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    2 Local Dependency Dynamic Programming 52.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Previous Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Our Approach Based on Previous Results . . . . . . . . . . . . . 9

    3 Graphics Processing Units 113.1 GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3 Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    4 Parallel Layouts for LDDP 154.1 Diagonal Wavefront . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2 Column-cyclic Wavefront . . . . . . . . . . . . . . . . . . . . . . 174.3 Diagonal-cyclic Wavefront . . . . . . . . . . . . . . . . . . . . . . 184.4 Applying Layouts to the GPU Architecture . . . . . . . . . . . . 194.5 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . 20

    5 Implementing LDDP on GPUs 235.1 Grid-level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2 Thread-level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.3 Space Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    6 Experimental Results for Grid-level 316.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    7 Experimental Results for Thread-level 377.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377.2 Results for Forward Pass Kernels . . . . . . . . . . . . . . . . . . 377.3 Results for Backward Pass Kernel . . . . . . . . . . . . . . . . . . 417.4 Part Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    ix

  • CONTENTS

    8 Performance Evaluation 438.1 The Potential of Solving LDDP Problems on GPUs . . . . . . . . 438.2 Comparing to Similar GPU Solutions . . . . . . . . . . . . . . . . 44

    9 Conclusion 459.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    Bibliography 47

    Appendices 51

    A NVIDIA GPU Data Sheets 53A.1 NVIDIA Tesla C2070 . . . . . . . . . . . . . . . . . . . . . . . . . 53A.2 NVIDIA GeForce GTX 590 . . . . . . . . . . . . . . . . . . . . . . 54

    B Kernel Source Code 55B.1 Forward pass kernels . . . . . . . . . . . . . . . . . . . . . . . . . 55

    x

  • 1 INTRODUCTION

    We revisit the classic algorithmic problem of comparing strings, includingsolving sequence alignment, edit distance and finding the longest commonsubsequence. In many textural information retrieval systems, the exact com-parison of large-scale strings is an important, but very time consuming task.As an example, the exact alignment of huge biological sequences, such asgenes and genomes, has previously been infeasible due to computing andmemory requirements. Consequently, much research effort has been investedin faster heuristic algorithms1 for sequence alignment. Although these meth-ods are faster than exact methods, they come at the cost of sensitivity. How-ever, the rise of new parallel computing platforms such as graphics processingunits, are able to change this scenario.

    Graphics processing units (GPUs) are designed for graphics applications,having a large degree of data parallelism using hundreds of cores, and aredesigned to solve multiple independent parallel tasks. Previous results foraccelerating sequence alignment using GPUs show a significant speedup, butare currently focused on aligning many independent short sequences, a prob-lemwhere the GPU architecture excel. Our focus, is based on the need to solvelarge-scale exact pairwise string comparison of biological sequences contain-ing millions of symbols.

    Our work is motivated by the increasing power of GPUs, and the chal-lenge of making exact comparison of large strings feasible. We consider par-allelization of a general set of pairwise string comparison algorithms, all solv-able using essentially the same dynamic programming scheme over a twodimensional-matrix. Taking two input strings X and Y of the same length n,these problems can be solved by computing all entries in an n n matrix us-ing a specific cost function. Computation of an entry in the matrix dependson the neighboring entries. We generalize all these problems as local depen-dency dynamic programming (LDDP). In general, LDDP problems are not triv-ially solved in parallel, as the local dependencies gives a varying degree ofparallelism across the entire problem space.

    The parallelism of a GPU is exposed as a coarse-grained grid of blocks,where each block consists of finer-grained threads. We call these levels of par-allelism the grid- and thread-level. We focus on layouts as a mean to describehow LDDP problems can be mapped to the different levels on the GPU.

    1One of the first heuristic algorithms for sequence alignment was FASTA, presented by Pear-son and Lipman in 1988 [2].

    1

  • 1. INTRODUCTION

    1.1 This Report

    We start by presenting a short description of our work. The following chap-ter gives a theoretical introduction to LDDP problems, including a survey ofprevious sequential and parallel solutions. Based on this, we sel