size matters : space/time tradeoffs to improve gpgpu application performance

Size Matters: Space/Time Tradeoffs to Improve GPGPU Application Performance

Abdullah Gharaibeh Matei Ripeanu

NetSysLabThe University of British Columbia

2

GPUs offer different characteristics

High peak compute power

High communication overhead

High peak memory bandwidth

Limited memory space

Implication: careful tradeoff analysis is needed when porting applications to GPU-based platforms

3

Motivating Question: How should we design applications to efficiently exploit GPU characteristics?

Context: A bioinformatics problem: Sequence Alignment

A string matching problem Data intensive (102 GB)

4

Past work: sequence alignment on GPUsMUMmerGPU [Schatz 07, Trapnell 09]:

A GPU port of the sequence alignment tool MUMmer [Kurtz 04] ~4x (end-to-end) compared to CPU version

Hypothesis: mismatch between the core data structure (suffix tree) and GPU characteristics

> 50% overhead

(%)

5

Use a space efficient data structure (though, from higher computational complexity class): suffix array

4x speedup compared to suffix tree-based on GPU

Idea: trade-off time for space

Consequences: Opportunity to exploit

multi-GPU systems as I/O is less of a bottleneck

Focus is shifted towards optimizing the compute stage

Significant overhead reduction

6

Outline

Sequence alignment: background and offloading to GPU

Space/Time trade-off analysis

Evaluation

7

CCAT GGCT... .....CGCCCTA GCAATTT.... ...GCGG ...TAGGC TGCGC... ...CGGCA... ...GGCG ...GGCTA ATGCG… .…TCGG... TTTGCGG…. ...TAGG ...ATAT… .…CCTA... CAATT…. ..CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGCG..

Background: sequence alignment problem

Find where each query most likely originated from Queries

108 queries101 to 102 symbols length per query

Reference106 to 1011 symbols length

Queries

Reference

8

GPU Offloading: opportunity and challenges

Sequence alignment

Easy to partition Memory intensive

GPU

Massively parallel High memory bandwidth

Opp

ortu

nity

Data Intensive Large output size

Limited memory space No direct access to

other I/O devices (e.g., disk)C

halle

nges

9

GPU Offloading: addressing the challenges

subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys)foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset,

subref) CopyFromGPU(results) } Decompress(results)}

• Data intensive problem and limited memory space

→divide and compute in rounds

• Large output size→compressed output

representation (decompress on the CPU) High-level algorithm (executed on the host)

10

Space/Time Trade-off AnalysisSpace/Time Trade-off Analysis

11

The core data structuremassive number of queries and long reference => pre-

process reference to an index

$

CAA TACACA$

0

5

CA$

2 4

CA$ $

3 1

$ CA$

Past work: build a suffix tree (MUMmerGPU [Schatz 07, 09])

Search: O(qry_len) per query Space: O(ref_len), but the

constant is high: ~20xref_len Post-processing:

O(4qry_len - min_match_len), DFS traversal per query

12

The core data structuremassive number of queries and long reference => pre-

process reference to an index

Past work: build a suffix tree (MUMmerGPU [Schatz 07])

Search: O(qry_len) per query Space: O(ref_len), but the

constant is high: ~20xref_len Post-processing:

O(4qry_len - min_match_len), DFS traversal per query

subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys)foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref)

MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results)}

Expensive

Expensive

Efficient

13

A better matching data structure

$

CAA TACACA$

0

5

CA$

2 4

CA$ $

3 1

$ CA$

Suffix Tree

0 A$

1 ACA$

2 ACACA$

3 CA$

4 CACA$

5 TACACA$

Suffix Array

Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len

Search O(qry_len) O(qry_len x log ref_len)

Post-process O(4qry_len - min_match_len) O(qry_len – min_match_len)

Impact 1: reduced communication

Less data to transfer

14


$

CAA TACACA$

0

5

CA$

2 4

CA$ $

3 1

$ CA$

Suffix Tree

0 A$

1 ACA$

2 ACACA$

3 CA$

4 CACA$

5 TACACA$

Suffix Array




Impact 2: better data locality is achieved at the cost of additional per-thread processing time

Space for longer sub-references => fewer processing rounds

15


$

CAA TACACA$

0

5

CA$

2 4

CA$ $

3 1

$ CA$

Suffix Tree

0 A$

1 ACA$

2 ACACA$

3 CA$

4 CACA$

5 TACACA$

Suffix Array




Impact 3: lower post-processing overhead

16

Evaluation

17

Evaluation setup

Workload / Species Reference sequence length

# of queries

Average read length

HS1 - Human (chromosome 2) ~238M ~78M ~200

HS2 - Human (chromosome 3) ~100M ~2M ~700

MONO - L. monocytogenes ~3M ~6M ~120

SUIS - S. suis ~2M ~26M ~36

Testbed Low-end Geforce 9800 GX2 GPU (512MB) High-end Tesla C1060 (4GB)

Base line: suffix tree on GPU (MUMmerGPU [Schatz 07, 09])

Success metrics Performance Energy consumption

Workloads (NCBI Trace Archive, http://www.ncbi.nlm.nih.gov/Traces)

http://www.ncbi.nlm.nih.gov/Traces



18

Speedup: array-based over tree-based

19

Dissecting the overheads

Significant reduction in data transfers and post-processing

Workload: HS1, ~78M queries, ~238M ref. length on Geforce

20

Summary GPUs have drastically different performance

characteristics

Reconsidering the choice of the data structure used is necessary when porting applications to the GPU

A good matching data structure ensures: Low communication overhead Data locality: might be achieved at the cost of

additional per thread processing time Low post-processing overhead

21

Code available at:Code available at: netsyslab.ece.ubc.ca netsyslab.ece.ubc.ca

size matters : space/time tradeoffs to improve gpgpu application performance

Documents