size matters : space/time tradeoffs to improve gpgpu application performance
DESCRIPTION
Size Matters : Space/Time Tradeoffs to Improve GPGPU Application Performance. Abdullah Gharaibeh Matei Ripeanu NetSysLab The University of British Columbia. High peak compute power High communication overhead. High peak memory bandwidth Limited memory space. - PowerPoint PPT PresentationTRANSCRIPT
Size Matters: Space/Time Tradeoffs to Improve GPGPU Application Performance
Abdullah Gharaibeh Matei Ripeanu
NetSysLabThe University of British Columbia
2
GPUs offer different characteristics
High peak compute power
High communication overhead
High peak memory bandwidth
Limited memory space
Implication: careful tradeoff analysis is needed when porting applications to GPU-based platforms
3
Motivating Question: How should we design applications to efficiently exploit GPU characteristics?
Context: A bioinformatics problem: Sequence Alignment
A string matching problem Data intensive (102 GB)
4
Past work: sequence alignment on GPUsMUMmerGPU [Schatz 07, Trapnell 09]:
A GPU port of the sequence alignment tool MUMmer [Kurtz 04] ~4x (end-to-end) compared to CPU version
Hypothesis: mismatch between the core data structure (suffix tree) and GPU characteristics
> 50% overhead
(%)
5
Use a space efficient data structure (though, from higher computational complexity class): suffix array
4x speedup compared to suffix tree-based on GPU
Idea: trade-off time for space
Consequences: Opportunity to exploit
multi-GPU systems as I/O is less of a bottleneck
Focus is shifted towards optimizing the compute stage
Significant overhead reduction
6
Outline
Sequence alignment: background and offloading to GPU
Space/Time trade-off analysis
Evaluation
7
CCAT GGCT... .....CGCCCTA GCAATTT.... ...GCGG ...TAGGC TGCGC... ...CGGCA... ...GGCG ...GGCTA ATGCG… .…TCGG... TTTGCGG…. ...TAGG ...ATAT… .…CCTA... CAATT…. ..CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGCG..
Background: sequence alignment problem
Find where each query most likely originated from Queries
108 queries101 to 102 symbols length per query
Reference106 to 1011 symbols length
Queries
Reference
8
GPU Offloading: opportunity and challenges
Sequence alignment
Easy to partition Memory intensive
GPU
Massively parallel High memory bandwidth
Opp
ortu
nity
Data Intensive Large output size
Limited memory space No direct access to
other I/O devices (e.g., disk)C
halle
nges
9
GPU Offloading: addressing the challenges
subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys)foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset,
subref) CopyFromGPU(results) } Decompress(results)}
• Data intensive problem and limited memory space
→divide and compute in rounds
• Large output size→compressed output
representation (decompress on the CPU) High-level algorithm (executed on the host)
10
Space/Time Trade-off AnalysisSpace/Time Trade-off Analysis
11
The core data structuremassive number of queries and long reference => pre-
process reference to an index
$
CAA TACACA$
0
5
CA$
2 4
CA$ $
3 1
$ CA$
Past work: build a suffix tree (MUMmerGPU [Schatz 07, 09])
Search: O(qry_len) per query Space: O(ref_len), but the
constant is high: ~20xref_len Post-processing:
O(4qry_len - min_match_len), DFS traversal per query
12
The core data structuremassive number of queries and long reference => pre-
process reference to an index
Past work: build a suffix tree (MUMmerGPU [Schatz 07])
Search: O(qry_len) per query Space: O(ref_len), but the
constant is high: ~20xref_len Post-processing:
O(4qry_len - min_match_len), DFS traversal per query
subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys)foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref)
MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results)}
Expensive
Expensive
Efficient
13
A better matching data structure
$
CAA TACACA$
0
5
CA$
2 4
CA$ $
3 1
$ CA$
Suffix Tree
0 A$
1 ACA$
2 ACACA$
3 CA$
4 CACA$
5 TACACA$
Suffix Array
Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len
Search O(qry_len) O(qry_len x log ref_len)
Post-process O(4qry_len - min_match_len) O(qry_len – min_match_len)
Impact 1: reduced communication
Less data to transfer
14
A better matching data structure
$
CAA TACACA$
0
5
CA$
2 4
CA$ $
3 1
$ CA$
Suffix Tree
0 A$
1 ACA$
2 ACACA$
3 CA$
4 CACA$
5 TACACA$
Suffix Array
Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len
Search O(qry_len) O(qry_len x log ref_len)
Post-process O(4qry_len - min_match_len) O(qry_len – min_match_len)
Impact 2: better data locality is achieved at the cost of additional per-thread processing time
Space for longer sub-references => fewer processing rounds
15
A better matching data structure
$
CAA TACACA$
0
5
CA$
2 4
CA$ $
3 1
$ CA$
Suffix Tree
0 A$
1 ACA$
2 ACACA$
3 CA$
4 CACA$
5 TACACA$
Suffix Array
Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len
Search O(qry_len) O(qry_len x log ref_len)
Post-process O(4qry_len - min_match_len) O(qry_len – min_match_len)
Impact 3: lower post-processing overhead
16
Evaluation
17
Evaluation setup
Workload / Species Reference sequence length
# of queries
Average read length
HS1 - Human (chromosome 2) ~238M ~78M ~200
HS2 - Human (chromosome 3) ~100M ~2M ~700
MONO - L. monocytogenes ~3M ~6M ~120
SUIS - S. suis ~2M ~26M ~36
Testbed Low-end Geforce 9800 GX2 GPU (512MB) High-end Tesla C1060 (4GB)
Base line: suffix tree on GPU (MUMmerGPU [Schatz 07, 09])
Success metrics Performance Energy consumption
Workloads (NCBI Trace Archive, http://www.ncbi.nlm.nih.gov/Traces)
18
Speedup: array-based over tree-based
19
Dissecting the overheads
Significant reduction in data transfers and post-processing
Workload: HS1, ~78M queries, ~238M ref. length on Geforce
20
Summary GPUs have drastically different performance
characteristics
Reconsidering the choice of the data structure used is necessary when porting applications to the GPU
A good matching data structure ensures: Low communication overhead Data locality: might be achieved at the cost of
additional per thread processing time Low post-processing overhead
21
Code available at:Code available at: netsyslab.ece.ubc.ca netsyslab.ece.ubc.ca