harvesting the opportunity of gpu-based acceleration for data-intensive applications
DESCRIPTION
Harvesting the Opportunity of GPU-Based Acceleration for Data-Intensive Applications Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work with: Abdullah Gharaibeh, Samer Al-Kiswany. Networked Systems Laboratory (NetSysLab) - PowerPoint PPT PresentationTRANSCRIPT
1
Harvesting the Opportunity of GPU-Based Acceleration for Data-Intensive Applications
Matei RipeanuNetworked Systems Laboratory (NetSysLab)
University of British Columbia
Joint work with: Abdullah Gharaibeh, Samer Al-Kiswany
2
A golf course …
… a (nudist) beach
(… and 199 days of rain each year)
Networked Systems Laboratory (NetSysLab)University of British Columbia
3
Hybrid architectures in Top 500 [Nov’10]
4
• Hybrid architectures– High compute power / memory bandwidth– Energy efficient
[operated today at low overall efficiency]
• Agenda for this talk– GPU Architecture Intuition
• What generates the above characteristics?
– Progress on efficiently harnessing hybrid
(GPU-based) architectures
5Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian
6Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian
7Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian
8Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian
9Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian
10Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian
11Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian
12
Feed the cores with data
Idea #3
The processing elements are data hungry!
Wide, high throughput memory bus
13
10,000x parallelism!
Idea #4
Hide memory access latency
Hardware supported multithreading
14
The Resulting GPU Architecture
Multiprocessor 2
Multiprocessor NGPU
Core MInstruction
Unit
Shared Memory
Registers
Multiprocessor 1
Core 1
Registers
Core 2
Registers
Global Memory
Texture Memory
Constant Memory
nVidia Tesla 2050
448 cores
Four ‘memories’•Shared fast – 4 cycles small – 48KB•Global slow – 400-600cycles large – up to 3GB
high throughput – 150GB/s•Texture – read only•Constant – read only
Hybrid• PCI 16x -- 4GBps
HostMemory
HostMachine
15
GPU characteristics
High peak compute power
High host-device communication overhead
Complex to program (SIMD, co-processor model)
High peak memory bandwidth
Limited memory space
16
MummerGPU++
Context:
Porting a bioinformatics application (Sequence Alignment)
A string matching problem Data intensive (102 GB)
Does the 10x lower computation cost offered by GPUs change the way we design (distributed) systems?
Motivating Question
Distributed storage systems
Context:
Motivating Question
How should one design/port applications to efficiently exploit GPU characteristics?
StoreGPU
Roadmap: Two Projects
17
Computationally Intensive Operations in Distributed (Storage) Systems
Hashing
Erasure coding
Encryption/decryption
Membership testing (Bloom-filter)
Compression
Computationally intensive Limit performance
Similarity detection (deduplication)
Content addressability
Security
Integrity checks
Redundancy
Load balancing
Summary cache
Storage efficiency
Operations Techniques
18
Distributed Storage System Architecture
Client
Metadata Manager
Storage Nodes
Access Module
Application
Techniques To improve Performance/Reliability
b1b2
b3b n
Files divided into stream of blocks
De-duplication
SecurityIntegrity Checks
Redundancy
CPUGPU
Offloading Layer
Enabling Operations
CompressionEncoding/Decoding
Encryption/Decryption
Hashing
Application Layer
FS API
MosaStorehttp://mosastore.net
19
GPU accelerated deduplication:Design / prototype implementation that integrates similarity detection and GPU support
End-to-end system evaluation2x throughput improvement for a realistic checkpointing workload
20
Challenges
Integration Challenges
Minimizing the integration effort
Transparency
Separation of concerns
Extracting Major Performance Gains
Hiding memory allocation overheads
Hiding data transfer overheads
Efficient utilization of the GPU memory units
Use of multi-GPU systems
Similarity Detection
b1b2
b3b n
Files divided into stream of blocks
GPU
Hashing
Offloading Layer
21
Hashing on GPUs
HashGPU1: a library that exploits GPUs to support specialized use of hashing in distributed storage systems
1 Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems, S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, G. Yuan, M. Ripeanu, HPDC ‘08
However, significant speedup achieved only for large blocks (>16MB) => not suitable for efficient similarity detection
One performance data point:Accelerates hashing by up to 5x speedup compared to a single core CPU
HashGPU
GPU
b1b2
b3b n
Hashing a stream of blocks
22
Profiling HashGPU
Amortizing memory allocation and overlapping data transfers and computation may bring important benefits
At least 75% overhead
23
CrystalGPU: a layer of abstraction that transparently enables common GPU optimizations
Similarity Detection
b1b2
b3b n
Files divided into stream of blocks
GPU
HashGPU
Off
load
ing
Lay
er
CrystalGPU
One performance data point:CrystalGPU can improve the speedup of hashing by more than 10x
24
CrystalGPU Opportunities and Enablers
Opportunity: Reusing GPU memory buffers
Enabler: a high-level memory manager
Opportunity: overlap the communication and computation
Enabler: double buffering and asynchronous kernel launch
Opportunity: multi-GPU systems (e.g., GeForce 9800 GX2 and GPU clusters)
Enabler: a task queue manager
Similarity Detection
b1b2
b3b n
Files divided into stream of blocks
GPU
HashGPU
Off
load
ing
Lay
er
CrystalGPUMemory Manager Task Queue
Double Buffering
25
HashGPU Performance on top CrystalGPU
The gains enabled by the three optimizations can be realized!
Base Line: CPU Single Core
26
End-to-end system evaluation
27
Testbed– Four storage nodes and one metadata server– One client with 9800GX2 GPU
Three configuration– No similarity detection (without-SD)– Similarity detection
• on CPU (4 cores @ 2.6GHz) (SD-CPU)• on GPU (9800 GX2) (SD-GPU)
Three workloads – Real checkpointing workload– Completely similar files: maximum gains in terms of data saving– Completely different files: only overheads, no gains
Success metrics:– System throughput – Impact on a competing application: compute or I/O intensive
End-to-End System Evaluation
• A GPU Accelerated Storage System, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, HPDC’10
28
System Throughput (Checkpointing Workload)
The integrated system preserves the throughput gains on a realistic workload!
1.8x improvement
29
System Throughput (Synthetic Workload of Similar Files)
Offloading to the GPU enables close to optimal performance!
Room for 2ximprovement
30
Impact on a Competing (Compute Intensive) Application
Writing Checkpoints back to back
2ximprovement
Frees resources (CPU) to competing applications while preserving throughput gains!
7% reduction
31
Summary
32
Distributed Storage System Architecture
Client
Metadata Manager
Storage Nodes
Access Module
Application
MosaStorehttp://mosastore.net
33
Does the 10x lower computation cost offered by GPUs change the way we design (distributed storage) systems?
Motivating Question
StoreGPU summary
Techniques To improve Performance/Reliability
b1b2
b3b n
Files divided into stream of blocks
De-duplication
SecurityIntegrity Checks
Redundancy
CPUGPU
Offloading Layer
Enabling Operations
CompressionEncoding/Decoding
Encryption/Decryption
Hashing
Application Layer
FS API
Results so far: StoreGPU: storage system
prototype that offloads to GPU Evaluate the feasibility of GPU
offloading, and the impact on competing applications
34
MummerGPU++
Context:
Porting a bioinformatics application (Sequence Alignment)
A string matching problem Data intensive (102 GB)
Does the 10x lower computation cost offered by GPUs change the way we design (distributed) systems?
Motivating Question
Distributed storage systems
Context:
Motivating Question
How should one design/port applications to efficiently exploit GPU characteristics?
StoreGPU
Roadmap: Two Projects
35
CCAT GGCT... .....CGCCCTA GCAATTT.... ...GCGG ...TAGGC TGCGC... ...CGGCA... ...GGCG ...GGCTA ATGCG… .…TCGG... TTTGCGG…. ...TAGG ...ATAT… .…CCTA... CAATT….
..CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGCG..
Background: Sequence Alignment Problem
Problem: Find where each query most likely originated from
Queries 108 queries101 to 102 symbols length per query
Reference106 to 1011 symbols length (up to ~400GB)
Queries
Reference
36
Sequence Alignment on GPUs
MUMmerGPU [Schatz 07, Trapnell 09]: A GPU port of the sequence alignment tool MUMmer [Kurtz 04] Achieves good speedup compared to CPU version Based on suffix tree
However, suffers from significant communication and post-processing overheads
MUMmerGPU++ [gharibeh 10]: Use a space efficient data structure (though, from higher
computational complexity class): suffix array Achieves significant speedup compared to suffix tree-based
on GPU
> 50% overhead
Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M. Ripeanu, SC’10Accelerating Sequence Alignment on Hybrid Architectures, A. Gharaibeh, M. Ripeanu, Scientific Computing, Jan/Feb 2011.
37
Speedup Evaluation
Workload: Human, ~10M queries, ~30M ref. length
Suffix Tree Suffix Tree Suffx Array
Over 60% improvement
38
Space/Time Trade-off AnalysisSpace/Time Trade-off Analysis
39
GPU Offloading: addressing the challenges
subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys)foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset,
subref) CopyFromGPU(results) } Decompress(results)}
• Data intensive problem and limited memory space
→divide and compute in rounds
→search-optimized data-structures
• Large output size→compressed output
representation (decompress on the CPU)
High-level algorithm (executed on the host)
40
The core data structure
massive number of queries and long reference =>
pre-process reference to an index
$
CAA TACACA$
0
5
CA$
2 4
CA$ $
3 1
$ CA$
Past work: build a suffix tree (MUMmerGPU [Schatz 07, 09])
Search: O(qry_len) per query Space: O(ref_len)
but the constant is high ~20 x ref_len Post-processing:
DFS traversal for each query O(4qry_len - min_match_len)
41
The core data structure
massive number of queries and long reference => pre-process reference to an index
Past work: build a suffix tree (MUMmerGPU [Schatz 07])
Search: O(qry_len) per query
Space: O(ref_len), but the constant is high: ~20xref_len
Post-processing: O(4qry_len - min_match_len), DFS traversal per query
subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys)foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref)
MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results)}
Expensive
Expensive
Efficient
42
A better matching data structure?
$
CAA TACACA$
0
5
CA$
2 4
CA$ $
3 1
$ CA$
Suffix Tree
0 A$
1 ACA$
2 ACACA$
3 CA$
4 CACA$
5 TACACA$
Suffix Array
Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len
Search O(qry_len) O(qry_len x log ref_len)
Post-
processO(4qry_len - min_match_len) O(qry_len – min_match_len)
Impact 1: Reduced communication
Less data to transfer
Com
pute
43
A better matching data structure
$
CAA TACACA$
0
5
CA$
2 4
CA$ $
3 1
$ CA$
Suffix Tree
0 A$
1 ACA$
2 ACACA$
3 CA$
4 CACA$
5 TACACA$
Suffix Array
Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len
Search O(qry_len) O(qry_len x log ref_len)
Post-
processO(4qry_len - min_match_len) O(qry_len – min_match_len)
Impact 2: Better data locality is achieved at the cost of additional per-thread processing time
Space for longer sub-references => fewer processing rounds
Com
pute
44
A better matching data structure
$
CAA TACACA$
0
5
CA$
2 4
CA$ $
3 1
$ CA$
Suffix Tree
0 A$
1 ACA$
2 ACACA$
3 CA$
4 CACA$
5 TACACA$
Suffix Array
Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len
Search O(qry_len) O(qry_len x log ref_len)
Post-
processO(4qry_len - min_match_len) O(qry_len – min_match_len)
Impact 3: Lower post-processing overhead
Com
pute
45
Evaluation
46
Evaluation setup
Workload / Species Reference
sequence length# of
queriesAverage read
length
HS1 - Human (chromosome 2) ~238M ~78M ~200
HS2 - Human (chromosome 3) ~100M ~2M ~700
MONO - L. monocytogenes ~3M ~6M ~120
SUIS - S. suis ~2M ~26M ~36
Testbed Low-end Geforce 9800 GX2 GPU (512MB) High-end Tesla C1060 (4GB)
Base line: suffix tree on GPU (MUMmerGPU [Schatz 07, 09])
Success metrics Performance Energy consumption
Workloads (NCBI Trace Archive, http://www.ncbi.nlm.nih.gov/Traces)
47
Speedup: array-based over tree-based
48
Dissecting the overheads
Consequences:• Focus shifts to optimizing
the compute stage• Opportunity to exploit
multi-GPU systems (as I/O is less of a bottleneck)
Workload: HS1, ~78M queries, ~238M ref. length on GeForce
49
Choice of appropriate data structure can be crucial when porting applications to the GPU
A good matching data structure ensures: Low communication
overhead Data locality: can be
achieved at the cost of additional per thread processing time
Low post-processing overhead
MummerGPU++ Summary
Motivating Question
How should one design/port applications to efficiently exploit GPU characteristics?
50
MummerGPU++
Hybrid platforms will gain wider adoption.
Unifying theme: making the use of hybrid architectures (e.g., GPU-based platforms) simple and effective
Does the 10x lower computation cost offered by GPUs change the way we design (distributed) systems?
Motivating Question Motivating Question
How should one design/port applications to efficiently exploit GPU characteristics?
StoreGPU
51
Code, benchmarks and papers Code, benchmarks and papers available at:available at: netsyslab.ece.ubc.ca netsyslab.ece.ubc.ca
52
Projects at NetSysLab@UBChttp://netsyslab.ece.ubc.ca
Accelerated storage systems A GPU Accelerated Storage System, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, HPDC’10 On GPU's Viability as a Middleware Accelerator, S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, M.
Ripeanu, JoCC‘08
Porting applications to efficiently exploit GPU characteristics• Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M.
Ripeanu, SC’10• Accelerating Sequence Alignment on Hybrid Architectures, A. Gharaibeh, M. Ripeanu, Scientific
Computing Magazine, January/February 2011.
Middleware runtime support to simplify application development
• CrystalGPU: Transparent and Efficient Utilization of GPU Power, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, Technical Report
GPU-optimized building blocks: Data structures and libraries• Hashing, BloomFilters, SuffixArray