cuda tricks
TRANSCRIPT
-
7/27/2019 CUDA Tricks
1/33
Presented by
Damo aran Ramani
-
7/27/2019 CUDA Tricks
2/33
Synopsis
Scan Algorithm
App ications
Specialized Libraries
CUDPP: CUDA Data Parallel Primitives Library
Thrust: a Template Library for CUDA Applications
CUDA FFT and BLAS libraries for the GPU
-
7/27/2019 CUDA Tricks
3/33
References
Scan primitives for GPU Computing. Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens
Presentation on scan primitives by Gary J. Katz based on the article
Parallel Prefix Sum (Scan) with CUDA - Harris, Sengupta and Owensapter
-
7/27/2019 CUDA Tricks
4/33
Introduction
GPUs massively parallel processors
Programmable parts of the graphics pipeline
o erates on rimitives vertices, fra ments
hese rimitive ro rams s awn a
thread for each primitive to keep the parallelrocessors full
-
7/27/2019 CUDA Tricks
5/33
Stream programming model (particlesystems, image processing, grid-based
fluid simulations, and dense matrix algebra)
Fra ment ro ram o eratin on n fra ments(accesses - O(n))
Problem arises when access requirements are
complex (eg: prefix-sum O(n2))
-
7/27/2019 CUDA Tricks
6/33
Prefix-Sum Example
in: 3 1 7 0 4 1 6 3
-
7/27/2019 CUDA Tricks
7/33
Trivial Sequential Implementation
void scan(int* in, int* out, int n)
out[0] = 0;
=
out[i] = in[i-1] + out[i-1];
-
7/27/2019 CUDA Tricks
8/33
Scan:An Efficient Parallel Primitive
Interested in finding efficient solutions toparallel problems in which each outputrequires global knowledge of the inputs.
Why CUDA? (General Load-Store MemoryArchitecture, On-chip Shared Memory, Thread
ync ron zat on
-
7/27/2019 CUDA Tricks
9/33
hreads & Blocks
GeForce 8800 GTX ( 16 multiprocessors, 8 processors each) CUDA structures GPU programs into parallel thread blocks of up
- .
Programmers specify the number of thread blocks and threads
per block, and the hardware and drivers map thread blocks to.
Within a thread block, threads can communicate
through shared memory and cooperate through sync.
ecause on y rea s w n e same oc can coopera e v ashared memory and thread synchronization,programmers mustpartition computation into multiple blocks.(complex
,
-
7/27/2019 CUDA Tricks
10/33
he Scan Operator
Definition: The scan operation takes a binary associative
operator with identity I, and an array of nelements
, , , n-and returns the array
[I, a0, (a0 a1), , (a0 a1 an-2)]
Types inclusive, exclusive, forward, backward
-
7/27/2019 CUDA Tricks
11/33
Parallel Scan
f or ( d = 1; d < l og2n; d++)f or al l k i n par al l el
i f k >= 2d
x[ out ] [ k] = x[ i n] [ k 2d- 1] + x[ i n] [ k]
el se
x out k = x i n k
Compl exi t y O( nl og2n)
-
7/27/2019 CUDA Tricks
12/33
A work efficient parallel scan
Goal is a parallel scan that is O(n)instead of O(nlog n)
Solution: Balanced Trees: Build a binar tree on the
input data and sweep it to and from theroot.
nary ree w n eaves as = og2n eve s,each level dhas 2dnodes
,
O(n) add on a single traversal of the tree.
-
7/27/2019 CUDA Tricks
13/33
O(n) unsegmented scan
Reduce/Up-Sweepf or ( d = 0; d < l og2n- 1; d++)
f or al l k=0; k < n- 1; k+=2d+1 i n par al l el
x[ k+2d+1- 1] = x[ k+2d- 1] + x[ k+2d+1- 1]
own- weepx[ n- 1] = 0;
f or d = l o n 1; d >=0; d- -
f or al l k = 0; k < n- 1; k += 2d+1 i n par al l el
t = x[ k + 2d 1]
- = -
x[ k + 2d+1 - 1] = t + x[ k + 2d+1 1]
-
7/27/2019 CUDA Tricks
14/33
ree analogy
x0 (x0..x1) (x0..x3)x2 x4 (x4..x5) x6 (x0..x7)
x0 (x0..x1) (x0..x3)x2 x4 (x4..x5) x6 0
x0 (x0..x1) 0x2 x4 (x4..x5) x6 (x0..x3)
x0 0 (x0..x1)x2 x4 (x0..x3) x6 (x0..x5)
0 (x0..x2) (x0..x4) (x0..x6)x0
(x0..x1) (x0..x3) (x0..x5)
-
7/27/2019 CUDA Tricks
15/33
O(n) Segmented Scan
Up-Sweep
-
7/27/2019 CUDA Tricks
16/33
Down-Sweep
-
7/27/2019 CUDA Tricks
17/33
Features of segmented scan
3 times slower than unsegmented scan
applications which are not possible with.
-
7/27/2019 CUDA Tricks
18/33
Primitives built on scan
Enumerate=
Exclusive scan of input vector
distribute([a b c][d e]) = [a a a][d d]
nc us ve scan o npu vec or Split and split-and-segment
Split divides the input vector into two pieces, with all theelements marked false on the left side of the output vector and all theelements marked true on the right.
-
7/27/2019 CUDA Tricks
19/33
Applications
Quicksort-
Tridiagonal Matrix Solvers and Fluid
Radix Sort
Stream Compaction
mm -Ar T l
-
7/27/2019 CUDA Tricks
20/33
Quicksort
-
7/27/2019 CUDA Tricks
21/33
Sparse Matrix-VectorMultiplication
-
7/27/2019 CUDA Tricks
22/33
Stream Compaction
Definition: Extracts the interest elements from an array of
array
Uses: Sparse Matrix Compression
A B A D D E C F B
A B A C B
-
7/27/2019 CUDA Tricks
23/33
-
7/27/2019 CUDA Tricks
24/33
-
7/27/2019 CUDA Tricks
25/33
Specialized Libraries
CUDPP: CUDA Data Parallel Primitives
CUDPP is a library of data-parallel
al orithm rimitives such as arallel refix-sum (scan), parallel sort and parallelreduction.
-
7/27/2019 CUDA Tricks
26/33
CUDPP_DLL CUDPPResult cudppSparseMatrixVectorMultiply(CUDPPHandle sparseMatrixHandle,void * d_y,const void
* d_x)Perform matrix-vector multiply y = A*x for
arbitrary sparse matrix A and vector x.
-
7/27/2019 CUDA Tricks
27/33
CUDPPScanConfig config;config.direction = CUDPP_SCAN_FORWARD;config.exclusivity = CUDPP_SCAN_EXCLUSIVE;config.op = CUDPP_ADD;
=. _config.maxNumElements = numElements;config.maxNumRows = 1;
config.rowPitch = 0;cudppInitializeScan(&config);
_ , _ , ,
-
7/27/2019 CUDA Tricks
28/33
CUFFT
No. of elements
-
7/27/2019 CUDA Tricks
29/33
CUBLAS
Cuda Based Linear Algebra Subroutines Saxpy, conjugate gradient, linear solvers.
3D reconstruction of planetary nebulae.
http://graphics.tu-bs.de ublications Fernandez08TechRe ort. df
-
7/27/2019 CUDA Tricks
30/33
-
7/27/2019 CUDA Tricks
31/33
-
7/27/2019 CUDA Tricks
32/33
GPU Variant 100 times faster than CPU
Matrix size is limited by graphics card
.
Although taking advantage of sparce
consumption, sparse matrix storage is.
-
7/27/2019 CUDA Tricks
33/33
Useful Links
http://www.science.uwaterloo.ca/~hmerz/CUDA_benchFFT/
http://developer.download.nvidia.com/compute/cuda/2_0/docs/CUBLAS_Library_2.0.pdf
.
http://gpgpu.org/2009/05/31/thrust