cuda tricks

7/27/2019 CUDA Tricks

1/33

Presented by

Damo aran Ramani


2/33

Synopsis

Scan Algorithm

App ications

Specialized Libraries

CUDPP: CUDA Data Parallel Primitives Library

Thrust: a Template Library for CUDA Applications

CUDA FFT and BLAS libraries for the GPU


3/33

References

Scan primitives for GPU Computing. Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens

Presentation on scan primitives by Gary J. Katz based on the article

Parallel Prefix Sum (Scan) with CUDA - Harris, Sengupta and Owensapter


4/33

Introduction

GPUs massively parallel processors

Programmable parts of the graphics pipeline

o erates on rimitives vertices, fra ments

hese rimitive ro rams s awn a

thread for each primitive to keep the parallelrocessors full


5/33

Stream programming model (particlesystems, image processing, grid-based

fluid simulations, and dense matrix algebra)

Fra ment ro ram o eratin on n fra ments(accesses - O(n))

Problem arises when access requirements are

complex (eg: prefix-sum O(n2))


6/33

Prefix-Sum Example

in: 3 1 7 0 4 1 6 3


7/33

Trivial Sequential Implementation

void scan(int* in, int* out, int n)

out[0] = 0;

=

out[i] = in[i-1] + out[i-1];


8/33

Scan:An Efficient Parallel Primitive

Interested in finding efficient solutions toparallel problems in which each outputrequires global knowledge of the inputs.

Why CUDA? (General Load-Store MemoryArchitecture, On-chip Shared Memory, Thread

ync ron zat on


9/33

hreads & Blocks

GeForce 8800 GTX ( 16 multiprocessors, 8 processors each) CUDA structures GPU programs into parallel thread blocks of up

- .

Programmers specify the number of thread blocks and threads

per block, and the hardware and drivers map thread blocks to.

Within a thread block, threads can communicate

through shared memory and cooperate through sync.

ecause on y rea s w n e same oc can coopera e v ashared memory and thread synchronization,programmers mustpartition computation into multiple blocks.(complex

,


10/33

he Scan Operator

Definition: The scan operation takes a binary associative

operator with identity I, and an array of nelements

, , , n-and returns the array

[I, a0, (a0 a1), , (a0 a1 an-2)]

Types inclusive, exclusive, forward, backward


11/33

Parallel Scan

f or ( d = 1; d < l og2n; d++)f or al l k i n par al l el

i f k >= 2d

x[ out ] [ k] = x[ i n] [ k 2d- 1] + x[ i n] [ k]

el se

x out k = x i n k

Compl exi t y O( nl og2n)


12/33

A work efficient parallel scan

Goal is a parallel scan that is O(n)instead of O(nlog n)

Solution: Balanced Trees: Build a binar tree on the

input data and sweep it to and from theroot.

nary ree w n eaves as = og2n eve s,each level dhas 2dnodes

,

O(n) add on a single traversal of the tree.


13/33

O(n) unsegmented scan

Reduce/Up-Sweepf or ( d = 0; d < l og2n- 1; d++)

f or al l k=0; k < n- 1; k+=2d+1 i n par al l el

x[ k+2d+1- 1] = x[ k+2d- 1] + x[ k+2d+1- 1]

own- weepx[ n- 1] = 0;

f or d = l o n 1; d >=0; d- -

f or al l k = 0; k < n- 1; k += 2d+1 i n par al l el

t = x[ k + 2d 1]

- = -

x[ k + 2d+1 - 1] = t + x[ k + 2d+1 1]


14/33

ree analogy

x0 (x0..x1) (x0..x3)x2 x4 (x4..x5) x6 (x0..x7)

x0 (x0..x1) (x0..x3)x2 x4 (x4..x5) x6 0

x0 (x0..x1) 0x2 x4 (x4..x5) x6 (x0..x3)

x0 0 (x0..x1)x2 x4 (x0..x3) x6 (x0..x5)

0 (x0..x2) (x0..x4) (x0..x6)x0

(x0..x1) (x0..x3) (x0..x5)


15/33

O(n) Segmented Scan

Up-Sweep


16/33

Down-Sweep


17/33

Features of segmented scan

3 times slower than unsegmented scan

applications which are not possible with.


18/33

Primitives built on scan

Enumerate=

Exclusive scan of input vector

distribute([a b c][d e]) = [a a a][d d]

nc us ve scan o npu vec or Split and split-and-segment

Split divides the input vector into two pieces, with all theelements marked false on the left side of the output vector and all theelements marked true on the right.


19/33

Applications

Quicksort-

Tridiagonal Matrix Solvers and Fluid

Radix Sort

Stream Compaction

mm -Ar T l


20/33

Quicksort


21/33

Sparse Matrix-VectorMultiplication


22/33

Stream Compaction

Definition: Extracts the interest elements from an array of

array

Uses: Sparse Matrix Compression

A B A D D E C F B

A B A C B


23/33


24/33


25/33

Specialized Libraries

CUDPP: CUDA Data Parallel Primitives

CUDPP is a library of data-parallel

al orithm rimitives such as arallel refix-sum (scan), parallel sort and parallelreduction.


26/33

CUDPP_DLL CUDPPResult cudppSparseMatrixVectorMultiply(CUDPPHandle sparseMatrixHandle,void * d_y,const void

* d_x)Perform matrix-vector multiply y = A*x for

arbitrary sparse matrix A and vector x.


27/33

CUDPPScanConfig config;config.direction = CUDPP_SCAN_FORWARD;config.exclusivity = CUDPP_SCAN_EXCLUSIVE;config.op = CUDPP_ADD;

=. _config.maxNumElements = numElements;config.maxNumRows = 1;

config.rowPitch = 0;cudppInitializeScan(&config);

_ , _ , ,


28/33

CUFFT

No. of elements


29/33

CUBLAS

Cuda Based Linear Algebra Subroutines Saxpy, conjugate gradient, linear solvers.

3D reconstruction of planetary nebulae.

http://graphics.tu-bs.de ublications Fernandez08TechRe ort. df


30/33


31/33


32/33

GPU Variant 100 times faster than CPU

Matrix size is limited by graphics card

.

Although taking advantage of sparce

consumption, sparse matrix storage is.


33/33

Useful Links

http://www.science.uwaterloo.ca/~hmerz/CUDA_benchFFT/

http://developer.download.nvidia.com/compute/cuda/2_0/docs/CUBLAS_Library_2.0.pdf

.

http://gpgpu.org/2009/05/31/thrust

cuda tricks

Documents