cuda tricks

Upload: comdet-phudphut

Post on 14-Apr-2018

239 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 CUDA Tricks

    1/33

    Presented by

    Damo aran Ramani

  • 7/27/2019 CUDA Tricks

    2/33

    Synopsis

    Scan Algorithm

    App ications

    Specialized Libraries

    CUDPP: CUDA Data Parallel Primitives Library

    Thrust: a Template Library for CUDA Applications

    CUDA FFT and BLAS libraries for the GPU

  • 7/27/2019 CUDA Tricks

    3/33

    References

    Scan primitives for GPU Computing. Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens

    Presentation on scan primitives by Gary J. Katz based on the article

    Parallel Prefix Sum (Scan) with CUDA - Harris, Sengupta and Owensapter

  • 7/27/2019 CUDA Tricks

    4/33

    Introduction

    GPUs massively parallel processors

    Programmable parts of the graphics pipeline

    o erates on rimitives vertices, fra ments

    hese rimitive ro rams s awn a

    thread for each primitive to keep the parallelrocessors full

  • 7/27/2019 CUDA Tricks

    5/33

    Stream programming model (particlesystems, image processing, grid-based

    fluid simulations, and dense matrix algebra)

    Fra ment ro ram o eratin on n fra ments(accesses - O(n))

    Problem arises when access requirements are

    complex (eg: prefix-sum O(n2))

  • 7/27/2019 CUDA Tricks

    6/33

    Prefix-Sum Example

    in: 3 1 7 0 4 1 6 3

  • 7/27/2019 CUDA Tricks

    7/33

    Trivial Sequential Implementation

    void scan(int* in, int* out, int n)

    out[0] = 0;

    =

    out[i] = in[i-1] + out[i-1];

  • 7/27/2019 CUDA Tricks

    8/33

    Scan:An Efficient Parallel Primitive

    Interested in finding efficient solutions toparallel problems in which each outputrequires global knowledge of the inputs.

    Why CUDA? (General Load-Store MemoryArchitecture, On-chip Shared Memory, Thread

    ync ron zat on

  • 7/27/2019 CUDA Tricks

    9/33

    hreads & Blocks

    GeForce 8800 GTX ( 16 multiprocessors, 8 processors each) CUDA structures GPU programs into parallel thread blocks of up

    - .

    Programmers specify the number of thread blocks and threads

    per block, and the hardware and drivers map thread blocks to.

    Within a thread block, threads can communicate

    through shared memory and cooperate through sync.

    ecause on y rea s w n e same oc can coopera e v ashared memory and thread synchronization,programmers mustpartition computation into multiple blocks.(complex

    ,

  • 7/27/2019 CUDA Tricks

    10/33

    he Scan Operator

    Definition: The scan operation takes a binary associative

    operator with identity I, and an array of nelements

    , , , n-and returns the array

    [I, a0, (a0 a1), , (a0 a1 an-2)]

    Types inclusive, exclusive, forward, backward

  • 7/27/2019 CUDA Tricks

    11/33

    Parallel Scan

    f or ( d = 1; d < l og2n; d++)f or al l k i n par al l el

    i f k >= 2d

    x[ out ] [ k] = x[ i n] [ k 2d- 1] + x[ i n] [ k]

    el se

    x out k = x i n k

    Compl exi t y O( nl og2n)

  • 7/27/2019 CUDA Tricks

    12/33

    A work efficient parallel scan

    Goal is a parallel scan that is O(n)instead of O(nlog n)

    Solution: Balanced Trees: Build a binar tree on the

    input data and sweep it to and from theroot.

    nary ree w n eaves as = og2n eve s,each level dhas 2dnodes

    ,

    O(n) add on a single traversal of the tree.

  • 7/27/2019 CUDA Tricks

    13/33

    O(n) unsegmented scan

    Reduce/Up-Sweepf or ( d = 0; d < l og2n- 1; d++)

    f or al l k=0; k < n- 1; k+=2d+1 i n par al l el

    x[ k+2d+1- 1] = x[ k+2d- 1] + x[ k+2d+1- 1]

    own- weepx[ n- 1] = 0;

    f or d = l o n 1; d >=0; d- -

    f or al l k = 0; k < n- 1; k += 2d+1 i n par al l el

    t = x[ k + 2d 1]

    - = -

    x[ k + 2d+1 - 1] = t + x[ k + 2d+1 1]

  • 7/27/2019 CUDA Tricks

    14/33

    ree analogy

    x0 (x0..x1) (x0..x3)x2 x4 (x4..x5) x6 (x0..x7)

    x0 (x0..x1) (x0..x3)x2 x4 (x4..x5) x6 0

    x0 (x0..x1) 0x2 x4 (x4..x5) x6 (x0..x3)

    x0 0 (x0..x1)x2 x4 (x0..x3) x6 (x0..x5)

    0 (x0..x2) (x0..x4) (x0..x6)x0

    (x0..x1) (x0..x3) (x0..x5)

  • 7/27/2019 CUDA Tricks

    15/33

    O(n) Segmented Scan

    Up-Sweep

  • 7/27/2019 CUDA Tricks

    16/33

    Down-Sweep

  • 7/27/2019 CUDA Tricks

    17/33

    Features of segmented scan

    3 times slower than unsegmented scan

    applications which are not possible with.

  • 7/27/2019 CUDA Tricks

    18/33

    Primitives built on scan

    Enumerate=

    Exclusive scan of input vector

    distribute([a b c][d e]) = [a a a][d d]

    nc us ve scan o npu vec or Split and split-and-segment

    Split divides the input vector into two pieces, with all theelements marked false on the left side of the output vector and all theelements marked true on the right.

  • 7/27/2019 CUDA Tricks

    19/33

    Applications

    Quicksort-

    Tridiagonal Matrix Solvers and Fluid

    Radix Sort

    Stream Compaction

    mm -Ar T l

  • 7/27/2019 CUDA Tricks

    20/33

    Quicksort

  • 7/27/2019 CUDA Tricks

    21/33

    Sparse Matrix-VectorMultiplication

  • 7/27/2019 CUDA Tricks

    22/33

    Stream Compaction

    Definition: Extracts the interest elements from an array of

    array

    Uses: Sparse Matrix Compression

    A B A D D E C F B

    A B A C B

  • 7/27/2019 CUDA Tricks

    23/33

  • 7/27/2019 CUDA Tricks

    24/33

  • 7/27/2019 CUDA Tricks

    25/33

    Specialized Libraries

    CUDPP: CUDA Data Parallel Primitives

    CUDPP is a library of data-parallel

    al orithm rimitives such as arallel refix-sum (scan), parallel sort and parallelreduction.

  • 7/27/2019 CUDA Tricks

    26/33

    CUDPP_DLL CUDPPResult cudppSparseMatrixVectorMultiply(CUDPPHandle sparseMatrixHandle,void * d_y,const void

    * d_x)Perform matrix-vector multiply y = A*x for

    arbitrary sparse matrix A and vector x.

  • 7/27/2019 CUDA Tricks

    27/33

    CUDPPScanConfig config;config.direction = CUDPP_SCAN_FORWARD;config.exclusivity = CUDPP_SCAN_EXCLUSIVE;config.op = CUDPP_ADD;

    =. _config.maxNumElements = numElements;config.maxNumRows = 1;

    config.rowPitch = 0;cudppInitializeScan(&config);

    _ , _ , ,

  • 7/27/2019 CUDA Tricks

    28/33

    CUFFT

    No. of elements

  • 7/27/2019 CUDA Tricks

    29/33

    CUBLAS

    Cuda Based Linear Algebra Subroutines Saxpy, conjugate gradient, linear solvers.

    3D reconstruction of planetary nebulae.

    http://graphics.tu-bs.de ublications Fernandez08TechRe ort. df

  • 7/27/2019 CUDA Tricks

    30/33

  • 7/27/2019 CUDA Tricks

    31/33

  • 7/27/2019 CUDA Tricks

    32/33

    GPU Variant 100 times faster than CPU

    Matrix size is limited by graphics card

    .

    Although taking advantage of sparce

    consumption, sparse matrix storage is.

  • 7/27/2019 CUDA Tricks

    33/33

    Useful Links

    http://www.science.uwaterloo.ca/~hmerz/CUDA_benchFFT/

    http://developer.download.nvidia.com/compute/cuda/2_0/docs/CUBLAS_Library_2.0.pdf

    .

    http://gpgpu.org/2009/05/31/thrust