advanced cuda 01

Upload: tenanpekok

Post on 09-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 Advanced CUDA 01

    1/43

    NVIDIA Corporation 2010

    Advanced C

  • 8/8/2019 Advanced CUDA 01

    2/43

    NVIDIA Corporation 2010

    Agenda

    CUDA Review

    Review of CUDA Architecture

    Programming & Memory Models

    Programming Environment

    Execution

    Performance

    Optimization Guidelines

    Productivity

    Resources

  • 8/8/2019 Advanced CUDA 01

    3/43

    NVIDIA Corporation 2010

    REVIEW OF CUDA ARCHITECTUCUDA Review

  • 8/8/2019 Advanced CUDA 01

    4/43

    NVIDIA Corporation 2010

    Processing Flow

    1. Copy input data from CPU memory to GPUmemory

    PCI Bus

  • 8/8/2019 Advanced CUDA 01

    5/43 NVIDIA Corporation 2010

    Processing Flow

    1. Copy input data from CPU memory to GPUmemory

    2. Load GPU program and execute,caching data on chip for performance

    PCI Bus

  • 8/8/2019 Advanced CUDA 01

    6/43 NVIDIA Corporation 2010

    Processing Flow

    1. Copy input data from CPU memory to GPUmemory

    2. Load GPU program and execute,caching data on chip for performance

    3. Copy results from GPU memory to CPUmemory

    PCI Bus

  • 8/8/2019 Advanced CUDA 01

    7/43 NVIDIA Corporation 2010

    CUDA Parallel Computing Architecture

    Parallel computing architecture

    and programming model

    Includes a CUDA C compiler,support for OpenCL andDirectCompute

    Architected to natively supportmultiple computationalinterfaces (standard languagesand APIs)

  • 8/8/2019 Advanced CUDA 01

    8/43 NVIDIA Corporation 2010

    CUDA Parallel Computing Architecture

    CUDA defines:

    Programming model

    Memory model

    Execution model

    CUDA uses the GPU, but is for general-purpose comp

    Facilitate heterogeneous computing: CPU + GPU

    CUDA is scalable

    Scale to run on 100s of cores/1000s of parallel threads

  • 8/8/2019 Advanced CUDA 01

    9/43 NVIDIA Corporation 2010

    PROGRAMMING MODELCUDA Review

  • 8/8/2019 Advanced CUDA 01

    10/43 NVIDIA Corporation 2010

    CUDA Kernels

    Parallel portion of application: execute as a kernel

    Entire GPU executes kernel, many threads

    CUDA threads:

    Lightweight

    Fast switching

    1000s execute simultaneously

    CPU Host Executes functions

    GPU Device Executes kernels

  • 8/8/2019 Advanced CUDA 01

    11/43 NVIDIA Corporation 2010

    CUDA Kernels: Parallel Threads

    A kernel is a function executed

    on the GPUArray of threads, in parallel

    All threads execute the samecode, can take different paths

    Each thread has an ID

    Select input/output data

    Control decisions

    float x = input[threfloat y = func(x);output[threadID] =

  • 8/8/2019 Advanced CUDA 01

    12/43 NVIDIA Corporation 2010

    CUDA Kernels: Subdivide into Blocks

  • 8/8/2019 Advanced CUDA 01

    13/43

    NVIDIA Corporation 2010

    CUDA Kernels: Subdivide into Blocks

    Threads are grouped into blocks

  • 8/8/2019 Advanced CUDA 01

    14/43

    NVIDIA Corporation 2010

    CUDA Kernels: Subdivide into Blocks

    Threads are grouped into blocks

    Blocks are grouped into a grid

  • 8/8/2019 Advanced CUDA 01

    15/43

    NVIDIA Corporation 2010

    CUDA Kernels: Subdivide into Blocks

    Threads are grouped into blocks

    Blocks are grouped into a grid

    A kernel is executed as a grid of blocks of threads

  • 8/8/2019 Advanced CUDA 01

    16/43

    NVIDIA Corporation 2010

    CUDA Kernels: Subdivide into Blocks

    Threads are grouped into blocks

    Blocks are grouped into a grid

    A kernel is executed as a grid of blocks of threads

    GPU

  • 8/8/2019 Advanced CUDA 01

    17/43

    NVIDIA Corporation 2010

    Communication Within a Block

    Threads may need to cooperate

    Memory accesses

    Share results

    Cooperate using shared memory

    Accessible by all threads within a block

    Restriction to within a block permits scalability

    Fast communication between N threads is not feasible when

  • 8/8/2019 Advanced CUDA 01

    18/43

    NVIDIA Corporation 2010

    Transparent Scalability G84

    1 2 3 4 5 6 7 8 9 10 11

    1 2

    3 4

    5 6

    7 8

    9 10

    11 12

  • 8/8/2019 Advanced CUDA 01

    19/43

    NVIDIA Corporation 2010

    Transparent Scalability G80

    1 2 3 4 5 6 7 8 9 10 11

    1 2 3 4 5

    9 10 11 12

  • 8/8/2019 Advanced CUDA 01

    20/43

    NVIDIA Corporation 2010

    Transparent Scalability GT200

    1 2 3 4 5 6 7 8 9 10 11

    1 2 3 4 5 6 7 8 9 10 11 12Idle

  • 8/8/2019 Advanced CUDA 01

    21/43

    NVIDIA Corporation 2010

    CUDA Programming Model - Summary

    A kernel executes as a grid of

    thread blocks

    A block is a batch of threads

    Communicate through sharedmemory

    Each block has a block ID

    Each thread has a thread ID

    Host

    Kernel 1

    Kernel 2

    Device

    0 1

    0,0 0,1

    1,0 1,1

  • 8/8/2019 Advanced CUDA 01

    22/43

    NVIDIA Corporation 2010

    MEMORY MODEL

    CUDA Review

  • 8/8/2019 Advanced CUDA 01

    23/43

    NVIDIA Corporation 2010

    Memory hierarchy

    Thread:

    Registers

  • 8/8/2019 Advanced CUDA 01

    24/43

    NVIDIA Corporation 2010

    Memory hierarchy

    Thread:

    Registers

    Thread:

    Local memory

  • 8/8/2019 Advanced CUDA 01

    25/43

    NVIDIA Corporation 2010

    Memory hierarchy

    Thread:

    Registers

    Thread:

    Local memory

    Block of threads:

    Shared memory

  • 8/8/2019 Advanced CUDA 01

    26/43

    NVIDIA Corporation 2010

    Memory hierarchy

    Thread:

    Registers

    Thread:

    Local memory

    Block of threads:

    Shared memory

  • 8/8/2019 Advanced CUDA 01

    27/43

    NVIDIA Corporation 2010

    Memory hierarchy

    Thread:

    Registers

    Thread:

    Local memory

    Block of threads:

    Shared memory

    All blocks:

    Global memory

    M hi h

  • 8/8/2019 Advanced CUDA 01

    28/43

    NVIDIA Corporation 2010

    Memory hierarchy

    Thread:

    Registers

    Thread:

    Local memory

    Block of threads:

    Shared memory

    All blocks:

    Global memory

    Additi l M i

  • 8/8/2019 Advanced CUDA 01

    29/43

    NVIDIA Corporation 2010

    Additional Memories

    Host can also allocate textures and arrays of constant

    Textures and constants have dedicated caches

  • 8/8/2019 Advanced CUDA 01

    30/43

    NVIDIA Corporation 2010

    PROGRAMMING ENVIRONMENT

    CUDA Review

    CUDA C d O CL

  • 8/8/2019 Advanced CUDA 01

    31/43

    NVIDIA Corporation 2010

    CUDA C and OpenCL

    Shared back-end compilerand optimization technology

    Entry point for developerswho want low-level API

    Entry poinwho prefe

    Vis al St dio

  • 8/8/2019 Advanced CUDA 01

    32/43

    NVIDIA Corporation 2010

    Visual Studio

    Separate file types

    .c/.cpp for host code

    .cu for device/mixed code

    Compilation rules: cuda.rules

    Syntax highlighting

    Intellisense

    Integrated debugger andprofiler: Nexus

    NVIDIA Nexus IDE

  • 8/8/2019 Advanced CUDA 01

    33/43

    NVIDIA Corporation 2010

    NVIDIA Nexus IDE

    The industrys first IDE formassively

    parallel applications

    Accelerates co-processing (CPU + GPU)application development

    Complete Visual Studio-integrateddevelopment environment

    Linux

  • 8/8/2019 Advanced CUDA 01

    34/43

    NVIDIA Corporation 2010

    Linux

    Separate file types

    .c/.cpp for host code

    .cu for device/mixed code

    Typically makefile driven

    cuda-gdb for debugging

    CUDA Visual Profiler

  • 8/8/2019 Advanced CUDA 01

    35/43

    NVIDIA Corporation 2010

    OPTIMIZATION GUIDELINES

    Performance

    Optimize Algorithms for GPU

  • 8/8/2019 Advanced CUDA 01

    36/43

    NVIDIA Corporation 2010

    Optimize Algorithms for GPU

    Algorithm selection

    Understand the problem, consider alternate algorithmsMaximize independent parallelism

    Maximize arithmetic intensity (math/bandwidth)

    Recompute?GPU allocates transistors to arithmetic, not memory

    Sometimes better to recompute rather than cache

    Serial computation on GPU?Low parallelism computation may be faster on GPU vs copy

    Optimize Memory Access

  • 8/8/2019 Advanced CUDA 01

    37/43

    NVIDIA Corporation 2010

    Optimize Memory Access

    Coalesce global memory access

    Maximise DRAM efficiencyOrder of magnitude impact on performance

    Avoid serialization

    Minimize shared memory bank conflicts

    Understand constant cache semantics

    Understand spatial locality

    Optimize use of textures to ensure spatial locality

    Exploit Shared Memory

  • 8/8/2019 Advanced CUDA 01

    38/43

    NVIDIA Corporation 2010

    Exploit Shared Memory

    Hundreds of times faster than global memory

    Inter-thread cooperation via shared memory and sync

    Cache data that is reused by multiple threads

    Stage loads/stores to allow reordering

    Avoid non-coalesced global memory accesses

    Use Resources Efficiently

  • 8/8/2019 Advanced CUDA 01

    39/43

    NVIDIA Corporation 2010

    Use Resources Efficiently

    Partition the computation to keep multiprocessors bus

    Many threads, many thread blocksMultiple GPUs

    Monitor per-multiprocessor resource utilizationRegisters and shared memory

    Low utilization per thread block permits multiple active bloc

    multiprocessor

    Overlap computation with I/OUse asynchronous memory transfers

  • 8/8/2019 Advanced CUDA 01

    40/43

    NVIDIA Corporation 2010

    RESOURCES

    Productivity

    Getting Started

  • 8/8/2019 Advanced CUDA 01

    41/43

    NVIDIA Corporation 2010

    Getting Started

    CUDA Zone

    www.nvidia.com/cudaIntroductory tutorials/webinars

    Forums

    Documentation

    Programming GuideBest Practices Guide

    Examples

    CUDA SDK

    Libraries

  • 8/8/2019 Advanced CUDA 01

    42/43

    NVIDIA Corporation 2010

    Libraries

    NVIDIA

    cuBLAS Dense linear algebra (subset of full BLAS suitecuFFT 1D/2D/3D real and complex

    Third party

    NAG Numeric libraries e.g. RNGs

    cuLAPACK/MAGMA

    Open SourceThrust STL/Boost style template language

    cuDPP Data parallel primitives (e.g. scan, sort and red

    CUSP Sparse linear algebra and graph computation

    Many more...

  • 8/8/2019 Advanced CUDA 01

    43/43