advanced cuda 01

8/8/2019 Advanced CUDA 01

1/43

NVIDIA Corporation 2010

Advanced C


2/43


Agenda

CUDA Review

Review of CUDA Architecture

Programming & Memory Models

Programming Environment

Execution

Performance

Optimization Guidelines

Productivity

Resources


3/43


REVIEW OF CUDA ARCHITECTUCUDA Review


4/43


Processing Flow

1. Copy input data from CPU memory to GPUmemory

PCI Bus


5/43 NVIDIA Corporation 2010

Processing Flow


2. Load GPU program and execute,caching data on chip for performance

PCI Bus



Processing Flow


2. Load GPU program and execute,caching data on chip for performance

3. Copy results from GPU memory to CPUmemory

PCI Bus



CUDA Parallel Computing Architecture

Parallel computing architecture

and programming model

Includes a CUDA C compiler,support for OpenCL andDirectCompute

Architected to natively supportmultiple computationalinterfaces (standard languagesand APIs)



CUDA Parallel Computing Architecture

CUDA defines:

Programming model

Memory model

Execution model

CUDA uses the GPU, but is for general-purpose comp

Facilitate heterogeneous computing: CPU + GPU

CUDA is scalable

Scale to run on 100s of cores/1000s of parallel threads



PROGRAMMING MODELCUDA Review



CUDA Kernels

Parallel portion of application: execute as a kernel

Entire GPU executes kernel, many threads

CUDA threads:

Lightweight

Fast switching

1000s execute simultaneously

CPU Host Executes functions

GPU Device Executes kernels



CUDA Kernels: Parallel Threads

A kernel is a function executed

on the GPUArray of threads, in parallel

All threads execute the samecode, can take different paths

Each thread has an ID

Select input/output data

Control decisions

float x = input[threfloat y = func(x);output[threadID] =



CUDA Kernels: Subdivide into Blocks


13/43



Threads are grouped into blocks


14/43




Blocks are grouped into a grid


15/43





A kernel is executed as a grid of blocks of threads


16/43





A kernel is executed as a grid of blocks of threads

GPU


17/43


Communication Within a Block

Threads may need to cooperate

Memory accesses

Share results

Cooperate using shared memory

Accessible by all threads within a block

Restriction to within a block permits scalability

Fast communication between N threads is not feasible when


18/43


Transparent Scalability G84

1 2 3 4 5 6 7 8 9 10 11

1 2

3 4

5 6

7 8

9 10

11 12


19/43


Transparent Scalability G80

1 2 3 4 5 6 7 8 9 10 11

1 2 3 4 5

9 10 11 12


20/43


Transparent Scalability GT200

1 2 3 4 5 6 7 8 9 10 11

1 2 3 4 5 6 7 8 9 10 11 12Idle


21/43


CUDA Programming Model - Summary

A kernel executes as a grid of

thread blocks

A block is a batch of threads

Communicate through sharedmemory

Each block has a block ID

Each thread has a thread ID

Host

Kernel 1

Kernel 2

Device

0 1

0,0 0,1

1,0 1,1


22/43


MEMORY MODEL

CUDA Review


23/43


Memory hierarchy

Thread:

Registers


24/43


Memory hierarchy

Thread:

Registers

Thread:

Local memory


25/43


Memory hierarchy

Thread:

Registers

Thread:

Local memory

Block of threads:

Shared memory


26/43


Memory hierarchy

Thread:

Registers

Thread:

Local memory

Block of threads:

Shared memory


27/43


Memory hierarchy

Thread:

Registers

Thread:

Local memory

Block of threads:

Shared memory

All blocks:

Global memory

M hi h


28/43


Memory hierarchy

Thread:

Registers

Thread:

Local memory

Block of threads:

Shared memory

All blocks:

Global memory

Additi l M i


29/43


Additional Memories

Host can also allocate textures and arrays of constant

Textures and constants have dedicated caches


30/43


PROGRAMMING ENVIRONMENT

CUDA Review

CUDA C d O CL


31/43


CUDA C and OpenCL

Shared back-end compilerand optimization technology

Entry point for developerswho want low-level API

Entry poinwho prefe

Vis al St dio


32/43


Visual Studio

Separate file types

.c/.cpp for host code

.cu for device/mixed code

Compilation rules: cuda.rules

Syntax highlighting

Intellisense

Integrated debugger andprofiler: Nexus

NVIDIA Nexus IDE


33/43


NVIDIA Nexus IDE

The industrys first IDE formassively

parallel applications

Accelerates co-processing (CPU + GPU)application development

Complete Visual Studio-integrateddevelopment environment

Linux


34/43


Linux

Separate file types

.c/.cpp for host code

.cu for device/mixed code

Typically makefile driven

cuda-gdb for debugging

CUDA Visual Profiler


35/43


OPTIMIZATION GUIDELINES

Performance

Optimize Algorithms for GPU


36/43


Optimize Algorithms for GPU

Algorithm selection

Understand the problem, consider alternate algorithmsMaximize independent parallelism

Maximize arithmetic intensity (math/bandwidth)

Recompute?GPU allocates transistors to arithmetic, not memory

Sometimes better to recompute rather than cache

Serial computation on GPU?Low parallelism computation may be faster on GPU vs copy

Optimize Memory Access


37/43


Optimize Memory Access

Coalesce global memory access

Maximise DRAM efficiencyOrder of magnitude impact on performance

Avoid serialization

Minimize shared memory bank conflicts

Understand constant cache semantics

Understand spatial locality

Optimize use of textures to ensure spatial locality

Exploit Shared Memory


38/43


Exploit Shared Memory

Hundreds of times faster than global memory

Inter-thread cooperation via shared memory and sync

Cache data that is reused by multiple threads

Stage loads/stores to allow reordering

Avoid non-coalesced global memory accesses

Use Resources Efficiently


39/43


Use Resources Efficiently

Partition the computation to keep multiprocessors bus

Many threads, many thread blocksMultiple GPUs

Monitor per-multiprocessor resource utilizationRegisters and shared memory

Low utilization per thread block permits multiple active bloc

multiprocessor

Overlap computation with I/OUse asynchronous memory transfers


40/43


RESOURCES

Productivity

Getting Started


41/43


Getting Started

CUDA Zone

www.nvidia.com/cudaIntroductory tutorials/webinars

Forums

Documentation

Programming GuideBest Practices Guide

Examples

CUDA SDK

Libraries


42/43


Libraries

NVIDIA

cuBLAS Dense linear algebra (subset of full BLAS suitecuFFT 1D/2D/3D real and complex

Third party

NAG Numeric libraries e.g. RNGs

cuLAPACK/MAGMA

Open SourceThrust STL/Boost style template language

cuDPP Data parallel primitives (e.g. scan, sort and red

CUSP Sparse linear algebra and graph computation

Many more...


43/43

advanced cuda 01

Documents