advanced cuda 01
TRANSCRIPT
-
8/8/2019 Advanced CUDA 01
1/43
NVIDIA Corporation 2010
Advanced C
-
8/8/2019 Advanced CUDA 01
2/43
NVIDIA Corporation 2010
Agenda
CUDA Review
Review of CUDA Architecture
Programming & Memory Models
Programming Environment
Execution
Performance
Optimization Guidelines
Productivity
Resources
-
8/8/2019 Advanced CUDA 01
3/43
NVIDIA Corporation 2010
REVIEW OF CUDA ARCHITECTUCUDA Review
-
8/8/2019 Advanced CUDA 01
4/43
NVIDIA Corporation 2010
Processing Flow
1. Copy input data from CPU memory to GPUmemory
PCI Bus
-
8/8/2019 Advanced CUDA 01
5/43 NVIDIA Corporation 2010
Processing Flow
1. Copy input data from CPU memory to GPUmemory
2. Load GPU program and execute,caching data on chip for performance
PCI Bus
-
8/8/2019 Advanced CUDA 01
6/43 NVIDIA Corporation 2010
Processing Flow
1. Copy input data from CPU memory to GPUmemory
2. Load GPU program and execute,caching data on chip for performance
3. Copy results from GPU memory to CPUmemory
PCI Bus
-
8/8/2019 Advanced CUDA 01
7/43 NVIDIA Corporation 2010
CUDA Parallel Computing Architecture
Parallel computing architecture
and programming model
Includes a CUDA C compiler,support for OpenCL andDirectCompute
Architected to natively supportmultiple computationalinterfaces (standard languagesand APIs)
-
8/8/2019 Advanced CUDA 01
8/43 NVIDIA Corporation 2010
CUDA Parallel Computing Architecture
CUDA defines:
Programming model
Memory model
Execution model
CUDA uses the GPU, but is for general-purpose comp
Facilitate heterogeneous computing: CPU + GPU
CUDA is scalable
Scale to run on 100s of cores/1000s of parallel threads
-
8/8/2019 Advanced CUDA 01
9/43 NVIDIA Corporation 2010
PROGRAMMING MODELCUDA Review
-
8/8/2019 Advanced CUDA 01
10/43 NVIDIA Corporation 2010
CUDA Kernels
Parallel portion of application: execute as a kernel
Entire GPU executes kernel, many threads
CUDA threads:
Lightweight
Fast switching
1000s execute simultaneously
CPU Host Executes functions
GPU Device Executes kernels
-
8/8/2019 Advanced CUDA 01
11/43 NVIDIA Corporation 2010
CUDA Kernels: Parallel Threads
A kernel is a function executed
on the GPUArray of threads, in parallel
All threads execute the samecode, can take different paths
Each thread has an ID
Select input/output data
Control decisions
float x = input[threfloat y = func(x);output[threadID] =
-
8/8/2019 Advanced CUDA 01
12/43 NVIDIA Corporation 2010
CUDA Kernels: Subdivide into Blocks
-
8/8/2019 Advanced CUDA 01
13/43
NVIDIA Corporation 2010
CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocks
-
8/8/2019 Advanced CUDA 01
14/43
NVIDIA Corporation 2010
CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocks
Blocks are grouped into a grid
-
8/8/2019 Advanced CUDA 01
15/43
NVIDIA Corporation 2010
CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocks
Blocks are grouped into a grid
A kernel is executed as a grid of blocks of threads
-
8/8/2019 Advanced CUDA 01
16/43
NVIDIA Corporation 2010
CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocks
Blocks are grouped into a grid
A kernel is executed as a grid of blocks of threads
GPU
-
8/8/2019 Advanced CUDA 01
17/43
NVIDIA Corporation 2010
Communication Within a Block
Threads may need to cooperate
Memory accesses
Share results
Cooperate using shared memory
Accessible by all threads within a block
Restriction to within a block permits scalability
Fast communication between N threads is not feasible when
-
8/8/2019 Advanced CUDA 01
18/43
NVIDIA Corporation 2010
Transparent Scalability G84
1 2 3 4 5 6 7 8 9 10 11
1 2
3 4
5 6
7 8
9 10
11 12
-
8/8/2019 Advanced CUDA 01
19/43
NVIDIA Corporation 2010
Transparent Scalability G80
1 2 3 4 5 6 7 8 9 10 11
1 2 3 4 5
9 10 11 12
-
8/8/2019 Advanced CUDA 01
20/43
NVIDIA Corporation 2010
Transparent Scalability GT200
1 2 3 4 5 6 7 8 9 10 11
1 2 3 4 5 6 7 8 9 10 11 12Idle
-
8/8/2019 Advanced CUDA 01
21/43
NVIDIA Corporation 2010
CUDA Programming Model - Summary
A kernel executes as a grid of
thread blocks
A block is a batch of threads
Communicate through sharedmemory
Each block has a block ID
Each thread has a thread ID
Host
Kernel 1
Kernel 2
Device
0 1
0,0 0,1
1,0 1,1
-
8/8/2019 Advanced CUDA 01
22/43
NVIDIA Corporation 2010
MEMORY MODEL
CUDA Review
-
8/8/2019 Advanced CUDA 01
23/43
NVIDIA Corporation 2010
Memory hierarchy
Thread:
Registers
-
8/8/2019 Advanced CUDA 01
24/43
NVIDIA Corporation 2010
Memory hierarchy
Thread:
Registers
Thread:
Local memory
-
8/8/2019 Advanced CUDA 01
25/43
NVIDIA Corporation 2010
Memory hierarchy
Thread:
Registers
Thread:
Local memory
Block of threads:
Shared memory
-
8/8/2019 Advanced CUDA 01
26/43
NVIDIA Corporation 2010
Memory hierarchy
Thread:
Registers
Thread:
Local memory
Block of threads:
Shared memory
-
8/8/2019 Advanced CUDA 01
27/43
NVIDIA Corporation 2010
Memory hierarchy
Thread:
Registers
Thread:
Local memory
Block of threads:
Shared memory
All blocks:
Global memory
M hi h
-
8/8/2019 Advanced CUDA 01
28/43
NVIDIA Corporation 2010
Memory hierarchy
Thread:
Registers
Thread:
Local memory
Block of threads:
Shared memory
All blocks:
Global memory
Additi l M i
-
8/8/2019 Advanced CUDA 01
29/43
NVIDIA Corporation 2010
Additional Memories
Host can also allocate textures and arrays of constant
Textures and constants have dedicated caches
-
8/8/2019 Advanced CUDA 01
30/43
NVIDIA Corporation 2010
PROGRAMMING ENVIRONMENT
CUDA Review
CUDA C d O CL
-
8/8/2019 Advanced CUDA 01
31/43
NVIDIA Corporation 2010
CUDA C and OpenCL
Shared back-end compilerand optimization technology
Entry point for developerswho want low-level API
Entry poinwho prefe
Vis al St dio
-
8/8/2019 Advanced CUDA 01
32/43
NVIDIA Corporation 2010
Visual Studio
Separate file types
.c/.cpp for host code
.cu for device/mixed code
Compilation rules: cuda.rules
Syntax highlighting
Intellisense
Integrated debugger andprofiler: Nexus
NVIDIA Nexus IDE
-
8/8/2019 Advanced CUDA 01
33/43
NVIDIA Corporation 2010
NVIDIA Nexus IDE
The industrys first IDE formassively
parallel applications
Accelerates co-processing (CPU + GPU)application development
Complete Visual Studio-integrateddevelopment environment
Linux
-
8/8/2019 Advanced CUDA 01
34/43
NVIDIA Corporation 2010
Linux
Separate file types
.c/.cpp for host code
.cu for device/mixed code
Typically makefile driven
cuda-gdb for debugging
CUDA Visual Profiler
-
8/8/2019 Advanced CUDA 01
35/43
NVIDIA Corporation 2010
OPTIMIZATION GUIDELINES
Performance
Optimize Algorithms for GPU
-
8/8/2019 Advanced CUDA 01
36/43
NVIDIA Corporation 2010
Optimize Algorithms for GPU
Algorithm selection
Understand the problem, consider alternate algorithmsMaximize independent parallelism
Maximize arithmetic intensity (math/bandwidth)
Recompute?GPU allocates transistors to arithmetic, not memory
Sometimes better to recompute rather than cache
Serial computation on GPU?Low parallelism computation may be faster on GPU vs copy
Optimize Memory Access
-
8/8/2019 Advanced CUDA 01
37/43
NVIDIA Corporation 2010
Optimize Memory Access
Coalesce global memory access
Maximise DRAM efficiencyOrder of magnitude impact on performance
Avoid serialization
Minimize shared memory bank conflicts
Understand constant cache semantics
Understand spatial locality
Optimize use of textures to ensure spatial locality
Exploit Shared Memory
-
8/8/2019 Advanced CUDA 01
38/43
NVIDIA Corporation 2010
Exploit Shared Memory
Hundreds of times faster than global memory
Inter-thread cooperation via shared memory and sync
Cache data that is reused by multiple threads
Stage loads/stores to allow reordering
Avoid non-coalesced global memory accesses
Use Resources Efficiently
-
8/8/2019 Advanced CUDA 01
39/43
NVIDIA Corporation 2010
Use Resources Efficiently
Partition the computation to keep multiprocessors bus
Many threads, many thread blocksMultiple GPUs
Monitor per-multiprocessor resource utilizationRegisters and shared memory
Low utilization per thread block permits multiple active bloc
multiprocessor
Overlap computation with I/OUse asynchronous memory transfers
-
8/8/2019 Advanced CUDA 01
40/43
NVIDIA Corporation 2010
RESOURCES
Productivity
Getting Started
-
8/8/2019 Advanced CUDA 01
41/43
NVIDIA Corporation 2010
Getting Started
CUDA Zone
www.nvidia.com/cudaIntroductory tutorials/webinars
Forums
Documentation
Programming GuideBest Practices Guide
Examples
CUDA SDK
Libraries
-
8/8/2019 Advanced CUDA 01
42/43
NVIDIA Corporation 2010
Libraries
NVIDIA
cuBLAS Dense linear algebra (subset of full BLAS suitecuFFT 1D/2D/3D real and complex
Third party
NAG Numeric libraries e.g. RNGs
cuLAPACK/MAGMA
Open SourceThrust STL/Boost style template language
cuDPP Data parallel primitives (e.g. scan, sort and red
CUSP Sparse linear algebra and graph computation
Many more...
-
8/8/2019 Advanced CUDA 01
43/43