cuda materials
TRANSCRIPT
Course Outline
� GPU Architecture Overview (this module)
� Tools of the Trade
� Introduction to CUDA C
� Patterns of Parallel Computing
� Thread Cooperation and Synchronization
� The Many Types of Memory
� Atomic Operations
� Events and Streams
� CUDA in Advanced Scenarios
Note: to follow this course, you require a CUDA-capable GPU. urse, you require a CUDA cap
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
History of GPU Computation
No GPU CPU handled graphic output
Dedicated GPU Separate graphics card (PCI, AGP)
Programmable GPU Shaders
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Shaders
� Small program that runs on the GPU
� Types
� Vertex (used to calculate location of a vertex) � Pixel (used to calculate the color components of a single pixel)
� Shader languages
� High Level Shader Language (HLSL, Microsoft DirectX) � OpenGL Shading Language (GLSL, OpenGL) � Both are C-like � Both are intended for graphics (i.e., not general-purpose)
� Pixel shaders used for math
� Convert data to texture � Run texture through pixel shader � Get result texture and convert to data
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Why GPGPU?
� General-Purpose Computation on GPUs
� Highly parallel architecture
� Lots of concurrent threads of execution (SIMT) � Higher throughput compared to CPUs
� Even taking into account many cores, hypethreading, SIMD
� Thus more FLOPS (floating-point operations per second)
� Commodity hardware
� Commonly available (mainly used by gamers) � Relatively cheap compared to custom solutions (e.g., FPGAs)
� Sensible programming model
� Manufacturers realized GPGPU market potential � Graphics made optional � NVIDIA offers dedicated GPU platform “Tesla”
� No output connections
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
GPGPU Frameworks
� Compute Unified Driver Architecture (CUDA)
� Developed by NVIDIA Corporation � Extensions to programming languages (C/C++)� Wrappers for other languages/platforms
(e.g., FORTRAN, PyCUDA, MATLAB)
� Open Computing Language (OpenCL)
� Supported by many manufacturers (inc. NVIDIA) � The high-level way to perform computation on ATI devices
� C++ Accelerated Massive Programming (AMP)
� C++ superset � A standard by Microsoft, part of MSVC++ � Supports both ATI and NVIDIA
� Other frameworks and cross-compilers
� Alea, Aparapi, Brook, etc.
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Graphics Processor Architecture
� Warning: NVIDIA terminology ahead
� Streaming Multiprocessor (SM)
� Contains several CUDA cores � Can have >1 SM on card
� CUDA Core (a.k.a. Streaming Processor, Shader Unit)
� # of cores per SM tied to compute capability
� Different types of memory
� Means of access � Performance characteristics
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Graphics Processor Architecture
SM2
SM1
SP1 SP2 SP3
Device Memory
Registers
Shared Memory
SP4
Constant Memory
Texture Memory
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Compute Capability
� A number indicating what the card can do
� Current range: 1.0, 1.x, 2.x, 3.0, 3.5
� Affects both hardware and API support
� Number of CUDA cores per SM � Max # of 32-bit registers per SM � Max # of instructions per kernel � Support for double-precision ops � … and many other parameters
� Higher is better
� See http://en.wikipedia.org/wiki/CUDA
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Choosing a Graphics Card
� Look at performance specs (peak flops)
� Pay attention to single vs. double � E.g. http://www.nvidia.com/object/tesla-servers.html
� Number of GPUs
� Compute capability/architecture
� COTS cheaper than dedicated
� Memory size
� Ensure PSU/motherboard is good enough to handle the card(s)
� Can have >1 graphics card
� YMMV (PCI saturation)
� Can mix architectures (e.g. NVIDIA+ATI)
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Overview
� Windows, Mac OS, Linux
� This course uses Windows/Visual Studio� This course focuses on desktop GPU development
� CUDA Toolkit
� LLVM-based compiler� Headers & libraries� Documentation� Samples
� NSight
� Visual Studio: plugin that allows debugging� Eclipse IDE
� http://developer.nvidia.com/cuda
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
CUDA Tools
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
What is NSight?
� Many developers, few GPUs
� No need for each dev to have GPU
� Client-server architecture
� Server with GPU (target)
� Runs Nsight service� Accepts connections from developers
� Clients with Visual Studio + NSight plugin (host)
� Use NSight to connect to the server to run applicationand debug GPU code
� Caveats
� Need to use VS Remote Debugging to debug CPU code� Need your own mechanism for syncing output
Target GPU(s)NSight Service
Visual Studio w/NSight
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Overview
� Compilation Process
� Obligatory “Hello Cuda” demo
� Location Qualifiers
� Execution Model
� Grid and Block Dimensions
� Error Handling
� Device Introspection
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
NVidia Cuda Compiler (nvcc)
� nvcc is used to compile CUDA-specific code
� Not a compiler!� Uses host C or C++ compiler (MSVC, GCC)
� Some aspects are C++ specific
� Splits code into GPU and non-GPU parts
� Host code passed to native compiler
� Accepts project-defined GPU-specific settings
� E.g., compute capability
� Translates code written in CUDA C into PTX
� Graphics driver turns PTX into binary code
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
NVCC Compilation Process
__global__ void a(){}
void b(){}
__global__ void a(){}
nvcc
void b(){}
Executable
Host compiler(e.g. MSVC’s cl.exe)
nvcc
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Parallel Thread Execution (PTX)
� PTX is the ‘assembly language’ of CUDA
� Similar to .NET IL or Java bytecode� Low-level GPU instructions
� Can be generated from a project
� Typically useful to compiler writers
� E.g., GPU Ocelot https://code.google.com/p/gpuocelot/
� Inline PTX (asm)
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Location Qualifiers
� __global__Defines a kernel.
Runs on the GPU, called from the CPU.
Executed with <<<dim3>>> arguments.
� __device__Runs on the GPU, called from the GPU.
� Can be used for variables too
� __host__Runs on the CPU, called from the CPU.
� Qualifiers can be mixed
� E.g. __host__ __device__ foo()� Code compiled for both CPU and GPU� Useful for testing
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Execution Model
doSomething<<<2,3>>>()
doSomethingElse<<<4,2>>>()
thread
thread block
grid
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Execution Model
� Thread blocks are scheduled to run on available SMs
� Each SM executes one block at a time
� Thread block is divided into warps
� Number of threads per warp depends on compute capability
� All warps are handled in parallel
� CUDA Warp Watch
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Dimensions
� We defined execution as
� <<<a,b>>>� A grid of a blocks of b threads each� The grid and each block are 1D structures
� In reality, these constructs are 3D
� A 3-dimensional grid of 3-dimensional blocks� You can define (a×b×c) blocks of (x×y×z) threads� Can have 2D or 1D by setting extra dimensions to 1
� Defined in dim3 structure
� Simple container with x, y and z values.� Some constructors defined for C++� Automatic conversion for
<<<a,b>>> � (a,1,1) by (b,1,1)
grid
block
thread
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Thread Variables
� Execution parameters & current position
� blockIdx
� Where we are in the grid
� gridDim
� The size of the grid
� threadIdx
� Position of current thread in thread block
� blockDim
� Size of thread block
� Limitations
� Grid & block sizes� # of threads
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Error Handling
� CUDA does not throw
� Silent failure
� Core functions return cudaError_t� Can check against cudaSuccess� Get description with cudaGetErrorString()
� Libraries may have different error types
� E.g. cuRAND has curandStatus_t
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Rules of the Game
� Different types of memory
� Shared vs. private� Access speeds
� Data is in arrays
� No parallel data structures� No other data structures� No auto-parallelization/vectorization compiler support� No CPU-type SIMD equivalent
� Compiler constraint
� No C++11 support
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Overview
� Data Access
� Map
� Gather
� Reduce
� Scan
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Data Access
� A real problem!
� Thread space can be up to 6D
� 3D grid of 3D thread blocks
� Input space typically 1D
� 2D arrays are possible
� Need to map threads to inputs
� Some examples
� 1 block, N threads � threadIdx.x� 1 block, MxN threads � threadIdx.y * blockDim.x + threadIdx.x� N blocks, M threads � blockIdx.x * gridDim.x + threadIdx.x� … and so on
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Map
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Gather
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Black Scholes Formula
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Reduce
++
+
+ +
+
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Reduce in Practice
� Adding up N data elements � Adding up N data elements
� Use 1 block of N/2 threads
� Each thread does
x[i] += x[j];� At each step
� # of threads halved� Distance (j-i) doubled
� x[0] is the result
1 2 3 4 5 6 7 8
3 7 11 153
0
7
1
11
2
155
3
10 260100
0
62626
1
3663636
0
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Scan
4 2 5 3 6
4 6 11 14 20
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Scan in Practice
� Similar to reduce
1 2 3 4
1 3 5 73
0
5
1
1 3 6 106
0
1 3 6 10
0
7
2
10
1
� Require N-1 threads
� Step size keeps doubling
� Number of threads reduced
by step size
� Each thread n does
x[n+step] += x[n];
5
3
9
2
000000 14
000 15
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Graphics Processor Architecture
SM2
SM1
SP1 SP2 SP3
Device Memory
Registers
Shared Memory
SP4
Constant Cache
Texture CacheT
C
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Device Memory
� Grid scope (i.e., available to all threads in all blocks in the grid)
� Application lifetime (exists until app exits or explicitly deallocated)
� Dynamic
� cudaMalloc() to allocate� Pass pointer to kernel� cudaMemcpy() to copy to/from host memory� cudaFree() to deallocate
� Static
� Declare global variable as device__device__ int sum = 0;
� Use freely within the kernel� Use cudaMemcpy[To/From]Symbol() to copy to/from host memory� No need to explicitly deallocate
� Slowest and most inefficient
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Constant & Texture Memory
� Read-only: useful for lookup tables, model parameters, etc.
� Grid scope, Application lifetime
� Resides in device memory, but
� Cached in a constant memory cache
� Constrained by MAX_CONSTANT_MEMORY� Expect 64kb
� Similar operation to statically-defined device memory
� Declare as __constant__� Use freely within the kernel� Use cudaMemcpy[To/From]Symbol() to copy to/from host memory
� Very fast provided all threads read from the same location
� Used for kernel arguments
� Texture memory: similar to Constant, optimized for 2D access
patterns
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Shared Memory
� Block scope
� Shared only within a thread block� Not shared between blocks
� Kernel lifetime
� Must be declared within the kernel function body
� Very fast
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Register & Local Memory
� Memory can be allocated right within the kernel
� Thread scope, Kernel lifetime
� Non-array memory
� int tid = …� Stored in a register� Very fast
� Array memory
� Stored in ‘local memory’� Local memory is an abstraction, actually put in global memory� Thus, as slow as global memory
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Summary
Declaration Memory Scope Lifetime Slowdown
int foo; Register Thread Kernel 1x
int foo[10]; Local Thread Kernel 100x
__shared__ int foo; Shared Block Kernel 1x
__device__ int foo; Global Grid Application 100x
__constant__ int foo; Constant Grid Application 1x
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Thread Synchronization
� Threads can take different amounts
of time to complete a part of a
computation
� Sometimes, you want all threads to reach
a particular point before continuing
their work
� CUDA provides a thread barrier function
__syncthreads() � A thread that calls __syncthreads() waits
for other threads to reach this line
� Then, all threads can continue executing
barrier
A B C
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Restrictions
� __syncthreads() only synchronizes threads within a block
� A thread that calls __syncthreads() waits for other threads to reach
this location
� All threads must reach __syncthreads() � if (x > 0.5)
{ __syncthreads(); // bad idea }
� Each call to __syncthreads() is unique
� if (x > 0.5) { __syncthreads(); } else { __syncthreads(); // also a bad idea }
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Branch Divergence
� Once a block is assigned to an
SM, it is split into several
warps
� All threads within a warp
must execute the same
instruction at the same time
� I.e., if and else branches
cannot be executed
concurrently
� Avoid branching if possible
� Ensure if statements cut on
warp boundaries
branch
performance loss
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Summary
� Threads are synchronized with __syncthreads()
� Block scope � All threads must reach the barrier � Each __syncthreads() creates a unique barrier
� A block is split into warps
� All threads in a warp execute same instruction � Branching leads to warp divergence (performance loss) � Avoid branching or ensure branch cases fall in different warps
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Overview
� Why atomics?
� Atomic Functions
� Atomic Sum
� Monte Carlo Pi
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Why Atomics?
� x++ is a read-modify-write operation
� Read x into a register � Increment register value � Write register back into x � Effectively { temp = x; temp = temp+1; x = temp; }
� If two threads do x++ � Each thread has its own temp (say t1 and t2) � { t1 = x; t1 = t1+1; x = t1;}
{ t2 = x; t2 = t2+1; x = t2;} � Race condition: the thread that writes to x first wins � Whoever wins, x gets incremented only once
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Atomic Functions
� Problem: many threads accessing the same memory location
� Atomic operations ensure that only one thread can access the location
� Grid scope!
� atomicOP(x,y) � t1 = *x; // read � t2 = t1 OP y; // modify � *a = t2; // write
� Atomics need to be configured
� #include "sm_20_atomic_functions.h“ �
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Monte Carlo Pi
r
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Summary
� Atomic operations ensure operations on a variable cannot be
interrupted by a different thread
� CUDA supports several atomic operations
� atomicAdd() � atomicOr() � atomicMin() � … and others
� Atomics incur a heavy performance penalty
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Overview
� Events
� Event API
� Event example
� Pinned memory
� Streams
� Stream API
� Example (single stream)
� Example (multiple streams)
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Events
� How to measure performance?
� Use OS timers
� Too much noise
� Use profiler
� Times only kernel duration + other invocations
� CUDA Events
� Event = timestamp � Timestamp recorded on the GPU � Invoked from the CPU side
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Event API
� cudaEvent_t � The event handle
� cudaEventCreate(&e) � Creates the event
� cudaEventRecord(e, 0) � Records the event, i.e. timestamp � Second param is the stream to which to record
� cudaEventSynchronize(e) � CPU and GPU are async, can be doing things in parallel � cudaEventSynchronize() blocks all instruction processing until the GPU
has reached the event
� cudaEventElapsedTime(&f, start, stop) � Computes elapsed time (msec) between start and stop, stored as float
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Pinned Memory
� CPU memory is pageable
� Can be swapped to disk
� Pinned (page-locked) stays in place
� Performance advantage when copying to/from GPU
� Use cudaHostAlloc() instead of malloc() or new � Use cudaFreeHost() to deallocate
� Cannot be swapped out
� Must have enough � Proactively deallocate
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Streams
� Remember cudaEventRecord(event, stream)?
� A CUDA stream is a queue of GPU operations
� Kernel launch � Memory copy
� Streams allow a form of task-based parallelism
� Performance improvement
� To leverage streams you need device overlap support
� GPU_OVERLAP
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Stream API
� cudaStream_t � cudaStreamCreate(&stream) � kernel<<<blocks,threads,shared,stream>>> � cudaMemcpyAsync()
� Must use pinned memory!
� stream parameter
� cudaStreamSynchronize(stream)
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy
Summary
� CUDA events let you time your code on the GPU
� Pinned memory speeds up data transfers to/from device
� CUDA streams allow you to queue up operations asynchronously
� Lets you do different things in parallel on the GPU � Use of pinned memory is required
CUDA Teach
ing C
enter
, Dep
artmen
t of C
ompu
ter A
pplic
ation
s, NIT, T
richy