gpu programming with cuda – optimisation mike griffiths gpucomputing@sheffield
TRANSCRIPT
GPU Programming with CUDA – Optimisation
Mike Griffiths
GPUComputing@Sheffieldhttp://gpucomputing.sites.sheffield.ac.uk/
Hardware Model
Main Program Code___________________________
DRAM GDRAM
CPUGPU
I/O I/OPCIe SM
Shared Memory
GPU Kernel Code____________________
GPU Kernel Code____________________
GPU Kernel Code____________________
• Data transfer to/from device memory• Device under-utilisation• GPU memory bandwidth• Code Branching
Performance Inhibitors
• Data transfer to/from device memory• Device under-utilisation and occupancy• GPU memory bandwidth• Code Branching
Performance Inhibitors
• CPU (host) and GPU (device) have separate dedicated memory• All data read/written on the device must be copied via PCIe bus• Very expensive operation
• Optimisation Technique: Minimise data copies• Keep resident data on the device• May have to move some computation to the GPU even if is not
computationally expensive • Might be quicker to re-calculate data on the device than copy it
Data Transfer
• Port inexpensive routine to the device• Minimise transfers by moving copy out of the loop
Data Transfer ExampleLoop over timesteps
inexpensive_routine_on_host(data_on_host)
copy data from host to device
expensive_routine_on_device(data_on_device)
copy data from device to host
End loop over timesteps
copy data from host to device
Loop over timesteps
inexpensive_routine_on_device(data_on_device)
expensive_routine_on_device(data_on_device)
End loop over timesteps
copy data from device to host
• Data transfer to/from device memory• Device under-utilisation and occupancy• GPU memory bandwidth• Code Branching
Performance Inhibitors
• GPU performance relies on the use of many threads• Degree of parallelism must be much higher than on the CPU• Ideally need many more threads than cores
• Effort must be made to expose as much parallelism as possible• May require re-engineering your problem
• If significant sections of code are serial then GPU acceleration will be limited• Amdahl’s Law
Exposing Parallelism
𝑆𝑝𝑒𝑒𝑑𝑢𝑝 (𝑁 )= 1
𝐵+1𝑁
(1−𝐵)
• Access to GPU memory has several hundred cycles of latency• When a thread is waiting for data it is stalled
• GPUs have very fast context switching• Stalled threads can be switched with active threads• Switching hides memory latency if other threads are performing compute• Requires many threads ideally performing large amounts of computation
• Optimisation Technique: Have lots of threads with high arithmetic intensity• Defined as the ratio of arithmetic computation to memory accesses
Memory Latency
Exposing parallelism exampleLoop over i from 1 to 512
Loop over j from 1 to 512
independent iteration
Calc i from thread/block ID
Loop over j from 1 to 512
independent iteration
Calc i & j from thread/block ID
independent iteration
Original code
1D decomposition 2D decomposition
512 threads 262,144 threads✖ ✔
• Data transfer to/from device memory• Device under-utilisation and occupancy• GPU memory bandwidth• Code Branching
Performance Inhibitors
• GPUs have high peak memory bandwidth• Maximum bandwidth is achieved when data is accessed in large
requests rather than many small requests• Large requests must come from multiple threads• Otherwise memory accesses are serialised degrading performance• Memory coalescing: Consecutive threads accessing consecutive memory
locations
• Optimisation technique: Coalesced memory accesses reduce the number of requests and achieve higher bandwidth
Memory Coalescing
• Consecutive threads are those with consecutive threadIdx.x values• Question: Do consecutive threads access consecutive memory
locations?
Coalescing Example
index = blockIdx.x*blockDim.x + threadIdx.x;
output[index] = 2*input[index];
• Consecutive threads are those with consecutive threadIdx.x values• Question: Do consecutive threads access consecutive memory
locations?
• Yes: Consecutive index values leads to consecutive threadIdx values
Coalescing Example
index = blockIdx.x*blockDim.x + threadIdx.x;
output[index] = 2*input[index];
• Question: Do consecutive threads access consecutive memory locations?
Coalescing Example 2
i = blockIdx.x*blockDim.x + threadIdx.x;
for (j=0; j<N; j++)
output[i][j]=2*input[i][j];
j = blockIdx.x*blockDim.x + threadIdx.x;
for (i=0; i<N; i++)
output[i][j]=2*input[i][j];
• Question: Do consecutive threads access consecutive memory locations?
• No: Consecutive threadIdx.x corresponds to consecutive i values
• Yes: Consecutive threadIdx.x corresponds to consecutive j values
Coalescing Example 2
i = blockIdx.x*blockDim.x + threadIdx.x;
for (j=0; j<N; j++)
output[i][j]=2*input[i][j];
j = blockIdx.x*blockDim.x + threadIdx.x;
for (i=0; i<N; i++)
output[i][j]=2*input[i][j];
• What about 2D or 3D decompositions• Exactly the same• Always threadIdx.x which should increment with consecutive threads• E.g. Matrix addition
Memory Coalescing in 2D
int j = blockIdx.x * blockDim.x + threadIdx.x; int i = blockIdx.y * blockDim.y + threadIdx.y;
c[i][j] = a[i][j] + b[i][j];
• Data transfer to/from device memory• Device under-utilisation and occupancy• GPU memory bandwidth• Code Branching
Performance Inhibitors
• On NVIDIA GPUs there are less instructional scheduling units than cores• Threads are scheduled in groups of 32 (a warp)• Threads within a warp execute the same instruction in lock-step• Single Instruction Multiple Data (SIMD)
• CUDA C Kernels are free to specify branches• BUT all threads will have to follow all code paths within the warp
• Optimisation Technique: Avoid inter warp branching wherever possible
Code Branching
• You want to split your threads into two groups:
Branching Example
i = blockIdx.x*blockDim.x + threadIdx.x;if (i%2 == 0)
…else
…
i = blockIdx.x*blockDim.x + threadIdx.x;if ((i/32)%2 == 0)
…else
…
• Set COMPUTE_PROFILE environment variable to 1• Log file will be created at runtime• E.g. cuda_profile_0.log• Contains timing information for kernel and data transfer• Possible to output more metrics (see doc/Compute_Profiler.txt)
CUDA Profiling
# CUDA_PROFILE_LOG_VERSION 2.0# CUDA_DEVICE 0 Tesla M1060# CUDA_CONTEXT 1# TIMESTAMPFACTOR fffff6e2e9ee8858method,gputime,cputime,occupancymethod=[ memcpyHtoD ] gputime=[ 37.952 ] cputime=[ 86.000 ] method=[ memcpyHtoD ] gputime=[ 37.376 ] cputime=[ 71.000 ] method=[ memcpyHtoD ] gputime=[ 37.184 ] cputime=[ 57.000 ] method=[ _Z23inverseEdgeDetect1D_colPfS_S_ ] gputime=[ 253.536 ] cputime=[ 13.000 ] occupancy=[ 0.250 ] ...
• GPUs offers higher Floating Point and memory bandwidth performance than CPUs• A number of factors will inhibit execution performance• A number of techniques can be applied to circumvent these• Some techniques may require re-engineering your problem• If you application cant be adapted GPU performance will not be good!
• It is important to have an understanding of the application, architecture and programming model
Conclusions