optimization on kepler zehuan wang [email protected]
TRANSCRIPT
![Page 2: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/2.jpg)
Fundamental Optimization
![Page 3: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/3.jpg)
Optimization Overview
GPU architectureKernel optimization
Memory optimizationLatency optimizationInstruction optimization
CPU-GPU interaction optimizationOverlapped execution using streams
![Page 4: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/4.jpg)
Optimization Overview
GPU architectureKernel optimization
Memory optimizationLatency optimizationInstruction optimization
CPU-GPU interaction optimizationOverlapped execution using streams
![Page 5: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/5.jpg)
GPU High Level View
SM
EM
SM
EM
SM
EM
SM
EM
Global MemoryCPU
ChipsetPCIe
Streaming Multiprocessor
Global memory
![Page 6: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/6.jpg)
GK110 SMControl unit
4 Warp Scheduler8 instruction dispatcher
Execution unit192 single-precision CUDA Cores64 double-precision CUDA Cores32 SFU, 32 LD/ST
MemoryRegisters: 64K 32-bitCache
L1+shared memory (64 KB)TextureConstant
![Page 7: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/7.jpg)
GPU and Programming ModelSoftware GPU
Threads are executed by CUDA cores
Thread
CUDA Core
Thread Block Multiprocessor
Thread blocks are executed on multiprocessors
Thread blocks do not migrate
Several concurrent thread blocks can reside on one multiprocessor - limited by multiprocessor resources
...
Grid Device
A kernel is launched as a grid of thread blocks
Up to 16 kernels can execute on a device at one time
![Page 8: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/8.jpg)
Warp
Warp is successive 32 threads in a blockE.g. blockDim = 160
Automatically divided to 5 warps by GPU
E.g. blockDim = 161If the blockDim is not the Multiple of 32The rest of thread will occupy one more warp
Block 0
Warp 4 (128~159)Warp 3 (96~127)
Warp2 (64~95)Warp1 (32~63)Warp 0 (0~31)
Block 0
Warp 5 (160)Warp 4 (128~159)Warp 3 (96~127)
Warp2 (64~95)Warp1 (32~63)Warp 0 (0~31)
Block
32 Threads
32 Threads
32 Threads
...
Warps
=
![Page 9: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/9.jpg)
Warp
SIMD: Same Instruction Multi DataThe threads in the same warp always executing the same instructionInstructions will be issued tooperation units by warp
warp 8 instruction 11
Warp Scheduler 0
warp 2 instruction 42
warp 14 instruction 95
warp 8 instruction 12
...
warp 14 instruction 96
warp 2 instruction 43
warp 9 instruction 11
Warp Scheduler 1
warp 3 instruction 33
warp 15 instruction 95
warp 9 instruction 12
...
warp 3 instruction 34
warp 15 instruction 96
![Page 10: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/10.jpg)
Warp
Latency is caused by the dependencybetween the neighbor instructions in the same warpIn the waiting time, other instructionsfrom other warps can be executedContext switching is freeA lot of warps can hide memory latency
warp 8 instruction 11
Warp Scheduler 0
warp 2 instruction 42
warp 14 instruction 95
warp 8 instruction 12
...
warp 14 instruction 96
warp 2 instruction 43
warp 9 instruction 11
Warp Scheduler 1
warp 3 instruction 33
warp 15 instruction 95
warp 9 instruction 12
...
warp 3 instruction 34
warp 15 instruction 96
![Page 11: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/11.jpg)
Kepler Memory Hierarchy
RegisterSpills to local memory
CachesShared memoryL1 cacheL2 cacheConstant cacheTexture cache
Global memory
![Page 12: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/12.jpg)
Kepler/Fermi Memory Hierarchy
L2
Global Memory
Registers
C
SM-0
L1&SMEM TEX
Registers
C
SM-1
L1&SMEM TEX
Registers
C
SM-N
L1&SMEM TEX
low
fastkepler
![Page 13: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/13.jpg)
Wrong View To Optimization
Try all the optimization methods in bookOptimization is endless
Low Efficiency
![Page 14: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/14.jpg)
General Optimization Strategies: Measurement
Find out the limiting factor in kernel performanceMemory bandwidth bound (memory optimization)Instruction throughput bound (instruction optimization)Latency bound (configuration optimization)
Measure effective memory/instruction throughput: NVIDIA Visual Profiler
![Page 15: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/15.jpg)
Resolved
Find Limiter
Compare to Effective Value GB/s
Memory optimization
Compare to Effective Value inst/s
Instruction optimization
Configuration optimization
Memory bound
Instruction bound
Latency bound
Done!
<< <<~ ~
![Page 16: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/16.jpg)
Optimization Overview
GPU architectureKernel optimization
Memory optimizationLatency optimizationInstruction optimization
CPU-GPU interaction optimizationOverlapped execution using streams
![Page 17: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/17.jpg)
Memory Optimization
If the code is memory-bound and effective memory throughput is much lower than the peak
Purpose: access only data that are absolutely necessary
Major techniquesImprove access pattern to reduce wasted transactionsReduce redundant access: read-only cache, shared memory
![Page 18: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/18.jpg)
Reduce Wasted Transactions
Memory accesses are per warpMemory is accessed in discrete chunksL1 is reserved only for register spills and stack data in Kepler
Go directly to L2 (invalidate line in L1), on L2 miss go to DRAMMemory is transport in segments = 32 B (same as for writes)If a warp can’t take use all of the data in the segments, the rest memory transaction is wasted.
![Page 19: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/19.jpg)
Kepler/Fermi Memory Hierarchy
L2
Global Memory
Registers
C
SM-0
L1&SMEM TEX
Registers
C
SM-1
L1&SMEM TEX
Registers
C
SM-N
L1&SMEM TEX
low
fastkepler
![Page 20: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/20.jpg)
Reduce Wasted Transactions
Scenario:Warp requests 32 aligned, consecutive 4-byte words
Addresses fall within 4 segmentsNo replaysBus utilization: 100%
Warp needs 128 bytes128 bytes move across the bus on a miss
...
addresses from a warp
96 192128 160 224 28825632 64 352320 384 448416Memory addresses
0
![Page 21: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/21.jpg)
Reduce Wasted Transactions
...
addresses from a warp
Scenario:Warp requests 32 aligned, permuted 4-byte words
Addresses fall within 4 segmentsNo replaysBus utilization: 100%
Warp needs 128 bytes128 bytes move across the bus on a miss
96 192128 160 224 28825632 64 352320 384 448416Memory addresses
0
![Page 22: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/22.jpg)
Reduce Wasted Transactions
Scenario:Warp requests 32 consecutive 4-byte words, offset from perfect alignment
Addresses fall within at most 5 segments1 replay (2 transactions)Bus utilization: at least 80%
Warp needs 128 bytesAt most 160 bytes move across the busSome misaligned patterns will fall within 4 segments, so 100% utilization
96 192128 160 224 28825632 64 352320 384 448416Memory addresses
0
...
addresses from a warp
![Page 23: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/23.jpg)
Reduce Wasted Transactions
addresses from a warp
Scenario:All threads in a warp request the same 4-byte word
Addresses fall within a single segmentNo replaysBus utilization: 12.5%
Warp needs 4 bytes32 bytes move across the bus on a miss
...
96 192128 160 224 28825632 64 352320 384 448416Memory addresses
0
![Page 24: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/24.jpg)
Reduce Wasted Transactions
addresses from a warp
96 192128 160 224 28825632 64 352320 384 448416Memory addresses
0
Scenario:Warp requests 32 scattered 4-byte words
Addresses fall within N segments(N-1) replays (N transactions)
Could be lower some segments can be arranged into a single transaction
Bus utilization: 128 / (N*32) (4x higher than caching loads)Warp needs 128 bytesN*32 bytes move across the bus on a miss
...
![Page 25: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/25.jpg)
Read-Only Cache
An alternative to L1 when accessing DRAMAlso known as texture cache: all texture accesses use this cacheCC 3.5 and higher also enable global memory accesses
Should not be used if a kernel reads and writes to the same addresses
Comparing to L1:Generally better for scattered reads than L1
Caching is at 32 B granularity (L1, when caching operates at 128 B granularity)Does not require replay for multiple transactions (L1 does)
Higher latency than L1 reads, also tends to increase register use
![Page 26: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/26.jpg)
Read-Only Cache
Annotate eligible kernel parameters withconst __restrict
Compiler will automatically map loads to use read-only data cache path
__global__ void saxpy(float x, float y, const float * __restrict input, float * output){ size_t offset = threadIdx.x + (blockIdx.x * blockDim.x);
// Compiler will automatically use texture // for "input" output[offset] = (input[offset] * x) + y;}
![Page 27: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/27.jpg)
Shared Memory
Low latency: a few cyclesHigh throughputMain use
Inter-block communicationUser-managed cache to reduce redundant global memory accessesAvoid non-coalesced access
![Page 28: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/28.jpg)
Shared Memory Example: Matrix Multiplication
A
B
C
C=AxB
Every thread corresponds to one entry in C.
![Page 29: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/29.jpg)
Naive Kernel
__global__ void simpleMultiply(float* a, float* b, float* c,
int N){ int row = threadIdx.x + blockIdx.x*blockDim.x; int col = threadIdx.y + blockIdx.y*blockDim.y; float sum = 0.0f; for (int i = 0; i < N; i++) { sum += a[row*N+i] * b[i*N+col]; } c[row*N+col] = sum;}
Every thread corresponds to one entry in C.
![Page 30: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/30.jpg)
Blocked Matrix Multiplication
A
B
C
C=AxB
Data reuse in the blocked version
![Page 31: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/31.jpg)
Blocked and cached kernel__global__ void coalescedMultiply(double*a,
double* b, double*c,
int N){ __shared__ double aTile[TILE_DIM][TILE_DIM]; __shared__ double bTile[TILE_DIM][TILE_DIM]; int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; float sum = 0.0f; for (int k = 0; k < N; k += TILE_DIM) { aTile[threadIdx.y][threadIdx.x] = a[row*N+threadIdx.x+k]; bTile[threadIdx.y][threadIdx.x] = b[(threadIdx.y+k)*N+col]; __syncthreads(); for (int i = 0; i < TILE_DIM; i++) { sum += aTile[threadIdx.y][i]* bTile[i][threadIdx.x]; } c[row*N+col] = sum;}
![Page 32: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/32.jpg)
Optimization Overview
GPU architectureKernel optimization
Memory optimizationLatency optimizationInstruction optimization
CPU-GPU interaction optimizationOverlapped execution using streams
![Page 33: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/33.jpg)
Latency Optimization
When the code is latency boundBoth the memory and instruction throughputs are far from the peak
Latency hiding: switching threadsA thread blocks when one of the operands isn’t ready
Purpose: have enough warps to hide latency
Major techniques: increase active warps
![Page 34: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/34.jpg)
Enough Block and Block Size
# of blocks >> # of SM > 100 to scale well to future deviceMinimum: 64. I generally use 128 or 256. But use whatever is best for your app.Depends on the problem, do experiments!
![Page 35: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/35.jpg)
Occupancy & Active Warps
Occupancy: ratio of active warps per SM to the maximum number of allowed warps
Maximum number: 48 in Fermi
We need the occupancy to be high enough to hide latency
Occupancy is limited by resource usage
![Page 36: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/36.jpg)
Dynamical Partitioning of SM Resources
Shared memory is partitioned among blocksRegisters are partitioned among threads: <= 255Thread block slots: <= 16Thread slots: <= 2048Any of those can be the limiting factor on how many threads can be launched at the same time on a SM
![Page 37: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/37.jpg)
Occupancy Calculator
http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls
![Page 38: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/38.jpg)
Occupancy Optimizations
Know the current occupancyVisual profiler--ptxas-options=-v: output resource usage info; input to Occupancy Calculator
Adjust resource usage to increase occupancyChange block sizeLimit register usage
Compiler option –maxrregcount=n: per file__launch_bounds__: per kernel
Dynamical allocating shared memory
![Page 39: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/39.jpg)
Optimization Overview
GPU architectureKernel optimization
Memory optimizationLatency optimizationInstruction optimization
CPU-GPU interaction optimizationOverlapped execution using streams
![Page 40: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/40.jpg)
Instruction Optimization
If you find out the code is instruction boundCompute-intensive algorithm can easily become memory-bound if not careful enoughTypically, worry about instruction optimization after memory and execution configuration optimizations
Purpose: reduce instruction countUse less instructions to get the same job done
Major techniquesUse high throughput instructionsReduce wasted instructions: branch divergence, etc.
![Page 41: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/41.jpg)
Reduce Instruction Count
Use float if precision allowAdding “f” to floating literals (e.g. 1.0f) because the default is double
Fast math functionsTwo types of runtime math library functions
func(): slower but higher accuracy (5 ulp or less)
__func(): fast but lower accuracy (see prog. guide for full details)
-use_fast_math: forces every func() to __func ()
![Page 42: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/42.jpg)
Control Flow
Divergent branches:Threads within a single warp take different pathsExample with divergence:
if (threadIdx.x > 2) {...} else {...}Branch granularity < warp size
Different execution paths within a warp are serialized
Different warps can execute different code with no impact on performance
Avoid diverging within a warpExample without divergence:
if (threadIdx.x / WARP_SIZE > 2) {...} else {...}Branch granularity is a whole multiple of warp size
![Page 43: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/43.jpg)
Kernel Optimization Workflow
Find Limiter
Compare to peak GB/s
Memory optimization
Compare to peak inst/s
Instruction optimization
Configuration optimization
Memory bound
Instruction bound
Latency bound
Done!
<< <<~ ~
![Page 44: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/44.jpg)
Optimization Overview
GPU architectureKernel optimization
Memory optimizationLatency optimizationInstruction optimization
CPU-GPU interaction optimizationOverlapped execution using streams
![Page 45: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/45.jpg)
Minimizing CPU-GPU data transferHost<->device data transfer has much lower bandwidth than global memory access.
16 GB/s (PCIe x16 Gen3) vs 250 GB/s & 3.95 Tinst/s (GK110)
Minimize transferIntermediate data can be allocated, operated, de-allocated directly on GPUSometimes it’s even better to recompute on GPU
Group transferOne large transfer much better than many small onesOverlap memory transfer with computation
![Page 46: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/46.jpg)
Streams and Async API
Default API:Kernel launches are asynchronous with CPUMemcopies (D2H, H2D) block CPU threadCUDA calls are serialized by the driver
Streams and async functions provide:Memcopies (D2H, H2D) asynchronous with CPUAbility to concurrently execute a kernel and a memcopyConcurrent kernel in Fermi
Stream = sequence of operations that execute in issue-order on GPUOperations from different streams can be interleavedA kernel and memcopy from different streams can be overlapped
![Page 47: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/47.jpg)
Pinned (non-pageable) memory
Pinned memory enables:memcopies asynchronous with CPU & GPU
UsagecudaHostAlloc / cudaFreeHost instead of malloc / free
Note:pinned memory is essentially removed from virtual memorycudaHostAlloc is typically very expensive
![Page 48: Optimization on Kepler Zehuan Wang zehuan@nvidia.com](https://reader033.vdocuments.us/reader033/viewer/2022052701/56649c785503460f9492de68/html5/thumbnails/48.jpg)
Overlap kernel and memory copy
Requirements:D2H or H2D memcopy from pinned memoryDevice with compute capability ≥ 1.1 (G84 and later)Kernel and memcopy in different, non-0 streams
Code:cudaStream_t stream1, stream2;cudaStreamCreate(&stream1);cudaStreamCreate(&stream2);
cudaMemcpyAsync( dst, src, size, dir, stream1 );kernel<<<grid, block, 0, stream2>>>(…);
potentiallyoverlapped