![Page 1: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/1.jpg)
CUDA OPTIMIZATION WITH
NVIDIA NSIGHT™ VISUAL STUDIO EDITION
Julien Demouth, NVIDIA
![Page 2: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/2.jpg)
WHAT WILL YOU LEARN? An iterative method to optimize your GPU code
A way to conduct that method with NVIDIA Nsight VSE
https://github.com/jdemouth/nsight-gtc2014
![Page 3: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/3.jpg)
Blur
A WORD ABOUT THE APPLICATION
Grayscale
Edges
![Page 4: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/4.jpg)
A WORD ABOUT THE APPLICATION Grayscale Conversion
// r, g, b: Red, green, blue components of the pixel p foreach pixel p: p = 0.298839f*r + 0.586811f*g + 0.114350f*b;
![Page 5: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/5.jpg)
A WORD ABOUT THE APPLICATION Blur: 7x7 Gaussian Filter
foreach pixel p: p = weighted sum of p and its 48 neighbors
16 12 8 4
9 6 3
6 4 2
3 2 1
6 3
4 2
9
6
3 2 1
4 8 12
3 6 9
2 4 6
1 2 3
3 6 9
2 4 6
1 2 3
12
8
4
4
8
12
Image from Wikipedia
![Page 6: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/6.jpg)
A WORD ABOUT THE APPLICATION Edges: 3x3 Sobel Filters
foreach pixel p: Gx = weighted sum of p and its 8 neighbors Gy = weighted sum of p and its 8 neighbors p = sqrt(Gx + Gy)
-1 0 1
-2 0 2
-1 0 1
Weights for Gx:
1 2 1
0 0 0
-1 -2 -1
Weights for Gy:
![Page 7: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/7.jpg)
OPTIMIZATION METHOD Trace the Application
Identify the Hot Spot and Profile it
Identify the Performance Limiter
— Memory Bandwidth
— Instruction Throughput
— Latency
Optimize the Code
Iterate
We focus on the Assess and Optimize steps of the APOD method. We do not talk about the Parallelize and Deploy steps http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#assess-parallelize-optimize-deploy
![Page 8: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/8.jpg)
ENVIRONMENT NVIDIA Tesla K20c (GK110, SM3.5) without ECC
Microsoft Windows 7 x64
Microsoft Visual Studio 2012
NVIDIA CUDA 6.0
NVIDIA Nsight Visual Studio Edition 4.0
![Page 9: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/9.jpg)
BEFORE WE START Some slides are for background
Performance Optimization: Programming Guidelines and GPU Architecture Details Behind Them, GTC 2013
http://on-demand.gputechconf.com/gtc/2013/video/S3466-Performance-Optimization-Guidelines-GPU-Architecture-Details.mp4
http://on-demand.gputechconf.com/gtc/2013/presentations/S3466-Programming-Guidelines-GPU-Architecture.pdf
CUDA Best Practices Guide
http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/
Chameleon from http://www.vectorportal.com, Creative Commons
![Page 10: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/10.jpg)
BEFORE WE START Instructions are executed by warps of threads
— It is a hardware concept
— There are 32 threads per warp
![Page 11: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/11.jpg)
ITERATION 1
![Page 12: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/12.jpg)
TRACE THE APPLICATION
Select
Trace
Application
Activate
CUDA
Launch
Verify
Parameters
![Page 13: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/13.jpg)
TIMELINE
![Page 14: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/14.jpg)
CUDA LAUNCH SUMMARY
The Hotspot is gaussian_filter_7x7_v0
Kernel Time Speedup
Original version 6.265ms
Hotspot
![Page 15: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/15.jpg)
PROFILE THE HOTSPOT
Select
Profile CUDA
Application
Select the Kernel
Launch
Select the
Experiments (All)
![Page 16: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/16.jpg)
IDENTIFY THE MAIN LIMITER Is it limited by the memory bandwidth ?
Is it limited by the instruction throughput ?
Is it limited by latency ?
![Page 17: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/17.jpg)
MEMORY BANDWIDTH
SMEM/L1$
Registers
SM
SMEM/L1$
Registers
SM
Global Memory (Framebuffer)
L2$
208GB/s (K20)
![Page 18: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/18.jpg)
MEMORY BANDWIDTH Utilization of L2$ Bandwidth (BW) limited and DRAM BW < 2%
Not limited by memory bandwidth
![Page 19: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/19.jpg)
INSTRUCTION THROUGHPUT
Each SM has 4 schedulers (Kepler)
Schedulers issue instructions to pipes
A scheduler issues up to 2 instructions/cycle — Sustainable peak is 7 instructions/cycle per SM (not 4x2 = 8)
A scheduler issues inst. from a single warp
Cannot issue to a pipe if its issue slot is full
SMEM/L1$
Registers
SM
Pipes Pipes Pipes Pipes
Sched Sched Sched Sched
![Page 20: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/20.jpg)
INSTRUCTION THROUGHPUT
Sched Sched Sched Sched
Schedulers saturated
Utilization: 90%
Load
Store Texture
Control
Flow ALU
11% 8%
65%
6%
Sched Sched Sched Sched
Schedulers and pipe
saturated
4%
27%
Utilization: 92%
Load
Store Texture
Control
Flow ALU
90%
Sched Sched Sched Sched
Pipe saturated
78%
Utilization: 64%
Load
Store Texture
Control
Flow ALU
24%
4%
![Page 21: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/21.jpg)
WARP ISSUE EFFICIENCY
Percentage of issue slots used (blue)
Aggregated over all the schedulers
![Page 22: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/22.jpg)
PIPE UTILIZATION
Percentages of issue slots used per pipe
Accounts for pipe throughputs
Four groups of pipes:
— Load/Store
— Texture
— Control Flow
— Arithmetic (ALU)
![Page 23: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/23.jpg)
INSTRUCTION THROUGHPUT Neither schedulers nor pipes are saturated
Not limited by the instruction throughput
![Page 24: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/24.jpg)
LATENCY GPUs cover latencies by having a lot of work in flight
warp 0
warp 1
warp 2
warp 3
warp 4
warp 5
warp 6
warp 7
warp 8
warp 9
The warp issues
The warp waits (latency)
Fully covered latency Exposed latency
No warp issues
![Page 25: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/25.jpg)
LATENCY: LACK OF OCCUPANCY Not enough active warps
The schedulers cannot find eligible warps at every cycle
warp 0
warp 1
warp 2
warp 3
No warp issues
![Page 26: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/26.jpg)
LATENCY
50% of theoretical occupancy
31.2 active warps per cycle
3.57 warps eligible per cycle
Hard to tell. Let’s start with occupancy
![Page 27: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/27.jpg)
OCCUPANCY Each SM has limited resources
64K Registers (32 bit) shared by threads
Up to 48KB of Shared memory
16 slots to execute Blocks
Full occupancy: 2048 threads per SM (64 warps)
Values for SM30/SM35. They vary with Compute Capability
![Page 28: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/28.jpg)
OCCUPANCY (BLOCK DIMENSION) Limited by the number of blocks
Blocks are too small (64 threads/block)
![Page 29: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/29.jpg)
OCCUPANCY (BLOCK DIMENSION)
Increase Block Size
for (slightly) better
occupancy
![Page 30: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/30.jpg)
OCCUPANCY (BLOCK DIMENSION) Increase the block size to 128 threads (8x16)
It runs slightly faster: 6.074ms
Kernel Time Speedup
Original version 6.263ms
Larger blocks 6.074ms 1.03x
![Page 31: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/31.jpg)
ITERATION 2
![Page 32: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/32.jpg)
TRACE THE APPLICATION
The hotspot is still gaussian_filter_7x7_v0
Hotspot
Kernel Time Speedup
Original version 6.265ms
Larger blocks 6.075ms 1.03x
![Page 33: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/33.jpg)
IDENTIFY THE MAIN LIMITER Is it limited by the memory bandwidth ?
Is it limited by the instruction throughput ?
Is it limited by latency ?
![Page 34: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/34.jpg)
MEMORY BANDWIDTH Utilization of L2$ BW limited and DRAM BW < 2%
Not limited by memory bandwidth
![Page 35: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/35.jpg)
INSTRUCTION THROUGHPUT Neither schedulers nor pipes are saturated
Not limited by the instruction throughput
![Page 36: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/36.jpg)
LATENCY (OCCUPANCY) Limited occupancy (56%) but 4.62 eligible warps/cycle (> 4)
There is probably something else limiting our performance
![Page 37: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/37.jpg)
MEMORY TRANSACTIONS Warp of threads (32 threads)
L1 transaction: 128B – Alignment: 128B (0, 128, 256, …)
L2 transaction: 32B – Alignment: 32B (0, 32, 64, 96, …)
![Page 38: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/38.jpg)
MEMORY TRANSACTIONS A warp issues 32x4B aligned and consecutive loads/stores
Threads read different elements of the same 128B segment
1x L1 transaction: 128B needed / 128B transferred
4x L2 transactions: 128B needed / 128B transferred
1x 128B L1 transaction per warp
4x 32B L2 transactions per warp
![Page 39: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/39.jpg)
MEMORY TRANSACTIONS Threads in a warp read/write 4B words, 128B between words
Each thread reads the first 4B of a 128B segment
32x L1 transactions: 128B needed / 32x 128B transferred
32x L2 transactions: 128B needed / 32x 32B transferred
1x 128B L1 transaction per thread
1x 32B L2 transaction per thread
32x
![Page 40: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/40.jpg)
Threads 24-31 Threads 0-7
TRANSACTIONS AND REPLAYS A warp reads from addresses spanning 3 lines of 128B
1 instr. executed and 2 replays = 1 request and 3 transactions
Threads 8-15
Threads 16-23
Time
Instruction issued Instruction re-issued
1st replay
Threads
0-7/24-31
Threads
8-15
Instruction re-issued
2nd replay
Threads
16-23
![Page 41: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/41.jpg)
TRANSACTIONS AND REPLAYS With replays, requests take more time and use more resources
— More instructions issued
— More memory traffic
— Increased execution time
Inst. 0
Issued
Inst. 1
Issued
Inst. 2
Issued
Execution time
Threads
0-7/24-31
Threads
8-15
Threads
16-23
Inst. 0
Completed
Inst. 1
Completed
Inst. 2
Completed
Threads
0-7/24-31
Threads
8-15
Threads
16-23
Transfer data for inst. 0
Transfer data for inst. 1
Transfer data for inst. 2
Extra latency Extra work (SM)
Extra memory traffic
![Page 42: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/42.jpg)
TRANSACTIONS PER REQUEST Transactions per Request: 4.20 (Load) / 4.00 (Store)
Too many memory transactions (too much pressure on LSU)
![Page 43: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/43.jpg)
TRANSACTIONS PER REQUEST Our blocks are 8x16
We should use blocks of size 32x4 (or 32x8)
Warp 0
Warp 1
Warp 2
Warp 3
27 28 29 30
36 37 38
44 45 46
52 53 54
21 22
13 14
20
12
4 5 6
24 25 26
32 33 34
40 41 42
48 49 50
16 17 18
8 9 10
0 1 2
19
11
3
51
43
35
31
39
47
55
23
15
7
60 61 62 56 57 58 59 63
4 5 6 0 1 2 3 7 13 14 12 8 9 10 11 15 21 22 20 16 17 18 19 23 27 28 29 30 24 25 26 31
36 37 38 32 33 34 35 39 44 45 46 40 41 42 43 47 52 53 54 48 49 50 51 55 60 61 62 56 57 58 59 63
threadIdx.x
… 64 65 66
![Page 44: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/44.jpg)
IMPROVED MEMORY ACCESSES Improved memory accesses: Blocks of size 32x4
It runs faster: 3.605ms
Kernel Time Speedup
Original version 6.265ms
Larger blocks 6.075ms 1.03x
Better memory accesses 3.605ms 1.74x
![Page 45: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/45.jpg)
ITERATION 3
![Page 46: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/46.jpg)
TRACE THE APPLICATION
The hotspot is still gaussian_filter_7x7_v0
Hotspot
Kernel Time Speedup
Original version 6.265ms
Larger blocks 6.075ms 1.03x
Better memory accesses 3.605ms 1.74x
We use the same block size for the Sobel filter kernel. That’s the reason why it also improves (2nd row of the Nsight table).
![Page 47: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/47.jpg)
MEMORY BANDWIDTH Utilization of L2$ BW low and DRAM BW < 2%
Not limited by memory bandwidth
![Page 48: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/48.jpg)
INSTRUCTION THROUGHPUT Neither schedulers nor pipes are saturated
Not limited by instruction throughput
![Page 49: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/49.jpg)
LATENCY (OCCUPANCY) Limited occupancy and not enough eligible warps/cycle (2.0)
Need more active warps (i.e. occupancy)
![Page 50: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/50.jpg)
LATENCY (OCCUPANCY) Limited by register usage
![Page 51: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/51.jpg)
REDUCE REGISTER USAGE Use the __launch_bounds__ attribute
10 gives the best results (48 registers): 1.949ms
__global__ __launch_bounds__(128, 10) void gaussian_filter_7x7_v1(int w, int h, const uchar *src, uchar *dst)
Number of threads per block Minimum number of blocks
Kernel Time Speedup
Original version 6.265ms
Larger blocks 6.075ms 1.03x
Better memory accesses 3.605ms 1.74x
Fewer registers 1.949ms 3.21x
![Page 52: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/52.jpg)
ITERATION 4
![Page 53: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/53.jpg)
TRACE THE APPLICATION
The hotspot is gaussian_filter_7x7_v1
Hotspot
Kernel Time Speedup
Original version 6.265ms
Larger blocks 6.075ms 1.03x
Better memory accesses 3.605ms 1.74x
Fewer registers 1.949ms 3.21x
![Page 54: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/54.jpg)
MEMORY BANDWIDTH Utilization of L2$ BW low and DRAM BW < 2%
Not limited by memory bandwidth
![Page 55: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/55.jpg)
INSTRUCTION THROUGHPUT Neither schedulers nor pipes are saturated
Not limited by instruction throughput
![Page 56: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/56.jpg)
LATENCY (OCCUPANCY) Enough active and eligible warps per cycle
Not limited by a lack of occupancy
![Page 57: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/57.jpg)
BRANCH DIVERGENCE Threads of a warp take different branches of a conditional
if( threadIdx.x < 12 ) {}
else {}
Time
Threads execute the “if” branch Threads execute the “else” branch
Execution time = “if” branch + “else” branch
![Page 58: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/58.jpg)
BRANCH DIVERGENCE But no divergence in our code
![Page 59: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/59.jpg)
OTHER IDEAS Shared memory bank conflicts
— Conflicts: Two threads read different addresses from a same bank
Excessive usage of synchronizations (__syncthreads)
But those symptoms do not affect our case
![Page 60: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/60.jpg)
MEMORY TRANSFERS Our image size: 2560x1600 = 4MB
We read 385MB from L2$: Too much traffic!
![Page 61: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/61.jpg)
MEMORY TRANSFERS We do not saturate the issue slot of the Load-Store unit
But we saturate inside the Load-Store unit
Unfortunately, we cannot detect that with Nsight (yet)
![Page 62: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/62.jpg)
Adjacent pixels access have neighbors in common
We should use shared memory to store those common pixels
SHARED MEMORY
__shared__ unsigned char smem_pixels[10][64];
![Page 63: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/63.jpg)
USE SHARED MEMORY Use shared memory to keep data on the SM: 1.211ms
Kernel Time Speedup
Original version 6.265ms
Larger blocks 6.075ms 1.03x
Better memory accesses 3.605ms 1.74x
Fewer registers 1.949ms 3.21x
Shared memory 1.211ms 5.17x
![Page 64: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/64.jpg)
ITERATION 5
![Page 65: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/65.jpg)
TRACE THE APPLICATION
The hotspot is gaussian_filter_7x7_v2
Hotspot
Kernel Time Speedup
Original version 6.265ms
Larger blocks 6.075ms 1.03x
Better memory accesses 3.605ms 1.74x
Fewer registers 1.949ms 3.21x
Shared memory 1.211ms 5.17x
![Page 66: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/66.jpg)
MEMORY BANDWIDTH Utilization of L2$ BW low and DRAM BW < 4%
Not limited by memory bandwidth
![Page 67: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/67.jpg)
INSTRUCTION THROUGHPUT Not saturating the schedulers
But we use 73% of the Load-Store issue slot
![Page 68: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/68.jpg)
LOAD-STORE INSTRUCTIONS LSU executes global and shared memory instructions
Change global loads to use the Read-Only path
![Page 69: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/69.jpg)
READ-ONLY CACHE (TEXTURE UNITS)
SMEM/L1$
Registers
SM
SMEM/L1$
Registers
SM
Global Memory (Framebuffer)
L2$
Texture Units Texture Units Skip LSU
Cache loads
![Page 70: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/70.jpg)
READ-ONLY PATH Annotate our pointer with const __restrict
The compiler generates LDG instructions: 1.019ms
Kernel Time Speedup
Original version 6.265ms
Larger blocks 6.075ms 1.03x
Better memory accesses 3.605ms 1.74x
Fewer registers 1.949ms 3.21x
Shared memory 1.211ms 5.17x
Read-Only path 1.019ms 6.15x
__global__ void gaussian_filter_7x7_v3(int w, int h, const uchar *__restrict src, uchar *dst)
![Page 71: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/71.jpg)
INSTRUCTION THROUGHPUT We are doing much better
Things to investigate next:
— Improve memory efficiency
— Reduce computational intensity (separable filter)
![Page 72: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/72.jpg)
MORE IN OUR COMPANION CODE
Kernel Time Speedup
Original version 6.265ms
Larger blocks 6.075ms 1.03x
Better memory accesses 3.605ms 1.74x
Fewer registers 1.949ms 3.21x
Shared memory 1.211ms 5.17x
Read-Only path 1.019ms 6.15x
Separable filter 0.656ms 9.55x
Process two pixels per thread (improve memory efficiency + add ILP) 0.511ms 12.26x
Use 64-bit shared memory (remove bank conflicts) 0.499ms 12.56x
Use float instead of int (increase instruction throughput) 0.434ms 14.44x
Your next idea!!!
https://github.com/jdemouth/nsight-gtc2014
![Page 73: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/73.jpg)
CONCLUSION
![Page 74: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides](https://reader034.vdocuments.us/reader034/viewer/2022051408/5ff74394440a7c78154d1554/html5/thumbnails/74.jpg)
OPTIMIZATION METHOD Trace the Application
Identify the Hot Spot and Profile it
Identify the Performance Limiter
— Memory Bandwidth
— Instruction Throughput
— Latency
Optimize the Code
Iterate
https://github.com/jdemouth/nsight-gtc2014