moving to vulkan - khronos group · nvidia maxwell 2 register file core load store unit. 210...
TRANSCRIPT
![Page 1: Moving To Vulkan - Khronos Group · NVIDIA Maxwell 2 Register File Core Load Store Unit. 210 Execution Model Thread Hierarchies 32 threads 32 threads 32 threads 32 threads Logical](https://reader034.vdocuments.us/reader034/viewer/2022042807/5f817375219b371332247361/html5/thumbnails/1.jpg)
Chris Hebert, Dev Tech Software Engineer, Professional Visualization
Moving To Vulkan Asynchronous Compute
![Page 2: Moving To Vulkan - Khronos Group · NVIDIA Maxwell 2 Register File Core Load Store Unit. 210 Execution Model Thread Hierarchies 32 threads 32 threads 32 threads 32 threads Logical](https://reader034.vdocuments.us/reader034/viewer/2022042807/5f817375219b371332247361/html5/thumbnails/2.jpg)
206
Who am I?
Dev Tech Software Engineer- Pro Vis
20 years in the industry
Joined NVIDIA in March 2015.
Real time graphics makes me happy
I also like helicopters
Chris Hebert@chrisjhebert
Chris Hebert - Circa 1974
![Page 3: Moving To Vulkan - Khronos Group · NVIDIA Maxwell 2 Register File Core Load Store Unit. 210 Execution Model Thread Hierarchies 32 threads 32 threads 32 threads 32 threads Logical](https://reader034.vdocuments.us/reader034/viewer/2022042807/5f817375219b371332247361/html5/thumbnails/3.jpg)
207
NVIDIA/KHRONOS CONFIDENTIAL
Agenda
• Some Context
• Sharing The Load
• Pipeline Barriers
![Page 4: Moving To Vulkan - Khronos Group · NVIDIA Maxwell 2 Register File Core Load Store Unit. 210 Execution Model Thread Hierarchies 32 threads 32 threads 32 threads 32 threads Logical](https://reader034.vdocuments.us/reader034/viewer/2022042807/5f817375219b371332247361/html5/thumbnails/4.jpg)
208
NVIDIA/KHRONOS CONFIDENTIAL
Some Context
![Page 5: Moving To Vulkan - Khronos Group · NVIDIA Maxwell 2 Register File Core Load Store Unit. 210 Execution Model Thread Hierarchies 32 threads 32 threads 32 threads 32 threads Logical](https://reader034.vdocuments.us/reader034/viewer/2022042807/5f817375219b371332247361/html5/thumbnails/5.jpg)
209
GPU ArchitectureIn a nutshell
NVIDIA Maxwell 2Register File
Core
Load Store Unit
![Page 6: Moving To Vulkan - Khronos Group · NVIDIA Maxwell 2 Register File Core Load Store Unit. 210 Execution Model Thread Hierarchies 32 threads 32 threads 32 threads 32 threads Logical](https://reader034.vdocuments.us/reader034/viewer/2022042807/5f817375219b371332247361/html5/thumbnails/6.jpg)
210
Execution ModelThread Hierarchies
32 threads
32 threads
32 threads
32 threads
Logical View HW View
Work Group Warps
SMM
![Page 7: Moving To Vulkan - Khronos Group · NVIDIA Maxwell 2 Register File Core Load Store Unit. 210 Execution Model Thread Hierarchies 32 threads 32 threads 32 threads 32 threads Logical](https://reader034.vdocuments.us/reader034/viewer/2022042807/5f817375219b371332247361/html5/thumbnails/7.jpg)
211
Resource PartitioningResources Are Limited
Key resources impacting local execution:
• Program Counters
• Registers
• Shared Memory
![Page 8: Moving To Vulkan - Khronos Group · NVIDIA Maxwell 2 Register File Core Load Store Unit. 210 Execution Model Thread Hierarchies 32 threads 32 threads 32 threads 32 threads Logical](https://reader034.vdocuments.us/reader034/viewer/2022042807/5f817375219b371332247361/html5/thumbnails/8.jpg)
212
Resource PartitioningResources Are Limited
Key resources impacting local execution:
• Program Counters
• Registers
• Shared Memory
Partitioned amongst threads
Partitioned amongst work groups
![Page 9: Moving To Vulkan - Khronos Group · NVIDIA Maxwell 2 Register File Core Load Store Unit. 210 Execution Model Thread Hierarchies 32 threads 32 threads 32 threads 32 threads Logical](https://reader034.vdocuments.us/reader034/viewer/2022042807/5f817375219b371332247361/html5/thumbnails/9.jpg)
213
Resource PartitioningResources Are Limited
Key resources impacting local execution:
• Program Counters
• Registers
• Shared Memory
Partitioned amongst threads
Partitioned amongst work groups
e.g. GTX 980 ti64k 32bit registers per SM96kb shared memory per SM
![Page 10: Moving To Vulkan - Khronos Group · NVIDIA Maxwell 2 Register File Core Load Store Unit. 210 Execution Model Thread Hierarchies 32 threads 32 threads 32 threads 32 threads Logical](https://reader034.vdocuments.us/reader034/viewer/2022042807/5f817375219b371332247361/html5/thumbnails/10.jpg)
214
Resource PartitioningRegisters
The more registers used by a kernel means few resident warps on the SM
Fewer Registers More Registers
More Threads Fewer Threads
![Page 11: Moving To Vulkan - Khronos Group · NVIDIA Maxwell 2 Register File Core Load Store Unit. 210 Execution Model Thread Hierarchies 32 threads 32 threads 32 threads 32 threads Logical](https://reader034.vdocuments.us/reader034/viewer/2022042807/5f817375219b371332247361/html5/thumbnails/11.jpg)
215
Resource PartitioningShared Memory
The more shared memory used by a work group means fewer work groups on the SM
Less SMEM More SMEM
More Groups Fewer Groups
![Page 12: Moving To Vulkan - Khronos Group · NVIDIA Maxwell 2 Register File Core Load Store Unit. 210 Execution Model Thread Hierarchies 32 threads 32 threads 32 threads 32 threads Logical](https://reader034.vdocuments.us/reader034/viewer/2022042807/5f817375219b371332247361/html5/thumbnails/12.jpg)
216
Keeping It MovingOccupancy
• Some small kernels may have low occupancy
• Depending on the algorithm
• Compute resources are limited
• Shared across threads or work groups on a per SM basis
• Warps stall when they have to wait for resources
• This latency can be hidden
• If there are other warps ready to execute.
![Page 13: Moving To Vulkan - Khronos Group · NVIDIA Maxwell 2 Register File Core Load Store Unit. 210 Execution Model Thread Hierarchies 32 threads 32 threads 32 threads 32 threads Logical](https://reader034.vdocuments.us/reader034/viewer/2022042807/5f817375219b371332247361/html5/thumbnails/13.jpg)
217
Keeping It MovingOccupancy – Simple Theoretical Example
• Simple kernel that updates positions of 20480 particles
• 1 FMAD - ~20 cycles (instruction latency)
• 20480 particles = 640 warps
• To hide this latency, according to Littles Law
• Required Warps = Latency x Throughput
• Throughput should be 32 threads * 16 sms = 512 to keep GPU busy
• Required warps is 20*512 = 10240
• ….oh….
![Page 14: Moving To Vulkan - Khronos Group · NVIDIA Maxwell 2 Register File Core Load Store Unit. 210 Execution Model Thread Hierarchies 32 threads 32 threads 32 threads 32 threads Logical](https://reader034.vdocuments.us/reader034/viewer/2022042807/5f817375219b371332247361/html5/thumbnails/14.jpg)
218
Keeping It MovingOccupancy – Simple Theoretical Example
• Simple kernel that updates positions of 20480 particles
• 1 FMAD - ~20 cycles (instruction latency)
• 20480 particles = 640 warps
• To hide this latency, according to Littles Law – But only on 1 SM..
• Required Warps = Latency x Throughput
• Throughput should be 32 threads * 1 sm = 32 to keep GPU busy
• Required warps is 20*32 = 640
• And we theoretically have 15 SMs to use for other stuff.
![Page 15: Moving To Vulkan - Khronos Group · NVIDIA Maxwell 2 Register File Core Load Store Unit. 210 Execution Model Thread Hierarchies 32 threads 32 threads 32 threads 32 threads Logical](https://reader034.vdocuments.us/reader034/viewer/2022042807/5f817375219b371332247361/html5/thumbnails/15.jpg)
219
Queuing It UpWorking with 1 Queue
KernelKernelKernel
Transfers
Command Queue
Command Buffer
Command Buffer
Command Buffer
Command Buffer
Command Buffer
• Scheduler will distribute work across all SMs
• kernels execute in sequence
(there may be some overlap)
• Low occupancy kernels will waste GPU time
![Page 16: Moving To Vulkan - Khronos Group · NVIDIA Maxwell 2 Register File Core Load Store Unit. 210 Execution Model Thread Hierarchies 32 threads 32 threads 32 threads 32 threads Logical](https://reader034.vdocuments.us/reader034/viewer/2022042807/5f817375219b371332247361/html5/thumbnails/16.jpg)
220
NVIDIA/KHRONOS CONFIDENTIAL
Sharing The Load
![Page 17: Moving To Vulkan - Khronos Group · NVIDIA Maxwell 2 Register File Core Load Store Unit. 210 Execution Model Thread Hierarchies 32 threads 32 threads 32 threads 32 threads Logical](https://reader034.vdocuments.us/reader034/viewer/2022042807/5f817375219b371332247361/html5/thumbnails/17.jpg)
221
Queuing It UpWorking with N Queues
KernelKernelKernelCommand Queue #1
Command Buffer
Command Buffer
Command Buffer
Command Buffer
• NVIDIA hardware gives you 16 all powerful queues
• 1 Queue family that supports all operations
• 16 queues available for use
KernelKernelKernelCommand Queue #2
KernelKernelKernelCommand Queue #3
![Page 18: Moving To Vulkan - Khronos Group · NVIDIA Maxwell 2 Register File Core Load Store Unit. 210 Execution Model Thread Hierarchies 32 threads 32 threads 32 threads 32 threads Logical](https://reader034.vdocuments.us/reader034/viewer/2022042807/5f817375219b371332247361/html5/thumbnails/18.jpg)
222
Queuing It UpWorking with N Queues
KernelKernelKernelCommand Queue #1
Command Buffer
Command Buffer
Command Buffer
Command Buffer
• Application decides which queues for which kernels
• Load balance for best performance
• Profile (Nsight) to gain insights
KernelKernelKernelCommand Queue #2
KernelKernelKernelCommand Queue #3
![Page 19: Moving To Vulkan - Khronos Group · NVIDIA Maxwell 2 Register File Core Load Store Unit. 210 Execution Model Thread Hierarchies 32 threads 32 threads 32 threads 32 threads Logical](https://reader034.vdocuments.us/reader034/viewer/2022042807/5f817375219b371332247361/html5/thumbnails/19.jpg)
223
Queuing It UpCompute and Graphics In Harmony
• Some hardware can even run compute and graphics work concurrently
• Needs fast context switching and at high granularity (not just at draw commands)
• Simple Graphics work tends to have high occupancy
• Complex graphics work can reduce occupancy
• Profile for performance insights
![Page 20: Moving To Vulkan - Khronos Group · NVIDIA Maxwell 2 Register File Core Load Store Unit. 210 Execution Model Thread Hierarchies 32 threads 32 threads 32 threads 32 threads Logical](https://reader034.vdocuments.us/reader034/viewer/2022042807/5f817375219b371332247361/html5/thumbnails/20.jpg)
224
Queuing It UpCompute and Graphics In Harmony
KernelKernelKernelCommand Queue #1
Compute Cmd Buffer
Compute Cmd Buffer
Graphics Cmd Buffer
Compute Cmd Buffer
• Profile to understand occupancy of both graphics and compute workloads
• Queues can support both compute and graphics
KernelKernelKernelCommand Queue #2
KernelKernelKernelCommand Queue #3
![Page 21: Moving To Vulkan - Khronos Group · NVIDIA Maxwell 2 Register File Core Load Store Unit. 210 Execution Model Thread Hierarchies 32 threads 32 threads 32 threads 32 threads Logical](https://reader034.vdocuments.us/reader034/viewer/2022042807/5f817375219b371332247361/html5/thumbnails/21.jpg)
225
An ExampleCompute and Graphics In Harmony
Free Surface Navier Stokes Solver
• 11 Compute Kernels
• 4 Shaders
• The output of each kernel is the input to the next
• Some kernels have very low occupancy
• Still opportunities for concurrency with compute
Click here to view this video
![Page 22: Moving To Vulkan - Khronos Group · NVIDIA Maxwell 2 Register File Core Load Store Unit. 210 Execution Model Thread Hierarchies 32 threads 32 threads 32 threads 32 threads Logical](https://reader034.vdocuments.us/reader034/viewer/2022042807/5f817375219b371332247361/html5/thumbnails/22.jpg)
226
An ExampleMany discretized operations are separable
SM SM SM SM
SM SM SM SM
SM SM SM SM
SM SM SM SM
Command Queue Command Queue
Process X Axis
(and half the Z)
Process Y Axis
(and other half of Z)
Examples• Fluid Sims• Gaussian Blurs• Convolution Kernels
Semaphore SemaphoreUse semaphores to synchronize
Driver handles dispatching groups
![Page 23: Moving To Vulkan - Khronos Group · NVIDIA Maxwell 2 Register File Core Load Store Unit. 210 Execution Model Thread Hierarchies 32 threads 32 threads 32 threads 32 threads Logical](https://reader034.vdocuments.us/reader034/viewer/2022042807/5f817375219b371332247361/html5/thumbnails/23.jpg)
227
An ExampleCompute and graphics run concurrently
SM SM
SM SM
SM SM
SM SM
SM SM
SM SM
Command Queue Command Queue
Graphics Work
Semaphore
SM SM
SM SM
Frame N
Frame
N+1
Frame
N+2
Frame
N+3
Frame
N+4
Frame N
Frame
N+1
Frame
N+2
Frame
N+3
Compute Work
Compute Graphics
![Page 24: Moving To Vulkan - Khronos Group · NVIDIA Maxwell 2 Register File Core Load Store Unit. 210 Execution Model Thread Hierarchies 32 threads 32 threads 32 threads 32 threads Logical](https://reader034.vdocuments.us/reader034/viewer/2022042807/5f817375219b371332247361/html5/thumbnails/24.jpg)
228
An ExamplePutting it all together
SM SM
SM SM
SM
SM
SM SM
SM SM
Command Queue Command Queue
Graphics Work
Semaphore
SM
SM
SM SM
SM SM
Frame N
Frame
N+1
Frame
N+2
Frame
N+3
Frame
N+4
Frame N
Frame
N+1
Frame
N+2
Frame
N+3
Process X Axis
(and half the Z)
Compute Graphics
Semaphore
Command Queue
Process Y Axis
(and other half of Z)
![Page 25: Moving To Vulkan - Khronos Group · NVIDIA Maxwell 2 Register File Core Load Store Unit. 210 Execution Model Thread Hierarchies 32 threads 32 threads 32 threads 32 threads Logical](https://reader034.vdocuments.us/reader034/viewer/2022042807/5f817375219b371332247361/html5/thumbnails/25.jpg)
229
Memory TransfersMore opportunity for concurrency
KernelTransferKernelTransferKernelCommand Queue #1
MMU may be idle
ALUs may be idle
• Memory transfers are handle by MMU
• Can run concurrently with Kernels
• As long as the current kernel isnt using the memory
Why do this?
![Page 26: Moving To Vulkan - Khronos Group · NVIDIA Maxwell 2 Register File Core Load Store Unit. 210 Execution Model Thread Hierarchies 32 threads 32 threads 32 threads 32 threads Logical](https://reader034.vdocuments.us/reader034/viewer/2022042807/5f817375219b371332247361/html5/thumbnails/26.jpg)
230
Memory TransfersMore opportunity for concurrency
TransferTransferTransferHost to Device Queue
When you can do this• DtoH and HtoD transfers can run concurrently
KernelKernelKernelCompute Queue
TransferTransferTransferDevice to Host Queue
Examples• Large image processing• Video processing
![Page 27: Moving To Vulkan - Khronos Group · NVIDIA Maxwell 2 Register File Core Load Store Unit. 210 Execution Model Thread Hierarchies 32 threads 32 threads 32 threads 32 threads Logical](https://reader034.vdocuments.us/reader034/viewer/2022042807/5f817375219b371332247361/html5/thumbnails/27.jpg)
231
ConclusionTakeaways
NVIDIA/KHRONOS CONFIDENTIAL
There is more than 1 queue available
Keep registers and shared memory to a minimum
Low occupancy leads to an under utilized GPU
Maximize GPU utilization by running kernels concurrently
Profile to understand the occupancy profiles of kernels and shaders
Some hardware can run kernels AND shaders concurrently
Use Semaphores to synchronize between queues
Be sensible at the beer festival
![Page 28: Moving To Vulkan - Khronos Group · NVIDIA Maxwell 2 Register File Core Load Store Unit. 210 Execution Model Thread Hierarchies 32 threads 32 threads 32 threads 32 threads Logical](https://reader034.vdocuments.us/reader034/viewer/2022042807/5f817375219b371332247361/html5/thumbnails/28.jpg)
232
NVIDIA/KHRONOS CONFIDENTIAL
Thank You Enjoy Vulkan!!
![Page 29: Moving To Vulkan - Khronos Group · NVIDIA Maxwell 2 Register File Core Load Store Unit. 210 Execution Model Thread Hierarchies 32 threads 32 threads 32 threads 32 threads Logical](https://reader034.vdocuments.us/reader034/viewer/2022042807/5f817375219b371332247361/html5/thumbnails/29.jpg)
Questions?Chris Hebert, Dev Tech Software Engineer, Professional Visualization