best practices for designing, deploying, and managing gpu ... · tesla accelerated computing...
TRANSCRIPT
![Page 1: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a98007e708231d42c65d2/html5/thumbnails/1.jpg)
Dale Southard
BEST PRACTICES FOR DESIGNING,
DEPLOYING, AND MANAGING GPU CLUSTERS
![Page 2: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a98007e708231d42c65d2/html5/thumbnails/2.jpg)
2
Cluster Design
![Page 3: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a98007e708231d42c65d2/html5/thumbnails/3.jpg)
3
Development Data Center Infrastructure
Tesla Accelerated Computing Platform
GPU
Accelerators Interconnect
System
Management
Compiler
Solutions
GPU Boost
…
GPU Direct
NVLink
…
NVML
…
LLVM
…
Profile and
Debug
CUDA Debugging API
…
Development
Tools
Programming
Languages
Infrastructure
Management Communication System Solutions
/
Software
Solutions
Libraries
cuBLAS
…
![Page 4: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a98007e708231d42c65d2/html5/thumbnails/4.jpg)
4
TESLA: PLATFORM WITH OPEN ECOSYSTEM
Pick the CPU that’s Correct for You
x86
Libraries
Programming
Languages
Compiler
Directives
AmgX
cuBLAS
/
![Page 5: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a98007e708231d42c65d2/html5/thumbnails/5.jpg)
5
DESIGNING FOR WORKLOAD
Multiple Ways to Balance Parallel and Serial Performance
GPU
GPU
GPU
GPU
CPU
NVLink PCIe Gen3 x16
GPU
GPU
CPU 80GB/s
16GB/s
2 GPUs per CPU
GPU
GPU
GPU CPU
3 GPUs per CPU
PCIe Switch
40 GB/s
20 GB/s
40 GB/s
16GB/s 20GB/s
4 GPUs per CPU
![Page 6: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a98007e708231d42c65d2/html5/thumbnails/6.jpg)
6
IN SITU VIS – FASTER SCIENCE, LOWER COST
Traditional Workflow
CPU Supercomputer Viz Cluster
Simulation- 1 Week Viz- 1 Day
Multiple Iterations
Time to Discovery =
Months
Tesla Platform Faster Time to Discovery
Reduced Pressure on Filestore
GPU-Accelerated Supercomputer
Visualize while
you simulate
Restart Simulation Instantly
Multiple Iterations
Time to Discovery = Weeks
![Page 7: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a98007e708231d42c65d2/html5/thumbnails/7.jpg)
7
Cluster Deployment
![Page 8: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a98007e708231d42c65d2/html5/thumbnails/8.jpg)
8
NODE TOPOLOGY CIRCA 2008
Life Was Good
CPU CPU
Bridge
GPU
Memory
Mem
ory
Mem
ory
GPU
![Page 9: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a98007e708231d42c65d2/html5/thumbnails/9.jpg)
9
NODE TOPOLOGY IN 2014 AND BEYOND
Better, but More Choices
CPU CPU
GPU
Memory
Mem
ory
Mem
ory
GPU
Memory
GPU
Memory
GPU
Memory
… …
![Page 10: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a98007e708231d42c65d2/html5/thumbnails/10.jpg)
10
TOPOLOGY-AWARE RESOURCE MANAGERS
$CUDA_VISIBLE_DEVCIES today, cgroups coming soon
CPU CPU
GPU GPU GPU GPU
Job 1
Job 3
Job 2
![Page 11: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a98007e708231d42c65d2/html5/thumbnails/11.jpg)
11
Cluster Management
![Page 12: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a98007e708231d42c65d2/html5/thumbnails/12.jpg)
12
TESLA GPU DEPLOYMENT KIT
nvidia-smi provides a command-line interface
NVML provides API access
https://developer.nvidia.com/gpu-deployment-kit
Command-line and Library
![Page 13: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a98007e708231d42c65d2/html5/thumbnails/13.jpg)
13
COMPUTE MODE
The Compute Mode setting controls simultaneous use
DEFAULT allow multiple simultaneous processes
EXCLUSIVE_THREAD allows only one context
EXCLUSIVE_PROCESS one process, but multiple threads
PROHIBITED
Can be set by command–line (nvidia-smi) & API (NVML)
Insurance Against Scheduling Issues
![Page 14: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a98007e708231d42c65d2/html5/thumbnails/14.jpg)
14
RESOURCE LIMITS
UVA depends on allocating virtual address space
Virtual address space != physical ram consumption
Several batch systems and resource mangers support cgroups,
either directly or via plugins.
Use cgroups, Not ulimit
![Page 15: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a98007e708231d42c65d2/html5/thumbnails/15.jpg)
15
MONITORING
Environmental and utilization metrics are available
Tesla cards may provide OoB access to telemetry via BMC
NVML support has been integrated into many monitoring systems
Open Source, Commercial, DIY
![Page 16: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a98007e708231d42c65d2/html5/thumbnails/16.jpg)
16
GPU ACCOUNTING
Per-process accounting of GPU usage by PID
Accessible by library (NVML) or command-line (nvidia-smi)
Enable accounting mode:
sudo nvidia-smi –am 1
Human-readable accounting
nvidia-smi –q –d ACCOUNTING
CSV accounting data
nvidia-smi –query-accounted-
apps=gpu_name,gpu_util –
format=csv
![Page 17: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a98007e708231d42c65d2/html5/thumbnails/17.jpg)
17
Summary
![Page 18: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a98007e708231d42c65d2/html5/thumbnails/18.jpg)
18
TESLA ACCELERATED COMPUTING PLATFORM
Pick the HW mix that’s right for your site
Topology Matters in design and in resource management
Rich ecosystem of management hooks and tools
Libraries, Languages, Tools
![Page 19: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a98007e708231d42c65d2/html5/thumbnails/19.jpg)
19
Thanks