dcgm overview and hands on · 5 data center gpu manager (dcgm) pre-configured policies job level...
TRANSCRIPT
![Page 1: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/1.jpg)
June 2019
DCGM OVERVIEW AND HANDS ON
![Page 2: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/2.jpg)
2
TOOLS FOR MANAGING GPUs
Out-of-Band
GPU Metrics and Monitoring via BMC (SMBPBI)
Provide metrics (thermals, power, etc.) without the NVIDIA driver
Typically used at public CSPs (i.e. multi-tenant environments)
In-Band
Tools use the NVIDIA driver to provide GPU and NVSwitchmetrics
DCGM, NVML (smi) are in-band tools
Typically used at single tenant environments
![Page 3: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/3.jpg)
3
NVIDIA IN-BAND TOOLS ECOSYSTEM
DCGM
NVML
3rd Party Tools
▶ Customers building their own GPU metrics/monitoring stack using NVML
▶ Customers integrating DCGM; CSPs for system validation
▶ Cluster managers, Job schedulers, TSDBs, Visualization tools
![Page 4: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/4.jpg)
4
HOW SHOULD I MANAGE MY GPUS?
3RD PARTY TOOLS
DCGMNVML
Stateless queries. Can only query current data
Low overhead while running, high overhead to develop
Low-level control of GPUs
Management app must run on same box as GPUs
Provide database, graphs, and a nice UI
Need management node(s)
Development already done. You just have to configure the tools.
Can query a few hours of metrics
Provides health checks and diagnostics
Can batch queries/operations to groups of GPUs
Can be remote or local
![Page 5: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/5.jpg)
5
DATA CENTER GPU MANAGER (DCGM)
▶ Pre-configured Policies▶ Job Level Statistics▶ Stateful Configuration
POLICY AND ALERTING
▶ Software Deployment Tests▶ Stress Tests▶ Hardware Issues and Interface Tests
(PCIe, NVLink)
GPU DIAGNOSTICS
▶ Dynamic Power Capping▶ Synchronous Clock Boost▶ Fixed Clocks
CONFIGURATION MANAGEMENT
▶ Runtime Health Checks▶ Prologue Checks▶ Epilogue Checks
ACTIVE HEALTH MONITORING
![Page 6: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/6.jpg)
6
https://developer.nvidia.com/data-center-gpu-manager-dcgm
GPU Management in the Accelerated Data Center
DCGM OVERVIEW
Supported NVIDIA Hardware
● Fully supported on Tesla GPUs (Kepler+)
● Supported on Quadro, GeForce, and Titan GPUs (Maxwell+,
since v1.3)
● Supports NvSwitch and DGX-2
● Driver R384 or Later (Linux only)
SDK Installer Packages
● .deb and .rpm Packages
● Includes Binaries – CLI (dcgmi) and daemon (nv-hostengine)
● Libraries and Headers (includes NVML)
● C and Python Bindings and Code samples● Documentation - User Guides and API docs
Latest Release: v1.5.6
![Page 7: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/7.jpg)
7
AVAILABLE NVIDIA MANAGEMENT TOOLS
Software Stack
NVML
NVIDIA Driver
CUDA
Data Center GPU Manager (DCGM)
▶ Additional diagnostics (aka NVVS) and active health monitoring
▶ Policy management and more
NVIDIA Management Library(NVML)
▶ Low level control of GPUs▶ Included as part of driver ▶ Header is part of CUDA Toolkit / DCGM
DCGM Daemon
DCGM-Based 3rd Party Tools
DCGMI
Client Lib Client Lib
GPU
Diagnostics
(NVVS)
![Page 8: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/8.jpg)
8
Two ways to use DCGM
APIs
Accessible programmatically through APIs
Supporting C/Python
Embedded library within 3-rd management tools
Also auto-manage scripts by using Python bindings
CLI
Command line interface: dcgmi
Simple, interactively
Favored by common users and system admins
Equivalent functionality provided through APIs or CLI
![Page 9: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/9.jpg)
9
Two modes of DCGM running
Embedded
mode
DCGM agent is loaded as a shared library by 3-rd
party agent
Periodically triggered to gather data and
management activities by 3-rd party agent
Standalone
mode
Embedded into a daemon called NVIDIA Host Engine
DCGM clients prefer to interact with a daemon
Multiple clients wish to interact with DCGM, not just one
node agent
Users who wants to leverage CML tool: DCGMI
Standalone mode is used widely for its flexibility and lowest
maintenance cost to users.
![Page 10: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/10.jpg)
10
Package Select▪ Linux - x64 and POWER▪ Ubuntu (deb)and CentOS/RHEL(rpm)▪ Normal version used for server without nvswitch▪ FM version used for server with nvswith ,such as DGX2/HGX2
![Page 11: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/11.jpg)
11
Install and quick Start
▪ $dpkg –i *.deb/ rpm –ivh *.rpm▪ $nv-hostengine▪ Parameter: -b allow other host access this daemon
![Page 12: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/12.jpg)
12
Device for management
❑ Host: --host
❑ Group: -g
❑ GPU: --gpuid
Usage: dcgmi discovery
dcgmi discovery [--host <IP/FQDN>] -l
dcgmi discovery [--host <IP/FQDN>] -i <flags> [-g <groupId>] [-v]
![Page 13: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/13.jpg)
13
DCGM Usage of CLI
GroupDmonPolicyJob StatsHealth & DiagnosticsTopologyNVLINK
![Page 14: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/14.jpg)
14
Group
![Page 15: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/15.jpg)
15
Groups in DCGM
Groups:Almost all DCGM operations take place on groups. User can create, modify and destroy collections of GPUs on local node, using these constructs to control all subsequent DCGM activities.
Partitioned groups, consisting of only a subset of GPUs, are useful for job-level concepts such as job stats and health.
![Page 16: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/16.jpg)
16
Groups create, list, delete
Managing group is simple, only with “dcgmi group” subcommand.
“dcgmi group –d GroupID” to delete a group
“dcgmi group –h” will list more detailed usage.
![Page 17: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/17.jpg)
17
Add GPUs to a group
![Page 18: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/18.jpg)
18
Health&diag
![Page 19: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/19.jpg)
19
Health—check select items heath status
![Page 20: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/20.jpg)
20
ACTIVE HEALTH MONITORING & ANALYSIS
NON INVASIVE CHECKS
Real-time monitoring & aggregated health indicator
Checks health of all GPUs and NVSwitch subsystems • PCIe, ECC, Inforom, Power
Thermal, NVLink
dcgmi health --check -g 1
Health Monitor Report
+------------------+---------------------------------------------------------+
| Overall Health: Healthy |
+==================+=========================================================+
Run Health Check : Healthy System
dcgmi health -g 1 –cHealth Monitor Report
+----------------------------------------------------------------------------+
| Group 1 | Overall Health: Warning |
+==================+=========================================================+
| GPU ID: 0 | Warning |
| | PCIe system: Warning - Detected more than 8 PCIe |
| | replays per minute for GPU 0: 13 |
+------------------+---------------------------------------------------------+
| GPU ID: 1 | Warning |
| | InfoROM system: Warning - A corrupt InfoROM has been |
| | detected in GPU 1. |
+------------------+---------------------------------------------------------+
Run Health Check : System with problems
![Page 21: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/21.jpg)
21
COMPREHENSIVE DIAGNOSTICS
ACTIVE HEALTH CHECKS
Identification, recovery & isolation of failed GPUs and NVSwitches.
Diagnostics to root cause failures, Pre & post job GPU health checks
System sanity to stress performance, bandwidth, power and thermal characteristics
Multi-level diagnostic options from few seconds to minutes
dcgmi diag -r 3
+---------------------------+-------------+
| Diagnostic | Result |
+===========================+=============+
|----- Deployment --------+-------------|
| Blacklist | Pass |
| NVML Library | Pass |
| CUDA Main Library | Pass |
| CUDA Toolkit Library | Pass |
| Permissions and OS Blocks | Pass |
| Persistence Mode | Pass |
| Environment Variables | Pass |
| Page Retirement | Pass |
| Graphics Processes | Pass |
| Inforom | Pass |
+----- Hardware ----------+-------------+
| GPU Memory | Pass - All |
| Diagnostic | Pass - All |
+----- Integration -------+-------------+
| PCIe | Pass - All |
+----- Stress ------------+-------------+
| SM Stress | Pass - All |
| Targeted Stress | Pass - All |
| Targeted Power | Warn - All |
| Memory Bandwidth | Pass - All |
+---------------------------+-------------+
![Page 22: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/22.jpg)
22
Policy
![Page 23: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/23.jpg)
23
Policy: when –T/-P/-e/-n/-x what to do
![Page 24: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/24.jpg)
24
FLEXIBLE GPU GOVERNANCE POLICIES
Continuous monitoring by the
user
Identify GPUs with double bit errors
Manually perform GPU reset to
correct problems
Auto-detects double bit errors, performs reset gpu, and notifies the
user
Using DCGMWith Existing Tools
Condition Action Notification
Condition: Watch for DBEAction: reset GPUNotification: Callback
![Page 25: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/25.jpg)
25
Job stats
![Page 26: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/26.jpg)
26
MANAGING JOB LIFECYCLE
Which GPUs did my job run on?
How much of the GPUs did my job use?
Any error or warning conditions during my job (ECC errors, clock throttling, etc)
Are the GPUs healthy and ready for the next job?
Create GPU group
and check health
Start Job Stats
Run Job
Stop Job Stats
Display Job Stats
![Page 27: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/27.jpg)
27
JOB STATISTICSdcgmi stats --job demojob -v -g 2Successfully retrieved statistics for job: demojob.
+----------------------------------------------------------------------------+
| GPU ID: 0 |
+==================================+=========================================+
|----- Execution Stats ----------+-----------------------------------------|
| Start Time | Wed Mar 7 10:02:34 2018 |
| End Time | Wed Mar 7 10:10:00 2018 |
| Total Execution Time (sec) | 445.48 |
| No. of Processes | 1 |
| Compute PID | 23112 |
+----- Performance Stats --------+-----------------------------------------+
| Energy Consumed (Joules) | 1437 |
| Max GPU Memory Used (bytes) | 120324096 |
| SM Clock (MHz) | Avg: 998, Max: 1177, Min: 405 |
| Memory Clock (MHz) | Avg: 2068, Max: 2505, Min: 324 |
| SM Utilization (%) | Avg: 76, Max: 100, Min: 0 |
| Memory Utilization (%) | Avg: 0, Max: 1, Min: 0 |
| PCIe Rx Bandwidth (megabytes) | Avg: 0, Max: 0, Min: 0 |
| PCIe Tx Bandwidth (megabytes) | Avg: 0, Max: 0, Min: 0 |
+----- Event Stats --------------+-----------------------------------------+
| Single Bit ECC Errors | 5 |
| Double Bit ECC Errors | 0 |
| PCIe Replay Warnings | 0 |
| Critical XID Errors | 0 |
+----- Slowdown Stats -----------+-----------------------------------------+
| Due to - Power (%) | 0 |
| - Thermal (%) | Not Supported |
| - Reliability (%) | Not Supported |
| - Board Limit (%) | Not Supported |
| - Low Utilization (%) | Not Supported |
| - Sync Boost (%) | 0 |
+----------------------------------+-----------------------------------------+
Detailed stats show
utilization, performance and
more…
![Page 28: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/28.jpg)
28
Config
![Page 29: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/29.jpg)
29
GPU CONFIGURATION MANAGEMENT
Initialization: Configure all GPUs (global group)
Per-job basis: Individual partitioned group settings
Maintains settings across driver restarts, GPU resets or at job start
Supports SET, GET and ENFORCE
MAINTAINS CONFIGURATIONSUPPORTED SETTINGSdcgmi config -g 1 --set –P 200
Configuration successfully set.
Disable ECC mode
dcgmi config -g 1 --get+--------------------------+------------------------+------------------------+
| all_gpu_group | | |
| Group of 2 GPUs | TARGET CONFIGURATION | CURRENT CONFIGURATION |
+==========================+========================+========================+
| Sync Boost | Not Specified | Disabled |
| SM Application Clock | Not Specified | 705 |
| Memory Application Clock | Not Specified | 2600 |
| ECC Mode | Disabled | Disabled |
| Power Limit | 200 | 225 |
| Compute Mode | Not Specified | E. Process |
+--------------------------+------------------------+------------------------+
Get Group config [Note DCGM performed reset]
![Page 30: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/30.jpg)
30
dmon
![Page 31: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/31.jpg)
31
Dmon—monitor GPUs stats
![Page 32: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/32.jpg)
32
Lists items for monitor
$dcgmi dmon -l
![Page 33: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/33.jpg)
33
Create fieldgroup for monitor
![Page 34: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/34.jpg)
34
Monitor field or fieldgroup
$dcgmi dmon –e/-f
![Page 35: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/35.jpg)
35
集成Promethues&Grafana
![Page 36: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/36.jpg)
36
Promethues
❑ https://prometheus.io/
❑ Open Source systems monitoring and alerting toolkit
❑ Third-party exporters
![Page 37: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/37.jpg)
37
$dcgm exporter
https://github.com/NVIDIA/gpu-monitoring-tools
$ docker run -d --rm --net="host" --pid="host" --volumes-from nvidia-dcgm-exporter:ro
quay.io/prometheus/node-exporter --collector.textfile.directory="/run/prometheus"
$ curl localhost:9100/metrics# HELP dcgm_dec_utilization Decoder utilization (in %).
# TYPE dcgm_dec_utilization gauge
dcgm_dec_utilization{gpu="0",uuid="GPU-c09da0a8-361d-4a42-0e58-9fe098e4d6d4"} 0
dcgm_dec_utilization{gpu="1",uuid="GPU-3d598278-88a6-6979-f600-5b7c33341502"} 0
dcgm_dec_utilization{gpu="2",uuid="GPU-e0725273-7885-235f-be70-5a333ac6fd63"} 0
dcgm_dec_utilization{gpu="3",uuid="GPU-eea85835-2787-da88-cf6f-a8b790f6ec2c"} 0
dcgm_dec_utilization{gpu="4",uuid="GPU-fc9dd930-84df-d270-dc93-dc3122bb901f"} 0
dcgm_dec_utilization{gpu="5",uuid="GPU-95320c1d-4f08-66a8-e834-c731136a7822"} 0
# HELP dcgm_ecc_dbe_aggregate_total Total number of double-bit persistent ECC errors.
# TYPE dcgm_ecc_dbe_aggregate_total counter
dcgm_ecc_dbe_aggregate_total{gpu="0",uuid="GPU-c09da0a8-361d-4a42-0e58-9fe098e4d6d4"} 0
dcgm_ecc_dbe_aggregate_total{gpu="1",uuid="GPU-3d598278-88a6-6979-f600-5b7c33341502"} 0
dcgm_ecc_dbe_aggregate_total{gpu="2",uuid="GPU-e0725273-7885-235f-be70-5a333ac6fd63"} 0
dcgm_ecc_dbe_aggregate_total{gpu="3",uuid="GPU-eea85835-2787-da88-cf6f-a8b790f6ec2c"} 0
dcgm_ecc_dbe_aggregate_total{gpu="4",uuid="GPU-fc9dd930-84df-d270-dc93-dc3122bb901f"} 0
dcgm_ecc_dbe_aggregate_total{gpu="5",uuid="GPU-95320c1d-4f08-66a8-e834-c731136a7822"} 0
![Page 38: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/38.jpg)
38
Promethues config file
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'szdgxstation'
static_configs:
- targets: ['127.0.0.1:9100']
- job_name: 'bjdlserver'
static_configs:
- targets: ['10.19.203.85:9100']
./promethues –config.file=***.yml
![Page 39: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/39.jpg)
39
Dcgm+promethues
![Page 40: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/40.jpg)
40
Grafana与数据源配置The open platform for beautiful analysitic and monitoring
![Page 41: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/41.jpg)
41
DCGM+Promethues+Grafana
![Page 42: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress](https://reader033.vdocuments.us/reader033/viewer/2022052011/6027993f630eea639a460f75/html5/thumbnails/42.jpg)