maximizing the power of gpu for diverse workloads of digital … · 2018-04-11 · take advantage...
TRANSCRIPT
Confidential │ ©2018 VMware, Inc.
Maximizing The Power of GPU for Diverse
Workloads of Digital Workspaces on VMware
vSphereUday Kurkure, Hari Sivaraman, Lan Vu
NVIDIA GTC 2018, San Jose, CA, USA
Diverse Workloads on Virtualized GPUs in vSphere
GPUs are the new CPUs• GPGPU: General Purpose Programming Using GPUs (CUDA)• Graphics: 3D-rendering• Video Encoding/Decoding
Virtualization Technology efficiently manages servers in the data centers• Enables Diverse Workloads• Windows and Linux VMs running on the same host• Suspend/Resume and vMotion of Virtualized GPU enabled VMs
Combine the Power of GPUs with Management Benefits of Virtualization
Capabilities of GPUs supported in VMware vSphere
•General Purpose GPU (GPGPU)
• Machine learning / Deep Learning
• High performance computing workloads
•Accelerating 2D/3D Graphics workloads for VMware Virtual Desktop Infrastructure (VDI)
•VMware Blast Extreme protocol encoding / decoding for VDI
• H.264 Based ( MPEG-4) and H.265
Exploring Diverse Workload Performance on vSphere
GPGPU: Deep Learning/Machine Learning
3D-Graphics
Video Workloads (Encode/Decode)
Mixed Workloads: ML/DL + 3D Graphics on the same host
GPGPU
3D Graphics Video
5
Nvidia GPUs in VMware vSphere
Hypervisor
vGPU
Virtual Machine
Guest OS
GPU driver
Applications
vGPU
Virtual Machine
Guest OS
GPU driver
Applications
Virtual Machine
Guest OS
GPU driver
Applications
Virtual Machine
Guest OS
GPU driver
Applications
Nvidia GRIDvGPU manager
GRIDGPU
vGPU
GRIDGPU
Nvidia GRID vGPU
Hypervisor
GPU
Virtual Machine
Guest OS
GPU driver
Applications
GPU GPU GPU
Virtual Machine
Guest OS
GPU driver
Applications
Virtual Machine
Guest OS
GPU driver
Applications
Virtual Machine
Guest OS
GPU driver
Applications
Pass-through
Pass-through
Pass-through
Pass-through
VMware DirectPath I/O
• One VM per physical GPU• Allow multiple GPUs per VM• Support more types of GPU cards• Low overhead
• Multiple VMs with vGPU per physical GPU• Allow one vGPU per VM• More users / VMs with vGPU per host• Flexibility of management
Server
Hypervisor
VirtualMachine
Virtual Machine
VirtualMachine
Virtual Machine
What is a virtualized GPU (vGPU) in NVIDIA GRID?
Virtual Machine
NVIDIA GPUH.264 Encode/Decode
Virtual Machine
NVIDIA Driver NVIDIA Driver
NVIDIA GRID vGPU manager (vib)
NVIDIA DriverNVIDIA Driver NVIDIA Driver NVIDIA Driver
vGPU vGPUvGPU vGPU vGPU vGPU
CPUs NVIDIA GPU
Har
dwar
eV
irtua
lizat
ion
Laye
r
Performanceof
Virtualized PascalGPGPU
Video Graphics
Tesla GM204 (Nvidia M60)
Pascal P40 Implications
Cuda Cores 2048 3840 Faster Training TeraFlops ~4.X ~12.X Faster Training
INT8 NA 47 (TOPS) Fast InferencingvGPU Support Graphics Graphics and Cuda Diverse Workloads, Data Center
Management Benefits
Performance: Tesla GM204 (M60) vs Pascal 40 GPU
ML Benchmark for comparison of M60 and P40
Complex Language Modelling – Given history of words, predicts next word
• Neural Network Type: Recurrent Neural Network• Large Model
– 1500 LSTM units /layer• Medium
– 650 LSTM units /layer• Small
– 200 LSTM units /layer• Penn Tree Bank (PTB) Database:
– 929K training words– 73K validation words– 82K test words– 10K vocabulary
Maxwell (Virtualized for Graphics) vs Pascal (Virtualized for CUDA & Graphics)
10
virtualGPU type physical board GRAPHICS CUDA
maximumvirtual GPUsper physical
GPU
GRID M60-1q Tesla M60 yes no 8
GRID M60-2q Tesla M60 yes no 4
GRID M60-4q Tesla M60 yes no 2
GRID M60-8q Tesla M60 yes yes 1
GRID P40-1q Tesla P40 yes yes 24
GRID P40-2q Tesla P40 yes yes 12
GRID P40-3q Tesla P40 yes yes 8
GRID P40-4q Tesla P40 yes yes 6
GRID P40-6q Tesla P40 yes yes 4
GRID P40-8q Tesla P40 yes yes 3
GRID P40-12q Tesla P40 yes yes 2
GRID P40-24q Tesla P40 yes yes 1
Performance: Training Times on M60-8q vs P40-8q profile
1827
7376
25134
5673083
11077
0
5000
10000
15000
20000
25000
30000
small medium large
Trai
ning
Tim
es
Low
er is
Bet
ter
(Sec
s)Training Times for RNN on PTB
M60-8q vs P40-8q
Training Times m60-8q Training Times p40-8q
3.22x Speedup for small, 2.4x speedup for medium, 2.27x speedup for large
Performance:
vGPU Profiles&
VM Scalability Performance
Scalability Mixed Workloads
Scalability: Scaling number of VMs/serverThe number of VMs with ML workload per server
Four VMs with one vGPU each One VM with one vGPU
: limited by #of vGPUs supported by GPU
Pascal (Virtualized for CUDA & Graphics)
14
virtualGPU type physical board GRAPHICS CUDA
maximumvirtual GPUsper physical
GPU
GRID P40-1q Tesla P40 yes yes 24
GRID P40-2q Tesla P40 yes yes 12
GRID P40-3q Tesla P40 yes yes 8
GRID P40-4q Tesla P40 yes yes 6
GRID P40-6q Tesla P40 yes yes 4
GRID P40-8q Tesla P40 yes yes 3
GRID P40-12q Tesla P40 yes yes 2
GRID P40-24q Tesla P40 yes yes 1
vGPU profiles and Scaling the number of VMs
Lower vGPU Profiles allow more VMs• P40-1q allows scaling upto 24 VMs• Use Lower profiles for small model sizes
Higher vGPU profiles allow bigger models• P40-12q allows scaling up to 2 VMs• Use Higher vGPU profiles for demanding training jobs
How to choose vGPU Profile?: Allocating Graphics Memory
RNNModel
P40-1q1GB
P40-2q2GB
P40-3q3GB
P40-4q4GB
P40-6q6GB
P40-8q8GB
P40-12q12GB
P40-24q24GB
Small 568 566 567 564 570 567 571 559
Medium out of mem 3091 3084 3083 3085 3083 3077 3074
Large out of mem out of mem out of mem out of mem 11076 11077 11048 11051
Training Times for P40 profiles for RNN on PTB(secs)
• Using Larger Profile than Necessary May Not Improve Performance• It may enable larger batch sizes. • Caution: Batch Size is a hyper parameter that needs to be tuned carefully.
Performance:
vGPU SchedulingPerformance
Scalability Mixed Workloads
Scheduling vGPUs : Best Effort Scheduling
Best Effort Scheduler
Content is adapted from NVIDIA materials
VM 1
VM 2
VM 3
Time slicedRoundRobin
Scheduler
8 7 6 5 4 3 2 1
GPU Engine
6 4 2 1
8 7 5
3
VM 3
VM 2VM 2
Scheduling vGPUs : Equal Share Scheduling
Scheduling vGPUs : Equal Share Scheduling
Equal Share Scheduling
Content is adapted from NVIDIA materials
VM 1
VM 2
VM 3
EqualShareRoundRobin
Scheduler
8 76 5 2 3 1
GPU Engine
6 4 2 1
8 7 5
3
VM 3 VM 2
VM 2
4
GPU is idle for 3 slots
ML Benchmark for comparison of Scheduling Policies
Workload : Handwriting Recognition
Dataset: MNIST database of handwritten digits
Training set: 60,000 examples
Test set: 10,000 examples
Neural Net: CNN
23
Best Effort Scheduling vs. Equal Share Scheduling
1 36
11
17
14
8
17
28
05
101520253035
1 4 8 16 24
Nor
mal
ized
Tra
inin
g Ti
me
Low
er is
bet
ter
Number of concurrent VMs ( or ML jobs ) sharing 1 GPU
Training Time MNIST with P40-1q vGPU profile on 1 x P40 GPU
Best Effort Scheduling Equal Share Scheduling
Best Effort Scheduling vs. Equal Share Scheduling
0
20
40
60
80
100
1 4 8 16 24
Util
izat
ion(
%)
Hig
her
is B
ette
r
Number of concurrent VMs ( or ML jobs ) 1 GPU
GPU Utilization MNIST with P40-1q vGPU profile on 1 x P40 GPU
Best Effort Scheduling Equal Share Scheduling
CAD on P40GPGPU
Graphics Video
CAD Benchmarks: P40 Scalability
Goal: Demonstrate the scalability of P40 using CAD benchmarks
Dell R740xd – Intel Skylake CPUs + 1 x NVidia GRID P4036 cores (2 x 18-core socket)768 GB RAM
Benchmarks:1. SPECapc for 3ds Max™ 20152. SPECViewPerf12
27
Win 7, x64, 4 vCPU, 16 GB RAM, 120 GB HD1. Autodesk 3ds Max 2015 + SPECapc2. SPECViewPerf12
Graphics
Experiment Design
For each P40 Profile supported:
Vary #VMs running concurrently from 1 VM to maximum allowed:
Run SPEC benchmark & record results
28
Scalability Using SPECapc 3DSMax-2015
29
Scalability Using SPECapc 3DSMax-2015
30
Summary of SPECapc Results
• Run-time is independent of Profile, depends only on #VM running.
• Each extra VM adds 20% performance penalty.
• CPU utilization scales linearly
31
Scalability Using SPECViewPerf12
32
Scalability Using SPECViewPerf12
33
Mixed WorkloadsGPGPU
Graphics Video
ML + Knowledge-worker WorkloadsGoal: Quantify Performance impact of running ML & knowledge-worker workloads
on the same server concurrently.
Benchmarks:ML: Handwriting Recognition (mnist)Knowledge-worker: ViewPlanner
35
CentOS 7.2, TensorFlow 0.1012vCPU, 16 GB RAM, 96GB HD
ML
Dell R740xd – Intel Skylake CPUs + 1 x NVidia GRID P4036 cores (2 x 18-core socket)768 GB RAM
ML on P40-12Q
Experiment# #ML VM
# Knowledge-worker VM
1 1 32
2 1 64
3 1 96
Win10, x64, 2 vCPU, 4 GB RAM, 40 GB HDMS Office 2013, Acrobat, IE, Firefox, 720p Video
Knowledge-worker Desktop
Mixed Workloads - Results
36
• Knowledge-worker impact < 0.5%• ML training, a batch job, sees some performance impact.
CAD + Knowledge-worker WorkloadsGoal: Quantify Performance impact of running CAD & knowledge-worker workloads
on the same server concurrently.
37
Experiment# #CAD VM
# Knowledge-worker VM
1 1 322 2 323 4 324 1 645 2 646 4 647 1 968 2 969 4 96Benchmarks:
CAD: SPECapc for 3ds Max™ 2015Knowledge-worker: ViewPlanner
Dell R740xd – Intel Skylake CPUs + 1 x NVidia GRID P4036 cores (2 x 18-core socket)768 GB RAM
CAD on P40-6Q
Mixed Workloads Results: Low Impact on CAD
38
• Knowledge-worker impact < 0.5%
Mixed Workloads Results: Low Impact on CAD
39
• Knowledge-worker impact < 0.5%
ML + CAD + Knowledge-worker WorkloadsGoal: Quantify Performance impact of running ML & knowledge-worker workloads
on the same server concurrently.
40
Experiment# #ML VM
#CDVM
# Knowledge-worker VM
1 1 1 32
2 1 1 64
3 1 1 96
Benchmarks:ML: Handwriting Recognition (mnist)CAD: SPECapc for 3ds Max™ 2015Knowledge-worker: ViewPlanner
Dell R740xd – Intel Skylake CPUs + 1 x NVidia GRID P4036 cores (2 x 18-core socket)768 GB RAM
ML on P40-12Q
CAD on P40-12Q
Mixed Workloads Results: Low Impact on Interactive Workloads
41
Summary of ML + CAD + Knowledge-worker Results
• Low impact on interactive workloads
• ML training, a batch job, sees some performance penalty.
• CPU utilization scales linearly
42
Advantages of Running Mixed Workloads
Better Resource Utilization
Higher Consolidation Ratios
Easier to load-balance
Advantages of Virtualized GPUs
Run Multiple ML/Graphics jobs concurrently on the same GPU• Perf. impact on interactive applications is negligible.
Advantages of Suspend/Resume Capability• Temporal Separation of ML and Graphics jobs/vms
– Run ML Training Jobs at night time– Suspend the ML VMs in the morning– Run Interactive Graphics VMs during daytime– Resume ML Training jobs at night
• Suspend high profile 24q VM and start 6 4q VMs or vice versa• Use vMotion
– Certain HPC jobs take weeks to run– vMotion will be very useful
Key Performance Takeaways
Virtualized GPUs deliver near bare metal Performance for ML workloads in VMware vSphere
GPUs can be used in two modes on vSphere: Direct Path IO and NVidia GRID vGPU
Running diverse workloads simultaneously has no significant perf. impact on interactive apps.
For Multi-GPU ML workloads, use Direct Path IO mode
For more consolidation of GPU-based workloads, use GRID vGPU
Take Advantage of Suspend/Resume: Run ML training at night, suspend in the morning and run Interactive Graphics jobs during the daytime
GRID vGPU combines performance of GPUs and datacenter management benefits of VMware vSphere
Q&A
Thank you NVIDIA for the opportunity
Contact
Uday Kurkure, Lan Vu, Hari Sivaraman
{ukurkure,lanv,hsivaraman}@vmware.com
Thanks to our colleagues• Bruce Herndon, Aravind Bappanadu
References Our paper “Machine Learning Using Virtualized GPUs in Cloud Environments”• Published in Lecture Notes in Computer Science by Springer on Oct. 20th 2017 published as
a chapter in International Conference on High Performance Computing ISC 2017. https://link.springer.com/chapter/10.1007/978-3-319-67630-2_41
A series of blogs on VMware VROOM. The contents in the blogs are almost the same as the aforementioned paper.• https://blogs.vmware.com/performance/2016/10/machine-learning-vsphere-nvidia-gpus.html• https://blogs.vmware.com/performance/2017/03/machine-learning-vsphere-6-5-nvidia-gpus-
episode-2.html• https://blogs.vmware.com/performance/2017/10/episode-3-performance-comparison-native-
gpu-virtualized-gpu-scalability-virtualized-gpus-machine-learning.html• https://blogs.vmware.com/performance/2017/11/machine-learning-virtualized-containers-
nvidia-vgpu-performance.html