maximizing the power of gpu for diverse workloads of digital … · 2018-04-11 · take advantage...

Confidential │ ©2018 VMware, Inc.

Maximizing The Power of GPU for Diverse

Workloads of Digital Workspaces on VMware

vSphereUday Kurkure, Hari Sivaraman, Lan Vu

NVIDIA GTC 2018, San Jose, CA, USA

Diverse Workloads on Virtualized GPUs in vSphere

GPUs are the new CPUs• GPGPU: General Purpose Programming Using GPUs (CUDA)• Graphics: 3D-rendering• Video Encoding/Decoding

Virtualization Technology efficiently manages servers in the data centers• Enables Diverse Workloads• Windows and Linux VMs running on the same host• Suspend/Resume and vMotion of Virtualized GPU enabled VMs

Combine the Power of GPUs with Management Benefits of Virtualization

Capabilities of GPUs supported in VMware vSphere

•General Purpose GPU (GPGPU)

• Machine learning / Deep Learning

• High performance computing workloads

•Accelerating 2D/3D Graphics workloads for VMware Virtual Desktop Infrastructure (VDI)

•VMware Blast Extreme protocol encoding / decoding for VDI

• H.264 Based ( MPEG-4) and H.265

Exploring Diverse Workload Performance on vSphere

GPGPU: Deep Learning/Machine Learning

3D-Graphics

Video Workloads (Encode/Decode)

Mixed Workloads: ML/DL + 3D Graphics on the same host

GPGPU

3D Graphics Video

5

Nvidia GPUs in VMware vSphere

Hypervisor

vGPU

Virtual Machine

Guest OS

GPU driver

Applications

vGPU

Virtual Machine

Guest OS

GPU driver

Applications

Virtual Machine

Guest OS

GPU driver

Applications

Virtual Machine

Guest OS

GPU driver

Applications

Nvidia GRIDvGPU manager

GRIDGPU

vGPU

GRIDGPU

Nvidia GRID vGPU

Hypervisor

GPU

Virtual Machine

Guest OS

GPU driver

Applications

GPU GPU GPU

Virtual Machine

Guest OS

GPU driver

Applications

Virtual Machine

Guest OS

GPU driver

Applications

Virtual Machine

Guest OS

GPU driver

Applications

Pass-through

Pass-through

Pass-through

Pass-through

VMware DirectPath I/O

• One VM per physical GPU• Allow multiple GPUs per VM• Support more types of GPU cards• Low overhead

• Multiple VMs with vGPU per physical GPU• Allow one vGPU per VM• More users / VMs with vGPU per host• Flexibility of management

Server

Hypervisor

VirtualMachine

Virtual Machine

VirtualMachine

Virtual Machine

What is a virtualized GPU (vGPU) in NVIDIA GRID?

Virtual Machine

NVIDIA GPUH.264 Encode/Decode

Virtual Machine

NVIDIA Driver NVIDIA Driver

NVIDIA GRID vGPU manager (vib)

NVIDIA DriverNVIDIA Driver NVIDIA Driver NVIDIA Driver

vGPU vGPUvGPU vGPU vGPU vGPU

CPUs NVIDIA GPU

Har

dwar

eV

irtua

lizat

ion

Laye

r

Performanceof

Virtualized PascalGPGPU

Video Graphics

Tesla GM204 (Nvidia M60)

Pascal P40 Implications

Cuda Cores 2048 3840 Faster Training TeraFlops ~4.X ~12.X Faster Training

INT8 NA 47 (TOPS) Fast InferencingvGPU Support Graphics Graphics and Cuda Diverse Workloads, Data Center

Management Benefits

Performance: Tesla GM204 (M60) vs Pascal 40 GPU

ML Benchmark for comparison of M60 and P40

Complex Language Modelling – Given history of words, predicts next word

• Neural Network Type: Recurrent Neural Network• Large Model

– 1500 LSTM units /layer• Medium

– 650 LSTM units /layer• Small

– 200 LSTM units /layer• Penn Tree Bank (PTB) Database:

– 929K training words– 73K validation words– 82K test words– 10K vocabulary

Maxwell (Virtualized for Graphics) vs Pascal (Virtualized for CUDA & Graphics)

10

virtualGPU type physical board GRAPHICS CUDA

maximumvirtual GPUsper physical

GPU

GRID M60-1q Tesla M60 yes no 8



GRID M60-8q Tesla M60 yes yes 1

GRID P40-1q Tesla P40 yes yes 24








Performance: Training Times on M60-8q vs P40-8q profile

1827

7376

25134

5673083

11077

0

5000

10000

15000

20000

25000

30000

small medium large

Trai

ning

Tim

es

Low

er is

Bet

ter

(Sec

s)Training Times for RNN on PTB

M60-8q vs P40-8q

Training Times m60-8q Training Times p40-8q

3.22x Speedup for small, 2.4x speedup for medium, 2.27x speedup for large

Performance:

vGPU Profiles&

VM Scalability Performance

Scalability Mixed Workloads

Scalability: Scaling number of VMs/serverThe number of VMs with ML workload per server

Four VMs with one vGPU each One VM with one vGPU

: limited by #of vGPUs supported by GPU

Pascal (Virtualized for CUDA & Graphics)

14

virtualGPU type physical board GRAPHICS CUDA

maximumvirtual GPUsper physical

GPU









vGPU profiles and Scaling the number of VMs

Lower vGPU Profiles allow more VMs• P40-1q allows scaling upto 24 VMs• Use Lower profiles for small model sizes

Higher vGPU profiles allow bigger models• P40-12q allows scaling up to 2 VMs• Use Higher vGPU profiles for demanding training jobs

How to choose vGPU Profile?: Allocating Graphics Memory

RNNModel

P40-1q1GB

P40-2q2GB

P40-3q3GB

P40-4q4GB

P40-6q6GB

P40-8q8GB

P40-12q12GB

P40-24q24GB

Small 568 566 567 564 570 567 571 559

Medium out of mem 3091 3084 3083 3085 3083 3077 3074

Large out of mem out of mem out of mem out of mem 11076 11077 11048 11051

Training Times for P40 profiles for RNN on PTB(secs)

• Using Larger Profile than Necessary May Not Improve Performance• It may enable larger batch sizes. • Caution: Batch Size is a hyper parameter that needs to be tuned carefully.

Performance:

vGPU SchedulingPerformance

Scalability Mixed Workloads

Scheduling vGPUs : Best Effort Scheduling

Best Effort Scheduler

Content is adapted from NVIDIA materials

VM 1

VM 2

VM 3

Time slicedRoundRobin

Scheduler

8 7 6 5 4 3 2 1

GPU Engine

6 4 2 1

8 7 5

3

VM 3

VM 2VM 2

Scheduling vGPUs : Equal Share Scheduling

Equal Share Scheduling

Content is adapted from NVIDIA materials

VM 1

VM 2

VM 3

EqualShareRoundRobin

Scheduler

8 76 5 2 3 1

GPU Engine

6 4 2 1

8 7 5

3

VM 3 VM 2

VM 2

4

GPU is idle for 3 slots

ML Benchmark for comparison of Scheduling Policies

Workload : Handwriting Recognition

Dataset: MNIST database of handwritten digits

Training set: 60,000 examples

Test set: 10,000 examples

Neural Net: CNN

23

Best Effort Scheduling vs. Equal Share Scheduling

1 36

11

17

14

8

17

28

05

101520253035

1 4 8 16 24

Nor

mal

ized

Tra

inin

g Ti

me

Low

er is

bet

ter

Number of concurrent VMs ( or ML jobs ) sharing 1 GPU

Training Time MNIST with P40-1q vGPU profile on 1 x P40 GPU

Best Effort Scheduling Equal Share Scheduling

Best Effort Scheduling vs. Equal Share Scheduling

0

20

40

60

80

100

1 4 8 16 24

Util

izat

ion(

%)

Hig

her

is B

ette

r

Number of concurrent VMs ( or ML jobs ) 1 GPU

GPU Utilization MNIST with P40-1q vGPU profile on 1 x P40 GPU

Best Effort Scheduling Equal Share Scheduling

CAD on P40GPGPU

Graphics Video

CAD Benchmarks: P40 Scalability

Goal: Demonstrate the scalability of P40 using CAD benchmarks

Dell R740xd – Intel Skylake CPUs + 1 x NVidia GRID P4036 cores (2 x 18-core socket)768 GB RAM

Benchmarks:1. SPECapc for 3ds Max™ 20152. SPECViewPerf12

27

Win 7, x64, 4 vCPU, 16 GB RAM, 120 GB HD1. Autodesk 3ds Max 2015 + SPECapc2. SPECViewPerf12

Graphics

Experiment Design

For each P40 Profile supported:

Vary #VMs running concurrently from 1 VM to maximum allowed:

Run SPEC benchmark & record results

28

Scalability Using SPECapc 3DSMax-2015

29

Scalability Using SPECapc 3DSMax-2015

30

Summary of SPECapc Results

• Run-time is independent of Profile, depends only on #VM running.

• Each extra VM adds 20% performance penalty.

• CPU utilization scales linearly

31

Scalability Using SPECViewPerf12

32

Scalability Using SPECViewPerf12

33

Mixed WorkloadsGPGPU

Graphics Video

ML + Knowledge-worker WorkloadsGoal: Quantify Performance impact of running ML & knowledge-worker workloads

on the same server concurrently.

Benchmarks:ML: Handwriting Recognition (mnist)Knowledge-worker: ViewPlanner

35

CentOS 7.2, TensorFlow 0.1012vCPU, 16 GB RAM, 96GB HD

ML


ML on P40-12Q

Experiment# #ML VM

# Knowledge-worker VM

1 1 32

2 1 64

3 1 96

Win10, x64, 2 vCPU, 4 GB RAM, 40 GB HDMS Office 2013, Acrobat, IE, Firefox, 720p Video

Knowledge-worker Desktop

Mixed Workloads - Results

36

• Knowledge-worker impact < 0.5%• ML training, a batch job, sees some performance impact.

CAD + Knowledge-worker WorkloadsGoal: Quantify Performance impact of running CAD & knowledge-worker workloads


37

Experiment# #CAD VM


1 1 322 2 323 4 324 1 645 2 646 4 647 1 968 2 969 4 96Benchmarks:

CAD: SPECapc for 3ds Max™ 2015Knowledge-worker: ViewPlanner


CAD on P40-6Q

Mixed Workloads Results: Low Impact on CAD

38

• Knowledge-worker impact < 0.5%

Mixed Workloads Results: Low Impact on CAD

39

• Knowledge-worker impact < 0.5%

ML + CAD + Knowledge-worker WorkloadsGoal: Quantify Performance impact of running ML & knowledge-worker workloads


40

Experiment# #ML VM

#CDVM


1 1 1 32

2 1 1 64

3 1 1 96

Benchmarks:ML: Handwriting Recognition (mnist)CAD: SPECapc for 3ds Max™ 2015Knowledge-worker: ViewPlanner


ML on P40-12Q

CAD on P40-12Q

Mixed Workloads Results: Low Impact on Interactive Workloads

41

Summary of ML + CAD + Knowledge-worker Results

• Low impact on interactive workloads

• ML training, a batch job, sees some performance penalty.

• CPU utilization scales linearly

42

Advantages of Running Mixed Workloads

Better Resource Utilization

Higher Consolidation Ratios

Easier to load-balance

Advantages of Virtualized GPUs

Run Multiple ML/Graphics jobs concurrently on the same GPU• Perf. impact on interactive applications is negligible.

Advantages of Suspend/Resume Capability• Temporal Separation of ML and Graphics jobs/vms

– Run ML Training Jobs at night time– Suspend the ML VMs in the morning– Run Interactive Graphics VMs during daytime– Resume ML Training jobs at night

• Suspend high profile 24q VM and start 6 4q VMs or vice versa• Use vMotion

– Certain HPC jobs take weeks to run– vMotion will be very useful

Key Performance Takeaways

Virtualized GPUs deliver near bare metal Performance for ML workloads in VMware vSphere

GPUs can be used in two modes on vSphere: Direct Path IO and NVidia GRID vGPU

Running diverse workloads simultaneously has no significant perf. impact on interactive apps.

For Multi-GPU ML workloads, use Direct Path IO mode

For more consolidation of GPU-based workloads, use GRID vGPU

Take Advantage of Suspend/Resume: Run ML training at night, suspend in the morning and run Interactive Graphics jobs during the daytime

GRID vGPU combines performance of GPUs and datacenter management benefits of VMware vSphere

Q&A

Thank you NVIDIA for the opportunity

Contact

Uday Kurkure, Lan Vu, Hari Sivaraman

{ukurkure,lanv,hsivaraman}@vmware.com

Thanks to our colleagues• Bruce Herndon, Aravind Bappanadu

References Our paper “Machine Learning Using Virtualized GPUs in Cloud Environments”• Published in Lecture Notes in Computer Science by Springer on Oct. 20th 2017 published as

a chapter in International Conference on High Performance Computing ISC 2017. https://link.springer.com/chapter/10.1007/978-3-319-67630-2_41

A series of blogs on VMware VROOM. The contents in the blogs are almost the same as the aforementioned paper.• https://blogs.vmware.com/performance/2016/10/machine-learning-vsphere-nvidia-gpus.html• https://blogs.vmware.com/performance/2017/03/machine-learning-vsphere-6-5-nvidia-gpus-

episode-2.html• https://blogs.vmware.com/performance/2017/10/episode-3-performance-comparison-native-

gpu-virtualized-gpu-scalability-virtualized-gpus-machine-learning.html• https://blogs.vmware.com/performance/2017/11/machine-learning-virtualized-containers-

nvidia-vgpu-performance.html

maximizing the power of gpu for diverse workloads of digital … · 2018-04-11 · take advantage...

Documents