accelerating & optimizing hpc/ml on vsphere leveraging ...€¦ · apps mobile analytics/ saas...
Post on 25-Jul-2020
2 Views
Preview:
TRANSCRIPT
1©2018 VMware, Inc.
Accelerating & Optimizing HPC/ML on vSphere Leveraging NVIDIA GPU
Mohan Potheri, VMware, Inc
Justin Murray, VMware, Inc
Agenda
2©2018 VMware, Inc.
New Demands on IT
VMware Goal and Approach
Why Virtualize AI & ML
Machine Learning Landscape
Maximizing GPU Utilization
Extending GPU Sharing to Containers
Summary
3©2018 VMware, Inc.
New Demands on IT Infrastructure
X86 SGXGPU NVM FPGAQAT IPU
Specialized Hardware
Security
Hybrid Cloud
Public Cloud
Global Infra and Edge
Growth of Apps
BusinessCritical Apps
DesktopVirtualization
Graphic Intensive
Cloud-NativeApps
Edge/IOTSaaSMobile Custom/OtherAnalytics/AI/ML
PMEM
Our Goal and Approach
• Increase agility and decrease time to discovery for researchers, data scientists, and engineers
• Provide IT with the ability to efficiently provision, allocate, manage and ensure compliance of research compute infrastructure across an increasingly broad range of technical and business requirements
• By leveraging VMware’s proven, enterprise-class virtualization and cloud technologies to meet the performance requirements of research computing, HPC, and ML workloads, and
• Bringing novel capabilities to bear to enable new capabilities not available in traditional HPC/ML environments
5©2018 VMware, Inc.
• Simple cluster expansion and contraction
• Rapidly reproduce research environments
• Higher resiliency and less downtime with vMotion
• Fault-isolation (hardware and software)
• Cluster resource-sharing
• Minimize setup and configuration time with centralized management capabilities
• Simultaneously support mixed software environments
• Industry-leading virtualization platform that your IT already knows
• Easy, secure data access and sharing
• Security Isolation
• Multi-tenant data security
Why Virtualize HPC AI/ML InfrastructurevSphere can help data scientists get to answers faster
Operational Flexibility Reduced Complexity Secure Sensitive Workloads
6©2018 VMware, Inc.
Dispelling the Misunderstanding about GPUs on vSphere
• Hypervisor is not an intermediary when accessing the GPU
• GPU access is
• Directly via passthrough to VM
or
• NVIDIA Grid vGPU
• Near Zero performance impact
7©2018 VMware, Inc.
MachineLearning
DeepLearningBig Data
EdgeorIoT
ON-PREM
OFF-PREM
trainingdata
inference
inference
Machine Learning Infrastructure Landscape
Data Analytics
Two Main Phases in ML
• Training / Model Building
• Often very large data sets
• Compute, storage, and network intensive
• Server-class infrastructure
• Inference / Scoring
• Apply existing models to new data
• Used for prediction
• Edge or core infrastructure
V
D
I
8©2018 VMware, Inc.
Using GPUs with vSphere
9©2018 VMware, Inc.
VM Direct Path I/O for NVIDIA GPU
10©2018 VMware, Inc.
A Virtualized GPU
PassThrough v Sphere 6.5/6.7
ESXi Host
GPU
VM VM
LinuxCUDA Library & Driver
TensorFlow
11©2018 VMware, Inc.
• Can provision VMs with one or more GPUs
• Easily reuse GPU infrastructure
• Same behavior as Public Cloud GPU instances
• Benefits:
• HW Isolation
• Workload Isolation
• VM Level Quality of Service
• Fast environment provisioning
• Near bare-metal performance
• Passthrough device certification for vSphere not required
• Server must be compatible with device as published by server OEM and GPU vendor
• Server must be vSphere Certified
GPU Acceleration on vSphere with DirectPath I/O
VMGPU App
GPU App
GPU App
GPU App
GPU App
• Caveats:
• No vMotion
• No Suspend and Resume
• No DRS
• No vSphere HA
Learn more
12©2018 VMware, Inc.
VM DirectPath I/O – Multiple GPUs Attached to a Virtual Machine
13©2018 VMware, Inc.
vSphere GPU Sharing Mechanisms
14©2018 VMware, Inc.
Using GPUs with vSphere
15©2018 VMware, Inc.
• Share single GPU among multiple VMs
• Provision VMs with partial up to one full GPU
• GRID vGPU VM Suspend and Resume support
• Quickly repurpose GPU infrastructure
• VDI or Data Science by day
• Compute (ML) by Night
• Benefits:
• HW Isolation
• Workload Isolation
• VM Level Quality of Service
• GPU Quality of Service
• Fast environment provisioning
• Bare-metal comparable performance
VMware vSphere 6.7 and NVIDIA Quadro vDWS (GRID 7.0)
GPU App
GPU App
GPU App
GPU App
GPU App
GPU App
GPU App
GPU App
Learn more
16©2018 VMware, Inc.
NVIDIA Grid – Two Layers of Software/Drivers
17©2018 VMware, Inc.
NVIDIA Grid Configuration – Choosing the vGPU Profile
18©2018 VMware, Inc.
Using GPUs with vSphere
19©2018 VMware, Inc.
• Dynamic GPU attach anywhere
• Fractional GPUs for Efficiency
• Application Run Time Virtualization
• Standard based GPU
Bitfusion Enables Remote GPU Sharing
BF Client VM
ESX Host
BF Server VM
ESX Host
GPU Passthrough
BF Server VM
ESX Host
GPU Passthrough
BF Server VM
ESX Host
GPU Passthrough
vSphere GPU Cluster
BF Client VM
ESX Host
BF Client VM
ESX Host
BF Client VM
ESX Host
20©2018 VMware, Inc.
Maximize GPU Utilization
21©2018 VMware, Inc.
vSphere 6.7 GPU Virtual Machine Suspend and Resume
Source: Enhancing Operations for NVIDIA Grid
Video Demo:
https://youtu.be/PwVReRauY50
Blog Article:
https://blogs.vmware.com/vsphere/2018/07/vsphere-6-7-suspend-and-resume-of-gpu-attached-virtual-machines.html
22©2018 VMware, Inc.
Go beyond a traditional batch-processing to viewing HPC resources as an engine for returning results in real time.
Enable HPC compute jobs to harvest cycles from a VDI compute environment.
Outcome
Benefit
Deep Learning Virtualization Use Case: Cycle Harvesting
Challenge:
Data Scientists submit jobs in traditional batches, because of compute availability• Submit jobs one day• Wait until the next day for the job results
What if…The VDI environment has unused cycles. Could HPC jobs be run in the environment when it is not needed to run VDI?
Will it blend?
23©2018 VMware, Inc.
Cycle Harvesting
VMware ESXi VMware ESXi VMware ESXi
100 100 100 100 100 100 1 1Share Value 100
8AMTime Noon 5PM 10PM
1
24©2018 VMware, Inc.
Cycle Harvesting Case Studyhttps://bit.ly/2MrBngH
25©2018 VMware, Inc.
Extending GPGPU Sharing to Containers
Why Singularity Containers?
Docker is not designed for HPC architectures
Singularity is the best suited Container solution for HPC:
Singularity container is encapsulated in a single file making it highly portable and secure.
Singularity is designed from the ground up for scientific computing
Combining Virtual Machines & Containers for GPU sharing
• Sharing GPUs in a container is difficult as there is no resource management
• vSphere VM with NVIDIA Grid or Bitfusion can use whole or partial GPU
• Containers are a great packaging mechanism for applications
• By enclosing one container per virtual machine, we get the best of both worlds• GPU resources can be shared with other containers
• Machine and Deep Learning applications & platforms can be packaged and distributed effectively as a container
Logical Schematic of Infrastructure components
• One Singularity Container per VM
• Containers leverage partial or full GPUs allocated to the virtual machine
• Container packaged with TensorFlow, tools, etc.
• Bitfusion provides GPU sharing
BF Server
VM
ESX Host
GPU Passthrough
BF Server
VM
ESX Host
GPU Passthrough
BF Server
VM
ESX Host
GPU Passthrough
vSphere GPU Cluster
Singularity Container
Virtual Machine
ESX Host
Singularity Container
Virtual Machine
ESX HostvSphere Generic Cluster
Images/sec Throughput comparison for 1 GPU
2.5-3X more throughput with sharing
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
Resnet50 Alexnet Inception3
Throughput comparison with and without GPU sharing
Total Throughput Baseline no sharing
Thro
ugh
pu
t R
atio
s
Runtime comparison for 1 GPU (with/without sharing)
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
160.00
180.00
200.00
Runtime (%) Average Run Time (Seconds)
Runtime comparison for 1 GPU with and without sharing
Unshared Shared
17%
Only 17% slower for nearly 3X Throughput
Summary
• Sharing is key to enable cloud like capabilities on premises
• vSphere is the best platform to leverage latest high performance hardware
• Virtualization supports device sharing and delivers near bare-metal performance
• HW Sharing through vSphere can increase utilization. (Cycle Harvesting)
top related