interconnect acceleration for publication for machine ... · deep learning is not new. o. 1943:...
TRANSCRIPT
#vmworld#vmworld
Interconnect Accelerationfor Machine Learning,
Big Data, and HPCAdit Ranadive, VMware, Inc.
Aviad Shaul Yehezkel, Mellanox
VAP2807BU
#VAP2807BU
VMworld 2018 Content: Not for publication or distribution
Disclaimer
2©2018 VMware, Inc.
This presentation may contain product features orfunctionality that are currently under development.
This overview of new technology represents no commitment from VMware to deliver these features in any generally available product.
Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.
Technical feasibility and market demand will affect final delivery.
Pricing and packaging for any new features/functionality/technology discussed or presented, have not been determined.
VMworld 2018 Content: Not for publication or distribution
3©2018 VMware, Inc. 3
Datacenter Interconnects bandwidths growing – 100 Gbps and beyond
Network devices with hardware offloaded protocols• Reduce CPU usage• Access application data directly without involving OS/Hypervisor - lower latencies
Really – interconnect accelerators
Big Data, Machine Learning, High Performance Computing Apps• Speed up benefit from increased network bandwidths and lower latencies• Distributed workloads across VMs accessing a shared high performance fabric
This talk aims to answer the following questions in a vSphere environment• What are the hardware protocols for interconnect acceleration?• How do we enable VM access to interconnect acceleration protocols?• How do interconnect accelerators help with application performance?
PreambleAccelerate my Interconnect - What and Why?
VMworld 2018 Content: Not for publication or distribution
Agenda
4©2018 VMware, Inc.
Agenda
4©2018 VMware, Inc.
Introduction to RDMA
Direct Device Access Technologies
Paravirtual RDMA (PVRDMA) in vSphere
Machine Learning over RDMA
Big Data over RDMA
HPC Applications over RDMA
Summary / Key Takeaways
VMworld 2018 Content: Not for publication or distribution
5©2018 VMware, Inc. 5
Introduction to Remote Direct Memory Access (RDMA)Hardware protocol to accelerate data
VMworld 2018 Content: Not for publication or distribution
A hardware transport protocolo Optimized for moving data to/from memory and
across interconnectso Industry Standard – InfiniBand Trade Association
(IBTA)o Examples: InfiniBand, RoCE (talk focus)
Extreme performanceo 600ns application-to-application latencieso 100Gbps throughputo Negligible CPU overheads
RDMA Use-caseso Storage (iSER, NFS-RDMA, NVMoF, Lustre)o HPC (MPI, SHMEM)o Big data and analytics (Hadoop, Spark)o Machine Learning (Tensor flow, Horovod)
Remote Direct Memory Access (RDMA)
6
VMworld 2018 Content: Not for publication or distribution
Traditional network stack challengesoPer message / packet / byte overheadsoUser-kernel crossingsoMemory copies
RDMA provides in hardware:o Isolation between applicationsoTransport
o Packetizing messageso Reliable delivery
oAddress translation
User-level networkingoDirect hardware access for data pathoHardware performs DMA to/from memory
How does RDMA achieve high performance?
7
Kernel
User
RDMA-capablehardware*
NVMeF iSERBuf
Buf
Buf
AppA AppB
BufBuf
* Host Channel Adapter (HCA)
VMworld 2018 Content: Not for publication or distribution
o InfiniBand is a centrally managed, lossless high performance network architectureo Provides specification for the RDMA transport
o RoCE adapts the efficient RDMA transport to run over Ethernet networks
o Standard Ethernet management
EthernetLink layer
IP
UDP
RDMATransport
Ethernet/IPManagement
ROCEv2
Eth L2 IP UDP BTH+ Payload iCRC FCSRoCEv2 Packet
RDMA over Converged Ethernet (RoCE)
LRH GRH BTH+ Payload iCRC vCRCInfiniBand Packet
InfiniBandLink layer
IB Network Layer
RDMATransport
InfiniBandManagement
InfiniBand
VMworld 2018 Content: Not for publication or distribution
9©2018 VMware, Inc. 9
Direct Device Access TechnologiesAccessing PCI devices from VMs with maximum performance
VMworld 2018 Content: Not for publication or distribution
10©2018 VMware, Inc. 10
VMware ESXi
VM Direct Path I/O
Allows PCI devices to be accessed directly by guest OS• Examples: GPUs for computation (GPGPU), ultra-low latency
interconnects like InfiniBand and RoCE
Downsides: No vMotion, No Snapshots, etc. (PVRDMA in ESX 6.5 discussed later, supports vMotion)
Full device is made available to a single VM – no sharing (PVRDMA allows sharing)
No ESXi driver required – just the standard vendor device driver
Virtual Machine
Guest OS Kernel
Application
DirectPath I/OVMworld 2018 Content: Not for publication or distribution
11©2018 VMware, Inc. 11
The PCI standard includes a specification for Single Root I/O Virtualization (SR-IOV)
A single PCI device can present as multiple logical devices (Virtual Functions or VFs) to ESX and to VMs
Downsides: No vMotion, No Snapshots (PVRDMA helps again!)
An ESXi driver and a guest driver are required for SR-IOV
Mellanox Technologies supports ESXi SR-IOV for both InfiniBand and RoCE interconnects
Device Partitioning (SR-IOV)
SR-IOV
Virtual Machine
Guest OSKernel
Application
PF VF
vSwitch
VM
XN
ET
3
VMworld 2018 Content: Not for publication or distribution
12©2018 VMware, Inc. 12
Paravirtual RDMA (PVRDMA)Accelerating VM data transfers while retaining all virtualization benefits
VMworld 2018 Content: Not for publication or distribution
13©2018 VMware, Inc. 13
o VMo Expose a virtual PCIe deviceo Guest Support – Driver/Library
o Implements RDMA APIs
o PVRDMA backendo Mediates guest access to HCAo Exposes a RDMA resource space for each guesto Invokes ESXi RDMA APIs in response to guest
RDMA operationso Physical RDMA resources created for each guest
o Physical HCA services all VMs on the host
PVRDMA ArchitectureVM2VM1
App
RDMA stack
PVRDMA driver
ESXi
App
RDMA stack
PVRDMA driver
RDMA stack
HCA driver
HCA
PVRDMA backend
VMworld 2018 Content: Not for publication or distribution
14©2018 VMware, Inc. 14
o Applications register buffers to be usedo PVRDMA registers these buffers with the HCA
o PVRDMA takes the application request (Work Request) to read/write from a particular guest address, sizeo Issues the request to the HCAo No data included here
o HCA does all the work of packetizing data to/from application memoryo Bypassing Guest OS/Hypervisoro Enables direct zero-copy data transfers in HW!
Accelerating VM Data Transfers
VMkernel
Guest OS 1
PVRDMA NIC
Application buffer
HCATo RDMA
peer
API call
Data transfer
HCA Device Driver
ESXi RDMA Stack
VMworld 2018 Content: Not for publication or distribution
15©2018 VMware, Inc. 15
o Challengeso Considerable connection state contained within RDMA hardwareo RDMA connection needs to be maintained before/during/after vMotion
o PVRDMA backend creates a management network across clustero Exchange RDMA metadata between RDMA peerso Maintain virtual RDMA connection between peerso Re-establish virtual & physical RDMA connection after VM moved to new host
o Supporto vMotiono Snapshotso Suspend/Resumeo High Availability
Supporting Virtualization Benefits
VMworld 2018 Content: Not for publication or distribution
16©2018 VMware, Inc. 16
Drawbacks of current approach
o Cannot connect to bare metal hosts
o Good for small scale system
o Increased metadata exchanged
o Decreased performance for application
Improving PVRDMA Scalability
Future Improvements
o Shrink PVRDMA backend management functionality• Stop sending metadata updates between peers
(evaluated in this talk)• Unify resource spaces between backend and HCA
o Create the exact hardware RDMA state for the VM after vMotion• Specific RDMA APIs to be added• RDMA hardware must keep alive connection for duration of
vMotion
o Allows connection to bare metal hosts• While supporting vMotion, Snapshots, High Availability!
VMworld 2018 Content: Not for publication or distribution
17©2018 VMware, Inc. 17
o Configure ESXi host for PVRDMA
o Add PVRDMA Device to Linux VM through VC
o Install PVRDMA Driver and Libraryo OFED 4.8.2o Upstream
o Linux 4.10 or latero rdma-core v13 or later
o Inboxo RHEL 7.5o SLES 15o Ubuntu 18.04
o PVRDMA support is now part of most Linux distributions
Enabling PVRDMA for a VMvSphere 6.5 and later
VMworld 2018 Content: Not for publication or distribution
18©2018 VMware, Inc. 18
Machine LearningAccelerate training time by scale out architecture over RDMA networks
VMworld 2018 Content: Not for publication or distribution
19©2018 VMware, Inc. 19
Machine Learning Is Everywhere!
Fraud DetectionVMworld 2018 Content: Not for publication or distribution
20©2018 VMware, Inc. 20
What Is Machine Learning
Machine Learning
Machine learning is the subfield of computer science that, according to Arthur Samuel in 1959, “uses statistical techniques to give computers the ability to learn with data without being explicitly programmed.”
Source: https://en.wikipedia.org/wiki/Machine_learning
VMworld 2018 Content: Not for publication or distribution
21©2018 VMware, Inc. 21
Driven by Deep Neural Networks (DNN)• Subset of Artificial Neural Networks (ANN)
Deep Learning
Deep Learning
Deep Learning is a subfield of machine learning concerned with algorithms, inspired by the structure and function of the brain, called artificial neural networks
Source: http://machinelearningmastery.com/what-is-deep-learning/
VMworld 2018 Content: Not for publication or distribution
22©2018 VMware, Inc. 22
Training and Inference
Source: Mellanox TechnologiesVMworld 2018 Content: Not for publication or distribution
23©2018 VMware, Inc. 23
Deep Learning allows difficult problems to be solvedo In some cases problems that can’t be solve in other ways
Deep Learning is not newo 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts
So why now?
Infrastructureo Recent development in GPU and network technology make the approach practical
Datao More data is generated than ever. Critical for the training process
Software o Wave of open source machine learning frameworks
Why Deep Learning And Why Now?
Cognitive ToolkitVMworld 2018 Content: Not for publication or distribution
24©2018 VMware, Inc. 24
SONAR~10-100KB Per/Sec
CAMERA~20-40MB Per/sec
GPS~50KB Per/Sec
Data will grow by a factor of 10 over the next decade to 160 Zeta Bytes in 2025 (source: IDC)
Faster Data processing requires faster Interconnect speeds
RADAR~10-100KB Per/Sec
Light Detection & Ranging~10-70MB Per/Sec
Data is Growing Faster than EverAutonomous vehicle generates 4000GByte per day
Source: Cruise RP-1/Youtube
VMworld 2018 Content: Not for publication or distribution
25©2018 VMware, Inc. 25
Neural Networks Complexity Growth
2014 2015 2016 2017
DeepSpeech DeepSpeech-2
DeepSpeech-3
30X
2013 2014 2015 2016
AlexNet GoogleNetResNet
Inception-V2
350X
Inception-V4
Image Recognition
SpeechRecognition
PolyNet
Source: Mellanox Technologies
VMworld 2018 Content: Not for publication or distribution
26©2018 VMware, Inc. 26
Training with large data sets and networks can take a long timeo In some cases even weeks
In many cases training needs to happen frequentlyo Model development and tuningo Real life use cases may require retraining regularly
Accelerate training time by scaling out architectureo Add workers (nodes) to reduce training time
Popular types of parallelismo Data parallelismo Model parallelism
Training Challenges
The network is a critical element to accelerate Distributed Training!VMworld 2018 Content: Not for publication or distribution
27©2018 VMware, Inc. 27
Model and Data Parallelism
Data Parallelism
Main Model/Parameter Server/Allreduce
LocalModel
Mini Batch
Mini Batch
Mini Batch
Mini Batch
Mini Batch
LocalModel
LocalModel
LocalModel
LocalModel
LocalModel
Mini BatchData Data
Source: Mellanox Technologies
Model Parallelism
VMworld 2018 Content: Not for publication or distribution
28©2018 VMware, Inc. 28
RDMA and GPU Direct Accelerates Distributed TrainingGPUDirect RDMA Technology
With GPUDirect RDMA in vSphere we get ~100Gbps for
RDMA bandwidth
VMworld 2018 Content: Not for publication or distribution
29©2018 VMware, Inc. 29
TensorFlow: Several implementations upstreamo Native (verbs) -
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/verbs
o MPI, Horovod – Donated by Uber among others
Caffe2: Over MPI or Gloo library
Microsoft Cognitive Toolkit: Native support
NVIDIA NCCL2: Native support in NCCL
All Major Machine Learning Frameworks Support RDMA
Cognitive ToolkitVMworld 2018 Content: Not for publication or distribution
30©2018 VMware, Inc. 30
Distributed training framework for TensorFlow
Inspired by work of Baidu, Facebook, et al.
Uses bandwidth-optimal communication protocolso Makes use of RDMA (RoCE, InfiniBand) if available
Seamlessly installs on top of TensorFlow via ‘pip install horovod’
Horovod
Source: Horovod: fast and easy distributed deep learning in TensorFlow: https://arxiv.org/pdf/1802.05799.pdfVMworld 2018 Content: Not for publication or distribution
31©2018 VMware, Inc. 31
Horovod Setup – SR-IOV
8 Hosts, 1 VM per Host, 1 GPU (Full Passthrough) per VM
Mellanox SN2700 100GigE Spectrum Switch
Host:o 2x Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz (12 cores per socket)o 256GB RAMo Mellanox ConnectX-5 100GbE (SR-IOV RoCE capable)o 8 x Nvidia Tesla P100 GPU (PCI, 16GB)o ESXi 6.7 GAo ConnectX Driver 4.17.13.8
VM:o 12 vCPUo 48GB memoryo Ubuntu 16.04 x64o Passthrough: SR-IOV RoCE VF + Nvidia Tesla P100 GPUo MLNX OFED 4.4-2.0.7.0o TensorFlow 1.9o CUDA SDK 9.0o cuDNN 7.0o Horovod – master from Githubo Docker v18.09VMworld 2018 Content: Not for publication or distribution
32©2018 VMware, Inc. 32
Horovod PerformanceVGG 16 Results
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
2 3 4 5 6 7 8
SPE
ED
UP
(X
)
GPUS
TCP SR-IOV TCP SR-IOV RoCE IdealVMworld 2018 Content: Not for publication or distribution
33©2018 VMware, Inc. 33
SPARK Big Data AnalyticsAccelerating time to solution with shared, high-performance interconnect
VMworld 2018 Content: Not for publication or distribution
34©2018 VMware, Inc. 34
Spark is a data analysis platform with implicit data parallelism and fault-tolerance
Initial release: May, 2014
Originally developed at UC Berkeley’s AMPLab
Donated as open source to the Apache Software Foundation
Most active Apache open source project
50% of production systems are in public clouds
Notable users:
Apache Spark: Quick Facts
VMworld 2018 Content: Not for publication or distribution
35©2018 VMware, Inc. 35
A programming model for processing big data sets in a distributed manner
Comprises 3 stageso Mapo Shuffleo Reduce
Map-Reduce
VMworld 2018 Content: Not for publication or distribution
36©2018 VMware, Inc. 36
Applications frequently reuse data in a Map-Reduce pipelineo Iterative algorithms (e.g., machine learning, graphs)o Interactive data-mining and streaming
Persisting each iteration in stable storage is inefficient
The Spark solution: Resilient Distributed Datasets (RDDs)o In-memory data representationo Preserves and enhances the appealing properties
of Map-Reduce:o Fault toleranceo Data localityo Scalability
o Reuses in-memory data set in each iterationo Mostly network I/O only perfect match for RDMA!
Spark and Map-Reduce
Step Step Step
Step Step Step
VMworld 2018 Content: Not for publication or distribution
37©2018 VMware, Inc. 37
Host:o 2x Intel Xeon E5-2697v3 @ 2.60 GHz (14 cores per socket)o 256GB RAMo RDMA Adapters – Mellanox ConnectX-5 100GbE (PFC enabled)o ESXi Experimental Build (includes PVRDMA optimization #1)
VM:o 20 vCPU, 200 GB RAM, Centos 7.4 x64o SR-IOV RoCE Passthrough VF / PVRDMAo Mellanox OFED Linux 4.4-1.0.0.0o Spark Benchmark - HiBench/TeraSort over Mellanox SparkRDMAo Hadoop 2.4.0
Name Node Host:o 2x Intel(R) Xeon(R) CPU E5-2680 0o Mellanox ConnectX-5 100GbE
Name Node Server VM:o 12 vCPU, 64GB RAM, Centos 7.4o SR-IOV RoCE Passthrough VF / PVRDMA
Spark Test Setup – RoCE SR-IOV/PVRDMA
8 ESXi hosts1 Spark VM
per host
1 Server used as Named Node
Mellanox SN2700 100GigE Spectrum Switch
90 GB TeraSort
VMworld 2018 Content: Not for publication or distribution
Results – vSphere with RoCE SR-IOV
Runtime samples SR-IOV TCP SR-IOV RDMA Improvement
Average 127 seconds 91 seconds 28%
Min 126 seconds 88 seconds 30%
Max 130 seconds 96 seconds 26%
Lower is better
0
20
40
60
80
100
120
140
Average Min Max
SRIOV TCP SRIOV RoCE
28%
VMworld 2018 Content: Not for publication or distribution
Results – vSphere with PVRDMA
Runtime samples TCP (VMXNet3) PVRDMA Improvement
Average 129 seconds 99 seconds 23%
Min 127 seconds 96 seconds 24%
Max 132 seconds 101 seconds 23%
Lower is better
0
20
40
60
80
100
120
140
Average Min Max
TCP (VMXNet3) PVRDMA
23%
VMworld 2018 Content: Not for publication or distribution
40©2018 VMware, Inc. 40
High Performance ComputingAccelerating virtualized compute intensive applications over shared, high performance fabric
VMworld 2018 Content: Not for publication or distribution
41©2018 VMware, Inc. 41
HPC Workloads
o Scientific or technical workloads
o Often floating-point intensiveo Often storage intensiveo Often parallelo Run on server-class systems
o Mechanical Design / Draftingo Chemical Engineeringo Economics/Financialo Weathero Electronic Design Automation
(EDA)o Geoscienceso Defenseo Computer-Aided Engineering
(CAE)o Bioscience – Molecular Dynamicso Government Labo University/Academic
MPI (Message Passing Interface)
VMworld 2018 Content: Not for publication or distribution
42©2018 VMware, Inc. 42
o Molecular Dynamics package designed for simulations of proteins, lipids and nucleic acids
• Simulate the Newtonian equations of motion for systems with hundreds to millions of particles
o Good to simulate interaction of chemicals/polymers before actually mixing them!
o Runs on CPUs, GPUs
o Parallel Execution via Message Passing Interface (MPI)
o Loads molecular configuration from an initial file• Simulates the trajectory or movement of the atoms over time
GROMACSGROningen MAchine for Chemical Simulations
Source: https://en.m.wikipedia.org/wiki/GROMACS,https://wiki.archlinux.org/index.php/GROMACS
Source: www.researchgate.netVMworld 2018 Content: Not for publication or distribution
43©2018 VMware, Inc. 43
HPC Test Setup – RoCE SR-IOV/PVRDMA
Host:o 2x Intel Xeon E5-2697v3 @ 2.60 GHz (14 cores per socket)o 256GB RAMo RDMA Adapters – Mellanox ConnectX-5 100GbEo ESXi Experimental Build (includes PVRDMA optimization #1)
VM:o 20 vCPU, 200 GB RAM, Centos 7.4 x64o SR-IOV RoCE Passthrough VF / PVRDMAo Intel MPIo Mellanox OFED Linux 4.4-1.0.0.0
GROMACS Inputo Ion_channel - pentameric chloride channel embedded in a lipid bilayero Simulation of 150,000 atoms interactingo Ns/day – Nanoseconds of simulation time in 1 dayo Importance in pharmaceutical applications
8 ESXi hosts1 VM per
host
Mellanox SN2700 100GigE Spectrum Switch
VMworld 2018 Content: Not for publication or distribution
44©2018 VMware, Inc. 44
Results – vSphere with RoCE SR-IOV
0
10
20
30
40
50
60
70
80
2 4 8
ns/d
ay
Number of Nodes
SR-IOV TCP SR-IOV RoCE
Higher is better
#nodes #processes SR-IOV TCP (ns/day)
SR-IOV RoCE
(ns/day)
1 20 10.13 10.13
2 40 17.25 18.44
4 80 27.80 32.81
8 160 44.49 71.60
VMworld 2018 Content: Not for publication or distribution
45©2018 VMware, Inc. 45
Results – vSphere with PVRDMA
0
10
20
30
40
50
60
2 4 8
ns/d
ay
Number of Nodes
TCP (VMXNet3) PVRDMA
Higher is better
#nodes #processes TCP VMXNet3 (ns/day)
PVRDMA (ns/day)
1 20 9.85 9.85
2 40 9.28 12.34
4 80 11.51 24.22
8 160 14.29 49.72
2X
3X
VMworld 2018 Content: Not for publication or distribution
46©2018 VMware, Inc. 46
New categories of applications demand more network I/O• Big Data, High Performance Computing, Machine Learning• 100 GigE RDMA Networks and beyond
VMware vSphere provides flexibility and high performance in sharing such fabrics• Paravirtual RDMA• Full Passthrough• SR-IOV
Summary
VMworld 2018 Content: Not for publication or distribution
47©2018 VMware, Inc. 47
Virtualized data-intensive applications accelerated through these technologies• RDMA outperforms standard TCP for all these workloads• Machine Learning workloads scale linearly with SR-IOV RoCE• PVRDMA gives more than 90% of SR-IOV RoCE performance for Big Data Analytics• HPC workloads on PVRDMA vs TCP improve by almost 3x
Key Takeaways
0102030405060708090
100
Spark GROMACS
No
rmal
ized
to
SR
-IOV
R
oC
EP
erfo
rman
ce
PVRDMA SR-IOV TCP TCP (VMXNet3)
VMworld 2018 Content: Not for publication or distribution
48©2018 VMware, Inc. 48
CTO2390BU Virtualize and Accelerate HPC/Big Data with SR-IOV, vGPU and RDMA
VIN2085BU Accelerating Performance of Mission-Critical Workloads with PVRDMA
HCI2476BU Tech Preview: RDMA and Next-Gen Storage Technologies for vSAN
VIN2062BU vSphere Networking: What’s New and What’s Next
CTO3693BUS Optimize your Virtualized Environment with Hardware Accelerators
Extreme RDMA Series – Las Vegas
VMworld 2018 Content: Not for publication or distribution
PLEASE FILL OUTYOUR SURVEY.Take a survey and enter a drawingfor a VMware company store gift card.
#vmworld #VAP2807BU
VMworld 2018 Content: Not for publication or distribution
THANK YOU!
#vmworld #VAP2807BU
VMworld 2018 Content: Not for publication or distribution
51©2018 VMware, Inc. 51
RDMA Operation
Asynchronous I/O• Application Memory management• Zero-copy I/O for all operations
Main transport objects• Queue Pair (QP)
– Comprises Send and Receive queues– Service Work Request Entries (WQEs)
• Completion Queue (CQ)
Semantics• Channel (message passing)• RDMA (Write / Read / Atomics)
Send WQE
Address Space
Send buffer
Address Space
Receive buffer
CQE
SendQ
RecvQ
CQ
RecvWQE
CQE
SendQ
RecvQ
CQ
RDMA-W WQE
Address Space
Initiatorbuffer
Address Space
Targetbuffer
CQE
SendQ
RecvQ
CQ
SendQ
RecvQ
CQVMworld 2018 Content: Not for publication or distribution