interconnect acceleration for publication for machine ... · deep learning is not new. o. 1943:...

#vmworld#vmworld

Interconnect Accelerationfor Machine Learning,

Big Data, and HPCAdit Ranadive, VMware, Inc.

Aviad Shaul Yehezkel, Mellanox

VAP2807BU

#VAP2807BU

VMworld 2018 Content: Not for publication or distribution

Disclaimer

2©2018 VMware, Inc.

This presentation may contain product features orfunctionality that are currently under development.

This overview of new technology represents no commitment from VMware to deliver these features in any generally available product.

Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.

Technical feasibility and market demand will affect final delivery.

Pricing and packaging for any new features/functionality/technology discussed or presented, have not been determined.


3©2018 VMware, Inc. 3

Datacenter Interconnects bandwidths growing – 100 Gbps and beyond

Network devices with hardware offloaded protocols• Reduce CPU usage• Access application data directly without involving OS/Hypervisor - lower latencies

Really – interconnect accelerators

Big Data, Machine Learning, High Performance Computing Apps• Speed up benefit from increased network bandwidths and lower latencies• Distributed workloads across VMs accessing a shared high performance fabric

This talk aims to answer the following questions in a vSphere environment• What are the hardware protocols for interconnect acceleration?• How do we enable VM access to interconnect acceleration protocols?• How do interconnect accelerators help with application performance?

PreambleAccelerate my Interconnect - What and Why?


Agenda


Agenda


Introduction to RDMA

Direct Device Access Technologies

Paravirtual RDMA (PVRDMA) in vSphere

Machine Learning over RDMA

Big Data over RDMA

HPC Applications over RDMA

Summary / Key Takeaways



Introduction to Remote Direct Memory Access (RDMA)Hardware protocol to accelerate data


A hardware transport protocolo Optimized for moving data to/from memory and

across interconnectso Industry Standard – InfiniBand Trade Association

(IBTA)o Examples: InfiniBand, RoCE (talk focus)

Extreme performanceo 600ns application-to-application latencieso 100Gbps throughputo Negligible CPU overheads

RDMA Use-caseso Storage (iSER, NFS-RDMA, NVMoF, Lustre)o HPC (MPI, SHMEM)o Big data and analytics (Hadoop, Spark)o Machine Learning (Tensor flow, Horovod)

Remote Direct Memory Access (RDMA)

6


Traditional network stack challengesoPer message / packet / byte overheadsoUser-kernel crossingsoMemory copies

RDMA provides in hardware:o Isolation between applicationsoTransport

o Packetizing messageso Reliable delivery

oAddress translation

User-level networkingoDirect hardware access for data pathoHardware performs DMA to/from memory

How does RDMA achieve high performance?

7

Kernel

User

RDMA-capablehardware*

NVMeF iSERBuf

Buf

Buf

AppA AppB

BufBuf

* Host Channel Adapter (HCA)


o InfiniBand is a centrally managed, lossless high performance network architectureo Provides specification for the RDMA transport

o RoCE adapts the efficient RDMA transport to run over Ethernet networks

o Standard Ethernet management

EthernetLink layer

IP

UDP

RDMATransport

Ethernet/IPManagement

ROCEv2

Eth L2 IP UDP BTH+ Payload iCRC FCSRoCEv2 Packet

RDMA over Converged Ethernet (RoCE)

LRH GRH BTH+ Payload iCRC vCRCInfiniBand Packet

InfiniBandLink layer

IB Network Layer

RDMATransport

InfiniBandManagement

InfiniBand



Direct Device Access TechnologiesAccessing PCI devices from VMs with maximum performance


10©2018 VMware, Inc. 10

VMware ESXi

VM Direct Path I/O

Allows PCI devices to be accessed directly by guest OS• Examples: GPUs for computation (GPGPU), ultra-low latency

interconnects like InfiniBand and RoCE

Downsides: No vMotion, No Snapshots, etc. (PVRDMA in ESX 6.5 discussed later, supports vMotion)

Full device is made available to a single VM – no sharing (PVRDMA allows sharing)

No ESXi driver required – just the standard vendor device driver

Virtual Machine

Guest OS Kernel

Application

DirectPath I/OVMworld 2018 Content: Not for publication or distribution

11©2018 VMware, Inc. 11

The PCI standard includes a specification for Single Root I/O Virtualization (SR-IOV)

A single PCI device can present as multiple logical devices (Virtual Functions or VFs) to ESX and to VMs

Downsides: No vMotion, No Snapshots (PVRDMA helps again!)

An ESXi driver and a guest driver are required for SR-IOV

Mellanox Technologies supports ESXi SR-IOV for both InfiniBand and RoCE interconnects

Device Partitioning (SR-IOV)

SR-IOV

Virtual Machine

Guest OSKernel

Application

PF VF

vSwitch

VM

XN

ET

3


12©2018 VMware, Inc. 12

Paravirtual RDMA (PVRDMA)Accelerating VM data transfers while retaining all virtualization benefits


13©2018 VMware, Inc. 13

o VMo Expose a virtual PCIe deviceo Guest Support – Driver/Library

o Implements RDMA APIs

o PVRDMA backendo Mediates guest access to HCAo Exposes a RDMA resource space for each guesto Invokes ESXi RDMA APIs in response to guest

RDMA operationso Physical RDMA resources created for each guest

o Physical HCA services all VMs on the host

PVRDMA ArchitectureVM2VM1

App

RDMA stack

PVRDMA driver

ESXi

App

RDMA stack

PVRDMA driver

RDMA stack

HCA driver

HCA

PVRDMA backend


14©2018 VMware, Inc. 14

o Applications register buffers to be usedo PVRDMA registers these buffers with the HCA

o PVRDMA takes the application request (Work Request) to read/write from a particular guest address, sizeo Issues the request to the HCAo No data included here

o HCA does all the work of packetizing data to/from application memoryo Bypassing Guest OS/Hypervisoro Enables direct zero-copy data transfers in HW!

Accelerating VM Data Transfers

VMkernel

Guest OS 1

PVRDMA NIC

Application buffer

HCATo RDMA

peer

API call

Data transfer

HCA Device Driver

ESXi RDMA Stack


15©2018 VMware, Inc. 15

o Challengeso Considerable connection state contained within RDMA hardwareo RDMA connection needs to be maintained before/during/after vMotion

o PVRDMA backend creates a management network across clustero Exchange RDMA metadata between RDMA peerso Maintain virtual RDMA connection between peerso Re-establish virtual & physical RDMA connection after VM moved to new host

o Supporto vMotiono Snapshotso Suspend/Resumeo High Availability

Supporting Virtualization Benefits


16©2018 VMware, Inc. 16

Drawbacks of current approach

o Cannot connect to bare metal hosts

o Good for small scale system

o Increased metadata exchanged

o Decreased performance for application

Improving PVRDMA Scalability

Future Improvements

o Shrink PVRDMA backend management functionality• Stop sending metadata updates between peers

(evaluated in this talk)• Unify resource spaces between backend and HCA

o Create the exact hardware RDMA state for the VM after vMotion• Specific RDMA APIs to be added• RDMA hardware must keep alive connection for duration of

vMotion

o Allows connection to bare metal hosts• While supporting vMotion, Snapshots, High Availability!


17©2018 VMware, Inc. 17

o Configure ESXi host for PVRDMA

o Add PVRDMA Device to Linux VM through VC

o Install PVRDMA Driver and Libraryo OFED 4.8.2o Upstream

o Linux 4.10 or latero rdma-core v13 or later

o Inboxo RHEL 7.5o SLES 15o Ubuntu 18.04

o PVRDMA support is now part of most Linux distributions

Enabling PVRDMA for a VMvSphere 6.5 and later


18©2018 VMware, Inc. 18

Machine LearningAccelerate training time by scale out architecture over RDMA networks


19©2018 VMware, Inc. 19

Machine Learning Is Everywhere!

Fraud DetectionVMworld 2018 Content: Not for publication or distribution

20©2018 VMware, Inc. 20

What Is Machine Learning

Machine Learning

Machine learning is the subfield of computer science that, according to Arthur Samuel in 1959, “uses statistical techniques to give computers the ability to learn with data without being explicitly programmed.”

Source: https://en.wikipedia.org/wiki/Machine_learning


21©2018 VMware, Inc. 21

Driven by Deep Neural Networks (DNN)• Subset of Artificial Neural Networks (ANN)

Deep Learning

Deep Learning

Deep Learning is a subfield of machine learning concerned with algorithms, inspired by the structure and function of the brain, called artificial neural networks

Source: http://machinelearningmastery.com/what-is-deep-learning/


22©2018 VMware, Inc. 22

Training and Inference

Source: Mellanox TechnologiesVMworld 2018 Content: Not for publication or distribution

23©2018 VMware, Inc. 23

Deep Learning allows difficult problems to be solvedo In some cases problems that can’t be solve in other ways

Deep Learning is not newo 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

So why now?

Infrastructureo Recent development in GPU and network technology make the approach practical

Datao More data is generated than ever. Critical for the training process

Software o Wave of open source machine learning frameworks

Why Deep Learning And Why Now?

Cognitive ToolkitVMworld 2018 Content: Not for publication or distribution

24©2018 VMware, Inc. 24

SONAR~10-100KB Per/Sec

CAMERA~20-40MB Per/sec

GPS~50KB Per/Sec

Data will grow by a factor of 10 over the next decade to 160 Zeta Bytes in 2025 (source: IDC)

Faster Data processing requires faster Interconnect speeds

RADAR~10-100KB Per/Sec

Light Detection & Ranging~10-70MB Per/Sec

Data is Growing Faster than EverAutonomous vehicle generates 4000GByte per day

Source: Cruise RP-1/Youtube


25©2018 VMware, Inc. 25

Neural Networks Complexity Growth

2014 2015 2016 2017

DeepSpeech DeepSpeech-2

DeepSpeech-3

30X

2013 2014 2015 2016

AlexNet GoogleNetResNet

Inception-V2

350X

Inception-V4

Image Recognition

SpeechRecognition

PolyNet

Source: Mellanox Technologies


26©2018 VMware, Inc. 26

Training with large data sets and networks can take a long timeo In some cases even weeks

In many cases training needs to happen frequentlyo Model development and tuningo Real life use cases may require retraining regularly

Accelerate training time by scaling out architectureo Add workers (nodes) to reduce training time

Popular types of parallelismo Data parallelismo Model parallelism

Training Challenges

The network is a critical element to accelerate Distributed Training!VMworld 2018 Content: Not for publication or distribution

27©2018 VMware, Inc. 27

Model and Data Parallelism

Data Parallelism

Main Model/Parameter Server/Allreduce

LocalModel

Mini Batch

Mini Batch

Mini Batch

Mini Batch

Mini Batch

LocalModel

LocalModel

LocalModel

LocalModel

LocalModel

Mini BatchData Data

Source: Mellanox Technologies

Model Parallelism


28©2018 VMware, Inc. 28

RDMA and GPU Direct Accelerates Distributed TrainingGPUDirect RDMA Technology

With GPUDirect RDMA in vSphere we get ~100Gbps for

RDMA bandwidth


29©2018 VMware, Inc. 29

TensorFlow: Several implementations upstreamo Native (verbs) -

https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/verbs

o MPI, Horovod – Donated by Uber among others

Caffe2: Over MPI or Gloo library

Microsoft Cognitive Toolkit: Native support

NVIDIA NCCL2: Native support in NCCL

All Major Machine Learning Frameworks Support RDMA

Cognitive ToolkitVMworld 2018 Content: Not for publication or distribution

https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/verbs

30©2018 VMware, Inc. 30

Distributed training framework for TensorFlow

Inspired by work of Baidu, Facebook, et al.

Uses bandwidth-optimal communication protocolso Makes use of RDMA (RoCE, InfiniBand) if available

Seamlessly installs on top of TensorFlow via ‘pip install horovod’

Horovod

Source: Horovod: fast and easy distributed deep learning in TensorFlow: https://arxiv.org/pdf/1802.05799.pdfVMworld 2018 Content: Not for publication or distribution

31©2018 VMware, Inc. 31

Horovod Setup – SR-IOV

8 Hosts, 1 VM per Host, 1 GPU (Full Passthrough) per VM

Mellanox SN2700 100GigE Spectrum Switch

Host:o 2x Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz (12 cores per socket)o 256GB RAMo Mellanox ConnectX-5 100GbE (SR-IOV RoCE capable)o 8 x Nvidia Tesla P100 GPU (PCI, 16GB)o ESXi 6.7 GAo ConnectX Driver 4.17.13.8

VM:o 12 vCPUo 48GB memoryo Ubuntu 16.04 x64o Passthrough: SR-IOV RoCE VF + Nvidia Tesla P100 GPUo MLNX OFED 4.4-2.0.7.0o TensorFlow 1.9o CUDA SDK 9.0o cuDNN 7.0o Horovod – master from Githubo Docker v18.09VMworld 2018 Content: Not for publication or distribution

32©2018 VMware, Inc. 32

Horovod PerformanceVGG 16 Results

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

2 3 4 5 6 7 8

SPE

ED

UP

(X

)

GPUS

TCP SR-IOV TCP SR-IOV RoCE IdealVMworld 2018 Content: Not for publication or distribution

33©2018 VMware, Inc. 33

SPARK Big Data AnalyticsAccelerating time to solution with shared, high-performance interconnect


34©2018 VMware, Inc. 34

Spark is a data analysis platform with implicit data parallelism and fault-tolerance

Initial release: May, 2014

Originally developed at UC Berkeley’s AMPLab

Donated as open source to the Apache Software Foundation

Most active Apache open source project

50% of production systems are in public clouds

Notable users:

Apache Spark: Quick Facts


35©2018 VMware, Inc. 35

A programming model for processing big data sets in a distributed manner

Comprises 3 stageso Mapo Shuffleo Reduce

Map-Reduce


36©2018 VMware, Inc. 36

Applications frequently reuse data in a Map-Reduce pipelineo Iterative algorithms (e.g., machine learning, graphs)o Interactive data-mining and streaming

Persisting each iteration in stable storage is inefficient

The Spark solution: Resilient Distributed Datasets (RDDs)o In-memory data representationo Preserves and enhances the appealing properties

of Map-Reduce:o Fault toleranceo Data localityo Scalability

o Reuses in-memory data set in each iterationo Mostly network I/O only perfect match for RDMA!

Spark and Map-Reduce

Step Step Step

Step Step Step


37©2018 VMware, Inc. 37

Host:o 2x Intel Xeon E5-2697v3 @ 2.60 GHz (14 cores per socket)o 256GB RAMo RDMA Adapters – Mellanox ConnectX-5 100GbE (PFC enabled)o ESXi Experimental Build (includes PVRDMA optimization #1)

VM:o 20 vCPU, 200 GB RAM, Centos 7.4 x64o SR-IOV RoCE Passthrough VF / PVRDMAo Mellanox OFED Linux 4.4-1.0.0.0o Spark Benchmark - HiBench/TeraSort over Mellanox SparkRDMAo Hadoop 2.4.0

Name Node Host:o 2x Intel(R) Xeon(R) CPU E5-2680 0o Mellanox ConnectX-5 100GbE

Name Node Server VM:o 12 vCPU, 64GB RAM, Centos 7.4o SR-IOV RoCE Passthrough VF / PVRDMA

Spark Test Setup – RoCE SR-IOV/PVRDMA

8 ESXi hosts1 Spark VM

per host

1 Server used as Named Node


90 GB TeraSort


Results – vSphere with RoCE SR-IOV

Runtime samples SR-IOV TCP SR-IOV RDMA Improvement

Average 127 seconds 91 seconds 28%

Min 126 seconds 88 seconds 30%

Max 130 seconds 96 seconds 26%

Lower is better

0

20

40

60

80

100

120

140

Average Min Max

SRIOV TCP SRIOV RoCE

28%


Results – vSphere with PVRDMA

Runtime samples TCP (VMXNet3) PVRDMA Improvement

Average 129 seconds 99 seconds 23%

Min 127 seconds 96 seconds 24%

Max 132 seconds 101 seconds 23%

Lower is better

0

20

40

60

80

100

120

140

Average Min Max

TCP (VMXNet3) PVRDMA

23%


40©2018 VMware, Inc. 40

High Performance ComputingAccelerating virtualized compute intensive applications over shared, high performance fabric


41©2018 VMware, Inc. 41

HPC Workloads

o Scientific or technical workloads

o Often floating-point intensiveo Often storage intensiveo Often parallelo Run on server-class systems

o Mechanical Design / Draftingo Chemical Engineeringo Economics/Financialo Weathero Electronic Design Automation

(EDA)o Geoscienceso Defenseo Computer-Aided Engineering

(CAE)o Bioscience – Molecular Dynamicso Government Labo University/Academic

MPI (Message Passing Interface)


42©2018 VMware, Inc. 42

o Molecular Dynamics package designed for simulations of proteins, lipids and nucleic acids

• Simulate the Newtonian equations of motion for systems with hundreds to millions of particles

o Good to simulate interaction of chemicals/polymers before actually mixing them!

o Runs on CPUs, GPUs

o Parallel Execution via Message Passing Interface (MPI)

o Loads molecular configuration from an initial file• Simulates the trajectory or movement of the atoms over time

GROMACSGROningen MAchine for Chemical Simulations

Source: https://en.m.wikipedia.org/wiki/GROMACS,https://wiki.archlinux.org/index.php/GROMACS

Source: www.researchgate.netVMworld 2018 Content: Not for publication or distribution

43©2018 VMware, Inc. 43

HPC Test Setup – RoCE SR-IOV/PVRDMA

Host:o 2x Intel Xeon E5-2697v3 @ 2.60 GHz (14 cores per socket)o 256GB RAMo RDMA Adapters – Mellanox ConnectX-5 100GbEo ESXi Experimental Build (includes PVRDMA optimization #1)

VM:o 20 vCPU, 200 GB RAM, Centos 7.4 x64o SR-IOV RoCE Passthrough VF / PVRDMAo Intel MPIo Mellanox OFED Linux 4.4-1.0.0.0

GROMACS Inputo Ion_channel - pentameric chloride channel embedded in a lipid bilayero Simulation of 150,000 atoms interactingo Ns/day – Nanoseconds of simulation time in 1 dayo Importance in pharmaceutical applications

8 ESXi hosts1 VM per

host



44©2018 VMware, Inc. 44

Results – vSphere with RoCE SR-IOV

0

10

20

30

40

50

60

70

80

2 4 8

ns/d

ay

Number of Nodes

SR-IOV TCP SR-IOV RoCE

Higher is better

#nodes #processes SR-IOV TCP (ns/day)

SR-IOV RoCE

(ns/day)

1 20 10.13 10.13

2 40 17.25 18.44

4 80 27.80 32.81

8 160 44.49 71.60


45©2018 VMware, Inc. 45

Results – vSphere with PVRDMA

0

10

20

30

40

50

60

2 4 8

ns/d

ay

Number of Nodes

TCP (VMXNet3) PVRDMA

Higher is better

#nodes #processes TCP VMXNet3 (ns/day)

PVRDMA (ns/day)

1 20 9.85 9.85

2 40 9.28 12.34

4 80 11.51 24.22

8 160 14.29 49.72

2X

3X


46©2018 VMware, Inc. 46

New categories of applications demand more network I/O• Big Data, High Performance Computing, Machine Learning• 100 GigE RDMA Networks and beyond

VMware vSphere provides flexibility and high performance in sharing such fabrics• Paravirtual RDMA• Full Passthrough• SR-IOV

Summary


47©2018 VMware, Inc. 47

Virtualized data-intensive applications accelerated through these technologies• RDMA outperforms standard TCP for all these workloads• Machine Learning workloads scale linearly with SR-IOV RoCE• PVRDMA gives more than 90% of SR-IOV RoCE performance for Big Data Analytics• HPC workloads on PVRDMA vs TCP improve by almost 3x

Key Takeaways

0102030405060708090

100

Spark GROMACS

No

rmal

ized

to

SR

-IOV

R

oC

EP

erfo

rman

ce

PVRDMA SR-IOV TCP TCP (VMXNet3)


48©2018 VMware, Inc. 48

CTO2390BU Virtualize and Accelerate HPC/Big Data with SR-IOV, vGPU and RDMA

VIN2085BU Accelerating Performance of Mission-Critical Workloads with PVRDMA

HCI2476BU Tech Preview: RDMA and Next-Gen Storage Technologies for vSAN

VIN2062BU vSphere Networking: What’s New and What’s Next

CTO3693BUS Optimize your Virtualized Environment with Hardware Accelerators

Extreme RDMA Series – Las Vegas


PLEASE FILL OUTYOUR SURVEY.Take a survey and enter a drawingfor a VMware company store gift card.

#vmworld #VAP2807BU


THANK YOU!

#vmworld #VAP2807BU


51©2018 VMware, Inc. 51

RDMA Operation

Asynchronous I/O• Application Memory management• Zero-copy I/O for all operations

Main transport objects• Queue Pair (QP)

– Comprises Send and Receive queues– Service Work Request Entries (WQEs)

• Completion Queue (CQ)

Semantics• Channel (message passing)• RDMA (Write / Read / Atomics)

Send WQE

Address Space

Send buffer

Address Space

Receive buffer

CQE

SendQ

RecvQ

CQ

RecvWQE

CQE

SendQ

RecvQ

CQ

RDMA-W WQE

Address Space

Initiatorbuffer

Address Space

Targetbuffer

CQE

SendQ

RecvQ

CQ

SendQ

RecvQ

CQVMworld 2018 Content: Not for publication or distribution

interconnect acceleration for publication for machine ... · deep learning is not new. o. 1943:...

Documents