sharp: in-network scalable hierarchical aggregation and … · 2020-01-16 · volta nvidia gpus...

38
1 © 2019 Mellanox Technologies Gil Bloch | Lugano SHARP: In-Network Scalable Hierarchical Aggregation and Reduction Protocol April, 2019

Upload: others

Post on 03-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

1© 2019 Mellanox Technologies

Gil Bloch | Lugano

SHARP: In-Network Scalable Hierarchical Aggregation and Reduction Protocol

April, 2019

Page 2: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

2© 2019 Mellanox Technologies

Exponential Growth of Data & Real Time AnalyticsRequires Mellanox High Throughput, Low Latency, Smart Interconnect Solutions

Unleashing the Power of Data

Page 3: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

3© 2019 Mellanox Technologies

Mellanox Accelerates Leading HPC and AI SystemsWorld’s Top 3 Supercomputers

Summit CORAL SystemWorld’s Fastest HPC / AI System9.2K InfiniBand Nodes

Sierra CORAL System#2 USA Supercomputer 8.6K InfiniBand Nodes

1 2Wuxi Supercomputing CenterFastest Supercomputer in China41K InfiniBand Nodes

3

Page 4: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

4© 2019 Mellanox Technologies

The Need for Intelligent and Faster Interconnect

CPU-Centric (Onload) Data-Centric (Offload)

Must Wait for the DataCreates Performance Bottlenecks

Faster Data Speeds and In-Network Computing Enable Higher Performance and Scale

GPU

CPU

GPU

CPU

Onload Network In-Network Computing

GPU

CPU

CPU

GPU

GPU

CPU

GPU

CPU

GPU

CPU

CPU

GPU

Analyze Data as it Moves!Higher Performance and Scale

Page 5: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

5© 2019 Mellanox Technologies

In-Network Computing to Enable Data-Centric Data Centers

GPU

CPU

GPU

CPU

GPU

CPU

CPU

GPU

GPUDirect

RDMA

Scalable Hierarchical Aggregation and

Reduction Protocol

NVMeOver

Fabrics

Page 6: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

6© 2019 Mellanox Technologies

Accelerating All Levels of HPC/AI Frameworks

GPUDirect

RDMA

Network

Framework

Communication

Framework

Application

Framework ▪ Data Analysis

▪ Configurable Logic

▪ SHARP – Data Aggregation

▪ MPI Tag Matching

▪ MPI Rendezvous

▪ SNAP - Software Defined Virtual Devices

▪ Network Transport Offload

▪ RDMA and GPU-Direct

▪ SHIELD (Self-Healing Network)

▪ Adaptive Routing and Congestion Control

Connectivity

Framework ▪ Multi-Host

▪ Enhanced Topologies

▪ Dragonfly+

Page 7: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

7© 2019 Mellanox Technologies

Accelerating HPC and AI Applications

Accelerating HPC Applications▪ Significantly reduce MPI collective runtime ▪ Increase CPU availability and efficiency ▪ Enable communication and computation overlap

Enabling Artificial Intelligence Solutions to Perform Critical and Timely Decision Making ▪ Accelerating distributed machine learning

Page 8: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

8© 2018 Mellanox Technologies | Confidential

AllReduce Example – Trees

▪ Many2One and One2Many traffic patterns – possible network congestion▪ Probably not a good solution for large data▪ Large scale requires higher tree / larger radix ▪ Result distribution – over the tree / MC

Switch

EndNode

Stage1

Stage2

Page 9: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

9© 2018 Mellanox Technologies | Confidential

AllReduce (Example) - Recursive Doubling

▪ The data is recursively divided, processed by CPUs and distributed▪ The rank’s CPUs are occupied performing the reduce algorithm▪ The data is sent at least 2x times, consumes at least twice the BW

Rank 1 Rank 2 Rank 3 Rank 4

Step 1

Step 2

Step 3

Step 4

½ Data

¼ Data

¼ Data

½ Data

Calculation phase

Result sending phase

Page 10: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

10© 2018 Mellanox Technologies | Confidential

Which Offload Should We Suggest?

▪ Switches can aggregate the data while it is going through the network…▪ Reducing the amount of data moving through the network▪ It will reduce the latency because data will go through a shorter path▪ Free up the CPU/GPU - the operation will be fully offloaded

Page 11: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

11© 2018 Mellanox Technologies | Confidential

Scalable Hierarchical Aggregation Protocol

Reliable Scalable General Purpose Primitive, Applicable to Multiple Use-cases▪ In-network Tree based aggregation mechanism▪ Large number of groups▪ Multiple simultaneous outstanding operations ▪ Streaming aggregation

Accelerating HPC applications▪ Scalable High Performance Collective Offload

▪ Barrier, Reduce, All-Reduce, Broadcast▪ Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND▪ Integer and Floating-Point, 16 / 32 / 64 bit▪ Up to 1KB payload size (in Quantum)

▪ Significantly reduce MPI collective runtime ▪ Increase CPU availability and efficiency ▪ Enable communication and computation overlap

Accelerating Machine Learning applications▪ Preven the many-to-one Traffic Pattern▪ CUDA , GPUDirect RDMA

SHArP Tree

SHArP Tree Aggregation Node

(Process running on HCA)

SHArP Tree Endnode

(Process running on HCA)

SHArP Tree Root

Page 12: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

12© 2019 Mellanox Technologies

HCOLL: SHARP vs No-SHARP

Step 1 `

Recursive Doubling `

Step 2 `

SHARP`

Page 13: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

13© 2019 Mellanox Technologies | Confidential

SHARP AllReduce Performance Advantages (128 Nodes)

SHARP enables 75% Reduction in Latency

Providing Scalable Flat LatencyScalable Hierarchical

Aggregation and

Reduction Protocol

Page 14: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

14© 2019 Mellanox Technologies | Confidential

SHARP AllReduce Performance Advantages 1500 Nodes, 60K MPI Ranks, Dragonfly+ Topology

SHARP Enables Highest PerformanceScalable Hierarchical

Aggregation and

Reduction Protocol

Page 15: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

15© 2019 Mellanox Technologies

Scalable Hierarchical Aggregation Protocol

▪ SHARP Tree is a Logical Construct▪ Nodes in the SHArP Tree are IB Endnodes▪ Logical tree defined on top of the physical

underlying fabric▪ SHArP Tree Links are implemented on top of the IB

transport (Reliable Connection)▪ Expected to follow the physical topology for

performance but not required

▪ SHARP Operations are Executed by a SHARP Tree▪ Multiple SHArP Trees are Supported▪ Each SHArP Tree can handle Multiple Outstanding

SHArP Operations▪ Within a SHArP Tree, each Operation is Uniquely

Identified by a SHArP-Tuple▪ GroupID▪ SequenceNumber

Physical Topology

One SHArP Tree

Switch/Router

HCA

SHArP Tree Aggregation Node (Process running on HCA)

SHArP Tree Endnode(Process running on HCA)

SHArP Tree Root

Page 16: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

16© 2019 Mellanox Technologies

SHARP Principles of Operation - Request

SHArP Tree Root

Aggregation Request

Page 17: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

17© 2019 Mellanox Technologies

SHARP Principles of Operation – Response

SHArP Tree Root

Aggregation Response

Page 18: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

18© 2018 Mellanox Technologies | Confidential

GPU Direct™ RDMA▪ Network adapter can directly read data from GPU device memory▪ Avoids copies through the host▪ Eliminates CPU bandwidth and latency bottlenecks▪ Uses remote direct memory access (RDMA) transfers between GPUs▪ Resulting in significantly improved MPISendRecv efficiency between GPUs in remote nodes▪ Fastest possible communication between GPU and other PCI-E devices▪ Allows for better asynchronous communication

With GPUDirect™ RDMA

Using PeerDirect™

Page 19: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

19© 2018 Mellanox Technologies | Confidential

SHARP & GPUDirect Performance Advantage for AI▪ TensorFlow Horovod running ResNet50 benchmark▪ E5-2650V4, 12 cores @ 2.2GHz, 30M L2 cache, 9.6GT QPI, 256GB RAM: 16 x 16 GB DDR4▪ P100 NVIDIA GPUs, ConnectX-6 HCA, IB Quantum Switch (EDR speed)▪ RH 7.5, Mellanox OFED 4.4, HPC-X v2.3, TensorFlow v1.11, Horovod 0.15.0

16%

12%

Page 20: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

20© 2018 Mellanox Technologies | Confidential

SHARP SW Overview

Page 21: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

21© 2018 Mellanox Technologies | Confidential

Mellanox HPC-X™ Scalable HPC Software Toolkit

▪ Complete MPI, PGAS OpenSHMEM and UPC package

▪ Maximize application performance

▪ For commercial and open source applications

▪ Best out of the box experience

Page 22: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

22© 2018 Mellanox Technologies | Confidential

Mellanox HPC-X™ Scalable HPC Software Toolkit

▪ Allow fast and simple deployment of HPC libraries▪ Both Stable & Latest Beta are bundled▪ All libraries are pre-compiled▪ Includes scripts/module files to ease

deployment

▪ Package Includes

▪ OpenMPI / OpenSHMEM

▪ BUPC (Berkeley UPC)

▪ UCX

▪ HCOLL (FCA)

▪ SHARP

▪ KNEM

▪ Allows fast intra-node MPI communication for large

messages

▪ Profiling Tools

▪ Libibprof

▪ IPM

▪ Standard Benchmarks

▪ OSU

▪ IMB

Page 23: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

23© 2019 Mellanox Technologies

HPCX/SHARP SW architecture

▪ HCOLL ▪ optimized collective library ▪ Easy to integrate with multiple MPIs(OpenMPI, MPICH,

MVAPICH*)

▪ Libsharp.so▪ Implementation of low level sharp API

▪ Libsharp_coll.so▪ Implementation of high level sharp API for enabling

sharp collectives for MPI▪ uses low level libsharp.so API▪ Easy to integrate with multiple MPIs(OpenMPI, MPICH,

MVAPICH*)

SHARP(libsharp/libsharp_coll)

HCOLL(libhcoll)

MPI (OpenMPI)

InfiniBand Network

Page 24: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

24© 2019 Mellanox Technologies

SHARP Software Architecture

Aggregation

Node

Aggregation

Node Aggregation

Node

Aggregation

Node ….

Compute node

MPI

Process

SHARPD

daemon

Compute node

MPI

Process

SHARPD

daemon

Compute node

MPI

Process

SHARPD

daemon

..….

Subnet

Manager

Aggragation

Manager (AM)

Page 25: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

25© 2019 Mellanox Technologies

SHARP: Configuring Subnet Manager

▪ Edit the opensm.conf file.

▪ Set the parameter “sharp_enabled” to “2”.

▪ Run OpenSM with the configuration file. ▪ % opensm -F <opensm configuration file> -B

▪ Verify that the Aggregation Nodes were activated by the OpenSM, run "ibnetdiscover".▪ For example:▪ vendid=0x0▪ devid=0xcf09▪ sysimgguid=0x7cfe900300a5a2a0▪ caguid=0x7cfe900300a5a2a8▪ Ca 1 "H-7cfe900300a5a2a8" # "Mellanox Technologies Aggregation Node" ▪ [1](7cfe900300a5a2a8) "S-7cfe900300a5a2a0"[37] # lid 256 lmc 0 "MF0;sharp2:MSB7800/U1" lid 512 4xFDR

Page 26: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

26© 2019 Mellanox Technologies

NVIDIA Collective Communication Library (NCCL)

▪ Used by HCOLL▪ Used by many Deep Learning Frameworks (Tensorflow/Horovod, PyTorch, MXNet, Chainer, …)

▪ NCCL 2.4▪ Hierarchical tree algorithm

Page 27: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

27© 2019 Mellanox Technologies

NCCL Ring

Network Fabric

NIC NIC NIC

▪ Simple▪ Full (half?) Bandwidth▪ Linear Latency

Page 28: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

28© 2019 Mellanox Technologies

NCCL Tree

Network Fabric

NIC NIC NIC

▪ Support added in NCCL-2.4▪ Keep Intra-node chain▪ Node leaders participate in tree▪ Binary double tree▪ Multiple rings -> Multiple trees

Page 29: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

29© 2019 Mellanox Technologies

NCCL 2.4

nccl-tests allreduce - 4096 nodes, 6x V100, 2x IB EDR (Summit/Oak Ridge National Laboratory)

Performance

S9656 - Distributed Training and Fast inter-GPU Communication with NCCL

Page 30: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

30© 2019 Mellanox Technologies

NCCL SHARPNetwork Fabric

NIC NIC NIC

Page 31: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

31© 2019 Mellanox Technologies

NCCL SHARP

▪ Collective network Plugin▪ Replace Inter-node tree with SHARP Tree▪ Keeps Intra-node ring▪ Aggregation in network switch ▪ Streaming from GPU memory with GPU Direct RDMA▪ 2x BW compared to NCCL-TREE

SHARP Enables 2X Higher Data Throughput for NCCL

0

5

10

15

20

25

8

16

32

64

12

8

25

6

51

2

10

24

20

48

Ban

dw

idth

( G

B/s

)

Message Size(MB)

NCCL Allreduce Bandwidth - 1 Ring

SHARP Ring Tree

Page 32: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

32© 2019 Mellanox Technologies

NCCL-SHARP Performance – DL Training

1875520456

0

5000

10000

15000

20000

25000

NCCL NCCL-SHARP

Imag

es/s

ec

ResNet50

System Configuration: (4) HPE Apollo 6500 systems configured with (8) NVIDIA Tesla V100 SXM2 16GB, (2) HPE DL360 Gen10 Intel Xeon-Gold 6134 (3.2 GHz/8-core/130 W) CPUs, (24) DDR4-2666 CAS-19-19-19 Registered Memory ModulesHPE 1.6 TB NVMe SFF (2.5") SSD, ConnectX-6 HCA, IB Quantum Switch (EDR speed), Ubuntu 16.04

Page 33: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

33© 2019 Mellanox Technologies

Performance Example

Page 34: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

34© 2019 Mellanox Technologies

Setup

▪ 4 nodes, 16 GPUs▪ Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz▪ Volta NVIDIA GPUs▪ InfiniBand ConnectX-6 HCA

▪ Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12, Horovod 0.15.2

▪ IB Quantum Switch (EDR speed)

▪ NCCL : 1 Ring, NVLink with in the Node.

▪ TensorFlow/Horovod running ResNet50 benchmark

▪ SHARP: Using 4 channels (4 ports) directly participating in SAT operation

Page 35: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

35© 2019 Mellanox Technologies

Allreduce - SHARP

0

1

2

3

4

5

6

4 8 16 32 64 128 256

Late

ncy

(u

s)

Message Size

SHARP Latency

HOST SHARP/LLT

0

50000

100000

150000

200000

250000

83

88

60

8

16

77

72

16

33

55

44

32

67

10

88

64

13

42

17

72

8

26

84

35

45

6

53

68

70

91

2

Late

ncy

(u

s)

Message Size

SHARP Streaming aggregation

HOST SHARP/SAT

3x

Page 36: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

36© 2019 Mellanox Technologies

Allreduce - GPU Direct & SHARP

0

10

20

30

40

50

4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384

Late

ncy

(u

s)

Message size

GPU Direct

NCCL SHARP

0

20000

40000

60000

80000

83

88

60

8

16

77

72

16

33

55

44

32

67

10

88

64

13

42

17

72

8

26

84

35

45

6

53

68

70

91

2

Late

ncy

(u

s)

Message Size

GPU Direct & SHARP

NCCL SHARP

10x

Page 37: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

37© 2019 Mellanox Technologies

Horovod – Resnet50

SHARP

Page 38: SHARP: In-Network Scalable Hierarchical Aggregation and … · 2020-01-16 · Volta NVIDIA GPUs InfiniBand ConnectX-6 HCA Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlow v1.12,

38© 2019 Mellanox Technologies

Thank You