software libraries and middleware for exascale...

Networking Technologies and Middleware for Next-Generation Clusters and Data Centers: Challenges and Opportunities

Dhabaleswar K. (DK) Panda

The Ohio State UniversityE-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Keynote Talk at HPSR ‘18

by

http://www.cse.ohio-state.edu/%7Epanda

HPSR ‘18 2Network Based Computing Laboratory

High-End Computing (HEC): Towards Exascale

Expected to have an ExaFlop system in 2020-2021!

100 PFlops in 2016

1 EFlops in 2020-2021?

200PFlops in

2018


Big Data – How Much Data Is Generated Every Minute on the Internet?

The global Internet population grew 7.5% from 2016 and now represents 3.7 Billion People.

Courtesy: https://www.domo.com/blog/data-never-sleeps-5/


Resurgence of AI/Machine Learning/Deep Learning

Courtesy: http://www.zdnet.com/article/caffe2-deep-learning-wide-ambitions-flexibility-scalability-and-advocacy/

http://www.zdnet.com/article/caffe2-deep-learning-wide-ambitions-flexibility-scalability-and-advocacy/


A Typical Multi-Tier Data Center Architecture

.

...

Tier1 Tier3

Routers/Servers

Switch

Database Server

Application Server

Routers/Servers

Routers/Servers

Application Server

Application Server

Application Server

Database Server

Database Server

Database Server

Switch SwitchRouters/Servers

Tier2


• Substantial impact on designing and utilizing data management and processing systems in multiple tiers

– Front-end data accessing and serving (Online)• Memcached + DB (e.g. MySQL), HBase

– Back-end data analytics (Offline)• HDFS, MapReduce, Spark

Data Management and Processing on Modern Datacenters


Communication and Computation Requirements

• Requests are received from clients over the WAN

• Proxy nodes perform caching, load balancing, resource monitoring, etc.

• If not cached, the request is forwarded to the next tiers Application Server

• Application server performs the business logic (CGI, Java servlets, etc.)– Retrieves appropriate data from the database to process the requests

ProxyServer

Web-server(Apache)

Application Server (PHP)

DatabaseServer(MySQL)

WAN

ClientsStorage

More Computation and CommunicationRequirements


Big Data (Hadoop, Spark,

HBase, Memcached,

etc.)

Deep Learning(Caffe, TensorFlow, BigDL,

etc.)

HPC (MPI, RDMA, Lustre, etc.)

Increasing Usage of HPC, Big Data and Deep Learning on Modern Datacenters

Convergence of HPC, Big Data, and Deep Learning!

Increasing Need to Run these applications on the Cloud!!


Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Physical Compute



Spark Job

Hadoop Job Deep LearningJob


Drivers of Modern HPC Cluster and Data Center Architecture

• Multi-core/many-core technologies

• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand, iWARP, RoCE, and Omni-Path)

• Single Root I/O Virtualization (SR-IOV)• Solid State Drives (SSDs), NVMe/NVMf, Parallel Filesystems, Object Storage Clusters

• Accelerators (NVIDIA GPGPUs and FPGAs)

High Performance Interconnects –InfiniBand (with SR-IOV)

<1usec latency, 200Gbps Bandwidth>Multi-/Many-core

Processors

Cloud CloudSDSC Comet TACC Stampede

Acceleratorshigh compute density, high

performance/watt>1 TFlop DP on a chip

SSD, NVMe-SSD, NVRAM


Ethernet (1979 - ) 10 Mbit/secFast Ethernet (1993 -) 100 Mbit/sec

Gigabit Ethernet (1995 -) 1000 Mbit /secATM (1995 -) 155/622/1024 Mbit/sec

Myrinet (1993 -) 1 Gbit/secFibre Channel (1994 -) 1 Gbit/sec

InfiniBand (2001 -) 2 Gbit/sec (1X SDR)10-Gigabit Ethernet (2001 -) 10 Gbit/sec

InfiniBand (2003 -) 8 Gbit/sec (4X SDR)InfiniBand (2005 -) 16 Gbit/sec (4X DDR)

24 Gbit/sec (12X SDR)InfiniBand (2007 -) 32 Gbit/sec (4X QDR)

40-Gigabit Ethernet (2010 -) 40 Gbit/secInfiniBand (2011 -) 54.6 Gbit/sec (4X FDR)InfiniBand (2012 -) 2 x 54.6 Gbit/sec (4X Dual-FDR)

25-/50-Gigabit Ethernet (2014 -) 25/50 Gbit/sec100-Gigabit Ethernet (2015 -) 100 Gbit/sec

Omni-Path (2015 - ) 100 Gbit/secInfiniBand (2015 - ) 100 Gbit/sec (4X EDR)InfiniBand (2016 - ) 200 Gbit/sec (4X HDR)

Trends in Network Speed Acceleration

100 times in the last 17 years


Kernel Space

Available Interconnects and Protocols for Data CentersApplication / Middleware

Verbs

Ethernet Adapter

Ethernet Switch

Ethernet Driver

TCP/IP

InfiniBand Adapter

InfiniBand Switch

IPoIB

IPoIB

Ethernet Adapter

Ethernet Switch

Hardware Offload

TCP/IP

1/10/25/40/50/100 GigE-

TOE

InfiniBand Adapter

InfiniBand Switch

User Space

RSockets

RSockets

iWARP Adapter

Ethernet Switch

TCP/IP

User Space

iWARP

RoCEAdapter

Ethernet Switch

RDMA

User Space

RoCE

InfiniBand Switch

InfiniBand Adapter

RDMA

User Space

IB Native

Sockets

Application / Middleware Interface

Protocol

Adapter

Switch

InfiniBand Adapter

InfiniBand Switch

RDMA

SDP

SDP1/10/25/40/50/100 GigE

Omni-Path Adapter

Omni-Path Switch

User Space

RDMA

100 Gb/s

OFI


• Introduced in Oct 2000• High Performance Data Transfer

– Interprocessor communication and I/O– Low latency (<1.0 microsec), High bandwidth (up to 25 GigaBytes/sec -> 200Gbps), and

low CPU utilization (5-10%)• Flexibility for LAN and WAN communication• Multiple Transport Services

– Reliable Connection (RC), Unreliable Connection (UC), Reliable Datagram (RD), Unreliable Datagram (UD), and Raw Datagram

– Provides flexibility to develop upper layers• Multiple Operations

– Send/Recv– RDMA Read/Write– Atomic Operations (very unique)

• high performance and scalable implementations of distributed locks, semaphores, collective communication operations

• Leading to big changes in designing HPC clusters, file systems, cloud computing systems, grid computing systems, ….

Open Standard InfiniBand Networking Technology


• 163 IB Clusters (32.6%) in the Nov’17 Top500 list

– (http://www.top500.org)

• Installations in the Top 50 (17 systems):

Large-scale InfiniBand Installations

19,860,000 core (Gyoukou) in Japan (4th) 60,512 core (DGX SATURN V) at NVIDIA/USA (36th)

241,108 core (Pleiades) at NASA/Ames (17th) 72,000 core (HPC2) in Italy (37th)

220,800 core (Pangea) in France (21st) 152,692 core (Thunder) at AFRL/USA (40th)

144,900 core (Cheyenne) at NCAR/USA (24th) 99,072 core (Mistral) at DKRZ/Germany (42nd)

155,150 core (Jureca) in Germany (29th) 147,456 core (SuperMUC) in Germany (44th)

72,800 core Cray CS-Storm in US (30th) 86,016 core (SuperMUC Phase 2) in Germany (45th)

72,800 core Cray CS-Storm in US (31st) 74,520 core (Tsubame 2.5) at Japan/GSIC (48th)

78,336 core (Electra) at NASA/USA (33rd) 66,000 core (HPC3) in Italy (51st)

124,200 core (Topaz) SGI ICE at ERDC DSRC in US (34th) 194,616 core (Cascade) at PNNL (53rd)

60,512 core (NVIDIA DGX-1/Relion) at Facebook in USA (35th) and many more!

#1st system also uses InfiniBand

Upcoming US DOE Summit (200 PFlops, to be the new #1st) uses InfiniBand

http://www.top500.org/


• 10GE Alliance formed by several industry leaders to take the Ethernet family to the next speed step

• Goal: To achieve a scalable and high performance communication architecture while maintaining backward compatibility with Ethernet

• http://www.ethernetalliance.org

• 40-Gbps (Servers) and 100-Gbps Ethernet (Backbones, Switches, Routers): IEEE 802.3 WG

• 25-Gbps Ethernet Consortium targeting 25/50Gbps (July 2014)– http://25gethernet.org

• Energy-efficient and power-conscious protocols– On-the-fly link speed reduction for under-utilized links

• Ethernet Alliance Technology Forum looking forward to 2026– http://insidehpc.com/2016/08/at-ethernet-alliance-technology-forum/

High-speed Ethernet Consortium (10GE/25GE/40GE/50GE/100GE)

http://www.ethernetalliance.org/

http://25gethernet.org/

http://insidehpc.com/2016/08/at-ethernet-alliance-technology-forum/


• TCP Offload Engines (TOE)– Hardware Acceleration for the entire TCP/IP stack

– Initially patented by Tehuti Networks

– Actually refers to the IC on the network adapter that implements TCP/IP

– In practice, usually referred to as the entire network adapter

• Internet Wide-Area RDMA Protocol (iWARP)– Standardized by IETF and the RDMA Consortium

– Support acceleration features (like IB) for Ethernet

• http://www.ietf.org & http://www.rdmaconsortium.org

TOE and iWARP Accelerators

http://www.ietf.org/

http://www.rdmaconsortium.org/


RDMA over Converged Enhanced Ethernet (RoCE)

IB Verbs

Application

Har

dwar

e

RoCE

IB Verbs

Application

RoCE v2

InfiniBand Link Layer

IB Network

IB Transport

IB Verbs

Application

InfiniBand

Ethernet Link Layer

IB Network

IB Transport

Ethernet Link Layer

UDP / IP

IB Transport

• Takes advantage of IB and Ethernet– Software written with IB-Verbs– Link layer is Converged (Enhanced) Ethernet (CE)– 100Gb/s support from latest EDR and ConnectX-

3 Pro adapters

• Pros: IB Vs RoCE– Works natively in Ethernet environments

• Entire Ethernet management ecosystem is available

– Has all the benefits of IB verbs– Link layer is very similar to the link layer of

native IB, so there are no missing features

• RoCE v2: Additional Benefits over RoCE– Traditional Network Management Tools Apply– ACLs (Metering, Accounting, Firewalling)– GMP Snooping for Optimized Multicast – Network Monitoring Tools

Courtesy: OFED, Mellanox

Network Stack Comparison

Packet Header Comparison

ETHL2 Hdr

Ethe

rtyp

e

IB GRHL3 Hdr

IB BTH+L4 HdrRo

CE

ETHL2 Hdr

Ethe

rtyp

e

IP HdrL3 Hdr

IB BTH+L4 HdrPr

oto

#

RoCE

v2 UDP

Hdr Port

#


HSE Scientific Computing Installations• 204 HSE compute systems with ranking in the Nov’17 Top500 list

– 39,680-core installation in China (#73)– 66,560-core installation in China (#101) – new– 66,280-core installation in China (#103) – new– 64,000-core installation in China (#104) – new– 64,000-core installation in China (#105) – new– 72,000-core installation in China (#108) – new– 78,000-core installation in China (#125)– 59,520-core installation in China (#128) – new– 59,520-core installation in China (#129) – new– 64,800-core installation in China (#130) – new– 67,200-core installation in China (#134) – new– 57,600-core installation in China (#135) – new– 57,600-core installation in China (#136) – new– 64,000-core installation in China (#138) – new– 84,000-core installation in China (#139)– 84,000-core installation in China (#140)– 51,840-core installation in China (#151) – new– 51,200-core installation in China (#156) – new– and many more!


Omni-Path Fabric Overview

Courtesy: Intel Corporation

• Derived from QLogic InfiniBand

• Layer 1.5: Link Transfer Protocol– Features

• Traffic Flow Optimization

• Packet Integrity Protection

• Dynamic Lane Switching

– Error detection/replay occurs in Link Transfer Packet units

– Retransmit request via NULL LTP; carries replay command flit

• Layer 2: Link Layer– Supports 24 bit fabric addresses

– Allows 10KB of L4 payload; 10,368 byte max packet size

– Congestion Management• Adaptive / Dispersive Routing

• Explicit Congestion Notification

– QoS support• Traffic Class, Service Level, Service Channel and Virtual Lane

• Layer 3: Data Link Layer– Fabric addressing, switching, resource allocation and partitioning support


• 35 Omni-Path Clusters (7%) in the Nov’17 Top500 list

– (http://www.top500.org)

Large-scale Omni-Path Installations

556,104 core (Oakforest-PACS) at JCAHPC in Japan (9th) 54,432 core (Marconi Xeon) at CINECA in Italy (72nd)

368,928 core (Stampede2) at TACC in USA (12th) 46,464 core (Peta4) at University of Cambridge in UK (75th)

135,828 core (Tsubame 3.0) at TiTech in Japan (13th) 53,352 core (Girzzly) at LANL in USA (85th)

314,384 core (Marconi XeonPhi) at CINECA in Italy (14th) 45,680 core (Endeavor) at Intel in USA (86th)

153,216 core (MareNostrum) at BSC in Spain (16th) 59,776 core (Cedar) at SFU in Canada (94th)

95,472 core (Quartz) at LLNL in USA (49th) 27,200 core (Peta HPC) in Taiwan (95th)

95,472 core (Jade) at LLNL in USA (50th) 40,392 core (Serrano) at SNL in USA (112th)

49,432 core (Mogon II) at Universitaet Mainz in Germany (65th) 40,392 core (Cayenne) at SNL in USA (113th)

38,552 core (Molecular Simulator) in Japan (70th) 39,774 core (Nel) at LLNL in USA (101st)

35,280 core (Quriosity) at BASF in Germany (71st) and many more!

http://www.top500.org/


IB, Omni-Path, and HSE: Feature ComparisonFeatures IB iWARP/HSE RoCE RoCE v2 Omni-Path

Hardware Acceleration Yes Yes Yes Yes Yes

RDMA Yes Yes Yes Yes Yes

Congestion Control Yes Optional Yes Yes Yes

Multipathing Yes Yes Yes Yes Yes

Atomic Operations Yes No Yes Yes Yes

Multicast Optional No Optional Optional Optional

Data Placement Ordered Out-of-order Ordered Ordered Ordered

Prioritization Optional Optional Yes Yes Yes

Fixed BW QoS (ETS) No Optional Yes Yes Yes

Ethernet Compatibility No Yes Yes Yes Yes

TCP/IP CompatibilityYes

(using IPoIB)Yes

Yes(using IPoIB)

Yes Yes


Designing Communication and I/O Libraries for Clusters and Data Center Middleware: Challenges

Cluster and Data Center Middleware(MPI, PGAS, Memcached, HDFS, MapReduce, HBase, and gRPC/TensorFlow)

Networking Technologies(InfiniBand, 1/10/40/100 GigE

and Intelligent NICs)

Storage Technologies(HDD, SSD, NVM, and NVMe-

SSD)

Programming Models(Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

RDMA?

Communication and I/O LibraryRDMA-based

Communication Substrate

QoS & Fault Tolerance

Threaded Modelsand Synchronization

Performance TuningI/O and File Systems

Virtualization (SR-IOV)

Upper level Changes?


• High-Performance Programming Models Support for HPC Clusters

• RDMA-Enabled Communication Substrate for Common Services in Datacenters

• High-Performance and Scalable Memcached

• RDMA-Enabled Spark and Hadoop (HDFS, HBase, MapReduce)

• Deep Learning with Scale-Up and Scale-Out– Caffe, CNTK, and TensorFlow

• Virtualization Support with SR-IOV and Containers

Designing Next-Generation Middleware for Clusters and Datacenters


Parallel Programming Models Overview

P1 P2 P3

Shared Memory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory MemoryLogical shared memory

Shared Memory Model

SHMEM, DSMDistributed Memory Model

MPI (Message Passing Interface)

Partitioned Global Address Space (PGAS)

Global Arrays, UPC, Chapel, X10, CAF, …

• Programming models provide abstract machine models

• Models can be mapped on different types of systems– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.

• PGAS models and Hybrid MPI+PGAS models are gradually receiving importance


Supporting Programming Models for Multi-Petaflop and Exaflop Systems: Challenges

Programming ModelsMPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP,

OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.

Application Kernels/Applications

Networking Technologies(InfiniBand, 40/100GigE,

Aries, and Omni-Path)

Multi-/Many-coreArchitectures

Accelerators(GPU and FPGA)

MiddlewareCo-Design

Opportunities and

Challenges across Various

Layers

PerformanceScalabilityResilience

Communication Library or Runtime for Programming ModelsPoint-to-point

CommunicationCollective

CommunicationEnergy-

AwarenessSynchronization

and LocksI/O and

File SystemsFault

Tolerance


• Scalability for million to billion processors– Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided)– Scalable job start-up

• Scalable Collective communication– Offload– Non-blocking– Topology-aware

• Balancing intra-node and inter-node communication for next generation nodes (128-1024 cores)– Multiple end-points per node

• Support for efficient multi-threading• Integrated Support for GPGPUs and Accelerators• Fault-tolerance/resiliency• QoS support for communication and I/O• Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM,

MPI+UPC++, CAF, …)• Virtualization • Energy-Awareness

Broad Challenges in Designing Communication Middleware for (MPI+X) at Exascale


Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002

– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015

– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 2,900 organizations in 86 countries

– More than 474,000 (> 0.47 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (Nov ‘17 ranking)• 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China

• 9th, 556,104 cores (Oakforest-PACS) in Japan

• 12th, 368,928-core (Stampede2) at TACC

• 17th, 241,108-core (Pleiades) at NASA

• 48th, 76,032-core (Tsubame 2.5) at Tokyo Institute of Technology

– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

• Empowering Top500 systems for over a decade

http://mvapich.cse.ohio-state.edu/


0

50000

100000

150000

200000

250000

300000

350000

400000

450000

500000Se

p-04

Feb-

05

Jul-0

5

Dec-

05

May

-06

Oct

-06

Mar

-07

Aug-

07

Jan-

08

Jun-

08

Nov

-08

Apr-

09

Sep-

09

Feb-

10

Jul-1

0

Dec-

10

May

-11

Oct

-11

Mar

-12

Aug-

12

Jan-

13

Jun-

13

Nov

-13

Apr-

14

Sep-

14

Feb-

15

Jul-1

5

Dec-

15

May

-16

Oct

-16

Mar

-17

Aug-

17

Jan-

18

Num

ber o

f Dow

nloa

ds

Timeline

MV

0.9.

4

MV2

0.9

.0

MV2

0.9

.8

MV2

1.0

MV

1.0

MV2

1.0.

3

MV

1.1

MV2

1.4

MV2

1.5

MV2

1.6

MV2

1.7

MV2

1.8

MV2

1.9

MV2

-GD

R 2.

0b

MV2

-MIC

2.0

MV2

-GD

R 2

.3a

MV2

-X2.

3bM

V2Vi

rt 2

.2

MV2

2.3

rc1

OSU

INAM

0.9

.3

MVAPICH2 Release Timeline and Downloads


Architecture of MVAPICH2 Software Family

High Performance Parallel Programming Models

Message Passing Interface(MPI)

PGAS(UPC, OpenSHMEM, CAF, UPC++)

Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms

Point-to-point

Primitives

Collectives Algorithms

Energy-

Awareness

Remote Memory Access

I/O and

File Systems

Fault

ToleranceVirtualization Active

MessagesJob Startup

Introspection & Analysis

Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path)

Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPower, Xeon-Phi, ARM, NVIDIA GPGPU)

Transport Protocols Modern Features

RC XRC UD DC UMR ODPSR-IOV

Multi Rail

Transport MechanismsShared

MemoryCMA IVSHMEM

Modern Features

MCDRAM* NVLink* CAPI*

* Upcoming

XPMEM*


One-way Latency: MPI over IB with MVAPICH2

00.20.40.60.8

11.21.41.61.8

2 Small Message Latency

Message Size (bytes)

Late

ncy

(us)

1.11

1.19

0.98

1.15

1.04

TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-5-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch

Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch

0

20

40

60

80

100

120TrueScale-QDRConnectX-3-FDRConnectIB-DualFDRConnectX-5-EDROmni-Path

Large Message Latency


Late

ncy

(us)


TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-5-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 IB switch

Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch

Bandwidth: MPI over IB with MVAPICH2

0

5000

10000

15000

20000

25000TrueScale-QDRConnectX-3-FDRConnectIB-DualFDRConnectX-5-EDROmni-Path

Bidirectional Bandwidth

Band

wid

th

(MBy

tes/

sec)


22,564

12,16121,983

6,228

24,136

0

2000

4000

6000

8000

10000

12000

14000 Unidirectional Bandwidth

Band

wid

th

(MBy

tes/

sec)


12,590

3,373

6,356

12,35812,366


Hardware Multicast-aware MPI_Bcast on Stampede

0

10

20

30

40

2 8 32 128 512

Late

ncy

(us)

Message Size (Bytes)

Small Messages (102,400 Cores)DefaultMulticast

ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch

0100200300400500

2K 8K 32K 128K

Late

ncy

(us)


Large Messages (102,400 Cores)DefaultMulticast

05

1015202530

Late

ncy

(us)

Number of Nodes

16 Byte Message

DefaultMulticast

0

50

100

150

200

Late

ncy

(us)

Number of Nodes

32 KByte Message

DefaultMulticast


Performance of MPI_Allreduce On Stampede2 (10,240 Processes)

0

50

100

150

200

250

300

4 8 16 32 64 128 256 512 1024 2048 4096

Late

ncy

(us)

Message SizeMVAPICH2 MVAPICH2-OPT IMPI

0200400600800

100012001400160018002000

8K 16K 32K 64K 128K 256KMessage Size

MVAPICH2 MVAPICH2-OPT IMPI

OSU Micro Benchmark 64 PPN

2.4X

• For MPI_Allreduce latency with 32K bytes, MVAPICH2-OPT can reduce the latency by 2.4XM. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, Scalable Reduction Collectives with Data Partitioning-based Multi-Leader Design, SuperComputing '17. Available in MVAPICH2-X 2.3b


Application Level Performance with Graph500 and SortGraph500 Execution Time

J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, International Supercomputing Conference (ISC’13), June 2013

J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012

05

101520253035

4K 8K 16K

Tim

e (s

)

No. of Processes

MPI-SimpleMPI-CSCMPI-CSRHybrid (MPI+OpenSHMEM)

13X

7.6X

• Performance of Hybrid (MPI+ OpenSHMEM) Graph500 Design• 8,192 processes

- 2.4X improvement over MPI-CSR- 7.6X improvement over MPI-Simple

• 16,384 processes- 1.5X improvement over MPI-CSR- 13X improvement over MPI-Simple

J. Jose, K. Kandalla, S. Potluri, J. Zhang and D. K. Panda, Optimizing Collective Communication in OpenSHMEM, Int'l Conference on Partitioned Global Address Space Programming Models (PGAS '13), October 2013.

Sort Execution Time

0500

10001500200025003000

500GB-512 1TB-1K 2TB-2K 4TB-4K

Tim

e (s

econ

ds)

Input Data - No. of Processes

MPI Hybrid

51%

• Performance of Hybrid (MPI+OpenSHMEM) Sort Application

• 4,096 processes, 4 TB Input Size- MPI – 2408 sec; 0.16 TB/min- Hybrid – 1172 sec; 0.36 TB/min- 51% improvement over MPI-design


At Sender:

At Receiver:MPI_Recv(r_devbuf, size, …);

insideMVAPICH2

• Standard MPI interfaces used for unified data movement

• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)

• Overlaps data movement from GPU with RDMA transfers

High Performance and High Productivity

MPI_Send(s_devbuf, size, …);

GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU


0

2000

4000

6000

1 2 4 8 16 32 64 128

256

512 1K 2K 4K

Band

wid

th (M

B/s)


GPU-GPU Inter-node Bi-Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3a

01000200030004000

1 2 4 8 16 32 64 128

256

512 1K 2K 4K

Band

wid

th (M

B/s)


GPU-GPU Inter-node Bandwidth


0

10

20

300 1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K

Late

ncy

(us)


GPU-GPU Inter-node Latency


MVAPICH2-GDR-2.3aIntel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores

NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA

CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA

10x

9x

Optimized MVAPICH2-GDR Design

1.88us11X


Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland

0

0.2

0.4

0.6

0.8

1

1.2

16 32 64 96Nor

mal

ized

Exec

utio

n Ti

me

Number of GPUs

CSCS GPU cluster

Default Callback-based Event-based

00.20.40.60.8

11.2

4 8 16 32

Nor

mal

ized

Exec

utio

n Ti

me

Number of GPUs

Wilkes GPU Cluster

Default Callback-based Event-based

• 2X improvement on 32 GPUs nodes• 30% improvement on 96 GPU nodes (8 GPUs/node)

C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16

On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application

Cosmo model: http://www2.cosmo-model.org/content/tasks/operational/meteoSwiss/

mailto:[email protected]

http://www2.cosmo-model.org/content



Data-Center Service Primitives

• Common Services needed by Data-Centers– Better resource management

– Higher performance provided to higher layers

• Service Primitives– Soft Shared State

– Distributed Lock Management

– Global Memory Aggregator

• Network Based Designs– RDMA, Remote Atomic Operations


Soft Shared State

Shared State

Data-CenterApplication






Get

Get

Get

Put

Put

Put


Active Caching• Dynamic data caching – challenging!

• Cache Consistency and Coherence– Become more important than in static case

User Requests

Proxy Nodes

Back-End Nodes

Update


RDMA based Client Polling Design

Front-End Back-End

Request

Cache Hit

Cache Miss

Response

Version Read

Response


Active Caching – Performance Benefits

Data-Center Throughput

0

2000

4000

6000

8000

10000

12000

14000

16000

Trace 2 Trace 3 Trace 4 Trace 5Traces with Increasing Update Rate

Thro

ughp

utNo Cache

Invalidate All

Dependency Lists

Effect of Load

0

2000

4000

6000

8000

10000

12000

14000

16000

0 1 2 4 8 16 32 64Load (Compute Threads)

Thro

ughp

ut

No Cache

DependencyLists

• Higher overall performance – Up to an order of magnitude• Performance is sustained under loaded conditions

S. Narravula, P. Balaji, K. Vaidyanathan, H. -W. Jin and D. K. Panda, Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data-Centers over InfiniBand. CCGrid-2005


Resource Monitoring Services

• Traditional approaches– Coarse-grained in nature– Assume resource usage is consistent throughout the monitoring granularity (in the order of

seconds)

• This assumption is no longer valid– Resource usage is becoming increasingly divergent

• Fine-grained monitoring is desired but has additional overheads – High overheads, less accurate, slow in response

• Can we design fine-grained resource monitoring scheme with low overhead and accurate resource usage?


Synchronous Resource Monitoring using RDMA (RDMA-Sync)

/proc

KernelSpace

UserSpace

KernelSpace

UserSpace

Front-endNode Memory Memory

CPU CPU

AppThreads

Front-endMonitoringProcess

KernelData Structures

AppThreads

Back-endNode

RDMA


Impact of Fine-grained Monitoring with ApplicationsImpact on RUBiS and Zipf Traces

0%

5%

10%

15%

20%

25%

30%

35%

40%

α = 0.9 α = 0.75 α = 0.5 α = 0.25Zipf alpha values

% Imp

rovem

ent

Socket-Sync RDMA-Async RDMA-Sync e-RDMA-Sync

K. Vaidyanathan, H. –W. Jin and D. K. Panda. Exploiting RDMA operations for Providing Efficient Fine-Grained Resource Monitoring in Cluster-Based Servers, Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations and Technologies, 2006

• Our schemes (RDMA-Sync and e-RDMA-Sync) achieve significant performance gain over existing schemes


• Three-layer architecture of Web 2.0– Web Servers, Memcached Servers,

Database Servers

• Memcached is a core component of Web 2.0 architecture

• Distributed Caching Layer– Allows to aggregate spare memory from

multiple nodes– General purpose

• Typically used to cache database queries, results of API calls

• Scalable model, but typical usage very network intensive

Architecture Overview of Memcached

Internet


• Server and client perform a negotiation protocol– Master thread assigns clients to appropriate worker thread

• Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread

• All other Memcached data structures are shared among RDMA and Sockets worker threads

• Memcached Server can serve both socket and verbs clients simultaneously

• Memcached applications need not be modified; uses verbs interface if available

Memcached-RDMA Design

SocketsClient

RDMAClient

MasterThread

SocketsWorker ThreadVerbs

Worker Thread

SocketsWorker Thread

VerbsWorker Thread

SharedData

MemorySlabsItems

…

1

1

2

2


1

10

100

10001 2 4 8 16 32 64 128

256

512 1K 2K 4K

Tim

e (u

s)

Message Size

OSU-IB (FDR)IPoIB (FDR)

0100200300400500600700

16 32 64 128 256 512 1024 2048 4080Thou

sand

s of

Tra

nsac

tions

pe

r Sec

ond

(TPS

)

No. of Clients

• Memcached Get latency– 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us– 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us

• Memcached Throughput (4bytes)– 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s– Nearly 2X improvement in throughput

Memcached GET Latency Memcached Throughput

Memcached Performance (FDR Interconnect)

Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR)

Latency Reduced by nearly 20X

2X


• Illustration with Read-Cache-Read access pattern using modified mysqlslap load testing tool

• Memcached-RDMA can - improve query latency by up to 66% over IPoIB (32Gbps)

- throughput by up to 69% over IPoIB (32Gbps)

Micro-benchmark Evaluation for OLDP workloads

012345678

64 96 128 160 320 400

Late

ncy

(sec

)

No. of Clients

Memcached-IPoIB (32Gbps)Memcached-RDMA (32Gbps)

0

1000

2000

3000

4000

64 96 128 160 320 400

Thro

ughp

ut (

Kq/s

)

No. of Clients

Memcached-IPoIB (32Gbps)Memcached-RDMA (32Gbps)

D. Shankar, X. Lu, J. Jose, M. W. Rahman, N. Islam, and D. K. Panda, Can RDMA Benefit On-Line Data Processing Workloads with Memcached and MySQL, ISPASS’15

Reduced by 66%


– Memcached latency test with Zipf distribution, server with 1 GB memory, 32 KB key-value pair size, total size of data accessed is 1 GB (when data fits in memory) and 1.5 GB (when data does not fit in memory)

– When data fits in memory: RDMA-Mem/Hybrid gives 5x improvement over IPoIB-Mem

– When data does not fit in memory: RDMA-Hybrid gives 2x-2.5x over IPoIB/RDMA-Mem

Performance Evaluation on IB FDR + SATA/NVMe SSDs

0

500

1000

1500

2000

2500

Set Get Set Get Set Get Set Get Set Get Set Get Set Get Set Get

IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid-NVMe

IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid-NVMe

Data Fits In Memory Data Does Not Fit In Memory

Late

ncy

(us)

slab allocation (SSD write) cache check+load (SSD read) cache update server response client wait miss-penalty


– RDMA-Accelerated Communication for Memcached Get/Set

– Hybrid ‘RAM+SSD’ slab management for higher data retention

– Non-blocking API extensions • memcached_(iset/iget/bset/bget/test/wait)• Achieve near in-memory speeds while hiding

bottlenecks of network and SSD I/O• Ability to exploit communication/computation

overlap• Optional buffer re-use guarantees

– Adaptive slab manager with different I/O schemes for higher throughput.

Accelerating Hybrid Memcached with RDMA, Non-blocking Extensions and SSDs

D. Shankar, X. Lu, N. S. Islam, M. W. Rahman, and D. K. Panda, High-Performance Hybrid Key-Value Store on Modern Clusters with RDMA Interconnects and SSDs: Non-blocking Extensions, Designs, and Benefits, IPDPS, May 2016

BLOCKING API

NON-BLOCKING API REQ.

NON-BLOCKING API REPLY

CLIE

NT

SERV

ER

HYBRID SLAB MANAGER (RAM+SSD)

RDMA-ENHANCED COMMUNICATION LIBRARY

RDMA-ENHANCED COMMUNICATION LIBRARY

LIBMEMCACHED LIBRARY

Blocking API Flow Non-Blocking API Flow


– Data does not fit in memory: Non-blocking Memcached Set/Get API Extensions can achieve• >16x latency improvement vs. blocking API over RDMA-Hybrid/RDMA-Mem w/ penalty• >2.5x throughput improvement vs. blocking API over default/optimized RDMA-Hybrid

– Data fits in memory: Non-blocking Extensions perform similar to RDMA-Mem/RDMA-Hybrid and >3.6x improvement over IPoIB-Mem

Performance Evaluation with Non-Blocking Memcached API

0

500

1000

1500

2000

2500

Set Get Set Get Set Get Set Get Set Get Set Get

IPoIB-Mem RDMA-Mem H-RDMA-Def H-RDMA-Opt-Block H-RDMA-Opt-NonB-i

H-RDMA-Opt-NonB-b

Aver

age

Late

ncy

(us)

Miss Penalty (Backend DB Access Overhead)Client WaitServer ResponseCache UpdateCache check+Load (Memory and/or SSD read)Slab Allocation (w/ SSD write on Out-of-Mem)

H = Hybrid Memcached over SATA SSD Opt = Adaptive slab manager Block = Default Blocking API NonB-i = Non-blocking iset/iget API NonB-b = Non-blocking bset/bget API w/ buffer re-use guarantee


• RDMA for Apache Spark

• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions

• RDMA for Apache HBase

• RDMA for Memcached (RDMA-Memcached)

• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)

• OSU HiBD-Benchmarks (OHB)

– HDFS, Memcached, HBase, and Spark Micro-benchmarks

• http://hibd.cse.ohio-state.edu

• Users Base: 285 organizations from 34 countries

• More than 26,700 downloads from the project site

The High-Performance Big Data (HiBD) Project

Available for InfiniBand and RoCEAlso run on Ethernet

Available for x86 and OpenPOWER

Support for Singularity and Docker

http://hibd.cse.ohio-state.edu/


050

100150200250300350400

80 120 160

Exec

utio

n Ti

me

(s)

Data Size (GB)

IPoIB (EDR)OSU-IB (EDR)

0100200300400500600700800

80 160 240

Exec

utio

n Ti

me

(s)

Data Size (GB)

IPoIB (EDR)OSU-IB (EDR)

Performance Numbers of RDMA for Apache Hadoop 2.x –RandomWriter & TeraGen in OSU-RI2 (EDR)

Cluster with 8 Nodes with a total of 64 maps

• RandomWriter– 3x improvement over IPoIB

for 80-160 GB file size

• TeraGen– 4x improvement over IPoIB for

80-240 GB file size

RandomWriter TeraGen

Reduced by 3x Reduced by 4x


• InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R)

• RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node. – 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps)

– 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps)

Performance Evaluation of RDMA-Spark on SDSC Comet – HiBench PageRank

32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time

050

100150200250300350400450

Huge BigData Gigantic

Tim

e (s

ec)

Data Size (GB)

IPoIB

RDMA

0

100

200

300

400

500

600

700

800

Huge BigData Gigantic

Tim

e (s

ec)

Data Size (GB)

IPoIB

RDMA

43%37%


Performance Evaluation on SDSC Comet: Astronomy Application

• Kira Toolkit1: Distributed astronomy image processing toolkit implemented using Apache Spark.

• Source extractor application, using a 65GB dataset from the SDSS DR2 survey that comprises 11,150 image files.

• Compare RDMA Spark performance with the standard apache implementation using IPoIB.

1. Z. Zhang, K. Barbary, F. A. Nothaft, E.R. Sparks, M.J. Franklin, D.A. Patterson, S. Perlmutter. Scientific Computing meets Big Data Technology: An Astronomy Use Case. CoRR, vol: abs/1507.03325, Aug 2015.

020406080

100120

RDMA Spark Apache Spark(IPoIB)

21 %

Execution times (sec) for Kira SE benchmark using 65 GB dataset, 48 cores.

M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC Comet, XSEDE’16, July 2016


• Deep Learning frameworks are a different game altogether

– Unusually large message sizes (order of megabytes)

– Most communication based on GPU buffers

• Existing State-of-the-art– cuDNN, cuBLAS, NCCL --> scale-up performance

– CUDA-Aware MPI --> scale-out performance• For small and medium message sizes only!

• Proposed: Can we co-design the MPI runtime (MVAPICH2-GDR) and the DL framework (Caffe) to achieve both?

– Efficient Overlap of Computation and Communication

– Efficient Large-Message Communication (Reductions)

– What application co-designs are needed to exploit communication-runtime co-designs?

Deep Learning: New Challenges for Communication Runtimes

Scal

e-up

Per

form

ance

Scale-out Performance

cuDNN

NCCL

gRPC

Hadoop

ProposedCo-Designs

MPIcuBLAS

A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)


0100200300

2 4 8 16 32 64 128

Late

ncy

(ms)

Message Size (MB)

Reduce – 192 GPUsLarge Message Optimized Collectives for Deep Learning

0

100

200

128 160 192

Late

ncy

(ms)

No. of GPUs

Reduce – 64 MB

0100200300

16 32 64

Late

ncy

(ms)

No. of GPUs

Allreduce - 128 MB

0

50

100

2 4 8 16 32 64 128

Late

ncy

(ms)

Message Size (MB)

Bcast – 64 GPUs

0

50

100

16 32 64

Late

ncy

(ms)

No. of GPUs

Bcast 128 MB

• MV2-GDR provides optimized collectives for large message sizes

• Optimized Reduce, Allreduce, and Bcast

• Good scaling with large number of GPUs

• Available since MVAPICH2-GDR 2.2GA

0100200300

2 4 8 16 32 64 128La

tenc

y (m

s)Message Size (MB)

Allreduce – 64 GPUs


MVAPICH2-GDR vs. NCCL2 – Allreduce Operation• Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases

• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 16 GPUs

*Will be available with upcoming MVAPICH2-GDR 2.3b

1

10

100

1000

10000

100000

Late

ncy

(us)


MVAPICH2-GDR NCCL2

~1.2X better

Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect

1

10

100

1000

4 8 16 32 64 128

256

512 1K 2K 4K 8K 16K

32K

64K

Late

ncy

(us)


MVAPICH2-GDR NCCL2

~3X better


0

10000

20000

30000

40000

50000

512K 1M 2M 4M

Late

ncy

(us)


MVAPICH2 BAIDU OPENMPI

0100000020000003000000400000050000006000000

8388

608

1677

7216

3355

4432

6710

8864

1342

1772

8

2684

3545

6

5368

7091

2

Late

ncy

(us)



1

10

100

1000

10000

100000

4 16 64 256

1024

4096

1638

4

6553

6

2621

44

Late

ncy

(us)



• 16 GPUs (4 nodes) MVAPICH2-GDR(*) vs. Baidu-Allreduce and OpenMPI 3.0

MVAPICH2: Allreduce Comparison with Baidu and OpenMPI

*Available with MVAPICH2-GDR 2.3a

~30X betterMV2 is ~2X better

than Baidu

~10X better OpenMPI is ~5X slower than Baidu

~4X better


• Caffe : A flexible and layered Deep Learning framework.

• Benefits and Weaknesses– Multi-GPU Training within a single node

– Performance degradation for GPUs across different sockets

– Limited Scale-out

• OSU-Caffe: MPI-based Parallel Training – Enable Scale-up (within a node) and Scale-out (across

multi-GPU nodes)

– Scale-out on 64 GPUs for training CIFAR-10 network on CIFAR-10 dataset

– Scale-out on 128 GPUs for training GoogLeNet network on ImageNet dataset

OSU-Caffe: Scalable Deep Learning

0

50

100

150

200

250

8 16 32 64 128

Trai

ning

Tim

e (s

econ

ds)

No. of GPUs

GoogLeNet (ImageNet) on 128 GPUs

Caffe OSU-Caffe (1024) OSU-Caffe (2048)

Invalid use caseOSU-Caffe publicly available from

http://hidl.cse.ohio-state.edu/

http://hidl.cse.ohio-state.edu/


Architecture Overview of gRPC Key Features:• Simple service definition• Works across languages and platforms

• C++, Java, Python, Android Java etc• Linux, Mac, Windows.

• Start quickly and scale• Bi-directional streaming and integrated

authentication• Used by Google (several of Google’s cloud

products and Google externally facing APIs,TensorFlow), Netflix, Docker, Cisco, JuniperNetworks etc.

• Uses sockets for communication!

Source: http://www.grpc.io/

Large-scale distributed systems composed of micro services


Performance Benefits for AR-gRPC with Micro-Benchmark

Point-to-Point Latency

• AR-gRPC (OSU design) Latency on SDSC-Comet-FDR– Up to 2.7x performance speedup over Default gRPC (IPoIB) for Latency for small messages.– Up to 2.8x performance speedup over Default gRPC (IPoIB) for Latency for medium messages.– Up to 2.5x performance speedup over Default gRPC (IPoIB) for Latency for large messages.

0

15

30

45

60

75

902 4 8 16 32 64 128

256

512 1K 2K 4K 8K

Late

ncy

(us)

payload (Bytes)

Default gRPC (IPoIB-56Gbps)

AR-gRPC (RDMA-56Gbps)

0

100

200

300

400

500

600

700

800

900

1000

16K 32K 64K 128K 256K 512KLa

tenc

y (u

s)Payload (Bytes)


AR-gRPC (RDMA-56 Gbps)

100

3800

7500

11200

14900

18600

1M 2M 4M 8M

Late

ncy

(us)

Payload (Bytes)



R. Biswas, X. Lu, and D. K. Panda, Accelerating TensorFlow with Adaptive RDMA-based gRPC. (Under Review)


0

60

120

180

240

300

8 16 32 64

Imag

es /

Seco

ndBatch Size / GPU

gRPPC (IPoIB-100Gbps)

Verbs (RDMA-100Gbps)

MPI (RDMA-100Gbps)


0

80

160

240

320

400

8 16 32 64

Imag

es /

Seco

nd

Batch Size / GPU



MPI (RDMA-100Gbps)


133%

16%

0

30

60

90

120

150

8 16 32 64

Imag

es /

Seco

nd

Batch Size/ GPU

gRPPC (IPoIB-100Gbps)Verbs (RDMA-100Gbps)MPI (RDMA-100Gbps)AR-gRPC (RDMA-100Gbps)

26%

10%

Performance Benefit for TensorFlow (Resnet50)

• TensorFlow Resnet50 performance evaluation on an IB EDR cluster– Up to 26% performance speedup over Default gRPC (IPoIB) for 4 nodes– Up to 127% performance speedup over Default gRPC (IPoIB) for 8 nodes– Up to 133% performance speedup over Default gRPC (IPoIB) for 12 nodes

4 Nodes 8 Nodes 12 Nodes

127%

15%


0

10

20

30

40

50

60

70

80

90

8 16 32 64

Imag

es /

Seco

nd

Batch Size / GPU



MPI (RDMA-100Gbps)


47%

7%

Performance Benefit for TensorFlow (Inception3)

• TensorFlow Inception3 performance evaluation on an IB EDR cluster– Up to 47% performance speedup over Default gRPC (IPoIB) for 4 nodes – Up to 116% performance speedup over Default gRPC (IPoIB) for 8 nodes– Up to 153% performance speedup over Default gRPC (IPoIB) for 12 nodes

4 Nodes 8 Nodes 12 Nodes

0

60

120

180

240

300

8 16 32 64

Imag

es /

Seco

nd

Batch Size / GPU



MPI (RDMA-100Gbps)


153%

12%

0

50

100

150

200

8 16 32 64Im

ages

/ Se

cond

Batch Size / GPU



MPI (RDMA-100Gbps)


116%

10%


X. Lu, H. Shi, M. H. Javed, R. Biswas, and D. K. Panda, Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-capable Networks, HotI 2017.

High-Performance Deep Learning over Big Data (DLoBD) Stacks• Benefits of Deep Learning over Big Data (DLoBD)

Easily integrate deep learning components into Big Data processing workflow

Easily access the stored data in Big Data systems No need to set up new dedicated deep learning clusters;

Reuse existing big data analytics clusters • Challenges

Can RDMA-based designs in DLoBD stacks improve performance, scalability, and resource utilization on high-performance interconnects, GPUs, and multi-core CPUs?

What are the performance characteristics of representative DLoBD stacks on RDMA networks?

• Characterization on DLoBD Stacks CaffeOnSpark, TensorFlowOnSpark, and BigDL IPoIB vs. RDMA; In-band communication vs. Out-of-band

communication; CPU vs. GPU; etc. Performance, accuracy, scalability, and resource utilization RDMA-based DLoBD stacks (e.g., BigDL over RDMA-Spark)

can achieve 2.6x speedup compared to the IPoIB based scheme, while maintain similar accuracy

0

20

40

60

10

1010

2010

3010

4010

5010

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Acc

urac

y (%

)

Epoc

hs T

ime (

secs

)

Epoch Number

IPoIB-TimeRDMA-TimeIPoIB-AccuracyRDMA-Accuracy

2.6X


• Single Root I/O Virtualization (SR-IOV) is providing new opportunities to design cloud-based datacenters with very little low overhead

Single Root I/O Virtualization (SR-IOV)

• Allows a single physical device, or a Physical Function (PF), to present itself as multiple virtual devices, or Virtual Functions (VFs)

• VFs are designed based on the existing non-virtualized PFs, no need for driver change

• Each VF can be dedicated to a single VM through PCI pass-through

• Work with 10/40/100 GigE and InfiniBand


• Virtualization has many benefits– Fault-tolerance– Job migration– Compaction

• Have not been very popular in HPC due to overhead associated with Virtualization

• New SR-IOV (Single Root – IO Virtualization) support available with Mellanox InfiniBand adapters changes the field

• Enhanced MVAPICH2 support for SR-IOV• MVAPICH2-Virt 2.2 supports:

– OpenStack, Docker, and singularity

Can High-Performance and Virtualization be Combined?

J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based Virtualized InfiniBand Clusters? EuroPar'14J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled InfiniBand Clusters, HiPC’14 J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient Approach to build HPC Clouds, CCGrid’15


0

50

100

150

200

250

300

350

400

milc leslie3d pop2 GAPgeofem zeusmp2 lu

Exec

utio

n Ti

me

(s)

MV2-SR-IOV-Def

MV2-SR-IOV-Opt

MV2-Native

1%9.5%

0

1000

2000

3000

4000

5000

6000

22,20 24,10 24,16 24,20 26,10 26,16

Exec

utio

n Ti

me

(ms)

Problem Size (Scale, Edgefactor)

MV2-SR-IOV-Def

MV2-SR-IOV-Opt

MV2-Native2%

• 32 VMs, 6 Core/VM

• Compared to Native, 2-5% overhead for Graph500 with 128 Procs

• Compared to Native, 1-9.5% overhead for SPEC MPI2007 with 128 Procs

Application-Level Performance on Chameleon

SPEC MPI2007Graph500

5%


0

10

20

30

40

50

60

70

80

90

100

MG.D FT.D EP.D LU.D CG.D

Exec

utio

n Ti

me

(s)

Container-Def

Container-Opt

Native

• 64 Containers across 16 nodes, pining 4 Cores per Container

• Compared to Container-Def, up to 11% and 73% of execution time reduction for NAS and Graph 500

• Compared to Native, less than 9 % and 5% overhead for NAS and Graph 500

Application-Level Performance on Docker with MVAPICH2

Graph 500NAS

11%

0

50

100

150

200

250

300

1Cont*16P 2Conts*8P 4Conts*4P

BFS

Exec

utio

n Ti

me

(ms)

Scale, Edgefactor (20,16)

Container-Def

Container-Opt

Native

73%


0

500

1000

1500

2000

2500

3000

22,16 22,20 24,16 24,20 26,16 26,20

BFS

Exec

utio

n Ti

me

(ms)

Problem Size (Scale, Edgefactor)

Graph500

Singularity

Native

0

50

100

150

200

250

300

CG EP FT IS LU MG

Exec

utio

n Ti

me

(s)

NPB Class D

Singularity

Native

• 512 Processes across 32 nodes

• Less than 7% and 6% overhead for NPB and Graph500, respectively

Application-Level Performance on Singularity with MVAPICH2

7%

6%

J. Zhang, X .Lu and D. K. Panda, Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds?, UCC ’17


RDMA-Hadoop with Virtualization

– 14% and 24% improvement with Default Mode for CloudBurst and Self-Join

– 30% and 55% improvement with Distributed Mode for CloudBurst and Self-Join

0

20

40

60

80

100

Default Mode Distributed Mode

EXEC

UTI

ON

TIM

ECloudBurst

RDMA-Hadoop RDMA-Hadoop-Virt

050

100150200250300350400

Default Mode Distributed Mode

EXEC

UTI

ON

TIM

E

Self-Join

RDMA-Hadoop RDMA-Hadoop-Virt

30% reduction

55% reduction

S. Gugnani, X. Lu, D. K. Panda. Designing Virtualization-aware and Automatic Topology Detection Schemes for Accelerating Hadoop on SR-IOV-enabled Clouds. CloudCom, 2016.


• Next generation Clusters and Data Centers need to be designed with a holistic view of HPC, Big Data, Deep Learning, and Cloud

• Presented an overview of the networking technology trends

• Presented some of the approaches and results along these directions

• Enable HPC, Big Data, Deep Learning and Cloud community to take advantage of modern networking technologies

• Many other open issues need to be solved

Concluding Remarks


Funding AcknowledgmentsFunding Support by

Equipment Support by


Personnel AcknowledgmentsCurrent Students (Graduate)

– A. Awan (Ph.D.)

– R. Biswas (M.S.)

– M. Bayatpour (Ph.D.)

– S. Chakraborthy (Ph.D.)– C.-H. Chu (Ph.D.)

– S. Guganani (Ph.D.)

Past Students – A. Augustine (M.S.)

– P. Balaji (Ph.D.)

– S. Bhagvat (M.S.)

– A. Bhat (M.S.)

– D. Buntinas (Ph.D.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– R. Rajachandrasekar (Ph.D.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

Past Research Scientist– K. Hamidouche

– S. Sur

Past Post-Docs– D. Banerjee

– X. Besseron

– H.-W. Jin

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– J. Jose (Ph.D.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– K. Kulkarni (M.S.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– M. Li (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– A. Moody (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– S. Potluri (Ph.D.)

– J. Hashmi (Ph.D.)

– H. Javed (Ph.D.)– P. Kousha (Ph.D.)

– D. Shankar (Ph.D.)

– H. Shi (Ph.D.)

– J. Zhang (Ph.D.)

– J. Lin

– M. Luo

– E. Mancini

Current Research Scientists– X. Lu

– H. Subramoni

Past Programmers– D. Bureddy

– J. Perkins

Current Research Specialist– J. Smith

– M. Arnold

– S. Marcarelli

– J. Vienne

– H. Wang

Current Post-doc– A. Ruhela

– K. Manian

Current Students (Undergraduate)– N. Sarkauskas (B.S.)


• Looking for Bright and Enthusiastic Personnel to join as

– Post-Doctoral Researchers

– PhD Students

– MPI Programmer/Software Engineer

– Hadoop/Big Data Programmer/Software Engineer

• If interested, please contact me at this conference and/or send an e-mail to [email protected]

Multiple Positions Available in My Group


Thank You!

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

[email protected]

The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/

The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/

The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/

http://nowlab.cse.ohio-state.edu/


software libraries and middleware for exascale...

Documents