software libraries and middleware for exascale...
TRANSCRIPT
Networking Technologies and Middleware for Next-Generation Clusters and Data Centers: Challenges and Opportunities
Dhabaleswar K. (DK) Panda
The Ohio State UniversityE-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
Keynote Talk at HPSR ‘18
by
HPSR ‘18 2Network Based Computing Laboratory
High-End Computing (HEC): Towards Exascale
Expected to have an ExaFlop system in 2020-2021!
100 PFlops in 2016
1 EFlops in 2020-2021?
200PFlops in
2018
HPSR ‘18 3Network Based Computing Laboratory
Big Data – How Much Data Is Generated Every Minute on the Internet?
The global Internet population grew 7.5% from 2016 and now represents 3.7 Billion People.
Courtesy: https://www.domo.com/blog/data-never-sleeps-5/
HPSR ‘18 4Network Based Computing Laboratory
Resurgence of AI/Machine Learning/Deep Learning
Courtesy: http://www.zdnet.com/article/caffe2-deep-learning-wide-ambitions-flexibility-scalability-and-advocacy/
HPSR ‘18 5Network Based Computing Laboratory
A Typical Multi-Tier Data Center Architecture
.
...
Tier1 Tier3
Routers/Servers
Switch
Database Server
Application Server
Routers/Servers
Routers/Servers
Application Server
Application Server
Application Server
Database Server
Database Server
Database Server
Switch SwitchRouters/Servers
Tier2
HPSR ‘18 6Network Based Computing Laboratory
• Substantial impact on designing and utilizing data management and processing systems in multiple tiers
– Front-end data accessing and serving (Online)• Memcached + DB (e.g. MySQL), HBase
– Back-end data analytics (Offline)• HDFS, MapReduce, Spark
Data Management and Processing on Modern Datacenters
HPSR ‘18 7Network Based Computing Laboratory
Communication and Computation Requirements
• Requests are received from clients over the WAN
• Proxy nodes perform caching, load balancing, resource monitoring, etc.
• If not cached, the request is forwarded to the next tiers Application Server
• Application server performs the business logic (CGI, Java servlets, etc.)– Retrieves appropriate data from the database to process the requests
ProxyServer
Web-server(Apache)
Application Server (PHP)
DatabaseServer(MySQL)
WAN
ClientsStorage
More Computation and CommunicationRequirements
HPSR ‘18 8Network Based Computing Laboratory
Big Data (Hadoop, Spark,
HBase, Memcached,
etc.)
Deep Learning(Caffe, TensorFlow, BigDL,
etc.)
HPC (MPI, RDMA, Lustre, etc.)
Increasing Usage of HPC, Big Data and Deep Learning on Modern Datacenters
Convergence of HPC, Big Data, and Deep Learning!
Increasing Need to Run these applications on the Cloud!!
HPSR ‘18 9Network Based Computing Laboratory
Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?
Physical Compute
HPSR ‘18 10Network Based Computing Laboratory
Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?
HPSR ‘18 11Network Based Computing Laboratory
Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?
HPSR ‘18 12Network Based Computing Laboratory
Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?
Spark Job
Hadoop Job Deep LearningJob
HPSR ‘18 13Network Based Computing Laboratory
Drivers of Modern HPC Cluster and Data Center Architecture
• Multi-core/many-core technologies
• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand, iWARP, RoCE, and Omni-Path)
• Single Root I/O Virtualization (SR-IOV)• Solid State Drives (SSDs), NVMe/NVMf, Parallel Filesystems, Object Storage Clusters
• Accelerators (NVIDIA GPGPUs and FPGAs)
High Performance Interconnects –InfiniBand (with SR-IOV)
<1usec latency, 200Gbps Bandwidth>Multi-/Many-core
Processors
Cloud CloudSDSC Comet TACC Stampede
Acceleratorshigh compute density, high
performance/watt>1 TFlop DP on a chip
SSD, NVMe-SSD, NVRAM
HPSR ‘18 14Network Based Computing Laboratory
Ethernet (1979 - ) 10 Mbit/secFast Ethernet (1993 -) 100 Mbit/sec
Gigabit Ethernet (1995 -) 1000 Mbit /secATM (1995 -) 155/622/1024 Mbit/sec
Myrinet (1993 -) 1 Gbit/secFibre Channel (1994 -) 1 Gbit/sec
InfiniBand (2001 -) 2 Gbit/sec (1X SDR)10-Gigabit Ethernet (2001 -) 10 Gbit/sec
InfiniBand (2003 -) 8 Gbit/sec (4X SDR)InfiniBand (2005 -) 16 Gbit/sec (4X DDR)
24 Gbit/sec (12X SDR)InfiniBand (2007 -) 32 Gbit/sec (4X QDR)
40-Gigabit Ethernet (2010 -) 40 Gbit/secInfiniBand (2011 -) 54.6 Gbit/sec (4X FDR)InfiniBand (2012 -) 2 x 54.6 Gbit/sec (4X Dual-FDR)
25-/50-Gigabit Ethernet (2014 -) 25/50 Gbit/sec100-Gigabit Ethernet (2015 -) 100 Gbit/sec
Omni-Path (2015 - ) 100 Gbit/secInfiniBand (2015 - ) 100 Gbit/sec (4X EDR)InfiniBand (2016 - ) 200 Gbit/sec (4X HDR)
Trends in Network Speed Acceleration
100 times in the last 17 years
HPSR ‘18 15Network Based Computing Laboratory
Kernel Space
Available Interconnects and Protocols for Data CentersApplication / Middleware
Verbs
Ethernet Adapter
Ethernet Switch
Ethernet Driver
TCP/IP
InfiniBand Adapter
InfiniBand Switch
IPoIB
IPoIB
Ethernet Adapter
Ethernet Switch
Hardware Offload
TCP/IP
1/10/25/40/50/100 GigE-
TOE
InfiniBand Adapter
InfiniBand Switch
User Space
RSockets
RSockets
iWARP Adapter
Ethernet Switch
TCP/IP
User Space
iWARP
RoCEAdapter
Ethernet Switch
RDMA
User Space
RoCE
InfiniBand Switch
InfiniBand Adapter
RDMA
User Space
IB Native
Sockets
Application / Middleware Interface
Protocol
Adapter
Switch
InfiniBand Adapter
InfiniBand Switch
RDMA
SDP
SDP1/10/25/40/50/100 GigE
Omni-Path Adapter
Omni-Path Switch
User Space
RDMA
100 Gb/s
OFI
HPSR ‘18 16Network Based Computing Laboratory
• Introduced in Oct 2000• High Performance Data Transfer
– Interprocessor communication and I/O– Low latency (<1.0 microsec), High bandwidth (up to 25 GigaBytes/sec -> 200Gbps), and
low CPU utilization (5-10%)• Flexibility for LAN and WAN communication• Multiple Transport Services
– Reliable Connection (RC), Unreliable Connection (UC), Reliable Datagram (RD), Unreliable Datagram (UD), and Raw Datagram
– Provides flexibility to develop upper layers• Multiple Operations
– Send/Recv– RDMA Read/Write– Atomic Operations (very unique)
• high performance and scalable implementations of distributed locks, semaphores, collective communication operations
• Leading to big changes in designing HPC clusters, file systems, cloud computing systems, grid computing systems, ….
Open Standard InfiniBand Networking Technology
HPSR ‘18 17Network Based Computing Laboratory
• 163 IB Clusters (32.6%) in the Nov’17 Top500 list
– (http://www.top500.org)
• Installations in the Top 50 (17 systems):
Large-scale InfiniBand Installations
19,860,000 core (Gyoukou) in Japan (4th) 60,512 core (DGX SATURN V) at NVIDIA/USA (36th)
241,108 core (Pleiades) at NASA/Ames (17th) 72,000 core (HPC2) in Italy (37th)
220,800 core (Pangea) in France (21st) 152,692 core (Thunder) at AFRL/USA (40th)
144,900 core (Cheyenne) at NCAR/USA (24th) 99,072 core (Mistral) at DKRZ/Germany (42nd)
155,150 core (Jureca) in Germany (29th) 147,456 core (SuperMUC) in Germany (44th)
72,800 core Cray CS-Storm in US (30th) 86,016 core (SuperMUC Phase 2) in Germany (45th)
72,800 core Cray CS-Storm in US (31st) 74,520 core (Tsubame 2.5) at Japan/GSIC (48th)
78,336 core (Electra) at NASA/USA (33rd) 66,000 core (HPC3) in Italy (51st)
124,200 core (Topaz) SGI ICE at ERDC DSRC in US (34th) 194,616 core (Cascade) at PNNL (53rd)
60,512 core (NVIDIA DGX-1/Relion) at Facebook in USA (35th) and many more!
#1st system also uses InfiniBand
Upcoming US DOE Summit (200 PFlops, to be the new #1st) uses InfiniBand
HPSR ‘18 18Network Based Computing Laboratory
• 10GE Alliance formed by several industry leaders to take the Ethernet family to the next speed step
• Goal: To achieve a scalable and high performance communication architecture while maintaining backward compatibility with Ethernet
• http://www.ethernetalliance.org
• 40-Gbps (Servers) and 100-Gbps Ethernet (Backbones, Switches, Routers): IEEE 802.3 WG
• 25-Gbps Ethernet Consortium targeting 25/50Gbps (July 2014)– http://25gethernet.org
• Energy-efficient and power-conscious protocols– On-the-fly link speed reduction for under-utilized links
• Ethernet Alliance Technology Forum looking forward to 2026– http://insidehpc.com/2016/08/at-ethernet-alliance-technology-forum/
High-speed Ethernet Consortium (10GE/25GE/40GE/50GE/100GE)
HPSR ‘18 19Network Based Computing Laboratory
• TCP Offload Engines (TOE)– Hardware Acceleration for the entire TCP/IP stack
– Initially patented by Tehuti Networks
– Actually refers to the IC on the network adapter that implements TCP/IP
– In practice, usually referred to as the entire network adapter
• Internet Wide-Area RDMA Protocol (iWARP)– Standardized by IETF and the RDMA Consortium
– Support acceleration features (like IB) for Ethernet
• http://www.ietf.org & http://www.rdmaconsortium.org
TOE and iWARP Accelerators
HPSR ‘18 20Network Based Computing Laboratory
RDMA over Converged Enhanced Ethernet (RoCE)
IB Verbs
Application
Har
dwar
e
RoCE
IB Verbs
Application
RoCE v2
InfiniBand Link Layer
IB Network
IB Transport
IB Verbs
Application
InfiniBand
Ethernet Link Layer
IB Network
IB Transport
Ethernet Link Layer
UDP / IP
IB Transport
• Takes advantage of IB and Ethernet– Software written with IB-Verbs– Link layer is Converged (Enhanced) Ethernet (CE)– 100Gb/s support from latest EDR and ConnectX-
3 Pro adapters
• Pros: IB Vs RoCE– Works natively in Ethernet environments
• Entire Ethernet management ecosystem is available
– Has all the benefits of IB verbs– Link layer is very similar to the link layer of
native IB, so there are no missing features
• RoCE v2: Additional Benefits over RoCE– Traditional Network Management Tools Apply– ACLs (Metering, Accounting, Firewalling)– GMP Snooping for Optimized Multicast – Network Monitoring Tools
Courtesy: OFED, Mellanox
Network Stack Comparison
Packet Header Comparison
ETHL2 Hdr
Ethe
rtyp
e
IB GRHL3 Hdr
IB BTH+L4 HdrRo
CE
ETHL2 Hdr
Ethe
rtyp
e
IP HdrL3 Hdr
IB BTH+L4 HdrPr
oto
#
RoCE
v2 UDP
Hdr Port
#
HPSR ‘18 21Network Based Computing Laboratory
HSE Scientific Computing Installations• 204 HSE compute systems with ranking in the Nov’17 Top500 list
– 39,680-core installation in China (#73)– 66,560-core installation in China (#101) – new– 66,280-core installation in China (#103) – new– 64,000-core installation in China (#104) – new– 64,000-core installation in China (#105) – new– 72,000-core installation in China (#108) – new– 78,000-core installation in China (#125)– 59,520-core installation in China (#128) – new– 59,520-core installation in China (#129) – new– 64,800-core installation in China (#130) – new– 67,200-core installation in China (#134) – new– 57,600-core installation in China (#135) – new– 57,600-core installation in China (#136) – new– 64,000-core installation in China (#138) – new– 84,000-core installation in China (#139)– 84,000-core installation in China (#140)– 51,840-core installation in China (#151) – new– 51,200-core installation in China (#156) – new– and many more!
HPSR ‘18 22Network Based Computing Laboratory
Omni-Path Fabric Overview
Courtesy: Intel Corporation
• Derived from QLogic InfiniBand
• Layer 1.5: Link Transfer Protocol– Features
• Traffic Flow Optimization
• Packet Integrity Protection
• Dynamic Lane Switching
– Error detection/replay occurs in Link Transfer Packet units
– Retransmit request via NULL LTP; carries replay command flit
• Layer 2: Link Layer– Supports 24 bit fabric addresses
– Allows 10KB of L4 payload; 10,368 byte max packet size
– Congestion Management• Adaptive / Dispersive Routing
• Explicit Congestion Notification
– QoS support• Traffic Class, Service Level, Service Channel and Virtual Lane
• Layer 3: Data Link Layer– Fabric addressing, switching, resource allocation and partitioning support
HPSR ‘18 23Network Based Computing Laboratory
• 35 Omni-Path Clusters (7%) in the Nov’17 Top500 list
– (http://www.top500.org)
Large-scale Omni-Path Installations
556,104 core (Oakforest-PACS) at JCAHPC in Japan (9th) 54,432 core (Marconi Xeon) at CINECA in Italy (72nd)
368,928 core (Stampede2) at TACC in USA (12th) 46,464 core (Peta4) at University of Cambridge in UK (75th)
135,828 core (Tsubame 3.0) at TiTech in Japan (13th) 53,352 core (Girzzly) at LANL in USA (85th)
314,384 core (Marconi XeonPhi) at CINECA in Italy (14th) 45,680 core (Endeavor) at Intel in USA (86th)
153,216 core (MareNostrum) at BSC in Spain (16th) 59,776 core (Cedar) at SFU in Canada (94th)
95,472 core (Quartz) at LLNL in USA (49th) 27,200 core (Peta HPC) in Taiwan (95th)
95,472 core (Jade) at LLNL in USA (50th) 40,392 core (Serrano) at SNL in USA (112th)
49,432 core (Mogon II) at Universitaet Mainz in Germany (65th) 40,392 core (Cayenne) at SNL in USA (113th)
38,552 core (Molecular Simulator) in Japan (70th) 39,774 core (Nel) at LLNL in USA (101st)
35,280 core (Quriosity) at BASF in Germany (71st) and many more!
HPSR ‘18 24Network Based Computing Laboratory
IB, Omni-Path, and HSE: Feature ComparisonFeatures IB iWARP/HSE RoCE RoCE v2 Omni-Path
Hardware Acceleration Yes Yes Yes Yes Yes
RDMA Yes Yes Yes Yes Yes
Congestion Control Yes Optional Yes Yes Yes
Multipathing Yes Yes Yes Yes Yes
Atomic Operations Yes No Yes Yes Yes
Multicast Optional No Optional Optional Optional
Data Placement Ordered Out-of-order Ordered Ordered Ordered
Prioritization Optional Optional Yes Yes Yes
Fixed BW QoS (ETS) No Optional Yes Yes Yes
Ethernet Compatibility No Yes Yes Yes Yes
TCP/IP CompatibilityYes
(using IPoIB)Yes
Yes(using IPoIB)
Yes Yes
HPSR ‘18 25Network Based Computing Laboratory
Designing Communication and I/O Libraries for Clusters and Data Center Middleware: Challenges
Cluster and Data Center Middleware(MPI, PGAS, Memcached, HDFS, MapReduce, HBase, and gRPC/TensorFlow)
Networking Technologies(InfiniBand, 1/10/40/100 GigE
and Intelligent NICs)
Storage Technologies(HDD, SSD, NVM, and NVMe-
SSD)
Programming Models(Sockets)
Applications
Commodity Computing System Architectures
(Multi- and Many-core architectures and accelerators)
RDMA?
Communication and I/O LibraryRDMA-based
Communication Substrate
QoS & Fault Tolerance
Threaded Modelsand Synchronization
Performance TuningI/O and File Systems
Virtualization (SR-IOV)
Upper level Changes?
HPSR ‘18 26Network Based Computing Laboratory
• High-Performance Programming Models Support for HPC Clusters
• RDMA-Enabled Communication Substrate for Common Services in Datacenters
• High-Performance and Scalable Memcached
• RDMA-Enabled Spark and Hadoop (HDFS, HBase, MapReduce)
• Deep Learning with Scale-Up and Scale-Out– Caffe, CNTK, and TensorFlow
• Virtualization Support with SR-IOV and Containers
Designing Next-Generation Middleware for Clusters and Datacenters
HPSR ‘18 27Network Based Computing Laboratory
Parallel Programming Models Overview
P1 P2 P3
Shared Memory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory MemoryLogical shared memory
Shared Memory Model
SHMEM, DSMDistributed Memory Model
MPI (Message Passing Interface)
Partitioned Global Address Space (PGAS)
Global Arrays, UPC, Chapel, X10, CAF, …
• Programming models provide abstract machine models
• Models can be mapped on different types of systems– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.
• PGAS models and Hybrid MPI+PGAS models are gradually receiving importance
HPSR ‘18 28Network Based Computing Laboratory
Supporting Programming Models for Multi-Petaflop and Exaflop Systems: Challenges
Programming ModelsMPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP,
OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.
Application Kernels/Applications
Networking Technologies(InfiniBand, 40/100GigE,
Aries, and Omni-Path)
Multi-/Many-coreArchitectures
Accelerators(GPU and FPGA)
MiddlewareCo-Design
Opportunities and
Challenges across Various
Layers
PerformanceScalabilityResilience
Communication Library or Runtime for Programming ModelsPoint-to-point
CommunicationCollective
CommunicationEnergy-
AwarenessSynchronization
and LocksI/O and
File SystemsFault
Tolerance
HPSR ‘18 29Network Based Computing Laboratory
• Scalability for million to billion processors– Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided)– Scalable job start-up
• Scalable Collective communication– Offload– Non-blocking– Topology-aware
• Balancing intra-node and inter-node communication for next generation nodes (128-1024 cores)– Multiple end-points per node
• Support for efficient multi-threading• Integrated Support for GPGPUs and Accelerators• Fault-tolerance/resiliency• QoS support for communication and I/O• Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM,
MPI+UPC++, CAF, …)• Virtualization • Energy-Awareness
Broad Challenges in Designing Communication Middleware for (MPI+X) at Exascale
HPSR ‘18 30Network Based Computing Laboratory
Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002
– MVAPICH2-X (MPI + PGAS), Available since 2011
– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Support for Virtualization (MVAPICH2-Virt), Available since 2015
– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
– Used by more than 2,900 organizations in 86 countries
– More than 474,000 (> 0.47 million) downloads from the OSU site directly
– Empowering many TOP500 clusters (Nov ‘17 ranking)• 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China
• 9th, 556,104 cores (Oakforest-PACS) in Japan
• 12th, 368,928-core (Stampede2) at TACC
• 17th, 241,108-core (Pleiades) at NASA
• 48th, 76,032-core (Tsubame 2.5) at Tokyo Institute of Technology
– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
• Empowering Top500 systems for over a decade
HPSR ‘18 31Network Based Computing Laboratory
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000Se
p-04
Feb-
05
Jul-0
5
Dec-
05
May
-06
Oct
-06
Mar
-07
Aug-
07
Jan-
08
Jun-
08
Nov
-08
Apr-
09
Sep-
09
Feb-
10
Jul-1
0
Dec-
10
May
-11
Oct
-11
Mar
-12
Aug-
12
Jan-
13
Jun-
13
Nov
-13
Apr-
14
Sep-
14
Feb-
15
Jul-1
5
Dec-
15
May
-16
Oct
-16
Mar
-17
Aug-
17
Jan-
18
Num
ber o
f Dow
nloa
ds
Timeline
MV
0.9.
4
MV2
0.9
.0
MV2
0.9
.8
MV2
1.0
MV
1.0
MV2
1.0.
3
MV
1.1
MV2
1.4
MV2
1.5
MV2
1.6
MV2
1.7
MV2
1.8
MV2
1.9
MV2
-GD
R 2.
0b
MV2
-MIC
2.0
MV2
-GD
R 2
.3a
MV2
-X2.
3bM
V2Vi
rt 2
.2
MV2
2.3
rc1
OSU
INAM
0.9
.3
MVAPICH2 Release Timeline and Downloads
HPSR ‘18 32Network Based Computing Laboratory
Architecture of MVAPICH2 Software Family
High Performance Parallel Programming Models
Message Passing Interface(MPI)
PGAS(UPC, OpenSHMEM, CAF, UPC++)
Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)
High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms
Point-to-point
Primitives
Collectives Algorithms
Energy-
Awareness
Remote Memory Access
I/O and
File Systems
Fault
ToleranceVirtualization Active
MessagesJob Startup
Introspection & Analysis
Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path)
Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPower, Xeon-Phi, ARM, NVIDIA GPGPU)
Transport Protocols Modern Features
RC XRC UD DC UMR ODPSR-IOV
Multi Rail
Transport MechanismsShared
MemoryCMA IVSHMEM
Modern Features
MCDRAM* NVLink* CAPI*
* Upcoming
XPMEM*
HPSR ‘18 33Network Based Computing Laboratory
One-way Latency: MPI over IB with MVAPICH2
00.20.40.60.8
11.21.41.61.8
2 Small Message Latency
Message Size (bytes)
Late
ncy
(us)
1.11
1.19
0.98
1.15
1.04
TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-5-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch
Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch
0
20
40
60
80
100
120TrueScale-QDRConnectX-3-FDRConnectIB-DualFDRConnectX-5-EDROmni-Path
Large Message Latency
Message Size (bytes)
Late
ncy
(us)
HPSR ‘18 34Network Based Computing Laboratory
TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-5-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 IB switch
Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch
Bandwidth: MPI over IB with MVAPICH2
0
5000
10000
15000
20000
25000TrueScale-QDRConnectX-3-FDRConnectIB-DualFDRConnectX-5-EDROmni-Path
Bidirectional Bandwidth
Band
wid
th
(MBy
tes/
sec)
Message Size (bytes)
22,564
12,16121,983
6,228
24,136
0
2000
4000
6000
8000
10000
12000
14000 Unidirectional Bandwidth
Band
wid
th
(MBy
tes/
sec)
Message Size (bytes)
12,590
3,373
6,356
12,35812,366
HPSR ‘18 35Network Based Computing Laboratory
Hardware Multicast-aware MPI_Bcast on Stampede
0
10
20
30
40
2 8 32 128 512
Late
ncy
(us)
Message Size (Bytes)
Small Messages (102,400 Cores)DefaultMulticast
ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch
0100200300400500
2K 8K 32K 128K
Late
ncy
(us)
Message Size (Bytes)
Large Messages (102,400 Cores)DefaultMulticast
05
1015202530
Late
ncy
(us)
Number of Nodes
16 Byte Message
DefaultMulticast
0
50
100
150
200
Late
ncy
(us)
Number of Nodes
32 KByte Message
DefaultMulticast
HPSR ‘18 36Network Based Computing Laboratory
Performance of MPI_Allreduce On Stampede2 (10,240 Processes)
0
50
100
150
200
250
300
4 8 16 32 64 128 256 512 1024 2048 4096
Late
ncy
(us)
Message SizeMVAPICH2 MVAPICH2-OPT IMPI
0200400600800
100012001400160018002000
8K 16K 32K 64K 128K 256KMessage Size
MVAPICH2 MVAPICH2-OPT IMPI
OSU Micro Benchmark 64 PPN
2.4X
• For MPI_Allreduce latency with 32K bytes, MVAPICH2-OPT can reduce the latency by 2.4XM. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, Scalable Reduction Collectives with Data Partitioning-based Multi-Leader Design, SuperComputing '17. Available in MVAPICH2-X 2.3b
HPSR ‘18 37Network Based Computing Laboratory
Application Level Performance with Graph500 and SortGraph500 Execution Time
J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, International Supercomputing Conference (ISC’13), June 2013
J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012
05
101520253035
4K 8K 16K
Tim
e (s
)
No. of Processes
MPI-SimpleMPI-CSCMPI-CSRHybrid (MPI+OpenSHMEM)
13X
7.6X
• Performance of Hybrid (MPI+ OpenSHMEM) Graph500 Design• 8,192 processes
- 2.4X improvement over MPI-CSR- 7.6X improvement over MPI-Simple
• 16,384 processes- 1.5X improvement over MPI-CSR- 13X improvement over MPI-Simple
J. Jose, K. Kandalla, S. Potluri, J. Zhang and D. K. Panda, Optimizing Collective Communication in OpenSHMEM, Int'l Conference on Partitioned Global Address Space Programming Models (PGAS '13), October 2013.
Sort Execution Time
0500
10001500200025003000
500GB-512 1TB-1K 2TB-2K 4TB-4K
Tim
e (s
econ
ds)
Input Data - No. of Processes
MPI Hybrid
51%
• Performance of Hybrid (MPI+OpenSHMEM) Sort Application
• 4,096 processes, 4 TB Input Size- MPI – 2408 sec; 0.16 TB/min- Hybrid – 1172 sec; 0.36 TB/min- 51% improvement over MPI-design
HPSR ‘18 38Network Based Computing Laboratory
At Sender:
At Receiver:MPI_Recv(r_devbuf, size, …);
insideMVAPICH2
• Standard MPI interfaces used for unified data movement
• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
• Overlaps data movement from GPU with RDMA transfers
High Performance and High Productivity
MPI_Send(s_devbuf, size, …);
GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU
HPSR ‘18 39Network Based Computing Laboratory
0
2000
4000
6000
1 2 4 8 16 32 64 128
256
512 1K 2K 4K
Band
wid
th (M
B/s)
Message Size (Bytes)
GPU-GPU Inter-node Bi-Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3a
01000200030004000
1 2 4 8 16 32 64 128
256
512 1K 2K 4K
Band
wid
th (M
B/s)
Message Size (Bytes)
GPU-GPU Inter-node Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3a
0
10
20
300 1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8K
Late
ncy
(us)
Message Size (Bytes)
GPU-GPU Inter-node Latency
MV2-(NO-GDR) MV2-GDR-2.3a
MVAPICH2-GDR-2.3aIntel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores
NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA
CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA
10x
9x
Optimized MVAPICH2-GDR Design
1.88us11X
HPSR ‘18 40Network Based Computing Laboratory
Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland
0
0.2
0.4
0.6
0.8
1
1.2
16 32 64 96Nor
mal
ized
Exec
utio
n Ti
me
Number of GPUs
CSCS GPU cluster
Default Callback-based Event-based
00.20.40.60.8
11.2
4 8 16 32
Nor
mal
ized
Exec
utio
n Ti
me
Number of GPUs
Wilkes GPU Cluster
Default Callback-based Event-based
• 2X improvement on 32 GPUs nodes• 30% improvement on 96 GPU nodes (8 GPUs/node)
C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16
On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application
Cosmo model: http://www2.cosmo-model.org/content/tasks/operational/meteoSwiss/
HPSR ‘18 41Network Based Computing Laboratory
• High-Performance Programming Models Support for HPC Clusters
• RDMA-Enabled Communication Substrate for Common Services in Datacenters
• High-Performance and Scalable Memcached
• RDMA-Enabled Spark and Hadoop (HDFS, HBase, MapReduce)
• Deep Learning with Scale-Up and Scale-Out– Caffe, CNTK, and TensorFlow
• Virtualization Support with SR-IOV and Containers
Designing Next-Generation Middleware for Clusters and Datacenters
HPSR ‘18 42Network Based Computing Laboratory
Data-Center Service Primitives
• Common Services needed by Data-Centers– Better resource management
– Higher performance provided to higher layers
• Service Primitives– Soft Shared State
– Distributed Lock Management
– Global Memory Aggregator
• Network Based Designs– RDMA, Remote Atomic Operations
HPSR ‘18 43Network Based Computing Laboratory
Soft Shared State
Shared State
Data-CenterApplication
Data-CenterApplication
Data-CenterApplication
Data-CenterApplication
Data-CenterApplication
Data-CenterApplication
Get
Get
Get
Put
Put
Put
HPSR ‘18 44Network Based Computing Laboratory
Active Caching• Dynamic data caching – challenging!
• Cache Consistency and Coherence– Become more important than in static case
User Requests
Proxy Nodes
Back-End Nodes
Update
HPSR ‘18 45Network Based Computing Laboratory
RDMA based Client Polling Design
Front-End Back-End
Request
Cache Hit
Cache Miss
Response
Version Read
Response
HPSR ‘18 46Network Based Computing Laboratory
Active Caching – Performance Benefits
Data-Center Throughput
0
2000
4000
6000
8000
10000
12000
14000
16000
Trace 2 Trace 3 Trace 4 Trace 5Traces with Increasing Update Rate
Thro
ughp
utNo Cache
Invalidate All
Dependency Lists
Effect of Load
0
2000
4000
6000
8000
10000
12000
14000
16000
0 1 2 4 8 16 32 64Load (Compute Threads)
Thro
ughp
ut
No Cache
DependencyLists
• Higher overall performance – Up to an order of magnitude• Performance is sustained under loaded conditions
S. Narravula, P. Balaji, K. Vaidyanathan, H. -W. Jin and D. K. Panda, Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data-Centers over InfiniBand. CCGrid-2005
HPSR ‘18 47Network Based Computing Laboratory
Resource Monitoring Services
• Traditional approaches– Coarse-grained in nature– Assume resource usage is consistent throughout the monitoring granularity (in the order of
seconds)
• This assumption is no longer valid– Resource usage is becoming increasingly divergent
• Fine-grained monitoring is desired but has additional overheads – High overheads, less accurate, slow in response
• Can we design fine-grained resource monitoring scheme with low overhead and accurate resource usage?
HPSR ‘18 48Network Based Computing Laboratory
Synchronous Resource Monitoring using RDMA (RDMA-Sync)
/proc
KernelSpace
UserSpace
KernelSpace
UserSpace
Front-endNode Memory Memory
CPU CPU
AppThreads
Front-endMonitoringProcess
KernelData Structures
AppThreads
Back-endNode
RDMA
HPSR ‘18 49Network Based Computing Laboratory
Impact of Fine-grained Monitoring with ApplicationsImpact on RUBiS and Zipf Traces
0%
5%
10%
15%
20%
25%
30%
35%
40%
α = 0.9 α = 0.75 α = 0.5 α = 0.25Zipf alpha values
% Imp
rovem
ent
Socket-Sync RDMA-Async RDMA-Sync e-RDMA-Sync
K. Vaidyanathan, H. –W. Jin and D. K. Panda. Exploiting RDMA operations for Providing Efficient Fine-Grained Resource Monitoring in Cluster-Based Servers, Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations and Technologies, 2006
• Our schemes (RDMA-Sync and e-RDMA-Sync) achieve significant performance gain over existing schemes
HPSR ‘18 50Network Based Computing Laboratory
• High-Performance Programming Models Support for HPC Clusters
• RDMA-Enabled Communication Substrate for Common Services in Datacenters
• High-Performance and Scalable Memcached
• RDMA-Enabled Spark and Hadoop (HDFS, HBase, MapReduce)
• Deep Learning with Scale-Up and Scale-Out– Caffe, CNTK, and TensorFlow
• Virtualization Support with SR-IOV and Containers
Designing Next-Generation Middleware for Clusters and Datacenters
HPSR ‘18 51Network Based Computing Laboratory
• Three-layer architecture of Web 2.0– Web Servers, Memcached Servers,
Database Servers
• Memcached is a core component of Web 2.0 architecture
• Distributed Caching Layer– Allows to aggregate spare memory from
multiple nodes– General purpose
• Typically used to cache database queries, results of API calls
• Scalable model, but typical usage very network intensive
Architecture Overview of Memcached
Internet
HPSR ‘18 52Network Based Computing Laboratory
• Server and client perform a negotiation protocol– Master thread assigns clients to appropriate worker thread
• Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread
• All other Memcached data structures are shared among RDMA and Sockets worker threads
• Memcached Server can serve both socket and verbs clients simultaneously
• Memcached applications need not be modified; uses verbs interface if available
Memcached-RDMA Design
SocketsClient
RDMAClient
MasterThread
SocketsWorker ThreadVerbs
Worker Thread
SocketsWorker Thread
VerbsWorker Thread
SharedData
MemorySlabsItems
…
1
1
2
2
HPSR ‘18 53Network Based Computing Laboratory
1
10
100
10001 2 4 8 16 32 64 128
256
512 1K 2K 4K
Tim
e (u
s)
Message Size
OSU-IB (FDR)IPoIB (FDR)
0100200300400500600700
16 32 64 128 256 512 1024 2048 4080Thou
sand
s of
Tra
nsac
tions
pe
r Sec
ond
(TPS
)
No. of Clients
• Memcached Get latency– 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us– 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us
• Memcached Throughput (4bytes)– 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s– Nearly 2X improvement in throughput
Memcached GET Latency Memcached Throughput
Memcached Performance (FDR Interconnect)
Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR)
Latency Reduced by nearly 20X
2X
HPSR ‘18 54Network Based Computing Laboratory
• Illustration with Read-Cache-Read access pattern using modified mysqlslap load testing tool
• Memcached-RDMA can - improve query latency by up to 66% over IPoIB (32Gbps)
- throughput by up to 69% over IPoIB (32Gbps)
Micro-benchmark Evaluation for OLDP workloads
012345678
64 96 128 160 320 400
Late
ncy
(sec
)
No. of Clients
Memcached-IPoIB (32Gbps)Memcached-RDMA (32Gbps)
0
1000
2000
3000
4000
64 96 128 160 320 400
Thro
ughp
ut (
Kq/s
)
No. of Clients
Memcached-IPoIB (32Gbps)Memcached-RDMA (32Gbps)
D. Shankar, X. Lu, J. Jose, M. W. Rahman, N. Islam, and D. K. Panda, Can RDMA Benefit On-Line Data Processing Workloads with Memcached and MySQL, ISPASS’15
Reduced by 66%
HPSR ‘18 55Network Based Computing Laboratory
– Memcached latency test with Zipf distribution, server with 1 GB memory, 32 KB key-value pair size, total size of data accessed is 1 GB (when data fits in memory) and 1.5 GB (when data does not fit in memory)
– When data fits in memory: RDMA-Mem/Hybrid gives 5x improvement over IPoIB-Mem
– When data does not fit in memory: RDMA-Hybrid gives 2x-2.5x over IPoIB/RDMA-Mem
Performance Evaluation on IB FDR + SATA/NVMe SSDs
0
500
1000
1500
2000
2500
Set Get Set Get Set Get Set Get Set Get Set Get Set Get Set Get
IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid-NVMe
IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid-NVMe
Data Fits In Memory Data Does Not Fit In Memory
Late
ncy
(us)
slab allocation (SSD write) cache check+load (SSD read) cache update server response client wait miss-penalty
HPSR ‘18 56Network Based Computing Laboratory
– RDMA-Accelerated Communication for Memcached Get/Set
– Hybrid ‘RAM+SSD’ slab management for higher data retention
– Non-blocking API extensions • memcached_(iset/iget/bset/bget/test/wait)• Achieve near in-memory speeds while hiding
bottlenecks of network and SSD I/O• Ability to exploit communication/computation
overlap• Optional buffer re-use guarantees
– Adaptive slab manager with different I/O schemes for higher throughput.
Accelerating Hybrid Memcached with RDMA, Non-blocking Extensions and SSDs
D. Shankar, X. Lu, N. S. Islam, M. W. Rahman, and D. K. Panda, High-Performance Hybrid Key-Value Store on Modern Clusters with RDMA Interconnects and SSDs: Non-blocking Extensions, Designs, and Benefits, IPDPS, May 2016
BLOCKING API
NON-BLOCKING API REQ.
NON-BLOCKING API REPLY
CLIE
NT
SERV
ER
HYBRID SLAB MANAGER (RAM+SSD)
RDMA-ENHANCED COMMUNICATION LIBRARY
RDMA-ENHANCED COMMUNICATION LIBRARY
LIBMEMCACHED LIBRARY
Blocking API Flow Non-Blocking API Flow
HPSR ‘18 57Network Based Computing Laboratory
– Data does not fit in memory: Non-blocking Memcached Set/Get API Extensions can achieve• >16x latency improvement vs. blocking API over RDMA-Hybrid/RDMA-Mem w/ penalty• >2.5x throughput improvement vs. blocking API over default/optimized RDMA-Hybrid
– Data fits in memory: Non-blocking Extensions perform similar to RDMA-Mem/RDMA-Hybrid and >3.6x improvement over IPoIB-Mem
Performance Evaluation with Non-Blocking Memcached API
0
500
1000
1500
2000
2500
Set Get Set Get Set Get Set Get Set Get Set Get
IPoIB-Mem RDMA-Mem H-RDMA-Def H-RDMA-Opt-Block H-RDMA-Opt-NonB-i
H-RDMA-Opt-NonB-b
Aver
age
Late
ncy
(us)
Miss Penalty (Backend DB Access Overhead)Client WaitServer ResponseCache UpdateCache check+Load (Memory and/or SSD read)Slab Allocation (w/ SSD write on Out-of-Mem)
H = Hybrid Memcached over SATA SSD Opt = Adaptive slab manager Block = Default Blocking API NonB-i = Non-blocking iset/iget API NonB-b = Non-blocking bset/bget API w/ buffer re-use guarantee
HPSR ‘18 58Network Based Computing Laboratory
• High-Performance Programming Models Support for HPC Clusters
• RDMA-Enabled Communication Substrate for Common Services in Datacenters
• High-Performance and Scalable Memcached
• RDMA-Enabled Spark and Hadoop (HDFS, HBase, MapReduce)
• Deep Learning with Scale-Up and Scale-Out– Caffe, CNTK, and TensorFlow
• Virtualization Support with SR-IOV and Containers
Designing Next-Generation Middleware for Clusters and Datacenters
HPSR ‘18 59Network Based Computing Laboratory
• RDMA for Apache Spark
• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions
• RDMA for Apache HBase
• RDMA for Memcached (RDMA-Memcached)
• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
• OSU HiBD-Benchmarks (OHB)
– HDFS, Memcached, HBase, and Spark Micro-benchmarks
• http://hibd.cse.ohio-state.edu
• Users Base: 285 organizations from 34 countries
• More than 26,700 downloads from the project site
The High-Performance Big Data (HiBD) Project
Available for InfiniBand and RoCEAlso run on Ethernet
Available for x86 and OpenPOWER
Support for Singularity and Docker
HPSR ‘18 60Network Based Computing Laboratory
050
100150200250300350400
80 120 160
Exec
utio
n Ti
me
(s)
Data Size (GB)
IPoIB (EDR)OSU-IB (EDR)
0100200300400500600700800
80 160 240
Exec
utio
n Ti
me
(s)
Data Size (GB)
IPoIB (EDR)OSU-IB (EDR)
Performance Numbers of RDMA for Apache Hadoop 2.x –RandomWriter & TeraGen in OSU-RI2 (EDR)
Cluster with 8 Nodes with a total of 64 maps
• RandomWriter– 3x improvement over IPoIB
for 80-160 GB file size
• TeraGen– 4x improvement over IPoIB for
80-240 GB file size
RandomWriter TeraGen
Reduced by 3x Reduced by 4x
HPSR ‘18 61Network Based Computing Laboratory
• InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R)
• RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node. – 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps)
– 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps)
Performance Evaluation of RDMA-Spark on SDSC Comet – HiBench PageRank
32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time
050
100150200250300350400450
Huge BigData Gigantic
Tim
e (s
ec)
Data Size (GB)
IPoIB
RDMA
0
100
200
300
400
500
600
700
800
Huge BigData Gigantic
Tim
e (s
ec)
Data Size (GB)
IPoIB
RDMA
43%37%
HPSR ‘18 62Network Based Computing Laboratory
Performance Evaluation on SDSC Comet: Astronomy Application
• Kira Toolkit1: Distributed astronomy image processing toolkit implemented using Apache Spark.
• Source extractor application, using a 65GB dataset from the SDSS DR2 survey that comprises 11,150 image files.
• Compare RDMA Spark performance with the standard apache implementation using IPoIB.
1. Z. Zhang, K. Barbary, F. A. Nothaft, E.R. Sparks, M.J. Franklin, D.A. Patterson, S. Perlmutter. Scientific Computing meets Big Data Technology: An Astronomy Use Case. CoRR, vol: abs/1507.03325, Aug 2015.
020406080
100120
RDMA Spark Apache Spark(IPoIB)
21 %
Execution times (sec) for Kira SE benchmark using 65 GB dataset, 48 cores.
M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC Comet, XSEDE’16, July 2016
HPSR ‘18 63Network Based Computing Laboratory
• High-Performance Programming Models Support for HPC Clusters
• RDMA-Enabled Communication Substrate for Common Services in Datacenters
• High-Performance and Scalable Memcached
• RDMA-Enabled Spark and Hadoop (HDFS, HBase, MapReduce)
• Deep Learning with Scale-Up and Scale-Out– Caffe, CNTK, and TensorFlow
• Virtualization Support with SR-IOV and Containers
Designing Next-Generation Middleware for Clusters and Datacenters
HPSR ‘18 64Network Based Computing Laboratory
• Deep Learning frameworks are a different game altogether
– Unusually large message sizes (order of megabytes)
– Most communication based on GPU buffers
• Existing State-of-the-art– cuDNN, cuBLAS, NCCL --> scale-up performance
– CUDA-Aware MPI --> scale-out performance• For small and medium message sizes only!
• Proposed: Can we co-design the MPI runtime (MVAPICH2-GDR) and the DL framework (Caffe) to achieve both?
– Efficient Overlap of Computation and Communication
– Efficient Large-Message Communication (Reductions)
– What application co-designs are needed to exploit communication-runtime co-designs?
Deep Learning: New Challenges for Communication Runtimes
Scal
e-up
Per
form
ance
Scale-out Performance
cuDNN
NCCL
gRPC
Hadoop
ProposedCo-Designs
MPIcuBLAS
A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)
HPSR ‘18 65Network Based Computing Laboratory
0100200300
2 4 8 16 32 64 128
Late
ncy
(ms)
Message Size (MB)
Reduce – 192 GPUsLarge Message Optimized Collectives for Deep Learning
0
100
200
128 160 192
Late
ncy
(ms)
No. of GPUs
Reduce – 64 MB
0100200300
16 32 64
Late
ncy
(ms)
No. of GPUs
Allreduce - 128 MB
0
50
100
2 4 8 16 32 64 128
Late
ncy
(ms)
Message Size (MB)
Bcast – 64 GPUs
0
50
100
16 32 64
Late
ncy
(ms)
No. of GPUs
Bcast 128 MB
• MV2-GDR provides optimized collectives for large message sizes
• Optimized Reduce, Allreduce, and Bcast
• Good scaling with large number of GPUs
• Available since MVAPICH2-GDR 2.2GA
0100200300
2 4 8 16 32 64 128La
tenc
y (m
s)Message Size (MB)
Allreduce – 64 GPUs
HPSR ‘18 66Network Based Computing Laboratory
MVAPICH2-GDR vs. NCCL2 – Allreduce Operation• Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases
• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 16 GPUs
*Will be available with upcoming MVAPICH2-GDR 2.3b
1
10
100
1000
10000
100000
Late
ncy
(us)
Message Size (Bytes)
MVAPICH2-GDR NCCL2
~1.2X better
Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect
1
10
100
1000
4 8 16 32 64 128
256
512 1K 2K 4K 8K 16K
32K
64K
Late
ncy
(us)
Message Size (Bytes)
MVAPICH2-GDR NCCL2
~3X better
HPSR ‘18 67Network Based Computing Laboratory
0
10000
20000
30000
40000
50000
512K 1M 2M 4M
Late
ncy
(us)
Message Size (Bytes)
MVAPICH2 BAIDU OPENMPI
0100000020000003000000400000050000006000000
8388
608
1677
7216
3355
4432
6710
8864
1342
1772
8
2684
3545
6
5368
7091
2
Late
ncy
(us)
Message Size (Bytes)
MVAPICH2 BAIDU OPENMPI
1
10
100
1000
10000
100000
4 16 64 256
1024
4096
1638
4
6553
6
2621
44
Late
ncy
(us)
Message Size (Bytes)
MVAPICH2 BAIDU OPENMPI
• 16 GPUs (4 nodes) MVAPICH2-GDR(*) vs. Baidu-Allreduce and OpenMPI 3.0
MVAPICH2: Allreduce Comparison with Baidu and OpenMPI
*Available with MVAPICH2-GDR 2.3a
~30X betterMV2 is ~2X better
than Baidu
~10X better OpenMPI is ~5X slower than Baidu
~4X better
HPSR ‘18 68Network Based Computing Laboratory
• Caffe : A flexible and layered Deep Learning framework.
• Benefits and Weaknesses– Multi-GPU Training within a single node
– Performance degradation for GPUs across different sockets
– Limited Scale-out
• OSU-Caffe: MPI-based Parallel Training – Enable Scale-up (within a node) and Scale-out (across
multi-GPU nodes)
– Scale-out on 64 GPUs for training CIFAR-10 network on CIFAR-10 dataset
– Scale-out on 128 GPUs for training GoogLeNet network on ImageNet dataset
OSU-Caffe: Scalable Deep Learning
0
50
100
150
200
250
8 16 32 64 128
Trai
ning
Tim
e (s
econ
ds)
No. of GPUs
GoogLeNet (ImageNet) on 128 GPUs
Caffe OSU-Caffe (1024) OSU-Caffe (2048)
Invalid use caseOSU-Caffe publicly available from
http://hidl.cse.ohio-state.edu/
HPSR ‘18 69Network Based Computing Laboratory
Architecture Overview of gRPC Key Features:• Simple service definition• Works across languages and platforms
• C++, Java, Python, Android Java etc• Linux, Mac, Windows.
• Start quickly and scale• Bi-directional streaming and integrated
authentication• Used by Google (several of Google’s cloud
products and Google externally facing APIs,TensorFlow), Netflix, Docker, Cisco, JuniperNetworks etc.
• Uses sockets for communication!
Source: http://www.grpc.io/
Large-scale distributed systems composed of micro services
HPSR ‘18 70Network Based Computing Laboratory
Performance Benefits for AR-gRPC with Micro-Benchmark
Point-to-Point Latency
• AR-gRPC (OSU design) Latency on SDSC-Comet-FDR– Up to 2.7x performance speedup over Default gRPC (IPoIB) for Latency for small messages.– Up to 2.8x performance speedup over Default gRPC (IPoIB) for Latency for medium messages.– Up to 2.5x performance speedup over Default gRPC (IPoIB) for Latency for large messages.
0
15
30
45
60
75
902 4 8 16 32 64 128
256
512 1K 2K 4K 8K
Late
ncy
(us)
payload (Bytes)
Default gRPC (IPoIB-56Gbps)
AR-gRPC (RDMA-56Gbps)
0
100
200
300
400
500
600
700
800
900
1000
16K 32K 64K 128K 256K 512KLa
tenc
y (u
s)Payload (Bytes)
Default gRPC (IPoIB-56Gbps)
AR-gRPC (RDMA-56 Gbps)
100
3800
7500
11200
14900
18600
1M 2M 4M 8M
Late
ncy
(us)
Payload (Bytes)
Default gRPC (IPoIB-56Gbps)
AR-gRPC (RDMA-56Gbps)
R. Biswas, X. Lu, and D. K. Panda, Accelerating TensorFlow with Adaptive RDMA-based gRPC. (Under Review)
HPSR ‘18 71Network Based Computing Laboratory
0
60
120
180
240
300
8 16 32 64
Imag
es /
Seco
ndBatch Size / GPU
gRPPC (IPoIB-100Gbps)
Verbs (RDMA-100Gbps)
MPI (RDMA-100Gbps)
AR-gRPC (RDMA-100Gbps)
0
80
160
240
320
400
8 16 32 64
Imag
es /
Seco
nd
Batch Size / GPU
gRPPC (IPoIB-100Gbps)
Verbs (RDMA-100Gbps)
MPI (RDMA-100Gbps)
AR-gRPC (RDMA-100Gbps)
133%
16%
0
30
60
90
120
150
8 16 32 64
Imag
es /
Seco
nd
Batch Size/ GPU
gRPPC (IPoIB-100Gbps)Verbs (RDMA-100Gbps)MPI (RDMA-100Gbps)AR-gRPC (RDMA-100Gbps)
26%
10%
Performance Benefit for TensorFlow (Resnet50)
• TensorFlow Resnet50 performance evaluation on an IB EDR cluster– Up to 26% performance speedup over Default gRPC (IPoIB) for 4 nodes– Up to 127% performance speedup over Default gRPC (IPoIB) for 8 nodes– Up to 133% performance speedup over Default gRPC (IPoIB) for 12 nodes
4 Nodes 8 Nodes 12 Nodes
127%
15%
HPSR ‘18 72Network Based Computing Laboratory
0
10
20
30
40
50
60
70
80
90
8 16 32 64
Imag
es /
Seco
nd
Batch Size / GPU
gRPPC (IPoIB-100Gbps)
Verbs (RDMA-100Gbps)
MPI (RDMA-100Gbps)
AR-gRPC (RDMA-100Gbps)
47%
7%
Performance Benefit for TensorFlow (Inception3)
• TensorFlow Inception3 performance evaluation on an IB EDR cluster– Up to 47% performance speedup over Default gRPC (IPoIB) for 4 nodes – Up to 116% performance speedup over Default gRPC (IPoIB) for 8 nodes– Up to 153% performance speedup over Default gRPC (IPoIB) for 12 nodes
4 Nodes 8 Nodes 12 Nodes
0
60
120
180
240
300
8 16 32 64
Imag
es /
Seco
nd
Batch Size / GPU
gRPPC (IPoIB-100Gbps)
Verbs (RDMA-100Gbps)
MPI (RDMA-100Gbps)
AR-gRPC (RDMA-100Gbps)
153%
12%
0
50
100
150
200
8 16 32 64Im
ages
/ Se
cond
Batch Size / GPU
gRPPC (IPoIB-100Gbps)
Verbs (RDMA-100Gbps)
MPI (RDMA-100Gbps)
AR-gRPC (RDMA-100Gbps)
116%
10%
HPSR ‘18 73Network Based Computing Laboratory
X. Lu, H. Shi, M. H. Javed, R. Biswas, and D. K. Panda, Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-capable Networks, HotI 2017.
High-Performance Deep Learning over Big Data (DLoBD) Stacks• Benefits of Deep Learning over Big Data (DLoBD)
Easily integrate deep learning components into Big Data processing workflow
Easily access the stored data in Big Data systems No need to set up new dedicated deep learning clusters;
Reuse existing big data analytics clusters • Challenges
Can RDMA-based designs in DLoBD stacks improve performance, scalability, and resource utilization on high-performance interconnects, GPUs, and multi-core CPUs?
What are the performance characteristics of representative DLoBD stacks on RDMA networks?
• Characterization on DLoBD Stacks CaffeOnSpark, TensorFlowOnSpark, and BigDL IPoIB vs. RDMA; In-band communication vs. Out-of-band
communication; CPU vs. GPU; etc. Performance, accuracy, scalability, and resource utilization RDMA-based DLoBD stacks (e.g., BigDL over RDMA-Spark)
can achieve 2.6x speedup compared to the IPoIB based scheme, while maintain similar accuracy
0
20
40
60
10
1010
2010
3010
4010
5010
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Acc
urac
y (%
)
Epoc
hs T
ime (
secs
)
Epoch Number
IPoIB-TimeRDMA-TimeIPoIB-AccuracyRDMA-Accuracy
2.6X
HPSR ‘18 74Network Based Computing Laboratory
• High-Performance Programming Models Support for HPC Clusters
• RDMA-Enabled Communication Substrate for Common Services in Datacenters
• High-Performance and Scalable Memcached
• RDMA-Enabled Spark and Hadoop (HDFS, HBase, MapReduce)
• Deep Learning with Scale-Up and Scale-Out– Caffe, CNTK, and TensorFlow
• Virtualization Support with SR-IOV and Containers
Designing Next-Generation Middleware for Clusters and Datacenters
HPSR ‘18 75Network Based Computing Laboratory
• Single Root I/O Virtualization (SR-IOV) is providing new opportunities to design cloud-based datacenters with very little low overhead
Single Root I/O Virtualization (SR-IOV)
• Allows a single physical device, or a Physical Function (PF), to present itself as multiple virtual devices, or Virtual Functions (VFs)
• VFs are designed based on the existing non-virtualized PFs, no need for driver change
• Each VF can be dedicated to a single VM through PCI pass-through
• Work with 10/40/100 GigE and InfiniBand
HPSR ‘18 76Network Based Computing Laboratory
• Virtualization has many benefits– Fault-tolerance– Job migration– Compaction
• Have not been very popular in HPC due to overhead associated with Virtualization
• New SR-IOV (Single Root – IO Virtualization) support available with Mellanox InfiniBand adapters changes the field
• Enhanced MVAPICH2 support for SR-IOV• MVAPICH2-Virt 2.2 supports:
– OpenStack, Docker, and singularity
Can High-Performance and Virtualization be Combined?
J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based Virtualized InfiniBand Clusters? EuroPar'14J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled InfiniBand Clusters, HiPC’14 J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient Approach to build HPC Clouds, CCGrid’15
HPSR ‘18 77Network Based Computing Laboratory
0
50
100
150
200
250
300
350
400
milc leslie3d pop2 GAPgeofem zeusmp2 lu
Exec
utio
n Ti
me
(s)
MV2-SR-IOV-Def
MV2-SR-IOV-Opt
MV2-Native
1%9.5%
0
1000
2000
3000
4000
5000
6000
22,20 24,10 24,16 24,20 26,10 26,16
Exec
utio
n Ti
me
(ms)
Problem Size (Scale, Edgefactor)
MV2-SR-IOV-Def
MV2-SR-IOV-Opt
MV2-Native2%
• 32 VMs, 6 Core/VM
• Compared to Native, 2-5% overhead for Graph500 with 128 Procs
• Compared to Native, 1-9.5% overhead for SPEC MPI2007 with 128 Procs
Application-Level Performance on Chameleon
SPEC MPI2007Graph500
5%
HPSR ‘18 78Network Based Computing Laboratory
0
10
20
30
40
50
60
70
80
90
100
MG.D FT.D EP.D LU.D CG.D
Exec
utio
n Ti
me
(s)
Container-Def
Container-Opt
Native
• 64 Containers across 16 nodes, pining 4 Cores per Container
• Compared to Container-Def, up to 11% and 73% of execution time reduction for NAS and Graph 500
• Compared to Native, less than 9 % and 5% overhead for NAS and Graph 500
Application-Level Performance on Docker with MVAPICH2
Graph 500NAS
11%
0
50
100
150
200
250
300
1Cont*16P 2Conts*8P 4Conts*4P
BFS
Exec
utio
n Ti
me
(ms)
Scale, Edgefactor (20,16)
Container-Def
Container-Opt
Native
73%
HPSR ‘18 79Network Based Computing Laboratory
0
500
1000
1500
2000
2500
3000
22,16 22,20 24,16 24,20 26,16 26,20
BFS
Exec
utio
n Ti
me
(ms)
Problem Size (Scale, Edgefactor)
Graph500
Singularity
Native
0
50
100
150
200
250
300
CG EP FT IS LU MG
Exec
utio
n Ti
me
(s)
NPB Class D
Singularity
Native
• 512 Processes across 32 nodes
• Less than 7% and 6% overhead for NPB and Graph500, respectively
Application-Level Performance on Singularity with MVAPICH2
7%
6%
J. Zhang, X .Lu and D. K. Panda, Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds?, UCC ’17
HPSR ‘18 80Network Based Computing Laboratory
RDMA-Hadoop with Virtualization
– 14% and 24% improvement with Default Mode for CloudBurst and Self-Join
– 30% and 55% improvement with Distributed Mode for CloudBurst and Self-Join
0
20
40
60
80
100
Default Mode Distributed Mode
EXEC
UTI
ON
TIM
ECloudBurst
RDMA-Hadoop RDMA-Hadoop-Virt
050
100150200250300350400
Default Mode Distributed Mode
EXEC
UTI
ON
TIM
E
Self-Join
RDMA-Hadoop RDMA-Hadoop-Virt
30% reduction
55% reduction
S. Gugnani, X. Lu, D. K. Panda. Designing Virtualization-aware and Automatic Topology Detection Schemes for Accelerating Hadoop on SR-IOV-enabled Clouds. CloudCom, 2016.
HPSR ‘18 81Network Based Computing Laboratory
• Next generation Clusters and Data Centers need to be designed with a holistic view of HPC, Big Data, Deep Learning, and Cloud
• Presented an overview of the networking technology trends
• Presented some of the approaches and results along these directions
• Enable HPC, Big Data, Deep Learning and Cloud community to take advantage of modern networking technologies
• Many other open issues need to be solved
Concluding Remarks
HPSR ‘18 82Network Based Computing Laboratory
Funding AcknowledgmentsFunding Support by
Equipment Support by
HPSR ‘18 83Network Based Computing Laboratory
Personnel AcknowledgmentsCurrent Students (Graduate)
– A. Awan (Ph.D.)
– R. Biswas (M.S.)
– M. Bayatpour (Ph.D.)
– S. Chakraborthy (Ph.D.)– C.-H. Chu (Ph.D.)
– S. Guganani (Ph.D.)
Past Students – A. Augustine (M.S.)
– P. Balaji (Ph.D.)
– S. Bhagvat (M.S.)
– A. Bhat (M.S.)
– D. Buntinas (Ph.D.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– R. Rajachandrasekar (Ph.D.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
Past Research Scientist– K. Hamidouche
– S. Sur
Past Post-Docs– D. Banerjee
– X. Besseron
– H.-W. Jin
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– K. Kulkarni (M.S.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– M. Li (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– J. Hashmi (Ph.D.)
– H. Javed (Ph.D.)– P. Kousha (Ph.D.)
– D. Shankar (Ph.D.)
– H. Shi (Ph.D.)
– J. Zhang (Ph.D.)
– J. Lin
– M. Luo
– E. Mancini
Current Research Scientists– X. Lu
– H. Subramoni
Past Programmers– D. Bureddy
– J. Perkins
Current Research Specialist– J. Smith
– M. Arnold
– S. Marcarelli
– J. Vienne
– H. Wang
Current Post-doc– A. Ruhela
– K. Manian
Current Students (Undergraduate)– N. Sarkauskas (B.S.)
HPSR ‘18 84Network Based Computing Laboratory
• Looking for Bright and Enthusiastic Personnel to join as
– Post-Doctoral Researchers
– PhD Students
– MPI Programmer/Software Engineer
– Hadoop/Big Data Programmer/Software Engineer
• If interested, please contact me at this conference and/or send an e-mail to [email protected]
Multiple Positions Available in My Group
HPSR ‘18 85Network Based Computing Laboratory
Thank You!
Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/
The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/
The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/
The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/