InfiniBand, Omni-Path, and High-Speed Ethernet: Advanced Features, Challenges in Designing HEC Systems, and Usage
Dhabaleswar K. (DK) Panda
The Ohio State UniversityE-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
Hari Subramoni
The Ohio State UniversityE-mail: [email protected]
http://www.cse.ohio-state.edu/~subramon
A Tutorial at IT4 Innovations’18by
Latest version of the slides can be obtained fromhttp://www.cse.ohio-state.edu/~panda/it4-advanced.pdf
Network Based Computing Laboratory 2IT4 Innovations’18
High-End Computing (HEC): ExaFlop & ExaByte
10K-20K EBytes in 2016-2018
40K EBytes in 2020 ?
ExaByte & BigDataExpected to have an ExaFlop system in 2019-2020!
100 PFlops in 2016
1 EFlops in 2019-2020?
Network Based Computing Laboratory 3IT4 Innovations’18
• Compute Clusters
• Storage Clusters
• Multi-tier Data Centers
• Cloud Computing Environments
• Big Data Processing (Hadoop and Spark)
• Web 2.0 with Memcached
Various High-End Computing (HEC) Systems
Network Based Computing Laboratory 4IT4 Innovations’18
Various Clusters (Compute, Storage and Datacenters)Compute cluster
Meta-DataManager
I/O ServerNode
MetaData
DataCompute
Node
ComputeNode
I/O ServerNode Data
ComputeNode
I/O ServerNode Data
ComputeNode
LANLANFrontend
Storage cluster
LAN/WAN
.
...
Enterprise Multi-tier Datacenter for Visualization and Mining
Tier1 Tier3
Routers/Servers
Switch
Database Server
Application Server
Routers/Servers
Routers/Servers
Application Server
Application Server
Application Server
Database Server
Database Server
Database Server
Switch SwitchRouters/Servers
Tier2
Network Based Computing Laboratory 5IT4 Innovations’18
Cloud Computing Environments
LAN / WAN
Physical Machine
Virtual Machine
Virtual Machine
Physical Machine
Virtual Machine
Virtual Machine
Physical Machine
Virtual Machine
Virtual Machine
Virtual Netw
ork File System
PhysicalMeta-DataManager
MetaData
PhysicalI/O Server
NodeData
PhysicalI/O Server
NodeData
PhysicalI/O Server
NodeData
PhysicalI/O Server
NodeData
Physical Machine
Virtual Machine
Virtual Machine
Network Based Computing Laboratory 6IT4 Innovations’18
• Open-source implementation of Google MapReduce, GFS, and BigTable for Big Data Analytics• Hadoop Common Utilities (RPC, etc.), HDFS, MapReduce, YARN
• http://hadoop.apache.org
Overview of Apache Hadoop Architecture
Hadoop Distributed File System (HDFS)
MapReduce(Cluster Resource Management & Data Processing)
Hadoop Common/Core (RPC, ..)
Hadoop Distributed File System (HDFS)
YARN(Cluster Resource Management & Job Scheduling)
Hadoop Common/Core (RPC, ..)
MapReduce(Data Processing)
Other Models(Data Processing)Hadoop 1.x
Hadoop 2.x
Network Based Computing Laboratory 7IT4 Innovations’18
Big Data Processing with Hadoop Components• Major components included in this
tutorial: – MapReduce (Batch)– HBase (Query)– HDFS (Storage)– RPC
• Underlying Hadoop Distributed File System (HDFS) used by both MapReduce and HBase
• Model scales but high amount of communication during intermediate phases can be further optimized
HDFS
MapReduce
Hadoop Framework
User Applications
HBase
Hadoop Common (RPC)
Network Based Computing Laboratory 8IT4 Innovations’18
Spark Architecture Overview• An in-memory data-processing
framework – Iterative machine learning jobs – Interactive data analytics – Scala based Implementation– Standalone, YARN, Mesos
• Scalable and communication intensive
– Wide dependencies between Resilient Distributed Datasets (RDDs)
– MapReduce-like shuffle operations to repartition RDDs
– Sockets based communication
http://spark.apache.org
Network Based Computing Laboratory 9IT4 Innovations’18
Memcached Architecture
• Distributed Caching Layer– Allows to aggregate spare memory from multiple nodes
– General purpose
• Typically used to cache database queries, results of API calls• Scalable model, but typical usage very network intensive
Network Based Computing Laboratory 10IT4 Innovations’18
Big Data (Hadoop, Spark,
HBase, Memcached,
etc.)
Deep Learning(Caffe, TensorFlow, BigDL,
etc.)
HPC (MPI, RDMA, Lustre, etc.)
Increasing Usage of HPC, Big Data and Deep Learning
Convergence of HPC, Big Data, and Deep Learning!
Increasing Need to Run these applications on the Cloud!!
Network Based Computing Laboratory 11IT4 Innovations’18
Drivers of Modern HPC Cluster Architectures
• Multi-core/many-core technologies
• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
• Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD
• Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)
• Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc.
Accelerators / Coprocessors high compute density, high
performance/watt>1 TFlop DP on a chip
High Performance Interconnects -InfiniBand
<1usec latency, 100Gbps Bandwidth>Multi-core Processors SSD, NVMe-SSD, NVRAM
Tianhe – 2 TitanK - ComputerSunway TaihuLight
Network Based Computing Laboratory 12IT4 Innovations’18
Kernel Space
Modern Interconnects and Protocols with IB, HSE, and Omni-PathApplication / Middleware
Verbs
Ethernet Adapter
Ethernet Switch
Ethernet Driver
TCP/IP
InfiniBand Adapter
InfiniBand Switch
IPoIB
IPoIB
Ethernet Adapter
Ethernet Switch
Hardware Offload
TCP/IP
10/40 GigE-TOE
InfiniBand Adapter
InfiniBand Switch
User Space
RSockets
RSockets
iWARP Adapter
Ethernet Switch
TCP/IP
User Space
iWARP
RoCEAdapter
Ethernet Switch
RDMA
User Space
RoCE
InfiniBand Switch
InfiniBand Adapter
RDMA
User Space
IB Native
Sockets
Application / Middleware Interface
Protocol
Adapter
Switch
InfiniBand Adapter
InfiniBand Switch
RDMA
SDP
SDP1/10/25/40/50/100 GigE
OmniPath Adapter
OmniPath Switch
User Space
RDMA
100 Gb/s
OFI
Network Based Computing Laboratory 13IT4 Innovations’18
• 163 IB Clusters (32.6%) in the Nov’17 Top500 list
– (http://www.top500.org)
• Installations in the Top 50 (17 systems):
Large-scale InfiniBand Installations
19,860,000 core (Gyoukou) in Japan (4th) 60,512 core (DGX SATURN V) at NVIDIA/USA (36th)
241,108 core (Pleiades) at NASA/Ames (17th) 72,000 core (HPC2) in Italy (37th)
220,800 core (Pangea) in France (21st) 152,692 core (Thunder) at AFRL/USA (40th)
144,900 core (Cheyenne) at NCAR/USA (24th) 99,072 core (Mistral) at DKRZ/Germany (42nd)
155,150 core (Jureca) in Germany (29th) 147,456 core (SuperMUC) in Germany (44th)
72,800 core Cray CS-Storm in US (30th) 86,016 core (SuperMUC Phase 2) in Germany (45th)
72,800 core Cray CS-Storm in US (31st) 74,520 core (Tsubame 2.5) at Japan/GSIC (48th)
78,336 core (Electra) at NASA/USA (33rd) 66,000 core (HPC3) in Italy (51st)
124,200 core (Topaz) SGI ICE at ERDC DSRC in US (34th) 194,616 core (Cascade) at PNNL (53rd)
60,512 core (NVIDIA DGX-1/Relion) at Facebook in USA (35th) and many more!
Network Based Computing Laboratory 14IT4 Innovations’18
• 35 Omni-Path Clusters (7%) in the Nov’17 Top500 list
– (http://www.top500.org)
Large-scale Omni-Path Installations
556,104 core (Oakforest-PACS) at JCAHPC in Japan (9th) 54,432 core (Marconi Xeon) at CINECA in Italy (72nd)
368,928 core (Stampede2) at TACC in USA (12th) 46,464 core (Peta4) at University of Cambridge in UK (75th)
135,828 core (Tsubame 3.0) at TiTech in Japan (13th) 53,352 core (Girzzly) at LANL in USA (85th)
314,384 core (Marconi XeonPhi) at CINECA in Italy (14th) 45,680 core (Endeavor) at Intel in USA (86th)
153,216 core (MareNostrum) at BSC in Spain (16th) 59,776 core (Cedar) at SFU in Canada (94th)
95,472 core (Quartz) at LLNL in USA (49th) 27,200 core (Peta HPC) in Taiwan (95th)
95,472 core (Jade) at LLNL in USA (50th) 40,392 core (Serrano) at SNL in USA (112th)
49,432 core (Mogon II) at Universitaet Mainz in Germany (65th) 40,392 core (Cayenne) at SNL in USA (113th)
38,552 core (Molecular Simulator) in Japan (70th) 39,774 core (Nel) at LLNL in USA (101st)
35,280 core (Quriosity) at BASF in Germany (71st) and many more!
Network Based Computing Laboratory 15IT4 Innovations’18
• Advanced Features for InfiniBand
• Advanced Features for High Speed Ethernet
• RDMA over Converged Ethernet
• Open Fabrics Software Stack and RDMA Programming
• Libfabrics Software Stack and Programming
• Network Management Infrastructure and Tool
• Common Challenges in Building HEC Systems with IB and HSE
– Network Adapters and NUMA Interactions
– Network Switches, Topology and Routing
– Network Bridges
• System Specific Challenges and Case Studies
– HPC (MPI, PGAS and GPU/Xeon Phi Computing)
– Deep Learning
– Cloud Computing
• Conclusions and Final Q&A
Presentation Overview
Network Based Computing Laboratory 16IT4 Innovations’18
Advanced Features of InfiniBand• SRQ and XRC
• DCT
• User-Mode Memory Registration (UMR)
• On-demand Paging
• Core-Direct Offload
• SHArP
Network Based Computing Laboratory 17IT4 Innovations’18
• Different transport protocols with IB – Reliable Connection (RC) is the most common
– Unreliable Datagram (UD) is used in some cases
• Buffers need to be posted at each receiver to receive message from any sender– Buffer requirement can increase with system size
• Connections need to be established across processes under RC– Each connection requires certain amount of memory for handling related data
structures
– Memory required for all connections can increase with system size
• Both issues have become critical as large-scale IB deployments have taken place– Being addressed by both IB specification and upper-level middleware
Memory overheads in large-scale systems
Network Based Computing Laboratory 18IT4 Innovations’18
• SRQ is a hardware mechanism for a process to share receive resources (memory) across multiple connections
– Introduced in specification v1.2
• 0 < Q << P*((M*N)-1)
Shared Receive Queue (SRQ)
Process
One RQ per connection
Process
One SRQ for all connections
P Q
(M*N) - 1
Network Based Computing Laboratory 19IT4 Innovations’18
• Each QP takes at least one page of memory– Connections between all processes is very costly for RC
• New IB Transport added: eXtended Reliable Connection– Allows connections between nodes instead of processes
eXtended Reliable Connection (XRC)
RC Connections XRC Connections
M2 x (N – 1) connections/node
M = # of processes/nodeN = # of nodes
M x (N – 1) connections/node
Network Based Computing Laboratory 20IT4 Innovations’18
• XRC uses SRQ Numbers (SRQN) to direct where a operation should complete
• Hardware does all routing of data, so p2 is not actually involved in the data transfer
• Connections are not bi-directional, so p3 cannot sent to p0
XRC Addressing
SRQ#1
SRQ#2
Process 0
Process 1
SRQ#1
Process 2
SRQ#2Process 3
Send to #2Send to #1
Network Based Computing Laboratory 21IT4 Innovations’18
DC Connection Model, Communication Objects and Addressing Scheme
• Constant connection cost– One QP for any peer
• Full Feature Set– RDMA, Atomics etc
Nod
e 0
P1P0
Node 1
P3P2
Node 3
P7P6
Nod
e 2
P5P4
IBNetwork
• Communication Objects & Addressing Scheme– DCINI
• Analogous to the send QPs• Can transmit data to any peer
– DCTGT• Receive objects• Must be backed by SRQ• Identified on a node by “DCT Number”
– Messages routed with combination of DCT Number + LID
– Requires “DC Key” to enable communication• Must be same across all processes
Network Based Computing Laboratory 22IT4 Innovations’18
• Support direct local and remote non-contiguous memory access
• Avoid packing at sender and unpacking at receiver
User-Mode Memory Registration
1
3
4
Process
Kernel
HCA/RNIC2
Steps to create memory regions with UMR:
1. UMR Creation Request • Send number of blocks
2. HCA issues uninitialized memory keys for future UMR use
3. Kernel maps virtual->physical and pins region into physical memory
4. HCA caches the virtual to physical mapping
Network Based Computing Laboratory 23IT4 Innovations’18
• Applications no longer need to pin down the underlying physical pages• Memory Region (MR) are NEVER pinned by the OS
• Paged in by the HCA when needed
• Paged out by the OS when reclaimed
• ODP can be divided into two classes
– Explicit ODP• Applications still register memory buffers for communication, but this operation is used to define access control
for IO rather than pin-down the pages
– Implicit ODP• Applications are provided with a special memory key that represents their complete address space, does not
need to register any virtual address range
• Advantages
• Simplifies programming
• Unlimited MR sizes
• Physical memory optimization
On-Demand Paging (ODP)
Network Based Computing Laboratory 24IT4 Innovations’18
• Introduced by Mellanox to avoid pinning the pages of registered memory regions
• ODP-aware runtime could reduce the size of pin-down buffers while maintaining performance
Implicit On-Demand Paging (ODP)
M. Li, X. Lu, H. Subramoni, and D. K. Panda, “Designing Registration Caching Free High-Performance MPI Library with Implicit On-Demand Paging (ODP) of InfiniBand”, HiPC ‘17
0.1
1
10
100
CG EP FT IS MG LU SP AWP-ODC Graph500
Exec
utio
nTi
me
(s)
Applications (256 Processes)Pin-down Explicit-ODPImplicit-ODP
Network Based Computing Laboratory 25IT4 Innovations’18
Collective Offload Support on the Adapters• Performance of collective operations (broadcast, barrier, reduction,
all-reduce, etc.) are very critical to the overall performance of MPI applications
• Currently being done with basic pt-to-pt operations (send/recv and RDMA) using host-based operations
• Mellanox ConnectX-2, ConnectX-3, ConnectX-4, and ConnectX-5 adapters support offloading some of these operations to the adapters (CORE-Direct)
– Provides overlap of computation and collective communication
– Reduces OS jitter (since everything is done in hardware)
Network Based Computing Laboratory 26IT4 Innovations’18
Application
One-to-many Multi-Send• Sender creates a task-list consisting of
only send and wait WQEs– One send WQE is created for each
registered receiver and is appended to the rear of a singly linked task-list
– A wait WQE is added to make the HCA wait for ACK packet from the receiverInfiniBand HCA
Physical Link
Send Q
Recv Q
Send CQ
Recv CQ
DataData
MCQ
MQ
Task ListSend WaitSendSendSend Wait
Network Based Computing Laboratory 27IT4 Innovations’18
Management and execution of MPI operations in the network by using SHArP Manipulation of data while it is being transferred in the
switch network
SHArP provides an abstraction to realize the reduction operation Defines Aggregation Nodes (AN), Aggregation Tree, and
Aggregation Groups
AN logic is implemented as an InfiniBand Target Channel Adapter (TCA) integrated into the switch ASIC*
Uses RC for communication between ANs and between AN and hosts in the Aggregation Tree*
Physical Network Topology*
* Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction. R. L. Graham, D. Bureddy, P. Lui, G. Shainer, H. Rosenstock, G. Bloch, D. Goldenberg, M. Dubman, S. Kotchubievsky, V. Koushnir, L. Levi, A. Margolin, T. Ronen, A. Shpiner, O. Wertheim, E. Zahavi, Mellanox Technologies, Inc. First Workshop on Optimization of Communication in HPC Runtime Systems (COM-HPC 2016)
Logical SHArP Tree*
Scalable Hierarchical Aggregation Protocol (SHArP)
Network Based Computing Laboratory 28IT4 Innovations’18
• Advanced Features for InfiniBand
• Advanced Features for High Speed Ethernet
• RDMA over Converged Ethernet
• Open Fabrics Software Stack and RDMA Programming
• Libfabrics Software Stack and Programming
• Network Management Infrastructure and Tool
• Common Challenges in Building HEC Systems with IB and HSE
– Network Adapters and NUMA Interactions
– Network Switches, Topology and Routing
– Network Bridges
• System Specific Challenges and Case Studies
– HPC (MPI, PGAS and GPU/Xeon Phi Computing)
– Deep Learning
– Cloud Computing
• Conclusions and Final Q&A
Presentation Overview
Network Based Computing Laboratory 29IT4 Innovations’18
The Ethernet Ecosystem
Courtesy: Scott Kipp @ Ethernet Alliance - http://www.ethernetalliance.org/roadmap/
Network Based Computing Laboratory 30IT4 Innovations’18
Courtesy
http://www.eetimes.com/document.asp?doc_id=1323184
http://www.networkcomputing.com/data-centers/25-gbe-big-deal-will-arrive/1714647938
http://www.eetimes.com/document.asp?doc_id=1323184
Emergence of 25 GigE and Benefits
Slash top-of-rack switches (Source: IEEE 802.3)
Courtesy
http://www.plexxi.com/2014/07/whats-25-gigabit-ethernet-want/
http://www.qlogic.com/Products/adapters/Pages/25Gb-Ethernet.aspx
Network Based Computing Laboratory 31IT4 Innovations’18
• Requires half the number of lanes compared to 40G (x4 instead of x8 PCIe lanes)
• Better PCIe bandwidth utilization (25/32=78% vs. 40/64=62.5%) with lower power impact
Matching PCIe and Ethernet SpeedsEthernet
Rate
(Gb/s)
Number of PCIe Gen3 Lanes Needed for
Single Port Dual Port
100 16 32 (Uncommon)
40 8 16
25 4 8
10 2 4
Courtesy: http://www.ieee802.org/3/cfi/0314_3/CFI_03_0314.pdf
Network Based Computing Laboratory 32IT4 Innovations’18
• 25G & 50G Ethernet specification extends IEEE 802.3 to work at increased data rates
• Features in Draft 1.4 of specification– PCS/PMA operation at 25 Gb/s over a single lane
– PCS/PMA operation at 25 Gb/s over two lanes
– Optional Forward Error Correction modes
– Optional auto-negotiation using an OUI next page
– Optional link training
• Standards for 50 Gb/s, 200 Gb/s and 400Gb/s under development
– Expected around 2017 – 2018?
Detailed Specifications for 25 and 50 GigE and Looking Forward
Courtesy: Scott Kipp @ Ethernet Alliance - http://www.ethernetalliance.org/roadmap/
Next standards by 2017 – 2018?
Network Based Computing Laboratory 33IT4 Innovations’18
Ethernet Roadmap – To Terabit Speeds?
50G, 100G, 200G and 400G by
2018-2019
Terabit speeds by 2025?!?!
Courtesy: Scott Kipp @ Ethernet Alliance - http://www.ethernetalliance.org/roadmap/
Network Based Computing Laboratory 34IT4 Innovations’18
• Advanced Features for InfiniBand
• Advanced Features for High Speed Ethernet
• RDMA over Converged Ethernet
• Open Fabrics Software Stack and RDMA Programming
• Libfabrics Software Stack and Programming
• Network Management Infrastructure and Tool
• Common Challenges in Building HEC Systems with IB and HSE
– Network Adapters and NUMA Interactions
– Network Switches, Topology and Routing
– Network Bridges
• System Specific Challenges and Case Studies
– HPC (MPI, PGAS and GPU/Xeon Phi Computing)
– Deep Learning
– Cloud Computing
• Conclusions and Final Q&A
Presentation Overview
Network Based Computing Laboratory 35IT4 Innovations’18
RDMA over Converged Enhanced Ethernet
IB VerbsApplication
Har
dwar
e
RoCE
IB VerbsApplication
RoCE v2
InfiniBand Link Layer
IB NetworkIB Transport
IB VerbsApplication
InfiniBand
Ethernet Link Layer
IB NetworkIB Transport
Ethernet Link Layer
UDP / IPIB Transport
• Takes advantage of IB and Ethernet– Software written with IB-Verbs– Link layer is Converged (Enhanced) Ethernet (CE)
• Pros: IB Vs RoCE– Works natively in Ethernet environments
• Entire Ethernet management ecosystem is available
– Has all the benefits of IB verbs– Link layer is very similar to the link layer of native IB, so
there are no missing features
• RoCE v2: Additional Benefits over RoCE– Traditional Network Management Tools Apply– ACLs (Metering, Accounting, Firewalling)– GMP Snooping for optimized Multicast – Network Monitoring Tools
• Cons:– Network bandwidth might be limited to Ethernet
switches• 10/40GE switches available; 56 Gbps IB is available
Courtesy: OFED, Mellanox
Network Stack Comparison
Packet Header Comparison
ETHL2 Hdr
Ethe
rtyp
e
IB GRHL3 Hdr
IB BTH+L4 HdrRo
CE
ETHL2 Hdr
Ethe
rtyp
e
IP HdrL3 Hdr
IB BTH+L4 HdrPr
oto
#
RoCE
v2 UDP
Hdr Port
#
Network Based Computing Laboratory 36IT4 Innovations’18
• Advanced Features for InfiniBand
• Advanced Features for High Speed Ethernet
• RDMA over Converged Ethernet
• Open Fabrics Software Stack and RDMA Programming
• Libfabrics Software Stack and Programming
• Network Management Infrastructure and Tool
• Common Challenges in Building HEC Systems with IB and HSE
– Network Adapters and NUMA Interactions
– Network Switches, Topology and Routing
– Network Bridges
• System Specific Challenges and Case Studies
– HPC (MPI, PGAS and GPU/Xeon Phi Computing)
– Deep Learning
– Cloud Computing
• Conclusions and Final Q&A
Presentation Overview
Network Based Computing Laboratory 37IT4 Innovations’18
• Open source organization (formerly OpenIB)– www.openfabrics.org
• Incorporates both IB, RoCE, and iWARP in a unified manner– Support for Linux and Windows
• Users can download the entire stack and run– Latest stable release is OFED 4.8.1
• New naming convention to get aligned with Linux Kernel Development
• OFED 4.8.2 is under development
Software Convergence with OpenFabrics
Network Based Computing Laboratory 38IT4 Innovations’18
OpenFabrics Software StackSA Subnet Administrator
MAD Management Datagram
SMA Subnet Manager Agent
PMA Performance Manager Agent
IPoIB IP over InfiniBand
SDP Sockets Direct Protocol
SRP SCSI RDMA Protocol (Initiator)
iSER iSCSI RDMA Protocol (Initiator)
RDS Reliable Datagram Service
UDAPL User Direct Access Programming Lib
HCA Host Channel Adapter
R-NIC RDMA NIC
CommonInfiniBand
iWARP
Key
InfiniBand HCA iWARP R-NIC
HardwareSpecific Driver
Hardware SpecificDriver
ConnectionManager
MAD
InfiniBand OpenFabrics Kernel Level Verbs / API iWARP R-NIC
SA Client
ConnectionManager
Connection ManagerAbstraction (CMA)
InfiniBand OpenFabrics User Level Verbs / API iWARP R-NIC
SDPIPoIB SRP iSER RDS
SDP Lib
User Level MAD API
Open SM
DiagTools
Hardware
Provider
Mid-Layer
Upper Layer Protocol
User APIs
Kernel Space
User Space
NFS-RDMARPC
ClusterFile Sys
Application Level
SMA
ClusteredDB Access
SocketsBasedAccess
VariousMPIs
Access toFile
Systems
BlockStorageAccess
IP BasedApp
Access
Apps & AccessMethodsfor usingOF Stack
UDAPL
Kern
el
bypa
ss
Kern
el
bypa
ss
Network Based Computing Laboratory 39IT4 Innovations’18
1. Create QPs (endpoints)
2. Register memory for sending and receiving
3. Send
– Channel semantics• Post receive
• Post send
– RDMA semantics
Programming with OpenFabrics
Sender ReceiverSample Steps
Kernel
HCA
Process
Network Based Computing Laboratory 40IT4 Innovations’18
• Prepare and post send descriptor (channel semantics)Verbs: Post Send
struct ibv_send_wr *bad_wr;
struct ibv_send_wr sr;
struct ibv_sge sg_entry;
sr.next = NULL;
sr.opcode = IBV_WR_SEND;sr.wr_id = 0;
sr.num_sge = 1;
if (len < max_inline_size) {
sr.send_flags = IBV_SEND_SIGNALED | IBV_SEND_INLINE;} else {
sr.send_flags = IBV_SEND_SIGNALED;}
sr.sg_list = &(sg_entry);
sg_entry.addr = (uintptr_t) buf;sg_entry.length = len;sg_entry.lkey = mr_handle->lkey;
ret = ibv_post_send(qp, &sr, &bad_wr);
Network Based Computing Laboratory 41IT4 Innovations’18
• Prepare and post RDMA write (memory semantics)Verbs: Post RDMA Write
struct ibv_send_wr *bad_wr; struct ibv_send_wr sr;
struct ibv_sge sg_entry;
sr.next = NULL;
sr.opcode = IBV_WR_RDMA_WRITE; /* set type to RDMA Write */sr.wr_id = 0;
sr.num_sge = 1;
sr.send_flags = IBV_SEND_SIGNALED;
sr.wr.rdma.remote_addr = remote_addr; /* remote virtual addr. */
sr.wr.rdma.rkey = rkey; /* from remote node */
sr.sg_list = &(sg_entry);
sg_entry.addr = buf; /* local buffer */
sg_entry.length = len;
sg_entry.lkey = mr_handle->lkey;
ret = ibv_post_send(qp, &sr, &bad_wr);
Network Based Computing Laboratory 42IT4 Innovations’18
• Advanced Features for InfiniBand
• Advanced Features for High Speed Ethernet
• RDMA over Converged Ethernet
• Open Fabrics Software Stack and RDMA Programming
• Libfabrics Software Stack and Programming
• Network Management Infrastructure and Tool
• Common Challenges in Building HEC Systems with IB and HSE
– Network Adapters and NUMA Interactions
– Network Switches, Topology and Routing
– Network Bridges
• System Specific Challenges and Case Studies
– HPC (MPI, PGAS and GPU/Xeon Phi Computing)
– Deep Learning
– Grid Computing
• Conclusions and Final Q&A
Presentation Overview
Network Based Computing Laboratory 43IT4 Innovations’18
Libfabrics Connection ModelServer Process OFI Provider
Sockets/Verbs/PSMHCA
GigE/IB/TrueScaleClient ProcessOFI Provider
Sockets/Verbs/PSMHCA
GigE/IB/TrueScalefi_fabrics Open fabrics fi_fabricsOpen fabrics
Open domain fi_domain
Register Mem fi_mr_reg
fi_endpointOpen EndPoint
fi_cq_openOpen Comp Q
fi_ep_bindBind EP to CQ
fi_connectConnect to Remote EP
Open Event Q fi_ep_open
fi_passive_ep Open Passive EP
fi_eq_open Open Event Q
fi_pep_bind Bind Passive EP
fi_listen Listen for Incoming Connections
New Event Detected on EQ
fi_eq_sread Validate New Event == FI_CONNREQ
Open domainfi_domain
Register Memfi_mr_reg
Network Based Computing Laboratory 44IT4 Innovations’18
Libfabrics Connection Model (Cont.)Server Process OFI Provider
Sockets/Verbs/PSMHCA
GigE/IB/TrueScaleClient ProcessOFI Provider
Sockets/Verbs/PSMHCA
GigE/IB/TrueScalefi_ep_open Open EndPoint
fi_cq_open Open Event Q
fi_ep_bind Bind EP to CQ
Open fabrics
fi_accept Accept Connection
fi_eq_sread Validate New Event == FI_CONNECTED New Event Detected on EQ
fi_eq_sreadValidate New Event == FI_CONNECTED
fi_sendPost Send
Recv Completionfi_send
Post Send Recv Completion
fi_shutdown Shutdown Channel
fi_close * Close all open resourcesfi_shutdownShutdown Channel
fi_close *Close all open resources
fi_cq_read /fi_cq_sread
Poll / Wait for Data
fi_cq_read /fi_cq_sread
Poll / Wait for Data
fi_recv Post Recvfi_recvPost Recv
Network Based Computing Laboratory 45IT4 Innovations’18
• Similar to socket / QP
• Simple / Easy to use
Scalable EndPoints Vs Shared TX/RX Context
Courtesy: http://www.slideshare.net/seanhefty/ofa-workshop2015ofiwg?ref=http://ofiwg.github.io/libfabric/
End Point
Transmit Receive
Completion
End Point
Transmit Receive
Completion
End Point
End Point
End Point
Transmit Receive
Completion
Transmit Receive
Scalable EndPointsShared TX/RX ContextNormal EndPoint
• Share HW resources
• # EP >> HW resources
• Use more HW resources
• Higher performance per EP
Network Based Computing Laboratory 46IT4 Innovations’18
• Open Fabric, Domain and EP
Libfabrics: Fabric, Domain and Endpoint creation
struct fi_info *info, *hints;
struct fid_fabric *fabric;
struct fid_domain *dom;
struct fid_ep *ep;
hints = fi_allocinfo();
/* Obtain fabric information */
rc = fi_getinfo(VERSION, node, service, flags, hints, &info);
/* Free fabric information */
fi_freeinfo(hints);
/* Open fabric */
rc = fi_fabric(info->fabric_attr, &fabric, NULL);
/* Open domain */
rc = fi_domain(fabric, entry.info, &dom, NULL);
/* Open End point */
rc = fi_endpoint(dom, entry.info, &ep, NULL);
Network Based Computing Laboratory 47IT4 Innovations’18
• Open Fabric / Domain and create EQ, EP to end nodes– Connection establishment is abstracted out using connection management APIs
(fi_cm) – fi_listen, fi_connect, fi_accept
– Fabric provider can implement them with connection managers (rdma_cm or ibcm) or directly through verbs with out-of-band communication
• Register memory
Libfabrics: Memory Registration
int fi_mr_reg(struct fid_domain *domain, const void *buf, size_t len, uint64_t access, uint64_t offset, uint64_t requested_key, uint64_t flags, struct fid_mr **mr, void *context);
rc = fi_mr_reg(domain, buffer, size, FI_SEND | FI_RECV,
0, 0, 0, &mr, NULL);
rc = fi_mr_reg(domain, buffer, size,
FI_REMOTE_READ | FI_REMOTE_WRITE, 0,
user_key, 0, &mr, NULL);
Permissions can be set as needed
Network Based Computing Laboratory 48IT4 Innovations’18
• Prepare and post receive request
Libfabrics: Post Receive (Channel Semantics)
ssize_t fi_recv(struct fid_ep *ep, void * buf, size_t len,
void *desc, fi_addr_t src_addr,
void *context);
- For connected EPs
ssize_t fi_recvmsg(struct fid_ep *ep,
const struct fi_msg *msg, uint64_t flags);
- For connected and un-connected EPs
struct fid_ep *ep;
struct fid_mr *mr;
/* Post recv request */
rc = fi_recv(ep, buf, size, fi_mr_desc(mr), 0,
(void *)(uintptr_t)RECV_WCID);
Network Based Computing Laboratory 49IT4 Innovations’18
• Prepare and post send descriptorLibfabrics: Post Send (Channel Semantics)
ssize_t fi_send(struct fid_ep *ep, void *buf, size_t len,
void *desc, fi_addr_t dest_addr, void *context);
- For connected EPs
ssize_t fi_sendmsg(struct fid_ep *ep, const struct fi_msg *msg,
uint64_t flags);
- For connected and un-connected EPs
ssize_t fi_inject(struct fid_ep *ep, void *buf, size_t len,
fi_addr_t dest_addr);
- Buffer available for re-use as soon as function returns
- No completion event generated for send
struct fid_ep *ep;
struct fid_mr *mr;
static fi_addr_t remote_fi_addr;
rc = fi_send(ep, buf, size, fi_mr_desc(mr), 0,
(void *)(uintptr_t)SEND_WCID);
rc = fi_inject(ep, buf, size, remote_fi_addr);
Network Based Computing Laboratory 50IT4 Innovations’18
• Prepare and post receive requestLibfabrics: Post Remote Read (Memory Semantics)
ssize_t fi_read(struct fid_ep *ep, void *buf, size_t len,
void *desc, fi_addr_t src_addr, uint64_t addr,
uint64_t key, void *context);
- For connected EPs
ssize_t fi_readmsg(struct fid_ep *ep,
const struct fi_msg_rma *msg,
uint64_t flags);
- For connected and un-connected EPs
struct fid_ep *ep;
struct fid_mr *mr;
struct fi_context fi_ctx_read;
/* Post remote read request */
ret = fi_read(ep, buf, size, fi_mr_desc(mr), local_addr,
remote_addr, remote_key, &fi_ctx_read);
Network Based Computing Laboratory 51IT4 Innovations’18
• Prepare and post send descriptorLibfabrics: Post Remote Write (Memory Semantics)
ssize_t fi_write(struct fid_ep *ep, const void *buf, size_t len,
void *desc, fi_addr_t dest_addr, uint64_t addr,
uint64_t key, void *context);
- For connected EPs
ssize_t fi_writemsg(struct fid_ep *ep,
const struct fi_msg_rma *msg, uint64_t flags);
- For connected and un-connected EPs
ssize_t fi_inject_write(struct fid_ep *ep, const void *buf,
size_t len, fi_addr_t dest_addr,
uint64_t addr, uint64_t key);
- Buffer available for re-use as soon as function returns
- No completion event generated for send
ssize_t fi_writedata(struct fid_ep *ep, const void *buf,
size_t len, void *desc, uint64_t data,
fi_addr_t dest_addr, uint64_t addr,
uint64_t key, void *context);
- Similar to fi_write
- Allows for the sending of remote CQ data
Network Based Computing Laboratory 52IT4 Innovations’18
• Advanced Features for InfiniBand
• Advanced Features for High Speed Ethernet
• RDMA over Converged Ethernet
• Open Fabrics Software Stack and RDMA Programming
• Libfabrics Software Stack and Programming
• Network Management Infrastructure and Tool
• Common Challenges in Building HEC Systems with IB and HSE
– Network Adapters and NUMA Interactions
– Network Switches, Topology and Routing
– Network Bridges
• System Specific Challenges and Case Studies
– HPC (MPI, PGAS and GPU/Xeon Phi Computing)
– Deep Learning
– Cloud Computing
• Conclusions and Final Q&A
Presentation Overview
Network Based Computing Laboratory 53IT4 Innovations’18
• Management Infrastructure– Subnet Manager
– Diagnostic tools• System Discovery Tools
• System Health Monitoring Tools
• System Performance Monitoring Tools
– Fabric management tools
Network Management Infrastructure and Tools
Network Based Computing Laboratory 54IT4 Innovations’18
• Agents– Processes or hardware units running on each adapter, switch, router
(everything on the network)
– Provide capability to query and set parameters
• Managers– Make high-level decisions and implement it on the network fabric using
the agents
• Messaging schemes– Used for interactions between the manager and agents (or between
agents)
• Messages
Concepts in IB Management
Network Based Computing Laboratory 55IT4 Innovations’18
• All IB management happens using packets called as Management
Datagrams
– Popularly referred to as “MAD packets”
• Four major classes of management mechanisms
– Subnet Management
– Subnet Administration
– Communication Management
– General Services
InfiniBand Management
Network Based Computing Laboratory 56IT4 Innovations’18
• Consists of at least one subnet manager (SM) and several subnet management agents (SMAs)
– Each adapter, switch, router has an agent running
– Communication between the SM and agents or between agents happens using MAD packets called as Subnet Management Packets (SMPs)
• SM’s responsibilities include:– Discovering the physical topology of the subnet
– Assigning LIDs to the end nodes, switches and routers
– Populating switches and routers with routing paths
– Subnet sweeps to discover topology changes
Subnet Management & Administration
Network Based Computing Laboratory 57IT4 Innovations’18
Subnet Manager
Active Links
Inactive Links
Compute Node
Switch
Subnet Manager
Inactive Link
Multicast Join
Multicast Setup
Multicast JoinMulticast
Setup
Network Based Computing Laboratory 58IT4 Innovations’18
• SM can be configured to sweep once or continuously• On the first sweep:
– All ports are assigned LIDs on the first sweep– All routes are setup on the switches
• On consequent sweeps:– If there has been any change to the topology, appropriate routes are
updated– If DLID X is down, packet not sent all the way
• First hop will not have a forwarding entry for LID X
• Sweep time configured by the system administrator– Cannot be too high or too low
Subnet Manager Sweep Behavior
Network Based Computing Laboratory 59IT4 Innovations’18
• Single subnet manager has issues on large systems– Performance and overhead of scanning
• Hardware implementations on switches are faster, but will work only for small systems (memory usage)
• Software implementations are more popular (OpenSM)
– Multi-SM models• Two benefits: fault tolerance (if one SM dies) and scalability (different SMs can
handle different portions of the network)
• Current SMs only provide a fault-tolerance model
• Network subsetting is still be investigated
• Asynchronous events specified to improve scalability– E.g., TRAPS are events sent by an agent to the SM when a link goes down
Subnet Manager Scalability Issues
Network Based Computing Laboratory 60IT4 Innovations’18
• Creation, joining/leaving, deleting multicast groups occur as SA requests– The requesting node sends a request to a SA
– The SA sends MAD packets to SMAs on the switches to setup routes for the multicast packets• Each switch contains information on which ports to forward the
multicast packet to
• Multicast itself does not go through the subnet manager– Only the setup and teardown goes through the SM
Multicast Group Management
Network Based Computing Laboratory 61IT4 Innovations’18
• Management Infrastructure– Subnet Manager
– Diagnostic tools• System Discovery Tools
• System Health Monitoring Tools
• System Performance Monitoring Tools
– Fabric management tools
Network Management Infrastructure and Tools
Network Based Computing Laboratory 62IT4 Innovations’18
• Different types of tools exist:– High-level tools that internally talk to the subnet manager using
management datagrams
– Each hardware device exposes a few mandatory counters and a number of optional (sometimes vendor-specific) counters
• Possible to write your own tools based on the management datagram interface– Several vendors provide such IB management tools
Tools to Analyze InfiniBand Networks
Network Based Computing Laboratory 63IT4 Innovations’18
• Starting with almost no knowledge about the system, we can identify several details of the network configuration
– Example tools include:• ibstatus: shows adapter status
• smpquery: SMP query tool
• perfquery: reports performance/error counters of a port
• ibportstate: shows status of IB port, enable/disable port
• ibhosts: finds all the network adapters in the system
• ibswitches: finds all the network switches in the system
• ibnetdiscover: finds the connectivity between the ports
• … and many others exist
– Possible to write your own tools based on the management datagram interface• Several vendors provide such IB management tools
Network Discovery Tools
Network Based Computing Laboratory 64IT4 Innovations’18
• Several tools exist to monitor the health and performance of the InfiniBand network– Example health monitoring tools include
• ibdiagnet: queries for overall fabric health
• ibportstate: identify state and link speed of an InfiniBand port
• ibdatacounts: get InfiniBand port data counters
– Example performance monitoring tools include• ibv_send_lat, ibv_write_lat: IB verbs level performance tests
• perfquery: queries performance counters in IB HCA
Health and Performance Monitoring Tools
Network Based Computing Laboratory 65IT4 Innovations’18
Tools for Network Switching and Routing
% ibroute -G 0x66a000700067c
Lid Out Destination
Port Info
0x0001 001 : (Channel Adapter portguid 0x0002c9030001e3f3: ' HCA-1')
0x0002 013 : (Channel Adapter portguid 0x0002c9020023c301: ' HCA-1')
0x0003 014 : (Channel Adapter portguid 0x0002c9030001e603: ' HCA-1')
0x0004 015 : (Channel Adapter portguid 0x0002c9020023c305: ' HCA-2')
0x0005 016 : (Channel Adapter portguid 0x0011750000ffe005: ' HCA-1')
0x0014 017 : (Switch portguid 0x00066a0007000728: 'SilverStorm 9120
GUID=0x00066a00020001aa Leaf 8, Chip A')
0x0015 020 : (Channel Adapter portguid 0x0002c9020023c131: ' HCA-2')
0x0016 019 : (Switch portguid 0x00066a0007000732: 'SilverStorm 9120
GUID=0x00066a00020001aa Leaf 10, Chip A')
0x0017 019 : (Channel Adapter portguid 0x0002c9030001c937: ' HCA-1')
0x0018 019 : (Channel Adapter portguid 0x0002c9020023c039: ' HCA-2')
...
Packets to LID 0x0001 will be sent out on Port 001
Network Based Computing Laboratory 66IT4 Innovations’18
• Based on destination LIDs and switching/routing information, the exact path of the packets can be identified– If application communication pattern is known, we can statically
identify possible network contention
Static Analysis of Network Contention
Leaf Blocks
Spine Blocks2
4 8 9 13 14 1 19 2 5 3 7 12 16 6 18 10
11 22 17 24 27 15 20
Network Based Computing Laboratory 67IT4 Innovations’18
• IB provides many optional counters to query performance counters– PortXmitWait: Number of ticks in which there was data to send, but no
flow-control credits
– RNR NAKs: Number of times a message was sent, but the receiver has not yet posted a receive buffer• This can timeout, so it can be an error in some cases
– PortXmitFlowPkts: Number of (link-level) flow-control packets transmitted on the port
– SWPortVLCongestion: Number of packets dropped due to congestion
Dynamic Analysis of Network Contention
Network Based Computing Laboratory 68IT4 Innovations’18
• Management Infrastructure– Subnet Manager
– Diagnostic tools• System Discovery Tools
• System Health Monitoring Tools
• System Performance Monitoring Tools
– Fabric management tools
Network Management Infrastructure and Tools
Network Based Computing Laboratory 69IT4 Innovations’18
• InfiniBand provides two forms of management– Out-of-band management (similar to other networks)
– In-band management (used by the subnet manager)
• Out-of-band management requires a separate Ethernet port on the switch, where an administrator can plug in his/her laptop
• In-band management allows the switch to receive management commands directly over the regular communication network
In-band Management vs. Out-of-band Management
InfiniBand connectivity(In-band management)
Ethernet connectivity(Out-of-band management)
Network Based Computing Laboratory 70IT4 Innovations’18
Overview of OSU INAM• A network monitoring and analysis tool that is capable of analyzing traffic on the InfiniBand network with inputs from the
MPI runtime– http://mvapich.cse.ohio-state.edu/tools/osu-inam/
• Monitors IB clusters in real time by querying various subnet management entities and gathering input from the MPI runtimes
• OSU INAM v0.9.2 released on 10/31/2017
• Significant enhancements to user interface to enable scaling to clusters with thousands of nodes
• Improve database insert times by using 'bulk inserts‘
• Capability to look up list of nodes communicating through a network link
• Capability to classify data flowing over a network link at job level and process level granularity in conjunction with MVAPICH2-X 2.3b
• “Best practices “ guidelines for deploying OSU INAM on different clusters
• Capability to analyze and profile node-level, job-level and process-level activities for MPI communication– Point-to-Point, Collectives and RMA
• Ability to filter data based on type of counters using “drop down” list
• Remotely monitor various metrics of MPI processes at user specified granularity
• "Job Page" to display jobs in ascending/descending order of various performance metrics in conjunction with MVAPICH2-X
• Visualize the data transfer happening in a “live” or “historical” fashion for entire network, job or set of nodes
Network Based Computing Laboratory 71IT4 Innovations’18
OSU INAM Features
• Show network topology of large clusters• Visualize traffic pattern on different links• Quickly identify congested links/links in error state• See the history unfold – play back historical state of the network
Comet@SDSC --- Clustered View
(1,879 nodes, 212 switches, 4,377 network links)Finding Routes Between Nodes
Network Based Computing Laboratory 72IT4 Innovations’18
OSU INAM Features (Cont.)
Visualizing a Job (5 Nodes)
• Job level view• Show different network metrics (load, error, etc.) for any live job• Play back historical data for completed jobs to identify bottlenecks
• Node level view - details per process or per node• CPU utilization for each rank/node• Bytes sent/received for MPI operations (pt-to-pt, collective, RMA)• Network metrics (e.g. XmitDiscard, RcvError) per rank/node
Estimated Process Level Link Utilization
• Estimated Link Utilization view• Classify data flowing over a network link at
different granularity in conjunction with MVAPICH2-X 2.2rc1• Job level and• Process level
Network Based Computing Laboratory 73IT4 Innovations’18
• Advanced Features for InfiniBand
• Advanced Features for High Speed Ethernet
• RDMA over Converged Ethernet
• Open Fabrics Software Stack and RDMA Programming
• Libfabrics Software Stack and Programming
• Network Management Infrastructure and Tool
• Common Challenges in Building HEC Systems with IB and HSE
– Network Adapters and NUMA Interactions
– Network Switches, Topology and Routing
– Network Bridges
• System Specific Challenges and Case Studies
– HPC (MPI, PGAS and GPU/Xeon Phi Computing)
– Deep Learning
– Cloud Computing
• Conclusions and Final Q&A
Presentation Overview
Network Based Computing Laboratory 74IT4 Innovations’18
Common Challenges for Large-Scale Installations
Common Challenges Adapters and Interactions I/O busMulti-port adapters NUMA
Switches Topologies Switching / Routing
Bridges IB interoperability
Network Based Computing Laboratory 75IT4 Innovations’18
• Network adapters and interactions with other components– I/O bus interactions and limitations
– Multi-port adapters and bottlenecks
– NUMA interactions
• Network switches
• Network bridges
Common Challenges in Building HEC Systems with IB and HSE
Network Based Computing Laboratory 76IT4 Innovations’18
• Data communication traverses three buses (or links) before it reaches the network switch
– Memory bus (memory to IO hub)
– I/O link (IO hub to the network adapter)
– Network link (network adapter to switch)
• For optimal communication, all these need to be balanced
• Network bandwidth:– 4X SDR (8 Gbps), 4X DDR (16 Gbps), 4X QDR (32 Gbps), 4X
FDR (56 Gbps), 4X EDR (100 Gbps) and 4X HDR (200 Gbps)
– 40 GigE (40 Gbps)
• Memory bandwidth:– Shared bandwidth (incoming and outgoing)
– For IB FDR (56 Gbps), memory bandwidth greater than 112 Gbps is required to fully utilize the network
I/O bus limitationsP0
Core0 Core1
Core2 Core3
P1Core0 Core1
Core2 Core3
Memory
Memory
Network Adapter
Network Switch
• I/O link bandwidth:– Tricky because several aspects need to be considered
– Connector capacity vs. link capacity
– I/O link communication headers, etc.
I/O Bus
Network Based Computing Laboratory 77IT4 Innovations’18
• Common I/O interconnect used on most current platforms– Can be configured as multiple lanes (1X, 4X, 8X, 16X, 32X)
• Generation 1 provided 2 Gbps bandwidth per lane, Gen 2 provides 4 Gbps, and Gen 3 provides 8 Gbps per lane)
– Compatible with adapters using lesser lanes• If a PCIe connector is 16X, it will still support an 8X adapter by using only 8 lanes
– Provides multiplexing across a single lane• A 1X PCIe bus can be connected to an 8X PCIe connector (allowing an 8X adapter to be
plugged in)
– I/O interconnects are like networks with packetized communication• Communication headers for each packet• Reliability acknowledgments• Flow control acknowledgments• Typical efficiency is around 75-80% with 256 byte PCIe packets
PCI Express
Use I/O bandwidth
Beware
Beware
Beware
Network Based Computing Laboratory 78IT4 Innovations’18
• Several multi-port adapters available in the market– Single adapter can drive multiple network ports at full bandwidth
– Important to measure other overheads (memory bandwidth and I/O link bandwidth) before assuming performance benefit
• Case Study: IB Dual-port 4x QDR adapter– Each network link is 32 Gbps (dual-port adapters can drive 64 Gbps)
– PCIe Gen2 8X link can give 32 Gbps data rate around 24 Gbps effective rate (20 % encoding overheads!!)• Dual-port IB QDR is not expected to give any benefit in this case
– PCIe Gen3 8X link can give 64 Gbps data rate 64 Gbps (minimal encoding overheads)• Delivers close to peak performance with Dual-port IB adapters
Multi-port adapters
Network Based Computing Laboratory 79IT4 Innovations’18
• Network adapters and interactions with other components– I/O bus interactions and limitations
– Multi-port adapters and bottlenecks
– NUMA interactions
• Network switches
• Network bridges
Common Challenges in Building HEC Systems with IB and HSE
Network Based Computing Laboratory 80IT4 Innovations’18
NUMA Interactions
Memory Memory
Memory Memory
Network Card
Core 8
Socket 2Core
9Core
10Core
11
Core 12
Socket 3Core
13Core
14Core
15
Core 4
Socket 1Core
5Core
6Core
7
Core 0
Socket 0Core
1Core
2Core
3
• Different cores in a NUMA platform have different communication costs
QPI or HT
PCIe
Network Based Computing Laboratory 81IT4 Innovations’18
0
0.5
1
1.5
2
2.5
3
2 4 8 16 32 64 128 256 512 1K 2K
Send
Late
ncy
(us)
Message Size (Bytes)
Core 0 -> 0 (Socket 0)Core 7->7 (Socket 0)Core 14->14 (Socket 1)Core 27->27 (Socket 1)
Impact of NUMA on Inter-node Latency
• Cores in Socket 0 (closest to network card) have lowest latency
• Cores in Socket 1 (one hop from network card) have highest latencyConnectX-4 EDR (100 Gbps): 2.4 GHz Fourteen-core (Broadwell) Intel with IB (EDR) switches
Network Based Computing Laboratory 82IT4 Innovations’18
Impact of NUMA on Inter-node Bandwidth
0
500
1000
1500
2000
2500
3000
Send
Ban
dwid
th (M
Bps)
Message Size (bytes)
AMD MagnyCoursCore-0Core-6Core-12Core-18
ConnectX-4 EDR (100 Gbps): 2.4 GHz Fourteen-core (Broadwell) Intel with IB (EDR) switchesConnectX-2-QDR (36 Gbps): 2.5 GHz Hex-core (MagnyCours) AMD with IB (QDR) switches
0
2000
4000
6000
8000
10000
12000
14000
Send
Ban
dwid
th (M
Bps)
Message Size (bytes)
Intel BroadwellCore-0Core-7Core-14Core-27
• NUMA interactions have significant impact on bandwidth
Network Based Computing Laboratory 83IT4 Innovations’18
• Network adapters and interactions with other components– I/O bus interactions and limitations
– Multi-port adapters and bottlenecks
– NUMA interactions
• Network switches
• Network bridges
Common Challenges in Building HEC Systems with IB and HSE
Network Based Computing Laboratory 84IT4 Innovations’18
• Network adapters and interactions with other components
• Network switches– Switch topologies– Switching and Routing
• Network bridges
Common Challenges in Building HEC Systems with IB and HSE
Network Based Computing Laboratory 85IT4 Innovations’18
• InfiniBand installations come in multiple topologies– Single crossbar switches (up to 36-ports for QDR or FDR)
• Applicable only to very small systems (hard to scale to large clusters)
– Fat-tree topologies (medium scale topologies)• Provides full bisection bandwidth: Given independent communication between processes,
you can find a switch configuration that provides fully non-blocking paths (though the same configuration might have contention if the communication pattern changes)
• Issue: Number of switch components increases super-linearly with the number of nodes (Not scalable for large-scale systems)
• Large scale installations can use more conservative topologies– Partial fat-tree topologies (over-provisioning)– 3D Torus (Sandia Red Sky and SDSC Gordon), Hypercube (SGI Altix) topologies, and
10D Hypercube (NASA Pleiades)
Switch Topologies
Network Based Computing Laboratory 86IT4 Innovations’18
Switch Topology: Absolute Performance vs. Scalability
Crossbar ASIC(all-to-all connectivity)
Leaf Blocks
Spine Blocks
Full Fat-tree Topology
(full bisection bandwidth)
Leaf Blocks
Spine Blocks
Partial Fat-tree Topology
(reduced inter-switch connectivity for more out-ports: super-linear scaling of switch components, but slower than
a full fat-tree topology)
Only a few links are connected
Torus/Hypercube Topology
(linear scaling of switch components)
Network Based Computing Laboratory 87IT4 Innovations’18
• IB standard only supports static routing– Not scalable for large systems where traffic might be non-deterministic causing hot-spots
• Next generation IB switches are supporting adaptive routing (in addition to static routing): Outside the IB standard
• Qlogic (Intel) support for adaptive routing– Continually monitors application messaging patterns and selects the optimum path for each traffic
flow, eliminating slowdowns caused by pathway bottlenecks– Dispersive routing load-balances traffic among multiple pathways– http://ir.qlogic.com/phoenix.zhtml?c=85695&p=irol-newsarticle&id=1428788
• Mellanox support for adaptive routing– Supports moving traffic via multiple parallel paths– Dynamically and automatically re-routes traffic to alleviate congested ports– http://www.mellanox.com/related-docs/prod_silicon/PB_InfiniScale_IV.pdf
Static Routing in IB + Adaptive Routing models from Qlogic (Intel) and Mellanox
Network Based Computing Laboratory 88IT4 Innovations’18
• Network adapters and interactions with other components
• Network switches
• Network bridges– IB interoperability with Ethernet and FC
Common Challenges in Building HEC Systems with IB and HSE
Network Based Computing Laboratory 89IT4 Innovations’18
Virtual Ethernet/FC Adapter
• Mainly developed for backward compatibility with existing infrastructure– Ethernet over IB (EoIB)
– Fibre Channel over IB (FCoIB)
IB-Ethernet and IB-FC Bridging Solutions
IB Adapter
Host
Ethernet Packet Convertor Switch(e.g., Mellanox BridgeX)
Ethernet/FC Adapter
Host
Network Based Computing Laboratory 90IT4 Innovations’18
• Can be used in an infrastructure where a part of the nodes are connected over Ethernet or FC
– All of the IB connected nodes can communicate over IB
– The same nodes can communicate with nodes in the older infrastructure using Ethernet-over-IB or FC-over-IB
• Do not have the performance benefits of IB– Host thinks it is using an Ethernet or FC adapter
– For example, with Ethernet, communication will be using TCP/IP• There is some hardware support for segmentation offload, but the rest of the IB features
are unutilized
• Note that this is different from VPI, as there is only one network connectivity from the adapter
Ethernet/FC over IB
Network Based Computing Laboratory 91IT4 Innovations’18
• Advanced Features for InfiniBand
• Advanced Features for High Speed Ethernet
• RDMA over Converged Ethernet
• Open Fabrics Software Stack and RDMA Programming
• Libfabrics Software Stack and Programming
• Network Management Infrastructure and Tool
• Common Challenges in Building HEC Systems with IB and HSE
– Network Adapters and NUMA Interactions
– Network Switches, Topology and Routing
– Network Bridges
• System Specific Challenges and Case Studies
– HPC (MPI, PGAS and GPU/Xeon Phi Computing)
– Deep Learning
– Cloud Computing
• Conclusions and Final Q&A
Presentation Overview
Network Based Computing Laboratory 92IT4 Innovations’18
System Specific Challenges for HPC Systems
Common Challenges
Adapters and Interactions I/O busMulti-port adapters NUMA
Switches Topologies Switching / Routing
Bridges IB interoperability
HPCMPIMulti-rail Collectives Scalability Application Scalability Energy Awareness
PGAS Programmability w/ Performance Optimized Resource Utilization
GPU / XeonPhi Programmability w/ Performance Hide data movement costs Heterogeneity aware design Streaming, Deep Learning
Network Based Computing Laboratory 93IT4 Innovations’18
• Message Passing Interface (MPI)
• Partitioned Global Address Space (PGAS) models
• GPU Computing
• Xeon Phi Computing
HPC System Challenges and Case Studies
Network Based Computing Laboratory 94IT4 Innovations’18
Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2, MPI-3.0, and MPI-3.1), Started in 2001, First version available in 2002
– MVAPICH2-X (MPI + PGAS), Available since 2011
– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Support for Virtualization (MVAPICH2-Virt), Available since 2015
– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
– Used by more than 2,850 organizations in 85 countries
– More than 440,000 (> 0.44 million) downloads from the OSU site directly
– Empowering many TOP500 clusters (Nov ‘17 ranking)• 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China
• 12th, 368,928-core (Stampede2) at TACC
• 17th, 241,108-core (Pleiades) at NASA
• 48th, 76,032-core (Tsubame 2.5) at Tokyo Institute of Technology
– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
• Empowering Top500 systems for over a decade
– System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) ->
– Sunway TaihuLight (1st in Jun’17, 10M cores, 100 PFlops)
Network Based Computing Laboratory 95IT4 Innovations’18
• Interaction with Multi-Rail Environments
• Collective Communication
• Scalability for Large-scale Systems
• Energy Awareness
Design Challenges and Sample Results
Network Based Computing Laboratory 96IT4 Innovations’18
0
2000
4000
6000
8000
10000
12000
140001 4 16 64 256 1K 4K 16K
64K
256K 1M
Band
wid
th (M
Byte
s/se
c)
Message Size (bytes)
Single Rail
Impact of Multiple Rails on Inter-node MPI Bandwidth
Designs based on: S. Sur, M. J. Koop, L. Chai and D. K. Panda, “Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms”, IEEE Hot Interconnects, 2007
0
5000
10000
15000
20000
25000
30000
1 4 16 64 256 1K 4K 16K
64K
256K 1M
Band
wid
th (M
Byte
s/se
c)
Message Size (bytes)
Dual Rail
1 pair
2 pairs
4 pairs
8 pairs
16 pairs
ConnectX-4 EDR (100 Gbps): 2.4 GHz Deca-core (Haswell) Intel with IB (EDR) switches
Network Based Computing Laboratory 97IT4 Innovations’18
Hardware Multicast-aware MPI_Bcast on Stampede
0
10
20
30
40
2 8 32 128 512
Late
ncy
(us)
Message Size (Bytes)
Small Messages (102,400 Cores)DefaultMulticast
ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch
0100200300400500
2K 8K 32K 128K
Late
ncy
(us)
Message Size (Bytes)
Large Messages (102,400 Cores)DefaultMulticast
05
1015202530
Late
ncy
(us)
Number of Nodes
16 Byte Message
DefaultMulticast
0
50
100
150
200
Late
ncy
(us)
Number of Nodes
32 KByte Message
DefaultMulticast
Network Based Computing Laboratory 98IT4 Innovations’18
Hardware Multicast-aware MPI_Bcast on Broadwell + EDR
01234567
2 8 32 128 512
Late
ncy
(us)
Message Size (Bytes)
Small Messages (1,120 Cores)DefaultMulticast
ConnectX-4 EDR (100 Gbps): 2.4 GHz Fourteen-core (Broadwell) Intel with Mellanox IB (EDR) switches
020406080
100120
2K 8K 32K 128K
Late
ncy
(us)
Message Size (Bytes)
Large Messages (1,120 Cores)DefaultMulticast
0
1
2
3
4
5
Late
ncy
(us)
Number of Nodes
16 Byte Message
DefaultMulticast
0
10
20
30
40
Late
ncy
(us)
Number of Nodes
32 KByte Message
DefaultMulticast
Network Based Computing Laboratory 99IT4 Innovations’18
Advanced Allreduce Collective Designs Using SHArP and Multi-Leaders
• Socket-based design can reduce the communication latency by 23% and 40% on Xeon + IB nodes
• Support is available in MVAPICH2 2.3a and MVAPICH2-X 2.3b
HPCG (28 PPN)
0
0.1
0.2
0.3
0.4
0.5
0.6
56 224 448
Com
mun
icat
ion
Late
ncy
(Sec
onds
)
Number of Processes
MVAPICH2 Proposed-Socket-Based MVAPICH2+SHArP
0
10
20
30
40
50
60
4 8 16 32 64 128 256 512 1K 2K 4K
Late
ncy
(us)
Message Size (Byte)
MVAPICH2 Proposed-Socket-Based MVAPICH2+SHArP
OSU Micro Benchmark (16 Nodes, 28 PPN)
23%
40%
Lower is better
M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, Scalable Reduction Collectives with Data Partitioning-based Multi-Leader Design, Supercomputing '17.
Network Based Computing Laboratory 100IT4 Innovations’18
Performance of MPI_Allreduce On Stampede2 (10,240 Processes)
0
50
100
150
200
250
300
4 8 16 32 64 128 256 512 1024 2048 4096
Late
ncy
(us)
Message SizeMVAPICH2 MVAPICH2-OPT IMPI
0200400600800
100012001400160018002000
8K 16K 32K 64K 128K 256KMessage Size
MVAPICH2 MVAPICH2-OPT IMPIOSU Micro Benchmark 64 PPN
2.4X
• MPI_Allreduce latency with 32K bytes reduced by 2.4X
Network Based Computing Laboratory 101IT4 Innovations’18
Network-Topology-Aware Placement of ProcessesCan we design a highly scalable network topology detection service for IB?
How do we design the MPI communication library in a network-topology-aware manner to efficiently leverage the topology information generated by our service?
What are the potential benefits of using a network-topology-aware MPI library on the performance of parallel scientific applications?
Overall performance and Split up of physical communication for MILC on Ranger
Performance for varyingsystem sizes Default for 2048 core run Topo-Aware for 2048 core run
15%
H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and D. K. Panda, Design of a Scalable InfiniBand Topology Service to Enable Network-Topology-Aware Placement of Processes, SC'12 . BEST Paper and BEST STUDENT Paper Finalist
• Reduce network topology discovery time from O(N2hosts) to O(Nhosts)
• 15% improvement in MILC execution time @ 2048 cores• 15% improvement in Hypre execution time @ 1024 cores
Network Based Computing Laboratory 102IT4 Innovations’18
Dynamic and Adaptive Tag Matching
Normalized Total Tag Matching Time at 512 ProcessesNormalized to Default (Lower is Better)
Normalized Memory Overhead per Process at 512 ProcessesCompared to Default (Lower is Better)
Adaptive and Dynamic Design for MPI Tag Matching; M. Bayatpour, H. Subramoni, S. Chakraborty, and D. K. Panda; IEEE Cluster 2016. [Best Paper Nominee]
Chal
leng
e Tag matching is a significant overhead for receiversExisting Solutions are- Static and do not adapt dynamically to communication pattern- Do not consider memory overhead
Solu
tion A new tag matching design
- Dynamically adapt to communication patterns- Use different strategies for different ranks - Decisions are based on the number of request object that must be traversed before hitting on the required one
Resu
lts Better performance than other state-of-the art tag-matching schemesMinimum memory consumption
Will be available in future MVAPICH2 releases
Network Based Computing Laboratory 103IT4 Innovations’18
● Enhance existing support for MPI_T in MVAPICH2 to expose a richer set of performance and control variables
● Get and display MPI Performance Variables (PVARs) made available by the runtime in TAU
● Control the runtime’s behavior via MPI Control Variables (CVARs)● Introduced support for new MPI_T based CVARs to MVAPICH2
○ MPIR_CVAR_MAX_INLINE_MSG_SZ, MPIR_CVAR_VBUF_POOL_SIZE, MPIR_CVAR_VBUF_SECONDARY_POOL_SIZE
● TAU enhanced with support for setting MPI_T CVARs in a non-interactive mode for uninstrumented applications
Performance Engineering Applications using MVAPICH2 and TAU
VBUF usage without CVAR based tuning as displayed by ParaProf VBUF usage with CVAR based tuning as displayed by ParaProf
Network Based Computing Laboratory 104IT4 Innovations’18
Dynamic and Adaptive MPI Point-to-point Communication Protocols
Process on Node 1 Process on Node 2
Eager Threshold for Example Communication Pattern with Different Designs
0 1 2 3
4 5 6 7
Default
16 KB 16 KB 16 KB 16 KB
0 1 2 3
4 5 6 7
Manually Tuned
128 KB 128 KB 128 KB 128 KB
0 1 2 3
4 5 6 7
Dynamic + Adaptive
32 KB 64 KB 128 KB 32 KB
H. Subramoni, S. Chakraborty, D. K. Panda, Designing Dynamic & Adaptive MPI Point-to-Point Communication Protocols for Efficient Overlap of Computation & Communication, ISC'17 - Best Paper
0
200
400
600
128 256 512 1K
Wal
l Clo
ck T
ime
(sec
onds
)
Number of Processes
Execution Time of Amber
Default Threshold=17K Threshold=64KThreshold=128K Dynamic Threshold
0
5
10
128 256 512 1K
Rela
tive
Mem
ory
Cons
umpt
ion
Number of Processes
Relative Memory Consumption of Amber
Default Threshold=17K Threshold=64KThreshold=128K Dynamic Threshold
Default Poor overlap; Low memory requirement Low Performance; High Productivity
Manually Tuned Good overlap; High memory requirement High Performance; Low Productivity
Dynamic + Adaptive Good overlap; Optimal memory requirement High Performance; High Productivity
Process Pair Eager Threshold (KB)
0 – 4 32
1 – 5 64
2 – 6 128
3 – 7 32
Desired Eager Threshold
Network Based Computing Laboratory 105IT4 Innovations’18
Enhanced MPI_Bcast with Optimized CMA-based Design
1
10
100
1000
10000
100000
1K 4K 16K 64K 256K 1M 4MMessage Size
KNL (64 Processes)
MVAPICH2-2.3a
Intel MPI 2017
OpenMPI 2.1.0
Proposed
Late
ncy
(us)
1
10
100
1000
10000
100000
Message Size
Broadwell (28 Processes)
MVAPICH2-2.3a
Intel MPI 2017
OpenMPI 2.1.0
Proposed1
10
100
1000
10000
100000
1K 4K 16K 64K 256K 1M 4MMessage Size
Power8 (160 Processes)
MVAPICH2-2.3a
OpenMPI 2.1.0
Proposed
• Up to 2x - 4x improvement over existing implementation for 1MB messages• Up to 1.5x – 2x faster than Intel MPI and Open MPI for 1MB messages
Use CMA
UseSHMEM
Use CMA
UseSHMEM
Use CMA
UseSHMEM
• Improvements obtained for large messages only• p-1 copies with CMA, p copies with Shared memory• Fallback to SHMEM for small messages
S. Chakraborty, H. Subramoni, and D. K. Panda, Contention Aware Kernel-Assisted MPI Collectives for Multi/Many-core Systems, IEEE Cluster ’17, BEST Paper Finalist
Support is available in MVAPICH2-X 2.3b
Network Based Computing Laboratory 106IT4 Innovations’18
Designing Energy-Aware (EA) MPI Runtime
Energy Spent in Communication Routines
Energy Spent in Computation Routines
Overall application Energy Expenditure
Point-to-point Routines
Collective Routines
RMA Routines
MVAPICH2-EA Designs
MPI Two-sided and collectives (ex: MVAPICH2)
Other PGAS Implementations (ex: OSHMPI)One-sided runtimes (ex: ComEx)
Impact MPI-3 RMA Implementations (ex: MVAPICH2)
Network Based Computing Laboratory 107IT4 Innovations’18
• An energy efficient runtime that provides energy savings without application knowledge
• Uses automatically and transparently the best energy lever
• Provides guarantees on maximum degradation with 5-41% savings at <= 5% degradation
• Pessimistic MPI applies energy reduction lever to each MPI call
• Available for download from MVAPICH project site since Aug’15
MVAPICH2-EA: Application Oblivious Energy-Aware-MPI (EAM)
A Case for Application-Oblivious Energy-Efficient MPI Runtime A. Venkatesh, A. Vishnu, K. Hamidouche, N. Tallent, D.
K. Panda, D. Kerbyson, and A. Hoise, Supercomputing ‘15, Nov 2015 [Best Student Paper Finalist]
1
Network Based Computing Laboratory 108IT4 Innovations’18
• Message Passing Interface (MPI)
• Partitioned Global Address Space (PGAS) models
• GPU Computing
• Xeon Phi Computing
HPC System Challenges and Case Studies
Network Based Computing Laboratory 109IT4 Innovations’18
• Global view improves programmer productivity• Idea is to decouple data movement with process synchronization• Processes should have asynchronous access to globally distributed data• Well suited for irregular applications and kernels that require dynamic access to different
data• Different Approaches
– Library-based (Global Arrays, OpenSHMEM)– Compiler-based (Unified Parallel C (UPC), Co-Array Fortran (CAF))– HPCS Language-based (X10, Chapel, Fortress)
Partitioned Global Address Space (PGAS) ModelsP1 P2 P3
Shared Memory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory MemoryLogical shared memory
Shared Memory Model
SHMEM, DSMDistributed Memory Model
MPI (Message Passing Interface)Partitioned Global Address Space (PGAS)
Network Based Computing Laboratory 110IT4 Innovations’18
Hybrid (MPI+PGAS) Programming
• Application sub-kernels can be re-written in MPI/PGAS based on communication characteristics
• Benefits:– Best of Distributed Computing Model
– Best of Shared Memory Computing Model
Kernel 1MPI
Kernel 2MPI
Kernel 3MPI
Kernel NMPI
HPC Application
Kernel 2PGAS
Kernel NPGAS
Network Based Computing Laboratory 111IT4 Innovations’18
MVAPICH2-X for Hybrid MPI + PGAS Applications
• Current Model – Separate Runtimes for OpenSHMEM/UPC/UPC++/CAF and MPI– Possible deadlock if both runtimes are not progressed
– Consumes more network resource
• Unified communication runtime for MPI, UPC, UPC++, OpenSHMEM, CAF– Available with since 2012 (starting with MVAPICH2-X 1.9) – http://mvapich.cse.ohio-state.edu
Network Based Computing Laboratory 112IT4 Innovations’18
UPC++ Collectives Performance
MPI + {UPC++} application
GASNet Interfaces
UPC++ Runtime
Network
Conduit (MPI)
MVAPICH2-XUnified
communication Runtime (UCR)
MPI + {UPC++} application
UPC++ Runtime MPI Interfaces
• Full and native support for hybrid MPI + UPC++ applications
• Better performance compared to IBV and MPI conduits
• OSU Micro-benchmarks (OMB) support for UPC++
• Available since MVAPICH2-X 2.2RC1
05000
10000150002000025000300003500040000
Tim
e (u
s)
Message Size (bytes)
GASNet_MPIGASNET_IBVMV2-X
14x
Inter-node Broadcast (64 nodes 1:ppn)
J. M. Hashmi, K. Hamidouche, and D. K. Panda, Enabling Performance Efficient Runtime Support for hybrid MPI+UPC++ Programming Models, IEEE International Conference on High Performance Computing and Communications (HPCC 2016)
Network Based Computing Laboratory 113IT4 Innovations’18
Application Level Performance with Graph500 and SortGraph500 Execution Time
J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, International Supercomputing Conference (ISC’13), June 2013
05
101520253035
4K 8K 16K
Tim
e (s
)
No. of Processes
MPI-SimpleMPI-CSCMPI-CSRHybrid (MPI+OpenSHMEM)
13X
7.6X
• Performance of Hybrid (MPI+ OpenSHMEM) Graph500 Design• 8,192 processes
- 2.4X improvement over MPI-CSR- 7.6X improvement over MPI-Simple
• 16,384 processes- 1.5X improvement over MPI-CSR- 13X improvement over MPI-Simple
Sort Execution Time
0500
10001500200025003000
500GB-512 1TB-1K 2TB-2K 4TB-4K
Tim
e (s
econ
ds)
Input Data - No. of Processes
MPI Hybrid
51%
• Performance of Hybrid (MPI+OpenSHMEM) Sort Application
• 4,096 processes, 4 TB Input Size- MPI – 2408 sec; 0.16 TB/min- Hybrid – 1172 sec; 0.36 TB/min- 51% improvement over MPI-design
J. Jose, S. Potluri, H. Subramoni, X. Lu, K. Hamidouche, K. Schulz, H. Sundar and D. Panda Designing Scalable Out-of-core Sorting with Hybrid MPI+PGAS Programming Models, PGAS’14
Network Based Computing Laboratory 114IT4 Innovations’18
Performance of PGAS Models on KNL using MVAPICH2-X
0.01
0.1
1
10
100
1000
1 4 16 64 256 1K 4K 16K 64K 256K 1M
Late
ncy
(us)
Message Size
shmem_put
upc_putmem
upcxx_async_put
Intra-node PUT
0.01
0.1
1
10
100
1000
1 4 16 64 256 1K 4K 16K 64K 256K 1M
Late
ncy
(us)
Message Size
shmem_get
upc_getmem
upcxx_async_get
Intra-node GET
• Intra-node performance of one-sided Put/Get operations of PGAS libraries/languages using MVAPICH2-X communication conduit
• Near-native communication performance is observed on KNL
Network Based Computing Laboratory 115IT4 Innovations’18
Optimized OpenSHMEM with AVX and MCDRAM: Application Kernels Evaluation
Heat Image Kernel
• On heat diffusion based kernels AVX-512 vectorization showed better performance• MCDRAM showed significant benefits on Heat-Image kernel for all process counts.
Combined with AVX-512 vectorization, it showed up to 4X improved performance
1
10
100
1000
16 32 64 128
Tim
e (s
)
No. of processes
KNL (Default)KNL (AVX-512)KNL (AVX-512+MCDRAM)Broadwell
Heat-2D Kernel using Jacobi method
0.1
1
10
100
16 32 64 128
Tim
e (s
)
No. of processes
KNL (Default)KNL (AVX-512)KNL (AVX-512+MCDRAM)Broadwell
Network Based Computing Laboratory 116IT4 Innovations’18
• Message Passing Interface (MPI)
• Partitioned Global Address Space (PGAS) models
• GPU Computing
• Xeon Phi Computing
HPC System Challenges and Case Studies
Network Based Computing Laboratory 117IT4 Innovations’18
At Sender:
At Receiver:MPI_Recv(r_devbuf, size, …);
insideMVAPICH2
• Standard MPI interfaces used for unified data movement
• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
• Overlaps data movement from GPU with RDMA transfers
High Performance and High Productivity
MPI_Send(s_devbuf, size, …);
GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU
Network Based Computing Laboratory 118IT4 Innovations’18
0
2000
4000
6000
1 2 4 8 16 32 64 128
256
512 1K 2K 4K
Band
wid
th (M
B/s)
Message Size (Bytes)
GPU-GPU Inter-node Bi-Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3a
01000200030004000
1 2 4 8 16 32 64 128
256
512 1K 2K 4K
Band
wid
th (M
B/s)
Message Size (Bytes)
GPU-GPU Inter-node Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3a
0
10
20
300 1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8K
Late
ncy
(us)
Message Size (Bytes)
GPU-GPU Inter-node Latency
MV2-(NO-GDR) MV2-GDR-2.3a
MVAPICH2-GDR-2.3aIntel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores
NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA
CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA
10x
9x
Optimized MVAPICH2-GDR Design
1.88us11X
Network Based Computing Laboratory 119IT4 Innovations’18
• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)• HoomdBlue Version 1.0.5
• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384
Application-Level Evaluation (HOOMD-blue)
0
500
1000
1500
2000
2500
4 8 16 32
Aver
age
Tim
e St
eps p
er
seco
nd (T
PS)
Number of Processes
MV2 MV2+GDR
0500
100015002000250030003500
4 8 16 32
Aver
age
Tim
e St
eps p
er
seco
nd (T
PS)
Number of Processes
64K Particles 256K Particles
2X2X
Network Based Computing Laboratory 120IT4 Innovations’18
Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland
0
0.2
0.4
0.6
0.8
1
1.2
16 32 64 96Nor
mal
ized
Exec
utio
n Ti
me
Number of GPUs
CSCS GPU cluster
Default Callback-based Event-based
00.20.40.60.8
11.2
4 8 16 32
Nor
mal
ized
Exec
utio
n Ti
me
Number of GPUs
Wilkes GPU Cluster
Default Callback-based Event-based
• 2X improvement on 32 GPUs nodes• 30% improvement on 96 GPU nodes (8 GPUs/node)
C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16
On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application
Cosmo model: http://www2.cosmo-model.org/content/tasks/operational/meteoSwiss/
Network Based Computing Laboratory 121IT4 Innovations’18
Enhanced Support for GPU Managed Memory ● CUDA Managed => no memory pin down
● No IPC support for intranode communication ● No GDR support for Internode communication
● Significant productivity benefits due to abstraction of explicit allocation and cudaMemcpy()
● Initial and basic support in MVAPICH2-GDR ● For both intra- and inter-nodes use “pipeline through”
host memory ● Enhance intranode managed memory to use IPC
● Double buffering pair-wise IPC-based scheme ● Brings IPC performance to Managed memory ● High performance and high productivity● 2.5 X improvement in bandwidth
● OMB extended to evaluate the performance of point-to-point and collective communications using managed buffers
0
2000
4000
6000
8000
10000
32K 128K 512K 2M
Enhanced
MV2-GDR 2.2b
Message Size (bytes)
Band
wid
th (M
B/s)
2.5X
D. S. Banerjee, K Hamidouche, and D. K Panda, Designing High Performance Communication Runtime for GPUManaged Memory: Early Experiences, GPGPU-9 Workshop, held in conjunction with PPoPP ‘16
0
0.2
0.4
0.6
0.8
1 2 4 8 16 32 64 128 256 1K 4K 8K 16KHal
o Ex
chan
ge T
ime
(ms)
Total Dimension Size (Bytes)
2D Stencil Performance for Halowidth=1DeviceManaged
Network Based Computing Laboratory 122IT4 Innovations’18
• Streaming applications on GPU clusters– Using a pipeline of broadcast operations to move host-
resident data from a single source—typically live— to multiple GPU-based computing sites
– Existing schemes require explicitly data movements between Host and GPU memories
Poor performance and breaking the pipeline
• IB hardware multicast + Scatter-List– Efficient heterogeneous-buffer broadcast operation
• CUDA Inter-Process Communication (IPC)– Efficient intra-node topology-aware broadcast
operations for multi-GPU systems
• Available MVAPICH2-GDR 2.3a!
High-Performance Heterogeneous Broadcast for Streaming Applications
Node N
IB HCA
IB HCA
CPU
GPU
Source
IB Switch
GPU
CPU
Node 1
Multicast steps
CData
C
IB SL step
Data
IB HCA
GPU
CPU
Data
C
Node N
Node 1
IB Switch
GPU 0 GPU 1 GPU N
GPU
CPUSource
GPU
CPU
CPUMulticast stepsIPC-based cudaMemcpy(Device<->Device)
3
Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters. C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh , B. Elton, and D. K. Panda, SBAC-PAD'16, Oct 2016.
Network Based Computing Laboratory 123IT4 Innovations’18
Control Flow Decoupling through GPUDirect Async
• Latency Oriented: Able to hide the kernel launch overhead
– 25% improvement at 256 Bytes
• Throughput Oriented: Asynchronously to offload queue the Communication and computation tasks
– 14% improvement at 1KB message size
• Intel Sandy Bridge, NVIDIA K20 and Mellanox FDR HCA
• Will be available in a public release soon
GPU CPU HCA
KernelLaunch
OverheadHidden
• CPU offloads the compute, communication and synchronization tasks to GPU
• All operations asynchronous from CPU • Hide the overhead of kernel launch
• Needs stream-based extensions to MPI semantics
Latency oriented: Kernel+Send and Recv+Kernel
010203040506070
1 4 16 64 256 1K 4K
Default MPI Enhaced MPI+GDS
Message Size (bytes)
Late
ncy
(us)
Overlap with host computation/communication
0
20
40
60
80
100
1 4 16 64 256 1K 4K
Default MPI Enhanced MPI+GDS
Message Size (bytes)
Ove
rlap
(%)
Network Based Computing Laboratory 124IT4 Innovations’18
• Message Passing Interface (MPI)
• Partitioned Global Address Space (PGAS) models
• GPU Computing
• Xeon Phi Computing
HPC System Challenges and Case Studies
Network Based Computing Laboratory 125IT4 Innovations’18
• On-load approach
– Takes advantage of the idle cores
– Dynamically configurable
– Takes advantage of highly multithreaded cores
– Takes advantage of MCDRAM of KNL processors
• Applicable to other programming models such as PGAS, Task-based, etc.
• Provides portability, performance, and applicability to runtime as well as applications in a transparent manner
Enhanced Designs for KNL: MVAPICH2 Approach
Network Based Computing Laboratory 126IT4 Innovations’18
Performance Benefits of the Enhanced Designs
• New designs to exploit high concurrency and MCDRAM of KNL
• Significant improvements for large message sizes
• Benefits seen in varying message size as well as varying MPI processes
Very Large Message Bi-directional Bandwidth16-process Intra-node All-to-AllIntra-node Broadcast with 64MB Message
0
2000
4000
6000
8000
10000
2M 4M 8M 16M 32M 64M
Band
wid
th (M
B/s)
Message size
MVAPICH2 MVAPICH2-Optimized
0
10000
20000
30000
40000
50000
60000
4 8 16
Late
ncy
(us)
No. of processes
MVAPICH2 MVAPICH2-Optimized
27%
0
50000
100000
150000
200000
250000
300000
1M 2M 4M 8M 16M 32M
Late
ncy
(us)
Message size
MVAPICH2 MVAPICH2-Optimized17.2%
52%
Network Based Computing Laboratory 127IT4 Innovations’18
Performance Benefits of the Enhanced Designs
0
10000
20000
30000
40000
50000
60000
1M 2M 4M 8M 16M 32M 64M
Band
wid
th (M
B/s)
Message Size (bytes)
MV2_Opt_DRAM MV2_Opt_MCDRAMMV2_Def_DRAM MV2_Def_MCDRAM 30%
0
50
100
150
200
250
300
4:268 4:204 4:64
Tim
e (s
)
MPI Processes : OMP Threads
MV2_Def_DRAM MV2_Opt_DRAM15%
Multi-Bandwidth using 32 MPI processesCNTK: MLP Training Time using MNIST (BS:64)
• Benefits observed on training time of Multi-level Perceptron (MLP) model on MNIST dataset using CNTK Deep Learning Framework
Enhanced Designs will be available in upcoming MVAPICH2 releases
Network Based Computing Laboratory 128IT4 Innovations’18
• Advanced Features for InfiniBand
• Advanced Features for High Speed Ethernet
• RDMA over Converged Ethernet
• Open Fabrics Software Stack and RDMA Programming
• Libfabrics Software Stack and Programming
• Network Management Infrastructure and Tool
• Common Challenges in Building HEC Systems with IB and HSE
– Network Adapters and NUMA Interactions
– Network Switches, Topology and Routing
– Network Bridges
• System Specific Challenges and Case Studies
– HPC (MPI, PGAS and GPU/Xeon Phi Computing)
– Big Data
– Cloud Computing
• Conclusions and Final Q&A
Presentation Overview
Network Based Computing Laboratory 129IT4 Innovations’18
System Specific Challenges for Big Data ProcessingCommon Challenges
Adapters and Interactions I/O busMulti-port adaptersNUMA
Switches Topologies Switching / Routing
Bridges IB interoperability
Big Data Taking advantage of RDMA Performance Scalability Backward compatibility
HPCMPI
Multi-rail Collectives ScalabilityApplication Scalability Energy Awareness
PGAS Programmability w/ Performance Optimized Resource Utilization
GPU / MIC Programmability w/ Performance Hide data movement costs Heterogeneity aware design
Network Based Computing Laboratory 130IT4 Innovations’18
How Can HPC Clusters with High-Performance Interconnect and Storage Architectures Benefit Big Data Applications?
Bring HPC and Big Data processing into a “convergent trajectory”!
What are the major bottlenecks in current Big
Data processing middleware (e.g. Hadoop, Spark, and Memcached)?
Can the bottlenecks be alleviated with new
designs by taking advantage of HPC
technologies?
Can RDMA-enabled high-performance
interconnectsbenefit Big Data
processing?
Can HPC Clusters with high-performance
storage systems (e.g. SSD, parallel file
systems) benefit Big Data applications?
How much performance benefits
can be achieved through enhanced
designs?
How to design benchmarks for evaluating the
performance of Big Data middleware on
HPC clusters?
Network Based Computing Laboratory 131IT4 Innovations’18
Can We Run Big Data Jobs on Existing HPC Infrastructure?
Network Based Computing Laboratory 132IT4 Innovations’18
Can We Run Big Data Jobs on Existing HPC Infrastructure?
Network Based Computing Laboratory 133IT4 Innovations’18
Can We Run Big Data Jobs on Existing HPC Infrastructure?
Network Based Computing Laboratory 134IT4 Innovations’18
Designing Communication and I/O Libraries for Big Data Systems: Challenges
Big Data Middleware(HDFS, MapReduce, HBase, Spark and Memcached)
Networking Technologies(InfiniBand, 1/10/40/100 GigE
and Intelligent NICs)
Storage Technologies(HDD, SSD, NVM, and NVMe-
SSD)
Programming Models(Sockets)
Applications
Commodity Computing System Architectures
(Multi- and Many-core architectures and accelerators)
RDMA?
Communication and I/O LibraryPoint-to-Point
Communication
QoS & Fault Tolerance
Threaded Modelsand Synchronization
Performance TuningI/O and File Systems
Virtualization (SR-IOV)
Benchmarks
Upper level Changes?
Network Based Computing Laboratory 135IT4 Innovations’18
• RDMA for Apache Spark
• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions
• RDMA for Apache HBase
• RDMA for Memcached (RDMA-Memcached)
• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
• OSU HiBD-Benchmarks (OHB)
– HDFS, Memcached, HBase, and Spark Micro-benchmarks
• http://hibd.cse.ohio-state.edu
• Users Base: 275 organizations from 34 countries
• More than 24,700 downloads from the project site
The High-Performance Big Data (HiBD) Project
Available for InfiniBand and RoCEAlso run on Ethernet
Support for OpenPower is available
Network Based Computing Laboratory 136IT4 Innovations’18
• HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well as performance. This mode is enabled by default in the package.
• HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations in-memory and obtain as much performance benefit as possible.
• HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster.
• HHH-L-BB: This mode deploys a Memcached-based burst buffer system to reduce the bandwidth bottleneck of shared file system access. The burst buffer design is hosted by Memcached servers, each of which has a local SSD.
• MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top of Lustre alone. Here, two different modes are introduced: with local disks and without local disks.
• Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH-L, and MapReduce over Lustre).
Different Modes of RDMA for Apache Hadoop 2.x
Network Based Computing Laboratory 137IT4 Innovations’18
050
100150200250300350400
80 120 160
Exec
utio
n Ti
me
(s)
Data Size (GB)
IPoIB (EDR)OSU-IB (EDR)
0100200300400500600700800
80 160 240
Exec
utio
n Ti
me
(s)
Data Size (GB)
IPoIB (EDR)OSU-IB (EDR)
Performance Numbers of RDMA for Apache Hadoop 2.x –RandomWriter & TeraGen in OSU-RI2 (EDR)
Cluster with 8 Nodes with a total of 64 maps
• RandomWriter– 3x improvement over IPoIB
for 80-160 GB file size
• TeraGen– 4x improvement over IPoIB for
80-240 GB file size
RandomWriter TeraGen
Reduced by 3x Reduced by 4x
Network Based Computing Laboratory 138IT4 Innovations’18
0100200300400500600700800
80 120 160
Exec
utio
n Ti
me
(s)
Data Size (GB)
IPoIB (EDR)OSU-IB (EDR)
Performance Numbers of RDMA for Apache Hadoop 2.x – Sort & TeraSort in OSU-RI2 (EDR)
Cluster with 8 Nodes with a total of 64 maps and 32 reduces
• Sort– 61% improvement over IPoIB for
80-160 GB data
• TeraSort– 18% improvement over IPoIB for
80-240 GB data
Reduced by 61%Reduced by 18%
Cluster with 8 Nodes with a total of 64 maps and 14 reduces
Sort TeraSort
0
100
200
300
400
500
600
80 160 240
Exec
utio
n Ti
me
(s)
Data Size (GB)
IPoIB (EDR)OSU-IB (EDR)
Network Based Computing Laboratory 139IT4 Innovations’18
• Design Features– RDMA based shuffle plugin– SEDA-based architecture– Dynamic connection
management and sharing– Non-blocking data transfer– Off-JVM-heap buffer
management– InfiniBand/RoCE support
Design Overview of Spark with RDMA
• Enables high performance RDMA communication, while supporting traditional socket interface
• JNI Layer bridges Scala based Spark with communication library written in native codeX. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014
X. Lu, D. Shankar, S. Gugnani, and D. K. Panda, High-Performance Design of Apache Spark with RDMA and Its Benefits on Various Workloads, IEEE BigData ‘16, Dec. 2016.
Spark Core
RDMA Capable Networks(IB, iWARP, RoCE ..)
Apache Spark Benchmarks/Applications/Libraries/Frameworks
1/10/40/100 GigE, IPoIB Network
Java Socket Interface Java Native Interface (JNI)
Native RDMA-based Comm. Engine
Shuffle Manager (Sort, Hash, Tungsten-Sort)
Block Transfer Service (Netty, NIO, RDMA-Plugin)NettyServer
NIOServer
RDMAServer
NettyClient
NIOClient
RDMAClient
Network Based Computing Laboratory 140IT4 Innovations’18
• InfiniBand FDR, SSD, 64 Worker Nodes, 1536 Cores, (1536M 1536R)
• RDMA vs. IPoIB with 1536 concurrent tasks, single SSD per node. – SortBy: Total time reduced by up to 80% over IPoIB (56Gbps)
– GroupBy: Total time reduced by up to 74% over IPoIB (56Gbps)
Performance Evaluation on SDSC Comet – SortBy/GroupBy
64 Worker Nodes, 1536 cores, SortByTest Total Time 64 Worker Nodes, 1536 cores, GroupByTest Total Time
0
50
100
150
200
250
300
64 128 256
Tim
e (s
ec)
Data Size (GB)
IPoIB
RDMA
0
50
100
150
200
250
64 128 256
Tim
e (s
ec)
Data Size (GB)
IPoIB
RDMA
74%80%
Network Based Computing Laboratory 141IT4 Innovations’18
Application Evaluation on SDSC Comet
• Kira Toolkit: Distributed astronomy image processing toolkit implemented using Apache Spark
– https://github.com/BIDS/Kira
• Source extractor application, using a 65GB dataset from the SDSS DR2 survey that comprises 11,150 image files.
020406080
100120
RDMA Spark Apache Spark(IPoIB)
21 %
Execution times (sec) for Kira SE benchmark using 65 GB dataset, 48 cores.
M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC Comet, XSEDE’16, July 2016
0
200
400
600
800
1000
24 48 96 192 384
One
Epo
ch T
ime
(sec
)
Number of cores
IPoIB RDMA
• BigDL: Distributed Deep Learning Tool using Apache Spark
– https://github.com/intel-analytics/BigDL
• VGG training model on the CIFAR-10 dataset
4.58x
Network Based Computing Laboratory 142IT4 Innovations’18
Using HiBD Packages on Existing HPC Infrastructure
Network Based Computing Laboratory 143IT4 Innovations’18
Using HiBD Packages on Existing HPC Infrastructure
Network Based Computing Laboratory 144IT4 Innovations’18
• RDMA for Apache Hadoop 2.x and RDMA for Apache Spark are installed and available on SDSC Comet.
– Examples for various modes of usage are available in:• RDMA for Apache Hadoop 2.x: /share/apps/examples/HADOOP
• RDMA for Apache Spark: /share/apps/examples/SPARK/
– Please email [email protected] (reference Comet as the machine, and SDSC as the site) if you have any further questions about usage and configuration.
• RDMA for Apache Hadoop is also available on Chameleon Cloud as an appliance
– https://www.chameleoncloud.org/appliances/17/
HiBD Packages on SDSC Comet and Chameleon Cloud
M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC Comet, XSEDE’16, July 2016
Network Based Computing Laboratory 145IT4 Innovations’18
1
10
100
10001 2 4 8 16 32 64 128
256
512 1K 2K 4K
Tim
e (u
s)
Message Size
OSU-IB (FDR)IPoIB (FDR)
0100200300400500600700
16 32 64 128 256 512 1024 2048 4080Thou
sand
s of
Tra
nsac
tions
pe
r Sec
ond
(TPS
)
No. of Clients
• Memcached Get latency– 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us– 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us
• Memcached Throughput (4bytes)– 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s– Nearly 2X improvement in throughput
Memcached GET Latency Memcached Throughput
Memcached Performance (FDR Interconnect)
Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR)
Latency Reduced by nearly 20X
2X
Network Based Computing Laboratory 146IT4 Innovations’18
• Advanced Features for InfiniBand
• Advanced Features for High Speed Ethernet
• RDMA over Converged Ethernet
• Open Fabrics Software Stack and RDMA Programming
• Libfabrics Software Stack and Programming
• Network Management Infrastructure and Tool
• Common Challenges in Building HEC Systems with IB and HSE
– Network Adapters and NUMA Interactions
– Network Switches, Topology and Routing
– Network Bridges
• System Specific Challenges and Case Studies
– HPC (MPI, PGAS and GPU/Xeon Phi Computing)
– Big Data
– Cloud Computing
• Conclusions and Final Q&A
Presentation Overview
Network Based Computing Laboratory 147IT4 Innovations’18
System Specific Challenges forCloud Computing
Common Challenges Adapters and Interactions I/O busMulti-port adapters NUMA
Switches Topologies Switching / Routing
Bridges IB interoperability
CloudComputing
SR-IOV Support Virtualization Containers
HPCMPI
Multi-rail Collectives ScalabilityApplication Scalability Energy Awareness
PGAS Programmability w/ Performance Optimized Resource Utilization
GPU / XeonPhi Programmability w/ Performance Hide data movement costs Heterogeneity aware design
Big Data
Taking advantage of RDMA Performance Scalability Backward compatibility
Network Based Computing Laboratory 148IT4 Innovations’18
• Cloud Computing widely adopted in industry computing environment
• Cloud Computing provides high resource utilization and flexibility
• Virtualization is the key technology to enable Cloud Computing
• Intersect360 study shows cloud is the fastest growing class of HPC
• HPC Meets Cloud: The convergence of Cloud Computing and HPC
HPC Meets Cloud Computing
Network Based Computing Laboratory 149IT4 Innovations’18
• Virtualization has many benefits– Fault-tolerance– Job migration– Compaction
• Have not been very popular in HPC due to overhead associated with Virtualization
• New SR-IOV (Single Root – IO Virtualization) support available with Mellanox InfiniBand adapters changes the field
• Enhanced MVAPICH2 support for SR-IOV• MVAPICH2-Virt 2.2 supports:
– OpenStack, Docker, and singularity
Can HPC and Virtualization be Combined?
J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based Virtualized InfiniBand Clusters? EuroPar'14J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled InfiniBand Clusters, HiPC’14 J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient Approach to build HPC Clouds, CCGrid’15
Network Based Computing Laboratory 150IT4 Innovations’18
0
50
100
150
200
250
300
350
400
milc leslie3d pop2 GAPgeofem zeusmp2 lu
Exec
utio
n Ti
me
(s)
MV2-SR-IOV-Def
MV2-SR-IOV-Opt
MV2-Native
1%9.5%
0
1000
2000
3000
4000
5000
6000
22,20 24,10 24,16 24,20 26,10 26,16
Exec
utio
n Ti
me
(ms)
Problem Size (Scale, Edgefactor)
MV2-SR-IOV-Def
MV2-SR-IOV-Opt
MV2-Native2%
• 32 VMs, 6 Core/VM
• Compared to Native, 2-5% overhead for Graph500 with 128 Procs
• Compared to Native, 1-9.5% overhead for SPEC MPI2007 with 128 Procs
Application-Level Performance on Chameleon
SPEC MPI2007Graph500
5%
Network Based Computing Laboratory 151IT4 Innovations’18
0
10
20
30
40
50
60
70
80
90
100
MG.D FT.D EP.D LU.D CG.D
Exec
utio
n Ti
me
(s)
Container-Def
Container-Opt
Native
• 64 Containers across 16 nodes, pining 4 Cores per Container
• Compared to Container-Def, up to 11% and 73% of execution time reduction for NAS and Graph 500
• Compared to Native, less than 9 % and 5% overhead for NAS and Graph 500
Application-Level Performance on Docker with MVAPICH2
Graph 500NAS
11%
0
50
100
150
200
250
300
1Cont*16P 2Conts*8P 4Conts*4P
BFS
Exec
utio
n Ti
me
(ms)
Scale, Edgefactor (20,16)
Container-Def
Container-Opt
Native
73%
Network Based Computing Laboratory 152IT4 Innovations’18
0
500
1000
1500
2000
2500
3000
22,16 22,20 24,16 24,20 26,16 26,20
BFS
Exec
utio
n Ti
me
(ms)
Problem Size (Scale, Edgefactor)
Graph500
Singularity
Native
0
50
100
150
200
250
300
CG EP FT IS LU MG
Exec
utio
n Ti
me
(s)
NPB Class D
Singularity
Native
• 512 Processes across 32 nodes
• Less than 7% and 6% overhead for NPB and Graph500, respectively
Application-Level Performance on Singularity with MVAPICH2
7%
6%
J. Zhang, X .Lu and D. K. Panda, Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds?,
UCC ’17, Best Student Paper Award
Network Based Computing Laboratory 153IT4 Innovations’18
• Challenges– Existing designs in Hadoop not virtualization-aware
– No support for automatic topology detection
• Design– Automatic Topology Detection using MapReduce-based
utility
• Requires no user input
• Can detect topology changes during runtime without affecting running jobs
– Virtualization and topology-aware communication through map task scheduling and YARN container allocation policy extensions
Virtualization-aware and Automatic Topology Detection Schemes in Hadoop on InfiniBand
S. Gugnani, X. Lu, and D. K. Panda, Designing Virtualization-aware and Automatic Topology Detection Schemes for Accelerating Hadoop on SR-IOV-enabled Clouds, CloudCom’16, December 2016
0
2000
4000
6000
40 GB 60 GB 40 GB 60 GB 40 GB 60 GB
EXEC
UTI
ON
TIM
E
Hadoop BenchmarksRDMA-Hadoop Hadoop-Virt
0
100
200
300
400
DefaultMode
DistributedMode
DefaultMode
DistributedMode
EXEC
UTI
ON
TIM
E
Hadoop ApplicationsRDMA-Hadoop Hadoop-Virt
CloudBurst Self-join
Sort WordCount PageRank
Reduced by 55%
Reduced by 34%
Network Based Computing Laboratory 154IT4 Innovations’18
• Presented advanced features of InfiniBand, HSE, Omni-Path, and RoCE
• Provided an overview of Open Fabrics Verbs-level and Libfabrics-level Programming and InfiniBand Network Management
• Discussed common set of challenges in designing HEC Systems
• Presented Challenges and Solutions in designing various High-End Computing systems with IB, Omni-Path, and HSE
• IB, Omni-Path, and HSE are emerging as new architectures leading to a new generation of networked computing systems, opening many research issues needing novel solutions
Concluding Remarks
Network Based Computing Laboratory 155IT4 Innovations’18
Funding AcknowledgmentsFunding Support by
Equipment Support by
Network Based Computing Laboratory 156IT4 Innovations’18
Personnel AcknowledgmentsCurrent Students (Graduate)
– A. Awan (Ph.D.)
– R. Biswas (M.S.)
– M. Bayatpour (Ph.D.)
– S. Chakraborthy (Ph.D.)– C.-H. Chu (Ph.D.)
– S. Guganani (Ph.D.)
Past Students – A. Augustine (M.S.)
– P. Balaji (Ph.D.)
– S. Bhagvat (M.S.)
– A. Bhat (M.S.)
– D. Buntinas (Ph.D.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– R. Rajachandrasekar (Ph.D.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
Past Research Scientist– K. Hamidouche
– S. Sur
Past Post-Docs– D. Banerjee
– X. Besseron
– H.-W. Jin
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– K. Kulkarni (M.S.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– M. Li (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– J. Hashmi (Ph.D.)
– H. Javed (Ph.D.)– P. Kousha (Ph.D.)
– D. Shankar (Ph.D.)
– H. Shi (Ph.D.)
– J. Zhang (Ph.D.)
– J. Lin
– M. Luo
– E. Mancini
Current Research Scientists– X. Lu
– H. Subramoni
Past Programmers– D. Bureddy
– J. Perkins
Current Research Specialist– J. Smith
– M. Arnold
– S. Marcarelli
– J. Vienne
– H. Wang
Current Post-doc– A. Ruhela
Current Students (Undergraduate)– N. Sarkauskas (B.S.)
Network Based Computing Laboratory 157IT4 Innovations’18
Thank You!
Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/
The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/
The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/