Download - Latest version of the slides can be obtained from …prace.it4i.cz/sites/prace.it4i.cz/files/files/iohea-01... · 2018-01-23 · framework – Iterative machine learning jobs –

InfiniBand, Omni-Path, and High-Speed Ethernet: Advanced Features, Challenges in Designing HEC Systems, and Usage

Dhabaleswar K. (DK) Panda

The Ohio State UniversityE-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Hari Subramoni

The Ohio State UniversityE-mail: [email protected]

http://www.cse.ohio-state.edu/~subramon

A Tutorial at IT4 Innovations’18by

Latest version of the slides can be obtained fromhttp://www.cse.ohio-state.edu/~panda/it4-advanced.pdf

http://www.cse.ohio-state.edu/%7Epanda

http://www.cse.ohio-state.edu/%7Esubramon

Network Based Computing Laboratory 2IT4 Innovations’18

High-End Computing (HEC): ExaFlop & ExaByte

10K-20K EBytes in 2016-2018

40K EBytes in 2020 ?

ExaByte & BigDataExpected to have an ExaFlop system in 2019-2020!

100 PFlops in 2016

1 EFlops in 2019-2020?


• Compute Clusters

• Storage Clusters

• Multi-tier Data Centers

• Cloud Computing Environments

• Big Data Processing (Hadoop and Spark)

• Web 2.0 with Memcached

Various High-End Computing (HEC) Systems


Various Clusters (Compute, Storage and Datacenters)Compute cluster

Meta-DataManager

I/O ServerNode

MetaData

DataCompute

Node

ComputeNode

I/O ServerNode Data

ComputeNode

I/O ServerNode Data

ComputeNode

LANLANFrontend

Storage cluster

LAN/WAN

.

...

Enterprise Multi-tier Datacenter for Visualization and Mining

Tier1 Tier3

Routers/Servers

Switch

Database Server

Application Server

Routers/Servers

Routers/Servers

Application Server

Application Server

Application Server

Database Server

Database Server

Database Server

Switch SwitchRouters/Servers

Tier2


Cloud Computing Environments

LAN / WAN

Physical Machine

Virtual Machine

Virtual Machine

Physical Machine

Virtual Machine

Virtual Machine

Physical Machine

Virtual Machine

Virtual Machine

Virtual Netw

ork File System

PhysicalMeta-DataManager

MetaData

PhysicalI/O Server

NodeData

PhysicalI/O Server

NodeData

PhysicalI/O Server

NodeData

PhysicalI/O Server

NodeData

Physical Machine

Virtual Machine

Virtual Machine


• Open-source implementation of Google MapReduce, GFS, and BigTable for Big Data Analytics• Hadoop Common Utilities (RPC, etc.), HDFS, MapReduce, YARN

• http://hadoop.apache.org

Overview of Apache Hadoop Architecture

Hadoop Distributed File System (HDFS)

MapReduce(Cluster Resource Management & Data Processing)

Hadoop Common/Core (RPC, ..)

Hadoop Distributed File System (HDFS)

YARN(Cluster Resource Management & Job Scheduling)

Hadoop Common/Core (RPC, ..)

MapReduce(Data Processing)

Other Models(Data Processing)Hadoop 1.x

Hadoop 2.x


Big Data Processing with Hadoop Components• Major components included in this

tutorial: – MapReduce (Batch)– HBase (Query)– HDFS (Storage)– RPC

• Underlying Hadoop Distributed File System (HDFS) used by both MapReduce and HBase

• Model scales but high amount of communication during intermediate phases can be further optimized

HDFS

MapReduce

Hadoop Framework

User Applications

HBase

Hadoop Common (RPC)


Spark Architecture Overview• An in-memory data-processing

framework – Iterative machine learning jobs – Interactive data analytics – Scala based Implementation– Standalone, YARN, Mesos

• Scalable and communication intensive

– Wide dependencies between Resilient Distributed Datasets (RDDs)

– MapReduce-like shuffle operations to repartition RDDs

– Sockets based communication

http://spark.apache.org

http://spark.apache.org


Memcached Architecture

• Distributed Caching Layer– Allows to aggregate spare memory from multiple nodes

– General purpose

• Typically used to cache database queries, results of API calls• Scalable model, but typical usage very network intensive


Big Data (Hadoop, Spark,

HBase, Memcached,

etc.)

Deep Learning(Caffe, TensorFlow, BigDL,

etc.)

HPC (MPI, RDMA, Lustre, etc.)

Increasing Usage of HPC, Big Data and Deep Learning

Convergence of HPC, Big Data, and Deep Learning!

Increasing Need to Run these applications on the Cloud!!


Drivers of Modern HPC Cluster Architectures

• Multi-core/many-core technologies

• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)

• Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD

• Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)

• Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc.

Accelerators / Coprocessors high compute density, high

performance/watt>1 TFlop DP on a chip

High Performance Interconnects -InfiniBand

<1usec latency, 100Gbps Bandwidth>Multi-core Processors SSD, NVMe-SSD, NVRAM

Tianhe – 2 TitanK - ComputerSunway TaihuLight


Kernel Space

Modern Interconnects and Protocols with IB, HSE, and Omni-PathApplication / Middleware

Verbs

Ethernet Adapter

Ethernet Switch

Ethernet Driver

TCP/IP

InfiniBand Adapter

InfiniBand Switch

IPoIB

IPoIB

Ethernet Adapter

Ethernet Switch

Hardware Offload

TCP/IP

10/40 GigE-TOE

InfiniBand Adapter

InfiniBand Switch

User Space

RSockets

RSockets

iWARP Adapter

Ethernet Switch

TCP/IP

User Space

iWARP

RoCEAdapter

Ethernet Switch

RDMA

User Space

RoCE

InfiniBand Switch

InfiniBand Adapter

RDMA

User Space

IB Native

Sockets

Application / Middleware Interface

Protocol

Adapter

Switch

InfiniBand Adapter

InfiniBand Switch

RDMA

SDP

SDP1/10/25/40/50/100 GigE

OmniPath Adapter

OmniPath Switch

User Space

RDMA

100 Gb/s

OFI


• 163 IB Clusters (32.6%) in the Nov’17 Top500 list

– (http://www.top500.org)

• Installations in the Top 50 (17 systems):

Large-scale InfiniBand Installations

19,860,000 core (Gyoukou) in Japan (4th) 60,512 core (DGX SATURN V) at NVIDIA/USA (36th)

241,108 core (Pleiades) at NASA/Ames (17th) 72,000 core (HPC2) in Italy (37th)

220,800 core (Pangea) in France (21st) 152,692 core (Thunder) at AFRL/USA (40th)

144,900 core (Cheyenne) at NCAR/USA (24th) 99,072 core (Mistral) at DKRZ/Germany (42nd)

155,150 core (Jureca) in Germany (29th) 147,456 core (SuperMUC) in Germany (44th)

72,800 core Cray CS-Storm in US (30th) 86,016 core (SuperMUC Phase 2) in Germany (45th)

72,800 core Cray CS-Storm in US (31st) 74,520 core (Tsubame 2.5) at Japan/GSIC (48th)

78,336 core (Electra) at NASA/USA (33rd) 66,000 core (HPC3) in Italy (51st)

124,200 core (Topaz) SGI ICE at ERDC DSRC in US (34th) 194,616 core (Cascade) at PNNL (53rd)

60,512 core (NVIDIA DGX-1/Relion) at Facebook in USA (35th) and many more!

http://www.top500.org/


• 35 Omni-Path Clusters (7%) in the Nov’17 Top500 list

– (http://www.top500.org)

Large-scale Omni-Path Installations

556,104 core (Oakforest-PACS) at JCAHPC in Japan (9th) 54,432 core (Marconi Xeon) at CINECA in Italy (72nd)

368,928 core (Stampede2) at TACC in USA (12th) 46,464 core (Peta4) at University of Cambridge in UK (75th)

135,828 core (Tsubame 3.0) at TiTech in Japan (13th) 53,352 core (Girzzly) at LANL in USA (85th)

314,384 core (Marconi XeonPhi) at CINECA in Italy (14th) 45,680 core (Endeavor) at Intel in USA (86th)

153,216 core (MareNostrum) at BSC in Spain (16th) 59,776 core (Cedar) at SFU in Canada (94th)

95,472 core (Quartz) at LLNL in USA (49th) 27,200 core (Peta HPC) in Taiwan (95th)

95,472 core (Jade) at LLNL in USA (50th) 40,392 core (Serrano) at SNL in USA (112th)

49,432 core (Mogon II) at Universitaet Mainz in Germany (65th) 40,392 core (Cayenne) at SNL in USA (113th)

38,552 core (Molecular Simulator) in Japan (70th) 39,774 core (Nel) at LLNL in USA (101st)

35,280 core (Quriosity) at BASF in Germany (71st) and many more!

http://www.top500.org/


• Advanced Features for InfiniBand

• Advanced Features for High Speed Ethernet

• RDMA over Converged Ethernet

• Open Fabrics Software Stack and RDMA Programming

• Libfabrics Software Stack and Programming

• Network Management Infrastructure and Tool

• Common Challenges in Building HEC Systems with IB and HSE

– Network Adapters and NUMA Interactions

– Network Switches, Topology and Routing

– Network Bridges

• System Specific Challenges and Case Studies

– HPC (MPI, PGAS and GPU/Xeon Phi Computing)

– Deep Learning

– Cloud Computing

• Conclusions and Final Q&A

Presentation Overview


Advanced Features of InfiniBand• SRQ and XRC

• DCT

• User-Mode Memory Registration (UMR)

• On-demand Paging

• Core-Direct Offload

• SHArP


• Different transport protocols with IB – Reliable Connection (RC) is the most common

– Unreliable Datagram (UD) is used in some cases

• Buffers need to be posted at each receiver to receive message from any sender– Buffer requirement can increase with system size

• Connections need to be established across processes under RC– Each connection requires certain amount of memory for handling related data

structures

– Memory required for all connections can increase with system size

• Both issues have become critical as large-scale IB deployments have taken place– Being addressed by both IB specification and upper-level middleware

Memory overheads in large-scale systems


• SRQ is a hardware mechanism for a process to share receive resources (memory) across multiple connections

– Introduced in specification v1.2

• 0 < Q << P*((M*N)-1)

Shared Receive Queue (SRQ)

Process

One RQ per connection

Process

One SRQ for all connections

P Q

(M*N) - 1


• Each QP takes at least one page of memory– Connections between all processes is very costly for RC

• New IB Transport added: eXtended Reliable Connection– Allows connections between nodes instead of processes

eXtended Reliable Connection (XRC)

RC Connections XRC Connections

M2 x (N – 1) connections/node

M = # of processes/nodeN = # of nodes

M x (N – 1) connections/node


• XRC uses SRQ Numbers (SRQN) to direct where a operation should complete

• Hardware does all routing of data, so p2 is not actually involved in the data transfer

• Connections are not bi-directional, so p3 cannot sent to p0

XRC Addressing

SRQ#1

SRQ#2

Process 0

Process 1

SRQ#1

Process 2

SRQ#2Process 3

Send to #2Send to #1


DC Connection Model, Communication Objects and Addressing Scheme

• Constant connection cost– One QP for any peer

• Full Feature Set– RDMA, Atomics etc

Nod

e 0

P1P0

Node 1

P3P2

Node 3

P7P6

Nod

e 2

P5P4

IBNetwork

• Communication Objects & Addressing Scheme– DCINI

• Analogous to the send QPs• Can transmit data to any peer

– DCTGT• Receive objects• Must be backed by SRQ• Identified on a node by “DCT Number”

– Messages routed with combination of DCT Number + LID

– Requires “DC Key” to enable communication• Must be same across all processes


• Support direct local and remote non-contiguous memory access

• Avoid packing at sender and unpacking at receiver

User-Mode Memory Registration

1

3

4

Process

Kernel

HCA/RNIC2

Steps to create memory regions with UMR:

1. UMR Creation Request • Send number of blocks

2. HCA issues uninitialized memory keys for future UMR use

3. Kernel maps virtual->physical and pins region into physical memory

4. HCA caches the virtual to physical mapping


• Applications no longer need to pin down the underlying physical pages• Memory Region (MR) are NEVER pinned by the OS

• Paged in by the HCA when needed

• Paged out by the OS when reclaimed

• ODP can be divided into two classes

– Explicit ODP• Applications still register memory buffers for communication, but this operation is used to define access control

for IO rather than pin-down the pages

– Implicit ODP• Applications are provided with a special memory key that represents their complete address space, does not

need to register any virtual address range

• Advantages

• Simplifies programming

• Unlimited MR sizes

• Physical memory optimization

On-Demand Paging (ODP)


• Introduced by Mellanox to avoid pinning the pages of registered memory regions

• ODP-aware runtime could reduce the size of pin-down buffers while maintaining performance

Implicit On-Demand Paging (ODP)

M. Li, X. Lu, H. Subramoni, and D. K. Panda, “Designing Registration Caching Free High-Performance MPI Library with Implicit On-Demand Paging (ODP) of InfiniBand”, HiPC ‘17

0.1

1

10

100

CG EP FT IS MG LU SP AWP-ODC Graph500

Exec

utio

nTi

me

(s)

Applications (256 Processes)Pin-down Explicit-ODPImplicit-ODP


Collective Offload Support on the Adapters• Performance of collective operations (broadcast, barrier, reduction,

all-reduce, etc.) are very critical to the overall performance of MPI applications

• Currently being done with basic pt-to-pt operations (send/recv and RDMA) using host-based operations

• Mellanox ConnectX-2, ConnectX-3, ConnectX-4, and ConnectX-5 adapters support offloading some of these operations to the adapters (CORE-Direct)

– Provides overlap of computation and collective communication

– Reduces OS jitter (since everything is done in hardware)


Application

One-to-many Multi-Send• Sender creates a task-list consisting of

only send and wait WQEs– One send WQE is created for each

registered receiver and is appended to the rear of a singly linked task-list

– A wait WQE is added to make the HCA wait for ACK packet from the receiverInfiniBand HCA

Physical Link

Send Q

Recv Q

Send CQ

Recv CQ

DataData

MCQ

MQ

Task ListSend WaitSendSendSend Wait


Management and execution of MPI operations in the network by using SHArP Manipulation of data while it is being transferred in the

switch network

SHArP provides an abstraction to realize the reduction operation Defines Aggregation Nodes (AN), Aggregation Tree, and

Aggregation Groups

AN logic is implemented as an InfiniBand Target Channel Adapter (TCA) integrated into the switch ASIC*

Uses RC for communication between ANs and between AN and hosts in the Aggregation Tree*

Physical Network Topology*

* Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction. R. L. Graham, D. Bureddy, P. Lui, G. Shainer, H. Rosenstock, G. Bloch, D. Goldenberg, M. Dubman, S. Kotchubievsky, V. Koushnir, L. Levi, A. Margolin, T. Ronen, A. Shpiner, O. Wertheim, E. Zahavi, Mellanox Technologies, Inc. First Workshop on Optimization of Communication in HPC Runtime Systems (COM-HPC 2016)

Logical SHArP Tree*

Scalable Hierarchical Aggregation Protocol (SHArP)

mailto:[email protected]















– Network Bridges



– Deep Learning

– Cloud Computing




The Ethernet Ecosystem

Courtesy: Scott Kipp @ Ethernet Alliance - http://www.ethernetalliance.org/roadmap/

http://www.ethernetalliance.org/roadmap/


Courtesy

http://www.eetimes.com/document.asp?doc_id=1323184

http://www.networkcomputing.com/data-centers/25-gbe-big-deal-will-arrive/1714647938


Emergence of 25 GigE and Benefits

Slash top-of-rack switches (Source: IEEE 802.3)

Courtesy

http://www.plexxi.com/2014/07/whats-25-gigabit-ethernet-want/

http://www.qlogic.com/Products/adapters/Pages/25Gb-Ethernet.aspx


http://www.networkcomputing.com/data-centers/25-gbe-big-deal-will-arrive/1714647938


http://www.plexxi.com/2014/07/whats-25-gigabit-ethernet-want/

http://www.qlogic.com/Products/adapters/Pages/25Gb-Ethernet.aspx


• Requires half the number of lanes compared to 40G (x4 instead of x8 PCIe lanes)

• Better PCIe bandwidth utilization (25/32=78% vs. 40/64=62.5%) with lower power impact

Matching PCIe and Ethernet SpeedsEthernet

Rate

(Gb/s)

Number of PCIe Gen3 Lanes Needed for

Single Port Dual Port

100 16 32 (Uncommon)

40 8 16

25 4 8

10 2 4

Courtesy: http://www.ieee802.org/3/cfi/0314_3/CFI_03_0314.pdf

http://www.ieee802.org/3/cfi/0314_3/CFI_03_0314.pdf


• 25G & 50G Ethernet specification extends IEEE 802.3 to work at increased data rates

• Features in Draft 1.4 of specification– PCS/PMA operation at 25 Gb/s over a single lane

– PCS/PMA operation at 25 Gb/s over two lanes

– Optional Forward Error Correction modes

– Optional auto-negotiation using an OUI next page

– Optional link training

• Standards for 50 Gb/s, 200 Gb/s and 400Gb/s under development

– Expected around 2017 – 2018?

Detailed Specifications for 25 and 50 GigE and Looking Forward


Next standards by 2017 – 2018?



Ethernet Roadmap – To Terabit Speeds?

50G, 100G, 200G and 400G by

2018-2019

Terabit speeds by 2025?!?!













– Network Bridges



– Deep Learning

– Cloud Computing




RDMA over Converged Enhanced Ethernet

IB VerbsApplication

Har

dwar

e

RoCE

IB VerbsApplication

RoCE v2

InfiniBand Link Layer

IB NetworkIB Transport

IB VerbsApplication

InfiniBand

Ethernet Link Layer

IB NetworkIB Transport

Ethernet Link Layer

UDP / IPIB Transport

• Takes advantage of IB and Ethernet– Software written with IB-Verbs– Link layer is Converged (Enhanced) Ethernet (CE)

• Pros: IB Vs RoCE– Works natively in Ethernet environments

• Entire Ethernet management ecosystem is available

– Has all the benefits of IB verbs– Link layer is very similar to the link layer of native IB, so

there are no missing features

• RoCE v2: Additional Benefits over RoCE– Traditional Network Management Tools Apply– ACLs (Metering, Accounting, Firewalling)– GMP Snooping for optimized Multicast – Network Monitoring Tools

• Cons:– Network bandwidth might be limited to Ethernet

switches• 10/40GE switches available; 56 Gbps IB is available

Courtesy: OFED, Mellanox

Network Stack Comparison

Packet Header Comparison

ETHL2 Hdr

Ethe

rtyp

e

IB GRHL3 Hdr

IB BTH+L4 HdrRo

CE

ETHL2 Hdr

Ethe

rtyp

e

IP HdrL3 Hdr

IB BTH+L4 HdrPr

oto

#

RoCE

v2 UDP

Hdr Port

#











– Network Bridges



– Deep Learning

– Cloud Computing




• Open source organization (formerly OpenIB)– www.openfabrics.org

• Incorporates both IB, RoCE, and iWARP in a unified manner– Support for Linux and Windows

• Users can download the entire stack and run– Latest stable release is OFED 4.8.1

• New naming convention to get aligned with Linux Kernel Development

• OFED 4.8.2 is under development

Software Convergence with OpenFabrics

http://www.openfabrics.org/


OpenFabrics Software StackSA Subnet Administrator

MAD Management Datagram

SMA Subnet Manager Agent

PMA Performance Manager Agent

IPoIB IP over InfiniBand

SDP Sockets Direct Protocol

SRP SCSI RDMA Protocol (Initiator)

iSER iSCSI RDMA Protocol (Initiator)

RDS Reliable Datagram Service

UDAPL User Direct Access Programming Lib

HCA Host Channel Adapter

R-NIC RDMA NIC

CommonInfiniBand

iWARP

Key

InfiniBand HCA iWARP R-NIC

HardwareSpecific Driver

Hardware SpecificDriver

ConnectionManager

MAD

InfiniBand OpenFabrics Kernel Level Verbs / API iWARP R-NIC

SA Client

ConnectionManager

Connection ManagerAbstraction (CMA)

InfiniBand OpenFabrics User Level Verbs / API iWARP R-NIC

SDPIPoIB SRP iSER RDS

SDP Lib

User Level MAD API

Open SM

DiagTools

Hardware

Provider

Mid-Layer

Upper Layer Protocol

User APIs

Kernel Space

User Space

NFS-RDMARPC

ClusterFile Sys

Application Level

SMA

ClusteredDB Access

SocketsBasedAccess

VariousMPIs

Access toFile

Systems

BlockStorageAccess

IP BasedApp

Access

Apps & AccessMethodsfor usingOF Stack

UDAPL

Kern

el

bypa

ss

Kern

el

bypa

ss


1. Create QPs (endpoints)

2. Register memory for sending and receiving

3. Send

– Channel semantics• Post receive

• Post send

– RDMA semantics

Programming with OpenFabrics

Sender ReceiverSample Steps

Kernel

HCA

Process


• Prepare and post send descriptor (channel semantics)Verbs: Post Send

struct ibv_send_wr *bad_wr;

struct ibv_send_wr sr;

struct ibv_sge sg_entry;

sr.next = NULL;

sr.opcode = IBV_WR_SEND;sr.wr_id = 0;

sr.num_sge = 1;

if (len < max_inline_size) {

sr.send_flags = IBV_SEND_SIGNALED | IBV_SEND_INLINE;} else {

sr.send_flags = IBV_SEND_SIGNALED;}

sr.sg_list = &(sg_entry);

sg_entry.addr = (uintptr_t) buf;sg_entry.length = len;sg_entry.lkey = mr_handle->lkey;

ret = ibv_post_send(qp, &sr, &bad_wr);


• Prepare and post RDMA write (memory semantics)Verbs: Post RDMA Write

struct ibv_send_wr *bad_wr; struct ibv_send_wr sr;

struct ibv_sge sg_entry;

sr.next = NULL;

sr.opcode = IBV_WR_RDMA_WRITE; /* set type to RDMA Write */sr.wr_id = 0;

sr.num_sge = 1;

sr.send_flags = IBV_SEND_SIGNALED;

sr.wr.rdma.remote_addr = remote_addr; /* remote virtual addr. */

sr.wr.rdma.rkey = rkey; /* from remote node */

sr.sg_list = &(sg_entry);

sg_entry.addr = buf; /* local buffer */

sg_entry.length = len;

sg_entry.lkey = mr_handle->lkey;

ret = ibv_post_send(qp, &sr, &bad_wr);











– Network Bridges



– Deep Learning

– Grid Computing




Libfabrics Connection ModelServer Process OFI Provider

Sockets/Verbs/PSMHCA

GigE/IB/TrueScaleClient ProcessOFI Provider


GigE/IB/TrueScalefi_fabrics Open fabrics fi_fabricsOpen fabrics

Open domain fi_domain

Register Mem fi_mr_reg

fi_endpointOpen EndPoint

fi_cq_openOpen Comp Q

fi_ep_bindBind EP to CQ

fi_connectConnect to Remote EP

Open Event Q fi_ep_open

fi_passive_ep Open Passive EP

fi_eq_open Open Event Q

fi_pep_bind Bind Passive EP

fi_listen Listen for Incoming Connections

New Event Detected on EQ

fi_eq_sread Validate New Event == FI_CONNREQ

Open domainfi_domain

Register Memfi_mr_reg












Libfabrics Connection Model (Cont.)Server Process OFI Provider


GigE/IB/TrueScaleClient ProcessOFI Provider


GigE/IB/TrueScalefi_ep_open Open EndPoint

fi_cq_open Open Event Q

fi_ep_bind Bind EP to CQ

Open fabrics

fi_accept Accept Connection

fi_eq_sread Validate New Event == FI_CONNECTED New Event Detected on EQ

fi_eq_sreadValidate New Event == FI_CONNECTED

fi_sendPost Send

Recv Completionfi_send

Post Send Recv Completion

fi_shutdown Shutdown Channel

fi_close * Close all open resourcesfi_shutdownShutdown Channel

fi_close *Close all open resources

fi_cq_read /fi_cq_sread

Poll / Wait for Data

fi_cq_read /fi_cq_sread

Poll / Wait for Data

fi_recv Post Recvfi_recvPost Recv












• Similar to socket / QP

• Simple / Easy to use

Scalable EndPoints Vs Shared TX/RX Context

Courtesy: http://www.slideshare.net/seanhefty/ofa-workshop2015ofiwg?ref=http://ofiwg.github.io/libfabric/

End Point

Transmit Receive

Completion

End Point

Transmit Receive

Completion

End Point

End Point

End Point

Transmit Receive

Completion

Transmit Receive

Scalable EndPointsShared TX/RX ContextNormal EndPoint

• Share HW resources

• # EP >> HW resources

• Use more HW resources

• Higher performance per EP


• Open Fabric, Domain and EP

Libfabrics: Fabric, Domain and Endpoint creation

struct fi_info *info, *hints;

struct fid_fabric *fabric;

struct fid_domain *dom;

struct fid_ep *ep;

hints = fi_allocinfo();

/* Obtain fabric information */

rc = fi_getinfo(VERSION, node, service, flags, hints, &info);

/* Free fabric information */

fi_freeinfo(hints);

/* Open fabric */

rc = fi_fabric(info->fabric_attr, &fabric, NULL);

/* Open domain */

rc = fi_domain(fabric, entry.info, &dom, NULL);

/* Open End point */

rc = fi_endpoint(dom, entry.info, &ep, NULL);


• Open Fabric / Domain and create EQ, EP to end nodes– Connection establishment is abstracted out using connection management APIs

(fi_cm) – fi_listen, fi_connect, fi_accept

– Fabric provider can implement them with connection managers (rdma_cm or ibcm) or directly through verbs with out-of-band communication

• Register memory

Libfabrics: Memory Registration

int fi_mr_reg(struct fid_domain *domain, const void *buf, size_t len, uint64_t access, uint64_t offset, uint64_t requested_key, uint64_t flags, struct fid_mr **mr, void *context);

rc = fi_mr_reg(domain, buffer, size, FI_SEND | FI_RECV,

0, 0, 0, &mr, NULL);

rc = fi_mr_reg(domain, buffer, size,

FI_REMOTE_READ | FI_REMOTE_WRITE, 0,

user_key, 0, &mr, NULL);

Permissions can be set as needed


• Prepare and post receive request

Libfabrics: Post Receive (Channel Semantics)

ssize_t fi_recv(struct fid_ep *ep, void * buf, size_t len,

void *desc, fi_addr_t src_addr,

void *context);

- For connected EPs

ssize_t fi_recvmsg(struct fid_ep *ep,

const struct fi_msg *msg, uint64_t flags);

- For connected and un-connected EPs

struct fid_ep *ep;

struct fid_mr *mr;

/* Post recv request */

rc = fi_recv(ep, buf, size, fi_mr_desc(mr), 0,

(void *)(uintptr_t)RECV_WCID);


• Prepare and post send descriptorLibfabrics: Post Send (Channel Semantics)

ssize_t fi_send(struct fid_ep *ep, void *buf, size_t len,

void *desc, fi_addr_t dest_addr, void *context);

- For connected EPs

ssize_t fi_sendmsg(struct fid_ep *ep, const struct fi_msg *msg,

uint64_t flags);


ssize_t fi_inject(struct fid_ep *ep, void *buf, size_t len,

fi_addr_t dest_addr);

- Buffer available for re-use as soon as function returns

- No completion event generated for send

struct fid_ep *ep;

struct fid_mr *mr;

static fi_addr_t remote_fi_addr;

rc = fi_send(ep, buf, size, fi_mr_desc(mr), 0,

(void *)(uintptr_t)SEND_WCID);

rc = fi_inject(ep, buf, size, remote_fi_addr);


• Prepare and post receive requestLibfabrics: Post Remote Read (Memory Semantics)

ssize_t fi_read(struct fid_ep *ep, void *buf, size_t len,

void *desc, fi_addr_t src_addr, uint64_t addr,

uint64_t key, void *context);

- For connected EPs

ssize_t fi_readmsg(struct fid_ep *ep,

const struct fi_msg_rma *msg,

uint64_t flags);


struct fid_ep *ep;

struct fid_mr *mr;

struct fi_context fi_ctx_read;

/* Post remote read request */

ret = fi_read(ep, buf, size, fi_mr_desc(mr), local_addr,

remote_addr, remote_key, &fi_ctx_read);


• Prepare and post send descriptorLibfabrics: Post Remote Write (Memory Semantics)

ssize_t fi_write(struct fid_ep *ep, const void *buf, size_t len,

void *desc, fi_addr_t dest_addr, uint64_t addr,


- For connected EPs

ssize_t fi_writemsg(struct fid_ep *ep,

const struct fi_msg_rma *msg, uint64_t flags);


ssize_t fi_inject_write(struct fid_ep *ep, const void *buf,

size_t len, fi_addr_t dest_addr,

uint64_t addr, uint64_t key);

- Buffer available for re-use as soon as function returns

- No completion event generated for send

ssize_t fi_writedata(struct fid_ep *ep, const void *buf,

size_t len, void *desc, uint64_t data,

fi_addr_t dest_addr, uint64_t addr,


- Similar to fi_write

- Allows for the sending of remote CQ data











– Network Bridges



– Deep Learning

– Cloud Computing




• Management Infrastructure– Subnet Manager

– Diagnostic tools• System Discovery Tools

• System Health Monitoring Tools

• System Performance Monitoring Tools

– Fabric management tools

Network Management Infrastructure and Tools


• Agents– Processes or hardware units running on each adapter, switch, router

(everything on the network)

– Provide capability to query and set parameters

• Managers– Make high-level decisions and implement it on the network fabric using

the agents

• Messaging schemes– Used for interactions between the manager and agents (or between

agents)

• Messages

Concepts in IB Management


• All IB management happens using packets called as Management

Datagrams

– Popularly referred to as “MAD packets”

• Four major classes of management mechanisms

– Subnet Management

– Subnet Administration

– Communication Management

– General Services

InfiniBand Management


• Consists of at least one subnet manager (SM) and several subnet management agents (SMAs)

– Each adapter, switch, router has an agent running

– Communication between the SM and agents or between agents happens using MAD packets called as Subnet Management Packets (SMPs)

• SM’s responsibilities include:– Discovering the physical topology of the subnet

– Assigning LIDs to the end nodes, switches and routers

– Populating switches and routers with routing paths

– Subnet sweeps to discover topology changes

Subnet Management & Administration


Subnet Manager

Active Links

Inactive Links

Compute Node

Switch

Subnet Manager

Inactive Link

Multicast Join

Multicast Setup

Multicast JoinMulticast

Setup


• SM can be configured to sweep once or continuously• On the first sweep:

– All ports are assigned LIDs on the first sweep– All routes are setup on the switches

• On consequent sweeps:– If there has been any change to the topology, appropriate routes are

updated– If DLID X is down, packet not sent all the way

• First hop will not have a forwarding entry for LID X

• Sweep time configured by the system administrator– Cannot be too high or too low

Subnet Manager Sweep Behavior


• Single subnet manager has issues on large systems– Performance and overhead of scanning

• Hardware implementations on switches are faster, but will work only for small systems (memory usage)

• Software implementations are more popular (OpenSM)

– Multi-SM models• Two benefits: fault tolerance (if one SM dies) and scalability (different SMs can

handle different portions of the network)

• Current SMs only provide a fault-tolerance model

• Network subsetting is still be investigated

• Asynchronous events specified to improve scalability– E.g., TRAPS are events sent by an agent to the SM when a link goes down

Subnet Manager Scalability Issues


• Creation, joining/leaving, deleting multicast groups occur as SA requests– The requesting node sends a request to a SA

– The SA sends MAD packets to SMAs on the switches to setup routes for the multicast packets• Each switch contains information on which ports to forward the

multicast packet to

• Multicast itself does not go through the subnet manager– Only the setup and teardown goes through the SM

Multicast Group Management


• Different types of tools exist:– High-level tools that internally talk to the subnet manager using

management datagrams

– Each hardware device exposes a few mandatory counters and a number of optional (sometimes vendor-specific) counters

• Possible to write your own tools based on the management datagram interface– Several vendors provide such IB management tools

Tools to Analyze InfiniBand Networks


• Starting with almost no knowledge about the system, we can identify several details of the network configuration

– Example tools include:• ibstatus: shows adapter status

• smpquery: SMP query tool

• perfquery: reports performance/error counters of a port

• ibportstate: shows status of IB port, enable/disable port

• ibhosts: finds all the network adapters in the system

• ibswitches: finds all the network switches in the system

• ibnetdiscover: finds the connectivity between the ports

• … and many others exist

– Possible to write your own tools based on the management datagram interface• Several vendors provide such IB management tools

Network Discovery Tools


• Several tools exist to monitor the health and performance of the InfiniBand network– Example health monitoring tools include

• ibdiagnet: queries for overall fabric health

• ibportstate: identify state and link speed of an InfiniBand port

• ibdatacounts: get InfiniBand port data counters

– Example performance monitoring tools include• ibv_send_lat, ibv_write_lat: IB verbs level performance tests

• perfquery: queries performance counters in IB HCA

Health and Performance Monitoring Tools


Tools for Network Switching and Routing

% ibroute -G 0x66a000700067c

Lid Out Destination

Port Info

0x0001 001 : (Channel Adapter portguid 0x0002c9030001e3f3: ' HCA-1')

0x0002 013 : (Channel Adapter portguid 0x0002c9020023c301: ' HCA-1')

0x0003 014 : (Channel Adapter portguid 0x0002c9030001e603: ' HCA-1')


0x0005 016 : (Channel Adapter portguid 0x0011750000ffe005: ' HCA-1')

0x0014 017 : (Switch portguid 0x00066a0007000728: 'SilverStorm 9120

GUID=0x00066a00020001aa Leaf 8, Chip A')


0x0016 019 : (Switch portguid 0x00066a0007000732: 'SilverStorm 9120

GUID=0x00066a00020001aa Leaf 10, Chip A')



...

Packets to LID 0x0001 will be sent out on Port 001


• Based on destination LIDs and switching/routing information, the exact path of the packets can be identified– If application communication pattern is known, we can statically

identify possible network contention

Static Analysis of Network Contention

Leaf Blocks

Spine Blocks2

4 8 9 13 14 1 19 2 5 3 7 12 16 6 18 10

11 22 17 24 27 15 20


• IB provides many optional counters to query performance counters– PortXmitWait: Number of ticks in which there was data to send, but no

flow-control credits

– RNR NAKs: Number of times a message was sent, but the receiver has not yet posted a receive buffer• This can timeout, so it can be an error in some cases

– PortXmitFlowPkts: Number of (link-level) flow-control packets transmitted on the port

– SWPortVLCongestion: Number of packets dropped due to congestion

Dynamic Analysis of Network Contention


• InfiniBand provides two forms of management– Out-of-band management (similar to other networks)

– In-band management (used by the subnet manager)

• Out-of-band management requires a separate Ethernet port on the switch, where an administrator can plug in his/her laptop

• In-band management allows the switch to receive management commands directly over the regular communication network

In-band Management vs. Out-of-band Management

InfiniBand connectivity(In-band management)

Ethernet connectivity(Out-of-band management)


Overview of OSU INAM• A network monitoring and analysis tool that is capable of analyzing traffic on the InfiniBand network with inputs from the

MPI runtime– http://mvapich.cse.ohio-state.edu/tools/osu-inam/

• Monitors IB clusters in real time by querying various subnet management entities and gathering input from the MPI runtimes

• OSU INAM v0.9.2 released on 10/31/2017

• Significant enhancements to user interface to enable scaling to clusters with thousands of nodes

• Improve database insert times by using 'bulk inserts‘

• Capability to look up list of nodes communicating through a network link

• Capability to classify data flowing over a network link at job level and process level granularity in conjunction with MVAPICH2-X 2.3b

• “Best practices “ guidelines for deploying OSU INAM on different clusters

• Capability to analyze and profile node-level, job-level and process-level activities for MPI communication– Point-to-Point, Collectives and RMA

• Ability to filter data based on type of counters using “drop down” list

• Remotely monitor various metrics of MPI processes at user specified granularity

• "Job Page" to display jobs in ascending/descending order of various performance metrics in conjunction with MVAPICH2-X

• Visualize the data transfer happening in a “live” or “historical” fashion for entire network, job or set of nodes

http://mvapich.cse.ohio-state.edu/tools/osu-inam/


OSU INAM Features

• Show network topology of large clusters• Visualize traffic pattern on different links• Quickly identify congested links/links in error state• See the history unfold – play back historical state of the network

Comet@SDSC --- Clustered View

(1,879 nodes, 212 switches, 4,377 network links)Finding Routes Between Nodes


OSU INAM Features (Cont.)

Visualizing a Job (5 Nodes)

• Job level view• Show different network metrics (load, error, etc.) for any live job• Play back historical data for completed jobs to identify bottlenecks

• Node level view - details per process or per node• CPU utilization for each rank/node• Bytes sent/received for MPI operations (pt-to-pt, collective, RMA)• Network metrics (e.g. XmitDiscard, RcvError) per rank/node

Estimated Process Level Link Utilization

• Estimated Link Utilization view• Classify data flowing over a network link at

different granularity in conjunction with MVAPICH2-X 2.2rc1• Job level and• Process level











– Network Bridges



– Deep Learning

– Cloud Computing




Common Challenges for Large-Scale Installations

Common Challenges Adapters and Interactions I/O busMulti-port adapters NUMA

Switches Topologies Switching / Routing

Bridges IB interoperability


• Network adapters and interactions with other components– I/O bus interactions and limitations

– Multi-port adapters and bottlenecks

– NUMA interactions

• Network switches

• Network bridges

Common Challenges in Building HEC Systems with IB and HSE


• Data communication traverses three buses (or links) before it reaches the network switch

– Memory bus (memory to IO hub)

– I/O link (IO hub to the network adapter)

– Network link (network adapter to switch)

• For optimal communication, all these need to be balanced

• Network bandwidth:– 4X SDR (8 Gbps), 4X DDR (16 Gbps), 4X QDR (32 Gbps), 4X

FDR (56 Gbps), 4X EDR (100 Gbps) and 4X HDR (200 Gbps)

– 40 GigE (40 Gbps)

• Memory bandwidth:– Shared bandwidth (incoming and outgoing)

– For IB FDR (56 Gbps), memory bandwidth greater than 112 Gbps is required to fully utilize the network

I/O bus limitationsP0

Core0 Core1

Core2 Core3

P1Core0 Core1

Core2 Core3

Memory

Memory

Network Adapter

Network Switch

• I/O link bandwidth:– Tricky because several aspects need to be considered

– Connector capacity vs. link capacity

– I/O link communication headers, etc.

I/O Bus


• Common I/O interconnect used on most current platforms– Can be configured as multiple lanes (1X, 4X, 8X, 16X, 32X)

• Generation 1 provided 2 Gbps bandwidth per lane, Gen 2 provides 4 Gbps, and Gen 3 provides 8 Gbps per lane)

– Compatible with adapters using lesser lanes• If a PCIe connector is 16X, it will still support an 8X adapter by using only 8 lanes

– Provides multiplexing across a single lane• A 1X PCIe bus can be connected to an 8X PCIe connector (allowing an 8X adapter to be

plugged in)

– I/O interconnects are like networks with packetized communication• Communication headers for each packet• Reliability acknowledgments• Flow control acknowledgments• Typical efficiency is around 75-80% with 256 byte PCIe packets

PCI Express

Use I/O bandwidth

Beware

Beware

Beware


• Several multi-port adapters available in the market– Single adapter can drive multiple network ports at full bandwidth

– Important to measure other overheads (memory bandwidth and I/O link bandwidth) before assuming performance benefit

• Case Study: IB Dual-port 4x QDR adapter– Each network link is 32 Gbps (dual-port adapters can drive 64 Gbps)

– PCIe Gen2 8X link can give 32 Gbps data rate around 24 Gbps effective rate (20 % encoding overheads!!)• Dual-port IB QDR is not expected to give any benefit in this case

– PCIe Gen3 8X link can give 64 Gbps data rate 64 Gbps (minimal encoding overheads)• Delivers close to peak performance with Dual-port IB adapters

Multi-port adapters






• Network bridges



NUMA Interactions

Memory Memory

Memory Memory

Network Card

Core 8

Socket 2Core

9Core

10Core

11

Core 12

Socket 3Core

13Core

14Core

15

Core 4

Socket 1Core

5Core

6Core

7

Core 0

Socket 0Core

1Core

2Core

3

• Different cores in a NUMA platform have different communication costs

QPI or HT

PCIe


0

0.5

1

1.5

2

2.5

3

2 4 8 16 32 64 128 256 512 1K 2K

Send

Late

ncy

(us)

Message Size (Bytes)

Core 0 -> 0 (Socket 0)Core 7->7 (Socket 0)Core 14->14 (Socket 1)Core 27->27 (Socket 1)

Impact of NUMA on Inter-node Latency

• Cores in Socket 0 (closest to network card) have lowest latency

• Cores in Socket 1 (one hop from network card) have highest latencyConnectX-4 EDR (100 Gbps): 2.4 GHz Fourteen-core (Broadwell) Intel with IB (EDR) switches


Impact of NUMA on Inter-node Bandwidth

0

500

1000

1500

2000

2500

3000

Send

Ban

dwid

th (M

Bps)

Message Size (bytes)

AMD MagnyCoursCore-0Core-6Core-12Core-18

ConnectX-4 EDR (100 Gbps): 2.4 GHz Fourteen-core (Broadwell) Intel with IB (EDR) switchesConnectX-2-QDR (36 Gbps): 2.5 GHz Hex-core (MagnyCours) AMD with IB (QDR) switches

0

2000

4000

6000

8000

10000

12000

14000

Send

Ban

dwid

th (M

Bps)


Intel BroadwellCore-0Core-7Core-14Core-27

• NUMA interactions have significant impact on bandwidth






• Network bridges



• Network adapters and interactions with other components

• Network switches– Switch topologies– Switching and Routing

• Network bridges



• InfiniBand installations come in multiple topologies– Single crossbar switches (up to 36-ports for QDR or FDR)

• Applicable only to very small systems (hard to scale to large clusters)

– Fat-tree topologies (medium scale topologies)• Provides full bisection bandwidth: Given independent communication between processes,

you can find a switch configuration that provides fully non-blocking paths (though the same configuration might have contention if the communication pattern changes)

• Issue: Number of switch components increases super-linearly with the number of nodes (Not scalable for large-scale systems)

• Large scale installations can use more conservative topologies– Partial fat-tree topologies (over-provisioning)– 3D Torus (Sandia Red Sky and SDSC Gordon), Hypercube (SGI Altix) topologies, and

10D Hypercube (NASA Pleiades)

Switch Topologies


Switch Topology: Absolute Performance vs. Scalability

Crossbar ASIC(all-to-all connectivity)

Leaf Blocks

Spine Blocks

Full Fat-tree Topology

(full bisection bandwidth)

Leaf Blocks

Spine Blocks

Partial Fat-tree Topology

(reduced inter-switch connectivity for more out-ports: super-linear scaling of switch components, but slower than

a full fat-tree topology)

Only a few links are connected

Torus/Hypercube Topology

(linear scaling of switch components)


• IB standard only supports static routing– Not scalable for large systems where traffic might be non-deterministic causing hot-spots

• Next generation IB switches are supporting adaptive routing (in addition to static routing): Outside the IB standard

• Qlogic (Intel) support for adaptive routing– Continually monitors application messaging patterns and selects the optimum path for each traffic

flow, eliminating slowdowns caused by pathway bottlenecks– Dispersive routing load-balances traffic among multiple pathways– http://ir.qlogic.com/phoenix.zhtml?c=85695&p=irol-newsarticle&id=1428788

• Mellanox support for adaptive routing– Supports moving traffic via multiple parallel paths– Dynamically and automatically re-routes traffic to alleviate congested ports– http://www.mellanox.com/related-docs/prod_silicon/PB_InfiniScale_IV.pdf

Static Routing in IB + Adaptive Routing models from Qlogic (Intel) and Mellanox

http://ir.qlogic.com/phoenix.zhtml?c=85695&p=irol-newsarticle&id=1428788

http://www.mellanox.com/related-docs/prod_silicon/PB_InfiniScale_IV.pdf


• Network adapters and interactions with other components


• Network bridges– IB interoperability with Ethernet and FC



Virtual Ethernet/FC Adapter

• Mainly developed for backward compatibility with existing infrastructure– Ethernet over IB (EoIB)

– Fibre Channel over IB (FCoIB)

IB-Ethernet and IB-FC Bridging Solutions

IB Adapter

Host

Ethernet Packet Convertor Switch(e.g., Mellanox BridgeX)

Ethernet/FC Adapter

Host


• Can be used in an infrastructure where a part of the nodes are connected over Ethernet or FC

– All of the IB connected nodes can communicate over IB

– The same nodes can communicate with nodes in the older infrastructure using Ethernet-over-IB or FC-over-IB

• Do not have the performance benefits of IB– Host thinks it is using an Ethernet or FC adapter

– For example, with Ethernet, communication will be using TCP/IP• There is some hardware support for segmentation offload, but the rest of the IB features

are unutilized

• Note that this is different from VPI, as there is only one network connectivity from the adapter

Ethernet/FC over IB











– Network Bridges



– Deep Learning

– Cloud Computing




System Specific Challenges for HPC Systems

Common Challenges

Adapters and Interactions I/O busMulti-port adapters NUMA



HPCMPIMulti-rail Collectives Scalability Application Scalability Energy Awareness

PGAS Programmability w/ Performance Optimized Resource Utilization

GPU / XeonPhi Programmability w/ Performance Hide data movement costs Heterogeneity aware design Streaming, Deep Learning


• Message Passing Interface (MPI)

• Partitioned Global Address Space (PGAS) models

• GPU Computing

• Xeon Phi Computing

HPC System Challenges and Case Studies


Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2, MPI-3.0, and MPI-3.1), Started in 2001, First version available in 2002

– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015

– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 2,850 organizations in 85 countries

– More than 440,000 (> 0.44 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (Nov ‘17 ranking)• 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China

• 12th, 368,928-core (Stampede2) at TACC

• 17th, 241,108-core (Pleiades) at NASA

• 48th, 76,032-core (Tsubame 2.5) at Tokyo Institute of Technology

– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

• Empowering Top500 systems for over a decade

– System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) ->

– Sunway TaihuLight (1st in Jun’17, 10M cores, 100 PFlops)

http://mvapich.cse.ohio-state.edu/


• Interaction with Multi-Rail Environments

• Collective Communication

• Scalability for Large-scale Systems

• Energy Awareness

Design Challenges and Sample Results


0

2000

4000

6000

8000

10000

12000

140001 4 16 64 256 1K 4K 16K

64K

256K 1M

Band

wid

th (M

Byte

s/se

c)


Single Rail

Impact of Multiple Rails on Inter-node MPI Bandwidth

Designs based on: S. Sur, M. J. Koop, L. Chai and D. K. Panda, “Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms”, IEEE Hot Interconnects, 2007

0

5000

10000

15000

20000

25000

30000

1 4 16 64 256 1K 4K 16K

64K

256K 1M

Band

wid

th (M

Byte

s/se

c)


Dual Rail

1 pair

2 pairs

4 pairs

8 pairs

16 pairs

ConnectX-4 EDR (100 Gbps): 2.4 GHz Deca-core (Haswell) Intel with IB (EDR) switches


Hardware Multicast-aware MPI_Bcast on Stampede

0

10

20

30

40

2 8 32 128 512

Late

ncy

(us)


Small Messages (102,400 Cores)DefaultMulticast

ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch

0100200300400500

2K 8K 32K 128K

Late

ncy

(us)


Large Messages (102,400 Cores)DefaultMulticast

05

1015202530

Late

ncy

(us)

Number of Nodes

16 Byte Message

DefaultMulticast

0

50

100

150

200

Late

ncy

(us)

Number of Nodes

32 KByte Message

DefaultMulticast


Hardware Multicast-aware MPI_Bcast on Broadwell + EDR

01234567

2 8 32 128 512

Late

ncy

(us)


Small Messages (1,120 Cores)DefaultMulticast

ConnectX-4 EDR (100 Gbps): 2.4 GHz Fourteen-core (Broadwell) Intel with Mellanox IB (EDR) switches

020406080

100120

2K 8K 32K 128K

Late

ncy

(us)


Large Messages (1,120 Cores)DefaultMulticast

0

1

2

3

4

5

Late

ncy

(us)

Number of Nodes

16 Byte Message

DefaultMulticast

0

10

20

30

40

Late

ncy

(us)

Number of Nodes

32 KByte Message

DefaultMulticast


Advanced Allreduce Collective Designs Using SHArP and Multi-Leaders

• Socket-based design can reduce the communication latency by 23% and 40% on Xeon + IB nodes

• Support is available in MVAPICH2 2.3a and MVAPICH2-X 2.3b

HPCG (28 PPN)

0

0.1

0.2

0.3

0.4

0.5

0.6

56 224 448

Com

mun

icat

ion

Late

ncy

(Sec

onds

)

Number of Processes

MVAPICH2 Proposed-Socket-Based MVAPICH2+SHArP

0

10

20

30

40

50

60

4 8 16 32 64 128 256 512 1K 2K 4K

Late

ncy

(us)

Message Size (Byte)

MVAPICH2 Proposed-Socket-Based MVAPICH2+SHArP

OSU Micro Benchmark (16 Nodes, 28 PPN)

23%

40%

Lower is better

M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, Scalable Reduction Collectives with Data Partitioning-based Multi-Leader Design, Supercomputing '17.


Performance of MPI_Allreduce On Stampede2 (10,240 Processes)

0

50

100

150

200

250

300

4 8 16 32 64 128 256 512 1024 2048 4096

Late

ncy

(us)

Message SizeMVAPICH2 MVAPICH2-OPT IMPI

0200400600800

100012001400160018002000

8K 16K 32K 64K 128K 256KMessage Size

MVAPICH2 MVAPICH2-OPT IMPIOSU Micro Benchmark 64 PPN

2.4X

• MPI_Allreduce latency with 32K bytes reduced by 2.4X


Network-Topology-Aware Placement of ProcessesCan we design a highly scalable network topology detection service for IB?

How do we design the MPI communication library in a network-topology-aware manner to efficiently leverage the topology information generated by our service?

What are the potential benefits of using a network-topology-aware MPI library on the performance of parallel scientific applications?

Overall performance and Split up of physical communication for MILC on Ranger

Performance for varyingsystem sizes Default for 2048 core run Topo-Aware for 2048 core run

15%

H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and D. K. Panda, Design of a Scalable InfiniBand Topology Service to Enable Network-Topology-Aware Placement of Processes, SC'12 . BEST Paper and BEST STUDENT Paper Finalist

• Reduce network topology discovery time from O(N2hosts) to O(Nhosts)

• 15% improvement in MILC execution time @ 2048 cores• 15% improvement in Hypre execution time @ 1024 cores


Dynamic and Adaptive Tag Matching

Normalized Total Tag Matching Time at 512 ProcessesNormalized to Default (Lower is Better)

Normalized Memory Overhead per Process at 512 ProcessesCompared to Default (Lower is Better)

Adaptive and Dynamic Design for MPI Tag Matching; M. Bayatpour, H. Subramoni, S. Chakraborty, and D. K. Panda; IEEE Cluster 2016. [Best Paper Nominee]

Chal

leng

e Tag matching is a significant overhead for receiversExisting Solutions are- Static and do not adapt dynamically to communication pattern- Do not consider memory overhead

Solu

tion A new tag matching design

- Dynamically adapt to communication patterns- Use different strategies for different ranks - Decisions are based on the number of request object that must be traversed before hitting on the required one

Resu

lts Better performance than other state-of-the art tag-matching schemesMinimum memory consumption

Will be available in future MVAPICH2 releases


● Enhance existing support for MPI_T in MVAPICH2 to expose a richer set of performance and control variables

● Get and display MPI Performance Variables (PVARs) made available by the runtime in TAU

● Control the runtime’s behavior via MPI Control Variables (CVARs)● Introduced support for new MPI_T based CVARs to MVAPICH2

○ MPIR_CVAR_MAX_INLINE_MSG_SZ, MPIR_CVAR_VBUF_POOL_SIZE, MPIR_CVAR_VBUF_SECONDARY_POOL_SIZE

● TAU enhanced with support for setting MPI_T CVARs in a non-interactive mode for uninstrumented applications

Performance Engineering Applications using MVAPICH2 and TAU

VBUF usage without CVAR based tuning as displayed by ParaProf VBUF usage with CVAR based tuning as displayed by ParaProf


Dynamic and Adaptive MPI Point-to-point Communication Protocols

Process on Node 1 Process on Node 2

Eager Threshold for Example Communication Pattern with Different Designs

0 1 2 3

4 5 6 7

Default

16 KB 16 KB 16 KB 16 KB

0 1 2 3

4 5 6 7

Manually Tuned

128 KB 128 KB 128 KB 128 KB

0 1 2 3

4 5 6 7

Dynamic + Adaptive

32 KB 64 KB 128 KB 32 KB

H. Subramoni, S. Chakraborty, D. K. Panda, Designing Dynamic & Adaptive MPI Point-to-Point Communication Protocols for Efficient Overlap of Computation & Communication, ISC'17 - Best Paper

0

200

400

600

128 256 512 1K

Wal

l Clo

ck T

ime

(sec

onds

)

Number of Processes

Execution Time of Amber

Default Threshold=17K Threshold=64KThreshold=128K Dynamic Threshold

0

5

10

128 256 512 1K

Rela

tive

Mem

ory

Cons

umpt

ion

Number of Processes

Relative Memory Consumption of Amber

Default Threshold=17K Threshold=64KThreshold=128K Dynamic Threshold

Default Poor overlap; Low memory requirement Low Performance; High Productivity

Manually Tuned Good overlap; High memory requirement High Performance; Low Productivity

Dynamic + Adaptive Good overlap; Optimal memory requirement High Performance; High Productivity

Process Pair Eager Threshold (KB)

0 – 4 32

1 – 5 64

2 – 6 128

3 – 7 32

Desired Eager Threshold


Enhanced MPI_Bcast with Optimized CMA-based Design

1

10

100

1000

10000

100000

1K 4K 16K 64K 256K 1M 4MMessage Size

KNL (64 Processes)

MVAPICH2-2.3a

Intel MPI 2017

OpenMPI 2.1.0

Proposed

Late

ncy

(us)

1

10

100

1000

10000

100000

Message Size

Broadwell (28 Processes)

MVAPICH2-2.3a

Intel MPI 2017

OpenMPI 2.1.0

Proposed1

10

100

1000

10000

100000

1K 4K 16K 64K 256K 1M 4MMessage Size

Power8 (160 Processes)

MVAPICH2-2.3a

OpenMPI 2.1.0

Proposed

• Up to 2x - 4x improvement over existing implementation for 1MB messages• Up to 1.5x – 2x faster than Intel MPI and Open MPI for 1MB messages

Use CMA

UseSHMEM

Use CMA

UseSHMEM

Use CMA

UseSHMEM

• Improvements obtained for large messages only• p-1 copies with CMA, p copies with Shared memory• Fallback to SHMEM for small messages

S. Chakraborty, H. Subramoni, and D. K. Panda, Contention Aware Kernel-Assisted MPI Collectives for Multi/Many-core Systems, IEEE Cluster ’17, BEST Paper Finalist

Support is available in MVAPICH2-X 2.3b


Designing Energy-Aware (EA) MPI Runtime

Energy Spent in Communication Routines

Energy Spent in Computation Routines

Overall application Energy Expenditure

Point-to-point Routines

Collective Routines

RMA Routines

MVAPICH2-EA Designs

MPI Two-sided and collectives (ex: MVAPICH2)

Other PGAS Implementations (ex: OSHMPI)One-sided runtimes (ex: ComEx)

Impact MPI-3 RMA Implementations (ex: MVAPICH2)


• An energy efficient runtime that provides energy savings without application knowledge

• Uses automatically and transparently the best energy lever

• Provides guarantees on maximum degradation with 5-41% savings at <= 5% degradation

• Pessimistic MPI applies energy reduction lever to each MPI call

• Available for download from MVAPICH project site since Aug’15

MVAPICH2-EA: Application Oblivious Energy-Aware-MPI (EAM)

A Case for Application-Oblivious Energy-Efficient MPI Runtime A. Venkatesh, A. Vishnu, K. Hamidouche, N. Tallent, D.

K. Panda, D. Kerbyson, and A. Hoise, Supercomputing ‘15, Nov 2015 [Best Student Paper Finalist]

1




• GPU Computing




• Global view improves programmer productivity• Idea is to decouple data movement with process synchronization• Processes should have asynchronous access to globally distributed data• Well suited for irregular applications and kernels that require dynamic access to different

data• Different Approaches

– Library-based (Global Arrays, OpenSHMEM)– Compiler-based (Unified Parallel C (UPC), Co-Array Fortran (CAF))– HPCS Language-based (X10, Chapel, Fortress)

Partitioned Global Address Space (PGAS) ModelsP1 P2 P3

Shared Memory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory MemoryLogical shared memory

Shared Memory Model

SHMEM, DSMDistributed Memory Model

MPI (Message Passing Interface)Partitioned Global Address Space (PGAS)


Hybrid (MPI+PGAS) Programming

• Application sub-kernels can be re-written in MPI/PGAS based on communication characteristics

• Benefits:– Best of Distributed Computing Model

– Best of Shared Memory Computing Model

Kernel 1MPI

Kernel 2MPI

Kernel 3MPI

Kernel NMPI

HPC Application

Kernel 2PGAS

Kernel NPGAS


MVAPICH2-X for Hybrid MPI + PGAS Applications

• Current Model – Separate Runtimes for OpenSHMEM/UPC/UPC++/CAF and MPI– Possible deadlock if both runtimes are not progressed

– Consumes more network resource

• Unified communication runtime for MPI, UPC, UPC++, OpenSHMEM, CAF– Available with since 2012 (starting with MVAPICH2-X 1.9) – http://mvapich.cse.ohio-state.edu

http://mvapich.cse.ohio-state.edu/overview/mvapich2x


UPC++ Collectives Performance

MPI + {UPC++} application

GASNet Interfaces

UPC++ Runtime

Network

Conduit (MPI)

MVAPICH2-XUnified

communication Runtime (UCR)

MPI + {UPC++} application

UPC++ Runtime MPI Interfaces

• Full and native support for hybrid MPI + UPC++ applications

• Better performance compared to IBV and MPI conduits

• OSU Micro-benchmarks (OMB) support for UPC++

• Available since MVAPICH2-X 2.2RC1

05000

10000150002000025000300003500040000

Tim

e (u

s)


GASNet_MPIGASNET_IBVMV2-X

14x

Inter-node Broadcast (64 nodes 1:ppn)

J. M. Hashmi, K. Hamidouche, and D. K. Panda, Enabling Performance Efficient Runtime Support for hybrid MPI+UPC++ Programming Models, IEEE International Conference on High Performance Computing and Communications (HPCC 2016)


Application Level Performance with Graph500 and SortGraph500 Execution Time

J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, International Supercomputing Conference (ISC’13), June 2013

05

101520253035

4K 8K 16K

Tim

e (s

)

No. of Processes

MPI-SimpleMPI-CSCMPI-CSRHybrid (MPI+OpenSHMEM)

13X

7.6X

• Performance of Hybrid (MPI+ OpenSHMEM) Graph500 Design• 8,192 processes

- 2.4X improvement over MPI-CSR- 7.6X improvement over MPI-Simple

• 16,384 processes- 1.5X improvement over MPI-CSR- 13X improvement over MPI-Simple

Sort Execution Time

0500

10001500200025003000

500GB-512 1TB-1K 2TB-2K 4TB-4K

Tim

e (s

econ

ds)

Input Data - No. of Processes

MPI Hybrid

51%

• Performance of Hybrid (MPI+OpenSHMEM) Sort Application

• 4,096 processes, 4 TB Input Size- MPI – 2408 sec; 0.16 TB/min- Hybrid – 1172 sec; 0.36 TB/min- 51% improvement over MPI-design

J. Jose, S. Potluri, H. Subramoni, X. Lu, K. Hamidouche, K. Schulz, H. Sundar and D. Panda Designing Scalable Out-of-core Sorting with Hybrid MPI+PGAS Programming Models, PGAS’14


Performance of PGAS Models on KNL using MVAPICH2-X

0.01

0.1

1

10

100

1000

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Late

ncy

(us)

Message Size

shmem_put

upc_putmem

upcxx_async_put

Intra-node PUT

0.01

0.1

1

10

100

1000

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Late

ncy

(us)

Message Size

shmem_get

upc_getmem

upcxx_async_get

Intra-node GET

• Intra-node performance of one-sided Put/Get operations of PGAS libraries/languages using MVAPICH2-X communication conduit

• Near-native communication performance is observed on KNL


Optimized OpenSHMEM with AVX and MCDRAM: Application Kernels Evaluation

Heat Image Kernel

• On heat diffusion based kernels AVX-512 vectorization showed better performance• MCDRAM showed significant benefits on Heat-Image kernel for all process counts.

Combined with AVX-512 vectorization, it showed up to 4X improved performance

1

10

100

1000

16 32 64 128

Tim

e (s

)

No. of processes

KNL (Default)KNL (AVX-512)KNL (AVX-512+MCDRAM)Broadwell

Heat-2D Kernel using Jacobi method

0.1

1

10

100

16 32 64 128

Tim

e (s

)

No. of processes

KNL (Default)KNL (AVX-512)KNL (AVX-512+MCDRAM)Broadwell




• GPU Computing




At Sender:

At Receiver:MPI_Recv(r_devbuf, size, …);

insideMVAPICH2

• Standard MPI interfaces used for unified data movement

• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)

• Overlaps data movement from GPU with RDMA transfers

High Performance and High Productivity

MPI_Send(s_devbuf, size, …);

GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU


0

2000

4000

6000

1 2 4 8 16 32 64 128

256

512 1K 2K 4K

Band

wid

th (M

B/s)


GPU-GPU Inter-node Bi-Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3a

01000200030004000

1 2 4 8 16 32 64 128

256

512 1K 2K 4K

Band

wid

th (M

B/s)


GPU-GPU Inter-node Bandwidth


0

10

20

300 1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K

Late

ncy

(us)


GPU-GPU Inter-node Latency


MVAPICH2-GDR-2.3aIntel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores

NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA

CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA

10x

9x

Optimized MVAPICH2-GDR Design

1.88us11X


• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)• HoomdBlue Version 1.0.5

• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384

Application-Level Evaluation (HOOMD-blue)

0

500

1000

1500

2000

2500

4 8 16 32

Aver

age

Tim

e St

eps p

er

seco

nd (T

PS)

Number of Processes

MV2 MV2+GDR

0500

100015002000250030003500

4 8 16 32

Aver

age

Tim

e St

eps p

er

seco

nd (T

PS)

Number of Processes

64K Particles 256K Particles

2X2X




Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland

0

0.2

0.4

0.6

0.8

1

1.2

16 32 64 96Nor

mal

ized

Exec

utio

n Ti

me

Number of GPUs

CSCS GPU cluster

Default Callback-based Event-based

00.20.40.60.8

11.2

4 8 16 32

Nor

mal

ized

Exec

utio

n Ti

me

Number of GPUs

Wilkes GPU Cluster

Default Callback-based Event-based

• 2X improvement on 32 GPUs nodes• 30% improvement on 96 GPU nodes (8 GPUs/node)

C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16

On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application

Cosmo model: http://www2.cosmo-model.org/content/tasks/operational/meteoSwiss/


http://www2.cosmo-model.org/content



Enhanced Support for GPU Managed Memory ● CUDA Managed => no memory pin down

● No IPC support for intranode communication ● No GDR support for Internode communication

● Significant productivity benefits due to abstraction of explicit allocation and cudaMemcpy()

● Initial and basic support in MVAPICH2-GDR ● For both intra- and inter-nodes use “pipeline through”

host memory ● Enhance intranode managed memory to use IPC

● Double buffering pair-wise IPC-based scheme ● Brings IPC performance to Managed memory ● High performance and high productivity● 2.5 X improvement in bandwidth

● OMB extended to evaluate the performance of point-to-point and collective communications using managed buffers

0

2000

4000

6000

8000

10000

32K 128K 512K 2M

Enhanced

MV2-GDR 2.2b


Band

wid

th (M

B/s)

2.5X

D. S. Banerjee, K Hamidouche, and D. K Panda, Designing High Performance Communication Runtime for GPUManaged Memory: Early Experiences, GPGPU-9 Workshop, held in conjunction with PPoPP ‘16

0

0.2

0.4

0.6

0.8

1 2 4 8 16 32 64 128 256 1K 4K 8K 16KHal

o Ex

chan

ge T

ime

(ms)

Total Dimension Size (Bytes)

2D Stencil Performance for Halowidth=1DeviceManaged


• Streaming applications on GPU clusters– Using a pipeline of broadcast operations to move host-

resident data from a single source—typically live— to multiple GPU-based computing sites

– Existing schemes require explicitly data movements between Host and GPU memories

Poor performance and breaking the pipeline

• IB hardware multicast + Scatter-List– Efficient heterogeneous-buffer broadcast operation

• CUDA Inter-Process Communication (IPC)– Efficient intra-node topology-aware broadcast

operations for multi-GPU systems

• Available MVAPICH2-GDR 2.3a!

High-Performance Heterogeneous Broadcast for Streaming Applications

Node N

IB HCA

IB HCA

CPU

GPU

Source

IB Switch

GPU

CPU

Node 1

Multicast steps

CData

C

IB SL step

Data

IB HCA

GPU

CPU

Data

C

Node N

Node 1

IB Switch

GPU 0 GPU 1 GPU N

GPU

CPUSource

GPU

CPU

CPUMulticast stepsIPC-based cudaMemcpy(Device<->Device)

3

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters. C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh , B. Elton, and D. K. Panda, SBAC-PAD'16, Oct 2016.


Control Flow Decoupling through GPUDirect Async

• Latency Oriented: Able to hide the kernel launch overhead

– 25% improvement at 256 Bytes

• Throughput Oriented: Asynchronously to offload queue the Communication and computation tasks

– 14% improvement at 1KB message size

• Intel Sandy Bridge, NVIDIA K20 and Mellanox FDR HCA

• Will be available in a public release soon

GPU CPU HCA

KernelLaunch

OverheadHidden

• CPU offloads the compute, communication and synchronization tasks to GPU

• All operations asynchronous from CPU • Hide the overhead of kernel launch

• Needs stream-based extensions to MPI semantics

Latency oriented: Kernel+Send and Recv+Kernel

010203040506070

1 4 16 64 256 1K 4K

Default MPI Enhaced MPI+GDS


Late

ncy

(us)

Overlap with host computation/communication

0

20

40

60

80

100

1 4 16 64 256 1K 4K

Default MPI Enhanced MPI+GDS


Ove

rlap

(%)




• GPU Computing




• On-load approach

– Takes advantage of the idle cores

– Dynamically configurable

– Takes advantage of highly multithreaded cores

– Takes advantage of MCDRAM of KNL processors

• Applicable to other programming models such as PGAS, Task-based, etc.

• Provides portability, performance, and applicability to runtime as well as applications in a transparent manner

Enhanced Designs for KNL: MVAPICH2 Approach


Performance Benefits of the Enhanced Designs

• New designs to exploit high concurrency and MCDRAM of KNL

• Significant improvements for large message sizes

• Benefits seen in varying message size as well as varying MPI processes

Very Large Message Bi-directional Bandwidth16-process Intra-node All-to-AllIntra-node Broadcast with 64MB Message

0

2000

4000

6000

8000

10000

2M 4M 8M 16M 32M 64M

Band

wid

th (M

B/s)

Message size

MVAPICH2 MVAPICH2-Optimized

0

10000

20000

30000

40000

50000

60000

4 8 16

Late

ncy

(us)

No. of processes

MVAPICH2 MVAPICH2-Optimized

27%

0

50000

100000

150000

200000

250000

300000

1M 2M 4M 8M 16M 32M

Late

ncy

(us)

Message size

MVAPICH2 MVAPICH2-Optimized17.2%

52%


Performance Benefits of the Enhanced Designs

0

10000

20000

30000

40000

50000

60000

1M 2M 4M 8M 16M 32M 64M

Band

wid

th (M

B/s)


MV2_Opt_DRAM MV2_Opt_MCDRAMMV2_Def_DRAM MV2_Def_MCDRAM 30%

0

50

100

150

200

250

300

4:268 4:204 4:64

Tim

e (s

)

MPI Processes : OMP Threads

MV2_Def_DRAM MV2_Opt_DRAM15%

Multi-Bandwidth using 32 MPI processesCNTK: MLP Training Time using MNIST (BS:64)

• Benefits observed on training time of Multi-level Perceptron (MLP) model on MNIST dataset using CNTK Deep Learning Framework

Enhanced Designs will be available in upcoming MVAPICH2 releases











– Network Bridges



– Big Data

– Cloud Computing




System Specific Challenges for Big Data ProcessingCommon Challenges

Adapters and Interactions I/O busMulti-port adaptersNUMA



Big Data Taking advantage of RDMA Performance Scalability Backward compatibility

HPCMPI

Multi-rail Collectives ScalabilityApplication Scalability Energy Awareness


GPU / MIC Programmability w/ Performance Hide data movement costs Heterogeneity aware design


How Can HPC Clusters with High-Performance Interconnect and Storage Architectures Benefit Big Data Applications?

Bring HPC and Big Data processing into a “convergent trajectory”!

What are the major bottlenecks in current Big

Data processing middleware (e.g. Hadoop, Spark, and Memcached)?

Can the bottlenecks be alleviated with new

designs by taking advantage of HPC

technologies?

Can RDMA-enabled high-performance

interconnectsbenefit Big Data

processing?

Can HPC Clusters with high-performance

storage systems (e.g. SSD, parallel file

systems) benefit Big Data applications?

How much performance benefits

can be achieved through enhanced

designs?

How to design benchmarks for evaluating the

performance of Big Data middleware on

HPC clusters?


Can We Run Big Data Jobs on Existing HPC Infrastructure?


Designing Communication and I/O Libraries for Big Data Systems: Challenges

Big Data Middleware(HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies(InfiniBand, 1/10/40/100 GigE

and Intelligent NICs)

Storage Technologies(HDD, SSD, NVM, and NVMe-

SSD)

Programming Models(Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

RDMA?

Communication and I/O LibraryPoint-to-Point

Communication

QoS & Fault Tolerance

Threaded Modelsand Synchronization

Performance TuningI/O and File Systems

Virtualization (SR-IOV)

Benchmarks

Upper level Changes?


• RDMA for Apache Spark

• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions

• RDMA for Apache HBase

• RDMA for Memcached (RDMA-Memcached)

• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)

• OSU HiBD-Benchmarks (OHB)

– HDFS, Memcached, HBase, and Spark Micro-benchmarks

• http://hibd.cse.ohio-state.edu

• Users Base: 275 organizations from 34 countries

• More than 24,700 downloads from the project site

The High-Performance Big Data (HiBD) Project

Available for InfiniBand and RoCEAlso run on Ethernet

Support for OpenPower is available

http://hibd.cse.ohio-state.edu/


• HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well as performance. This mode is enabled by default in the package.

• HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations in-memory and obtain as much performance benefit as possible.

• HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster.

• HHH-L-BB: This mode deploys a Memcached-based burst buffer system to reduce the bandwidth bottleneck of shared file system access. The burst buffer design is hosted by Memcached servers, each of which has a local SSD.

• MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top of Lustre alone. Here, two different modes are introduced: with local disks and without local disks.

• Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH-L, and MapReduce over Lustre).

Different Modes of RDMA for Apache Hadoop 2.x


050

100150200250300350400

80 120 160

Exec

utio

n Ti

me

(s)

Data Size (GB)

IPoIB (EDR)OSU-IB (EDR)

0100200300400500600700800

80 160 240

Exec

utio

n Ti

me

(s)

Data Size (GB)


Performance Numbers of RDMA for Apache Hadoop 2.x –RandomWriter & TeraGen in OSU-RI2 (EDR)

Cluster with 8 Nodes with a total of 64 maps

• RandomWriter– 3x improvement over IPoIB

for 80-160 GB file size

• TeraGen– 4x improvement over IPoIB for

80-240 GB file size

RandomWriter TeraGen

Reduced by 3x Reduced by 4x


0100200300400500600700800

80 120 160

Exec

utio

n Ti

me

(s)

Data Size (GB)


Performance Numbers of RDMA for Apache Hadoop 2.x – Sort & TeraSort in OSU-RI2 (EDR)

Cluster with 8 Nodes with a total of 64 maps and 32 reduces

• Sort– 61% improvement over IPoIB for

80-160 GB data

• TeraSort– 18% improvement over IPoIB for

80-240 GB data

Reduced by 61%Reduced by 18%

Cluster with 8 Nodes with a total of 64 maps and 14 reduces

Sort TeraSort

0

100

200

300

400

500

600

80 160 240

Exec

utio

n Ti

me

(s)

Data Size (GB)



• Design Features– RDMA based shuffle plugin– SEDA-based architecture– Dynamic connection

management and sharing– Non-blocking data transfer– Off-JVM-heap buffer

management– InfiniBand/RoCE support

Design Overview of Spark with RDMA

• Enables high performance RDMA communication, while supporting traditional socket interface

• JNI Layer bridges Scala based Spark with communication library written in native codeX. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014

X. Lu, D. Shankar, S. Gugnani, and D. K. Panda, High-Performance Design of Apache Spark with RDMA and Its Benefits on Various Workloads, IEEE BigData ‘16, Dec. 2016.

Spark Core

RDMA Capable Networks(IB, iWARP, RoCE ..)

Apache Spark Benchmarks/Applications/Libraries/Frameworks

1/10/40/100 GigE, IPoIB Network

Java Socket Interface Java Native Interface (JNI)

Native RDMA-based Comm. Engine

Shuffle Manager (Sort, Hash, Tungsten-Sort)

Block Transfer Service (Netty, NIO, RDMA-Plugin)NettyServer

NIOServer

RDMAServer

NettyClient

NIOClient

RDMAClient


• InfiniBand FDR, SSD, 64 Worker Nodes, 1536 Cores, (1536M 1536R)

• RDMA vs. IPoIB with 1536 concurrent tasks, single SSD per node. – SortBy: Total time reduced by up to 80% over IPoIB (56Gbps)

– GroupBy: Total time reduced by up to 74% over IPoIB (56Gbps)

Performance Evaluation on SDSC Comet – SortBy/GroupBy

64 Worker Nodes, 1536 cores, SortByTest Total Time 64 Worker Nodes, 1536 cores, GroupByTest Total Time

0

50

100

150

200

250

300

64 128 256

Tim

e (s

ec)

Data Size (GB)

IPoIB

RDMA

0

50

100

150

200

250

64 128 256

Tim

e (s

ec)

Data Size (GB)

IPoIB

RDMA

74%80%


Application Evaluation on SDSC Comet

• Kira Toolkit: Distributed astronomy image processing toolkit implemented using Apache Spark

– https://github.com/BIDS/Kira

• Source extractor application, using a 65GB dataset from the SDSS DR2 survey that comprises 11,150 image files.

020406080

100120

RDMA Spark Apache Spark(IPoIB)

21 %

Execution times (sec) for Kira SE benchmark using 65 GB dataset, 48 cores.

M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC Comet, XSEDE’16, July 2016

0

200

400

600

800

1000

24 48 96 192 384

One

Epo

ch T

ime

(sec

)

Number of cores

IPoIB RDMA

• BigDL: Distributed Deep Learning Tool using Apache Spark

– https://github.com/intel-analytics/BigDL

• VGG training model on the CIFAR-10 dataset

4.58x


Using HiBD Packages on Existing HPC Infrastructure


• RDMA for Apache Hadoop 2.x and RDMA for Apache Spark are installed and available on SDSC Comet.

– Examples for various modes of usage are available in:• RDMA for Apache Hadoop 2.x: /share/apps/examples/HADOOP

• RDMA for Apache Spark: /share/apps/examples/SPARK/

– Please email [email protected] (reference Comet as the machine, and SDSC as the site) if you have any further questions about usage and configuration.

• RDMA for Apache Hadoop is also available on Chameleon Cloud as an appliance

– https://www.chameleoncloud.org/appliances/17/

HiBD Packages on SDSC Comet and Chameleon Cloud

M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC Comet, XSEDE’16, July 2016

https://www.chameleoncloud.org/appliances/17/


1

10

100

10001 2 4 8 16 32 64 128

256

512 1K 2K 4K

Tim

e (u

s)

Message Size

OSU-IB (FDR)IPoIB (FDR)

0100200300400500600700

16 32 64 128 256 512 1024 2048 4080Thou

sand

s of

Tra

nsac

tions

pe

r Sec

ond

(TPS

)

No. of Clients

• Memcached Get latency– 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us– 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us

• Memcached Throughput (4bytes)– 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s– Nearly 2X improvement in throughput

Memcached GET Latency Memcached Throughput

Memcached Performance (FDR Interconnect)

Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR)

Latency Reduced by nearly 20X

2X











– Network Bridges



– Big Data

– Cloud Computing




System Specific Challenges forCloud Computing

Common Challenges Adapters and Interactions I/O busMulti-port adapters NUMA



CloudComputing

SR-IOV Support Virtualization Containers

HPCMPI

Multi-rail Collectives ScalabilityApplication Scalability Energy Awareness


GPU / XeonPhi Programmability w/ Performance Hide data movement costs Heterogeneity aware design

Big Data

Taking advantage of RDMA Performance Scalability Backward compatibility


• Cloud Computing widely adopted in industry computing environment

• Cloud Computing provides high resource utilization and flexibility

• Virtualization is the key technology to enable Cloud Computing

• Intersect360 study shows cloud is the fastest growing class of HPC

• HPC Meets Cloud: The convergence of Cloud Computing and HPC

HPC Meets Cloud Computing


• Virtualization has many benefits– Fault-tolerance– Job migration– Compaction

• Have not been very popular in HPC due to overhead associated with Virtualization

• New SR-IOV (Single Root – IO Virtualization) support available with Mellanox InfiniBand adapters changes the field

• Enhanced MVAPICH2 support for SR-IOV• MVAPICH2-Virt 2.2 supports:

– OpenStack, Docker, and singularity

Can HPC and Virtualization be Combined?

J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based Virtualized InfiniBand Clusters? EuroPar'14J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled InfiniBand Clusters, HiPC’14 J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient Approach to build HPC Clouds, CCGrid’15


0

50

100

150

200

250

300

350

400

milc leslie3d pop2 GAPgeofem zeusmp2 lu

Exec

utio

n Ti

me

(s)

MV2-SR-IOV-Def

MV2-SR-IOV-Opt

MV2-Native

1%9.5%

0

1000

2000

3000

4000

5000

6000

22,20 24,10 24,16 24,20 26,10 26,16

Exec

utio

n Ti

me

(ms)

Problem Size (Scale, Edgefactor)

MV2-SR-IOV-Def

MV2-SR-IOV-Opt

MV2-Native2%

• 32 VMs, 6 Core/VM

• Compared to Native, 2-5% overhead for Graph500 with 128 Procs

• Compared to Native, 1-9.5% overhead for SPEC MPI2007 with 128 Procs

Application-Level Performance on Chameleon

SPEC MPI2007Graph500

5%


0

10

20

30

40

50

60

70

80

90

100

MG.D FT.D EP.D LU.D CG.D

Exec

utio

n Ti

me

(s)

Container-Def

Container-Opt

Native

• 64 Containers across 16 nodes, pining 4 Cores per Container

• Compared to Container-Def, up to 11% and 73% of execution time reduction for NAS and Graph 500

• Compared to Native, less than 9 % and 5% overhead for NAS and Graph 500

Application-Level Performance on Docker with MVAPICH2

Graph 500NAS

11%

0

50

100

150

200

250

300

1Cont*16P 2Conts*8P 4Conts*4P

BFS

Exec

utio

n Ti

me

(ms)

Scale, Edgefactor (20,16)

Container-Def

Container-Opt

Native

73%


0

500

1000

1500

2000

2500

3000

22,16 22,20 24,16 24,20 26,16 26,20

BFS

Exec

utio

n Ti

me

(ms)

Problem Size (Scale, Edgefactor)

Graph500

Singularity

Native

0

50

100

150

200

250

300

CG EP FT IS LU MG

Exec

utio

n Ti

me

(s)

NPB Class D

Singularity

Native

• 512 Processes across 32 nodes

• Less than 7% and 6% overhead for NPB and Graph500, respectively

Application-Level Performance on Singularity with MVAPICH2

7%

6%

J. Zhang, X .Lu and D. K. Panda, Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds?,

UCC ’17, Best Student Paper Award


• Challenges– Existing designs in Hadoop not virtualization-aware

– No support for automatic topology detection

• Design– Automatic Topology Detection using MapReduce-based

utility

• Requires no user input

• Can detect topology changes during runtime without affecting running jobs

– Virtualization and topology-aware communication through map task scheduling and YARN container allocation policy extensions

Virtualization-aware and Automatic Topology Detection Schemes in Hadoop on InfiniBand

S. Gugnani, X. Lu, and D. K. Panda, Designing Virtualization-aware and Automatic Topology Detection Schemes for Accelerating Hadoop on SR-IOV-enabled Clouds, CloudCom’16, December 2016

0

2000

4000

6000

40 GB 60 GB 40 GB 60 GB 40 GB 60 GB

EXEC

UTI

ON

TIM

E

Hadoop BenchmarksRDMA-Hadoop Hadoop-Virt

0

100

200

300

400

DefaultMode

DistributedMode

DefaultMode

DistributedMode

EXEC

UTI

ON

TIM

E

Hadoop ApplicationsRDMA-Hadoop Hadoop-Virt

CloudBurst Self-join

Sort WordCount PageRank

Reduced by 55%

Reduced by 34%


• Presented advanced features of InfiniBand, HSE, Omni-Path, and RoCE

• Provided an overview of Open Fabrics Verbs-level and Libfabrics-level Programming and InfiniBand Network Management

• Discussed common set of challenges in designing HEC Systems

• Presented Challenges and Solutions in designing various High-End Computing systems with IB, Omni-Path, and HSE

• IB, Omni-Path, and HSE are emerging as new architectures leading to a new generation of networked computing systems, opening many research issues needing novel solutions

Concluding Remarks


Funding AcknowledgmentsFunding Support by

Equipment Support by


Personnel AcknowledgmentsCurrent Students (Graduate)

– A. Awan (Ph.D.)

– R. Biswas (M.S.)

– M. Bayatpour (Ph.D.)

– S. Chakraborthy (Ph.D.)– C.-H. Chu (Ph.D.)

– S. Guganani (Ph.D.)

Past Students – A. Augustine (M.S.)

– P. Balaji (Ph.D.)

– S. Bhagvat (M.S.)

– A. Bhat (M.S.)

– D. Buntinas (Ph.D.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– R. Rajachandrasekar (Ph.D.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

Past Research Scientist– K. Hamidouche

– S. Sur

Past Post-Docs– D. Banerjee

– X. Besseron

– H.-W. Jin

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– J. Jose (Ph.D.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– K. Kulkarni (M.S.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– M. Li (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– A. Moody (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– S. Potluri (Ph.D.)

– J. Hashmi (Ph.D.)

– H. Javed (Ph.D.)– P. Kousha (Ph.D.)

– D. Shankar (Ph.D.)

– H. Shi (Ph.D.)

– J. Zhang (Ph.D.)

– J. Lin

– M. Luo

– E. Mancini

Current Research Scientists– X. Lu

– H. Subramoni

Past Programmers– D. Bureddy

– J. Perkins

Current Research Specialist– J. Smith

– M. Arnold

– S. Marcarelli

– J. Vienne

– H. Wang

Current Post-doc– A. Ruhela

Current Students (Undergraduate)– N. Sarkauskas (B.S.)


Thank You!

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

[email protected]

The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/

The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/

http://nowlab.cse.ohio-state.edu/


Download - Latest version of the slides can be obtained from …prace.it4i.cz/sites/prace.it4i.cz/files/files/iohea-01... · 2018-01-23 · framework – Iterative machine learning jobs –

Top Related