designing high-end computing systems with …...designing high-end computing systems with infiniband...

165
Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing ‘09 by Pavan Balaji Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Pavan Balaji Argonne National Laboratory E-mail: [email protected] http://www.mcs.anl.gov/~balaji http://www.cse.ohio state.edu/ panda http://www.mcs.anl.gov/ balaji Matthew Koop NASA Goddard E il tth k @ E-mail: matthew.koop@nasa.gov http://www.cse.ohio-state.edu/~koop

Upload: others

Post on 01-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet

Dhabaleswar K. (DK) Panda

A Tutorial at Supercomputing ‘09by

Pavan BalajiDhabaleswar K. (DK) PandaThe Ohio State University

E-mail: [email protected]://www.cse.ohio-state.edu/~panda

Pavan BalajiArgonne National LaboratoryE-mail: [email protected]

http://www.mcs.anl.gov/~balajihttp://www.cse.ohio state.edu/ panda http://www.mcs.anl.gov/ balaji

Matthew KoopNASA Goddard

E il tth k @E-mail: [email protected]://www.cse.ohio-state.edu/~koop

Page 2: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Presentation Overview

• Networking Requirements of HEC Systems

• Recap of InfiniBand and 10GE Overview

• Advanced Features of IBAdvanced Features of IB

• Advanced Features of 10/40/100GE

The Open Fabrics Software Stack• The Open Fabrics Software Stack

• Designing High-End Systems with IB and 10GE– MPI, Sockets, File Systems, Multi-Tier Data Centers and

Virtualization

Conclusions and Final Q&A• Conclusions and Final Q&A

Supercomputing '09

Page 3: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

HPC Clusters and Applications

• Multi-core processors becoming increasingly common

pp

• System scales growing rapidly– Small (~256 cores – 8/16 nodes with 8/16 cores each)– Medium (~1K cores – 32/64 nodes with 8/16 cores each)– Large (~10K cores – 320/640 nodes with 8/16 cores each)– Huge (~100K cores – 3200/6400 nodes with 8/16 cores each)– Huge ( 100K cores – 3200/6400 nodes with 8/16 cores each)

• Large range of applications

– Scientific– Scientific – Commercial

• Diverse computation and communication characteristicsp• Diverse scaling requirements

Supercomputing '09

Page 4: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Trends for Computing Clusters in the T 500 Li t

• Top 500 list of Supercomputers (www.top500.org)

Top 500 List

Jun. 2001: 33/500 (6.6%) Nov. 2005: 360/500 (72.0%)

Nov. 2001: 43/500 (8.6%) Jun. 2006: 364/500 (72.8%)

Jun. 2002: 80/500 (16%) Nov. 2006: 361/500 (72.2%)

Nov. 2002: 93/500 (18.6%) Jun. 2007: 373/500 (74.6%)

Jun. 2003: 149/500 (29.8%) Nov. 2007: 406/500 (81.2%)

Nov. 2003: 208/500 (41.6%) Jun. 2008: 400/500 (80.0%)

Jun. 2004: 291/500 (58.2%) Nov. 2008: 410/500 (82.0%)

Nov. 2004: 294/500 (58.8%) Jun. 2009: 410/500 (82.0%)

Supercomputing '09

Jun. 2005: 304/500 (60.8%) Nov. 2009: To be announced

Page 5: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

PetaFlop to ExaFlop Computingp p p g

10 PFlops in 2011

100 PFlops in 2015

Supercomputing '09

Expected to have an ExaFlop system in 2018-2019 !

Page 6: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Integrated High-End Computing E i tEnvironments

Compute cluster Storage cluster

Meta-DataManager

I/O ServerNode

MetaData

DataCompute

Node

ComputeNode

I/O ServerNode Data

ComputeNode

LANLANFrontend

NodeNode

I/O ServerNode Data

ComputeNodeLAN/WANDifferent requirements for each

type of HEC system

Routers/Servers

Database Server

Application Server

Datacenter for Visualization and Mining

.

.

Switch

.

.

Routers/ServersRouters/Servers

Application Server

Application Server

Database Server

Database Server

Switch Switch

Supercomputing '09

.Routers/Servers

Application Server

Database Server

Tier1 Tier3

Page 7: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Programming Models for HPC Clusters

• Message Passing Interface (MPI) is the de-facto

g g

standard– MPI 1.0 was the initial standard; MPI 2.2 is the latest

– MPI 3.0 is under discussion in the MPI Forum

• Other models coming up as well

– Traditional Partitioned Global Address Space Models (Global Arrays, UPC, Coarray Fortran)

– HPCS languages (X10, Chapel, Fortress)

Supercomputing '09

Page 8: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Networking Requirements for HPC Cl t

• Different Communication PatternsP i t t i t

Clusters

– Point-to-point• Low latency, high bandwidth, low CPU usage, overlap of computation &

communication

S l bl ll ti “ ” i ti– Scalable collective or “group” communication• Broadcast, barrier, reduce, all-reduce, all-to-all, etc.

– Support for concurrent multi-pair communication at the NIC • Emerging multi-core architecture

• Reduced network contention and congestion– Good routing schemeg– Efficient congestion control management

• Reliability and Fault ToleranceF il d t ti d ( t k d l l)– Failure detection and recovery (network and process level)

– Ease of administrationSupercomputing '09

Page 9: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Sample Diagram of State-of-the-Art File S t

• Sample file systems:

Systems

– Lustre, Panasas, GPFS, Sistina/Redhat GFS

– PVFS, Google File systems, Oracle Cluster File system (OCFS2)

Metadata Server

Metadata Server Server

Computing node I/O server

Network

Computing node I/O server

Supercomputing '09

Computing node I/O server

Page 10: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Networking Requirements for Storage Cl t

• Several similar to HPC Clusters

Clusters

– Low latency, high bandwidth, low CPU utilization

– Reduced network contention & congestion

R li bilit d F lt T l– Reliability and Fault Tolerance

• Several unique challenges as wellAggregate bandwidth becomes very important– Aggregate bandwidth becomes very important

• Systems typically use fewer I/O nodes than compute nodes

– Quality of Service (QoS)• The same set of file servers support multiple “client applications”

– Network UnificationN d k i h b h & “ d d”• Need to work with both compute systems & “standard” storage protocols (Fiber Channel, iSCSI)

Supercomputing '09

Page 11: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Enterprise Datacenter Environmentsp

ProxyWeb-server

(A h )ProxyServer

(Apache)

WAN

ClientsStorage

More Computation and CommunicationR i t

ApplicationDatabase

WAN Requirements

• Requests are received from clients over the WAN

Application Server (PHP)

Server(MySQL)

• Proxy nodes perform caching, load balancing, resource monitoring, etc.– If not cached, request forwarded to the next tier Application Server

A li ti f th b i l i (CGI J l t )• Application server performs the business logic (CGI, Java servlets)– Retrieves appropriate data from the database to process the requests

Supercomputing '09

Page 12: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Networking Requirements for D t t

• Support for large number of communication streams

Datacenters

• Heterogeneous compute requirements– Front-end servers typically much lesser loaded than

backend servers need to handle communication in a load-resilient manner

• Quality of Service (QoS)– The same set of servers respond to many clients

• Network Virtualization– Zero-overhead communication in Virtual environments

– Efficient Migration

Supercomputing '09

Page 13: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Networking Requirements for I t t d E i t

• High performance WAN-level communication

Integrated Environments

– Low latency

– High bandwidth (unidirectional and bidirectional)

– Low CPU utilization

• Good performance in spite of delays– Out-of-order messages

– Buffering requirements to keep high-bandwidth pipes full

• Seamless integration between LAN and WAN protocols

• Resiliency to network failure

Supercomputing '09

Page 14: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Presentation Overview

• Networking Requirements of HEC Systems

• Recap of InfiniBand and 10GE Overview

• Advanced Features of IBAdvanced Features of IB

• Advanced Features of 10/40/100GE

The Open Fabrics Software Stack Usage• The Open Fabrics Software Stack Usage

• Designing High-End Systems with IB and 10GE– MPI, Sockets, File Systems, Multi-Tier Data Centers and

Virtualization

Conclusions and Final Q&A• Conclusions and Final Q&A

Supercomputing '09

Page 15: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

A Typical IB Networkyp

Three primary components

Channel Adapters

Switches/Routers

Links and connectors

Supercomputing '09

Page 16: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Communication and Management S ti

• Two forms of communication semantics

Semantics

– Channel semantics (Send/Recv)

– Memory semantics (RDMA, Atomic operations)

• Management model– A detailed management model complete with

managers, agents, messages and protocols

• Verbs Interface– A low-level programming interface for performing

communication as well as management

Supercomputing '09

Page 17: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Communication in the Channel S ti (S d/R i M d l)

Memory Memory

Semantics (Send/Receive Model)Processor Processor

ReceiveBuffer

SendBuffer

CQQP

Send Recv CQQP

Send RecvProcessor is involved only to:

1. Post receive WQE2. Post send WQEQ

3. Pull out completed CQEs from the CQ

InfiniBand DeviceInfiniBand Device

S d WQE t i i f ti b t th Receive WQE contains information on the receive

Hardware ACK

Supercomputing '09

Send WQE contains information about the send buffer

Receive WQE contains information on the receive buffer; Incoming messages have to be matched to

a receive WQE to know where to place the data

Page 18: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Communication in the Memory S ti (RDMA M d l)

Memory Memory

Semantics (RDMA Model)Processor Processor

ReceiveBuffer

SendBuffer

CQQP

Send Recv CQQP

Send RecvInitiator processor is involved only to:

1. Post send WQE2. Pull out completed CQE from the send CQ

No involvement from the target processor

InfiniBand DeviceInfiniBand Device

S d WQE t i i f ti b t th

Hardware ACK

Supercomputing '09

Send WQE contains information about the send buffer and the receive buffer

Page 19: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Basic iWARP Capabilities

• Supports most of the communication features supported

p

by IB (with minor differences)– Hardware acceleration, RDMA, Multicast, QoS

L k f t• Lacks some features– E.g., Atomic operations

but supports some other features• … but supports some other features– Out-of-Order data placement (useful for iSCSI semantics)

Fine grained data rate control (very useful for long haul– Fine-grained data rate control (very useful for long-haul networks)

– Fixed bandwidth QoS (more details later)

Supercomputing '09

Page 20: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

IB and 10GE:C liti d DiffCommonalities and Differences

IB iWARP/10GEIB iWARP/10GE

Hardware Acceleration SupportedSupported

(for TOE and iWARP)Supported

RDMA SupportedSupported

(for iWARP)

Atomic Operations Supported Not supported

Multicast Supported Supported

Data Placement OrderedOut-of-order(for iWARP)

Data Rate-control Static and Coarse-grainedDynamic and Fine-grained

(for TOE and iWARP)

QoS PrioritizationPrioritization and

Fixed Bandwidth QoS

Supercomputing '09

Fixed Bandwidth QoS

Page 21: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

One-way Latency: MPI over IBy y

7Small Message Latency

400MVAPICH-InfiniHost III-DDR

Large Message Latency

5

6

250

300

350MVAPICH-InfiniHost III-DDR

MVAPICH-Qlogic-SDR

MVAPICH-ConnectX-DDR

MVAPICH-ConnectX-QDR-PCIe2

3

4

Late

ncy

(us)

2.77150

200

250 MVAPICH-Qlogic-DDR-PCIe2

Late

ncy

(us)

1

2

1.061.281.492.19

50

100

0

Message Size (bytes)

0

Message Size (bytes)

Supercomputing '09

InfiniHost III and ConnectX-DDR: 2.33 GHz Quad-core (Clovertown) Intel with IB switch

ConnectX-QDR-PCIe2: 2.83 GHz Quad-core (Harpertown) Intel with back-to-back

Page 22: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Bandwidth: MPI over IB

3500MVAPICH-InfiniHost III-DDR

Unidirectional Bandwidth6000

Bidirectional Bandwidth5553.5

2500

3000 MVAPICH-Qlogic-SDRMVAPICH-ConnectX-DDRMVAPICH-ConnectX-QDR-PCIe2MVAPICH-Qlogic-DDR-PCIe2

sec

3022.1

1952.94000

5000

sec

5553.5

3621.4

1500

2000

Mill

ionB

ytes

/s

1399.8

1389.42000

3000

Mill

ionB

ytes

/s

2457.4

2718.3

0

500

1000 936.5

0

1000

2000

1519.8

0

Message Size (bytes)

0

Message Size (bytes)

Supercomputing '09

InfiniHost III and ConnectX-DDR: 2.33 GHz Quad-core (Clovertown) Intel with IB switch

ConnectX-QDR-PCIe2: 2.4 GHz Quad-core (Nehalem) Intel with IB switch

Page 23: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

One-way Latency: MPI over iWARP

35One-way Latency

y y

25

30 Chelsio (TCP/IP)

Chelsio (iWARP)

15

20

Late

ncy

(us)

15.47

5

10

L

6.88

00 1 2 4 8 16 32 64 128 256 512 1K 2K 4K

Message Size (bytes)

Supercomputing '09

2.0 GHz Quad-core Intel with 10GE (Fulcrum) Switch

Page 24: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Bandwidth: MPI over iWARP

1400Ch l i (TCP/IP)

Unidirectional Bandwidth2500

Bidirectional Bandwidth

2260 8

1000

1200Chelsio (TCP/IP)

Chelsio (iWARP)

ec 839 8

2000

c

2260.81231.8

600

800

Mill

ionB

ytes

/se 839.8

1000

1500

illio

nByt

es/s

ec

855.3

200

400

M

500M

i

0

Message Size (bytes)

0

Message Size (bytes)Message Size (bytes) Message Size (bytes)

Supercomputing '09

2.0 GHz Quad-core Intel with 10GE (Fulcrum) Switch

Page 25: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Presentation Overview

• Networking Requirements of HEC Systems

• Recap of InfiniBand and 10GE Overview

• Advanced Features of IBAdvanced Features of IB

• Advanced Features of 10/40/100GE

The Open Fabrics Software Stack Usage• The Open Fabrics Software Stack Usage

• Designing High-End Systems with IB and 10GE– MPI, Sockets, File Systems, Multi-Tier Data Centers and

Virtualization

Conclusions and Final Q&A• Conclusions and Final Q&A

Supercomputing '09

Page 26: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Advanced Capabilities in InfiniBandp

CompleteComplete Hardware

ImplementationsExist

Supercomputing '09

Page 27: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Basic IB Capabilities at EachP t l L

• Link Layer

Protocol Layer

– CRC-based data integrity, Buffering and Flow-control, Virtual Lanes, Service Levels and QoS, Switching and Multicast WAN capabilitiesMulticast, WAN capabilities

• Network LayerRouting and Flow Labels– Routing and Flow Labels

• Transport LayerReliable Connection Unreliable Datagram Reliable– Reliable Connection, Unreliable Datagram, Reliable Datagram and Unreliable Connection

– Shared Receive Queue and Extended ReliableShared Receive Queue and Extended Reliable Connections (discussed in more detail later)

Supercomputing '09

Page 28: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Advanced Capabilities in IB

• Link Layer

p

– Congestion Control

• Network Layer– Multipathing Capability and Automatic Path Migration

• Transport Layer– Shared Receive Queues, Extended Reliable Connection

– Data Segmentation, Transaction Ordering

– Message-level End-to-End Flow Control

– Static Rate Control and Auto-Negotiation

• Management ToolsSupercomputing '09

Page 29: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Congestion Control

• Switch detects congestion on a link

g

– Detects whether it is the root or victim of congestion

• IB follows a three-step protocol Forward Explicit Congestion Notification– Forward Explicit Congestion Notification

• Used to communicate congested port status• Switch sets FECN bit; marks packets leaving the congested state

Backward Explicit Congestion Notification– Backward Explicit Congestion Notification• Destination sends BECN to sender informing about congestion

– Injection Rate Control (Throttling)• Source throttles its send rate temporarily (timer based)• Original injection rate reduces over time• Congestion control may be performed per QP or SL

• Pro-active does not wait for packet drops to occurSupercomputing '09

Page 30: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Multipathing Capability

• Similar to basic switching, except…

p g p y

– … sender can utilize multiple LIDs associated to the same destination port

• Packets sent to one DLID take a fixed path• Packets sent to one DLID take a fixed path• Different packets can be sent using different DLIDs• Each DLID can have a different path (switch can be configured

differently for each DLID)

• Can cause out-of-order arrival of packets– IB uses a simplistic approach:

• If packets in one connection arrive out-of-order, they are droppeddropped

– Easier to use different DLIDs for different connectionsSupercomputing '09

Page 31: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Automatic Path Migration

• Automatically utilizes IB multipathing for network

g

y p g

fault-tolerance

• Enables migrating connections to a different pathEnables migrating connections to a different path

– Connection recovery in the case of failures

O ti l F t– Optional Feature

• Available for RC, UC, and RD

• Reliability guarantees for service type maintained

during migration

Supercomputing '09

Page 32: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Advanced Capabilities in IB

• Link Layer

p

– Congestion Control

• Network Layer– Multipathing Capability and Automatic Path Migration

• Transport Layer– Shared Receive Queues, Extended Reliable Connection

– Data Segmentation, Transaction Ordering

– Message-level End-to-End Flow Control

– Static Rate Control and Auto-Negotiation

• Management ToolsSupercomputing '09

Page 33: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Shared Receive Queues (SRQs)Q ( Q )

Process Process

QP QP QP QP… QP QP QP QP…

Receive Queues (one queue per QP)

Shared Receive Queue (Many QPs to one queue)

• Shared Receive Queues allows multiple QPs to share a single Receive Queue

• Allows much better scalability for applications/libraries that pre post to• Allows much better scalability for applications/libraries that pre-post to receive queues

Supercomputing '09

Page 34: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Flow Control with SRQs

• SRQ has link-level flow-control, but no E2E flow-control

Q

– Problem: How will the sender know how many messages to send (buffers are shared by all receivers)?

• Solution: SRQ limit event functionality can be used to post• Solution: SRQ limit event functionality can be used to post additional buffers

• Limit event functionality– Receiver requests network for an interrupt when there are

less than some buffers left (low watermark)P bl Wh t if h b t f b ff– Problem: What if you have a burst of messages buffers get utilized before you can post more

• Solution: Sender will just retransmit (handled in hardware)j ( )• Can lose some performance, but this is a very rare case !

Supercomputing '09

Page 35: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Low Watermark Method

• Upon receiving a low watermark (SRQ limit event) the receiver can post additional buffers

Sender 1 Sender 2 ReceiverPost 10 buffers, set limit of 4

Limit Event (4 remaining)Have thread post 6 more

• If SRQ is exhausted the send operation does not complete until the receiver has posted more bufferscomplete until the receiver has posted more buffers– Sender hardware gets Receiver Not Ready (RNR NAK)

Supercomputing '09

Page 36: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

SRQ Buffer Usage

• Buffers are consumed in the order received from RQs

Q g

and SRQs

• Libraries such as MPI are forced to post fixed sized buffers – leading to potential waste

• Multiple SRQs, each with a different size?

MPI Message

1KB

Posted Buffer (8KB)

7KB

Supercomputing '09

1KB 7KB (unused)

Page 37: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Multiple SRQsp Q

QPs

SRQ

to peer 0

To peer 18KBbuffers

SRQ

to peer 2

4KBbuffers

SRQ

Process

to peer n

buffers

• Each additional SRQ multiplies the number of QPs required– Each QP can only be associated with one SRQ

• Not a scalable solution since many QPs are needed for each process

Supercomputing '09

Page 38: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

eXtended Reliable Connection (XRC)

Instead of connecting

( )

M = number of nodesP = number of processes per node

processes, connect processes to a node

NodeProcesses A

P = number of processes per node

Each process needs only a single QP to a node

In best case M-1 QPs per process are needed for a f ll t d t

M-1(per process)

M-1(per process)

fully-connected setup

M*(M-1)*P QPs total vs. M2P2 M*P for RCB C D M2P2 – M*P for RC

Supercomputing '09

Page 39: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

XRC Addressing

• XRC uses SRQ Numbers (SRQN) to direct where a

g

operation should complete

Send to #2Send to #1SRQ#1

Process 0

SRQ#1

Process 2

Send to #2Send to #1

SRQ#2Process 1

SRQ#2Process 3

• Hardware does all routing of data, so p2 is not actually involved in the data transfer

Supercomputing '09

data transfer• Connections are not bi-directional, so p3 cannot sent to p0

Page 40: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Multiple SRQs with XRCp Q

QPs QPs

SRQ

QPsto process 0

to process 1

to process 2

8KBbuffers SRQ

QPsto node 0

to node 1

8KBbuffers

SRQ

Process

p

to process n

4KBbuffers

…SRQ

Processto node n

4KBbuffers

RC XRC

Add i b SRQ b l ll lti l SRQ

RC XRC

• Addressing by SRQ number also allows multiple SRQs per process without additional memory resources

P t ti l t l i li ti d lib i• Potential to use less memory in applications and libraries

Supercomputing '09

Page 41: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Advanced Capabilities in IB

• Link Layer

p

– Congestion Control

• Network Layer– Multipathing Capability and Automatic Path Migration

• Transport Layer– Shared Receive Queues, Extended Reliable Connection

– Data Segmentation, Transaction Ordering

– Message-level End-to-End Flow Control

– Static Rate Control and Auto-Negotiation

• Management ToolsSupercomputing '09

Page 42: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Data Segmentation

• Application can hand over a large message

g

pp g g

– Network adapter segments it to MTU sized packets

Single notification hen the entire message is transmitted– Single notification when the entire message is transmitted

or received (not per packet)

• Reduced host overhead to send/receive messages

– Depends on the number of messages, not the number of

bytes

Supercomputing '09

Page 43: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Transaction Ordering

• IB follows a strong transaction ordering for RC

g

• Sender network adapter transmits messages in the order in which WQEs were posted

• Each QP utilizes a single LID– All WQEs posted on same QP take the same path

– All packets are received by the receiver in the same order

– All receive WQEs are completed in the order in which they were posted

Supercomputing '09

Page 44: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Message-level Flow-Control

• Also called as End-to-end Flow-control

g

– Does not depend on the number of network hops

• Separate from Link-level Flow-Control– Link-level flow-control only relies on the number of bytes

being transmitted, not the number of messages

– Message-level flow-control only relies on the number of messages transferred, not the number of bytes

• If 5 receive WQEs are posted the sender can send 5• If 5 receive WQEs are posted, the sender can send 5 messages (can post 5 send WQEs)– If the sent messages are larger than what the receive– If the sent messages are larger than what the receive

buffers are posted, flow-control cannot handle itSupercomputing '09

Page 45: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Static Rate Control andA t N ti ti

• IB allows link rates to be statically changed

Auto-Negotiation

– On a 4X link, we can set data to be sent at 1X

– For heterogeneous links, rate can be set to the lowest link rate

– Useful for low-priority traffic

• Auto-negotiation also available– E.g., if you connect a 4X adapter to a 1X switch, data is

t ti ll t t 1X tautomatically sent at 1X rate

• Only fixed settings availableC t t t i t t 3 16 Gb f l– Cannot set rate requirement to 3.16 Gbps, for example

Supercomputing '09

Page 46: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Advanced Capabilities in IB

• Link Layer

p

– Congestion Control

• Network Layer– Multipathing Capability and Automatic Path Migration

• Transport Layer– Shared Receive Queues, Extended Reliable Connection

– Data Segmentation, Transaction Ordering

– Message-level End-to-End Flow Control

– Static Rate Control and Auto-Negotiation

• Management ToolsSupercomputing '09

Page 47: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

InfiniBand Management & Tools

• Subnet Management

g

g

• Diagnostic Tools

System Discovery Tools– System Discovery Tools

– System Health Monitoring Tools

S t P f M it i T l– System Performance Monitoring Tools

Supercomputing '09

Page 48: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Concepts in IB Management

• Agents

p g

– Processes or hardware units running on each adapter, switch, router (everything on the network)Provide capability to query and set parameters– Provide capability to query and set parameters

• Managers– Make high-level decisions and implement it on the– Make high-level decisions and implement it on the

network fabric using the agents

• Messaging schemesg g– Used for interactions between the manager and agents

(or between agents)

• MessagesSupercomputing '09

Page 49: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

InfiniBand Management

• All IB management happens using packets called as

g

Management Datagrams

– Popularly referred to as “MAD packets”p y p

• Four major classes of management mechanisms

Subnet Management– Subnet Management

– Subnet Administration

– Communication Management

– General Services

Supercomputing '09

Page 50: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Subnet Management & Administration

• Consists of at least one subnet manager (SM) and

g

several subnet management agents (SMAs)– Each adapter, switch, router has an agent running

C i ti b t th SM d t b t– Communication between the SM and agents or between agents happens using MAD packets called as Subnet Management Packets (SMPs)g ( )

• SM’s responsibilities include:– Discovering the physical topology of the subnet– Assigning LIDs to the end nodes, switches and routers– Populating switches and routers with routing paths– Subnet sweeps to discover topology changes

Supercomputing '09

Page 51: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Subnet Managerg

Inactive Link

Active Links

Inactive Links

Compute Node

Switch Multicast Join

Multicast Setup

Node

Multicast Join

Multicast Setup

Subnet ManagerSubnet

ManagerSubnet

Manager

Supercomputing '09

Page 52: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Subnet Manager Sweep Behavior

• SM can be configured to sweep once or continuously

g p

• On the first sweep:– All ports are assigned LIDs on the first sweep

All t t th it h– All routes are setup on the switches

• On consequent sweeps:If there has been any change to the topology appropriate– If there has been any change to the topology, appropriate routes are updated

– If DLID X is down, packet not sent all the wayy• First hop will not have a forwarding entry for LID X

• Sweep time configured by the system administrator– Cannot be too high or too low

Supercomputing '09

Page 53: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Subnet Manager Scalability Issues

• Single subnet manager has issues on large systems

g y

– Performance and overhead of scanning• Hardware implementations on switches are faster, but will work

only for small systems (memory usage)only for small systems (memory usage)• Software implementations are more popular (OpenSM)

– Fault tolerance• There can be multiple SMs• During initialization only one should be active; once started

other SMs can handle different network portionsother SMs can handle different network portions

• Asynchronous events specified to improve scalability– E g TRAPS are events sent by an agent to the SM when a– E.g., TRAPS are events sent by an agent to the SM when a

link goes downSupercomputing '09

Page 54: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Multicast Group Management

• Creation, joining/leaving, deleting multicast groups occur

p g

as SA requests– The requesting node sends a request to a SA

– The SA sends MAD packets to SMAs on the switches to setup routes for the multicast packets

E h it h t i i f ti hi h t t f d• Each switch contains information on which ports to forward the multicast packet to

• Multicast itself does not go through the subnet managerMulticast itself does not go through the subnet manager– Only the setup and teardown goes through the SM

Supercomputing '09

Page 55: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

General Services

• Several general service management features provided by the standard– Performance Management

Se eral req ired and optional performance co nters• Several required and optional performance counters• Flow control counters, RNR counters, Number of sent and

received packets

– Hardware Management• Baseboard Management• Device Management• SNMP Tunneling• Vendor SpecificVendor Specific• Application Specific

Supercomputing '09

Page 56: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

InfiniBand Management & Tools

• Subnet Management

g

g

• Diagnostic Tools

System Discovery Tools– System Discovery Tools

– System Health Monitoring Tools

S t P f M it i T l– System Performance Monitoring Tools

Supercomputing '09

Page 57: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Tools to Analyze InfiniBand Networks

• Different types of tools exist:

y

– High-level tools that internally talk to the subnet manager using management datagrams

– Each hardware device exposes a few mandatory counters and a number of optional (sometimes vendor-specific) countersspecific) counters

• Possible to write your own tools based on the management datagram interfacemanagement datagram interface– Several vendors provide such IB management tools

Supercomputing '09

Page 58: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Network Discovery Tools

• Starting with almost no knowledge about the system,

y

we can identify several details of the network configuration– Example tools include:

• ibhosts: finds all the network adapters in the system

• ibswitches: finds all the network switches in the system• ibswitches: finds all the network switches in the system

• ibnetdiscover: finds the connectivity between the ports

• … and many others existy

– Possible to write your own tools based on the management datagram interface

• Several vendors provide such IB management tools

Supercomputing '09

Page 59: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Discovering Network Adaptersg p

% ibhostsCa : 0x0002c9020023c314 ports 2 " HCA 2"GUID of the adapterCa : 0x0002c9020023c314 ports 2 HCA-2Ca : 0x0002c9020023c05c ports 2 " HCA-2"Ca : 0x0002c9020023c0e8 ports 2 " HCA-2"Ca : 0x0002c9020023c178 ports 2 " HCA-2"

GUID of the adapter

pCa : 0x0002c9020023c058 ports 2 " HCA-2"Ca : 0x0002c9020023bffc ports 2 " HCA-2"Ca : 0x0002c9020023c08c ports 2 "wci59"Number of adapter

tCa : 0x0011750000ffe01a ports 1 " HCA-1"Ca : 0x0011750000ffe141 ports 1 " HCA-1"Ca : 0x0011750000ffe1dd ports 1 " HCA-1"Ca : 0x0011750000ffe079 ports 1 " HCA-1"

ports96 adapters

“online”

Ca : 0x0011750000ffe079 ports 1 HCA-1Ca : 0x0011750000ffe25c ports 1 " HCA-1"Ca : 0x0002c9020023c318 ports 2 " HCA-2"...

Adapter description

Supercomputing '09

Page 60: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Network Adapter Classificationp

% ibnetdiscover –H /* Some parts snipped out */Ca : ports 2 devid 0x6282 vendid 0x2c9 " HCA 2"Vendor IDCa : ports 2 devid 0x6282 vendid 0x2c9 HCA-2Ca : ports 2 devid 0x6282 vendid 0x2c9 " HCA-2"Ca : ports 2 devid 0x6282 vendid 0x2c9 " HCA-2"Ca : ports 2 devid 0x6282 vendid 0x2c9 " HCA-2"

Vendor IDDevice ID

59 InfiniHost III adaptersp

Ca : ports 2 devid 0x6282 vendid 0x2c9 " HCA-2"Ca : ports 2 devid 0x634a vendid 0x2c9 " HCA-1"Ca : ports 2 devid 0x634a vendid 0x2c9 " HCA-1"

Mellanox adapters

adapters

Ca : ports 2 devid 0x634a vendid 0x2c9 " HCA-1"Ca : ports 1 devid 0x10 vendid 0x1fc1 " HCA-1"Ca : ports 1 devid 0x10 vendid 0x1fc1 " HCA-1"Ca : ports 1 devid 0x10 vendid 0x1fc1 " HCA-1"

8 Qlogic adapters

29 ConnectX adapters

Ca : ports 1 devid 0x10 vendid 0x1fc1 HCA-1Ca : ports 1 devid 0x10 vendid 0x1fc1 " HCA-1"...

adapters

Supercomputing '09

Page 61: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Discovering Network Switchesg

% ibswitches /* Some parts snipped out */Switch : ports 24 "SilverStorm 9120 Leaf 1 Chip A"Switch : ports 24 SilverStorm 9120 Leaf 1, Chip ASwitch : ports 24 "SilverStorm 9120 Spine 2, Chip A"Switch : ports 24 "SilverStorm 9120 Spine 1, Chip A"Switch : ports 24 "SilverStorm 9120 Spine 3, Chip A"

Switch vendorinformation

p p pSwitch : ports 24 "SilverStorm 9120 Spine 2, Chip B"Switch : ports 24 "SilverStorm 9120 Spine 1, Chip B"Switch : ports 24 "SilverStorm 9120 Spine 3, Chip B"Switch : ports 24 "SilverStorm 9120 Leaf 8, Chip A"Switch : ports 24 "SilverStorm 9120 Leaf 2, Chip A"Switch : ports 24 "SilverStorm 9120 Leaf 4, Chip A"Switch : ports 24 "SilverStorm 9120 Leaf 12 Chip A"

Ports per switch

Switch : ports 24 SilverStorm 9120 Leaf 12, Chip ASwitch : ports 24 "SilverStorm 9120 Leaf 6, Chip A"Switch : ports 24 "SilverStorm 9120 Leaf 10, Chip A"Switch : ports 24 "SilverStorm 9120 Leaf 7, Chip A"

12 Leaf switches3 Spine switches with 2 chips each

Supercomputing '09

Switch : ports 24 "SilverStorm 9120 Leaf 3, Chip A“...

Page 62: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Discovering Network Connectivityg y

% ibnetdiscover /* Some parts snipped out */Switch 24 "S 00066a000700067c" # "SilverStorm 9120Switch 24 S-00066a000700067c # SilverStorm 9120 GUID=0x00066a00020001aa Leaf 1, Chip A" base port 0 lid 66 lmc 0[24] # "SilverStorm 9120 Spine 2, Chip A" lid 104 4xDDRp p[23] # "SilverStorm 9120 Spine 2, Chip A" lid 104 4xDDR[22] # "SilverStorm 9120 Spine 3, Chip A" lid 100 4xDDR[21] # "SilverStorm 9120 Spine 1, Chip A" lid 110 4xDDRConnectivity of

h it h t...[12] "H-0002c9030001e5e6" # " HCA-1" lid 125 4xDDR[11] "H-0002c9030001e3fa" # " HCA-1" lid 142 4xDDR[10] "H-0002c9030000b0c4" # " HCA-1" lid 106 4xDDR

each switch port

[10] H-0002c9030000b0c4 # HCA-1 lid 106 4xDDR[9] "H-0002c9030000b0c8" # " HCA-1" lid 108 4xDDR[8] "H-0002c9030001e5fa" # " HCA-1" lid 143 4xDDR...

Supercomputing '09

Page 63: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Roughly Constructed Network Fabricg y

Spine 3Spine 3 (Unmanaged)Chip A Chip B

Spine 2

FAN4 FAN3Spine 3 (Unmanaged)

Spine 2 (Managed)

Spine 1 (Managed)

8 Qlogic Adapters

Spine 1

FAN2 FAN1PS1

PS2

PS3

PS4

PS5

PS6

59 MellanoxInfiniHost III

Leaf 10Leaf 9Leaf 12Leaf 11

InfiniHost IIIAdapters

Leaf 4Leaf 3

Leaf 6Leaf 5Leaf 8Leaf 7

29 MellanoxConnectXAd t

Supercomputing '09Leaf 2Leaf 1

Leaf 2Leaf 1Adapters

Page 64: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

InfiniBand Management & Tools

• Subnet Management

g

g

• Diagnostic Tools

System Discovery Tools– System Discovery Tools

– System Health Monitoring Tools

S t P f M it i T l– System Performance Monitoring Tools

Supercomputing '09

Page 65: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Overall Diagnostics

• Tools to query overall fabric health

g

[ib1 ]# ibdiagnet -r…STAGE Errors Warningsg

Bad GUIDs/LIDs Check 0 0 Link State Active Check 0 0 Performance Counters Report 0 0 Partitions Check 0 0 IPoIB Subnets Check 0 0 Subnet Manager Check 0 0 Fabric Qualities Report 0 0Fabric Qualities Report 0 0 Credit Loops Check 0 0 Multicast Groups Report 0 0

Supercomputing '09

Page 66: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

End-node Adapter Statep

[ib1 ]# ibportstate 8 1PortInfo:# Port info: Lid 8 port 1# Port info: Lid 8 port 1LinkState:.......................ActivePhysLinkState:...................LinkUpLinkWidthSupported:..............1X or 4XLinkWidthEnabled:................1X or 4XLinkWidthActive:.................4XLinkSpeedSupported:..............2.5 Gbps or 5.0 GbpsLinkSpeedEnabled:................2.5 Gbps or 5.0 GbpsLinkSpeedActive:.................5.0 Gbps

Supercomputing '09

Page 67: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

End-node Adapter Countersp

[ib1 ]# ibdatacounts 119 1# Port counters: Lid 119 port 1XmtData:.........................2102127705RcvData:.........................2101904109XmtPkts: 9069780XmtPkts:.........................9069780RcvPkts:.........................9068305

[ib1 ]# ibdatacounts 119 1# Port counters: Lid 119 port 1XmtData:.........................432RcvData:.........................432XmtPkts 6XmtPkts:.........................6RcvPkts:.........................6

[ib1 ]# ibcheckerrs -v 20 1

Supercomputing '09

Error check on lid 20 (ib12 HCA-2) port 1: OK

Page 68: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

InfiniBand Management & Tools

• Subnet Management

g

g

• Diagnostic Tools

System Discovery Tools– System Discovery Tools

– System Health Monitoring Tools

S t P f M it i T l– System Performance Monitoring Tools

Supercomputing '09

Page 69: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Network Switching and Routingg g

% ibroute -G 0x66a000700067cLid Out DestinationLid Out Destination

Port Info0x0001 001 : (Channel Adapter portguid 0x0002c9030001e3f3: ' HCA-1')0x0002 013 : (Channel Adapter portguid 0x0002c9020023c301: ' HCA-1')

Packets to LID 0x0001 will be sent out onp p g

0x0003 014 : (Channel Adapter portguid 0x0002c9030001e603: ' HCA-1')0x0004 015 : (Channel Adapter portguid 0x0002c9020023c305: ' HCA-2')0x0005 016 : (Channel Adapter portguid 0x0011750000ffe005: ' HCA-1')

will be sent out on Port 001

0x0014 017 : (Switch portguid 0x00066a0007000728: 'SilverStorm 9120 GUID=0x00066a00020001aa Leaf 8, Chip A')0x0015 020 : (Channel Adapter portguid 0x0002c9020023c131: ' HCA-2')0x0016 019 : (Switch portguid 0x00066a0007000732: 'SilverStorm 91200x0016 019 : (Switch portguid 0x00066a0007000732: SilverStorm 9120 GUID=0x00066a00020001aa Leaf 10, Chip A')0x0017 019 : (Channel Adapter portguid 0x0002c9030001c937: ' HCA-1')0x0018 019 : (Channel Adapter portguid 0x0002c9020023c039: ' HCA-2')

Supercomputing '09

...

Page 70: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Static Analysis of Network Contentiony

Spine Blocks2 11 22 17 2427 15 20

Leaf Blocks

Spine Blocks

• Based on destination LIDs and switching/routing

4 8 9 13 14 1 19 2 5 3 7 12 16 6 18 10

• Based on destination LIDs and switching/routing information, the exact path of the packets can be identifiedidentified– If application communication pattern is known, we can

statically identify possible network contentiony y

Supercomputing '09

Page 71: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Dynamic Analysis of Network C t ti

• IB provides many optional counters to query

Contention

performance counters– PortXmitWait: Number of ticks in which there was data

to send, but no flow-control credits

– RNR NAKs: Number of times a message was sent, but the receiver has not yet posted a receive bufferthe receiver has not yet posted a receive buffer

• This can timeout, so it can be an error in some cases

– PortXmitFlowPkts: Number of (link-level) flow-controlPortXmitFlowPkts: Number of (link level) flow control packets transmitted on the port

– SWPortVLCongestion: Number of packets dropped due to congestion

Supercomputing '09

Page 72: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Presentation Overview

• Networking Requirements of HEC Systems

• Recap of InfiniBand and 10GE Overview

• Advanced Features of IBAdvanced Features of IB

• Advanced Features of 10/40/100GE

The Open Fabrics Software Stack Usage• The Open Fabrics Software Stack Usage

• Designing High-End Systems with IB and 10GE– MPI, Sockets, File Systems, Multi-Tier Data Centers and

Virtualization

Conclusions and Final Q&A• Conclusions and Final Q&A

Supercomputing '09

Page 73: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Advanced Capabilities in iWARP/10GE

• Security in iWARP

p

• Security in iWARP

• Multipathing support using VLANs (Ethernet Feature)

• 64b/66b Encoding Standards

• Link Aggregation (Ethernet Feature)

Supercomputing '09

Page 74: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Security in iWARP

• iWARP was designed to be compliant with the Internet,

y

while providing high performance– Security is an important consideration

– E.g., can be used in a data-centers where unknown clients can communicate over iWARP with the server

M ltiple le els of sec rit meas res specified b the• Multiple levels of security measures specified by the standard (in practice, only a few are implemented)

Untrusted peer access model– Untrusted peer access model

– Encrypted Wire Protocol

– Information DisclosureInformation Disclosure

Supercomputing '09

Page 75: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Untrusted Peer Access Model

• Single access RDMA– The target exposes memory region for a single protected

access– Initiator performs three steps:– Initiator performs three steps:

• Writes data to the target location• Sends an invalidate STAG message (all access to the target

memory is removed)• Sends a verification key

– Target performs two steps:– Target performs two steps:• Uses verification key to ensure message is not tampered• Marks the process as complete

– Verification model unspecified by the iWARP standard

Supercomputing '09

Page 76: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Encrypted Wire Protocol and I f ti Di l

• iWARP is built on top of TCP/IP, so all its security

Information Disclosure

protocols are directly usable by iWARP– The standard discusses IPSec, but does not specify it

• Can be thought of as the “recommended mechanism”

– Security capabilities are not directly accessible, except to turn on or offturn on or off

• Information DisclosurePeer specific memory protection capabilities– Peer-specific memory protection capabilities

• E.g., only peer X can access this buffer

• Only peer X can write to this, and peer Y can read from ity p , p

• Mixed modes (a buffer is readable by a subset of peers)Supercomputing '09

Page 77: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Denial of Service Attacks

• Typical forms of DoS attacks are when the peer negotiates resources (e.g., by opening a connection) but performs no real work– Especially difficult to handle when done on the network

adapter (limited resources)

The iWARP standard does not specify any solution for• The iWARP standard does not specify any solution for this; leaves it to the TCP/IP layer to handle it

E g using authentication or terminating connections by– E.g., using authentication or terminating connections by monitoring usage

– Recommends that communication be offloaded only after ythe authentication is done

Supercomputing '09

Page 78: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Advanced Capabilities in iWARP/10GE

• Security in iWARP

p

• Security in iWARP

• Multipathing support using VLANs (Ethernet Feature)

• 64b/66b Encoding Standards

• Link Aggregation (Ethernet Feature)

Supercomputing '09

Page 79: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

VLAN based Multipathing

• Ethernet basic switching

p g

– Network is broken down to a tree by disabling links

– Pros: No live-locks and simple switching

– Cons: Single path between nodes and wastage of links

• VLAN based multipathing– Overlay many logical networks on one physical network

• Each overlay network will break down into a unique tree

D di hi h l t k d t• Depending on which overlay network you send on, you get a different path

• Adding nodes/links is simple; you just add a new overlayg p ; y j y

• Older overlays will continue to work as earlierSupercomputing '09

Page 80: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Example VLAN Configuration

• Basic Ethernet converts the

p g

topology to a tree– Wastes four of the links

• Can be considered as twoCan be considered as two different VLANs– All the links in the network

tili dare utilized

• Can be used for:– High Performanceg

– Security (if someone has to get access only to a part of the network)

• Supported by several switch vendorsthe network)

– Fault toleranceSupercomputing '09

• Woven Systems, Cisco

Page 81: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Advanced Capabilities in iWARP/10GE

• Security in iWARP

p

• Security in iWARP

• Multipathing support using VLANs (Ethernet Feature)

• 64b/66b Encoding Standards

• Link Aggregation (Ethernet Feature)

Supercomputing '09

Page 82: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

64b/66b Network Data Encoding

• All communication channels utilize data encoding

g

– There can be imbalance in the number of 1’s and 0’s in the data bytes that are being transmitted

• Leads to a problem called DC-balancingp g• Reduce signal integrity especially for fast networks

– Converts data into a format with more even 1’s and 0’s– E.g., 10GE has 12.5Gbps signaling; same for Myrinet,

Quadrics, GigaNet cLAN and most other networks• 8b/10b encoding8b/10b encoding

– Pros: Better signal integrity, so lesser retransmits– Cons: More bits sent over the wire (20% overhead)

• 64b/66b has the same benefit, but lesser overheadSupercomputing '09

Page 83: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Link Aggregation

• Link aggregation allows for multiple links to logically

gg g

look like a single faster link– Done at a hardware level

– Several multi-port network adapters allow for packets sequencing to avoid out-of-order packets

B th 64b/66b di d li k ti i l– Both 64b/66b encoding and link aggregation are mainly driven by the need to conserve power

• More data rate but not necessarily higher clock speedMore data rate but not necessarily higher clock speed

Supercomputing '09

Page 84: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Presentation Overview

• Networking Requirements of HEC Systems

• Recap of InfiniBand and 10GE Overview

• Advanced Features of IBAdvanced Features of IB

• Advanced Features of 10/40/100GE

The Open Fabrics Software Stack Usage• The Open Fabrics Software Stack Usage

• Designing High-End Systems with IB and 10GE– MPI, Sockets, File Systems, Multi-Tier Data Centers and

Virtualization

Conclusions and Final Q&A• Conclusions and Final Q&A

Supercomputing '09

Page 85: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

OpenFabrics

• www.openfabrics.org

p

• Open source organization (formerly OpenIB)• Incorporates both IB and iWARP in a unified manner

F i ff t f O S IB d iWARP• Focusing on effort for Open Source IB and iWARP support for Linux and Windows

• Design of complete software stack with `best of breed’Design of complete software stack with best of breed components– Gen1– Gen2 (current focus)

• Users can download the entire stack and runLatest release is OFED 1 4 2– Latest release is OFED 1.4.2

– OFED 1.5 is being worked outSupercomputing '09

Page 86: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

OpenFabrics Software StackpSA Subnet Administrator

MAD Management DatagramOpenDiag

Application Level

ClusteredDB Access

SocketsBasedA

VariousMPIs

Access toFile

S t

BlockStorageA

IP BasedApp

A

SMA Subnet Manager Agent

PMA Performance Manager Agent

IPoIB IP over InfiniBandInfiniBand OpenFabrics User Level Verbs / API iWARP R-NIC

SDP Lib

User Level MAD API

Open SM

DiagTools

User APIs

U S

DB AccessAccess MPIs SystemsAccessAccess

UDAPL

SDP Sockets Direct Protocol

SRP SCSI RDMA Protocol (Initiator)

iSER iSCSI RDMA Protocol (Initiator)

RDS Reliable Datagram Service

SDPIPoIB SRP iSER RDS

SDP Lib

Upper Layer Protocol

Kernel Space

User Space

NFS-RDMARPC

ClusterFile Sys

g

UDAPL User Direct Access Programming Lib

HCA Host Channel Adapter

R-NIC RDMA NIC

ConnectionManager

MADSA Client

ConnectionManager

Connection ManagerAbstraction (CMA)

Mid-LayerSMA

el b

ypas

s

el b

ypas

s

CommonKeyHardware

Specific DriverHardware Specific

Driver

InfiniBand OpenFabrics Kernel Level Verbs / API iWARP R-NIC

ProviderApps & Access

Ker

ne

Ker

ne

Supercomputing '09

InfiniBand

iWARPInfiniBand HCA iWARP R-NICHardware

AccessMethodsfor usingOF Stack

Page 87: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Programming with OpenFabricsg g p

Sender ReceiverSample Steps

1. Create QPs (endpoints)

Sample Steps

Process(endpoints)

2. Register memory for sending and receiving

3. Send

– ChannelP t i

Kernel

HCA • Post receive

• Post send

– Post RDMA

HCA

operation

Supercomputing '09

Page 88: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Verb Steps

• Open HCA and create QPs to end nodes

p

– Can be done with connection managers (rdma_cm or ibcm) or directly through verbs with out-of-band communication

• Register memory

ibv_mr * mrhandle = ibv_reg_mr(pd, *buffer, len, IBV ACCESS LOCAL WRITE | IBV ACCESS REMOTE WRITE |

P i i b d d

IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ

• Permissions can be set as needed

Supercomputing '09

Page 89: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Verbs: Post Receive

• Prepare and post receive descriptor (channel semantics)struct ibv_recv_wr *bad_wr;struct ibv_recv_wr rr;struct ibv_sge sg_entry;

rr.next = NULL;rr.wr_id = 0;

1rr.num_sge = 1;rr.sg_list = &(sg_entry);sg_entry.addr = (uintptr_t) buf; /* local buffer address */sg entry.length = len;g_ y g ;sg_entry.lkey = mr_handle->lkey; /* memory handle */

ret = ibv_post_recv(qp, &rr, &bad_wr); /* post to QP */

Supercomputing '09

ret = ibv_post_srq_recv(srq, &rr, &bad_wr); /* post to SRQ */

Page 90: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Verbs: Post Send

• Prepare and post send descriptor (channel semantics)struct ibv_send_wr *bad_wr;struct ibv_send_wr sr;struct ibv_sge sg_entry;

sr.next = NULL;sr.opcode = IBV_WR_SEND;sr wr id = 0;sr.wr_id = 0;sr.num_sge = 1;sr.send_flags = IBV_SEND_SIGNALED;sr.sg list = &(sg entry);g_ g_ ysg_entry.addr = (uintptr_t) buf;sg_entry.length = len;sg_entry.lkey = mr_handle->lkey;

Supercomputing '09

ret = ibv_post_send(qp, &sr, &bad_wr);

Page 91: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Verbs: Post RDMA Write

• Prepare and post RDMA write (memory semantics)struct ibv_send_wr *bad_wr; struct ibv_send_wr sr;struct ibv_sge sg_entry;

sr.next = NULL;;sr.opcode = IBV_WR_RDMA_WRITE; /* set type to RDMA Write */sr.wr_id = 0;sr.num_sge = 1;sr.send_flags = IBV_SEND_SIGNALED;sr.wr.rdma.remote_addr = remote_addr; /* remote virtual addr. */sr.wr.rdma.rkey = rkey; /* from remote node */sr sg list &(sg entry);sr.sg_list = &(sg_entry);sg_entry.addr = buf; /* local buffer */sg_entry.length = len;sg entry.lkey = mr handle->lkey;

Supercomputing '09

g_ y y _ y;

ret = ibv_post_send(qp, &sr, &bad_wr);

Page 92: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Presentation Overview

• Networking Requirements of HEC Systems

• Recap of InfiniBand and 10GE Overview

• Advanced Features of IBAdvanced Features of IB

• Advanced Features of 10/40/100GE

The Open Fabrics Software Stack Usage• The Open Fabrics Software Stack Usage

• Designing High-End Systems with IB and 10GE– MPI, Sockets, File Systems, Multi-Tier Data Centers and

Virtualization

Conclusions and Final Q&A• Conclusions and Final Q&A

Supercomputing '09

Page 93: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Designing High-End Computing S t ith IB d iWARP

• Message Passing Interface (MPI)

Systems with IB and iWARP

• Message Passing Interface (MPI)

• Sockets Direct Protocol (SDP) and IPoIB (TCP/IP)( ) ( )

• File Systems

• Multi-tier Data Centers

• Virtualization

Supercomputing '09

Page 94: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Designing MPI Using IB/iWARP F tFeatures

MPI Design Components

ProtocolMapping

Buffer

FlowControl

Connection

CommunicationProgress

Collective

Multi-railSupport

One-sidedBufferManagement

ConnectionManagement

CollectiveCommunication

Substrate

One sidedActive/Passive

Substrate

RDMAOperations

UnreliableDatagram

Static RateControl

Multicast Out-of-orderPlacement

QoS Multi-PathVLANspg

AtomicOperations

SharedReceive Queues

End-to-EndFlow Control

Send /Receive

DynamicRate

Control

Multi-PathLMC

Supercomputing '09

IB and Ethernet Features

Page 95: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

MVAPICH/MVAPICH2 Software

• High Performance MPI Library for IB and 10GE– MVAPICH (MPI-1) and MVAPICH2 (MPI-2)

– Used by more than 975 organizations in 51 countries

– More than 32,000 downloads from OSU site directly

– Empowering many TOP500 clusters• 8th ranked 62,976-core cluster (Ranger) at TACC

– Available with software stacks of many IB, 10GE and server vendors including Open Fabrics Enterprise Distribution (OFED)vendors including Open Fabrics Enterprise Distribution (OFED)

– Also supports uDAPL device to work with any network supporting uDAPL

– http://mvapich.cse.ohio-state.edu/Supercomputing '09

Page 96: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

MPICH2 Software Stack

• High-performance and Widely Portable MPIS t MPI 1 MPI 2 d MPI 2 1– Supports MPI-1, MPI-2 and MPI-2.1

– Supports multiple networks (TCP, IB, iWARP, Myrinet)– Commercial support by many vendors

• IBM (integrated stack distributed by Argonne)• Microsoft, Intel (in process of integrating their stack)

– Used by many derivative implementationsUsed by many derivative implementations• E.g., MVAPICH2, IBM, Intel, Microsoft, SiCortex, Cray, Myricom• MPICH2 and its derivatives support many Top500 systems

(estimated at more than 90%)(estimated at more than 90%)– Available with many software distributions– Integrated with the ROMIO MPI-IO implementation and the MPE

profiling libraryprofiling library– http://www.mcs.anl.gov/research/projects/mpich2

Supercomputing '09 96

Page 97: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Design Challenges and Sample R lt

• Interaction with Multi-core Environments

Results

– Communication Characteristics on Multi-core Systems

– Protocol Processing Interactions

• Network Congestion and Hot-spots

• Collective CommunicationCollective Communication

• Scalability for Large-scale Systems

• Fault Tolerance• Fault Tolerance

• Quality of Service

• Application Scalability

Supercomputing '09

Page 98: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

MPI Bandwidth on ConnectX with M lti

1600Multi-stream Bandwidth

Multicore

1200

1400

ytes

/sec

)

600

800

1000

th (M

illio

nBy

1 pair

2 pairs

4 pairs

5 fold performance

difference

200

400

Ban

dwid

t p

8 pairsdifference

0

Message Size (bytes)

Supercomputing '09

S. Sur, M. J. Koop, L. Chai and D. K. Panda, “Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms”, IEEE Hot Interconnects, 2007

Page 99: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Design Challenges and Sample R lt

• Interaction with Multi-core Environments

Results

– Communication Characteristics on Multi-core Systems

– Protocol Processing Interactionsg

• Network Congestion and Hot-spots

• Collective CommunicationCollective Communication

• Scalability for Large-scale Systems

F lt T l• Fault Tolerance

• Quality of Service

• Application ScalabilitySupercomputing '09

Page 100: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Analyzing Interrupts and Cache Missesy g p

100000Hardware Interrupts

250L2 Cache Misses

10000

Core 0

Core 1

Core 2200

Core 0

Core 1

Core 2

100

1000

r Mes

sage

Core 3

100

150

Diff

eren

ce Core 3

1

10

nter

rupt

s pe

r

50

100

Per

cent

age

0.1

1 8 64 512

4K 32K

256K 2M

In

0

Supercomputing '09

0.01Message Size (bytes)

-50Message Size (bytes)

Page 101: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

MPI Performance on Different Cores

3500Intel Platform

3000AMD Platform

2500

3000

Core 0

Core 1

Core 2

2500

Core 0

Core 1

Core 2

1500

2000

2500

idth

(Mbp

s) Core 3

1500

2000

wid

th (M

bps) Core 3

1000

1500

Ban

dw

500

1000Ban

dw

0

500

1 8 64 2

4K 2K 6K M

0

500

1 8 64 12 4K 2K 6K 2M

Supercomputing '09

6 51 4 32 256 2M

Message Size (bytes)6 51 4 32 256 2

Message Size (bytes)

Page 102: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Design Challenges and Sample R lt

• Interaction with Multi-core Environments

Results

• Network Congestion and Hot-spots

• Collective Communication• Collective Communication

• Scalability for Large-scale Systems

• Fault Tolerance

• Quality of Servicey

• Application Scalability

Supercomputing '09

Page 103: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Hot-Spot Avoidance with MVAPICH

• Deterministic nature of IB routing

p

30NAS FT

leads to network hot-spots

• Responsibility of path utilization 20

25

s)

Original

HSAM, 4 Pathsis up to the MPI Library

• We Design HSAM (Hot-Spot

Avoidance MVAPICH) to alleviate 10

15

20

me

(Sec

onds

Avoidance MVAPICH) to alleviate

this problem

0

5

10

Tim

032x1 64x1

Number of Processes

Supercomputing '09

A. Vishnu, M. Koop, A. Moody, A. Mamidala, S. Narravula and D. K. Panda , “Hot-Spot Avoidance With Multi-Pathing Over InfiniBand: An MPI Perspective”, CCGrid ’07 (Best Paper Nominee)

Page 104: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Network Congestion with Multi-Coresg

Faster Processor

P t l N t k

Protocol Processing

Network Communication

Protocol Processing

Network Communication

Protocol Network

Faster Network

Multi-coresProtocol

ProcessingNetwork

Communication

Supercomputing '09

Network Usage

Page 105: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Communication Burstiness

25000Communication Performance

1000000Network Congestion

20000

s) 10000

100000 RX framesTX frames

10000

15000

ughp

ut (M

bps

100

1000

ause

Fra

mes

5000Thro

u

10

100

Pa0

Configuration

1

Configuration

Supercomputing '09

g Configuration

A. Shah, B. N. Bryant, H. Shah, P. Balaji and W. Feng, “Switch Analysis of Network Congestion for Scientific Applications” (under preparation)

Page 106: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Out-of-Order (OoO) Packets

• Multi-path communication supported by many networks

( )

– IB, 10GE: Hardware feature!– Can cause OoO packets protocol stack has to handle

• Simple approach by most protocols (e g IB) drop & retransmit• Simple approach by most protocols (e.g., IB) drop & retransmit• Not good for large-scale systems

– iWARP specifies a more graceful approach• Out-of-order placement of data• Overhead of out-of-order packets should be minimized

1234 1

2

Supercomputing '09

34

Page 107: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Issues with Out-of-Order Packets

PacketHeader

iWARPHeader Data Payload Packet

HeaderiWARPHeader

Data Payload

PacketHeader

iWARPHeader

Data Payload

I t di t S it h S t ti

PacketHeader

iWARPHeader

PartialPayload

PacketHeader

PartialPayload

PacketHeader

iWARPHeader

Data Payload

PacketHeader

iWARPHeader

Data Payload

Intermediate Switch Segmentation

Delayed Packet Out-Of-Order PacketsOut-Of-Order Packets

(Cannot identify iWARP header)

Supercomputing '09

Page 108: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Handling Out-of-Order Packets in iWARPiWARP

RDMAP RDDPRDMAP MarkersRDMAP HOST

DDPHeader Payload (IF ANY)

Pad CRC

CRC Markers

RDMAP Markers

DDPHeader Payload (IF ANY)

MarkerSegment

TCP/IP

TCP/IP

RDDP CRC

Markers TCP/IP

RDDP CRC NIC

MarkerSegmentLength

• Packet structure becomes overly

Host-based Host-offloaded Host-assistedcomplicated

• Performing in hardware no longer straight forward!

Supercomputing '09

straight forward!

Page 109: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Overhead of Supporting OoO Packetspp g

60iWARP Latency

8000iWARP Bandwidth

50

Host-offloaded iWARP

Host-based iWARP

Host-assisted iWARP 6000

7000

30

40

ency

(us)

In-order iWARP

4000

5000

dth

(Mbp

s)

20

Late

2000

3000

Ban

dwi

0

10

0

1000

Supercomputing '09

Message Size (bytes) Message Size (bytes)

Page 110: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Design Challenges and Sample R lt

• Interaction with Multi-core Environments

Results

• Network Congestion and Hot-spots

• Collective Communication• Collective Communication

– IB Multicast based MPI Broadcast

– Shared memory aware collectives

• Scalability for Large-scale Systems

• Fault Tolerance Quality of Service

• Application ScalabilityApplication Scalability

Supercomputing '09

Page 111: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

MPI Broadcast Using IB Multicastg

160Actual Measurements (64 processes)

180Analytical Model (1024 processes)

120

140 Multicast

Point-to-point 140

160 Multicast

Point-to-point

80

100

ency

(us)

80

100

120

ency

(us)

40

60Late

40

60

80

Late

0

20

1 2 4 8 16 32 64 28 56 12 1K 2K 4K

0

20

1 2 4 8 16 32 64 28 56 12 1K 2K 4K1 2 5 2 4

Message Size (bytes)

Supercomputing '093 6 12 25 5 1 2 4

Message Size (bytes)

Page 112: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Shared-memory Aware Collectives(4K cores on TACC Ranger with MVAPICH2)(4K cores on TACC Ranger with MVAPICH2)

160MPI_Reduce (4096 cores)

4500MPI_ Allreduce (4096 cores)

120

140

3500

4000 Original

Shared-memory

80

100

ency

(us) Original

Shared-memory2000

2500

3000

ency

(us)

40

60Late

1000

1500

2000

Late

0

20

0 4 8 16 32 64 128 256 5120

500

0 4 8 6 32 64 28 56 2 K K K K0 4 8 16 32 64 128 256 512

Message Size (bytes)

Supercomputing '09

1 3 6 12 25 51 1 2 4 8

Message Size (bytes)

Page 113: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Design Challenges and Sample R lt

• Interaction with Multi-core Environments

Results

• Network Congestion and Hot-spots

• Collective Communication• Collective Communication

• Scalability for Large-scale Systems

– Memory Efficient Communication

• Fault Tolerance

• Quality of Service

• Application Scalability• Application Scalability

Supercomputing '09

Page 114: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Memory Utilization usingSh d R i QShared Receive Queues

100

120

B) 1416

) MVAPICH-RDMA

40

60

80

100

mor

y U

sed

(MB

MVAPICH-RDMA

MVAPICH SR68

1012

ory

Use

d (G

B)

MVAPICH-SRMVAPICH-SRQ

0

20

2 4 8 16 32

Mem

Number of Processes

MVAPICH-SR

MVAPICH-SRQ

024

128 256 512 1024 2048 4096 8192 16384

Mem

o

N b f P

• SRQ consumes only 1/10th compared to RDMA for 16 000 processes

Number of Processes Number of Processes

Analytical modelMPI_Init memory utilization

SRQ consumes only 1/10 compared to RDMA for 16,000 processes

• Send/Recv exhausts the Buffer Pool after 1000 processes; consumes 2X memory as SRQ for 16,000 processes

Supercomputing '09

S. Sur, L. Chai, H. –W. Jin and D. K. Panda, “Shared Receive Queue Based Scalable MPI Design for InfiniBand Clusters”, IPDPS 2006

Page 115: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Communication Buffer Memory Utili ti ith NAMD ( 1)

ARDMA-SR ARDMA-SRQ SRQARDMA-SR % ARDMA-SRQ % SRQ %

Utilization with NAMD (apoa1)

Avg. RDMA channels 53.15

0.9511.051.1

40506070

Per

form

ance

Usa

ge (M

B)

g

Avg. Low watermarks 0.03

Unexpected Msgs (%) 48.2

0.80.850.90.95

0102030

16 32 64

Nor

mal

ized

Mem

ory

U

Total Messages 3.7e6

MPI Time (%) 23.54

Number of Processes

• 50% messages < 128 Bytes, other 50% between 128 Bytes and 32 KB– 53 RDMA connections setup for 64 process experiment

• SRQ Channel takes 5-6MB of memory– Memory needed by SRQ decreases by 1MB going from 16 to 64

Supercomputing '09

S. Sur, M. Koop and D. K. Panda, “High-Performance and Scalable MPI over InfiniBand with Reduced Memory Usage: An In-Depth Performance Analysis”, SC ‘06

Page 116: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

UD vs. RC: Performance and S l bilit (SMG2000 A li ti )Scalability (SMG2000 Application)

1.2RC UDMemory Usage (MB/process)

0.6

0.8

1

aliz

ed T

ime

RC UD

RC (MVAPICH 0.9.8) UD Design

Conn. Buffers Struct. Total Buffers Struct Total

512 22.9 65.0 0.3 88.2 37.0 0.2 37.2

1024 29 5 65 0 0 6 95 1 37 0 0 4 37 4

0

0.2

0.4

Nor

ma1024 29.5 65.0 0.6 95.1 37.0 0.4 37.4

2048 42.4 65.0 1.2 107.4 37.0 0.9 37.9

4096 66.7 65.0 2.4 134.1 37.0 1.7 38.7

Large number of peers per process (992 at maximum)

128 256 512 1024 2048 4096Processes

UD reduces HCA QP cache thrashing

M K S S Q G d D K P d “Hi h P f MPI D i i U li bl D t

Supercomputing '09

M. Koop, S. Sur, Q. Gao and D. K. Panda, “High Performance MPI Design using Unreliable Datagram for Ultra-Scale InfiniBand Clusters,” ICS ‘07

Page 117: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

eXtended Reliable Connection (XRC)( )

RC XRC (8-core) XRC (16-core) RC XRC

Memory Usage Performance on NAMD (64 cores)

300350400450500

/pro

cess

)

0.81

1.2

d Ti

me

RC XRC

50100150200250

Mem

ory

(MB

00.20.40.6

Nor

mal

ize

• Memory usage for 32K processes with 16-cores per node can be 30MB/process (for

0

1 4 16 64 256 1K 4K 16K

M

Connections

0apoa1 er-gre f1atpase jac

Dataset

Memory usage for 32K processes with 16 cores per node can be 30MB/process (for connections)

• Performance for NAMD can increase when there is frequent communication to many peers (HCA cache miss goes down)

Supercomputing '09

M. Koop, J. Sridhar and D. K. Panda, “Scalable MPI Design over InfiniBand using eXtended Reliable Connection,” Cluster ‘08

Page 118: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Hybrid Transport Design (UD/RC/XRC)y p g ( )

• Both UD and RC/XRC have benefits

• Evaluate characteristics of all of them and use two sets of transports in the same application – get the best of both

Supercomputing '09

M. Koop, T. Jones and D. K. Panda, “MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand,” IPDPS ‘08

Page 119: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Design Challenges and Sample R lt

• Interaction with Multi-core Environments

Results

• Network Congestion and Hot-spots

• Collective CommunicationCollective Communication

• Scalability for Large-scale Systems

Fault Tolerance• Fault Tolerance

– Network Faults: Automatic Path Migration

P F lt Ch k i t R t t– Process Fault: Checkpoint-Restart

• Quality of Service

• Application ScalabilitySupercomputing '09

Page 120: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Fault Tolerance

• Component failures are common in large-scale clusters

• Imposes need on reliability and fault tolerance

• Working along the following three anglesg g g g– Reliable Networking with Automatic Path Migration (APM)

utilizing Redundant Communication Paths (available since MVAPICH 1.0 and MVAPICH2 1.0)

– Process Fault Tolerance with Efficient Checkpoint and Restart (available since MVAPICH2 0.9.8)

– End-to-end Reliability with memory-to-memory CRC ( il bl i MVAPICH 0 9 9)(available since MVAPICH 0.9.9)

Supercomputing '09

Page 121: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Network Fault-Tolerance with APM

• Network Fault Tolerance using InfiniBand Automatic Path Migration (APM)

– Utilizes Redundant Communication Paths

• Multiple Ports

• LMC

• Supported in OFED 1.2

A. Vishnu, A. Mamidala, S. Narravula and D. K. Panda, “Automatic Path Migration over InfiniBand: Early Experiences”, Third International Workshop on System Management Techniques, Processes, and

Supercomputing '09

y p p y g qServices, held in conjunction with IPDPS ‘07

Page 122: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

APM Performance Evaluation

80FT Class B

300LU Class B

60

70250

40

50

(sec

onds

)

150

200

(sec

onds

)20

30Tim

e (

Original Armed

100Tim

e

0

10

8 16 32 64

Armed-Migrated Network Fault

0

50

8 16 32 648 16 32 64

QPs per process

Supercomputing '09

8 16 32 64

QPs per process

Page 123: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Checkpoint-Restart Support in MVAPICH2

• Process-level Fault Tolerance

MVAPICH2

– User-transparent, system-level checkpointing– Based on BLCR from LBNL to take coordinated checkpoints

of entire program including front end and individualof entire program, including front end and individual processes

– Designed novel schemes tog• Coordinate all MPI processes to drain all in flight messages in

IB connections St i ti t t & b ff hil h k i ti• Store communication state & buffers while checkpointing

• Restarting from the checkpoint

• Systems-level checkpoint can be initiated from the applicationSystems level checkpoint can be initiated from the application (available since MVAPICH2 1.0)

Supercomputing '09

Page 124: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Checkpoint-Restart Performance with PVFS2PVFS2

NAS, LU Class C, 32x1 (Storage: 8 PVFS2 servers on IPoIB)

80

100

onds

)

40

60

on T

ime

(Sec

o

0

20

Exe

cutio

No checkpoint

1 ckpt (avg 60 sec

interval)

2 ckpts (avg 40 sec

interval)

3 ckpts (avg 30 sec

interval)

4 ckpts (avg 20 sec

interval)

Number of Checkpoints Taken

Supercomputing '09

Q. Gao, W. Yu, W. Huang and D.K. Panda, “Application-Transparent Checkpoint/Restart for MPI over InfiniBand”, ICPP ‘06

Page 125: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Enhancing CR Performance with I/O A ti f M lti S tAggregation for Multi-core Systems

Original BLCR Aggregation

30000

35000

40000

45000

kpoint(ms)

speedup=13.08

speedup=11.57

15000

20000

25000

30000

take one check

speedup=9.13

0

5000

10000

LU.C.64 SP.C.64 BT.C.64 EP.C.64

Time to

speedup=1.67

• 64 MPI processes on 4 nodes, 16 processes/node• Checkpoint data is written to local disk files

LU.C.64 SP.C.64 BT.C.64 EP.C.64

Supercomputing '09

X. Ouyang, K. Gopalakrishnan, D. K. Panda, Accelerating Checkpoint Operation by Node-Level Write Aggregation on Multicore Systems, Int'l Conference on Parallel Processing (ICPP '09), Sept. 2009.

Page 126: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Design Challenges and Sample R lt

• Interaction with Multi-core Environments

Results

• Network Congestion and Hot-spots

• Collective Communication• Collective Communication

• Scalability for Large-scale Systems

• Fault Tolerance

• Quality of Servicey

• Application Scalability

Supercomputing '09

Page 127: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

QoS in IBQoS in IB

IB i bl f idi IB HCA B ff O i ti• IB is capable of providing network level differentiated service – Virtual Lane

IB HCA Buffer Organization

differentiated service –QoS

• Uses Service LevelsCommon

Virtual Lane

VirtualUses Service Levels (SL) and Virtual Lanes (VL) to classify traffic

Buffer

Pool

Virtual Lane

Virtual

Lane

Arbiter

IB

Link

( ) y Pool

Virtual Lane

Supercomputing '09

Virtual Lane

Page 128: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Inter-Job Quality of ServiceQ y

16Small Messages

9000Large Messages

12

14

6000

7000

8000

)

6

8

10

aten

cy (u

s) No-Traffic

No-QoS

QoS 4000

5000

6000

Late

ncy

(us)

2

4

6La

1000

2000

3000

L

0

2

1 2 4 8 16 32 64 128 256 512

0

M Si (b t )Message Size (bytes) Message Size (bytes)

Supercomputing '09

Can differentiate between multiple jobs

Page 129: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Design Challenges and Sample R lt

• Interaction with Multi-core Environments

Results

• Network Congestion and Hot-spots

• Collective Communication• Collective Communication

• Scalability for Large-scale Systems

• Fault Tolerance

• Quality of Servicey

• Application Scalability

Supercomputing '09

Page 130: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Performance of HPC Applications on TACC R i MVAPICH + IBTACC Ranger using MVAPICH + IB

• Rob Farber’s facial• Rob Farber s facial recognition application was run up to 60K coresup to 60K cores using MVAPICH

• Ranges from 84% of peak at low end to 65% of peak at high end

http://www.tacc.utexas.edu/research/users/features/index.php?m_b_c=farber

Supercomputing '09

Page 131: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

3DFFT based Computations

• Internally utilize sequential 1D FFT libraries and perform

p

data grid transforms to collect the required data– Example implementations: P3DFFT, FFTW

– 3D volume of data divided amongst 2D grid of processes

– Grid transpose MPI_Alltoallv across the row and columnE h i i h ll i i• Each process communicates with all processes in its row

Supercomputing '09

Page 132: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Performance of HPC Applications on TACC R DNS/T b lTACC Ranger: DNS/Turbulence

Courtesy: P.K. Yeung, Diego Donzis, TG 2008

Supercomputing '09

Page 133: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Application Example: Blast Simulationspp p

• Researchers from the University of Utah haveUniversity of Utah have developed a simulation framework, called UintahCombines ad anced• Combines advanced mechanical, chemical and physical models into a novel computationala novel computational framework

• Have run > 32K MPI t k Rtasks on Ranger

• Uses asynchronous communication

http://www.tacc.utexas.edu/news/feature-stories/2009/explosive-science/Courtesy: J. Luitjens, M. Bertzins, Univ of Utah

Supercomputing '09

Page 134: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Application Example: OMENpp p

• OMEN is a two- and three-dimensional Schrodinger-Poisson solver basedsolver based

• Used in semi-conductor modelingg

• Run to almost 60K tasks

Courtesy: Mathieu Luisier, Gerhard Klimeck, Purdehttp://www.tacc.utexas.edu/RangerImpact/pdf/Save_Our_Semiconductors.pdf

Supercomputing '09

Page 135: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Designing High-End Computing S t ith IB d iWARP

• Message Passing Interface (MPI)

Systems with IB and iWARP

• Message Passing Interface (MPI)

• Sockets Direct Protocol (SDP) and IPoIB (TCP/IP)( ) ( )

• File Systems

• Multi-tier Data Centers

• Virtualization

Supercomputing '09

Page 136: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

IPoIB vs. SDP Architectural ModelsTraditional Model Possible SDP Model

S k t A S k t A li tiSDP

Sockets App

Sockets API

Sockets Application

Sockets APIUser User

OS Modules

InfiniBand Hardware

KernelTCP/IP Sockets

Provider KernelTCP/IP Sockets

ProviderSockets Direct

Protocol

TCP/IP TransportDriver

TCP/IP TransportDriver

Kernel Bypass

RDMA Driver

InfiniBand CA

Driver

InfiniBand CA

Semantics

Supercomputing '09

InfiniBand CA InfiniBand CA

(Source: InfiniBand Trade Association 2002 )

Page 137: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

SDP vs. IPoIB (IB QDR)( Q )

1500

2000

ps) IPoIB-RC 35004000

ps)

500

1000

1500

ndw

idth

(MBp IPoIB-UD

SDP

1500200025003000

ndw

idth

(MB

0

500

2 4 8 16 32 64 128

256

512

1K 2K 4K 8K 16K

32K

64K

Ban

0500

10001500

2 4 8 6 2 4 8 6 2 K K K K K K K

Bid

ir B

a n20

25

30

us)

1 3 6 12 25 51 1 2 4 8 16 32 64

SDP enables high bandwidth

5

10

15

Late

ncy

(u SDP enables high bandwidth (up to 15 Gbps),

low latency (6.6 µs)

Supercomputing '09

0

5

2 4 8 16 32 64 128 256 512 1K 2K

Page 138: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Flow-Control in IB

• Previous implementations of high-speed sockets (such as SDP) were on other networks– Implemented flow-control in software

• IB provides end-to-end message-level flow-control in hardware– Benefits: Asynchronous progress (i.e., SDP stack does

not need to keep waiting for the receiver to be ready; hardware will take care of it)hardware will take care of it)

– Issues:• No intelligent software coalescingNo intelligent software coalescing

• Does not handle buffer overflowSupercomputing '09

Page 139: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Flow-control Performance

35Latency

7000Bandwidth

25

30

s)

Credit-based

RDMA-based

NIC-assisted5000

6000

Mbp

s)

10

15

20

Late

ncy

(us

2000

3000

4000

Ban

dwid

th (M

0

5

10

0

1000

2000

Message Size (bytes) Message Size (bytes)

Supercomputing '09

P. Balaji, S. Bhagvat, D. K. Panda, R. Thakur and W. Gropp, Advanced Flow-control Mechanisms for the Sockets Direct Protocol over InfiniBand, ICPP, Sep 2007

Page 140: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Application Evaluationpp

800Iso-surface Rendering

5Virtual Microscope

600

700

800Credit-based

RDMA-based

NIC-assisted3 5

4

4.5

5Credit-based

RDMA-based

NIC-assisted

400

500

utio

n Ti

me

(s)

2.5

3

3.5

utio

n Ti

me

(s)

100

200

300

Exe

cu

1

1.5

2

Exe

cu

0

100

1024 2048 4096 8192Dataset dimensions

0

0.5

512 1024 2048 4096Dataset dimensions

Supercomputing '09

Page 141: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Designing High-End Computing S t ith IB d iWARP

• Message Passing Interface (MPI)

Systems with IB and iWARP

• Message Passing Interface (MPI)

• Sockets Direct Protocol (SDP) and IPoIB (TCP/IP)( ) ( )

• File Systems

• Multi-tier Data Centers

• Virtualization

Supercomputing '09

Page 142: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Lustre Performance

1200Write Performance (4 OSSs)

3500Read Performance (4 OSSs)

800

1000IPoIBNative

ut (M

Bps

)

2000

2500

3000

ut (M

Bps

)

200

400

600

Thro

ughp

u

500

1000

1500

Thro

ughp

u

01 2 3 4

Number of Clients

0

500

1 2 3 4Number of Clients

• Lustre over Native IB– Write: 1.38X faster than IPoIB; Read: 2.16X faster than IPoIB

• Memory copies in IPoIB and Native IB

Supercomputing '09

• Memory copies in IPoIB and Native IB– Reduced throughput and high overhead; I/O servers are saturated

Page 143: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

CPU Utilization

90

100IPoIB (Read) IPoIB (Write)

60

70

80

90 Native (Read) Native (Write)

n (%

)

30

40

50

60

PU U

tiliz

atio

n

0

10

20

30

CP

01 2 3 4

Number of Clients

• 4 OSS nodes, IOzone record size 1MB

Supercomputing '09

4 OSS nodes, IOzone record size 1MB

• Offers potential for greater scalability

Page 144: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Can we enhance NFS Performance i IB RDMA

• Many enterprise environments use NFSIETF current major revision is NFSv4; NFSv4 1 deals with pNFS

using IB RDMA

– IETF current major revision is NFSv4; NFSv4.1 deals with pNFS• Current systems use Ethernet with TCP/IP; can they use IB?

– Metadata intensive workloads are latency sensitive– Need throughput for large transfers and OLTP type workloads

• NFS over RDMA standard has been proposed– Designed and implemented this on InfiniBand in Open Solaris

• Taking advantage of RDMA mechanisms in InfiniBand• Design works for NFSv3 and NFSv4• Interoperable with Linux NFS/RDMA

– NFS over RDMA design incorporated into Open Solaris by Sun• Ongoing work for pNFS (NFSv4.1)

– Joint work with Sun and NetApppp– http://nowlab.cse.ohio-state.edu/projects/nfs-rdma/index.html

Supercomputing '09

Page 145: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

NFS/RDMA Performance

1000Write (tmpfs)

1000Read (tmpfs)

600

700

800

900

t (M

B/s

)

600

700

800

900

t (M

B/s

)

200

300

400

500

Read-ReadThro

ughp

u

200

300

400

500

Thro

ughp

ut

0

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Read-Write

Number of Threads

0

100

200

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Number of Threads

• IOzone Read Bandwidth up to 913 MB/s (Sun x2200’s with x8 PCIe)• Read-Write design by OSU, available with the latest OpenSolaris• NFS/RDMA is being added into OFED 1 4

Supercomputing '09

NFS/RDMA is being added into OFED 1.4

R. Noronha, L. Chai, T. Talpey and D. K. Panda, “Designing NFS With RDMA For Security, Performance and Scalability”, ICPP ‘07

Page 146: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Designing High-End Computing S t ith IB d iWARP

• Message Passing Interface (MPI)

Systems with IB and iWARP

• Message Passing Interface (MPI)

• Sockets Direct Protocol (SDP) and IPoIB (TCP/IP)( ) ( )

• File Systems

• Multi-tier Data Centers

• Virtualization

Supercomputing '09

Page 147: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Enterprise Datacenter Environmentsp

ProxyWeb-server

(A h )ProxyServer

(Apache)

WAN

ClientsStorage

More Computation and CommunicationR i t

ApplicationDatabase

WAN Requirements

• Requests are received from clients over the WAN

Application Server (PHP)

Server(MySQL)

• Proxy nodes perform caching, load balancing, resource monitoring, etc.– If not cached, request forwarded to the next tier Application Server

A li ti f th b i l i (CGI J l t )• Application server performs the business logic (CGI, Java servlets)– Retrieves appropriate data from the database to process the requests

Supercomputing '09

Page 148: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Proposed Architecture for Enterprise M lti ti D t tMulti-tier Data-centers

Existing Data-Center ComponentsExisting Data Center Components

Advanced System Services

ActiveCaching

CooperativeCaching

DynamicReconfiguration

ResourceMonitoring

Dynamic Content Caching Active Resource AdaptationActive

CachingCooperative

CachingDynamic

ReconfigurationResource

Monitoring

GlobalM

DistributedL k

PointT

Services

Data-CenterService

Primitives

Caching

SoftSh d

Caching Reconfiguration Monitoring

Distributed Data Sharing Substrate

Caching Caching Reconfiguration Monitoring

SoftSh d

DistributedL k

Distributed Data Sharing Substrate

Sockets Direct Protocol

MemoryAggregator

LockManager

ToPoint

Primitives

AdvancedC i ti P t l

SharedState

SharedState

LockManager

RDMA Atomic MulticastProtocolOffl d

RDMA-basedFlow-control

Communication Protocols and Subsystems

Network

Async. Zero-copyCommunication

Async. Zero-copyCommunication

Supercomputing '09

RDMA Atomic MulticastOffload Network

Page 149: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Data-Center Response Time with SDP(I t t P i )(Internet Proxies)

30

20

25

ms)

15

20

onse

Tim

e (m

IPoIBSDP

5

10

Res

po

032K 64K 128K 256K 512K 1M 2M

Message Size (bytes)

Supercomputing '09

Message Size (bytes)

Page 150: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Cache Coherency and Consistency ith D i D twith Dynamic Data

Example of Strong Cache Coherency: Never Send Stale Data

Proxy NodeCache

Back-endData

User Request #1

User Request #2

Proxy Node Back-end

Example of Strong Cache Consistency:Always Follow Increasing Time Line of Events

User Request #1

Proxy Node

Back endData

q

User Request #2

User Request #3

y

Supercomputing '09

Page 151: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Active Polling: An Approach for Strong C h C hCache Coherency

Proxy Node Back-end

Request

Proxy Node Back-end

Request

SoftwareO h d RDMA Read

Cache Hit

Overhead

Cache Hit

RDMA Read

Cache MissCache Miss

Supercomputing '09

Page 152: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Strong Cache Coherency with RDMAg y

2500Data-center Throughput

10Data-center Response Time

2000No Cache

IPoIBSeco

nd 7

8

9

e (m

s)

1000

1500IPoIB

RDMA

sact

ions

per

S

3

4

5

6

Res

pons

e Ti

me

0

500Tran

s

0

1

2

3R

0 10 20 30 40 50 60 70 80 90 100

200

Number of Compute Threads

0

0 10 20 30 40 50 60 70 80 90 100

200

Number of Compute Threads

RDMA can sustain performance even with heavy load on the back-end

Supercomputing '09

S. Narravula, P. Balaji, K. Vaidyanathan, S. Krishnamoorthy, J. Wu, and D. K. Panda, “Supporting Strong Cache Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand”, SAN ‘04

Page 153: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Advanced System Services over IB

• Cache coherency is one of the possible enhancements

y

IB can provide for Enterprise datacenter environments• Other concepts:

– System load monitoring• RDMA based monitoring of load on each node (kernel writes

this information to a memory location; read it using RDMA)this information to a memory location; read it using RDMA)• Multicast capabilities can help spread such information

quickly to multiple processes (reliability not very important)

– Load Balancing for Performance and QoS• Asynchronously update forwarding tables to decide how

many machines serve the content for a given websitemany machines serve the content for a given website

Supercomputing '09

http://nowlab.cse.ohio-state.edu/projects/data-centers

Page 154: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Designing High-End Computing S t ith IB d iWARP

• Message Passing Interface (MPI)

Systems with IB and iWARP

• Message Passing Interface (MPI)

• Sockets Direct Protocol (SDP) and IPoIB (TCP/IP)( ) ( )

• File Systems

• Multi-tier Data Centers

• Virtualization

Supercomputing '09

Page 155: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Current I/O Virtualization Approachespp

• I/O in VMM (e.g. VMware ESX)– Device drivers hosted in the VMM– I/O operations always trap into the

VMMApplication Application

Dom0 VM

– VMM ensures safe device sharing among VMs

• I/O in a special VM

Guest ModuleBackend Module

PrivilegedMod le p

– Device drivers are hosted in a special (privileged) VM

– I/O operations always involve the

OS

VMM

Module

– I/O operations always involve the VMM and the special VM

– E.g.: Xen and VMware Workstation

Device

Supercomputing '09

Page 156: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

From OS-bypass to VMM-bypassyp yp

• Guest modules in guest VMs handle t d t ( i il dsetup and management (privileged

access)

– Guest modules communicate with Application Application

VM VM

VMM backend modules to get jobs done

– Original privileged module can be

OS Guest Module

reused

• Once setup, devices are accessed directly from guest VMs (VMM-bypass)

Backend Module

Privileged Module y g ( yp )

– Either from OS kernel or applications

• Backend and privileged modules can also reside in a special VM

VMM

Device

reside in a special VM

Supercomputing '09

Privileged Access

VMM-bypass Access

Page 157: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Xen Overhead with VMM Bypassyp

18One-way Latency

9001000

Unidirectional Bandwidth

10121416

Xen

Native

(us) 600

700800900

ytes

/sec

468

10

Late

ncy

(

200300400500

Mill

ionB

y

02

0 1 2 4 8 16 32 64 128

256

512 1K 2K 4K

M Si (b t )

0100

1 4 16 64 256

1K 4K 16K

64K

256K 1M 4M

M Si (b t )Message Size (bytes) Message Size (bytes)

• Only VMM Bypass operations are used (MVAPICH implementation)• Xen-IB performs similar to native InfiniBand

Supercomputing '09

W. Huang, J. Liu, B. Abali, D. K. Panda. “A Case for High Performance Computing with Virtual Machines”, ICS ’06

Page 158: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Optimizing VM migration through RDMARDMA

VMVM

Pre-allocate Machine states

Helper Process

Machine statesMachine statesMachine states

Physical host Physical host

resourcesMachine statesMachine statesMachine statesMachine states

Live VM migration:• Step 1: Pre-allocate resource on target hostp g

• Step 2: Pre-copy machine states for multiple iterations

• Step 3: Suspend VM and copy the latest updates to machine states

• Step 4: Restart VM on the new host

Supercomputing '09

Page 159: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Fast Migration over RDMAg

70Native IPoIB RDMA

100%250RDMA IPoIB CPU IPoIB CPU RDMA

30

40

50

60

70

on T

ime

(s)

60%

80%

100%

150

200

250

Util

izat

ion

BW

(M

B/s

)

0

10

20

30

SP BT FT LU EP CG

Exe

cutio

0%

20%

40%

0

50

100

SP A 9 BT A 9 FT B 8 LU A 8 EP B 9 CG B 8

CP

U U

Effe

ctiv

e B

SP BT FT LU EP CG SP.A.9 BT.A.9 FT.B.8 LU.A.8 EP.B.9 CG.B.8

• Disable one physical CPU on the nodes

Mi ti h d ith IP IB d ti ll i• Migration overhead with IPoIB drastically increases

• RDMA achieves higher migration performance with less CPU usage

W H Q G J Li D K P d “Hi h P f Vi t l M hi Mi ti ith RDMA

Supercomputing '09

W. Huang, Q. Gao, J. Liu, D. K. Panda. “High Performance Virtual Machine Migration with RDMA over Modern Interconnects”, Cluster ’07 (Selected as a Best Paper)

Page 160: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Summary of Design Issues and R lt

• Current generation IB adapters, 10GE/iWARP

Results

g p

adapters and software environments are already

delivering competitive performance compared to g p p p

other interconnects

• IB and 10GE/iWARP hardware firmware and• IB and 10GE/iWARP hardware, firmware, and

software are going through rapid changes

• Significant performance improvement is expected in

near future

Supercomputing '09

Page 161: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Presentation Overview

• Networking Requirements of HEC Systems

• Recap of InfiniBand and 10GE Overview

• Advanced Features of IBAdvanced Features of IB

• Advanced Features of 10/40/100GE

The Open Fabrics Software Stack Usage• The Open Fabrics Software Stack Usage

• Designing High-End Systems with IB and 10GE– MPI, Sockets, File Systems, Multi-Tier Data Centers and

Virtualization

Conclusions and Final Q&A• Conclusions and Final Q&A

Supercomputing '09

Page 162: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Concluding Remarks

• Presented networking requirements for HEC Clusters

g

• Presented advanced features of IB and 10GE

• Discussed OpenFabrics stack and usagep g

• Discussed Design Issues, Challenges, and State-of-the-art in designing various high-end systems with IB and 10GEin designing various high end systems with IB and 10GE

• IB and 10GE are emerging as new architectures leading to a new generation of networked computing systems openinga new generation of networked computing systems, opening many research issues needing novel solutions

Supercomputing '09

Page 163: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Funding Acknowledgmentsg g

Our research is supported by the following organizations

• Funding support by

• Equipment support by

Supercomputing '09

Page 164: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Personnel Acknowledgmentsg

Current Students – M Kalaiya (M S )

Past Students – P Balaji (Ph D ) – A Mamidala (Ph D )M. Kalaiya (M. S.)

– K. Kandalla (M.S.)– P. Lai (Ph.D.)– M. Luo (Ph.D.)

P. Balaji (Ph.D.)

– D. Buntinas (Ph.D.)

– S. Bhagvat (M.S.)

– L. Chai (Ph.D.)

A. Mamidala (Ph.D.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)– G. Santhanaraman (Ph.D.)

– G. Marsh (Ph. D.)– X. Ouyang (Ph.D.)– S. Potluri (M. S.)

H S b i (Ph D )

– B. Chandrasekharan (M.S.)– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– K. Vaidyanathan (Ph.D.)– H. Subramoni (Ph.D.)

Current Post-Doc– E Mancini

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– S. Kini (M.S.)

M Koop (Ph D )

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)E. Mancini

Current Programmer– J. Perkins

– M. Koop (Ph.D.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– P. Lai (Ph. D.)

Supercomputing '09

( )

– J. Liu (Ph.D.)

Page 165: Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing

Web Pointers

http://www.cse.ohio-state.edu/~pandahttp://www.mcs.anl.gov/~balaji

http://www.cse.ohio-state.edu/~koophttp://nowlab.cse.ohio-state.edu

MVAPICH Web Pagehttp://mvapich.cse.ohio-state.edu

[email protected]@mcs.anl.gov

Supercomputing '09

[email protected]