interconnect your future - hpc advisory councilintelligent interconnect paves the road to exascale...

Rich Graham

February 2016, HPCAC Stanford Conference

Interconnect Your Future

© 2015 Mellanox Technologies 2

The Ever Growing Demand for Higher Performance

2000 202020102005

“Roadrunner”

1st

2015

Terascale Petascale Exascale

Single-Core to Many-CoreSMP to Clusters

Performance Development

Co-Design

HW SW

APP

Hardware

Software

Application

The Interconnect is the Enabling Technology


Co-Design Architecture to Enable Exascale Performance

CPU-Centric Co-Design

Limited to Main CPU Usage

Results in Performance Limitation

Creating Synergies

Enables Higher Performance and Scale

Software

In-CPU

ComputingIn-Network

Computing

In-Storage

Computing


The Intelligence is Moving to the Interconnect

CPU

Interconnect

Past Future


Breaking the Application Latency Wall

Today: Network device latencies are on the order of 100 nanoseconds

Challenge: Enabling the next order of magnitude improvement in application performance

Solution: Creating synergies between software and hardware – intelligent interconnect

Intelligent Interconnect Paves the Road to Exascale Performance

10 years ago

~10

microsecond

~100

microsecond

NetworkCommunication

Framework

Today

~10

microsecond

Communication

Framework

~0.1

microsecond

Network

~1

microsecond

Communication

Framework

Future

~0.05

microsecond

Co-Design

Network


Co-Design: Offloaded Technologies Target Application Characteristics

Programmability

RDMA GPUDirect Virtualization

Backward and Future Compatibility

Direct Communication

Applications (Innovations, Scalability, Performance)

Software-Defined

Network (SDN)

Co-Design Requires Intelligent Interconnect

Offloaded Technologies: Intelligent Interconnect


The Road to Exascale – Co-Design System Architecture

Co-Design

Co-Design

Co-Design

Co-Design

CPU GPU

HCA

Switch

FPGA

In-CPU Computing

In-GPUComputing

In-FPGAComputing

In-NetworkComputing

In-Network Computing


Introducing Switch-IB 2 World’s First Smart Switch


Introducing Switch-IB 2 World’s First Smart Switch

The world fastest switch with <90 nanosecond latency

36-ports, 100Gb/s per port, 7.2Tb/s throughput, 7.02 Billion messages/sec

Adaptive Routing, Congestion control, support for multiple topologies

World’s First Smart Switch

Build for Scalable Compute and Storage Infrastructures

10X Higher Performance with The New Switch SHArP Technology


SHArP (Scalable Hierarchical Aggregation Protocol) Technology

Delivering 10X Performance Improvement

for MPI and SHMEM/PAGS Applications

Switch-IB 2 Enables the Switch Network to

Operate as a Co-Processor

SHArP Enables Switch-IB 2 to Manage and

Execute MPI Operations in the Network


Scalable Hierarchical Aggregation Protocol

Reliable Scalable General Purpose Primitive, Applicable to Multiple Use-cases

• In-network Tree based aggregation mechanism

• Large number of groups

• Multiple simultaneous outstanding operations

Accelerating HPC applications

Scalable High Performance Collective Offload

• Barrier, Reduce, All-Reduce, Broadcast

• Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND

• Integer and Floating-Point, 32 / 64 bit

Significantly reduce MPI collective runtime

Increase CPU availability and efficiency

Enable communication and computation overlap

Accelerating MapReduce Applications

Prevent the Incast Traffic Pattern


SHArP Performance Advantage – MiniFE Details

MiniFE is a Finite Element mini-application

• Implements kernels that represent implicit finite-element applications

10X to 25X Performance Improvement

AllRedcue MPI Collective

Number

of Nodes

CPU-Based

Latency (usec)

SHArP

Latency (usec)

Ratio

32 41.7 4.24 9.9

64 49.08 4.63 10.6

128 57.67 4.76 12.1

256 67.76 4.87 13.9

512 79.62 5.09 15.6

1024 93.55 5.58 16.8

2048 109.92 5.63 19.5

4096 129.16 5.73 22.5

8192 151.76 5.94 25.5


SHArP Performance– First Results (Partial Implementation)

3.5X Performance Improvement on 64 Nodes


The Intelligence is Moving to the Interconnect

Communication Frameworks (MPI, SHMEM/PGAS)

The Only Approach to Deliver 10X Performance Improvements

Applications Transport

RDMA

SR-IOV

Collectives

Peer-Direct

GPUDirect

More…

MPI / SHMEM Offloads

Q1’16

Q3’16


Introducing ConnectX-4 Lx Programmable Adapter

Scalable, Efficient, High-Performance and Flexible Solution

Security

Cloud/Virtualization

Storage

High Performance Computing

Precision Time Synchronization

Networking + FPGA

Mellanox Acceleration Engines

and FGPA Programmability

On One Adapter


InfiniBand Router – In Progress

Isolation between InfiniBand subnets

Simple connectivity between different topologies

• Enable sharing a common storage network by multiple disconnected subnets

Support 2^128 nodes (unlimited system size)

SB7780


Router implements GID to LID mapping

SM allocates Alias GID to HCA

Address resolution

• IP based applications

- Name to IP (standard), IP to GID using new API

• Pure IB applications

- Upon LID assignment change, GID DNS is updated

InfiniBand Router Details

IB subnet

IB subnetIB subnet

GID DNS

RMA 1

RPA

RPA RPA

RTM

HCA

GID DNA Agent

SMSRPM

SRTM

HCA

GID DNA Agent

SMSRPM

SRTM

HCA

GID DNA Agent

SMSRPM

SRTM

RTM: Routing Table Manager

SRTM: Subnet Routing Table Manager

RPA: Router Port Agent

SRPM: Subnet Router Port Manager

GID DNS: IP to GID resolution


Multi-Host Socket Direct – Low Latency Socket Communication

Each CPU with direct network access

QPI avoidance for I/O – improve performance

Enables GPU / peer direct on both sockets

Solution is transparent to software

CPU CPUCPU CPUQPI

Multi-Host Socket Direct Performance

50% Lower CPU Utilization

20% lower Latency

Multi Host Evaluation Kit

Lower Application Latency, Free-up CPU


Switch LatencyMessage Rate

Mellanox InfiniBand Leadership Over Future Competition

20%

Lower44%

Higher

Power ConsumptionPer Switch Port

ScalabilityCPU efficiency

25%

Lower

2X

Higher

100Gb/s

Link Speed

200Gb/s

Link Speed

2014

Gain Competitive Advantage Today

Protect Your Future

2017

Smart Network For Smart SystemsRDMA, Acceleration Engines, Programmability

Higher Performance

Unlimited Scalability

Higher Resiliency

Proven!


Technology Roadmap – One-Generation Lead over the Competition

2000 202020102005

20G 40G 56G 100G

“Roadrunner”Mellanox Connected

1st3rd

TOP500 2003Virginia Tech (Apple)

2015

200G

Terascale Petascale Exascale

Mellanox 400G

Thank You

interconnect your future - hpc advisory councilintelligent interconnect paves the road to exascale...

Documents