interconnect your future - hpc advisory councilintelligent interconnect paves the road to exascale...
TRANSCRIPT
Rich Graham
February 2016, HPCAC Stanford Conference
Interconnect Your Future
© 2015 Mellanox Technologies 2
The Ever Growing Demand for Higher Performance
2000 202020102005
“Roadrunner”
1st
2015
Terascale Petascale Exascale
Single-Core to Many-CoreSMP to Clusters
Performance Development
Co-Design
HW SW
APP
Hardware
Software
Application
The Interconnect is the Enabling Technology
© 2015 Mellanox Technologies 3
Co-Design Architecture to Enable Exascale Performance
CPU-Centric Co-Design
Limited to Main CPU Usage
Results in Performance Limitation
Creating Synergies
Enables Higher Performance and Scale
Software
In-CPU
ComputingIn-Network
Computing
In-Storage
Computing
© 2015 Mellanox Technologies 4
The Intelligence is Moving to the Interconnect
CPU
Interconnect
Past Future
© 2015 Mellanox Technologies 5
Breaking the Application Latency Wall
Today: Network device latencies are on the order of 100 nanoseconds
Challenge: Enabling the next order of magnitude improvement in application performance
Solution: Creating synergies between software and hardware – intelligent interconnect
Intelligent Interconnect Paves the Road to Exascale Performance
10 years ago
~10
microsecond
~100
microsecond
NetworkCommunication
Framework
Today
~10
microsecond
Communication
Framework
~0.1
microsecond
Network
~1
microsecond
Communication
Framework
Future
~0.05
microsecond
Co-Design
Network
© 2015 Mellanox Technologies 6
Co-Design: Offloaded Technologies Target Application Characteristics
Programmability
RDMA GPUDirect Virtualization
Backward and Future Compatibility
Direct Communication
Applications (Innovations, Scalability, Performance)
Software-Defined
Network (SDN)
Co-Design Requires Intelligent Interconnect
Offloaded Technologies: Intelligent Interconnect
© 2015 Mellanox Technologies 7
The Road to Exascale – Co-Design System Architecture
Co-Design
Co-Design
Co-Design
Co-Design
CPU GPU
HCA
Switch
FPGA
In-CPU Computing
In-GPUComputing
In-FPGAComputing
In-NetworkComputing
In-Network Computing
© 2015 Mellanox Technologies 8
Introducing Switch-IB 2 World’s First Smart Switch
© 2015 Mellanox Technologies 9
Introducing Switch-IB 2 World’s First Smart Switch
The world fastest switch with <90 nanosecond latency
36-ports, 100Gb/s per port, 7.2Tb/s throughput, 7.02 Billion messages/sec
Adaptive Routing, Congestion control, support for multiple topologies
World’s First Smart Switch
Build for Scalable Compute and Storage Infrastructures
10X Higher Performance with The New Switch SHArP Technology
© 2015 Mellanox Technologies 10
SHArP (Scalable Hierarchical Aggregation Protocol) Technology
Delivering 10X Performance Improvement
for MPI and SHMEM/PAGS Applications
Switch-IB 2 Enables the Switch Network to
Operate as a Co-Processor
SHArP Enables Switch-IB 2 to Manage and
Execute MPI Operations in the Network
© 2015 Mellanox Technologies 11
Scalable Hierarchical Aggregation Protocol
Reliable Scalable General Purpose Primitive, Applicable to Multiple Use-cases
• In-network Tree based aggregation mechanism
• Large number of groups
• Multiple simultaneous outstanding operations
Accelerating HPC applications
Scalable High Performance Collective Offload
• Barrier, Reduce, All-Reduce, Broadcast
• Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND
• Integer and Floating-Point, 32 / 64 bit
Significantly reduce MPI collective runtime
Increase CPU availability and efficiency
Enable communication and computation overlap
Accelerating MapReduce Applications
Prevent the Incast Traffic Pattern
© 2015 Mellanox Technologies 12
SHArP Performance Advantage – MiniFE Details
MiniFE is a Finite Element mini-application
• Implements kernels that represent implicit finite-element applications
10X to 25X Performance Improvement
AllRedcue MPI Collective
Number
of Nodes
CPU-Based
Latency (usec)
SHArP
Latency (usec)
Ratio
32 41.7 4.24 9.9
64 49.08 4.63 10.6
128 57.67 4.76 12.1
256 67.76 4.87 13.9
512 79.62 5.09 15.6
1024 93.55 5.58 16.8
2048 109.92 5.63 19.5
4096 129.16 5.73 22.5
8192 151.76 5.94 25.5
© 2015 Mellanox Technologies 13
SHArP Performance– First Results (Partial Implementation)
3.5X Performance Improvement on 64 Nodes
© 2015 Mellanox Technologies 14
The Intelligence is Moving to the Interconnect
Communication Frameworks (MPI, SHMEM/PGAS)
The Only Approach to Deliver 10X Performance Improvements
Applications Transport
RDMA
SR-IOV
Collectives
Peer-Direct
GPUDirect
More…
MPI / SHMEM Offloads
Q1’16
Q3’16
© 2015 Mellanox Technologies 15
Introducing ConnectX-4 Lx Programmable Adapter
Scalable, Efficient, High-Performance and Flexible Solution
Security
Cloud/Virtualization
Storage
High Performance Computing
Precision Time Synchronization
Networking + FPGA
Mellanox Acceleration Engines
and FGPA Programmability
On One Adapter
© 2015 Mellanox Technologies 16
InfiniBand Router – In Progress
Isolation between InfiniBand subnets
Simple connectivity between different topologies
• Enable sharing a common storage network by multiple disconnected subnets
Support 2^128 nodes (unlimited system size)
SB7780
© 2015 Mellanox Technologies 17
Router implements GID to LID mapping
SM allocates Alias GID to HCA
Address resolution
• IP based applications
- Name to IP (standard), IP to GID using new API
• Pure IB applications
- Upon LID assignment change, GID DNS is updated
InfiniBand Router Details
IB subnet
IB subnetIB subnet
GID DNS
RMA 1
RPA
RPA RPA
RTM
HCA
GID DNA Agent
SMSRPM
SRTM
HCA
GID DNA Agent
SMSRPM
SRTM
HCA
GID DNA Agent
SMSRPM
SRTM
RTM: Routing Table Manager
SRTM: Subnet Routing Table Manager
RPA: Router Port Agent
SRPM: Subnet Router Port Manager
GID DNS: IP to GID resolution
© 2015 Mellanox Technologies 18
Multi-Host Socket Direct – Low Latency Socket Communication
Each CPU with direct network access
QPI avoidance for I/O – improve performance
Enables GPU / peer direct on both sockets
Solution is transparent to software
CPU CPUCPU CPUQPI
Multi-Host Socket Direct Performance
50% Lower CPU Utilization
20% lower Latency
Multi Host Evaluation Kit
Lower Application Latency, Free-up CPU
© 2015 Mellanox Technologies 19
Switch LatencyMessage Rate
Mellanox InfiniBand Leadership Over Future Competition
20%
Lower44%
Higher
Power ConsumptionPer Switch Port
ScalabilityCPU efficiency
25%
Lower
2X
Higher
100Gb/s
Link Speed
200Gb/s
Link Speed
2014
Gain Competitive Advantage Today
Protect Your Future
2017
Smart Network For Smart SystemsRDMA, Acceleration Engines, Programmability
Higher Performance
Unlimited Scalability
Higher Resiliency
Proven!
© 2015 Mellanox Technologies 20
Technology Roadmap – One-Generation Lead over the Competition
2000 202020102005
20G 40G 56G 100G
“Roadrunner”Mellanox Connected
1st3rd
TOP500 2003Virginia Tech (Apple)
2015
200G
Terascale Petascale Exascale
Mellanox 400G
Thank You