qsnetiii, an hpc interconnect for peta scale systems
DESCRIPTION
QsNetIII Network –Multi-stage switch network –Evolution of the QsNetIIdesign –Increased use of commodity hardware –Increasing support for standard software •QsNetIII Components –ASICs Elan5 and Elite5 –Adapters, switches, cables –Firmware, drivers, libraries –Diagnostics, documentationTRANSCRIPT
QsNetIII An HPC Interconnect
for PetaScale Systems
Duncan Roweth, Quadrics Ltd
ISC08 Dresden June 2008
Quadrics Background
• Develops interconnect products for the HPC market
– HPC Linux systems
– AlphaServer SC systems
• Quadrics is owned by the Finmeccanica group
• Quadrics will be 12 years old in July
Interconnect Network – QsNet
• QsNetIII Network
– Multi-stage switch network
– Evolution of the QsNetII design
– Increased use of commodity hardware
– Increasing support for standard
software
• QsNetIII Components
– ASICs Elan5 and Elite5
– Adapters, switches, cables
– Firmware, drivers, libraries
– Diagnostics, documentation
Fabric
Bridge
x8
PLL
EEPROM Clocks PCIe
16 Lanes
Host I/F
TLB
Cmd Launch
PCIe
SERDES
Local Functions
Buffer Manager
Object Cache Tags
Free List
Local Memory
Ext i/f
SDRAM i/f
External cache
External
DDRII
16K x 8 x 8 banks = 1MB ECC RAM
CX4/ QSNetIII
Link
CX4/ QSNetIII
Link
Packet Engine 16K inst cache 9K data buffers
Packet Engine 16K inst cache 9K data buffers
Packet Engine 16K inst cache 9K data buffers
Packet Engine 16K inst cache 9K data buffers
Packet Engine 16K inst cache 9K data buffers
Packet Engine 16K inst cache 9K data buffers
Packet Engine 16K inst cache 9K data buffers
Elan5 Adapter
Elan5 Adapter Overview
• 2 × 25 Gbit/s QsNetIII links
• PCIe, PCIe2 host interface
• Multiple packet engines
• 512KB of high bandwidth on
chip local memory
• SDRAM interface to optional
local memory
• Buffer manager, object
cache
QsNetIII Adapter Overview
• QM700 PCIe x16
• 128MB adapter memory
• 2 QSFP links
• Half height low profile
• Adapters variants
– PCIe Gen2
– Blade formats
– 10Gbit/s Ethernet 10GBase-CX4
Elite5 - Overview
• Physical layer DDR XAUI
– 4 x 6.25Gbit/s (2.5Gbytes/s)
in each direction
• 32-way crosspoint router
• 32 virtual channels per link
• Fat tree or mesh topologies
• Adaptive routing
• Broadcast & barrier support
• Memory mapped stats & error
counters accessed via control
network
QsNetIII Adaptive Routing
• Packet by packet dynamic routing
– Single cycle routing decision
• Selects route based on
– Link state, errors etc
– Number of pending acks
• High radix switches
– 2 routing decisions for 2048 nodes
• More flexible than QsNetII
– Operates on groups of links
– Can adaptively route up or down
Bandwidth scalability – 1024 nodes
• Bandwidth achieved
when 1024 nodes all
communicate at the
same time
• QsNetII provides better
average bandwidth
and much narrower
spread in best to worst
case performance
System Interconnect Min Max Average
Atlas Infiniband 95 762 263
Thunder QsNetII 248 403 369
Data from Lawrence Livermore National Lab, published at the Sonoma OpenFabrics workshop June 2007
QsNetIII Device Overview
Elan Elite
Manufacturing partner LSI/TSMC G90 process
Semi custom ASICs, 500MHz system clock
High performance BGA package
672 pin 982 pin
17W 18W
QsNetIII – Federated Network Switches
• Node switch chassis
– 128 links up 128 down
• Same chassis provides multiple
top switch configurations:
– 644 512-way systems
– 328 1024-way systems
– 1616 2048-way systems
– 832 4096-way systems
QsNetIII Network 4096–way
QsNetIII cables
• QSFP connectors throughout
• Optical cables (e.g.Luxtera), 5-300m
– PVDF Plenum rated
– LSZH available as an option
• Active copper cables (Gore), 8-20m
• Copper cables (Gore) 1-10m
• No longer Quadrics proprietary
• Bit error rates are a big issue at 5 Gbps
and above
– Optical cables between switches
– Short copper cables from nodes
QsNetIII for HP BladeSystem
Elite5 switch module
Full bandwidth
16 links to the blades (via backplane)
16 links to back of the module
Elan5 mezzanine adapter
2 QsNet links
PCI-E x8 (initially)
128 MB of memory
2048-way QsNetIII BladeSystem Network
Building a 16K node system in 2009/10
• Single water cooled rack will
provide 1000-2000 standard
cores ~12-25 TF.
• 8 Blade switches per rack
• Connect 128 of these racks
with 1024-way top switches
• Single fibre cable per node -
for full bi-section bandwidth.
QsNetIII Fault Tolerance
• All of the QsNetII Features– CRCs on every packet
– Automatic retransmission
– Adaptive routing avoids failed links
– Redundant routes
– Redundant, hot plugable, PSUs and fans
+ Full line rate testing of each link as it comes up– Switches generate CRPAT, CJPAT or PRBS packets
– Links are only added to the route tables when they are (a)
up, (b) connect to the right place, and (c) can transfer data
without error.
Software Model – Firmware & Drivers
• Base firmware in the ROMs
• Firmware modules loadable with the device driver– Elan, OpenFabrics, 10GE Ethernet, …
• Kernel modules– elan5, elan, rms
• Device dependent library (libelan5)
• Device independent library (libelan)
• User libraries
• Point-to-point message
passing
• One-sided put/get
• Transparent rail striping
• Optimised collectives
• Locks and atomics ops
• Global memory allocation
Software Model – Elan Libraries
• Focus on the most demanding HPC applications
• Delivers large system scalability
– All nodes achieve host adapter bandwidth at the same time
– Minimal spread between best and worst case performance
– Low and uniform latency
– Highly optimised collectives
• Single supplier of interconnect hardware, software, support
• Stability of our products
• Track record of delivering production systems
• European company
Why Quadrics?