qsnetiii, an hpc interconnect for peta scale systems

19
QsNet III An HPC Interconnect for PetaScale Systems Duncan Roweth, Quadrics Ltd ISC08 Dresden June 2008

Upload: federica-pisani

Post on 28-Nov-2014

1.113 views

Category:

Technology


0 download

DESCRIPTION

QsNetIII Network –Multi-stage switch network –Evolution of the QsNetIIdesign –Increased use of commodity hardware –Increasing support for standard software •QsNetIII Components –ASICs Elan5 and Elite5 –Adapters, switches, cables –Firmware, drivers, libraries –Diagnostics, documentation

TRANSCRIPT

Page 1: QsNetIII, An HPC Interconnect For Peta Scale Systems

QsNetIII An HPC Interconnect

for PetaScale Systems

Duncan Roweth, Quadrics Ltd

ISC08 Dresden June 2008

Page 2: QsNetIII, An HPC Interconnect For Peta Scale Systems

Quadrics Background

• Develops interconnect products for the HPC market

– HPC Linux systems

– AlphaServer SC systems

• Quadrics is owned by the Finmeccanica group

• Quadrics will be 12 years old in July

Page 3: QsNetIII, An HPC Interconnect For Peta Scale Systems

Interconnect Network – QsNet

• QsNetIII Network

– Multi-stage switch network

– Evolution of the QsNetII design

– Increased use of commodity hardware

– Increasing support for standard

software

• QsNetIII Components

– ASICs Elan5 and Elite5

– Adapters, switches, cables

– Firmware, drivers, libraries

– Diagnostics, documentation

Page 4: QsNetIII, An HPC Interconnect For Peta Scale Systems

Fabric

Bridge

x8

PLL

EEPROM Clocks PCIe

16 Lanes

Host I/F

TLB

Cmd Launch

PCIe

SERDES

Local Functions

Buffer Manager

Object Cache Tags

Free List

Local Memory

Ext i/f

SDRAM i/f

External cache

External

DDRII

16K x 8 x 8 banks = 1MB ECC RAM

CX4/ QSNetIII

Link

CX4/ QSNetIII

Link

Packet Engine 16K inst cache 9K data buffers

Packet Engine 16K inst cache 9K data buffers

Packet Engine 16K inst cache 9K data buffers

Packet Engine 16K inst cache 9K data buffers

Packet Engine 16K inst cache 9K data buffers

Packet Engine 16K inst cache 9K data buffers

Packet Engine 16K inst cache 9K data buffers

Elan5 Adapter

Elan5 Adapter Overview

• 2 × 25 Gbit/s QsNetIII links

• PCIe, PCIe2 host interface

• Multiple packet engines

• 512KB of high bandwidth on

chip local memory

• SDRAM interface to optional

local memory

• Buffer manager, object

cache

Page 5: QsNetIII, An HPC Interconnect For Peta Scale Systems

QsNetIII Adapter Overview

• QM700 PCIe x16

• 128MB adapter memory

• 2 QSFP links

• Half height low profile

• Adapters variants

– PCIe Gen2

– Blade formats

– 10Gbit/s Ethernet 10GBase-CX4

Page 6: QsNetIII, An HPC Interconnect For Peta Scale Systems

Elite5 - Overview

• Physical layer DDR XAUI

– 4 x 6.25Gbit/s (2.5Gbytes/s)

in each direction

• 32-way crosspoint router

• 32 virtual channels per link

• Fat tree or mesh topologies

• Adaptive routing

• Broadcast & barrier support

• Memory mapped stats & error

counters accessed via control

network

Page 7: QsNetIII, An HPC Interconnect For Peta Scale Systems

QsNetIII Adaptive Routing

• Packet by packet dynamic routing

– Single cycle routing decision

• Selects route based on

– Link state, errors etc

– Number of pending acks

• High radix switches

– 2 routing decisions for 2048 nodes

• More flexible than QsNetII

– Operates on groups of links

– Can adaptively route up or down

Page 8: QsNetIII, An HPC Interconnect For Peta Scale Systems

Bandwidth scalability – 1024 nodes

• Bandwidth achieved

when 1024 nodes all

communicate at the

same time

• QsNetII provides better

average bandwidth

and much narrower

spread in best to worst

case performance

System Interconnect Min Max Average

Atlas Infiniband 95 762 263

Thunder QsNetII 248 403 369

Data from Lawrence Livermore National Lab, published at the Sonoma OpenFabrics workshop June 2007

Page 9: QsNetIII, An HPC Interconnect For Peta Scale Systems

QsNetIII Device Overview

Elan Elite

Manufacturing partner LSI/TSMC G90 process

Semi custom ASICs, 500MHz system clock

High performance BGA package

672 pin 982 pin

17W 18W

Page 10: QsNetIII, An HPC Interconnect For Peta Scale Systems

QsNetIII – Federated Network Switches

• Node switch chassis

– 128 links up 128 down

• Same chassis provides multiple

top switch configurations:

– 644 512-way systems

– 328 1024-way systems

– 1616 2048-way systems

– 832 4096-way systems

Page 11: QsNetIII, An HPC Interconnect For Peta Scale Systems

QsNetIII Network 4096–way

Page 12: QsNetIII, An HPC Interconnect For Peta Scale Systems

QsNetIII cables

• QSFP connectors throughout

• Optical cables (e.g.Luxtera), 5-300m

– PVDF Plenum rated

– LSZH available as an option

• Active copper cables (Gore), 8-20m

• Copper cables (Gore) 1-10m

• No longer Quadrics proprietary

• Bit error rates are a big issue at 5 Gbps

and above

– Optical cables between switches

– Short copper cables from nodes

Page 13: QsNetIII, An HPC Interconnect For Peta Scale Systems

QsNetIII for HP BladeSystem

Elite5 switch module

Full bandwidth

16 links to the blades (via backplane)

16 links to back of the module

Elan5 mezzanine adapter

2 QsNet links

PCI-E x8 (initially)

128 MB of memory

Page 14: QsNetIII, An HPC Interconnect For Peta Scale Systems

2048-way QsNetIII BladeSystem Network

Page 15: QsNetIII, An HPC Interconnect For Peta Scale Systems

Building a 16K node system in 2009/10

• Single water cooled rack will

provide 1000-2000 standard

cores ~12-25 TF.

• 8 Blade switches per rack

• Connect 128 of these racks

with 1024-way top switches

• Single fibre cable per node -

for full bi-section bandwidth.

Page 16: QsNetIII, An HPC Interconnect For Peta Scale Systems

QsNetIII Fault Tolerance

• All of the QsNetII Features– CRCs on every packet

– Automatic retransmission

– Adaptive routing avoids failed links

– Redundant routes

– Redundant, hot plugable, PSUs and fans

+ Full line rate testing of each link as it comes up– Switches generate CRPAT, CJPAT or PRBS packets

– Links are only added to the route tables when they are (a)

up, (b) connect to the right place, and (c) can transfer data

without error.

Page 17: QsNetIII, An HPC Interconnect For Peta Scale Systems

Software Model – Firmware & Drivers

• Base firmware in the ROMs

• Firmware modules loadable with the device driver– Elan, OpenFabrics, 10GE Ethernet, …

• Kernel modules– elan5, elan, rms

• Device dependent library (libelan5)

• Device independent library (libelan)

• User libraries

Page 18: QsNetIII, An HPC Interconnect For Peta Scale Systems

• Point-to-point message

passing

• One-sided put/get

• Transparent rail striping

• Optimised collectives

• Locks and atomics ops

• Global memory allocation

Software Model – Elan Libraries

Page 19: QsNetIII, An HPC Interconnect For Peta Scale Systems

• Focus on the most demanding HPC applications

• Delivers large system scalability

– All nodes achieve host adapter bandwidth at the same time

– Minimal spread between best and worst case performance

– Low and uniform latency

– Highly optimised collectives

• Single supplier of interconnect hardware, software, support

• Stability of our products

• Track record of delivering production systems

• European company

Why Quadrics?