leveraging hypertransport for a custom high-performance...
TRANSCRIPT
Leveraging HyperTransport for a custom high-performance cluster network
Leveraging HyperTransport for a custom high-performance cluster network
Mondrian NüssleHTCE Symposium 2009
11.02.2009
Leveraging HyperTransport for a custom high-performance cluster network
OutlineOutline
Background & Motivation
Architecture
Hardware Implementation
Software Stack
Results
Conclusion
NIC NetworkHost Interface
Hyper-TransportIP Core
HTAXXBar
ATU
VELO
C&SRegisterfile
RMA
EXTOLLXBar
Net-work-port
Link-port
Link-port
Link-port
Link-port
Link-port
Link-port
Net-work-port
Net-work-port
Leveraging HyperTransport for a custom high-performance cluster network
EXTOLL: Background & MotivationEXTOLL: Background & Motivation
High-performance computing synonymous with parallel computingInterconnection networks between processors are a key component in parallel systemsPatterson stated: “Latency lags Bandwidth”
The EXTOLL project at the CAG aims to significantly lower communication latency and improve communication in parallel systems
Leveraging HyperTransport for a custom high-performance cluster network
GoalsGoals
Enable communication with extremely low latency → close to main memory access
Enable communication – computation overlapDesign a balanced system
In terms of CPU on-loading and off-loadingIn terms of system complexity
Adding bandwidth is much easier ☺
Leveraging HyperTransport for a custom high-performance cluster network
Key design factsKey design facts
Leverage HT as host interface for lowest latency of data transport between CPU and deviceLeverage modified HT as on-chip communication protocol Implement a lean network interface controller:
Minimize state information on NICProvide user-level, virtualized access (avoid kernel)Minimize number of CPU ↔ device and memory ↔device transactions
Network layer that provides reliable, in-order, low-latency transport service
Leveraging HyperTransport for a custom high-performance cluster network
OutlineOutline
Background & Motivation
Architecture
Hardware Implementation
Software Stack
Results
Conclusion
NIC NetworkHost Interface
Hyper-TransportIP Core
HTAXXBar
ATU
VELO
C&SRegisterfile
RMA
EXTOLLXBar
Net-work-port
Link-port
Link-port
Link-port
Link-port
Link-port
Link-port
Net-work-port
Net-work-port
Leveraging HyperTransport for a custom high-performance cluster network
Block diagramBlock diagram
NIC NetworkHost Interface
Hyper-TransportIP Core
HTAXXBar
ATU
VELO
C&SRegisterfile
RMA
EXTOLLXBar
Net-work-port
Link-port
Link-port
Link-port
Link-port
Link-port
Link-port
Net-work-port
Net-work-port
Host Interface blockNIC block:
Several communication functions
Network block6 links9x9 crossbar
Flexible architecture:
Configurable data path
Leveraging HyperTransport for a custom high-performance cluster network
Communication functions: VELOCommunication functions: VELO
Virtualized Engine for low overheadEnable ultra-low send/receive communicationSupports messages of up to 64-byte (one cache line) directlyA single PIO transaction triggers sending of a messageMessage completion at the receiver is usually performed with a single DMA transaction
Minimized traffic between host and device!
NIC
ATU
VELO
C&SRegisterfile
RMA
Leveraging HyperTransport for a custom high-performance cluster network
Communication functions: RMACommunication functions: RMA
EXTOLL Remote Memory Architecture
Enables access to remote memory using put, get and atomic transactionsTransaction triggered by a single 128-bit SSE2 store → minimizing start-up latencyFlexible notifications:
at the requesterthe completerthe responder or any combination thereof
NIC
ATU
VELO
C&SRegisterfile
RMA
Leveraging HyperTransport for a custom high-performance cluster network
Supporting modulesSupporting modules
Address Translation UnitProvides address translation services for RMARegistration/unregistration latency in prototype systems starts at ~2 µsTranslation using on-chip TLB and main-memory tables
Control and Status Registerfileautomatically generated from high-level spec (including kernel code)Local and remote access possible (network management software)
NIC
ATU
VELO
C&SRegisterfile
RMA
Leveraging HyperTransport for a custom high-performance cluster network
HT InterfaceHT Interface
HT-Core: interface to hostAll functional units need to communicate with hostAvoid protocol conversion for on chip-network
→ HTAX crossbar running on-chip protocol
simplifiedmore source tagsfixed format
Host Interface
Hyper-TransportIP Core
HTAXXBar
Leveraging HyperTransport for a custom high-performance cluster network
Network layerNetwork layer
Fully parametrizable width of data-paths and number of portsIn-order delivery of packetsVirtual channelsHardware retransmissionCut-through switchingCredit based flow-control
Current implementations:6 ports used to connect to external links16+2 bit data path width
Network
EXTOLLXBar
Net-work-port
Link-port
Link-port
Link-port
Link-port
Link-port
Link-port
Net-work-port
Net-work-port
Leveraging HyperTransport for a custom high-performance cluster network
OutlineOutline
Background & Motivation
Architecture
Hardware Implementation
Software Stack
Results
Conclusion
NIC NetworkHost Interface
Hyper-TransportIP Core
HTAXXBar
ATU
VELO
C&SRegisterfile
RMA
EXTOLLXBar
Net-work-port
Link-port
Link-port
Link-port
Link-port
Link-port
Link-port
Net-work-port
Net-work-port
Leveraging HyperTransport for a custom high-performance cluster network
Implementation IImplementation I
EXTOLL prototype is implemented on the HTX-Board
Virtex 4 FX100 FPGA, speed-grade 11 or 126 SFP optical transceivers
Currently :16 bit width, 180 MHz core frequency
3.6 Gb/s links
Leveraging HyperTransport for a custom high-performance cluster network
Implementation IIImplementation II
> 90% of all slices of the FPGA are in use for the designHT-Core runs at 200 MHz internal frequency and HT400EXTOLL modules run with 180 MHz on speed-grade -12 device
Leveraging HyperTransport for a custom high-performance cluster network
OutlineOutline
Background & Motivation
Architecture
Hardware Implementation
Software Stack
Results
Conclusion
NIC NetworkHost Interface
Hyper-TransportIP Core
HTAXXBar
ATU
VELO
C&SRegisterfile
RMA
EXTOLLXBar
Net-work-port
Link-port
Link-port
Link-port
Link-port
Link-port
Link-port
Net-work-port
Net-work-port
Leveraging HyperTransport for a custom high-performance cluster network
EXTOLL Basedriver
atudrv
User-Application
libVELO
Middleware, i.e. MPI, GasNET(Library)
EXTOLL Hardware
VELO RMA RegisterfileATU
User Space
NIC
Kernel Space
Application Management
libRMA
extoll_rfrmadrvvelodrv
sEru
PCIConfig-space
Software StackSoftware Stack
OS bypassLayered approachPGAS support through GasNETMPI support through OpenMPILinux kernel driver
Leveraging HyperTransport for a custom high-performance cluster network
OutlineOutline
Background & Motivation
Architecture
Hardware Implementation
Software Stack
Results
Conclusion
NIC NetworkHost Interface
Hyper-TransportIP Core
HTAXXBar
ATU
VELO
C&SRegisterfile
RMA
EXTOLLXBar
Net-work-port
Link-port
Link-port
Link-port
Link-port
Link-port
Link-port
Net-work-port
Net-work-port
Leveraging HyperTransport for a custom high-performance cluster network
0
1
2
3
4
5
6
7
8
10 100 1000
Late
ncy
[us]
Size [byte], logarithmic scale
EXTOLL VELOEXTOLL RMA PutEXTOLL RMA Get
Results Results –– LatencyLatency
Start-up latency~ 1 µs
RMA Put transaction beats VELO at 256 bytes
Get latency is full roundtrip
Leveraging HyperTransport for a custom high-performance cluster network
50
100
150
200
250
300
350
10 100 1000
Ban
dwid
th [M
B/s
]
Size [byte], logarithmic scale
EXTOLL VeloEXTOLL PutEXTOLL Get
Peak payload bandwidthHalf peak payload bandwidth
Results Results -- BandwidthBandwidthMore than n½ bandwidth
at 32 byte! Maximum bandwidth
reached at 4k
Leveraging HyperTransport for a custom high-performance cluster network
0 200 400 600 800 1000 1200
HT1000 ASIC, 800MHz,est.
HT800 ASIC, 500MHz int,est.
optimized FPGA, HT400,200 MHz
FPGA, HT400, 180 MHz
Reference: MellanoxConnect X DDR IB
Technology ScalingTechnology Scaling
Already beats best IB Silicon
ASIC would show 3 times lower latency!
Leveraging HyperTransport for a custom high-performance cluster network
OutlineOutline
Background & Motivation
Architecture
Hardware Implementation
Software Stack
Results
Conclusion
NIC NetworkHost Interface
Hyper-TransportIP Core
HTAXXBar
ATU
VELO
C&SRegisterfile
RMA
EXTOLLXBar
Net-work-port
Link-port
Link-port
Link-port
Link-port
Link-port
Link-port
Net-work-port
Net-work-port
Leveraging HyperTransport for a custom high-performance cluster network
ConclusionConclusion
EXTOLL is an architecture for ultra low-latency communication in parallel systemsprototype hardware is up and runningbasic software environment is up and runningPerformance numbers are excellent:
~ 1 μs start-up latency on FPGA prototypeBandwidth limited by serializers & board, but can be improved with new platform
Leveraging HyperTransport for a custom high-performance cluster network
Next StepsNext Steps
more software is being addedMost interesting GasNET
Evaluation on 1024-core Valencia ClusterOn the hardware-side, next step is a new revision with a more powerful base technology
Evaluation of next platform for HW
Leveraging HyperTransport for a custom high-performance cluster network
Thanks !Thanks !
Questions?