swift - stefan's techblog · swift – requirements support for heterogeneous network...
TRANSCRIPT
SWIFT
Simon Pickartz, Pablo Reble, Carsten Clauß, and Stefan Lankes
A Transparent and Flexible communication Layer for PCIe-coupledAccelerators and (Co-)Processors
05/19/2014
Today’s HPC Systems
Homogeneous hardware landscapeSeparate computer systemsconnected to increase theaggregated compute powerDifferent interconnects forLAN and SANRDMA capabilities
I/O
StorageHost
I/O
StorageHost
...
I/O
StorageHost
I/O
StorageHost
...Fabric
→ For portability concerns one standardised interface to layers on topis sufficient (e. g. uDAPL)
05/19/2014 | ACS Automation for Complex Power Systems | 2Simon Pickartz
Tomorrow’s HPC Systems
Intra-rack network connectingHostsI/O devicesStorage
Heterogeneous compute nodesCPUsGPUsAcceleratorsEtc.
Peer-to-peer communicationRDMA and RMA capabilitiesStill different interconnects
PCIeFabric
Host
Host
Host
Host
...
I/O
I/O
...
Storage
Storage
...
→ Computer systems connected to share resourcesand to increase the aggregated compute power
05/19/2014 | ACS Automation for Complex Power Systems | 3Simon Pickartz
Agenda
Socket Wheeled Intelligent Fabric Transport (SWIFT)
Hardware
Results
Outlook
05/19/2014 | ACS Automation for Complex Power Systems | 4Simon Pickartz
SWIFT – Requirements
Support for heterogeneous network landscapes
→ A transparent solution is targeted
High portability
→ Hardware abstraction
Supply of different programming models
→ Service-oriented, SPMD, etc.
Consideration of the hardware’s RDMA and RMA capabilities
→ Offer one-sided communication primitives
High performance
→ Low latencies and high data rates
05/19/2014 | ACS Automation for Complex Power Systems | 5Simon Pickartz
SWIFT – Requirements
Support for heterogeneous network landscapes→ A transparent solution is targetedHigh portability
→ Hardware abstraction
Supply of different programming models
→ Service-oriented, SPMD, etc.
Consideration of the hardware’s RDMA and RMA capabilities
→ Offer one-sided communication primitives
High performance
→ Low latencies and high data rates
05/19/2014 | ACS Automation for Complex Power Systems | 5Simon Pickartz
SWIFT – Requirements
Support for heterogeneous network landscapes→ A transparent solution is targetedHigh portability
→ Hardware abstractionSupply of different programming models
→ Service-oriented, SPMD, etc.
Consideration of the hardware’s RDMA and RMA capabilities
→ Offer one-sided communication primitives
High performance
→ Low latencies and high data rates
05/19/2014 | ACS Automation for Complex Power Systems | 5Simon Pickartz
SWIFT – Requirements
Support for heterogeneous network landscapes→ A transparent solution is targetedHigh portability
→ Hardware abstractionSupply of different programming models
→ Service-oriented, SPMD, etc.Consideration of the hardware’s RDMA and RMA capabilities
→ Offer one-sided communication primitives
High performance
→ Low latencies and high data rates
05/19/2014 | ACS Automation for Complex Power Systems | 5Simon Pickartz
SWIFT – Requirements
Support for heterogeneous network landscapes→ A transparent solution is targetedHigh portability
→ Hardware abstractionSupply of different programming models
→ Service-oriented, SPMD, etc.Consideration of the hardware’s RDMA and RMA capabilities
→ Offer one-sided communication primitivesHigh performance
→ Low latencies and high data rates
05/19/2014 | ACS Automation for Complex Power Systems | 5Simon Pickartz
SWIFT – Requirements
Support for heterogeneous network landscapes→ A transparent solution is targetedHigh portability
→ Hardware abstractionSupply of different programming models
→ Service-oriented, SPMD, etc.Consideration of the hardware’s RDMA and RMA capabilities
→ Offer one-sided communication primitivesHigh performance
→ Low latencies and high data rates
05/19/2014 | ACS Automation for Complex Power Systems | 5Simon Pickartz
SWIFT – Basic Concepts
TopologyHostsNodesEndpoints
Communication modesAsynchronous signaling via mailsNon-blocking two-sided communicationOne-sided communication (including atomics)
Gateway-based fabric connection
05/19/2014 | ACS Automation for Complex Power Systems | 6Simon Pickartz
Topology
Fabric 0
Fabric 1
Fabric 2
Fabric 3
SWIFT Network
05/19/2014 | ACS Automation for Complex Power Systems | 7Simon Pickartz
Topology
Fabric 0
Fabric 1
Fabric 2
Fabric 3
SWIFT Network
05/19/2014 | ACS Automation for Complex Power Systems | 7Simon Pickartz
Topology
Fabric 0
Fabric 1
Fabric 2
Fabric 3
SWIFT Network
05/19/2014 | ACS Automation for Complex Power Systems | 7Simon Pickartz
Layered Architecture
Application layerHigher level libraries(e. g. MPI)Parallel applicationsService-oriented apps
SWIFT layerRoutingTopologyMessaging services
Device layerHardware abstractionOptimization
Applications /High Level Communications Libraries
SWIFT
Device #0 Device #N...
Network #0 Network #N
Software
Hardware
→ Well-defined interfaces to layers above and below
05/19/2014 | ACS Automation for Complex Power Systems | 8Simon Pickartz
SWIFT Device
Small interface (around 20 prototypes only)Administration module
Constructor and destructorAutomatic discovery of fabric nodes
Channel moduleBi-directional FIFO channelFixed channel sizeAsynchronous connection establishmentvia create() and connect()Three transfer modes: PIO, DMA, and AUTO
→ High portability
put
getA B
get
put
05/19/2014 | ACS Automation for Complex Power Systems | 9Simon Pickartz
Connection Setup
A distributed Directory Service (DS)Dedicated process for holding topologyinformationAutomatically maintains connectionsto other DS on neighbor hostsManages node IDs for the local nodesNo single point of failure
On-demand connection setup via DSDirect communication between nodes on different hosts
→ Minimization of the hop countAutomatically connect to destination DS if necessaryA bit of proactivity
05/19/2014 | ACS Automation for Complex Power Systems | 10Simon Pickartz
Connection Setup
Host #0 Host #1
A
DS
B
DS
1
2
3
4
1
23
4
05/19/2014 | ACS Automation for Complex Power Systems | 11Simon Pickartz
Connection Setup
Host #0 Host #1
A
DS
B
DS
1
2
3
4
1
23
4
05/19/2014 | ACS Automation for Complex Power Systems | 11Simon Pickartz
Connection Setup
Host #0 Host #1
A
DS
B
DS
1
2
3
4
1
23
4
05/19/2014 | ACS Automation for Complex Power Systems | 11Simon Pickartz
Connection Setup
Host #0 Host #1
A
DS
B
DS
1
2
3
4
1
23
4
05/19/2014 | ACS Automation for Complex Power Systems | 11Simon Pickartz
The ACS Cluster
Intel Xeon(E5-2650)
#0
Intel Xeon(E5-2650)
#1
Intel Xeon Phi#0
Intel Xeon Phi#1
Host #0
Dolphin IXH610 PCIe Fabric
Intel Xeon(E5-2650)
#0
Intel Xeon(E5-2650)
#1
Intel Xeon Phi#0
Intel Xeon Phi#1
Host #1
05/19/2014 | ACS Automation for Complex Power Systems | 12Simon Pickartz
Mapping SWIFT onto the Cluster
SCIF Fabrics
SISCI Fabric
05/19/2014 | ACS Automation for Complex Power Systems | 13Simon Pickartz
SWIFT Overhead – Latencies
Host-to-Host Host-to-Phi Phi-to-Phi0
5
10
15
20La
tenc
yin
µs(R
TT/2
)DeviceSWIFTMPI
05/19/2014 | ACS Automation for Complex Power Systems | 14Simon Pickartz
SWIFT Overhead – Throughput
64 512 4 k 32 k 256 k 2 M 16 M0
500
1,000
1,500
2,000
2,500
Size in Byte
Thro
ughp
utin
MiB
/s
Host-to-Host (RMA)
DeviceSWIFT
MPITPP
05/19/2014 | ACS Automation for Complex Power Systems | 15Simon Pickartz
SWIFT Overhead – Throughput
64 512 4 k 32 k 256 k 2 M 16 M0
200
400
600
800
1,000
Size in Byte
Thro
ughp
utin
MiB
/s
Host-to-Phi (RMA)
DeviceSWIFT
MPITPP
05/19/2014 | ACS Automation for Complex Power Systems | 16Simon Pickartz
Multi-Hop PingPong – Latency
1-Hop 2-Hop 3-Hop0
5
10
15La
tenc
yin
µsRealPhi-HostHost-Host
05/19/2014 | ACS Automation for Complex Power Systems | 17Simon Pickartz
Multi-Hop PingPong – Throughput
Latencies accumulateAverage throughput equals that of the bottleneck link
A B C DtX tY tZ
05/19/2014 | ACS Automation for Complex Power Systems | 18Simon Pickartz
Multi-Hop PingPong – Throughput
64 512 4 k 32 k 256 k 2 M 16 M0
200
400
600
800
Size in Byte
Thro
ughp
utin
MiB
/s
1-Hop2-Hop
64 512 4 k 32 k 256 k 2 M 16 M0
200
400
600
800
Size in Byte
Thro
ughp
utin
MiB
/s
1-Hop2-Hop3-Hop
05/19/2014 | ACS Automation for Complex Power Systems | 19Simon Pickartz
Multi-Hop PingPong – Throughput
64 512 4 k 32 k 256 k 2 M 16 M0
200
400
600
800
Size in Byte
Thro
ughp
utin
MiB
/s
1-Hop2-Hop
64 512 4 k 32 k 256 k 2 M 16 M0
200
400
600
800
Size in Byte
Thro
ughp
utin
MiB
/s
1-Hop2-Hop3-Hop
05/19/2014 | ACS Automation for Complex Power Systems | 19Simon Pickartz
Multi-Hop PingPong – Throughput
Latencies accumulateAverage throughput equals that of the bottleneck link
A B C DtX tY tZ
tAB
tBA
tBC
tCB
tCD
tDC
→ Asymetric links→ Average throughput corresponds to the harmonic mean of the two
bottleneck links
05/19/2014 | ACS Automation for Complex Power Systems | 20Simon Pickartz
Multi-Hop PingPong – Throughput
Latencies accumulateAverage throughput equals that of the bottleneck link
A B C D
tX tY tZ
tAB
tBA
tBC
tCB
tCD
tDC
→ Asymetric links→ Average throughput corresponds to the harmonic mean of the two
bottleneck links
05/19/2014 | ACS Automation for Complex Power Systems | 20Simon Pickartz
Multi-Hop PingPong – Throughput
64 512 4 k 32 k 256 k 2 M 16 M0
100
200
300
400
500
Size in Byte
Thro
ughp
utin
MiB
/s
3-HopPhi-to-Host
05/19/2014 | ACS Automation for Complex Power Systems | 21Simon Pickartz
What we have . . .
Support for heterogeneous network landscapesHigh portability
Device abstractionThree devices: SCIF, SISCI, and SHMEM
Supply of different programming models
Three-layered topologyAutomatic node ID assignment
Consideration of the hardware’s RDMA and RMA capabilities
One-sided communicationZero-copy forwarding
High performance
Good latency results (asynchronous signaling)Promising multi-hop throughput results
05/19/2014 | ACS Automation for Complex Power Systems | 22Simon Pickartz
What we have . . .
Support for heterogeneous network landscapesHigh portability
Device abstractionThree devices: SCIF, SISCI, and SHMEM
Supply of different programming models
Three-layered topologyAutomatic node ID assignment
Consideration of the hardware’s RDMA and RMA capabilities
One-sided communicationZero-copy forwarding
High performance
Good latency results (asynchronous signaling)Promising multi-hop throughput results
05/19/2014 | ACS Automation for Complex Power Systems | 22Simon Pickartz
What we have . . .
Support for heterogeneous network landscapesHigh portability
Device abstractionThree devices: SCIF, SISCI, and SHMEM
Supply of different programming modelsThree-layered topologyAutomatic node ID assignment
Consideration of the hardware’s RDMA and RMA capabilities
One-sided communicationZero-copy forwarding
High performance
Good latency results (asynchronous signaling)Promising multi-hop throughput results
05/19/2014 | ACS Automation for Complex Power Systems | 22Simon Pickartz
What we have . . .
Support for heterogeneous network landscapesHigh portability
Device abstractionThree devices: SCIF, SISCI, and SHMEM
Supply of different programming modelsThree-layered topologyAutomatic node ID assignment
Consideration of the hardware’s RDMA and RMA capabilitiesOne-sided communicationZero-copy forwarding
High performance
Good latency results (asynchronous signaling)Promising multi-hop throughput results
05/19/2014 | ACS Automation for Complex Power Systems | 22Simon Pickartz
What we have . . .
Support for heterogeneous network landscapesHigh portability
Device abstractionThree devices: SCIF, SISCI, and SHMEM
Supply of different programming modelsThree-layered topologyAutomatic node ID assignment
Consideration of the hardware’s RDMA and RMA capabilitiesOne-sided communicationZero-copy forwarding
High performanceGood latency results (asynchronous signaling)Promising multi-hop throughput results
05/19/2014 | ACS Automation for Complex Power Systems | 22Simon Pickartz
What we need . . .
DMA over the whole platformImplementation of swift_put()/_get() (w. i. p.)
Multicast or PUB/SUB communication modeDynamic routing (e. g. Bellman-Ford)
05/19/2014 | ACS Automation for Complex Power Systems | 23Simon Pickartz
Thank you for your kind attention!
Simon Pickartz – [email protected]
Institute for Automation of Complex Power SystemsE.ON Energy Research Center, RWTH Aachen UniversityMathieustraße 1052074 Aachen
www.eonerc.rwth-aachen.de
Asymmetric Channels
A BtAB
tBA
time to send a message of length L from A to B and back
T = LtAB
+ LtBA
resulting throughput
tres = LT2
= 2LL
tAB+ L
tBA
= 2 · tAB · tBAtAB + tBA
05/19/2014 | ACS Automation for Complex Power Systems | 25Simon Pickartz
RDMA Results
64 512 4 k 32 k 256 k 2 M 16 M0
500
1,000
1,500
Size in Byte
Thro
ughp
utin
MiB
/s
Phi-to-Phi
DeviceSWIFT (host-routed)
MPI (host-routed)
05/19/2014 | ACS Automation for Complex Power Systems | 26Simon Pickartz