a weighted fat-tree routing algorithm for efficient load
TRANSCRIPT
Feroz Zahid, Ernst Gunnar Gran, Tor Skeie Simula Research Laboratory, Norway Bartosz Bogdanksi, BjØrn Dag Johnsen Oracle Corporation
PDP 2015, Turku, Finland
March 5, 2015
A weighted fat-tree routing algorithm for efficient load-balancing in InfiniBand clusters
InfiniBand (IB) is a popular interconnect for HPC systems
Source: Top500 Supercomputers List, http://top500.org/
44.8% share in November 2014 top supercomputers list
Network performance in HPC systems depends on three important factors
Routing
Network Topology
Traffic Patterns
Many different topologies are found in real-world clusters Ring, Kautz, Torus, Clos, Fat-trees
Fat-tree and its variants are very common in IB networks
• k-ary-n-tree • n levels, 𝑘𝑘𝑛𝑛 nodes n . 𝑘𝑘𝑛𝑛−1 switches • 2k ports on each switch • Each switch has equal number of up and down connections • Only half of the ports of the root switches are used
• XGFTs • More generalized • Allows different number of up and down connections on switches • Also, allows different number of connections at each level
• PGFTs • Allows multiple connecting links between switches
• RLFTs • Restrictions on PGFTs • Same port switches at all levels
Maintenance of full-bisection bandwidth
A B
Easy deadlock-free Routing
Fault Tolerance
Fat-trees have nice properties that make them popular
Up Down
Routing in IB networks is generally deterministic
Based on linear forwarding tables (LFTs) stored in the switches
Deterministic routing is traffic oblivious!
Routing in fat-tree networks can be source based or destination based, and can be closed form or iterative
• Source-based • Out-port for a packet at a switch based on source node identifier
• Destination-based • Out-port for a packet at a switch based on destination node identifier
• Closed form • D-mod-K, S-mod-K
• Iterative
for each leaf switch lf for each node connected to lf id <= node identifier route_downgoing_go_up(id) ... end for end for
OFED’s fat-tree routing algorithm tends to spread the routes across the tree using counters
Ref: Zahavi, Eitan, et al. "Optimized InfiniBand fat-tree routing for shift all-to-all communication patterns." Concurrency and Computation: Practice and Experience 22.2 (2010): 217-231.
OFED is the de-facto standard software stack for building and deploying IB based applications
• Deterministic • High-performance, Avoids out-of-order packet deliveries
• Destination-based • Direct realization in IB networks
• Iterative • Better routes balancing
• Maintains counters on ports • When a new route is added - +1
• Supports XGFTs, PGFTs, RLFTs
“Multi-stage switches are not cross-bars!”
The effective bisection-bandwidth depends on the traffic pattern
Ref: Hoefler, Torsten, Timo Schneider, and Andrew Lumsdaine. "Multistage switches are not crossbars: Effects of static routing in high-performance networks." Cluster Computing, 2008
“Multi-stage switches are not cross-bars!”
The effective bisection-bandwidth depends on the traffic pattern
Ref: Hoefler, Torsten, Timo Schneider, and Andrew Lumsdaine. "Multistage switches are not crossbars: Effects of static routing in high-performance networks." Cluster Computing, 2008
“Multi-stage switches are not cross-bars!”
The effective bisection-bandwidth depends on the traffic pattern
Ref: Hoefler, Torsten, Timo Schneider, and Andrew Lumsdaine. "Multistage switches are not crossbars: Effects of static routing in high-performance networks." Cluster Computing, 2008
Node 1 and 4 share same index position in their leaf switches
We identify two important issues with the fat-tree routing algorithm as implemented by OFED’s subnet manager
• Node Traffic Oblivious Routing • All nodes treated equally • Node roles ignored
• Non-predictable Performance • Node are routed in an order that depends on the port numbers • Port numbering is hard to set
• Sysadmins do not care about it • Addition of new nodes
• Which nodes share links? • Depends on the indexing sequence!
Some nodes tends to receive more traffic than others, so routes towards those nodes are more likely to be congested Node 4 and 5 are more likely to receive traffic e.g. storage nodes
Some nodes tends to receive more traffic than others, so routes towards those nodes are more likely to be congested Node 4 and 5 are more likely to receive traffic e.g. storage nodes
Some nodes tends to receive more traffic than others, so routes towards those nodes are more likely to be congested
We call these nodes receiver nodes!
Node 4 and 5 are more likely to receive traffic e.g. storage nodes
648-port fat-tree is a common building block for HPC systems
Result: The probability of index collision for receiver nodes is very high for node oblivious routing
Probability of about 90% that two receiver nodes will share the same index for 2 rcv/switch !
The weighted fat-tree routing algorithm (wFatTree) assigns weights to the nodes
The algorithm is still deterministic!
• All compute nodes are assigned a new parameter • receive weight
• Weights can be assigned based on • Known node roles e.g. storage nodes • Known traffic priorities e.g. following QoS levels • Traffic profiling
• Nodes are routed in the decreasing order of their weights • Not based on port numbering • Predictable
• Port selection is based on both • Downward weight • Upward weight
Port selection in wFatTree uses both downward and upward weights
Result: Evaluation on 648-port fat-tree shows substantial improvements in total network bandwidth
18 Switches with receiver nodes
27 Switches with receiver nodes
Result: Evaluation on 648-port fat-tree shows substantial improvements in total network bandwidth
All 36 Switches with receiver nodes
Result: wFatTree minimizes the total contention on the links by routes balancing
Result: wFatTree minimizes the total contention on the links by routes balancing
Result: The wFatTree execution time is competitive to the original fat tree routing
Topology No. of End Nodes Fat Tree Routing wFatTree Routing
4-ary-2-tree 16 0.167 0.255
8-ary-2-tree 64 0.318 0.365
16-ary-2-tree 256 1.686 2.268
8-ary-3-tree 512 16.386 19.657
12-ary-3-tree 1728 188.856 230.639
16-ary-3-tree 4096 1029.369 1434.287
Future Work: Enable smart network provisioning – Four important components
Nodes with weights
Balanced Traffic Better Routes
Optimized Algorithms
Smart Routing Reconfiguration Load Balancing Congestion Control
IB Congestion Control
Performance
Adjusting to Load
Optimization
Monitor->Optimize->Execute Loop
Questions?
State-of-the fat-tree routing with oblivious path assignment
The weighted fat-tree routing with
better load-balancing
In summary, weighted fat-tree routing improves actual load-balancing in IB based fat-tree networks