congestion in hpc interconnection networks€¦ · by the interconnection network ... • knowledge...
TRANSCRIPT
Congestion Management in HPC Interconnection NetworksInterconnection Networks
Pedro J GarcíaPedro J. GarcíaUniversidad de Castilla-La Mancha (SPAIN)
Conference title 1
OutlineOutline
Why may congestion become a problem?
Should we care about congestion in current HPC systems?
How can congestion be managed?
ChallengesChallenges
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 2
Why may congestion become a problem?
• For three decades the goal of computer architects has been g pto keep the processors busy → top performance
• Interconnects were usually cheap, and never a bottleneckInterconnects were usually cheap, and never a bottleneck
• Now, global system performance in large systems is limited by the interconnection networkby the interconnection network
• Network saturation leads to congestion situations that may drastically degrade network performancedrastically degrade network performance
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 3
Contention
• Several packets from different flows request the same output port in a switch
• One packet makes progress, the others wait
Networkcontentioncontention
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 4
Congestion
• Persistent contention, mainly in network saturation state• Buffers containing packets belonging to flows involved in g p g g
contention become full
Persistentnetwork
contentioncontention
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 5
Congestion propagation
• In saturated lossless networks, congestion is quickly propagated by flow control, forming congestion trees
Flow control
Persistentnetwork
contention
Flow control
contention
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 6
Congestion propagation
• In saturated lossless networks, congestion is quickly propagated by flow control, forming congestion trees
• Congestion tree structure:Congestion
Congestiontree leaf
Congestion
tree branchCongestionpropagation may reach
the sourcestree root
Congestiontree branch
Congestion
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 7
tree leaf
Congestion trees and Head‐of‐Line blocking
• Congestion trees may cause Head‐of‐Line (HoL) blocking
Non-congested packets advance at the same speed as
congested ones
Congestion affectssources that do notsources that do notcause congestion
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 8
Network performance at saturation
HS HS
HS = traffic injected to Hot Spot destination
HSstarts
HSends
At saturation, network performance drops dramatically due to congestion situations
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 9
due to congestion situations
Should we currently care about congestion?
• Conflicting interests: cost vs. performance
• Saturation was traditionally avoided by overdimensioning the interconnection network
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 10
Network overdimensioning
• Many more components than really necessary
Off d t k b d idth i h hi h th th • Offered network bandwidth is much higher than the bandwidth requested by end nodes
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 11
Network overdimensioning
• Advantage: low link utilization → congestion is unlikelySaturation
Working zoneWorking zone
ency
Late
Injected trafficj• Disadvantages:
• Expensive (processors cheaper relative to interconnects)P ti i ( i li k d)
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 12
• Power consumption increases (growing link speed)
Should we currently care about congestion?
• Conflicting interests: cost vs. performance
• Saturation was traditionally avoided by overdimensioning the interconnection network → currently not suitabley
• No network overdimensioning?
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 13
Network not overdimensioned
• Only the components strictly necessary to interconnect all the processing nodesp g
• Offered network bandwidth decreases
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 14
Network not overdimensioned
• Advantages: cheaper, less power consumptionSaturation
zoneWorking zoneWorking zone
tenc
yLa
t
Injected traffic
• Disadvantage: high link utilization → congestion is likely
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 15
Should we currently care about congestion?
• Conflicting interests: cost vs. performance
• Saturation was traditionally avoided by overdimensioning the interconnection network →Currently not suitabley
• No overdimensioning→Danger when working with high traffic loads (close to the saturation point)( p )
• Network performance (throughput, latency) should be good under very different traffic patterns & load scenariosunder very different traffic patterns & load scenarios
• Traffic load may significantly vary over time, reaching saturationsaturation
Some strategy to deal with congestion is required
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 16
gy g q
The big picture:
Growing processorGrowing processordd
Growing linkd
Power consumptioni
Relative
speedspeed speed increases
Power managementProcessor pricesProcessor pricesdrop (demand)drop (demand)
Relative interconnectcost increases Smaller networks
Power management
Congestion PerformancePerformanceCongestion probability
grows
PerformancePerformancedegradationdegradationManagementStrategies
Bandwidth decreasesSaturation pointreached with lower traffic load
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 17
reached with lower traffic load
Benefits of congestion management
• Stable performance when the network reaches saturation
• No performance drop
• Delivers maximum achievable throughput
• Reacts quickly when power management turned somecomponents off and demand suddenly increasesp y
• Prevents performance degradation due to power management
• Enables more aggressive power saving strategies without risk• Enables more aggressive power saving strategies without risk
• Helps to keep performance when faults occur and faulttolerance techniques enable alternative pathstolerance techniques enable alternative paths
• Alternative paths may become congested (fewer resources are available)
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 18
How can congestion be managed?
• Different approaches to congestion management:
• Packet dropping
• Proactive techniques
• Reactive techniques
• HoL‐blocking elimination techniquesg q
• Hybrid techniques
• Related techniques • Related techniques
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 19
Packet dropping
• Packets in congested buffers are discarded
• Suitable for computer networks (like the Internet) but not suitable for most current HPC parallel applicationsp pp
• Both congested and non‐congested packets may be discardeddiscarded
• Discarded packets must be retransmitted, thus increasing final packet latencyfinal packet latency
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 20
Proactive congestion management
• A.K.A. congestion prevention
• Path setup before data transmission [1]• Path setup before data transmission [1]
• Used in ATM, computer networks (QoS)
• Optimal performance requires to know in advance:
• Resource requirements of each transmissionq
• Network status
• Knowledge about network status is not always available• Knowledge about network status is not always available
• High overhead, high setup latencies, poor link utilization (not i bl f HPC)suitable for HPC)
[1] P. Yew , N. Tzeng, D.H. Lawrie, “Distributing Hot‐Spot Addressing in Large‐Scale
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 21
Multiprocessors”, IEEE Transactions on Computers, 36(4): 388–395, 1987.
Reactive congestion management
• A.K.A. congestion recovery
• Injection limitation techniques (injection throttling) using • Injection limitation techniques (injection throttling) using closed‐loop feedback
D t l ll ith t k i d li k b d idth• Does not scale well with network size and link bandwidth– Notification delay (proportional to distance / number of hops)
– Link and buffer capacity (proportional to clock frequency)– Link and buffer capacity (proportional to clock frequency)
– May produce traffic oscillations (closed loop system with pure delay)
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 22
Reactive congestion management
• Example: Infiniband FECN/BECN mechanism [2]:
• Two bits in the packet header are reserved for congestion notificationTwo bits in the packet header are reserved for congestion notification
• If a switch port is considered as congested, the Forward Explicit Congestion Notification (FECN) bit in the header of packets crossing that g p gport is set
• Upon reception of such a “FECN‐marked” packet, a destination will k f k h h d llreturn a packet (Congestion Notification Packet, CNP) whose header will
have the Backward Explicit Congestion Notification (BECN) bit set back to the source
• Any source receiving a “BECN‐marked” packet will then reduce its packet injection rate for this traffic flow
[2] E.G. Gran, M. Eimot, S.A. Reinemo, T. Skeie, O. Lysne, L. Huse, G. Shainer, “First experiences
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 23
with congestion control in InfiniBand hardware”, in Proceedings of IPDPS 2010, pp. 1–12.
HoL‐blocking elimination techniques
• Key idea:
The real problem is not the congestion itself, but its negative effect (HoL blocking)effect (HoL blocking)
By eliminating HoL blocking, congestion becomes harmless
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 24
Example of HoL‐blocking due to congestionShould congested flows be throttled? Should congested flows be throttled?
33 %33 %
Congested flows
N t d fl
Sw. 1
Sw. 5Src. 0
33 %33 %
Non-congested flows
Sw. 2S 1 33 %
33 %33 %
100 %Sw. 6 Sw. 8
Dst. 1
Src. 1
33 % 33 %
33 %66 %Sw. 3 Sw. 7
Dst. 2
Src 2
33 % Sending33 % Stopped
Sw 4
Src. 2
33 % 33 %
33 % Sending
Sw. 4Src. 3
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 25
Example of real‐life HoL‐blocking The A 31 highway metaphorThe A‐31 highway metaphor
Bottleneck
A-31
A-43
The flow is affected by the bottleneck of the the bottleneck of the
A‐31 highway
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 26
Map Source: Google Maps
HoL‐blocking elimination techniques
• In general, these techniques rely on having different queues t h t t t diff t k t flat each port to separate different packet flows
• They differ mainly in the criteria to map packets to queues and in the number of required queues per port
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 27
HoL‐blocking elimination techniques
• VOQnet (Virtual Output Queuing at network level) [3]( p g ) [3]
• A separate queue at each input port for every destination
• Packets with the same destination are stored in the same queue• Packets with the same destination are stored in the same queue
Selected_Queue = Packet_Destination
• Completely eliminates HoL‐blocking
_ _
• Number of required buffer resources increases at least quadratically with network size !!!
[3] W. Dally, P. Carvey, L. Dennison, “Architecture of the Avici terabit switch/router”, in Proceedings
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 28
of 6th Hot Interconnects, 1998, pp. 41–50.
HoL‐blocking elimination techniques
• VOQsw (Virtual Output Queuing at switch level) [4]& DAMQs (Dynamically Allocated Multi‐Queues) [5]DAMQs (Dynamically Allocated Multi Queues) [5]
• A separate queue at every input port for every output port
P k t ti th t t t d i th • Packets requesting the same output are stored in the same queue
Selected Queue = Requested Output Port
• Better than nothing but they do not completely eliminate HoL‐blocking
Selected_Queue = Requested_Output_Port
• Effectiveness depends on topology and traffic pattern
[4] T. Anderson, S. Owicki, J. Saxe, C. Thacker, “High‐speed switch scheduling for local‐area networks,” ACM Transactions on Computer Systems, vol. 11 (4), pp. 319–352, November 1993.
[5] Y. Tamir , G. Frazier, “Dynamically‐allocated multi‐queue buffers for VLSI communication
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 29
switches”, IEEE Transactions on Computers,vol. 41 (6), June 1992.
HoL‐blocking elimination techniques
• DBBM (Destination‐Based Buffer Management) [6]( g ) [ ]• Several groups of destinations are defined
• A separate queue for each group at every port (q queues per port)
• Packets with destinations in the same group are stored at the same queue
Selected Queue = Packet Destination MOD q
• Does not completely eliminate HoL‐blocking
Selected_Queue = Packet_Destination MOD q
• Effectiveness depends on the number of queues, topology and traffic pattern
[6] T. Nachiondo, J. Flich, J. Duato, “Buffer management strategies to reduce HoL‐blocking”, IEEE
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 30
ff g g gTransactions on Parallel and Distributed Systems, vol. 21 (6), pp. 739–753, 2010.
HoL‐blocking elimination techniques
• OBQA (Output‐Based Queue Assignment) [7]• Suitable for fat‐trees with DESTRO routing
• Queue assignment linked with topology & routing algorithm
• Reduces HoL‐blocking with the minimum number of queues per port (q)
S l t d Q R t d O t t P t MOD
• q smaller than half the switch radix
Selected_Queue = Requested_Output_Port MOD q
q smaller than half the switch radix
• Does not completely eliminate HoL‐blocking
• Effectiveness depends on the number of queues
[7] J. Escudero‐Sahuquillo, P. J. García, F. J. Quiles, J. Duato, “An efficient strategy for reducing head‐of‐line blocking in fat‐trees”, in LNCS vol. 6272, pp. 413–427. Proceedings of 16th
l C f ( ) h l S
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 31
International Euro‐Par Conference (II), Ischia, Italy, Sept. 2010.
Performance comparisonPerformance comparisonUniform traffic simulation resultsUniform traffic simulation results
Network Latency (cycles) vs Normalized Generated Trafficy ( y )
4‐ary 4‐tree 8x8 switches
16‐ary 2‐tree 32x32 switches
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 32
HoL‐blocking elimination techniques• RECN (Regional Explicit Congestion Notification) [8] & FBICM (Flow‐Based Implicit Congestion Management) [9]RECN h b d f b d ti t k hil • RECN has been proposed for source‐based routing networks while FBICM for distributed table‐based routing networks
• The key difference with respect to previous techniques is that they y p p q ycompletely and dynamically isolate congested flows
• Basics:
f f f– Explicit identification of congested flows
– Storage of congestion information
Dynamic queue allocation to isolate congested flows– Dynamic queue allocation to isolate congested flows
[8] P. J. García, J. Flich, J. Duato, I. Johnson, F. J. Quiles, F. Naven, “Efficient, scalable congestion management for interconnection networks” IEEE Micro vol 26 (5) pp 52–66 September management for interconnection networks , IEEE Micro, vol. 26 (5), pp. 52 66, September 2006.
[9] J. Escudero‐Sahuquillo, P. J. García, F. J. Quiles, J. Flich, J. Duato, “Cost‐effective congestion management for interconnection networks using distributed deterministic routing”, in
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 33
Proceedings of ICPADS 2010, Shanghai, China, December 2010.
RECN/FBICM basic procedure• Congested points are detected at any port of the network by
measuring queue occupancyTh l i f d d d i i d i • The location of any detected congested point is stored in a control memory (a CAM line) at any port forwarding packets towards the congested point:towards the congested point:• RECN: an explicit route is stored• FBICM: a list of destinations is stored to implicitly locate the pointp y p
• A special queue associated to the CAM line is also allocated to exclusively store packets addressed to that congested point
• Congestion information is progressively notified to any port in other switches crossed by congested flows, where new CAM li d i l ll d lines and special queues are allocated
• A packet arriving at a port is stored in the standard queue only if its ro ting information does not match an CAM line
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 34
if its routing information does not match any CAM line
RECN/FBICM queue requirements
• Non‐congested packets can share queues without suffering significant HoL‐blocking → only one standard queue per port
• Special queues are allocated/deallocated when required, thus congested packets can be separately buffered by using a small
b f l bl k d dnumber of special queues per port→HoL‐blocking produced by congested packets is eliminated in a scalable way
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 35
RECN/FBICM drawbacks
• In scenarios with a lot of different congested points, it is possible to run out of special queues at some ports
• The need for CAMs at switch ports increases implementation cost and required silicon area per port
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 36
Hybrid congestion management strategies
• Example: Combining Injection Throttling and FBICM [10]:
• Use FBICM to quickly and locally eliminate HoL‐blocking propagating Use FBICM to quickly and locally eliminate HoL blocking, propagating congestion information and allocating queues as necessary
• Use reactive congestion management to slowly eliminate congestion, g g y gdeallocating FBICM queues whenever possible
• Use of FBICM provides immediate response and allows reactive b d f l h dcongestion management to be tuned for slow reaction, thus avoiding
oscillations
• Reactive congestion management drastically reduces FBICM buffer • Reactive congestion management drastically reduces FBICM buffer requirements (just one or two queues per port)
[10] J. Escudero‐Sahuquillo, E. G. Gran, P.J. García, J. Flich, T. Skeie, O. Lysne, F.J. Quiles, J . Duato, “Combining Congested‐Flow Isolation and Injection Throttling in HPC Interconnection
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 37
g g j gNetworks “, to appear in Proceedings of ICPP 2011.
Performance comparisonPerformance comparisonHot spot scenario simulation resultsHot‐spot scenario simulation results
Network Normalized Throughput vs Timeg p
4‐ary 3‐tree 1 hot‐spot
4‐ary 3‐tree 4 hot‐spots
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 38
Related techniques
• Adaptive Routing/Traffic balancing
M h l d l h f i• May help to delay the occurrence of congestion
• Useless when heavy congestion arises
• Problems regarding in‐order packet delivery
• Existing congestion management techniques do not work correctly with d i i ( d i )adaptive routing (congested points may vary)
• Adaptive routing may spread congestion over more links
• Virtual Channels
P f d d h l ( ) i t• Performance depends on channel (queue) assignment
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 39
Challenges
• To develop congestion management techniques that react locally and immediatelywhen congestion ariseslocally and immediatelywhen congestion arises
• To make congestion management techniques truly scalable
• To achieve coordination among end nodes without explicit communication among them
• To eliminate instabilities and oscillatory responses
• To minimize the number of extra resources needed to handle • To minimize the number of extra resources needed to handle congestion
T k i ibl i h d i • To make congestion management compatible with adaptive routing
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 40
AcknowledgementsAcknowledgements
• Jose Duato (Universitat Politecnica de Valencia), whogenerously gave us the main ideas behind our congestiongenerously gave us the main ideas behind our congestionmanagement proposals
J Fli h (U i i P li i d V l i ) d J• Jose Flich (Universitat Politecnica de Valencia) and JesusEscudero‐Sahuquillo (Universidad de Castilla‐La Mancha),who have developed alongside me all our congestionwho have developed alongside me all our congestionmanagement proposals
• The technique combining reactive congestion management• The technique combining reactive congestion managementand FBICM has been developed in collaboration with SimulaResearch Laboratory (Oslo)Research Laboratory (Oslo)
“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 41