congestion in hpc interconnection networks€¦ · by the interconnection network ... • knowledge...

Congestion Management in HPC Interconnection NetworksInterconnection Networks

Pedro J GarcíaPedro J. GarcíaUniversidad de Castilla-La Mancha (SPAIN)

Conference title 1

OutlineOutline

Why may congestion become a problem?

Should we care about congestion in current HPC systems?

How can congestion be managed?

ChallengesChallenges

“Congestion Management in HPC Interconnection Networks” , Pedro J. GarcíaHPC Advisory Council European Workshop , Hamburg , June 19th 2011 2

Why may congestion become a problem?

• For three decades the goal of computer architects has been g pto keep the processors busy → top performance

• Interconnects were usually cheap, and never a bottleneckInterconnects were usually cheap, and never a bottleneck

• Now, global system performance in large systems is limited by the interconnection networkby the interconnection network

• Network saturation leads to congestion situations that may drastically degrade network performancedrastically degrade network performance


Contention

• Several packets from different flows request the same output port in a switch

• One packet makes progress, the others wait

Networkcontentioncontention


Congestion

• Persistent contention, mainly in network saturation state• Buffers containing packets belonging to flows involved in g p g g

contention become full

Persistentnetwork

contentioncontention


Congestion propagation

• In saturated lossless networks, congestion is quickly propagated by flow control, forming congestion trees

Flow control

Persistentnetwork

contention

Flow control

contention


Congestion propagation

• In saturated lossless networks, congestion is quickly propagated by flow control, forming congestion trees

• Congestion tree structure:Congestion

Congestiontree leaf

Congestion

tree branchCongestionpropagation may reach

the sourcestree root

Congestiontree branch

Congestion


tree leaf

Congestion trees and Head‐of‐Line blocking

• Congestion trees may cause Head‐of‐Line (HoL) blocking

Non-congested packets advance at the same speed as

congested ones

Congestion affectssources that do notsources that do notcause congestion


Network performance at saturation

HS HS

HS = traffic injected to Hot Spot destination

HSstarts

HSends

At saturation, network performance drops dramatically due to congestion situations


due to congestion situations

Should we currently care about congestion?

• Conflicting interests: cost vs. performance

• Saturation was traditionally avoided by overdimensioning the interconnection network


Network overdimensioning

• Many more components than really necessary

Off d t k b d idth i h hi h th th • Offered network bandwidth is much higher than the bandwidth requested by end nodes


Network overdimensioning

• Advantage: low link utilization → congestion is unlikelySaturation

Working zoneWorking zone

ency

Late

Injected trafficj• Disadvantages:

• Expensive (processors cheaper relative to interconnects)P ti i ( i li k d)


• Power consumption increases (growing link speed)



• Saturation was traditionally avoided by overdimensioning the interconnection network → currently not suitabley

• No network overdimensioning?


Network not overdimensioned

• Only the components strictly necessary to interconnect all the processing nodesp g

• Offered network bandwidth decreases


Network not overdimensioned

• Advantages: cheaper, less power consumptionSaturation

zoneWorking zoneWorking zone

tenc

yLa

t

Injected traffic

• Disadvantage: high link utilization → congestion is likely




• Saturation was traditionally avoided by overdimensioning the interconnection network →Currently not suitabley

• No overdimensioning→Danger when working with high traffic loads (close to the saturation point)( p )

• Network performance (throughput, latency) should be good under very different traffic patterns & load scenariosunder very different traffic patterns & load scenarios

• Traffic load may significantly vary over time, reaching saturationsaturation

Some strategy to deal with congestion is required


gy g q

The big picture:

Growing processorGrowing processordd

Growing linkd

Power consumptioni

Relative

speedspeed speed increases

Power managementProcessor pricesProcessor pricesdrop (demand)drop (demand)

Relative interconnectcost increases Smaller networks

Power management

Congestion PerformancePerformanceCongestion probability

grows

PerformancePerformancedegradationdegradationManagementStrategies

Bandwidth decreasesSaturation pointreached with lower traffic load


reached with lower traffic load

Benefits of congestion management

• Stable performance when the network reaches saturation

• No performance drop

• Delivers maximum achievable throughput

• Reacts quickly when power management turned somecomponents off and demand suddenly increasesp y

• Prevents performance degradation due to power management

• Enables more aggressive power saving strategies without risk• Enables more aggressive power saving strategies without risk

• Helps to keep performance when faults occur and faulttolerance techniques enable alternative pathstolerance techniques enable alternative paths

• Alternative paths may become congested (fewer resources are available)


How can congestion be managed?

• Different approaches to congestion management:

• Packet dropping

• Proactive techniques

• Reactive techniques

• HoL‐blocking elimination techniquesg q

• Hybrid techniques

• Related techniques • Related techniques


Packet dropping

• Packets in congested buffers are discarded

• Suitable for computer networks (like the Internet) but not suitable for most current HPC parallel applicationsp pp

• Both congested and non‐congested packets may be discardeddiscarded

• Discarded packets must be retransmitted, thus increasing final packet latencyfinal packet latency


Proactive congestion management

• A.K.A. congestion prevention

• Path setup before data transmission [1]• Path setup before data transmission [1]

• Used in ATM, computer networks (QoS)

• Optimal performance requires to know in advance:

• Resource requirements of each transmissionq

• Network status

• Knowledge about network status is not always available• Knowledge about network status is not always available

• High overhead, high setup latencies, poor link utilization (not i bl f HPC)suitable for HPC)

[1] P. Yew , N. Tzeng, D.H. Lawrie, “Distributing Hot‐Spot Addressing in Large‐Scale


Multiprocessors”, IEEE Transactions on Computers, 36(4): 388–395, 1987.

Reactive congestion management

• A.K.A. congestion recovery

• Injection limitation techniques (injection throttling) using • Injection limitation techniques (injection throttling) using closed‐loop feedback

D t l ll ith t k i d li k b d idth• Does not scale well with network size and link bandwidth– Notification delay (proportional to distance / number of hops)

– Link and buffer capacity (proportional to clock frequency)– Link and buffer capacity (proportional to clock frequency)

– May produce traffic oscillations (closed loop system with pure delay)


Reactive congestion management

• Example: Infiniband FECN/BECN mechanism [2]:

• Two bits in the packet header are reserved for congestion notificationTwo bits in the packet header are reserved for congestion notification

• If a switch port is considered as congested, the Forward Explicit Congestion Notification (FECN) bit in the header of packets crossing that g p gport is set

• Upon reception of such a “FECN‐marked” packet, a destination will k f k h h d llreturn a packet (Congestion Notification Packet, CNP) whose header will

have the Backward Explicit Congestion Notification (BECN) bit set back to the source

• Any source receiving a “BECN‐marked” packet will then reduce its packet injection rate for this traffic flow

[2] E.G. Gran, M. Eimot, S.A. Reinemo, T. Skeie, O. Lysne, L. Huse, G. Shainer, “First experiences


with congestion control in InfiniBand hardware”, in Proceedings of IPDPS 2010, pp. 1–12.

HoL‐blocking elimination techniques

• Key idea:

The real problem is not the congestion itself, but its negative effect (HoL blocking)effect (HoL blocking)

By eliminating HoL blocking, congestion becomes harmless


Example of HoL‐blocking due to congestionShould congested flows be throttled? Should congested flows be throttled?

33 %33 %

Congested flows

N t d fl

Sw. 1

Sw. 5Src. 0

33 %33 %

Non-congested flows

Sw. 2S 1 33 %

33 %33 %

100 %Sw. 6 Sw. 8

Dst. 1

Src. 1

33 % 33 %

33 %66 %Sw. 3 Sw. 7

Dst. 2

Src 2

33 % Sending33 % Stopped

Sw 4

Src. 2

33 % 33 %

33 % Sending

Sw. 4Src. 3


Example of real‐life HoL‐blocking The A 31 highway metaphorThe A‐31 highway metaphor

Bottleneck

A-31

A-43

The flow is affected by the bottleneck of the the bottleneck of the

A‐31 highway


Map Source: Google Maps


• In general, these techniques rely on having different queues t h t t t diff t k t flat each port to separate different packet flows

• They differ mainly in the criteria to map packets to queues and in the number of required queues per port



• VOQnet (Virtual Output Queuing at network level) [3]( p g ) [3]

• A separate queue at each input port for every destination

• Packets with the same destination are stored in the same queue• Packets with the same destination are stored in the same queue

Selected_Queue = Packet_Destination

• Completely eliminates HoL‐blocking

_ _

• Number of required buffer resources increases at least quadratically with network size !!!

[3] W. Dally, P. Carvey, L. Dennison, “Architecture of the Avici terabit switch/router”, in Proceedings


of 6th Hot Interconnects, 1998, pp. 41–50.


• VOQsw (Virtual Output Queuing at switch level) [4]& DAMQs (Dynamically Allocated Multi‐Queues) [5]DAMQs (Dynamically Allocated Multi Queues) [5]

• A separate queue at every input port for every output port

P k t ti th t t t d i th • Packets requesting the same output are stored in the same queue

Selected Queue = Requested Output Port

• Better than nothing but they do not completely eliminate HoL‐blocking

Selected_Queue = Requested_Output_Port

• Effectiveness depends on topology and traffic pattern

[4] T. Anderson, S. Owicki, J. Saxe, C. Thacker, “High‐speed switch scheduling for local‐area networks,” ACM Transactions on Computer Systems, vol. 11 (4), pp. 319–352, November 1993.

[5] Y. Tamir , G. Frazier, “Dynamically‐allocated multi‐queue buffers for VLSI communication


switches”, IEEE Transactions on Computers,vol. 41 (6), June 1992.


• DBBM (Destination‐Based Buffer Management) [6]( g ) [ ]• Several groups of destinations are defined

• A separate queue for each group at every port (q queues per port)

• Packets with destinations in the same group are stored at the same queue

Selected Queue = Packet Destination MOD q

• Does not completely eliminate HoL‐blocking

Selected_Queue = Packet_Destination MOD q

• Effectiveness depends on the number of queues, topology and traffic pattern

[6] T. Nachiondo, J. Flich, J. Duato, “Buffer management strategies to reduce HoL‐blocking”, IEEE


ff g g gTransactions on Parallel and Distributed Systems, vol. 21 (6), pp. 739–753, 2010.


• OBQA (Output‐Based Queue Assignment) [7]• Suitable for fat‐trees with DESTRO routing

• Queue assignment linked with topology & routing algorithm

• Reduces HoL‐blocking with the minimum number of queues per port (q)

S l t d Q R t d O t t P t MOD

• q smaller than half the switch radix

Selected_Queue = Requested_Output_Port MOD q

q smaller than half the switch radix

• Does not completely eliminate HoL‐blocking

• Effectiveness depends on the number of queues

[7] J. Escudero‐Sahuquillo, P. J. García, F. J. Quiles, J. Duato, “An efficient strategy for reducing head‐of‐line blocking in fat‐trees”, in LNCS vol. 6272, pp. 413–427. Proceedings of 16th

l C f ( ) h l S


International Euro‐Par Conference (II), Ischia, Italy, Sept. 2010.

Performance comparisonPerformance comparisonUniform traffic simulation resultsUniform traffic simulation results

Network Latency (cycles) vs Normalized Generated Trafficy ( y )

4‐ary 4‐tree 8x8 switches

16‐ary 2‐tree 32x32 switches


HoL‐blocking elimination techniques• RECN (Regional Explicit Congestion Notification) [8] & FBICM (Flow‐Based Implicit Congestion Management) [9]RECN h b d f b d ti t k hil • RECN has been proposed for source‐based routing networks while FBICM for distributed table‐based routing networks

• The key difference with respect to previous techniques is that they y p p q ycompletely and dynamically isolate congested flows

• Basics:

f f f– Explicit identification of congested flows

– Storage of congestion information

Dynamic queue allocation to isolate congested flows– Dynamic queue allocation to isolate congested flows

[8] P. J. García, J. Flich, J. Duato, I. Johnson, F. J. Quiles, F. Naven, “Efficient, scalable congestion management for interconnection networks” IEEE Micro vol 26 (5) pp 52–66 September management for interconnection networks , IEEE Micro, vol. 26 (5), pp. 52 66, September 2006.

[9] J. Escudero‐Sahuquillo, P. J. García, F. J. Quiles, J. Flich, J. Duato, “Cost‐effective congestion management for interconnection networks using distributed deterministic routing”, in


Proceedings of ICPADS 2010, Shanghai, China, December 2010.

RECN/FBICM basic procedure• Congested points are detected at any port of the network by

measuring queue occupancyTh l i f d d d i i d i • The location of any detected congested point is stored in a control memory (a CAM line) at any port forwarding packets towards the congested point:towards the congested point:• RECN: an explicit route is stored• FBICM: a list of destinations is stored to implicitly locate the pointp y p

• A special queue associated to the CAM line is also allocated to exclusively store packets addressed to that congested point

• Congestion information is progressively notified to any port in other switches crossed by congested flows, where new CAM li d i l ll d lines and special queues are allocated

• A packet arriving at a port is stored in the standard queue only if its ro ting information does not match an CAM line


if its routing information does not match any CAM line

RECN/FBICM queue requirements

• Non‐congested packets can share queues without suffering significant HoL‐blocking → only one standard queue per port

• Special queues are allocated/deallocated when required, thus congested packets can be separately buffered by using a small

b f l bl k d dnumber of special queues per port→HoL‐blocking produced by congested packets is eliminated in a scalable way


RECN/FBICM drawbacks

• In scenarios with a lot of different congested points, it is possible to run out of special queues at some ports

• The need for CAMs at switch ports increases implementation cost and required silicon area per port


Hybrid congestion management strategies

• Example: Combining Injection Throttling and FBICM [10]:

• Use FBICM to quickly and locally eliminate HoL‐blocking propagating Use FBICM to quickly and locally eliminate HoL blocking, propagating congestion information and allocating queues as necessary

• Use reactive congestion management to slowly eliminate congestion, g g y gdeallocating FBICM queues whenever possible

• Use of FBICM provides immediate response and allows reactive b d f l h dcongestion management to be tuned for slow reaction, thus avoiding

oscillations

• Reactive congestion management drastically reduces FBICM buffer • Reactive congestion management drastically reduces FBICM buffer requirements (just one or two queues per port)

[10] J. Escudero‐Sahuquillo, E. G. Gran, P.J. García, J. Flich, T. Skeie, O. Lysne, F.J. Quiles, J . Duato, “Combining Congested‐Flow Isolation and Injection Throttling in HPC Interconnection


g g j gNetworks “, to appear in Proceedings of ICPP 2011.

Performance comparisonPerformance comparisonHot spot scenario simulation resultsHot‐spot scenario simulation results

Network Normalized Throughput vs Timeg p

4‐ary 3‐tree 1 hot‐spot

4‐ary 3‐tree 4 hot‐spots


Related techniques

• Adaptive Routing/Traffic balancing

M h l d l h f i• May help to delay the occurrence of congestion

• Useless when heavy congestion arises

• Problems regarding in‐order packet delivery

• Existing congestion management techniques do not work correctly with d i i ( d i )adaptive routing (congested points may vary)

• Adaptive routing may spread congestion over more links

• Virtual Channels

P f d d h l ( ) i t• Performance depends on channel (queue) assignment


Challenges

• To develop congestion management techniques that react locally and immediatelywhen congestion ariseslocally and immediatelywhen congestion arises

• To make congestion management techniques truly scalable

• To achieve coordination among end nodes without explicit communication among them

• To eliminate instabilities and oscillatory responses

• To minimize the number of extra resources needed to handle • To minimize the number of extra resources needed to handle congestion

T k i ibl i h d i • To make congestion management compatible with adaptive routing


AcknowledgementsAcknowledgements

• Jose Duato (Universitat Politecnica de Valencia), whogenerously gave us the main ideas behind our congestiongenerously gave us the main ideas behind our congestionmanagement proposals

J Fli h (U i i P li i d V l i ) d J• Jose Flich (Universitat Politecnica de Valencia) and JesusEscudero‐Sahuquillo (Universidad de Castilla‐La Mancha),who have developed alongside me all our congestionwho have developed alongside me all our congestionmanagement proposals

• The technique combining reactive congestion management• The technique combining reactive congestion managementand FBICM has been developed in collaboration with SimulaResearch Laboratory (Oslo)Research Laboratory (Oslo)


Thanks!!Any question? y q

Conference title 42

congestion in hpc interconnection networks€¦ · by the interconnection network ... • knowledge...

Documents