a study of network congestion in two supercomputing high ... · presented on august 16, 2019 at...

A Study of Network Congestion in Two Supercomputing High-Speed Interconnects

Saurabh Jha*, Archit Patke*, Jim Brandt^, Ann Gentile^, Mike Showerman^, Eric Roman^^, Zbigniew Kalbarczyk*, Bill Kramer*, Ravi Iyer*

* UIUC/NCSA, ^ Sandia National Lab, ^^ NERSC

Presented on August 16, 2019 at HOTI 26

Email: [email protected]

DOE HPC Facilities

Source: hpcwireHPC interconnect technologies: • Cray Gemini (3D Torus), Aries (DragonFly), Slingshot (DragonFly)• Mellanox InfiniBand (Non-blocking Fat Tree)• IBM BG/Q proprietary technology (5D Torus)

Studied HPC SystemsNCSA Blue Waters NERSC Edison

Interconnect: Cray Gemini (3-D Torus) Interconnect: Cray Aries (DragonFly)

Mystery of Application Performance Variation

MILC Completion Time (NERSC Edison) NAMD Completion Time (NCSA Blue Waters)

Variation caused by • Interference from other applications• Non-optimal configuration settings (too many knobs!)

e.g., huge pages, placement, message size, node sharing

150200250300350400450500550

0 2 4 6 8 10 12 14

Runtime(min)

Run Number

Mean Runtime : 318 minsStandard Deviation in Runtime: 103 mins

Mean Runtime : 576 secsStandard Deviation in Runtime: 100 secs

450500550600650700750800850

0 20 40 60 80 100 120

Runtime(sec)

Run Number

Monet: End-to-End Interconnect Monitoring Workflow

Data Measurement and Metrics

• Congestion measured in terms of Percent Time Stalled (PTS)

𝑃"# = 100 ∗𝑇)*𝑇)

𝑇)* is the time spent in stalled state

𝑇) is the measurement interval

• On Blue Waters: 𝑇) is 60s, data gathered: ~700 MB/minute• On Edison: 𝑇) is 1s, data gathered: ~7.5 GB/minute

150200250300350400450500550

0 5 10 15 20 25 30 35

ExecutionTime(mins)

Max of average PTs acrossall regions overlapping theapplication topology (%)

NegLowMedHigh

Correlation Between NAMD Completion Time and Network Congestion (NCSA Blue Waters)Congestion Cloud (NCSA Blue Waters)

Characterizing Congestion On Toroidal Networks

Performance Variation Depends on Duration, Size and Intensity of Congestion Clouds

Low Medium High

Neg: 0-5%, Low: 5-15%, Medium: 15-25%, High > 25%

Characterizing Congestion On Toroidal Networks

0100200300400500600700800

0 5 10 15 20 25 30 35 40 45 50

Duration(minutes)

PTS Threshold (%)

median99%ile99.9%ile

Hotspot Link Duration(NCSA Blue Waters)

• Continuous presence of highly congested links• 95% of the operation time contained a region with a min size of 20 links. • Max size of 6904 (17%) links• Average congestion duration: 16 minutes, 95th percentile: 16 hours

Low

Neg

High

Med

1.36%

98.29% 0.25%

0.10%

0.53

0.47

0.53

0.99

0.01 0.08

0.32

0.05

0.42

0.4

0.12

0.05

0.02

Congestion cloud: % of total system-time

Comparing with DragonFly Network – Link Hotspot Characterization

0

5

10

15

20

25

0 2 4 6 8 10 12 14 16 18 20

Duration(minutes)

PTS Threshold (%)


Hotspot Link Duration Aries(NERSC Edison)

• Duration of continuous congestion on links significantly reduced in DragonFly• Hotspots continue to exist

0100200300400500600700800

0 5 10 15 20 25 30 35 40 45 50

Duration(minutes)

PTS Threshold (%)


Hotspot Link Duration(NCSA Blue Waters)

Measuring Load Imbalance: Equivalence Groups

Congestion Imbalanced Scenario

Link Congestion (PTS)

L1 20%

L2 10%

Congestion Balanced Scenario (Ideal)

Link Congestion (PTS)

L1 15%

L2 15%L1, L2 belong to an equivalence group

N1 sending data to N2

N1

N2

Comparing with DragonFly Network – Load Imbalance Characterization

0.00010.0010.010.11

0 10 20 30 40 50 60 70 80 90

Inv.CDF

Range of PTS inequivalence groups(%)

localglobal

• Existence of load imbalance on local links• Load balance significantly improved on Global links

Existence of congestion cloud for long duration (NCSA Blue Waters)

(Cray Testbed System)

Demo – 5 minutes of congestion viz on EdisonCabinet Chassis

Blade Aries Switch

>15% PTS is dark blue0% - 15% green gradient

• Duration of congestion hotspots is short

• Hotspots move rapidly and application continues experience congestion

Going Forward: Taking Systems Approach for Minimizing Performance Variation

Sources of Contention

Multiple sources for contention/congestion

. . .<latexit sha1_base64="U0x7U4PkymoE4QbuPANSgG0aecc=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69BKvgqSRV0GPBi8cK9gPaUDabTbt2sxt2J0Ip/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZemApu0PO+ncLa+sbmVnG7tLO7t39QPjxqGZVpyppUCaU7ITFMcMmayFGwTqoZSULB2uHodua3n5g2XMkHHKcsSMhA8phTglZq9USk0PTLFa/qzeGuEj8nFcjR6Je/epGiWcIkUkGM6fpeisGEaORUsGmplxmWEjoiA9a1VJKEmWAyv3bqnlslcmOlbUl05+rviQlJjBknoe1MCA7NsjcT//O6GcY3wYTLNEMm6WJRnAkXlTt73Y24ZhTF2BJCNbe3unRINKFoAyrZEPzll1dJq1b1L6u1+6tK/SyPowgncAoX4MM11OEOGtAECo/wDK/w5ijnxXl3PhatBSefOYY/cD5/ALRnjx4=</latexit>

NICP1

NICP3

P2NIC

NIC P4

1 2

3

4 5

<dst> processor <dst> processor not ready to receive data

Network Switches Head of line blocking due to busy buffers

<src>/<dst> NIC

<src> processor

NIC buffers busy

Reasons

Cache conflict, TLB, OS resources

ContentionMap

1

2

5

4

3

Traffic flowing from P1 to P4 nic2procproc2nic

hl2hl

. . .<latexit sha1_base64="U0x7U4PkymoE4QbuPANSgG0aecc=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69BKvgqSRV0GPBi8cK9gPaUDabTbt2sxt2J0Ip/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZemApu0PO+ncLa+sbmVnG7tLO7t39QPjxqGZVpyppUCaU7ITFMcMmayFGwTqoZSULB2uHodua3n5g2XMkHHKcsSMhA8phTglZq9USk0PTLFa/qzeGuEj8nFcjR6Je/epGiWcIkUkGM6fpeisGEaORUsGmplxmWEjoiA9a1VJKEmWAyv3bqnlslcmOlbUl05+rviQlJjBknoe1MCA7NsjcT//O6GcY3wYTLNEMm6WJRnAkXlTt73Y24ZhTF2BJCNbe3unRINKFoAyrZEPzll1dJq1b1L6u1+6tK/SyPowgncAoX4MM11OEOGtAECo/wDK/w5ijnxXl3PhatBSefOYY/cD5/ALRnjx4=</latexit>

NICP1

NICP3

P2NIC

NIC P4

1 2

3

4 5

<dst> processor <dst> processor not ready to receive data

Network Switches Head of line blocking due to busy buffers

<src>/<dst> NIC

<src> processor

NIC buffers busy

Reasons

Cache conflict, TLB, OS resources

ContentionMap

1

2

5

4

3

Traffic flowing from P1 to P4

Traffic flowing from P1 to P4

ML-driven Performance Prediction in Aries

Feature Type Feature List Avg R2

Application and Scheduler Specific

app,hugepages,placement, balance, avg hops

0.64

Network Specific nic2proc,proc2nic,hl2hl PTS & SPF

0.86

All features All the above 0.91

• Diagnosis for understanding performance variation

• Understanding sensitivity to each configuration parameter

• Scheduling policy

Ensemble Models

k-Fold

Cross Validation

1. Increase observability into system and applications

2. Backpressure model for contention using probabilistic methods

3. Use explainable machine-learning methods

Conclusion and Future Work

Conclusion• Congestion studies across generations of production systems help improve

understanding of • App. performance variation – diagnostic models• Parameter tuning – selecting optimal parameters• Network design - load imbalance continues to be a problem

Future work• Characterize and understand upcoming QoS features in Slingshot• ML-driven scheduling for HPC kernels

References

a study of network congestion in two supercomputing high ... · presented on august 16, 2019 at...

Documents