a study of network congestion in two supercomputing high ... · presented on august 16, 2019 at...
TRANSCRIPT
A Study of Network Congestion in Two Supercomputing High-Speed Interconnects
Saurabh Jha*, Archit Patke*, Jim Brandt^, Ann Gentile^, Mike Showerman^, Eric Roman^^, Zbigniew Kalbarczyk*, Bill Kramer*, Ravi Iyer*
* UIUC/NCSA, ^ Sandia National Lab, ^^ NERSC
Presented on August 16, 2019 at HOTI 26
Email: [email protected]
DOE HPC Facilities
Source: hpcwireHPC interconnect technologies: • Cray Gemini (3D Torus), Aries (DragonFly), Slingshot (DragonFly)• Mellanox InfiniBand (Non-blocking Fat Tree)• IBM BG/Q proprietary technology (5D Torus)
Studied HPC SystemsNCSA Blue Waters NERSC Edison
Interconnect: Cray Gemini (3-D Torus) Interconnect: Cray Aries (DragonFly)
Mystery of Application Performance Variation
MILC Completion Time (NERSC Edison) NAMD Completion Time (NCSA Blue Waters)
Variation caused by • Interference from other applications• Non-optimal configuration settings (too many knobs!)
e.g., huge pages, placement, message size, node sharing
150200250300350400450500550
0 2 4 6 8 10 12 14
Runtime(min)
Run Number
Mean Runtime : 318 minsStandard Deviation in Runtime: 103 mins
Mean Runtime : 576 secsStandard Deviation in Runtime: 100 secs
450500550600650700750800850
0 20 40 60 80 100 120
Runtime(sec)
Run Number
Monet: End-to-End Interconnect Monitoring Workflow
Data Measurement and Metrics
• Congestion measured in terms of Percent Time Stalled (PTS)
𝑃"# = 100 ∗𝑇)*𝑇)
𝑇)* is the time spent in stalled state
𝑇) is the measurement interval
• On Blue Waters: 𝑇) is 60s, data gathered: ~700 MB/minute• On Edison: 𝑇) is 1s, data gathered: ~7.5 GB/minute
150200250300350400450500550
0 5 10 15 20 25 30 35
ExecutionTime(mins)
Max of average PTs acrossall regions overlapping theapplication topology (%)
NegLowMedHigh
Correlation Between NAMD Completion Time and Network Congestion (NCSA Blue Waters)Congestion Cloud (NCSA Blue Waters)
Characterizing Congestion On Toroidal Networks
Performance Variation Depends on Duration, Size and Intensity of Congestion Clouds
Low Medium High
Neg: 0-5%, Low: 5-15%, Medium: 15-25%, High > 25%
Characterizing Congestion On Toroidal Networks
0100200300400500600700800
0 5 10 15 20 25 30 35 40 45 50
Duration(minutes)
PTS Threshold (%)
median99%ile99.9%ile
Hotspot Link Duration(NCSA Blue Waters)
• Continuous presence of highly congested links• 95% of the operation time contained a region with a min size of 20 links. • Max size of 6904 (17%) links• Average congestion duration: 16 minutes, 95th percentile: 16 hours
Low
Neg
High
Med
1.36%
98.29% 0.25%
0.10%
0.53
0.47
0.53
0.99
0.01 0.08
0.32
0.05
0.42
0.4
0.12
0.05
0.02
Congestion cloud: % of total system-time
Comparing with DragonFly Network – Link Hotspot Characterization
0
5
10
15
20
25
0 2 4 6 8 10 12 14 16 18 20
Duration(minutes)
PTS Threshold (%)
median99%ile99.9%ile
Hotspot Link Duration Aries(NERSC Edison)
• Duration of continuous congestion on links significantly reduced in DragonFly• Hotspots continue to exist
0100200300400500600700800
0 5 10 15 20 25 30 35 40 45 50
Duration(minutes)
PTS Threshold (%)
median99%ile99.9%ile
Hotspot Link Duration(NCSA Blue Waters)
Measuring Load Imbalance: Equivalence Groups
Congestion Imbalanced Scenario
Link Congestion (PTS)
L1 20%
L2 10%
Congestion Balanced Scenario (Ideal)
Link Congestion (PTS)
L1 15%
L2 15%L1, L2 belong to an equivalence group
N1 sending data to N2
N1
N2
Comparing with DragonFly Network – Load Imbalance Characterization
0.00010.0010.010.11
0 10 20 30 40 50 60 70 80 90
Inv.CDF
Range of PTS inequivalence groups(%)
localglobal
• Existence of load imbalance on local links• Load balance significantly improved on Global links
Existence of congestion cloud for long duration (NCSA Blue Waters)
(Cray Testbed System)
Demo – 5 minutes of congestion viz on EdisonCabinet Chassis
Blade Aries Switch
>15% PTS is dark blue0% - 15% green gradient
• Duration of congestion hotspots is short
• Hotspots move rapidly and application continues experience congestion
Going Forward: Taking Systems Approach for Minimizing Performance Variation
Sources of Contention
Multiple sources for contention/congestion
. . .<latexit sha1_base64="U0x7U4PkymoE4QbuPANSgG0aecc=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69BKvgqSRV0GPBi8cK9gPaUDabTbt2sxt2J0Ip/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZemApu0PO+ncLa+sbmVnG7tLO7t39QPjxqGZVpyppUCaU7ITFMcMmayFGwTqoZSULB2uHodua3n5g2XMkHHKcsSMhA8phTglZq9USk0PTLFa/qzeGuEj8nFcjR6Je/epGiWcIkUkGM6fpeisGEaORUsGmplxmWEjoiA9a1VJKEmWAyv3bqnlslcmOlbUl05+rviQlJjBknoe1MCA7NsjcT//O6GcY3wYTLNEMm6WJRnAkXlTt73Y24ZhTF2BJCNbe3unRINKFoAyrZEPzll1dJq1b1L6u1+6tK/SyPowgncAoX4MM11OEOGtAECo/wDK/w5ijnxXl3PhatBSefOYY/cD5/ALRnjx4=</latexit>
NICP1
NICP3
P2NIC
NIC P4
1 2
3
4 5
<dst> processor <dst> processor not ready to receive data
Network Switches Head of line blocking due to busy buffers
<src>/<dst> NIC
<src> processor
NIC buffers busy
Reasons
Cache conflict, TLB, OS resources
ContentionMap
1
2
5
4
3
Traffic flowing from P1 to P4 nic2procproc2nic
hl2hl
. . .<latexit sha1_base64="U0x7U4PkymoE4QbuPANSgG0aecc=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69BKvgqSRV0GPBi8cK9gPaUDabTbt2sxt2J0Ip/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZemApu0PO+ncLa+sbmVnG7tLO7t39QPjxqGZVpyppUCaU7ITFMcMmayFGwTqoZSULB2uHodua3n5g2XMkHHKcsSMhA8phTglZq9USk0PTLFa/qzeGuEj8nFcjR6Je/epGiWcIkUkGM6fpeisGEaORUsGmplxmWEjoiA9a1VJKEmWAyv3bqnlslcmOlbUl05+rviQlJjBknoe1MCA7NsjcT//O6GcY3wYTLNEMm6WJRnAkXlTt73Y24ZhTF2BJCNbe3unRINKFoAyrZEPzll1dJq1b1L6u1+6tK/SyPowgncAoX4MM11OEOGtAECo/wDK/w5ijnxXl3PhatBSefOYY/cD5/ALRnjx4=</latexit>
NICP1
NICP3
P2NIC
NIC P4
1 2
3
4 5
<dst> processor <dst> processor not ready to receive data
Network Switches Head of line blocking due to busy buffers
<src>/<dst> NIC
<src> processor
NIC buffers busy
Reasons
Cache conflict, TLB, OS resources
ContentionMap
1
2
5
4
3
Traffic flowing from P1 to P4
Traffic flowing from P1 to P4
ML-driven Performance Prediction in Aries
Feature Type Feature List Avg R2
Application and Scheduler Specific
app,hugepages,placement, balance, avg hops
0.64
Network Specific nic2proc,proc2nic,hl2hl PTS & SPF
0.86
All features All the above 0.91
• Diagnosis for understanding performance variation
• Understanding sensitivity to each configuration parameter
• Scheduling policy
Ensemble Models
k-Fold
Cross Validation
1. Increase observability into system and applications
2. Backpressure model for contention using probabilistic methods
3. Use explainable machine-learning methods
Conclusion and Future Work
Conclusion• Congestion studies across generations of production systems help improve
understanding of • App. performance variation – diagnostic models• Parameter tuning – selecting optimal parameters• Network design - load imbalance continues to be a problem
Future work• Characterize and understand upcoming QoS features in Slingshot• ML-driven scheduling for HPC kernels
References