an analysis of network congestion in the titan supercomputer’s...
TRANSCRIPT
Jonathan Freed, University of South CarolinaSaurabh Gupta and Devesh Tiwari, Oak Ridge National Laboratory (OLCF)
Acknowledgments • This work was supported in part by the U.S. Department of Energy, Office of Science, Office of
Workforce Development for Teachers and ScienBsts (WDTS) under the Science Undergraduate Laboratory Internship program.
• This research used resources of the Oak Ridge Leadership CompuBng Facility at the Oak Ridge NaBonal Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-‐AC05-‐00OR22725.
• I would like to thank the University of South Carolina’s Office of Undergraduate Research and Computer Science and Engineering Department, as well as ACM’s SRC for travel support.
References • MaS Ezell, “Understanding the Impact of Interconnect
Failures on System Opera9on”, In Cray User Group (CUG) 2013.
• Kevin Pedtre\, Courtenay Vaughan, Richard BarreS, Karen Devine, K. ScoS Hemmert, “Using the Cray Gemini Performance Counters”, In Cray User Group (CUG) 2013.
An Analysis Of Network Congestion in the Titan Supercomputer’s Interconnect
Titan implements the Cray Gemini interconnect, which has a 3D Torus topology. 3D Torus topology is more prone to interference among applicaBons sharing the network.
Each node in the 3D Torus network is a Gemini router which hosts two compute nodes in Titan. A compute node is comprised of a CPU and a GPU.
Node Neighborhood We used two different neighborhoods of nodes for our analyses:
• Direct Path • One-‐Hop Expanded
The direct path spans the intermediate connecBons (routers) between the two test nodes. The One-‐Hop Expanded includes nodes immediately surrounding the direct path in all dimensions.
Expanding the neighborhood allows us to inves9gate the effects of the surrounding area’s status at the 9me of the throughput test.
Correla8on with Number of Busy Nodes From the analysis of the amount of acBvity, or number of busy nodes in the neighborhood, we were able to find some correlaBons with respect to throughput. We quanBfied these correlaBons using Pearson’s and Spearman’s correlaBon coefficients.
Our data analysis shows: • moderate amount of correlaBon exists between number of busy nodes and throughput for all dimensions
• correlaBon is stronger in X and Y dimensions
Direct Path
One-‐Hop Expanded
Neighborhood Dimension Pearson Coefficient
Pearson p-‐value
Spearman Coefficient
Spearman p-‐value
Direct Path
X -‐0.421876 0.000000 -‐0.519125 0.000000
Y -‐0.467197 0.000000 -‐0.713708 0.000000
Z -‐0.269941 0.000000 -‐0.099382 0.017526
1 hop
X -‐0.421416 0.000000 -‐0.505862 0.000000
Y -‐0.404840 0.000000 -‐0.615761 0.000000
Z -‐0.384263 0.000000 -‐0.388042 0.000000
High number of busy nodes nega8vely affects throughput.
Pearson & Spearman Correla8ons
For the analyses of the number of “busy nodes” and neighboring applicaBons, we used two node neighborhoods—the Direct Path and the One-‐Hop Expanded. This project explored the factors causing network congesBon and invesBgated which applicaBons are likely to interfere with others.
Correla8on with Applica8ons Introduc8on
Methods Overview
Large-‐scale systems, like Titan, rely on high-‐speed interconnects to benefit from using vast amounts of parallelism. This interconnect is essenBal for compute nodes to communicate data with each other throughout the computaBon. If the interconnect becomes congested with data, the throughput drops—resulBng in a slower compute Bme for researchers.
3D Torus
Gemini Router
We collected data by tesBng the throughput between two nodes. We then invesBgated this data to find correlaBons between throughput and the following three variables:
• number of “busy nodes” • neighboring applicaBons • distance between test nodes
Effect of Path Distance Due to the 3D Torus design, the maximum length path actually occurs at the median distance of each dimension.
Conclusion We can successfully idenBfy applicaBons that need invesBgaBon into the causes of network congesBon. Our study shows that network throughput is affected by:
• number of busy nodes nearby • certain applicaBons running on nearby nodes
• path distance between the communicaBng nodes
Using the One-‐Hop Expanded node neighborhood we were able to find applicaBons we suspect cause network congesBon.
Direct Path One-‐Hop Expanded
Confidence –
Presence –
how ooen low throughput occurred with a given applicaBon in the neighborhood average percentage of nodes in neighborhood that are occupied by a given applicaBon
Some neighboring applica8ons could cause low throughput.
Note: ApplicaBon names/users replaced with IDs.