dongju choi, glenn lockwood, robert sinkovits, mahidhar tatineni san diego supercomputer center
DESCRIPTION
Performance of Applications Using Dual-Rail InfiniBand 3D Torus N etwork on the Gordon Supercomputer. Dongju Choi, Glenn Lockwood, Robert Sinkovits, Mahidhar Tatineni San Diego Supercomputer Center University of California, San Diego. Background. - PowerPoint PPT PresentationTRANSCRIPT
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Performance of Applications Using Dual-Rail InfiniBand 3D Torus
Network on the Gordon Supercomputer
Dongju Choi, Glenn Lockwood, Robert Sinkovits, Mahidhar Tatineni
San Diego Supercomputer CenterUniversity of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Background
• SDSC data intensive supercomputer Gordon: • 1,024 dual-socket Intel Sandy Bridge nodes, each with 64 GB DDR3–1333
memory• 16 cores per node and 16 nodes (256 cores) per switch• Large IO nodes and local/global ssd disks
• Dual rails QDR InfiniBand network supports IO and Compute communication separately.• Can be scheduled to be used for computation also.
• We have been interested witch communication oversubscription in switch-to-switch and switch/node topology effects on application performance.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Gordon System Architecture
3-D torus of switches on Gordon Subrack level network architecture on Gordon
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
MVAPICH2 MPI Implementation
• MVAPICH2 current version 1.9, 2.0 on the Gordon system• Full control of dual rail usage at the task level via user
settable environment variables:• MV2_NUM_HCAS=2, • MV2_IBA_HCA=mlx4_0:mlx4_1• MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD=8000: can be as
low as 8KB,• MV2_SM_SCHEDULING=ROUND_ROBIN: explicitly distribute tasks
over rails
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
OSU Micro-Benchmarks
• Compare the performance of single and dual rail QDR InfiniBand vs FDR InfiniBand: evaluate the impact of rail sharing, scheduling, and threshold parameters
• Bandwidth tests• Latency tests
OSU Bandwidth Test Results for Single Rail QDR, FDR, and Dual-Rail QDR Network
Configurations- Single rail FDR
performance is much better than single rail QDR for message sizes larger than 4K bytes
- Dual rail QDR performance exceeds FDR performance at sizes greater than 32K
- FDR showing better performance between 4K and 32K byte sizes due to the rail-sharing threshold
OSU Bandwidth Test Performance with
MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD=8K- Lowering the rail sharing threshold bridges the dual-rail QDR, FDR performance gap down to 8K bytes.
OSU Bandwidth Test Performance with MV2_SM_SCHEDULING = ROUND_ROBIN
- Adding explicit round-robin tasks to communicate over different rails
OSU Latency Benchmark Results for QDR, Dual-Rail QDR with
MVAPICH2 Defaults, FDR- There is no latency penalty at small message sizes (expected as only one rail is active below the striping threshold). - Above the striping threshold a minor increase in latency is observed but the performance is still better than single rail FDR.
OSU Latency Benchmark Results for QDR, Dual-Rail QDR with Round
Robin Option, FDR- Distributing messages across HCAs using the round-robin optionincreases the latency at small message sizes.- Again, the latency results are better than the FDR case.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Application Performance Benchmarks
• Applications• P3DFFT Benchmark• LAMMPS Water Box Benchmark• AMBER Cellulose Benchmark
• Test Configuration• Single Rail vs. Dual Rails• Multiple Switch Runs with Maximum Hops=1 or no hops limit for 512
core runs (2 switches are involved)
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
P3DFFT Benchmark• Parallel Three-Dimensional Fast Fourier Transforms• Used for studies of turbulence, climatology, astrophysics
and material science• Depends strongly on the available bandwidth as the main
communication component is driven by transposes of large arrays (alltoallv)
Simulation Results for P3DFFT benchmark with 256 cores and QDR, Dual-Rail QDR
Run# QDRWallclock Time (s)
Dual-Rail QDRWallclock Time (s)
1 992 761
2 985 760
3 991 766
4 993 759
- Dual-rail runs are consistently faster than the single rail runs, with an average performance gain of 23%.
Communication and Compute Time Breakdown for 256 core, Single/Dual QDR rail P3DFFT Runs.
Run #
Total Time
Comm. Time
Compute Time
1 992 539 453
2 985 535 450
3 991 539 452
4 993 543 450
Run #
Total Time
Comm. Time
Compute Time
1 761 302 459
2 760 301 459
3 766 308 458
4 759 300 459
- Compute part is nearly identical in both sets of runs- Performance improvement is almost entirely in the communication part of the code- Shows that Dual rail boosts the alltoallv performance and consequently speeds up
the overall calculation
Single Rail Runs Dual Rail Runs
Communication and Compute Time Breakdown for 512 core, Single/Dual QDR Rail P3DFFT
Runs. Maximum Switch Hops=1
Run #
Total Time
Comm. Time
Compute Time
1 802 592 210
2 802 592 210
3 804 594 210
4 803 592 211
Run #
Total Time
Comm. Time
Compute Time
1 537 322 215
2 538 322 216
3 538 322 216
4 538 322 216
Single Rail Runs Dual Rail Runs
- Shows similar dual rail benefits- Fewer runs pans/links, reducing the likelihood of oversubscription due to other jobs- Also can increase the likelihood of oversubscription due to lesser switch connections
P3DFFT benchmark with 512 cores, Single Rail QDR. No Switch Hop Restriction
Run # Total Time Comm. Time Compute Time1 717 506 2112 732 525 2073 789 580 2094 726 518 2085 825 615 2106 697 488 209
- oversubscription is mitigated by topology of the run and the performance is nearly 15% better than the single hop case. However, as seen from the results a different topology may also lead to lower performance if the distribution is not optimal (it could be by oversubscription of the job itself or from other jobs).
- Spread out the computation on several switches. Lowering bandwidth requirements on a given set of switch-to-switch links
- bad for latency bound codes (given the extra switch hops) but benefit bandwidth sensitive codes depending on the topology of the run
- Nukada et. al. utilizes dynamic links to minimize congestion to perform better in the dual-rail case
P3DFFT benchmark with 512 cores, Single Rail QDR. No Switch Hop Restriction
Run # Total Time Comm. Time Compute Time1..3 - - -
4 726 518 2085 825 615 2106 697 488 209
Nukada, A., Sato, K. and Matsuoka, S.. 2012. Scalable multi-GPU 3-D FFT for TSUBAME 2.0 supercomputer.In Proceedings of the International Conference on HighPerformance Computing, Networking, Storage andAnalysis (SC '12). IEEE Computer Society Press, LosAlamitos, CA, USA, Article 44, 10 pages.
Communication and Compute Time Breakdown for 1024 Core P3DFFT Runs
Run Total Time
Comm. Time
Compute Time
1 404 307 97
2 408 310 98
-. No switch hop restrictions are placed on the runs. -. Communication aspect is greatly improved in the dual rail cases while compute fraction is the nearly identical in all the runs.
Run Total Time
Comm. Time
Compute Time
1 332 232 100
2 325 226 99
Single Rail Runs Dual Rail Runs
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
LAMMPS Water Box Benchmark
• Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) is a widely used classical molecular dynamics code.
• 12,000 water molecules (36,000 atoms) are set in the input• Simulation is run for 20 picoseconds.
LAMMPS Water Box Benchmark with Single/Dual Rail QDR and 256 cores.
Run # QDRWallclock Time (s)
Dual-Rail QDRWallclock Time (s)
1 57 462 57 463 58 464 57 46
- Dual-Rail runs show better performance than the single rail runs and mitigate communication overhead with an average of 32% in wallclock time used. improvement
LAMMPS Water Box Benchmark with Single-Dual Rail QDR and 512 Cores
Run #Single Rail QDR w
MAX_HOP=1Wallclock Time (s)
Single Rail QDR wNo Limit in MAX_HOP
Wallclock Time (s)Dual Rail QDR
Wallclock Time (s)
1 69 71 472 69 70 473 70 281 484 69 450 47
- Application is not scaling due to larger communication overhead (happens due to fine level of domain decomposition)- LAMMPS benchmark is very sensitive to topology and shows large variations if the maximum switch hops are not restricted
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
AMBER Cellulose Benchmark
• Amber is a package of programs for molecular dynamics simulations of proteins and nucleic acids.
• 408,609 atoms are used for the tests.
Amber Cellulose Benchmark with Single/Dual Rail QDR and 256 Cores
Run # Single Rail QDRWallclock Time (s)
Dual Rail QDRWallclock Time (s)
1 218 212
2 219 213
3 218 212
4 219 212
- Communication overhead is low and the dual rail benefit is minor (<3%)
Amber Cellulose Benchmark with Single/Dual Rail QDR, 512 cores
Run #Single Rail QDR w
MAX_HOP=1 Wallclock Time (s)
Single Rail QDR wNo Limit in MAX_HOP
Wallclock Time (s)Dual Rail QDR
Wallclock Time (s)
1 204 332 168
2 202 331 168
3 202 396 168
4 202 373 167
- There is a modest benefit (<5 %) in the single rail QDR runs- Communication overhead increases with increased core count, leading to the drop off in scaling. This can be mitigated with dual rail QDR- Dual rail QDR performance is better by 17%
Amber Cellulose Benchmark with Single/Dual Rail QDR, 512
cores
- Dual rail enables the benchmark to scale to higher core count- Shows sensitivity to the topology due to the larger number of switch hops and possible contention from other jobs
Run #Single Rail QDR w
MAX_HOP=1 Wallclock Time (s)
Single Rail QDR wNo Limit in MAX_HOP
Wallclock Time (s)Dual Rail QDR
Wallclock Time (s)
1 204 332 168
2 202 331 168
3 202 396 168
4 202 373 167
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Summary
• Aggregated bandwidth obtained with dual rail QDR exceeds the FDR performance.
• Shows performance benefits from dual rail QDR configurations.
• Gordon’s 3-D torus of switches leads to variability in performance due to oversubscription/topology considerations.
• Switch topology can be configured to enable mitigation of the link oversubscription bottleneck.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Summary
• Performance improvement also varies based on the degree of communication overhead.• Benchmark cases with larger communication fractions (with respect to
overall run time) show more improvement with dual rail QDR configurations.
• Computational time scaled with the core counts in both single and dual rail configurations for the currently benchmarked applications: LAMMPS and Amber
Acknowlegements• This work was supported by NSF grant:
OCI #0910847 Gordon: A Data Intensive Supercomputer.