hpc best practices: application performance optimization
TRANSCRIPT
![Page 1: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/1.jpg)
Application Performance Optimizations
Pak Lui
![Page 2: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/2.jpg)
2
• MILC
• OpenMX
• PARATEC
• PFA
• PFLOTRAN
• Quantum ESPRESSO
• RADIOSS
• SPECFEM3D
• WRF
• LS-DYNA
• miniFE
• MILC
• MSC Nastran
• MR Bayes
• MM5
• MPQC
• NAMD
• Nekbone
• NEMO
• NWChem
• Octopus
• OpenAtom
• OpenFOAM
140 Applications Best Practices Published • Abaqus
• AcuSolve
• Amber
• AMG
• AMR
• ABySS
• ANSYS CFX
• ANSYS FLUENT
• ANSYS Mechanics
• BQCD
• CCSM
• CESM
• COSMO
• CP2K
• CPMD
• Dacapo
• Desmond
• DL-POLY
• Eclipse
• FLOW-3D
• GADGET-2
• GROMACS
• Himeno
• HOOMD-blue
• HYCOM
• ICON
• Lattice QCD
• LAMMPS For more information, visit: http://www.hpcadvisorycouncil.com/best_practices.php
![Page 3: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/3.jpg)
3
HPC Advisory Council HPC Center Dell™ PowerEdge™ R730 32-node cluster
HP Cluster Platform 3000SL 16-node cluster
HP ProLiant SL230s Gen8 4-node cluster
Dell™ PowerEdge™ R815 11-node cluster
Dell™ PowerEdge™ C6145 6-node cluster
Dell™ PowerEdge™
M610 38-node cluster
Dell™ PowerEdge™
C6100 4-node cluster
Dell™ PowerVault MD3420 / MD3460 InfiniBand-based Lustre Storage
Dell™ PowerEdge™ R720/R720xd 32-node cluster
HP Proliant XL230a Gen9 10-node cluster
Colfax CX1350s-XK5 4-node cluster
![Page 4: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/4.jpg)
4
Agenda
• Overview of HPC Applications Performance
• Way to Inspect, Profile, Optimize HPC Applications – CPU, memory, file I/O, network
• System Configurations and Tuning
• Case Studies, Performance Comparisons, Optimizations and Highlights
• Conclusions
![Page 5: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/5.jpg)
5
HPC Application Performance Overview
• To achieve scalability performance on HPC applications – Involves understanding of the workload by performing profile analysis
• Tune for the most time spent (either CPU, Network, IO, etc) – Underlying implicit requirement: Each node to perform similarly
• Run CPU/memory /network tests or cluster checker to identify bad node(s) – Comparing behaviors of using different HW components
• Which pinpoint bottlenecks in different areas of the HPC cluster
• A selection of HPC applications will be shown – To demonstrate method of profiling and analysis
– To determine the bottleneck in SW/HW
– To determine the effectiveness of tuning to improve on performance
![Page 6: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/6.jpg)
6
Ways To Inspect and Profile Applications
• Computation (CPU/Accelerators)
– Tools: gprof, top, htop, perf top, pstack, Visual Profiler, etc
– Tests and Benchmarks: HPL, STREAM
• File I/O
– Bandwidth and Block Size: iostat, collectl, darshan, etc
– Characterization Tools and Benchmarks: iozone, ior, etc
• Network Interconnect and MPI communications
– Tools and Profilers: perfquery, MPI profilers (IPM, TAU, etc)
– Characterization Tools and Benchmarks:
– Latency and Bandwidth: OSU benchmarks, IMB
![Page 7: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/7.jpg)
7
Case Study:
LS-DYNA
![Page 8: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/8.jpg)
8
Note
• The following research was performed under the HPC Advisory Council activities – Participating vendors: Intel, Dell, Mellanox – Compute resource - HPC Advisory Council Cluster Center
• The following was done to provide best practices – LS-DYNA performance overview – Understanding LS-DYNA communication patterns – Ways to increase LS-DYNA productivity – MPI libraries comparisons
• For more info please refer to
– http://www.dell.com
– http://www.intel.com
– http://www.mellanox.com
– http://www.lstc.com
![Page 9: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/9.jpg)
9
LS-DYNA
• LS-DYNA – A general purpose structural and fluid analysis simulation software package
capable of simulating complex real world problems – Developed by the Livermore Software Technology Corporation (LSTC)
• LS-DYNA used by – Automobile – Aerospace – Construction – Military – Manufacturing – Bioengineering
![Page 10: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/10.jpg)
10
Objectives
• The presented research was done to provide best practices
– LS-DYNA performance benchmarking
• MPI Library performance comparison
• Interconnect performance comparison
• CPU cores/speed comparison
• Optimization tuning
• The presented results will demonstrate
– The scalability of the compute environment/application
– Considerations for higher productivity and efficiency
![Page 11: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/11.jpg)
11
Test Cluster Configuration • Dell PowerEdge R730 32-node (896-core) “Thor” cluster
– Dual-Socket 14-Core Intel E5-2697v3 @ 2.60 GHz CPUs (Power Management in BIOS sets to Maximum Performance)
– Memory: 64GB memory, DDR4 2133 MHz, Memory Snoop Mode in BIOS sets to Home Snoop
– OS: RHEL 6.5, MLNX_OFED_LINUX-2.4-1.0.5.1_20150408_1555 InfiniBand SW stack
– Hard Drives: 2x 1TB 7.2 RPM SATA 2.5” on RAID 1
• Mellanox ConnectX-4 EDR 100Gb/s InfiniBand Adapters
• Mellanox Switch-IB SB7700 36-port EDR 100Gb/s InfiniBand Switch
• Mellanox ConnectX-3 FDR VPI InfiniBand and 40Gb/s Ethernet Adapters
• Mellanox SwitchX-2 SX6036 36-port 56Gb/s FDR InfiniBand / VPI Ethernet Switch
• MPI: Open MPI 1.8.4, Mellanox HPC-X v1.2.0-326, Intel MPI 5.0.2.044, IBM Platform MPI 9.1
• Application:
– LS-DYNA 8.0.0 (builds 95359, 95610), Single Precision
• Benchmarks: 3 Vehicle Collision, Neon refined revised
![Page 12: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/12.jpg)
12
PowerEdge R730 Massive flexibility for data intensive operations
• Performance and efficiency – Intelligent hardware-driven systems management
with extensive power management features – Innovative tools including automation for
parts replacement and lifecycle manageability – Broad choice of networking technologies from GigE to IB – Built in redundancy with hot plug and swappable PSU, HDDs and fans
• Benefits – Designed for performance workloads
• from big data analytics, distributed storage or distributed computing where local storage is key to classic HPC and large scale hosting environments
• High performance scale-out compute and low cost dense storage in one package • Hardware Capabilities
– Flexible compute platform with dense storage capacity • 2S/2U server, 6 PCIe slots
– Large memory footprint (Up to 768GB / 24 DIMMs) – High I/O performance and optional storage configurations
• HDD options: 12 x 3.5” - or - 24 x 2.5 + 2x 2.5 HDDs in rear of server • Up to 26 HDDs with 2 hot plug drives in rear of server for boot or scratch
![Page 13: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/13.jpg)
13
LS-DYNA Performance – Network Interconnects
• EDR InfiniBand delivers superior scalability in application performance – Provides higher performance by over 4-5 times than 1GbE, 10GbE and 40GbE – 1GbE stop scaling beyond 4 nodes, and 10GbE stops scaling beyond 8 nodes – InfiniBand demonstrates continuous performance gain at scale
28 MPI Processes / Node Higher is better
444% 505% 572%
![Page 14: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/14.jpg)
14
• Majority of the MPI time is spent on MPI_recv and MPI Collective Ops – MPI_Recv(36%), MPI_Allreduce(27%), MPI_Bcast(24%)
• Similar communication characteristics seen on both input dataset – Both exhibit similar communication patterns
LS-DYNA Profiling – Time Spent in MPI
3 Vehicle Collision – 32 nodes Neon_refined_revised – 32 nodes
![Page 15: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/15.jpg)
15
LS-DYNA Profiling – Time Spent in MPI
• Most of the MPI messages are in the medium sizes – Most message sizes are between 0 to 64B
• For the most time consuming MPI calls – MPI_Recv: Most messages are under 4KB – MPI_Bcast: Majority are less than 16B, but larger messages exist – MPI_Allreduce: Most messages are less than 256B
neon_refined_revised
![Page 16: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/16.jpg)
16
LS-DYNA Performance – EDR vs FDR InfiniBand
• EDR InfiniBand delivers superior scalability in application performance – As the cluster scales, performance gap of EDR IB becomes widen
• Performance advantage of EDR InfiniBand increases for larger core counts – EDR IB provides 15% versus FDR IB at 32 nodes (896 cores)
Higher is better
15%
28 MPI Processes / Node
![Page 17: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/17.jpg)
17
LS-DYNA Performance – Cores Per Node
• Better performance is seen at scale with less CPU cores per node – At low node counts, higher performance can be achieved with more cores per node – At high node counts, slightly better performance by using less cores per node – Memory bandwidth might be limited by more CPU cores being used
CPU @ 2.6GHz Higher is better
5% 3% 2% 5%
![Page 18: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/18.jpg)
18
LS-DYNA Performance – AVX2/SSE2 CPU Instructions
• LS-DYNA provides executables with supports for different CPU instructions – AVX2 is supported on “Haswell” while SSE2 is supported on previous generations – Due to runtime issue, AVX2 executable build 95610 is used, instead of the public build 95359 – Slight improvement of ~2-4% by using executable with AVX2 instructions – The AVX2 instructions runs at a lower clock speed (2.2GHz) than normal CPU clock (2.6GHz)
Higher is better
4%
24 MPI Processes / Node
![Page 19: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/19.jpg)
19
LS-DYNA Performance – Turbo Mode
• Turbo Boost enables processors to run above its base frequency – Capability to allow CPU cores to run dynamically above the CPU clock – When thermal headroom allows the CPU to operate – The 2.6GHz clock speed could boost to Max Turbo Frequency of 3.3GHz – Running with Turbo Boost translates to a ~25% of performance boost
40% 25%
Higher is better 28 MPI Processes / Node
![Page 20: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/20.jpg)
20
LS-DYNA Performance – Memory Optimization
• Setting the environment variables for memory allocator improve on performance – Modifying the memory allocator allows faster memory registration for communications
• Environment variables used: – export MALLOC_MMAP_MAX_=0 – export MALLOC_TRIM_THRESHOLD_=-1
176%
Higher is better
19%
28 MPI Processes / Node
![Page 21: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/21.jpg)
21
LS-DYNA Performance – MPI Optimization
• FCA and MXM enhance LS-DYNA performance at scale for HPC-X – Open MPI and HPC-X are based on the Open MPI distribution – The “yalla” PML, UD transport and memory optimization in HPC-X reduce overhead – MXM provides a speedup of 38% over un-tuned baseline run at 32 nodes (768 cores)
• MCA parameters: • -mca btl_sm_use_knem 1 -mca pml yalla -x MXM_TLS=ud,shm,self -x MXM_SHM_RNDV_THRESH=32768 • -x HCOLL_CONTEXT_CACHE_ENABLE=1
Higher is better
38%
24 MPI Processes / Node
![Page 22: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/22.jpg)
22
LS-DYNA Performance – Intel MPI Optimization
• The DAPL provider performs better than OFA provider for Intel MPI – DAPL would provide better scalability performance for Intel MPI on LS-DYNA
• MCA parameters for MXM: – Common for 2 tests: I_MPI_DAPL_SCALABLE_PROGRESS 1, I_MPI_RDMA_TRANSLATION_CACHE 1,
I_MPI_FAIR_CONN_SPIN_COUNT 2147483647, I_MPI_FAIR_READ_SPIN_COUNT 2147483647, I_MPI_ADJUST_REDUCE 2, I_MPI_ADJUST_BCAST 0, I_MPI_RDMA_TRANSLATION_CACHE 1, I_MPI_RDMA_RNDV_BUF_ALIGN 65536, I_MPI_SPIN_COUNT 121
– For OFA: -IB, MV2_USE_APM 0, I_MPI_OFA_USE_XRC 1 – For DAPL: -DAPL, I_MPI_DAPL_DIRECT_COPY_THRESHOLD 65536, I_MPI_DAPL_UD enable,
I_MPI_DAPL_PROVIDER ofa-v2-mlx5_0-1u
Higher is better
20%
24 MPI Processes / Node
![Page 23: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/23.jpg)
23
LS-DYNA Performance – MPI Libraries
• HPC-X outperforms Platform MPI, and Open MPI in scalability performance – HPC-X delivers higher performance than Intel MPI (OFA) by 33%, (DAPL) by 11%, Platform MPI by 27% on
neon_refined_revised – Performance is 20% higher than Intel OFA, and % 8% better than Platform MPI in 3cars
• Tuning parameter used: – For Open MPI: -bind-to-core and KNEM. For Platform MPI: -cpu_bind, -xrc. For Intel MPI: see previous slide
7% 27%
Higher is better
33% 20% 11%
24 MPI Processes / Node
![Page 24: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/24.jpg)
24
• Current Haswell system configuration outperforms prior system generations – Current systems outperformed Ivy Bridge by 47%, Sandy Bridge by 75%, Westmere by 148%, Nehalem by 290% – Scalability support from EDR InfiniBand and HPC-X provide huge boost in performance at scale for LS-DYNA
• System components used: – Haswell: 2-socket 14-core [email protected], 2133MHz DIMMs, ConnectX-4 EDR InfiniBand – Ivy Bridge: 2-socket 10-core [email protected], 1600MHz DIMMs, Connect-IB FDR InfiniBand – Sandy Bridge: 2-socket 8-core [email protected], 1600MHz DIMMs, ConnectX-3 FDR InfiniBand – Westmere: 2-socket 6-core [email protected], 1333MHz DIMMs, ConnectX-2 QDR InfiniBand – Nehalem: 2-socket 4-core [email protected], 1333MHz DIMMs, ConnectX-2 QDR InfiniBand
290% 47%
75% 148%
Higher is better
LS-DYNA Performance – System Generations
![Page 25: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/25.jpg)
25
LS-DYNA Summary
• Performance – Compute: Intel Haswell cluster outperforms system architecture of previous generations
• Outperforms Ivy Bridge by 47%, Sandy Bridge by 75%, Westmere by 148%, and Nehalem by 290% • Using executable with AVX2 instructions provides slight advantage • Slight improvement of ~2-4% by using executable with AVX2 instructions
– Turbo Mode: Running with Turbo Boost provides ~25% of performance boost in some cases • Turbo Boost enables processors to run above its base frequency
– Network: EDR InfiniBand and HPC-X MPI library deliver superior scalability in application performance • EDR IB provides higher performance by over 4-5 times vs 1GbE, 10GbE and 40GbE, 15% vs FDR IB at 32 nodes
• MPI Tuning – HPC-X enhances LS-DYNA performance at scale for LS-DYNA
• MXM UD provides a speedup of 38% over un-tuned baseline run at 32 nodes
– HPC-X outperforms Platform MPI, and Open MPI in scalability performance • Up to 27% better than Platform MPI on neon_refined_revised, and 8% better than Platform MPI in 3cars
![Page 26: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/26.jpg)
26
Case Study:
GROMACS
![Page 27: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/27.jpg)
27
GROMACS
• GROMACS (GROningen MAchine for Chemical Simulation) – A molecular dynamics simulation package – Primarily designed for biochemical molecules like proteins, lipids and nucleic acids
• A lot of algorithmic optimizations have been introduced in the code • Extremely fast at calculating the nonbonded interactions
– Ongoing development to extend GROMACS with interfaces both to Quantum Chemistry and Bioinformatics/databases
– An open source software released under the GPL
![Page 28: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/28.jpg)
28
Objectives
• The presented research was done to provide best practices
– GROMACS performance benchmarking
• CPU performance comparison
• MPI library performance comparison
• Interconnect performance comparison
• System generations comparison
• The presented results will demonstrate
– The scalability of the compute environment/application
– Considerations for higher productivity and efficiency
![Page 29: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/29.jpg)
29
Test Cluster Configuration • Dell PowerEdge R730 32-node (896-core) “Thor” cluster
– Dual-Socket 14-Core Intel E5-2697v3 @ 2.60 GHz CPUs (Power Management in BIOS sets to Maximum Performance)
– Memory: 64GB memory, DDR4 2133 MHz, Memory Snoop Mode in BIOS sets to Home Snoop, Turbo Enabled
– OS: RHEL 6.5, MLNX_OFED_LINUX-3.0-1.0.1 InfiniBand SW stack
– Hard Drives: 2x 1TB 7.2 RPM SATA 2.5” on RAID 1
• Mellanox ConnectX-4 EDR 100Gbps EDR InfiniBand Adapters
• Mellanox Switch-IB SB7700 36-port 100Gb/s EDR InfiniBand Switch
• Mellanox ConnectX-3 FDR InfiniBand, 10/40GbE Ethernet VPI Adapters
• Mellanox SwitchX-2 SX6036 36-port 56Gb/s FDR InfiniBand / VPI Ethernet Switch
• MPI: Mellanox HPC-X v1.2.0-326
• Compiler and Libraries: Intel Composer XE 2015.3.187 and MKL
• Application: GROMACS 4.6.7
• Benchmark datasets: DPPC in Water (d.dppc, 121856 atoms, 150000 steps, SP) unless stated otherwise
![Page 30: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/30.jpg)
30
GROMACS Performance – Network Interconnects
• InfiniBand is the only interconnect that delivers superior scalability performance – EDR InfiniBand provides higher performance and more scalable than 1GbE, 10GbE, or 40GbE – Performance for Ethernet stays flat (or stops scaling) beyond 2 nodes – EDR InfiniBand outperforms 10GbE-RoCE on scalability performance by 55% at 32 nodes / 896c – EDR InfiniBand demonstrates continuous performance gain at scale
28 MPI Processes / Node Higher is better
55% 4.1x 4.2x 4.6x
![Page 31: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/31.jpg)
31
GROMACS Profiling – Time Spent by MPI Calls
• The most time consuming MPI call os MPI_Sendrecv – MPI_Sendrecv: 66% (or 27% of runtime) at 32 nodes (896 cores) – MPI_Waitall: 18% (or 7% of runtime), MPI_Bcast: 6% (or 2% of runtime) – Point to point and non-blocking sends and receives consume most time in GROMACS
32 Nodes
![Page 32: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/32.jpg)
32
GROMACS Profiling – MPI Message Sizes
• Majority of data transfer messages are medium sizes, except for: – MPI_Sendrecv has a large concentration (from 8B to 8KB) – MPI_Bcast shows some concentration
32 Nodes
![Page 33: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/33.jpg)
33
GROMACS Profiling – MPI Data Transfer
• As the cluster grows, similar communication behavior is seen – Majority of communications are between neighboring ranks – Non-blocking (point to point) data, and point-to-point transfers are shown in the graph – Collective data communications are small compared to point-to-point communications
2 Nodes / 56 Cores 32 Nodes / 896 Cores
![Page 34: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/34.jpg)
34
GROMACS Performance – EDR vs FDR InfiniBand
• EDR InfiniBand delivers superior scalability in application performance – As the number of nodes scales, performance gap of EDR IB becomes widen
• Performance advantage of EDR InfiniBand increases for larger core counts – EDR InfiniBand provides 29% versus FDR InfiniBand at 32 nodes (896 cores)
Higher is better
29%
28 MPI Processes / Node
![Page 35: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/35.jpg)
35
GROMACS Performance – System Generations
• Thor cluster (based on Intel E5-2697v3 - Haswell) outperforms prior generations – 1.1 to 3.5x higher performance than clusters based on previous generations of Intel architecture
• System components used: – Janus: 2-socket 6-core Xeon X5670 @ 2.93GHz, 1333MHz DIMMs, ConnectX-2 QDR IB – Jupiter: 2-socket 8-core Xeon E5-2680 @ 2.7GHz, 1600MHz DIMMs, ConnectX-3 FDR IB – Thor: 2-socket 14-core Xeon E5-2680V3 @2.6GHz, 2133MHz DIMMs, ConnectX-4 EDR IB
356%
118%
Higher is better
![Page 36: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/36.jpg)
36
GROMACS Performance – Cores Per Node
• Running more CPU cores provides higher performance – ~7-10% higher productivity with 28PPN compared to 24PPN – Higher demand on memory bandwidth and network might limit performance as more cores are used
CPU @ 2.6GHz Higher is better
13% 39%
7%
![Page 37: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/37.jpg)
37
GROMACS Performance – Turbo Mode & CPU Clock
• Advantages are seen with running higher clock rate – Either by enabling Turbo mode or higher CPU clock frequency
• Boosting CPU clock rate yields higher performance at lower cost – Increasing to 2600MHz (from 2300MHz) run 11% faster
CPU @ 2.6GHz Higher is better
16%
11%
![Page 38: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/38.jpg)
38
GROMACS Performance – Floating Point Precision
• GROMACS allows running either SP and DP for floating point precision • Running at SP is shown to be faster than running at DP
– Seen around 41%-47% faster running at SP (Single Precision) versus DP (Double Precision) – All other slides are running using Single Precision
CPU @ 2.6GHz Higher is better
47%
41%
![Page 39: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/39.jpg)
39
GROMACS Summary
• Latest system generation improve GROMACS performance at scale – Compute: Intel Haswell cluster outperforms system architecture of previous generations
• Haswell cluster outperforms Sandy Bridge cluster by 110%, and outperforms Westmere cluster by 350% at 32 node – Compute: Running more CPU cores provides higher performance
• ~7-10% higher productivity with 28PPN compared to 24PPN
– Network: EDR InfiniBand delivers superior scalability in application performance • EDR InfiniBand provides higher performance and more scalable than 1GbE, 10GbE, or 40GbE • Performance for Ethernet (1GbE/10GbE/40GbE) stays flat (or stops scaling) beyond 2 nodes • EDR InfiniBand outperforms 10GbE-RoCE on scalability performance by 55% at 32 nodes / 896c
– Running at Single Precision is approximately twice as fast as running at Double Precision • Seen around 41%-47% faster running at SP (Single Precision) versus DP (Double Precision)
– MPI Profile shows majority of data transfer are point-to-point and non-blocking communications • MPI_Sendrecv and MPI_Waitall are the most used MPI communication
![Page 40: HPC Best Practices: Application Performance Optimization](https://reader030.vdocuments.us/reader030/viewer/2022020717/58a856041a28ab210b8b6a3b/html5/thumbnails/40.jpg)
40 40
Thank You HPC Advisory Council
All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein