Intel’s Xeon Scalable processor Family
(Skylake) and AMD’s EPYC processor
Application Performance in
Chemistry and Materials Science:
Martyn Guest≠, Christine Kitchen≠ & Enguerrand Petit†
≠ Cardiff University †Performance Engineering Group, Atos
2Application Performance in Materials Science
Outline
• Measurement of parallel application performance featuring synthetics and
end-user applications across a variety of clusters
¤ Synthetic Codes – STREAM, IMB (interconnect) and HPCC
¤ Variety of End-user Codes – DL_POLY, GROMACS, NAMD, LAMMPS,
GAMESS-UK, Quantum Espresso, VASP, CP2K, ONETEP & OpenFOAM
• Focus on Intel’s Xeon Scalable processors (“Skylake”), including
¤ Intel Xeon Gold 6150 Processor (18c, 24.75M Cache, 2.70 GHz)
¤ Intel Xeon Gold 6142 Processor (16c, 22M Cache, 2.60 GHz)
¤ Intel Xeon Gold 6148 Processor (20c, 27.5M Cache, 2.40 GHz)
¤ Intel Xeon Gold 6130 Processor (16c, 22M Cache, 2.10 GHz).
¤ Clusters with dual-socket nodes + Mellanox EDR & Intel’s OPA Interconnects.
• Comparison with a number of HPC systems based on earlier CPUs:
¤ Intel’s Broadwell (16-core E5-2697Av4 2.6GHz) & 14-core E5-2680 v4 2.4
GHz), and Sandy Bridge (E5-2690 & E5-2670) clusters from Bull and Fujitsu
¤ Mellanox IB EDR, and Intel’s Omnipath OPA Interconnects.
¤ IBM’s Power 8 cluster - S822LC
• Preliminary Evaluation of the AMD EPYC 7601 Processor
14 December 2017
3Application Performance in Materials Science
Contents1. Performance Benchmarks and Cluster Systems
a. Synthetic Code Performance: STREAM, IMB, and HPCC
b. Application Code Performance: DLPOLY, GROMACS,
LAMMPS, NAMD, GAMESS_UK, VASP, Quantum Espresso,
CP2K, and OpenFOAM
2. Selecting Fabrics and Optimising Performance
a. Interconnect Performance: Mellanox EDR Infiniband and
Intel’s Omnipath (OPA)
3. Relative Code Performance: Processor Family and
Interconnect – “core to core” and “node to node” benchmarks.
4. Preliminary Evaluation of the AMD EPYC 7601 Processor
a. represents a fairly radical design departure from what Intel
offers
5. Acknowledgements and Summary
14 December 2017
The Xeon Skylake Architecture
4Application Performance in Materials Science
• The architecture of Skylake is
very different from that of the prior
“Haswell” and “Broadwell” Xeon
chips
• Three basic variants that now
cover what was formerly the Xeon
E5 and Xeon E7 product lines, with
Intel converging the Xeon E5 and
E7 chips into a single socket.
• Product segmentation – Platinum, Gold, Silver, & Bronze – with 51
variants of the SP chip
• Also custom versions requested by hyperscale and OEM customers.
• All of these chips differ from each other in a number of ways, including
number of cores, clock speed, L3 cache capacity, number and speed of
UltraPath links between sockets, number of sockets supported, main
memory capacity, width of the AVX vector units etc.
14 December 2017
Intel Xeon : Westmere - Skylake
Xeon 5600
(Westmere-EP)
Xeon E5-2600
(Sandy Bridge-EP)
Xeon E5-2600 v4
“Broadwell-EP”
Intel Xeon Scalable
Processor
“Skylake”
Cores / ThreadsUp to 6 cores / 12
threads
Up to 8 cores / 16
threads
Up to 22 Cores / 44
threads
Up to 28 Cores / 56
threads
Last-level cache 12 MB Up to 20 MB Up to 55 MBUp to 38.5 MB (non-
inclusive)
Max memory
channels, speed
/ socket
3xDDR3 channels,
1333
4xDDR3 channels,
1600
4 channels of up to 3
RDIMMs, LRDIMMs
or 3DS LRDIMMs,
2400 MHz
6 channels of up to 2
RDIMMs, LRDIMMs
or 3DS LRDIMMs,
2666 MHz
New
instructionsAES-NI
AVX 1.0
8 DP Flops/Clock
AVX 2.0
16 DP Flops/Clock
AVX 512
32 DP Flops/Clock
QPI / UPI Speed
(GT/s)
1 QPI channels @
6.4 GT/s
2 QPI channels @ 8.0
GT/s
2 x QPI channels @
9.6 GT/s
Up to 3 x UPI @ 10.4
GT/s
PCIe Lanes /
Controllers /
Speed (GT/s)
36 lanes PCIe 2.0 on
chipset
40 Lanes / Socket
Integrated PCIe 3.0
40 / 10 / PCIe* 3.0
(2.5, 5, 8 GT/s)48 / 12 / PCIe* 3.0
(2.5, 5, 8 GT/s)
Server /
Workstation
TDP
Server /
Workstation: 130W
Up to 130W Server;
150W Workstation 55 - 145W 70 – 205W
5Application Performance in Materials Science 14 December 2017
Baseline Cluster Systems
6Application Performance in Materials Science
Cluster Configuration
Intel Sandy Bridge Clusters
“Raven”128 x Bull|ATOS b510 EP-nodes each with 2 Intel Sandy Bridge E5-2670
(2.6 GHz), with Mellanox QDR infiniband.
Supercomputing
Wales
384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670
(2.6 GHz), with Mellanox QDR infiniband.
Intel Broadwell Clusters
Dell PE R730/R630,
Broadwell EP-2697A v4
2.6 GHz 16C
HPC Advisory Council, “Thor” cluster, Dell PowerEdge R730/R630 36-
node cluster: 2 x Xeon E5-2697A v4 @ 2.6GHz, 16 Core, 145W TDP,
40MB Cache,256GB DDR4 2400MHz, Interconnect: ConnectX-4 EDR
ATOS Broadwell EP-
2680 v4 2.4 GHz 16C
32 node cluster, Node config: 2 x Xeon E5-2680 v4 @ 2.4GHz, 16 Core,
145W TDP, 40MB Cache,128GB DDR4 2400MHz, Interconnect: Mellanox
ConnectX-4 EDR; and Intel OPA
IBM Power 8 S822LC
IBM Power 8 S822LC
with Mellanox EDR
20 cores, 3.49 GHz with performance CPU governor; 256 GB memory ;
1 – IB (EDR) port ; 2 × NVIDIA K80 GPU;
IBM PE (Parallel Environment) Operating System: RHEL 7.2 LE;
Compilers: xlC 13.1.3, xlf 15.1.3, gcc 4.8.5 (Red Hat), gcc 5.2.1 (from IBM
Advance Toolchain 9.0)
14 December 2017
Skylake Cluster Systems
Cluster / Configuration
32 node Dell|EMC cluster running SLURM with separate partitions for each processor
SKU; Mellanox EDR:
Intel® Xeon® Gold 6150 Processor (24.75M Cache, 2.70GHz), Max Turbo frequency, 3.70
GHz. # cores 18; #threads 36; DDR4-2666; TDP 165W; # of UPI Links 3;
Intel® Xeon® Gold 6142 Processor (22M Cache, 2.60GHz), Max Turbo frequency, 3.70
GHz, # cores 16; #threads 32; DDR4-2666; TDP 150W; # of UPI Links 3.
28 node Dell|EMC cluster running SLURM: Intel OPA
Intel® Xeon® Gold 6130 Processor (22M Cache, 2.10GHz), Max Turbo frequency, 3.70
GHz, # cores 16; #threads 32; DDR4-2666; TDP 125W; # of UPI Links 3.
The 6130’s are configured with 12 × 8GB 2666 DIMMs rather than 12 × 16GB, resulting in
somewhat slower memory bandwidth (165 GB/s vs 195 GB/s STREAM Triad).
20 node Bull|ATOS cluster running SLURM;
Intel® Xeon® Gold 6150 Processor (24.75M Cache, 2.70 GHz), Max Turbo frequency, 3.70
GHz. # cores 18; #threads 36; DDR4-2666; TDP 165W; # of UPI Links 3; SMT; Mellanox
EDR.
16 node Intel cluster running SLURM;
Intel® Xeon® Gold 6148 Processor (27.5M Cache, 2.40 GHz), Max Turbo frequency, 3.70
GHz. # cores 20; #threads 40; DDR4-2666; TDP 150W; # of UPI Links 3; SMT; Intel OPA.
7Application Performance in Materials Science 14 December 2017
The Performance Benchmarks
• The Test suite comprises both synthetics & end-user applications.
Synthetics include HPCC (http://icl.cs.utk.edu/hpcc/) & IMB benchmarks
(http://software.intel.com/en-us/articles/intel-mpi-benchmarks), IOR and
STREAM
• Variety of “open source” & commercial end-user application codes:
• These stress various aspects of the architectures under consideration
and should provide a level of insight into why particular levels of
performance are observed e.g., memory bandwidth and latency, node
floating point performance and interconnect performance (both latency
and B/W) and sustained I/O performance.
GROMACS, LAMMPS, NAMD, DL_POLY classic & DL_POLY-4
(molecular dynamics)
Quantum Espresso, Siesta, CP2K, ONETEP, CASTEP and VASP
(ab initio Materials properties)
NWChem, GAMESS-US and GAMESS-UK
(molecular electronic structure)
8Application Performance in Materials Science 14 December 2017
74,309
93,486
118,605 114,367
132,035 128,083
165,974
185,863195,122
184,087
279,562
0
50,000
100,000
150,000
200,000
250,000
Bull b510"Raven"Sandy
Bridge e5-2670/2.6GHz IB-
QDR
ClusterVisione5-2650v2
2.6GHz
Dell R730Haswell e5-
2697v3 2.6GHz(T)
Dell OPA32 e5-2660v3 2.6GHz
(T) OPA
Thor Broadwelle5-2697A v42.6GHz (T)
ATOSBroadwell e5-2680v4 2.4GHz
(T) OPA
Dell SkylakeGold 6130
2.1GHz (T) OPA
Dell SkylakeGold 61422.6GHz (T)
Dell SkylakeGold 61482.4GHz (T)
IBM Power8S822LC
2.92GHz IB/EDR
AMD Epyc 76012.2 GHz
Memory B/W –STREAM performance
TRIAD [Rate (MB/s) ]
Ivy Bridge & Haswell
E5-26xx v2,v3
OMP_NUM_THREADS (KMP_AFFINITY=physical
Broadwell
E5-26xx v4
Skylake Gold
6130, 6142, 6148
Application Performance in Materials Science 914 December 2017
4,644
5,843
4,236
5,718
4,126
4,574
5,187
5,808
4,878
9,204
4,368
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,000
9,000
10,000
Bull b510"Raven"Sandy
Bridge e5-2670/2.6GHz IB-
QDR
ClusterVisione5-2650v2
2.6GHz
Dell R730Haswell e5-
2697v3 2.6GHz(T)
Dell OPA32 e5-2660v3 2.6GHz
(T) OPA
Thor Broadwelle5-2697A v42.6GHz (T)
ATOSBroadwell e5-2680v4 2.4GHz
(T) OPA
Dell SkylakeGold 6130
2.1GHz (T) OPA
Dell SkylakeGold 61422.6GHz (T)
Dell SkylakeGold 61482.4GHz (T)
IBM Power8S822LC
2.92GHz IB/EDR
AMD Epyc 76012.2 GHz
Memory B/W – STREAM / core performance
TRIAD [Rate (MB/s) ]
OMP_NUM_THREADS (KMP_AFFINITY=physical
Ivy Bridge & Haswell
E5-26xx v2,v3
Broadwell
E5-26xx v4
Skylake Gold
6130, 6142, 6148
Application Performance in Materials Science 1014 December 2017
3.8
11,466
5,957
1.7
3,694
1,729
1
10
100
1,000
10,000
1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07
ATOS AMD EPYC 7601 2.2 GHz (T) EDR
Intel SKL Gold 6148 2.4GHz (T) OPA
Dell Skylake Gold 6150 2.7GHz (T) EDR
IBM Power8 S822LC 2.92GHz IB/EDR
Thor BDW e5-2697A v4 2.6GHz (T) EDR
Intel BDW e5-2690v4 2.6GHz (T) OPA
Dell OPA32 e5-2660v3 2.6GHz (T) OPA
Bull HSW E5-2680v3 2.5 GHz (T) Connect-IB
Dell R720 e5-2680v2 2.8 GHz (T) connect-IB
Azure A9 WE (e5-2670 2.6 GHz) IB RDMA
Merlin Xeon E5472 3.0 GHz QC + IB (mvapich2 1.4)
MPI Performance – PingPong
IMB Benchmark (Intel)
1 PE / node
Latency
Message Length (Bytes)
Mb
yte
s/s
ec
11Application Performance in Materials Science
BE
TT
ER
14 December 2017
215.4
5.0E+01
5.0E+02
5.0E+03
5.0E+04
5.0E+05
5.0E+06
1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
ATOS Broadwell e5-2680v4 2.4GHz (T) OPA
Dell PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR
Dell PE R730 Broadwell e5-2697Av4 2.6GHz (T) OPA
Dell Skylake Gold 6130 2.1GHz (T) OPA
Intel Skylake Gold 6148 2.4GHz (T) OPA
Dell Skylake Gold 6142 2.6GHz (T) EDR
MPI Collectives – Alltoallv (128 PEs)
IMB Benchmark (Intel)
128 PEs
Latency
BE
TT
ER
Message Length (Bytes)
Measured Time (usec)
Time-consuming
messages called by
Alltoall & Alltoallv (IPM)
12Application Performance in Materials Science 14 December 2017
• Linpack TPP benchmark measures floating point rate of execution for solving a linear system of equations. HPL
• measures floating point rate of execution of double precision real matrix-matrix multiplication. DGEMM
• measures sustainable memory B/W (in GB/s) and the corresponding computation rate for simple vector kernel. STREAM
• parallel matrix transpose - exercises the communicationswhere pairs of processors communicate with each other simultaneously. Useful test of the total communications capacity of the network.
PTRANS
• measures rate of integer random updates of memory (GUPS). RandomAccess
• measures floating point rate of execution of double precision complex one-dimensional Discrete Fourier Transform. Performance a combination of flops, memory, and network bandwidth
FFTE
• tests to measure latency and B/W of a number of simultaneous communication patterns; based on b_eff (effective bandwidth benchmark).
Communication B/W and latency
HPC Challenge Benchmark (Source - http://icl.cs.utk.edu/hpcc/)
CPU cores N
128 83,000
256 117,000
512 166,000
1024 234,000
13Application Performance in Materials Science 14 December 2017
0.0
0.2
0.4
0.6
0.8
1.0
HPL [Gflops]
G-PTRANS[GB/s]
G-RandomAccess [Gup/s]
G-FFTE [Gflops]
EP-STREAM Sys[GB/s]
EP-STREAMTriad [GB/s]
EP-DGEMM[Gflops]
Random RingBandwidth[Gbytes]
Random RingLatency [usec]
Fujitsu CX250 e5-2670 2.6 GHzQDR
Bull b510 (Raven) e5-2670 2.6GHz IB-QDR
ATOS Genji e5-2680 v4 2.4GHz(T) OPA IMPI
Thor Dell e5-2697A v4 2.6GHz (T)EDR HCPX
Thor Dell e5-2697A v4 2.6GHz (T)OPA
Dell Skylake Gold 6130 2.1GHz(T) OPA
Intel Skylake Gold 6148 2.4GHz(T) OPA
Dell Skylake Gold 6142 2.6GHz(T) EDR
Dell Skylake Gold 6150 2.7GHz(T) EDR
HPCC - 128 Processing Elements
14Application Performance in Materials Science
[Matrix Size 83,000]
14 December 2017
Application Code Performance in Materials
Science, Chemistry and Nanoscience:
DLPOLY, GROMACS, NAMD, LAMMPS, GAMESS,
NWChem, GAMESS-UK, ONETEP, VASP, SIESTA,
CASTEP, Quantum Espresso, CP2K – on a variety of HPC
systems.
IPM Performance Monitoring
http://ipm-hpc.sourceforge.net/userguide.html
IPM is a profiling tool that helps analyse MPI programs.
• Very easy to use, requires no code modifications (unless you want more
information), and provides a lightweight profiling interface (with very low
overhead <2%).
• Can create html O/P that include graphical representation of the data
To run a program with IPM profiling; There are three ways of using IPM.
1. set an environment variable before you run your program:
$ export LD_PRELOAD=/application/tools/ipm-2.0.6/install-impi/lib/libipm.so
2. Recompile your program with IPM enabled:
$ mpicc -L/path/to/ipm/lib -lipm your_program.c -o your-program
3. Use “export I_MPI_STATS=ipm” if using Intel’s mpirun or mpiexec.hydra
When executing a program with ipm, an xml file is created that can be parsed to text
or html using ''ipm_parse -html xmlfile''.
IPM 2.0.6
16Application Performance in Materials Science 14 December 2017
Allinea Performance Reports
Allinea Performance Reports provides a
mechanism to characterize and understand the
performance of HPC application runs through a
single-page HTML report.
17Application Performance in Materials Science
• Based on Allinea MAP's adaptive sampling technology that keeps data
volumes collected and application overhead low.
• Modest application slowdown (ca. 5%) even with 1000’s of MPI
processes.
• Runs on existing codes: a single command added to execution scripts.
• If submitted through a batch queuing system, then the submission script
is modified to load the Allinea module and add the 'perf-report' command
in front of the required mpiexec command.
• perf-report mpiexec -n 4 $code
• A Report Summary: This characterizes how the application's wallclock
time was spent, broken down into CPU, MPI and I/O
• All examples updated on Broadwell Mellanox Cluster (E5-2697A v4)
14 December 2017
DL_POLY Developed as CCP5 parallel MD code by W. Smith,
T.R. Forester and I. Todorov
UK CCP5 + International user community
DLPOLY_classic (replicated data) and DLPOLY_3 &
_4 (distributed data – domain decomposition)
Areas of application:
liquids, solutions, spectroscopy, ionic solids,
molecular crystals, polymers, glasses, membranes,
proteins, metals, solid and liquid interfaces,
catalysis, clathrates, liquid crystals, biopolymers,
polymer electrolytes.
Molecular Dynamics Codes: AMBER, DL_POLY,
CHARMM, NAMD, LAMMPS, GROMACS etc
Molecular Simulation I. DL_POLY
18Application Performance in Materials Science 14 December 2017
A B
C D
• Distribute atoms, forces across the nodes
¤ More memory efficient, can address much larger
cases (105-107)
• Shake and short-ranges forces require only
neighbour communication
¤ communications scale linearly with number of
nodes
• Coulombic energy remains global
¤ Adopt Smooth Particle Mesh Ewald scheme
• includes Fourier transform smoothed charge
density (reciprocal space grid typically
64x64x64 - 128x128x128)
http://www.scd.stfc.ac.uk//research/app/ccg/software/DL_POLY/44516.aspx
W. Smith and I. Todorov
Domain Decomposition - Distributed data:
DL_POLY 4 – Distributed data
19Application Performance in Materials Science
Benchmarks1. NaCl Simulation; 216,000 ions, 200 time steps, Cutoff=12Å
2. Gramicidin in water; rigid bonds + SHAKE: 792,960 ions, 50 time steps
14 December 2017
1.8
3.1
4.0
4.9
2.7
5.0
6.5
8.4
2.9
5.3
7.0
9.1
0.0
2.0
4.0
6.0
8.0
10.0
64 128 192 256
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA
Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR IMPI
Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA
Intel Skylake Gold 6148 2.4GHz (T) OPA
Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR
Bull|ATOS Skylake Gold 6150 2.7GHz (T) EDR
DL_POLY 4 – Gramicidin Simulation
Number of Processing Elements
Performance
BE
TT
ER
Gramicidin 792,960 atoms; 50 time steps
Performance Data (64-256 PEs)
20Application Performance in Materials Science
Relative to the Fujitsu CX250 e5-2670 2.6 GHz 8-C (32 PEs)
14 December 2017
SKL 6142 2.6 GHz ~
1.06 X e5-2697v4 2.6
GHz
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU Scalar numeric ops (%)
CPU Vector numeric ops (%)
CPU Memory accesses (%)
Performance Data (32-256 PEs)
DLPOLY4 – Gramicidin Simulation Performance Report
Smooth Particle Mesh Ewald Scheme
21Application Performance in Materials Science
CPU Time Breakdown
Total Wallclock Time
Breakdown
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU (%)
MPI (%)
14 December 2017
“DL_POLY_4 and Xeon Phi: Lessons Learnt”,
Alin Marin Elena , Christian Lalanne, Victor
Gamayunov , Gilles Civario, Michael Lysaght,
and IlianTodorov
Molecular Simulation II. GROMACS
GROMACS v 5.0 (GROningen MAchine for Chemical Simulations) is a molecular dynamics package designed for simulations of proteins,
lipids and nucleic acids [University of Groningen] .
Berk Hess et al. "GROMACS 4: Algorithms for Highly Efficient, Load-
Balanced, and Scalable Molecular Simulation". Journal of Chemical Theory
and Computation 4 (3): 435–447.
22Application Performance in Materials Science
Ion channel system
• The 142k particle ion channel system is the membrane protein GluCl - a
pentameric chloride channel embedded in a DOPC membrane and
solvated in TIP3P water, using the Amber ff99SB-ILDN force field. This
system is a challenging parallelization case due to the small size, but is
one of the most wanted target sizes for biomolecular simulations.
Lignocellulose
• Gromacs Test Case B from the UEA Benchmark Suite. A model of
cellulose and lignocellulosic biomass in an aqueous solution. This system
of 3.3M atoms is inhomogeneous, and uses reaction-field
electrostatics instead of PME and therefore should scale well.
14 December 2017
20.6
37.9
54.5
68.8
32.5
60.3
75.5
100.4
36.4
68.5
98.8
123.6
0.0
20.0
40.0
60.0
80.0
100.0
120.0
64 128 192 256
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA
Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR HPCX
Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA
Intel Skylake Gold 6148 2.4GHz (T) OPA
Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR
Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR
23
GROMACS – Ion Channel Simulation
Number of Processing Elements
Performance (ns /day)
Performance Data (64-256 PEs)
BE
TT
ER
142k particle ion channel system
Application Performance in Materials Science 14 December 2017
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
50.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU Scalar numeric ops (%)
CPU Vector numeric ops (%)
CPU Memory accesses (%)
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU (%)
MPI (%)
Performance Data (32-256 PEs)
GROMACS – Ion-channel Performance Report
24Application Performance in Materials Science
CPU Time Breakdown
Total Wallclock Time
Breakdown
14 December 2017
0.9
1.7
2.5
3.3
1.3
2.6
3.8
5.0
1.3
2.6
3.8
5.0
1.4
2.8
3.9
5.2
1.6
3.1
4.6
6.1
0.0
1.0
2.0
3.0
4.0
5.0
6.0
64 128 192 256
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA
Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR IMPI
Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR HPCX
Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA
Intel Skylake Gold 6148 2.4GHz (T) OPA
Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR
Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR
25
GROMACS – Lignocellulose
Number of Processing Elements
Performance (ns /day)
Performance Data (64-256 PEs)
BE
TT
ER
3,316,463 atom system using
reaction-field electrostatics instead
of PME
Application Performance in Materials Science 14 December 2017
Performance Data (256 PEs)GROMACS - IPM Reports
26Application Performance in Materials Science
13%
10%
wallclock : 26.8 secs
# mpi_tasks : 256 on 8 nodes
%comm : 34.90%
7%
% of Total Time
ligno-cellulosewallclock : 352.1 secs
# mpi_tasks : 256 on 8 nodes
%comm : 5.81%
Ion-channel
% of Total Time
4%
1%
1%
14 December 2017
GAMESS-UK - Moving to Distributed Data.
The MPI/ScaLAPACK Implementation
of the GAMESS-UK SCF/DFT module
• Alternative pragmatic approach in which
¤ MPI-based tools (such as ScaLAPACK) used in place of Global Arrays
¤ All data structures except those required for the Fock matrix build are fully
distributed (F, P)
• Partially distributed model chosen because, in the absence of efficient
one-sided communications it is difficult to efficiently load balance a
distributed Fock matrix build.
• Obvious drawback - some large replicated data structures are required.
¤ These are kept to a minimum. For a closed shell HF or DFT calculation only
2 replicated matrices are required, 1 × Fock and 1 × Density (doubled for
UHF).
27Application Performance in Materials Science
“The GAMESS-UK electronic structure package: algorithms, developments and
applications'' M.F. Guest, I. J. Bush, H.J.J. van Dam, P. Sherwood, J.M.H. Thomas, J.H.
van Lenthe, R.W.A Havenith, J. Kendrick, Mol. Phys. 103, No. 6-8, 2005, 719-747.
14 December 2017
1.2
1.5
2.7
1.6
2.8
1.2
2.1
1.6
2.9
1.5
2.7
1.8
3.2
1.6
2.8
1.8
3.3
1.9
3.5
1.1
2.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
128 256
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
ClusterVision e5-2650v2 2.6GHz Truescale QDR
Intel Ivy Bridge e5-2690v2 3.0GHz (T) True Scale QDR
Bull Haswell e5-2695v3 2.3GHz Connect-IB
Intel Haswell e5-2697v3 2.6GHz (T) True Scale QDR
Huawei Fusion CH140 e5-2683 v4 2.1GHz (T) EDR
Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) EDR
Thor Broadwell e5-2697A v4 2.6GHz (T) EDR
Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA
Intel Skylake Gold 6148 2.4GHz (T) OPA
Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR
Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR
IBM Power8 S822LC 2.92GHz IB/EDR
Zeolite Y cluster SioSi7 DZVP (Si,O), DZVP2 (H) B3LYP(3975 GTOs)
GAMESS-UK Performance - Zeolite Y cluster
Performance
Number of Processing Elements
Relative to the Fujitsu HTC X5650 2.67 GHz 6-C (128 PEs)
BE
TT
ER
28Application Performance in Materials Science
SKL 6142 2.6 GHz
~ 1.05 X e5-2697v4 2.6 GHz
14 December 2017
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU (%)
MPI (%)
Performance Data (32-256 PEs)
GAMESS-UK.MPI DFT – DFT Performance Report
29Application Performance in Materials Science
Cyclosporin 6-31G** basis (1855
GTOs); DFT B3LYP
CPU Time Breakdown
Total Wallclock Time
Breakdown
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU Scalar numeric ops (%)
CPU Vector numeric ops (%)
CPU Memory accesses (%)
14 December 2017
• VASP – performs ab-initio QM molecular dynamics (MD) simulations using
pseudopotentials or the projector-augmented wave method and a plane
wave basis set.
• Quantum Espresso – an integrated suite of Open-Source computer codes
for electronic-structure calculations and materials modelling at the
nanoscale. It is based on density-functional theory (DFT), plane waves,
and pseudopotentials
• SIESTA - an O(N) DFT code for electronic structure calculations and ab
initio molecular dynamics simulations for molecules and solids. It uses
norm-conserving pseudopotentials and linear combination of numerical
atomic orbitals (LCAO) basis set.
• CP2K is a program to perform atomistic and molecular simulations of solid
state, liquid, molecular, and biological systems. It provides a framework for
different methods such as e.g., DFT using a mixed Gaussian & plane waves
approach (GPW) and classical pair and many-body potentials. • ONETEP (Order-N Electronic Total Energy Package) is a linear-scaling
code for quantum-mechanical calculations based on DFT.
Computational Materials
Advanced Materials Software
30Application Performance in Materials Science 14 December 2017
Quantum Espresso is an
integrated suite of Open-
Source computer codes
for electronic-structure
calculations and
materials modelling at the
nanoscale. It is based on
density-functional theory,
plane waves, and
pseudopotentials.
Ground-state calculations.
Structural Optimization.
Transition states & minimum energy paths.
Ab-initio molecular dynamics.
Response properties (DFPT).
Spectroscopic properties.
Quantum Transport.
Benchmark Details
DEISA AU112
Au complex (Au112), 2,158,381 G-
vectors, 2 k-points, FFT dimensions:
(180, 90, 288)
PRACE
GRIR443
Carbon-Iridium complex (C200Ir243),
2,233,063 G-vectors, 8 k-points, FFT
dimensions: (180, 180, 192)
Quantum Espresso
31Application Performance in Materials Science 14 December 2017
1.01.3
2.0 2.0
2.52.9
3.3 3.4
3.3
5.6
5.0
5.7
7.67.8
1.5
2.2
3.2
5.1
4.3
4.9
5.9
1.9
2.7
4.0
6.7
5.7
6.4
8.3
8.8
0.0
2.0
4.0
6.0
8.0
10.0
0 64 128 192 256 320
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA
Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR IMPI DAPL
Thor Dell|EMC e5-2697A v4 2.6GHz (T) OPA
Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA
Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR
Intel Skylake Gold 6148 2.4GHz (T) OPA
Number of Processing Elements
Perf
orm
an
ce
Performance Data (32 - 320 PEs)
BE
TT
ER
Quantum Espresso – Au112
32Application Performance in Materials Science
Relative to the Fujitsu e5-2670
2.6 GHz 8-C (32 PEs)
14 December 2017
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU (%)
MPI (%)
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU Scalar numeric ops (%)
CPU Vector numeric ops (%)
CPU Memory accesses (%)
Performance Data (32-256 PEs)
Quantum Espresso – Au112 Performance Report
Au complex (Au112), 2,158,381 G-
vectors, 2 k-points, FFT
dimensions: (180, 90, 288)
33Application Performance in Materials Science
CPU Time Breakdown
Total Wallclock Time
Breakdown
14 December 2017
Performance Data (256 PEs)
Quantum Espresso – Au112 IPM Report
34Application Performance in Materials Science
36%
17%
wallclock : 108.3 secs
# mpi_tasks : 256 on 8 nodes
%comm : 72.52%
Time-consuming messages
called by Alltoall are modest
sized (<1KB)
8%
MPI_Alltoall
% of Total Time
14 December 2017
Zeolite Benchmark
• Zeolite with the MFI structure unit cell running
a single point calculation and a planewave cut
off of 400eV using the PBE functional
• 2 k-points; maximum number of plane-
waves: 96,834
• FFT grid; NGX=65, NGY=65, NGZ=43,
giving a total of 181,675 points
Pd-O Benchmark
• Pd-O complex – Pd75O12, 5X4 3-layer
supercell running a single point calculation
and a planewave cut off of 400eV. Uses the
RMM-DIIS algorithm for the SCF and
is calculated in real space.
• 10 k-points; maximum number of plane-
waves: 34,470
• FFT grid; NGX=31, NGY=49, NGZ=45,
giving a total of 68,355 points
VASP – Vienna Ab-initio Simulation Package
Benchmark Details
MFI Zeolite
Zeolite (Si96O192), 2 k-
points, FFT grid: (65,
65, 43); 181,675 points
Pd-O
complex
Palladium-Oxygen
complex (Pd75O12), 10
k-points, FFT grid: (31,
49, 45), 68,355 points
VASP (5.4.1) performs ab-
initio QM molecular
dynamics (MD) simulations
using pseudopotentials or
the projector-augmented
wave method and a plane
wave basis set.
Application Performance in Materials Science 3514 December 2017
1.7
2.5
2.1
2.8
4.6
6.5
3.3
5.3
6.1
3.7
5.6
6.8
0.0
2.0
4.0
6.0
8.0
64 128 256
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA
Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR HPCX
Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) OPA
Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA
Intel Skylake Gold 6148 2.4GHz (T) OPA
Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR
Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR
Number of Processing Elements
Performance Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (32 PEs)
BE
TT
ER
Palladium-Oxygen complex (Pd75O12), 8 k-
points, FFT grid: (31, 49, 45), 68,355 points
VASP 5.4.1 – Pd-O Benchmark
36Application Performance in Materials Science 14 December 2017
1.7
2.52.1
2.8
4.6
6.5
3.3
5.3
6.1
4.0
6.1
9.2
3.7
5.6
6.8
0.0
2.0
4.0
6.0
8.0
10.0
64 128 256
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA
Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR HPCX
Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) OPA
Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA
Intel Skylake Gold 6148 2.4GHz (T) OPA
Intel Skylake Gold 6148 2.4GHz (T) OPA [KPAR=2]
Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR
Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR
Number of Processing Elements
Performance Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (32 PEs)
BE
TT
ER
Palladium-Oxygen complex (Pd75O12), 8 k-
points, FFT grid: (31, 49, 45), 68,355 points
VASP 5.4.1 – Pd-O Benchmark - Parallelisation on k-points
37Application Performance in Materials Science 14 December 2017
NPEs KPAR NPAR
64 2 2
128 2 4
256 2 8
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU Scalar numeric ops (%)
CPU Vector numeric ops (%)
CPU Memory accesses (%)
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU (%)
MPI (%)
Performance Data (32-256 PEs)
VASP – Pd-O Benchmark Performance Report
38Application Performance in Materials Science
Palladium-Oxygen complex (Pd75O12), 8 k-
points, FFT grid: (31, 49, 45), 68,355 points
CPU Time Breakdown
Total Wallclock Time
Breakdown
14 December 2017
1.0
1.51.7
1.4
2.9
4.7
1.5
2.7
3.9
1.7
3.0
4.2
1.7
3.2
4.5
0.0
1.0
2.0
3.0
4.0
5.0
64 128 256
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA
Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR HPCX
Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) OPA
Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA
Intel Skylake Gold 6148 2.4GHz (T) OPA
Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR
Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR
IBM Power8 S822LC: 4.3
Number of Processing Elements
Performance
BE
TT
ER
Zeolite (Si96O192) with MFI structure unit cell running a single
point calculation and a 400eV planewave cut off of using the
PBE functional. maximum number of plane-waves: 96,834,
2 k-points, FFT grid: (65, 65, 43); 181,675 points
VASP 5.4.1 – Zeolite Benchmark
39Application Performance in Materials Science
Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (64 PEs)
14 December 2017
40
OpenFOAM - The open source CFD toolbox
• OpenFOAM has an extensive range of features to solve anything from complex fluid flows involving chemical reactions, turbulence and heat transfer, to solid dynamics and electromagnetics.
Features
• Includes over 80 solver applications that simulate specific problems in engineering mechanics and over 170 utility applications that perform pre- and post-processing tasks, e.g. meshing, data visualisation, etc.
Applications
http://www.openfoam.com/
The OpenFOAM® (Open Field Operation and Manipulation) CFD
Toolbox is a free, open source CFD software package produced by
OpenCFD Ltd.
Application Performance in Materials Science 14 December 2017
• Isothermal, incompressible flow in a 2D square domain. The geometry has all the boundaries of the square are walls. The top wall moves in the x-direction at 1 m/s while the other 3 are stationary. Initially, the flow is assumed laminar and is solved on a uniform mesh using the icoFoam solver.
Lid-driven cavity flow (Cavity 3d)
Geometry of the lid driven cavity
http://www.openfoam.org/docs/user/cavity.php#x5-170002.1.5
1.02.2
3.3
7.2
9.3
11.512.4
15.0 15.0
16.9 16.9
19.9
23.4
26.5 26.5
1.4
3.1
4.6
9.7
12.4
15.6
17.6
23.4
29.4
34.5
36.1
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
0 64 128 192 256 320 384 448 512
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA
Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA
Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR
Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR
Bull|ATOS Skylake Gold 6150 2.7GHz (T) EDR
Intel Skylake Gold 6148 2.4GHz (T) OPA
41
OpenFOAM – Cavity 3d-3M
Number of Processing Elements
Perf
orm
an
ce
Relative to the Fujitsu CX250 e5-2670 8-C (32 PEs)
BE
TT
ER
Performance Data (32-512 PEs)
OpenFOAM with lid-driven cavity flow 3d-3M data set
Application Performance in Materials Science 14 December 2017
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
100.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU (%)
MPI (%)
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU Scalar numeric ops (%)
CPU Vector numeric ops (%)
CPU Memory accesses (%)
Performance Data (32-256 PEs)
OpenFOAM – Cavity 3d-3M Performance Report
42Application Performance in Materials Science
OpenFOAM with lid-driven cavity
flow 3d-3M data set
CPU Time Breakdown
Total Wallclock Time
Breakdown
14 December 2017
Application Code Performance in Materials
Science, Chemistry and Nanoscience:
2. Selecting Fabrics and Optimising
Performance:
Intel MPI, Mellanox HPCX, IPM
and Allinea Performance Report.
• Intel MPI Library - can select a communication fabric at runtime without having
to recompile the application. By default, it automatically selects the most
appropriate fabric based on both S/W and H/W configuration i.e. in most cases
you do not have to worry about manually selecting a fabric.
• Specifying a particular fabric can boost performance. Can specify fabrics for both
intra-node and inter-node communications. Following fabrics available:
• For inter-node communication, it uses the first available fabric from the default
fabric list. This list is defined automatically for each H/W and S/W configuration
(see I_MPI_FABRICS_LIST).
• For most configurations, this list is as follows: dapl, ofa, tcp, tmi, ofi
Selecting Fabrics – MPI Optimisation
44Application Performance in Materials Science
Fabric Network hardware and software used
shm Shared memory (for intra node communication only).
dapl Direct Access Programming Library (DAPL) fabrics, such as InfiniBand (IB)
and iWarp (through DAPL).
ofa OpenFabrics Alliance (OFA) fabrics e.g. InfiniBand (through OFED verbs).
tcp TCP/IP network fabrics, such as Ethernet and InfiniBand (through IPoIB).
tmi Tag Matching Interface (TMI) fabrics, such as Intel True Scale Fabric, Intel
Omni Path Architecture and Myrinet (through TMI).
ofi OpenFabrics Interfaces* (OFI) - capable fabrics, such as Intel True Scale
Fabric, Intel Omni Path Architecture, IB and Ethernet (through OFI API).
14 December 2017
MPIFLAGS+="-genv DAT_OVERRIDE /etc/dat.conf "
MPIFLAGS+="-genv I_MPI_DAT_LIBRARY /usr/lib64/libdat2.so "
if [[ "$TRANSPORT" == "DAPL" ]]; then
MPIFLAGS+="-DAPL "
MPIFLAGS+="-genv I_MPI_FABRICS shm:dapl "
MPIFLAGS+="-genv I_MPI_DAPL_UD off "
MPIFLAGS+="-genv I_MPI_DAPL_PROVIDER ofa-v2-$HCA-${HCAPORT}u "
MPIFLAGS+="-genv DAPL_MAX_INLINE 256 "
MPIFLAGS+="-genv I_MPI_DAPL_RDMA_RNDV_WRITE on "
MPIFLAGS+="-genv DAPL_IB_MTU 4096 "
elif [[ "$TRANSPORT" == "OFA" ]]; then
MPIFLAGS+="-IB "
MPIFLAGS+="-genv MV2_USE_APM 0 "
MPIFLAGS+="-genv I_MPI_FABRICS shm:ofa "
MPIFLAGS+="-genv I_MPI_OFA_USE_XRC 1 "
MPIFLAGS+="-genv I_MPI_OFA_NUM_ADAPTERS 1 "
MPIFLAGS+="-genv I_MPI_OFA_ADAPTER_NAME $HCA "
MPIFLAGS+="-genv I_MPI_OFA_NUM_PORTS 1 "
fi
if [ "$NET" == "OPA" ]; then
MPIFLAGS="-PSM2 "
fi
MPIFLAGS+="-genv I_MPI_PIN on "
MPIFLAGS+="-genv I_MPI_DEBUG 4 "
MPIFLAGS+="-genv MALLOC_MMAP_MAX_ 0 -genv MALLOC_TRIM_THRESHOLD_ -1 "
45Application Performance in Materials Science
HCA=mlx5_0
HCAPORT=1
TRANSPORT=DAPL
mpirun -np $NP $MPIFLAGS
Selecting Fabrics – MPI Optimisation
Intel MPI Library
14 December 2017
Mellanox HPC-X Toolkit and Intel MPI
The Mellanox HPC-X Toolkit provides a MPI, SHMEM and UPC
software suite for HPC environments. Delivers “enhancements to
significantly increase the scalability & performance of message
communications in the network”. Includes:
¤ Complete MPI, SHMEM, UPC package, including Mellanox MXM and
FCA acceleration engines
¤ Offload collectives communication from MPI process onto Mellanox
interconnect hardware
¤ Maximize application performance with underlying hardware
architecture. Optimized for Mellanox InfiniBand and VPI interconnects
¤ Increase application scalability and resource efficiency
¤ Multiple transport support including RC, DC and UD
¤ Intra-node shared memory communication
• Performance comparison conducted on the Mellanox HP Proliant- E5-
2697A v4 EDR based cluster
46Application Performance in Materials Science
http://www.mellanox.com/related-docs/prod_acceleration_software/PB_HPC-X.pdf
14 December 2017
Application Performance & Interconnect
Two comparison exercises undertaken:
¤ For each application (and associated data sets) analyse the
performance as a function of interconnect – Mellanox EDR and
Intel OPA – as a function of increasing core count.
• DLPOLY4 & GROMACS – 128-1024 cores
• VASP PdO (128-384 cores) & Zeolite (128-512 cores)
• Quantum ESPRESSO (Au112, 64-512; GRIR443, 128-1024)
• OpenFOAM (64-512 cores)
¤ On the Mellanox HP Proliant- E5-2697A v4 EDR based Thor
cluster, compare for each application (and associated data sets)
the relative performance undertaken using Intel MPI and
Mellanox HPCX i.e.
T HPCX / T Intel-MPI
47Application Performance in Materials Science
http://www.mellanox.com/related-docs/prod_acceleration_software/PB_HPC-X.pdf
14 December 2017
4.9
8.4
13.8
19.2
5.2
8.9
13.7
19.3
5.3
9.1
13.7
0.0
3.0
6.0
9.0
12.0
15.0
18.0
128 256 512 1024
Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA
Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR IMPI
Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR HPCX
Thor Dell|EMC e5-2697A v4 2.6GHz (T) OPA
Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA
Bull|ATOS Skylake Gold 6150 2.7GHz (T) EDR
DL_POLY 4 – Gramicidin
Number of Processing Elements
Performance Relative to the Fujitsu CX250 e5-2670/ 2.6 GHz 8-C (32 PEs)
BE
TT
ER
48Application Performance in Materials Science
Gramicidin 792,960 atoms; 50 time steps
14 December 2017
Performance Data (128-1024 PEs)
2.6
5.0
9.8
18.2
2.5
4.9
9.6
15.4
2.9
5.7
11.0
0.0
4.0
8.0
12.0
16.0
20.0
128 256 512 1024
Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA
Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR IMPI
Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR HPCX
Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) OPA
Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA
Bull|ATOS Skylake Gold 6150 2.7GHz (T) EDR
49
GROMACS – Lignocellulose
Number of Processing Elements
Performance (ns /day)
Performance Data (128-1024 PEs)
BE
TT
ER
3,316,463 atom system using reaction-
field electrostatics instead of PME
Application Performance in Materials Science 14 December 2017
1.51.7 1.7
2.9
4.7
5.6
3.0
4.3
4.9
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
128 256 512
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA
Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR HPCX
Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) OPA
Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA
Bull|ATOS Skylake Gold 6150 2.7GHz (T) EDR
Number of Processing Elements
Performance
BE
TT
ER
Zeolite (Si96O192) with MFI structure unit cell running a single point
calculation and a 400eV planewave cut off of using the PBE
functional. maximum number of plane-waves: 96,834, 2 k-points,
FFT grid: (65, 65, 43); 181,675 points
VASP 5.4.1 – Zeolite Benchmark
50Application Performance in Materials Science
Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (64 PEs)
14 December 2017
2.4
4.6
6.5
7.7
2.5
4.4
5.7
5.2
2.8
5.2
6.3
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
128 256 512 1024
Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA
Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR IMPI DAPL
Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR HPCX
Thor Dell|EMC e5-2697A v4 2.6GHz (T) OPA
Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA
Bull|ATOS Skylake Gold 6150 2.7GHz (T) EDR
Number of Processing Elements
Pe
rfo
rma
nc
e
BE
TT
ER
Performance Data (128-1024 PEs)Quantum Espresso – GRIR443
51Application Performance in Materials Science
[Re
lative
to
th
e F
ujit
su
e5
-2670
2.6
GH
z 8
-C (
96
PE
s)]
14 December 2017
2.3
8.3
16.9
33.1
2.4
9.0
21.5
39.7
2.5
7.9
15.3
36.1
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
64 128 256 512
Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA
Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR IMPI
Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR HPCX
Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) OPA
Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA
Bull|ATOS Skylake Gold 6150 2.7GHz (T) EDR
52
OpenFOAM – Cavity 3d-3M
Number of Processing Elements
Perf
orm
an
ce
Relative to the Fujitsu CX250 e5-2670 8-C (32 PEs)
BE
TT
ER
Performance Data (64-512 PEs)
OpenFOAM with lid-driven cavity flow 3d-3M data set
Application Performance in Materials Science 14 December 2017
DL_POLY 4 – Intel MPI vs. HPCX
53Application Performance in Materials Science
% Intel MPI
Performance vs.
HPCX
Processor Count
Intel MPI is seen to outperform HPC-X for the DLPOLY 4 NaCl test
case at all core counts, and at lower core counts for Gramicidin
90%
95%
100%
105%
110%
115%
120%
0 128 256 384 512 640 768 896 1024
DL_POLY4 - NaCl
DL_POLY4 - Gramicidin
14 December 2017
95%
100%
105%
110%
115%
120%
125%
130%
0 128 256 384 512 640 768 896 1024
GROMACS - ion channel
GROMACS - lignocellulose
GROMACS – Intel MPI vs. HPCX
54Application Performance in Materials Science
% Intel MPI
Performance vs. HPCX
Processor Count
At no point does the HPC-X implementation
of Gromacs outperform that using Intel MPI
14 December 2017
50%
60%
70%
80%
90%
100%
110%
0 128 256 384 512
VASP - Palladium Complex
VASP - Zeolite Cluster
VASP 5.4.1 – Intel MPI vs. HPCX
55Application Performance in Materials Science
% Intel MPI Performance vs.
HPCX
Processor Count
Significantly different to the classical MD codes – now
HPCX is seen to outperform Intel MPI for the Zeolite
cluster at all core counts, and at larger core counts for
the Palladium complex
14 December 2017
65%
75%
85%
95%
105%
0 128 256 384 512 640 768
Quantum Espresso - GRIR443
Quantum Espresso - Au112
Quantum Espresso – Intel MPI vs. HPCX
56Application Performance in Materials Science
% Intel MPI Performance vs. HPCX
Processor Count
Significantly different to the classical MD codes – as
with VASP, HPCX is seen to outperform Intel MPI for the
larger core counts
14 December 2017
3. Relative Performance as a Function of
Processor Family and Interconnect.
Application Performance in
Chemistry and Materials Science:
0.00
0.20
0.40
0.60
0.80
1.00
DLPOLY-4Gramicidin
DLPOLY-4 NaCl
GROMACS ion-channel
GROMACSlignocellulose
OpenFoam -3d3M
QE Au112
QE GRIR443
VASP Pd-Ocomplex
VASP Zeolitecomplex
BSMBenchBalance
Bull b510 "Raven"Sandy Bridge e5-2670/2.6 GHz IB-QDR
Fujitsu CX250 SandyBridge e5-2670/2.6 GHzIB-QDR
ATOS Broadwell e5-2680v4 2.4GHz (T) OPA
Thor Dell|EMC e5-2697Av4 2.6GHz (T) EDR IMPI
Thor Dell|EMC e5-2697Av4 2.6GHz (T) EDRHPCX
Dell Skylake Gold 61302.1GHz (T) OPA
Intel Skylake Gold 61482.4GHz (T) OPA
Dell Skylake Gold 61422.6GHz (T) EDR
Dell Skylake Gold 61502.7GHz (T) EDR
Target Codes and Data Sets – 128 PEs
58Application Performance in Materials Science
128 PE Performance [Applications]
14 December 2017
1.08
1.19
1.32
1.34
1.38
1.39
1.40
1.42
1.44
1.44
1.49
1.52
1.55
1.64
1.76
2.07
2.12
2.29
2.64
1.00 1.30 1.60 1.90 2.20 2.50
DLPOLYclassic Bench7
OpenFoam - 3d3M
GAMESS-UK (hf12z)
GAMESS-UK (Siosi7)
GROMACS ion-channel
NAMD (F1atpase)
GAMESS-UK (valino.A2)
GAMESS-UK (cyc-sporin)
GROMACS…
LAMMPS (3d LJ melt)
NAMD (apoa1)
NAMD (stmv)
DLPOLY-4 NaCl
DLPOLY-4 Gramicidin
CP2K H2O.256
VASP Zeolite complex
QE GRIR443
QE Au112
VASP Pd-O complex
Improved Performance of
Dell |EMC Skylake Gold 6130
2.1GHz (T) OPA
vs.
Fujitsu CX250 Sandy Bridge e5-
2670/2.6 GHz IB-QDR
59Application Performance in Materials Science
Average Factor = 1.54
SKL “Gold” 6130 2.1 GHz OPA vs. SB e5-2670 2.6 GHz QDR
256 cores
14 December 2017
1.08
1.19
1.32
1.34
1.38
1.39
1.40
1.42
1.44
1.44
1.49
1.52
1.55
1.64
1.76
2.07
2.12
2.29
2.64
1.00 1.30 1.60 1.90 2.20 2.50
DLPOLYclassic Bench7
OpenFoam - 3d3M
GAMESS-UK (hf12z)
GAMESS-UK (Siosi7)
GROMACS ion-channel
NAMD (F1atpase)
GAMESS-UK (valino.A2)
GAMESS-UK (cyc-sporin)
GROMACS…
LAMMPS (3d LJ melt)
NAMD (apoa1)
NAMD (stmv)
DLPOLY-4 NaCl
DLPOLY-4 Gramicidin
CP2K H2O.256
VASP Zeolite complex
QE GRIR443
QE Au112
VASP Pd-O complex
Improved Performance of
Dell |EMC Skylake Gold 6150
2.7GHz (T) EDR
vs.
Fujitsu CX250 Sandy Bridge e5-
2670/2.6 GHz IB-QDR
60Application Performance in Materials Science
Average Factor = 1.80
SKL “Gold” 6150 2.7 GHz EDR vs. SB e5-2670 2.6 GHz QDR
256 cores
14 December 2017
0.87
0.98
1.00
1.00
1.01
1.02
1.02
1.04
1.04
1.04
1.11
1.11
1.11
1.12
1.12
1.22
1.33
1.45
1.58
1.59
0.70 0.80 0.90 1.00 1.10 1.20 1.30 1.40 1.50 1.60
OpenFoam - 3d3M
VASP Zeolite complex
DLPOLY-4 NaCl
GAMESS-UK (hf12z)
LAMMPS (3d LJ melt)
GROMACS ion-channel
GAMESS-UK (valino.A2)
DLPOLY-4 Gramicidin
GAMESS-UK (Siosi7)
GAMESS-UK (cyc-sporin)
NAMD (stmv)
NAMD (apoa1)
GROMACS…
NAMD (F1atpase)
VASP Pd-O complex
QE Au112
DLPOLYclassic Bench4
DLPOLYclassic Bench5
QE GRIR443
DLPOLYclassic Bench7
Improved Performance of
Intel Skylake Gold 6148 2.4GHz (T)
OPA
vs.
ATOS Broadwell e5-2680v4 2.4GHz
(T) OPA
61Application Performance in Materials Science
Average Factor = 1.16
SKL “Gold” 6148 2.4 GHz vs. BDW e5-2680v4 2.4 GHz EDR
256 cores
14 December 2017
62Application Performance in Materials Science
Performance Benchmarks – Node to Node
• Analysis of performance Metrics across a variety of data sets
¤ “Core to core” and “node to node” workload comparisons
• Previous charts based on Core to core comparison i.e.
performance for jobs with a fixed number of cores
• Node to Node comparison typical of the performance when
running a workload (real life production). Expected to reveal
the major benefits of increasing core count per socket
¤ Focus on a 6 “node to node” comparison of the following:
¤ Benchmarks based on set of 10 applications & 19 data sets.
1Fujitsu CX250 Sandy Bridge e5-2670/2.6
GHz IB-QDR [96 cores]Dell |EMC Skylake Gold 6150 2.7GHz
(T) EDR [216 cores]
2Fujitsu CX250 Sandy Bridge e5-2670/2.6
GHz IB-QDR [96 cores]Dell |EMC Skylake Gold 6142 2.6GHz
(T) EDR [192 cores]
3Fujitsu CX250 Sandy Bridge e5-
2670/2.6 GHz IB-QDR [96 cores]
Dell |EMC Skylake Gold 6130 2.1GHz
(T) OPA [168 cores]
4Fujitsu CX250 Sandy Bridge e5-
2670/2.6 GHz IB-QDR [96 cores]
“Thor” Broadwell cluster e5-2697A v4
2.6GHz (T) IB EDR [192 cores]
5“Thor” Broadwell cluster e5-2697A v4
2.6GHz (T) IB EDR [192 cores]
Dell |EMC Skylake Gold 6150 2.7GHz
(T) EDR [216 cores]
14 December 2017
1.49
2.15
2.22
2.42
2.46
2.46
2.47
2.51
2.58
2.58
2.62
2.67
2.72
2.75
2.79
2.86
2.86
3.98
4.01
1.0 1.5 2.0 2.5 3.0 3.5 4.0
DLPOLYclassic Bench4
GAMESS-UK (cyc-sporin)
GAMESS-UK (hf12z)
GAMESS-UK (Siosi7)
VASP Pd-O complex
GAMESS-UK (valino.A2)
DLPOLY-4 NaCl
CP2K H2O.256
DLPOLY-4 Gramicidin
GROMACS ion-channel
VASP Zeolite complex
NAMD (stmv)
LAMMPS (3d LJ melt)
GROMACS lignocellulose
NAMD (F1atpase)
NAMD (apoa1)
QE Au112
OpenFoam - 3d3M
QE GRIR443
Improved Performance of
Dell |EMC Skylake Gold 6130
2.1GHz (T) OPA [192 cores]
vs.
Fujitsu CX250 Sandy Bridge e5-
2670/2.6 GHz IB-QDR [96 cores]
63Application Performance in Materials Science
Average Factor = 2.54
SKL “Gold” 6130 2.1 GHz OPA vs. SB e5-2670 2.6 GHz QDR
6 Node Comparison
14 December 2017
1.76
2.83
3.08
3.17
3.26
3.28
3.31
3.32
3.34
3.83
4.50
5.23
0.0 1.0 2.0 3.0 4.0 5.0
DLPOLYclassicBench5
QE Au112
DLPOLY-4 NaCl
GROMACS ion-channel
DLPOLY-4 Gramicidin
VASP Zeolite complex
LAMMPS (3d LJ melt)
VASP Pd-O complex
CP2K H2O.256
GROMACSlignocellulose
QE GRIR443
OpenFoam - 3d3M
Improved Performance of
Dell |EMC Skylake Gold 6150
2.7GHz (T) EDR [216 cores]
vs.
Fujitsu CX250 Sandy Bridge e5-
2670/2.6 GHz IB-QDR [96 cores]
64Application Performance in Materials Science
Average Factor = 3.15
SKL “Gold” 6150 2.7 GHz EDR vs. SB e5-2670 2.6 GHz QDR
6 Node Comparison
14 December 2017
0.91
0.91
0.93
0.94
1.00
1.01
1.02
1.02
1.04
1.04
1.05
1.05
1.06
1.08
1.09
1.10
1.11
1.12
1.14
0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2
VASP Pd-O complex
OpenFoam - 3d3M
DLPOLYclassic Bench7
QE Au112
QE GRIR443
DLPOLY-4 NaCl
CP2K H2O.256
GAMESS-UK (hf12z)
DLPOLY-4 Gramicidin
GAMESS-UK (valino.A2)
GAMESS-UK (Siosi7)
GAMESS-UK (cyc-sporin)
VASP Zeolite complex
LAMMPS (3d LJ melt)
GROMACS ion-channel
NAMD (apoa1)
NAMD (F1atpase)
NAMD (stmv)
GROMACS lignocellulose
65Application Performance in Materials Science
Average Factor = 1.03
6 Node Comparison
SKL “Gold” 6142 2.6 GHz EDR vs. BDW e5-2697Av4 2.4 GHz EDR
Improved Performance of
Dell |EMC Skylake Gold 6142 2.6GHz
(T) EDR [192 cores]
vs.
Thor Dell|EMC e5-2697A v4 2.6GHz
(T) EDR HPCX [192 cores]
14 December 2017
4. Preliminary Evaluation of the AMD
EPYC 7601 Processor.
Application Performance in
Chemistry and Materials Science:
• Zen cores
¤ Private L1/L2 cache
• CCX
¤ 4 ZEN cores (or less)
¤ 8MB L3 shared cache
• Zeppelin
¤ 2 CCX (or less)
¤ 2 DDR4 channels
¤ 2 PCIe 16x
• Naples
¤ 4 Zeppelin SoC dies fully
connected by Infinity
Fabric.
¤ 4 Numa Nodes !
067
EPYC Architecture - Naples, Zeppelin & CCX2x16 PCie
2x D
DR
4 C
han
nels
Coherent Links
Co
here
nt L
inks
8M
L3
Zen L2
Zen L2
Zen L2
Zen L2
8M
L3
L2
L2
Zen
Zen
L2
L2
Zen
Zen
2x16 PCie
2x D
DR
4 C
han
nels
Coherent Links
Co
here
nt L
inks
8M
L3
Zen L2
Zen L2
Zen L2
Zen L2
8M
L3
L2
L2
Zen
Zen
L2
L2
Zen
Zen
2x D
DR
4 C
ha
nn
els
Coherent Links
Co
here
nt L
inks
8M
L3
Zen L2
Zen L2
Zen L2
Zen L2
8M
L3
L2
L2
Zen
Zen
L2
L2
Zen
Zen
2x16 PCie
2x D
DR
4 C
han
nels
Coherent Links
Co
here
nt L
inks
8M
L3
Zen L2
Zen L2
Zen L2
Zen L2
8M
L3
L2
L2
Zen
Zen
L2
L2
Zen
Zen
2x16 PCie
∞
• Delivers 32 cores / 64 threads, 16MB L2 cache and 64MB L3 cache per socket.
• Design also means that there are four NUMA nodes per socket or eight NUMA nodes in a
dual socket system i.e. different memory latencies depending on which die needs data from
memory that can be attached to that die or another die on the fabric.
• The key difference with Intel’s Skylake SP architecture is that AMD needs to go off die within
the same socket where Intel stays on a single piece of silicon.
Application Performance in Materials Science 14 December 2017
68
068
Inter-socket• 4 link (2B) between
sockets in 2 processors configurations.
• Each die connect to peer die
• 2 hop maximum system diameter
• 38GB/s bi-dir BW per link• 152GB/s between two
sockets • Infinity Control Fabric
connected between socket
INFINITY FABRICS and Inter-Socket Connectivity
Intra socket• Fully connect with 4B link• 42,6GB/s per link • 170GB/s
Infinity Fabric is a feat of engineering, but it does mean that
there are significant performance variations as you move off
die and onto the fabric.
Application Performance in Materials Science 14 December 2017
069
SKU 7601 7551 7501 7451 7401 7351 7301
Freq (base) 2.2 2.0 2.0 2.3 2.0 2.4 2.2
Turboboost
All cores
active
2.7 2.6 2.6 2.9 2.8 2.9 2.7
Turboboost
On core
active
3.2 3.0 3.0 3.2 3.0 2.9 2.7
Cores/socket 32 32 32 24 24 16 16
L3 cache size 64 MB
Memory
Channel8
Memory Freq 2667 MT/s
TDP (W) 180 180 155/170 180 155/170 155/170 155/170
AMD® Epyc™ 7000 Series - SKU Map and FLOP/cycle
Architecture Sandy Bridge Haswell Skylake EPYC
ISA* AVX AVX2 AVX-512 AVX2
op/cycle2
(1 ADD, 1 MUL)
4
(2 FMA)
4
(2 FMA)
4
(2 ADD, 2 MUL)
Vector size (DP =
64-bits)4 4 8 2
FLOP/cycle 8 16 32 8
* Instruction Set Architecture
Application Performance in Materials Science 14 December 2017
AMD EPYC 7601 System – 4 x SuperMicro AS-1123US-TR4
– 1U 1 node form-factor
– Bi-socket AMD® Epyc 7601
• x86_64 architecture (Support up to AVX2 ISA)
• Base Frequency 2.2GHz, Turbo core:3.2GHz (one
core active), 2.7GHz (all cores active)
• Per socket: 32 zen-cores/64 threads
– 563.2 MFLOPS in DP64 per socket
– L1-D 32KB, L1-I 64KB, L2 512KB, L3 3MB
– Hyperthreading On, Turbo core On
– TDP: 180 Watt
– Memory sub-system
– 8-Memory channels per socket, up to 2666MT/s
» Theoretical Peak: 170.4GBytes/s per socket
– 16 DIMM 16GB SR (1 per Channel) @ 2666MT/s
» Samsung M393A2K40BB2-CTD
▶ 4 disks ST1000NM0008-2F2
– Seagate Capacity 1TB (3,5“) SATA, 7200rpm
– 1 disk for OS (Ext4)
– 3 disks for /scratch (not configured)
▶ OS : RHEL7.4, kernel 3.10.0-693.el7.x86_64
▶ Interconnected through Mellanox® EDR-4x Fabric
Application Performance in Materials Science 7014 December 2017
EPYC - Compiler and Run-time Options
Compilation:
INTEL COMPILERS 2018, IntelMPI 2017
Update 3, FFTW-3.3.5
AMD EPYC: –O3 -xAVX2
INTEL SKYLAKE: -O3 –xCORE-AVX2
#
# Preload the amd-cputype library to navigate
# the Intel Genuine cpu test
module use /opt/amd/modulefiles
module load AMD/amd-cputype/1.0
export LD_PRELOAD=$AMD_CPUTYPE_LIB
export OMP_PROC_BIND=true
# export KMP_AFFINITY=granularity=fine
export I_MPI_DEBUG=5
export MKL_DEBUG_CPU_TYPE=5
Application Performance in Materials Science 7114 December 2017
STREAM:source /opt/intel/compilers_and_libraries_2018/linux/bin/compilervars.sh
intel64
module load AMD/amd-cputype/1.0
icc -o stream.x stream.c -DSTATIC -Ofast -xCORE-AVX2 -qopenmp -
DSTREAM_ARRAY_SIZE=800000000 \
-mcmodel=large -shared-intel
export OMP_NUM_THREADS=16
export OMP_PROC_BIND=true
export OMP_PLACES="{0:4:1}:16:4” #1 thread per CCX
export OMP_DISPLAY_ENV=true
2.0
2,999
11,567
1
10
100
1,000
10,000
1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
ATOS Broadwell e5-2680v4 2.4GHz (T) OPA
Dell PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR
Dell Skylake Gold 6142 2.6GHz (T) EDR
Dell Skylake Gold 6150 2.7GHz (T) EDR
ATOS AMD EPYC 7601 2.2 GHz (T) EDR
MPI Performance – PingPong
IMB Benchmark (Intel)
1 PE / node
Latency
Message Length (Bytes)
Mb
yte
s/s
ec
72Application Performance in Materials Science
BE
TT
ER
14 December 2017
1.0E+02
1.0E+03
1.0E+04
1.0E+05
1.0E+06
1.0E+07
1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
ATOS Broadwell e5-2680v4 2.4GHz (T) OPA
Dell PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR
Dell PE R730 Broadwell e5-2697Av4 2.6GHz (T) OPA
Dell Skylake Gold 6130 2.1GHz (T) OPA
Intel Skylake Gold 6148 2.4GHz (T) OPA
Dell Skylake Gold 6142 2.6GHz (T) EDR
ATOS AMD EPYC 7601 2.2 GHz (T) EDR
MPI Collectives – Alltoallv (128 PEs)
IMB Benchmark (Intel)
128 PEs
Latency
BE
TT
ER
Message Length (Bytes)
Measured Time (usec)
73Application Performance in Materials Science
EPYC performance with
Intel MPI ~ 4-6 × worse
than that with SKL
processors!
14 December 2017
2.9
5.8
9.9
13.4
3.5
6.8
11.1
15.7
4.1
7.7
12.4
16.0
4.2
7.7
12.4
16.7
4.3
6.4
10.2
13.6
0.0
3.0
6.0
9.0
12.0
15.0
32 64 128 256
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
ClusterVision Ivy Bridge e5-2650v2 2.6GHz True Scale QDR
Bull Haswell e5-2680v3 2.5GHz (T) Connect-IB
Intel Broadwell2 e5-2690v4 2.6GHz (T) OPA
Intel Skylake Gold 6148 2.4GHz (T) OPA
Dell Skylake Gold 6142 2.6GHz (T) EDR
Dell Skylake Gold 6150 2.7GHz (T) EDR
ATOS AMD EPYC 7601 2.2 GHz (T) EDR
DL_POLY Classic – NaCl Simulation
Number of Processing Elements
Performance
Performance Data (32-256 PEs)
Relative to the Fujitsu HTC X5650 2.67 GHz 6-C (16 PEs)
BE
TT
ER
NaCl 27,000 atoms; 500 time steps
Application Performance in Materials Science 7414 December 2017
1.2
2.1
1.5
2.7
1.2
2.1
1.6
2.9
1.5
2.7
1.8
3.2
1.6
2.8
1.7
3.1
1.8
3.3
1.9
3.5
1.1
2.0
1.5
2.7
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
128 256
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
ClusterVision e5-2650v2 2.6GHz Truescale QDR
Bull Haswell e5-2695v3 2.3GHz Connect-IB
Intel Haswell e5-2697v3 2.6GHz (T) True Scale QDR
Huawei Fusion CH140 e5-2683 v4 2.1GHz (T) EDR
Thor Broadwell e5-2697A v4 2.6GHz (T) EDR
Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA
Intel Skylake Gold 6148 2.4GHz (T) OPA
Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR
Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR
IBM Power8 S822LC 2.92GHz IB/EDR
ATOS AMD EPYC 7601 2.2 GHz (T) EDR
Zeolite Y cluster SioSi7 DZVP (Si,O), DZVP2 (H) B3LYP(3975 GTOs)
GAMESS-UK MPI/ScaLAPACK code
Performance
Number of Processing Elements
Relative to the Fujitsu HTC X5650 2.67 GHz 6-C (128 PEs)
BE
TT
ER
Application Performance in Materials Science 7514 December 2017
1.8
3.1
4.9
2.9
5.1
8.5
2.7
5.0
8.4
3.0
5.4
9.0
2.4
3.6
4.9
0.0
2.0
4.0
6.0
8.0
10.0
64 128 256
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA
Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR IMPI
Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA
Intel Skylake Gold 6148 2.4GHz (T) OPA
Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR
ATOS AMD EPYC 7601 2.2 GHz (T) EDR
DL_POLY 4 – Gramicidin Simulation
Number of Processing Elements
Performance Relative to the Fujitsu CX250 e5-2670/ 2.6 GHz 8-C (32 PEs)
BE
TT
ER
Gramicidin 792,960 atoms; 50 time steps
Application Performance in Materials Science 7614 December 2017
1.4
2.0
2.6
1.5
2.1
2.7
0.9
1.4
1.6
1.4
1.8
2.1
0.0
0.5
1.0
1.5
2.0
2.5
3.0
64 96 128
Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA
Intel Skylake Gold 6148 2.4GHz (T) OPA
ATOS AMD EPYC 7601 2.2 GHz (T) EDR
ATOS AMD EPYC 7601 2.2 GHz (T) EDR (16c/socket)
Number of Processing Elements
Perf
orm
an
ce
Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (64 PEs)
BE
TT
ER
VASP 5.4.1 – Zeolite Benchmark
Zeolite (Si96O192) with MFI structure unit cell running a single
point calculation and a 400eV planewave cut off of using the
PBE functional. maximum number of plane-waves: 96,834, 2 k-
points, FFT grid: (65, 65, 43); 181,675 points
Application Performance in Materials Science 7714 December 2017
1.3
1.9
2.42.3
3.2
3.8
2.8
4.0
5.2
2.8
4.0
4.9
1.4
1.82.0
0.0
1.0
2.0
3.0
4.0
5.0
6.0
96 128 160
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA
Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR IMPI DAPL
Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA
Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR
Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR
ATOS AMD EPYC 7601 2.2 GHz (T) EDR
Number of Processing Elements
Pe
rfo
rma
nc
e
BE
TT
ER
Performance Data (96-160 PEs)Quantum Espresso – GRIR443[R
ela
tive
to
th
e F
ujit
su
e5
-
26
70
2.6
GH
z 8
-C (
96
PE
s)]
Application Performance in Materials Science 7814 December 2017
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
DLPOLYclassic Bench4
DLPOLY-4Gramicidin
DLPOLY-4NaCl
GROMACSion-channel
GROMACSlignocellulos
e
GAMESS-UK(cyc-sporin)
GAMESS-UK(Siosi7)
QE Au112
QE GRIR443
VASP Pd-Ocomplex
VASP Zeolitecomplex
Fujitsu CX250 SandyBridge e5-2670/2.6 GHz IB-QDR
ATOS Broadwell e5-2680v4 2.4GHz (T) OPA
Thor Dell|EMC e5-2697Av4 2.6GHz (T) EDR IMPI
Dell Skylake Gold 61302.1GHz (T) OPA
Intel Skylake Gold 61482.4GHz (T) OPA
Dell Skylake Gold 61422.6GHz (T) EDR
Dell Skylake Gold 61502.7GHz (T) EDR
Bull|ATOS Skylake Gold6150 2.7GHz (T) EDR
ATOS AMD EPYC 7601 2.2GHz (T) EDR
EPYC - Target Codes and Data Sets – 128 PEs
79Application Performance in Materials Science
128 PE Performance [Applications]
14 December 2017
80Application Performance in Materials Science
Performance Benchmarks – Node to Node
• Analysis of performance Metrics across a variety of data sets
¤ “Core to core” and “node to node” workload comparisons
• Previous EPYC charts based on Core to core comparison
i.e. performance for jobs with a fixed number of cores
• Node to Node comparison typical of the performance when
running a workload (real life production). Expected to reveal
the major benefits of increasing core count per socket
¤ Focus on a “node to node” comparison of the following:
¤ Benchmarks based on set of 6 applications & 15 data sets.
1Fujitsu CX250 Sandy Bridge e5-
2670/2.6 GHz IB-QDR [96 cores]
ATOS AMD EPYC 7601 2.2 GHz (T)
EDR [256 cores]
2Dell |EMC Skylake Gold 6130 2.1GHz
(T) OPA [168 cores]
ATOS AMD EPYC 7601 2.2 GHz (T)
EDR [256 cores]
14 December 2017
1.48
1.93
2.09
2.09
2.46
2.49
2.67
2.92
3.37
3.43
3.98
4.05
1.0 1.5 2.0 2.5 3.0 3.5 4.0
DLPOLYclassic Bench7
VASP Zeolite complex
DLPOLYclassic Bench5
VASP Pd-O complex
DLPOLY-4 NaCl
DLPOLYclassic Bench4
DLPOLY-4 Gramicidin
QE Au112
GROMACS ion-channel
GAMESS-UK (cyc-sporin)
GROMACS lignocellulose
GAMESS-UK (valino.A2)
Relative Performance of
ATOS AMD EPYC 7601 2.2 GHz
(T) EDR [256 cores]
vs.
Fujitsu CX250 Sandy Bridge e5-
2670/2.6 GHz IB-QDR [64 cores]
81Application Performance in Materials Science
Average Factor = 2.75
AMD EPYC 7601 2.2 GHz (T) EDR vs. SB e5-2670 2.6 GHz QDR
14 December 2017
4 Node Comparison
0.70
0.72
0.74
0.76
0.95
1.00
1.06
1.19
1.29
1.43
1.49
1.59
1.70
1.71
1.78
0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9
QE Au112
QE GRIR443
VASP Pd-O complex
VASP Zeolite complex
DLPOLYclassic Bench7
DLPOLY-4 NaCl
DLPOLY-4 Gramicidin
DLPOLYclassic Bench5
DLPOLYclassic Bench4
GAMESS-UK (cyc-sporin)
GROMACS ion-channel
GAMESS-UK (valino.A2)
GAMESS-UK (Siosi7)
GROMACS lignocellulose
GAMESS-UK (hf12z)
Relative Performance of
ATOS AMD EPYC 7601 2.2 GHz
(T) EDR [256 cores]
vs.
Dell |EMC Skylake Gold 6130
2.1GHz (T) OPA [128 cores]
82Application Performance in Materials Science
Average Factor = 1.21
SKL “Gold” 6130 2.1 GHz OPA vs. AMD EPYC 7601 2.2 GHz (T) EDR vs.
14 December 2017
4 Node Comparison
5. Acknowledgements
• Ludovic Sauge, Enguerrand Petit and Patrick Berghaeger (Bull/ATOS)
for informative performance discussions and access to the Skylake &
EPYC clusters at the Bull HPC Competency Centre.
• Pak Lui, David Cho, Gilad Shainer, Colin Bridger & Steve Davey for
access to the “Thor” cluster at the HPC Advisory Council and
“Hercules” partition in Mellanox.
• Doug Mark, Farid Parpia, John Simpson, Ludovic Enault, Xinghong
He, James Kuchler & Luke Willett for access to and assistance on the
IBM Power8 S822LC cluster in Poughkeepsie.
• David Power for access to two Skylake nodes in Boston /BIOSIT.
• Jamie Wilcox, Bogdan Pop, Toby Smith and Andrew Richardson (Intel)
for past access to and help with a host of processors and for access to
the Swindon clusters.
• Joshua Weage, Martin Hilgeman, Dave Coughlin, Gilles Civario and
Christopher Huggins for access to, and assistance with, the variety of
Skylake SKUs at the Dell Benchmarking Centre.
Application Performance in Materials Science 8314 December 2017
Summary
• Focus on performance benchmarks and clusters featuring Intel’s
Skylake “Gold” processors (6130, 2.1 GHz [16c]; 6148, 2.4 GHz
[20c], 6142, 2.6 GHz [16c] ; and 6150, 2.7 GHz [18c]).
• Performance comparison with current Sandy Bridge systems and
those based on dual Intel Broadwell processor EP nodes (16-
core, 14-core) with Mellanox EDR and Intel’s Omnipath OPA
interconnects.
• Measurements of parallel application performance based on
synthetic (STREAM, IMB, HPCC and IOR) and end user
applications – DLPOLY, Gromacs, NAMD, LAMMPS, GAMESS-
UK, Quantum ESPRESSO, VASP and CP2K.
¤ Use of IPM and Allinea Performance reports and comparison of
Mellanox’s HPC-X and Intel MPI on EDR-based systems
• Enhanced performance of the Skylake-based clusters is at first
sight modest, particularly when compared with optimised runs
(HPCX) on previous generation Intel Broadwell clusters.
84Application Performance in Materials Science 14 December 2017
Summary II
• A Core-to-Core comparison across 19 data sets (10 applications)
suggests modest speedups (ca. 1.08) when comparing a Skylake
“Gold” 6150 cluster (EDR) to the Broadwell-based “Thor” e5-2697A v4
2.6GHz (T) cluster with EDR and the HPCX environment.
¤ Little difference in application performance based on Mellanox’s IB
EDR interconnect and Intel’s Omnipath (OPA) interconnect, at least
at modest (< 512 core) counts.
¤ A comparison against the Fujitsu CX250 Sandy Bridge 2.6GHz IB-
QDR, shows average performance increase factors of
• 1.54 (256 cores) for clusters based on the Gold 6130 [OPA]
• 1.64 (256 cores) for clusters based on the Gold 6148 [OPA]
• 1.71 (256 cores) for clusters based on the Gold 6142 [EDR]
• 1.88 (256 cores) for clusters based on the Gold 6150 [EDR]
¤ Some applications however show much higher factors e.g.
Quantum Espresso and VASP.
¤ A Node-to-Node comparison typical of the performance when
running a workload is more encouraging.
85Application Performance in Materials Science 14 December 2017
Summary III
• A 6-node benchmark based on examples from 10 application and 19
data sets show the following improvement factors against 6 node runs
on the Fujitsu CX250 Sandy Bridge e5-2670/2.6 GHz cluster
¤ Skylake “Gold” 6130 cluster (16c) with EDR interconnect : 2.54
¤ Skylake “Gold” 6142 cluster (16c) with EDR interconnect : 2.83
¤ Skylake “Gold” 6150 cluster (18c) with EDR interconnect : 3.15
• Optimum Interconnect performance is a function of both application
and core count.
¤ With the materials-based codes & OpenFOAM, and at high core
count (> 512 cores), EDR exhibits a clear performance
advantage over OPA.
¤ This is not the case for the classical MD codes where OPA
shows a distinct advantage at all but the highest core counts.
• Preliminary studies on the EPYC 2701 shows a complex performance
dependency on EPYC architecture.
¤ Codes with high usage of vector instructions (Gromacs, VASP and
Quantum Espresso) perform at best in somewhat modest fashion.
86Application Performance in Materials Science 14 December 2017