Download - Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

Intel’s Xeon Scalable processor Family

(Skylake) and AMD’s EPYC processor

Application Performance in

Chemistry and Materials Science:

Martyn Guest≠, Christine Kitchen≠ & Enguerrand Petit†

≠ Cardiff University †Performance Engineering Group, Atos

2Application Performance in Materials Science

Outline

• Measurement of parallel application performance featuring synthetics and

end-user applications across a variety of clusters

¤ Synthetic Codes – STREAM, IMB (interconnect) and HPCC

¤ Variety of End-user Codes – DL_POLY, GROMACS, NAMD, LAMMPS,

GAMESS-UK, Quantum Espresso, VASP, CP2K, ONETEP & OpenFOAM

• Focus on Intel’s Xeon Scalable processors (“Skylake”), including

¤ Intel Xeon Gold 6150 Processor (18c, 24.75M Cache, 2.70 GHz)

¤ Intel Xeon Gold 6142 Processor (16c, 22M Cache, 2.60 GHz)

¤ Intel Xeon Gold 6148 Processor (20c, 27.5M Cache, 2.40 GHz)

¤ Intel Xeon Gold 6130 Processor (16c, 22M Cache, 2.10 GHz).

¤ Clusters with dual-socket nodes + Mellanox EDR & Intel’s OPA Interconnects.

• Comparison with a number of HPC systems based on earlier CPUs:

¤ Intel’s Broadwell (16-core E5-2697Av4 2.6GHz) & 14-core E5-2680 v4 2.4

GHz), and Sandy Bridge (E5-2690 & E5-2670) clusters from Bull and Fujitsu

¤ Mellanox IB EDR, and Intel’s Omnipath OPA Interconnects.

¤ IBM’s Power 8 cluster - S822LC

• Preliminary Evaluation of the AMD EPYC 7601 Processor

14 December 2017


Contents1. Performance Benchmarks and Cluster Systems

a. Synthetic Code Performance: STREAM, IMB, and HPCC

b. Application Code Performance: DLPOLY, GROMACS,

LAMMPS, NAMD, GAMESS_UK, VASP, Quantum Espresso,

CP2K, and OpenFOAM

2. Selecting Fabrics and Optimising Performance

a. Interconnect Performance: Mellanox EDR Infiniband and

Intel’s Omnipath (OPA)

3. Relative Code Performance: Processor Family and

Interconnect – “core to core” and “node to node” benchmarks.

4. Preliminary Evaluation of the AMD EPYC 7601 Processor

a. represents a fairly radical design departure from what Intel

offers

5. Acknowledgements and Summary

14 December 2017

The Xeon Skylake Architecture


• The architecture of Skylake is

very different from that of the prior

“Haswell” and “Broadwell” Xeon

chips

• Three basic variants that now

cover what was formerly the Xeon

E5 and Xeon E7 product lines, with

Intel converging the Xeon E5 and

E7 chips into a single socket.

• Product segmentation – Platinum, Gold, Silver, & Bronze – with 51

variants of the SP chip

• Also custom versions requested by hyperscale and OEM customers.

• All of these chips differ from each other in a number of ways, including

number of cores, clock speed, L3 cache capacity, number and speed of

UltraPath links between sockets, number of sockets supported, main

memory capacity, width of the AVX vector units etc.

14 December 2017

Intel Xeon : Westmere - Skylake

Xeon 5600

(Westmere-EP)

Xeon E5-2600

(Sandy Bridge-EP)

Xeon E5-2600 v4

“Broadwell-EP”

Intel Xeon Scalable

Processor

“Skylake”

Cores / ThreadsUp to 6 cores / 12

threads

Up to 8 cores / 16

threads

Up to 22 Cores / 44

threads

Up to 28 Cores / 56

threads

Last-level cache 12 MB Up to 20 MB Up to 55 MBUp to 38.5 MB (non-

inclusive)

Max memory

channels, speed

/ socket

3xDDR3 channels,

1333

4xDDR3 channels,

1600

4 channels of up to 3

RDIMMs, LRDIMMs

or 3DS LRDIMMs,

2400 MHz

6 channels of up to 2

RDIMMs, LRDIMMs

or 3DS LRDIMMs,

2666 MHz

New

instructionsAES-NI

AVX 1.0

8 DP Flops/Clock

AVX 2.0

16 DP Flops/Clock

AVX 512

32 DP Flops/Clock

QPI / UPI Speed

(GT/s)

1 QPI channels @

6.4 GT/s

2 QPI channels @ 8.0

GT/s

2 x QPI channels @

9.6 GT/s

Up to 3 x UPI @ 10.4

GT/s

PCIe Lanes /

Controllers /

Speed (GT/s)

36 lanes PCIe 2.0 on

chipset

40 Lanes / Socket

Integrated PCIe 3.0

40 / 10 / PCIe* 3.0

(2.5, 5, 8 GT/s)48 / 12 / PCIe* 3.0

(2.5, 5, 8 GT/s)

Server /

Workstation

TDP

Server /

Workstation: 130W

Up to 130W Server;

150W Workstation 55 - 145W 70 – 205W

5Application Performance in Materials Science 14 December 2017

Baseline Cluster Systems


Cluster Configuration

Intel Sandy Bridge Clusters

“Raven”128 x Bull|ATOS b510 EP-nodes each with 2 Intel Sandy Bridge E5-2670

(2.6 GHz), with Mellanox QDR infiniband.

Supercomputing

Wales

384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670

(2.6 GHz), with Mellanox QDR infiniband.

Intel Broadwell Clusters

Dell PE R730/R630,

Broadwell EP-2697A v4

2.6 GHz 16C

HPC Advisory Council, “Thor” cluster, Dell PowerEdge R730/R630 36-

node cluster: 2 x Xeon E5-2697A v4 @ 2.6GHz, 16 Core, 145W TDP,

40MB Cache,256GB DDR4 2400MHz, Interconnect: ConnectX-4 EDR

ATOS Broadwell EP-

2680 v4 2.4 GHz 16C

32 node cluster, Node config: 2 x Xeon E5-2680 v4 @ 2.4GHz, 16 Core,

145W TDP, 40MB Cache,128GB DDR4 2400MHz, Interconnect: Mellanox

ConnectX-4 EDR; and Intel OPA

IBM Power 8 S822LC

IBM Power 8 S822LC

with Mellanox EDR

20 cores, 3.49 GHz with performance CPU governor; 256 GB memory ;

1 – IB (EDR) port ; 2 × NVIDIA K80 GPU;

IBM PE (Parallel Environment) Operating System: RHEL 7.2 LE;

Compilers: xlC 13.1.3, xlf 15.1.3, gcc 4.8.5 (Red Hat), gcc 5.2.1 (from IBM

Advance Toolchain 9.0)

14 December 2017

Skylake Cluster Systems

Cluster / Configuration

32 node Dell|EMC cluster running SLURM with separate partitions for each processor

SKU; Mellanox EDR:

Intel® Xeon® Gold 6150 Processor (24.75M Cache, 2.70GHz), Max Turbo frequency, 3.70

GHz. # cores 18; #threads 36; DDR4-2666; TDP 165W; # of UPI Links 3;

Intel® Xeon® Gold 6142 Processor (22M Cache, 2.60GHz), Max Turbo frequency, 3.70

GHz, # cores 16; #threads 32; DDR4-2666; TDP 150W; # of UPI Links 3.

28 node Dell|EMC cluster running SLURM: Intel OPA

Intel® Xeon® Gold 6130 Processor (22M Cache, 2.10GHz), Max Turbo frequency, 3.70

GHz, # cores 16; #threads 32; DDR4-2666; TDP 125W; # of UPI Links 3.

The 6130’s are configured with 12 × 8GB 2666 DIMMs rather than 12 × 16GB, resulting in

somewhat slower memory bandwidth (165 GB/s vs 195 GB/s STREAM Triad).

20 node Bull|ATOS cluster running SLURM;

Intel® Xeon® Gold 6150 Processor (24.75M Cache, 2.70 GHz), Max Turbo frequency, 3.70

GHz. # cores 18; #threads 36; DDR4-2666; TDP 165W; # of UPI Links 3; SMT; Mellanox

EDR.

16 node Intel cluster running SLURM;

Intel® Xeon® Gold 6148 Processor (27.5M Cache, 2.40 GHz), Max Turbo frequency, 3.70

GHz. # cores 20; #threads 40; DDR4-2666; TDP 150W; # of UPI Links 3; SMT; Intel OPA.


The Performance Benchmarks

• The Test suite comprises both synthetics & end-user applications.

Synthetics include HPCC (http://icl.cs.utk.edu/hpcc/) & IMB benchmarks

(http://software.intel.com/en-us/articles/intel-mpi-benchmarks), IOR and

STREAM

• Variety of “open source” & commercial end-user application codes:

• These stress various aspects of the architectures under consideration

and should provide a level of insight into why particular levels of

performance are observed e.g., memory bandwidth and latency, node

floating point performance and interconnect performance (both latency

and B/W) and sustained I/O performance.

GROMACS, LAMMPS, NAMD, DL_POLY classic & DL_POLY-4

(molecular dynamics)

Quantum Espresso, Siesta, CP2K, ONETEP, CASTEP and VASP

(ab initio Materials properties)

NWChem, GAMESS-US and GAMESS-UK

(molecular electronic structure)


74,309

93,486

118,605 114,367

132,035 128,083

165,974

185,863195,122

184,087

279,562

0

50,000

100,000

150,000

200,000

250,000

Bull b510"Raven"Sandy

Bridge e5-2670/2.6GHz IB-

QDR

ClusterVisione5-2650v2

2.6GHz

Dell R730Haswell e5-

2697v3 2.6GHz(T)

Dell OPA32 e5-2660v3 2.6GHz

(T) OPA

Thor Broadwelle5-2697A v42.6GHz (T)

ATOSBroadwell e5-2680v4 2.4GHz

(T) OPA

Dell SkylakeGold 6130

2.1GHz (T) OPA

Dell SkylakeGold 61422.6GHz (T)


IBM Power8S822LC

2.92GHz IB/EDR

AMD Epyc 76012.2 GHz

Memory B/W –STREAM performance

TRIAD [Rate (MB/s) ]

Ivy Bridge & Haswell

E5-26xx v2,v3

OMP_NUM_THREADS (KMP_AFFINITY=physical

Broadwell

E5-26xx v4

Skylake Gold

6130, 6142, 6148

Application Performance in Materials Science 914 December 2017

4,644

5,843

4,236

5,718

4,126

4,574

5,187

5,808

4,878

9,204

4,368

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

9,000

10,000

Bull b510"Raven"Sandy

Bridge e5-2670/2.6GHz IB-

QDR

ClusterVisione5-2650v2

2.6GHz

Dell R730Haswell e5-

2697v3 2.6GHz(T)

Dell OPA32 e5-2660v3 2.6GHz

(T) OPA

Thor Broadwelle5-2697A v42.6GHz (T)

ATOSBroadwell e5-2680v4 2.4GHz

(T) OPA

Dell SkylakeGold 6130

2.1GHz (T) OPA



IBM Power8S822LC

2.92GHz IB/EDR

AMD Epyc 76012.2 GHz

Memory B/W – STREAM / core performance

TRIAD [Rate (MB/s) ]

OMP_NUM_THREADS (KMP_AFFINITY=physical

Ivy Bridge & Haswell

E5-26xx v2,v3

Broadwell

E5-26xx v4

Skylake Gold

6130, 6142, 6148


3.8

11,466

5,957

1.7

3,694

1,729

1

10

100

1,000

10,000

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07

ATOS AMD EPYC 7601 2.2 GHz (T) EDR

Intel SKL Gold 6148 2.4GHz (T) OPA

Dell Skylake Gold 6150 2.7GHz (T) EDR

IBM Power8 S822LC 2.92GHz IB/EDR

Thor BDW e5-2697A v4 2.6GHz (T) EDR

Intel BDW e5-2690v4 2.6GHz (T) OPA

Dell OPA32 e5-2660v3 2.6GHz (T) OPA

Bull HSW E5-2680v3 2.5 GHz (T) Connect-IB

Dell R720 e5-2680v2 2.8 GHz (T) connect-IB

Azure A9 WE (e5-2670 2.6 GHz) IB RDMA

Merlin Xeon E5472 3.0 GHz QC + IB (mvapich2 1.4)

MPI Performance – PingPong

IMB Benchmark (Intel)

1 PE / node

Latency

Message Length (Bytes)

Mb

yte

s/s

ec


BE

TT

ER

14 December 2017

215.4

5.0E+01

5.0E+02

5.0E+03

5.0E+04

5.0E+05

5.0E+06

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

ATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Dell PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR

Dell PE R730 Broadwell e5-2697Av4 2.6GHz (T) OPA

Dell Skylake Gold 6130 2.1GHz (T) OPA

Intel Skylake Gold 6148 2.4GHz (T) OPA


MPI Collectives – Alltoallv (128 PEs)


128 PEs

Latency

BE

TT

ER


Measured Time (usec)

Time-consuming

messages called by

Alltoall & Alltoallv (IPM)


• Linpack TPP benchmark measures floating point rate of execution for solving a linear system of equations. HPL

• measures floating point rate of execution of double precision real matrix-matrix multiplication. DGEMM

• measures sustainable memory B/W (in GB/s) and the corresponding computation rate for simple vector kernel. STREAM

• parallel matrix transpose - exercises the communicationswhere pairs of processors communicate with each other simultaneously. Useful test of the total communications capacity of the network.

PTRANS

• measures rate of integer random updates of memory (GUPS). RandomAccess

• measures floating point rate of execution of double precision complex one-dimensional Discrete Fourier Transform. Performance a combination of flops, memory, and network bandwidth

FFTE

• tests to measure latency and B/W of a number of simultaneous communication patterns; based on b_eff (effective bandwidth benchmark).

Communication B/W and latency

HPC Challenge Benchmark (Source - http://icl.cs.utk.edu/hpcc/)

CPU cores N

128 83,000

256 117,000

512 166,000

1024 234,000


0.0

0.2

0.4

0.6

0.8

1.0

HPL [Gflops]

G-PTRANS[GB/s]

G-RandomAccess [Gup/s]

G-FFTE [Gflops]

EP-STREAM Sys[GB/s]

EP-STREAMTriad [GB/s]

EP-DGEMM[Gflops]

Random RingBandwidth[Gbytes]

Random RingLatency [usec]

Fujitsu CX250 e5-2670 2.6 GHzQDR

Bull b510 (Raven) e5-2670 2.6GHz IB-QDR

ATOS Genji e5-2680 v4 2.4GHz(T) OPA IMPI

Thor Dell e5-2697A v4 2.6GHz (T)EDR HCPX

Thor Dell e5-2697A v4 2.6GHz (T)OPA

Dell Skylake Gold 6130 2.1GHz(T) OPA

Intel Skylake Gold 6148 2.4GHz(T) OPA

Dell Skylake Gold 6142 2.6GHz(T) EDR

Dell Skylake Gold 6150 2.7GHz(T) EDR

HPCC - 128 Processing Elements


[Matrix Size 83,000]

14 December 2017

Application Code Performance in Materials

Science, Chemistry and Nanoscience:

DLPOLY, GROMACS, NAMD, LAMMPS, GAMESS,

NWChem, GAMESS-UK, ONETEP, VASP, SIESTA,

CASTEP, Quantum Espresso, CP2K – on a variety of HPC

systems.

IPM Performance Monitoring

http://ipm-hpc.sourceforge.net/userguide.html

IPM is a profiling tool that helps analyse MPI programs.

• Very easy to use, requires no code modifications (unless you want more

information), and provides a lightweight profiling interface (with very low

overhead <2%).

• Can create html O/P that include graphical representation of the data

To run a program with IPM profiling; There are three ways of using IPM.

1. set an environment variable before you run your program:

$ export LD_PRELOAD=/application/tools/ipm-2.0.6/install-impi/lib/libipm.so

2. Recompile your program with IPM enabled:

$ mpicc -L/path/to/ipm/lib -lipm your_program.c -o your-program

3. Use “export I_MPI_STATS=ipm” if using Intel’s mpirun or mpiexec.hydra

When executing a program with ipm, an xml file is created that can be parsed to text

or html using ''ipm_parse -html xmlfile''.

IPM 2.0.6


Allinea Performance Reports

Allinea Performance Reports provides a

mechanism to characterize and understand the

performance of HPC application runs through a

single-page HTML report.


• Based on Allinea MAP's adaptive sampling technology that keeps data

volumes collected and application overhead low.

• Modest application slowdown (ca. 5%) even with 1000’s of MPI

processes.

• Runs on existing codes: a single command added to execution scripts.

• If submitted through a batch queuing system, then the submission script

is modified to load the Allinea module and add the 'perf-report' command

in front of the required mpiexec command.

• perf-report mpiexec -n 4 $code

• A Report Summary: This characterizes how the application's wallclock

time was spent, broken down into CPU, MPI and I/O

• All examples updated on Broadwell Mellanox Cluster (E5-2697A v4)

14 December 2017

DL_POLY Developed as CCP5 parallel MD code by W. Smith,

T.R. Forester and I. Todorov

UK CCP5 + International user community

DLPOLY_classic (replicated data) and DLPOLY_3 &

_4 (distributed data – domain decomposition)

Areas of application:

liquids, solutions, spectroscopy, ionic solids,

molecular crystals, polymers, glasses, membranes,

proteins, metals, solid and liquid interfaces,

catalysis, clathrates, liquid crystals, biopolymers,

polymer electrolytes.

Molecular Dynamics Codes: AMBER, DL_POLY,

CHARMM, NAMD, LAMMPS, GROMACS etc

Molecular Simulation I. DL_POLY


A B

C D

• Distribute atoms, forces across the nodes

¤ More memory efficient, can address much larger

cases (105-107)

• Shake and short-ranges forces require only

neighbour communication

¤ communications scale linearly with number of

nodes

• Coulombic energy remains global

¤ Adopt Smooth Particle Mesh Ewald scheme

• includes Fourier transform smoothed charge

density (reciprocal space grid typically

64x64x64 - 128x128x128)

http://www.scd.stfc.ac.uk//research/app/ccg/software/DL_POLY/44516.aspx

W. Smith and I. Todorov

Domain Decomposition - Distributed data:

DL_POLY 4 – Distributed data


Benchmarks1. NaCl Simulation; 216,000 ions, 200 time steps, Cutoff=12Å

2. Gramicidin in water; rigid bonds + SHAKE: 792,960 ions, 50 time steps

14 December 2017

1.8

3.1

4.0

4.9

2.7

5.0

6.5

8.4

2.9

5.3

7.0

9.1

0.0

2.0

4.0

6.0

8.0

10.0

64 128 192 256


Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR IMPI

Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA


Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR

Bull|ATOS Skylake Gold 6150 2.7GHz (T) EDR

DL_POLY 4 – Gramicidin Simulation

Number of Processing Elements

Performance

BE

TT

ER

Gramicidin 792,960 atoms; 50 time steps

Performance Data (64-256 PEs)


Relative to the Fujitsu CX250 e5-2670 2.6 GHz 8-C (32 PEs)

14 December 2017

SKL 6142 2.6 GHz ~

1.06 X e5-2697v4 2.6

GHz

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU Scalar numeric ops (%)

CPU Vector numeric ops (%)

CPU Memory accesses (%)


DLPOLY4 – Gramicidin Simulation Performance Report

Smooth Particle Mesh Ewald Scheme


CPU Time Breakdown

Total Wallclock Time

Breakdown

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU (%)

MPI (%)

14 December 2017

“DL_POLY_4 and Xeon Phi: Lessons Learnt”,

Alin Marin Elena , Christian Lalanne, Victor

Gamayunov , Gilles Civario, Michael Lysaght,

and IlianTodorov

Molecular Simulation II. GROMACS

GROMACS v 5.0 (GROningen MAchine for Chemical Simulations) is a molecular dynamics package designed for simulations of proteins,

lipids and nucleic acids [University of Groningen] .

Berk Hess et al. "GROMACS 4: Algorithms for Highly Efficient, Load-

Balanced, and Scalable Molecular Simulation". Journal of Chemical Theory

and Computation 4 (3): 435–447.


Ion channel system

• The 142k particle ion channel system is the membrane protein GluCl - a

pentameric chloride channel embedded in a DOPC membrane and

solvated in TIP3P water, using the Amber ff99SB-ILDN force field. This

system is a challenging parallelization case due to the small size, but is

one of the most wanted target sizes for biomolecular simulations.

Lignocellulose

• Gromacs Test Case B from the UEA Benchmark Suite. A model of

cellulose and lignocellulosic biomass in an aqueous solution. This system

of 3.3M atoms is inhomogeneous, and uses reaction-field

electrostatics instead of PME and therefore should scale well.

14 December 2017

20.6

37.9

54.5

68.8

32.5

60.3

75.5

100.4

36.4

68.5

98.8

123.6

0.0

20.0

40.0

60.0

80.0

100.0

120.0

64 128 192 256



Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR HPCX





23

GROMACS – Ion Channel Simulation


Performance (ns /day)


BE

TT

ER

142k particle ion channel system


0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

50.0

32 PEs

64 PEs

128 PEs

256 PEs




0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU (%)

MPI (%)


GROMACS – Ion-channel Performance Report


CPU Time Breakdown


Breakdown

14 December 2017

0.9

1.7

2.5

3.3

1.3

2.6

3.8

5.0

1.3

2.6

3.8

5.0

1.4

2.8

3.9

5.2

1.6

3.1

4.6

6.1

0.0

1.0

2.0

3.0

4.0

5.0

6.0

64 128 192 256



Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR IMPI






25

GROMACS – Lignocellulose




BE

TT

ER

3,316,463 atom system using

reaction-field electrostatics instead

of PME


Performance Data (256 PEs)GROMACS - IPM Reports


13%

10%

wallclock : 26.8 secs

# mpi_tasks : 256 on 8 nodes

%comm : 34.90%

7%

% of Total Time

ligno-cellulosewallclock : 352.1 secs


%comm : 5.81%

Ion-channel

% of Total Time

4%

1%

1%

14 December 2017

GAMESS-UK - Moving to Distributed Data.

The MPI/ScaLAPACK Implementation

of the GAMESS-UK SCF/DFT module

• Alternative pragmatic approach in which

¤ MPI-based tools (such as ScaLAPACK) used in place of Global Arrays

¤ All data structures except those required for the Fock matrix build are fully

distributed (F, P)

• Partially distributed model chosen because, in the absence of efficient

one-sided communications it is difficult to efficiently load balance a

distributed Fock matrix build.

• Obvious drawback - some large replicated data structures are required.

¤ These are kept to a minimum. For a closed shell HF or DFT calculation only

2 replicated matrices are required, 1 × Fock and 1 × Density (doubled for

UHF).


“The GAMESS-UK electronic structure package: algorithms, developments and

applications'' M.F. Guest, I. J. Bush, H.J.J. van Dam, P. Sherwood, J.M.H. Thomas, J.H.

van Lenthe, R.W.A Havenith, J. Kendrick, Mol. Phys. 103, No. 6-8, 2005, 719-747.

14 December 2017

1.2

1.5

2.7

1.6

2.8

1.2

2.1

1.6

2.9

1.5

2.7

1.8

3.2

1.6

2.8

1.8

3.3

1.9

3.5

1.1

2.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

128 256


ClusterVision e5-2650v2 2.6GHz Truescale QDR

Intel Ivy Bridge e5-2690v2 3.0GHz (T) True Scale QDR

Bull Haswell e5-2695v3 2.3GHz Connect-IB

Intel Haswell e5-2697v3 2.6GHz (T) True Scale QDR

Huawei Fusion CH140 e5-2683 v4 2.1GHz (T) EDR

Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) EDR

Thor Broadwell e5-2697A v4 2.6GHz (T) EDR






Zeolite Y cluster SioSi7 DZVP (Si,O), DZVP2 (H) B3LYP(3975 GTOs)

GAMESS-UK Performance - Zeolite Y cluster

Performance


Relative to the Fujitsu HTC X5650 2.67 GHz 6-C (128 PEs)

BE

TT

ER


SKL 6142 2.6 GHz

~ 1.05 X e5-2697v4 2.6 GHz

14 December 2017

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU (%)

MPI (%)


GAMESS-UK.MPI DFT – DFT Performance Report


Cyclosporin 6-31G** basis (1855

GTOs); DFT B3LYP

CPU Time Breakdown


Breakdown

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

32 PEs

64 PEs

128 PEs

256 PEs




14 December 2017

• VASP – performs ab-initio QM molecular dynamics (MD) simulations using

pseudopotentials or the projector-augmented wave method and a plane

wave basis set.

• Quantum Espresso – an integrated suite of Open-Source computer codes

for electronic-structure calculations and materials modelling at the

nanoscale. It is based on density-functional theory (DFT), plane waves,

and pseudopotentials

• SIESTA - an O(N) DFT code for electronic structure calculations and ab

initio molecular dynamics simulations for molecules and solids. It uses

norm-conserving pseudopotentials and linear combination of numerical

atomic orbitals (LCAO) basis set.

• CP2K is a program to perform atomistic and molecular simulations of solid

state, liquid, molecular, and biological systems. It provides a framework for

different methods such as e.g., DFT using a mixed Gaussian & plane waves

approach (GPW) and classical pair and many-body potentials. • ONETEP (Order-N Electronic Total Energy Package) is a linear-scaling

code for quantum-mechanical calculations based on DFT.

Computational Materials

Advanced Materials Software


http://www.quantum-espresso.org/


Quantum Espresso is an

integrated suite of Open-

Source computer codes

for electronic-structure

calculations and

materials modelling at the

nanoscale. It is based on

density-functional theory,

plane waves, and

pseudopotentials.

Ground-state calculations.

Structural Optimization.

Transition states & minimum energy paths.

Ab-initio molecular dynamics.

Response properties (DFPT).

Spectroscopic properties.

Quantum Transport.

Benchmark Details

DEISA AU112

Au complex (Au112), 2,158,381 G-

vectors, 2 k-points, FFT dimensions:

(180, 90, 288)

PRACE

GRIR443

Carbon-Iridium complex (C200Ir243),

2,233,063 G-vectors, 8 k-points, FFT

dimensions: (180, 180, 192)

Quantum Espresso




1.01.3

2.0 2.0

2.52.9

3.3 3.4

3.3

5.6

5.0

5.7

7.67.8

1.5

2.2

3.2

5.1

4.3

4.9

5.9

1.9

2.7

4.0

6.7

5.7

6.4

8.3

8.8

0.0

2.0

4.0

6.0

8.0

10.0

0 64 128 192 256 320



Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR IMPI DAPL

Thor Dell|EMC e5-2697A v4 2.6GHz (T) OPA





Perf

orm

an

ce

Performance Data (32 - 320 PEs)

BE

TT

ER

Quantum Espresso – Au112


Relative to the Fujitsu e5-2670

2.6 GHz 8-C (32 PEs)

14 December 2017

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU (%)

MPI (%)

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

32 PEs

64 PEs

128 PEs

256 PEs





Quantum Espresso – Au112 Performance Report

Au complex (Au112), 2,158,381 G-

vectors, 2 k-points, FFT

dimensions: (180, 90, 288)


CPU Time Breakdown


Breakdown

14 December 2017

Performance Data (256 PEs)

Quantum Espresso – Au112 IPM Report


36%

17%

wallclock : 108.3 secs


%comm : 72.52%

Time-consuming messages

called by Alltoall are modest

sized (<1KB)

8%

MPI_Alltoall

% of Total Time

14 December 2017

Zeolite Benchmark

• Zeolite with the MFI structure unit cell running

a single point calculation and a planewave cut

off of 400eV using the PBE functional

• 2 k-points; maximum number of plane-

waves: 96,834

• FFT grid; NGX=65, NGY=65, NGZ=43,

giving a total of 181,675 points

Pd-O Benchmark

• Pd-O complex – Pd75O12, 5X4 3-layer

supercell running a single point calculation

and a planewave cut off of 400eV. Uses the

RMM-DIIS algorithm for the SCF and

is calculated in real space.

• 10 k-points; maximum number of plane-

waves: 34,470

• FFT grid; NGX=31, NGY=49, NGZ=45,

giving a total of 68,355 points

VASP – Vienna Ab-initio Simulation Package

Benchmark Details

MFI Zeolite

Zeolite (Si96O192), 2 k-

points, FFT grid: (65,

65, 43); 181,675 points

Pd-O

complex

Palladium-Oxygen

complex (Pd75O12), 10

k-points, FFT grid: (31,

49, 45), 68,355 points

VASP (5.4.1) performs ab-

initio QM molecular

dynamics (MD) simulations

using pseudopotentials or

the projector-augmented

wave method and a plane

wave basis set.


1.7

2.5

2.1

2.8

4.6

6.5

3.3

5.3

6.1

3.7

5.6

6.8

0.0

2.0

4.0

6.0

8.0

64 128 256




Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) OPA






Performance Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (32 PEs)

BE

TT

ER

Palladium-Oxygen complex (Pd75O12), 8 k-

points, FFT grid: (31, 49, 45), 68,355 points

VASP 5.4.1 – Pd-O Benchmark


1.7

2.52.1

2.8

4.6

6.5

3.3

5.3

6.1

4.0

6.1

9.2

3.7

5.6

6.8

0.0

2.0

4.0

6.0

8.0

10.0

64 128 256







Intel Skylake Gold 6148 2.4GHz (T) OPA [KPAR=2]




Performance Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (32 PEs)

BE

TT

ER



VASP 5.4.1 – Pd-O Benchmark - Parallelisation on k-points


NPEs KPAR NPAR

64 2 2

128 2 4

256 2 8

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

32 PEs

64 PEs

128 PEs

256 PEs




0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU (%)

MPI (%)


VASP – Pd-O Benchmark Performance Report




CPU Time Breakdown


Breakdown

14 December 2017

1.0

1.51.7

1.4

2.9

4.7

1.5

2.7

3.9

1.7

3.0

4.2

1.7

3.2

4.5

0.0

1.0

2.0

3.0

4.0

5.0

64 128 256









IBM Power8 S822LC: 4.3


Performance

BE

TT

ER

Zeolite (Si96O192) with MFI structure unit cell running a single

point calculation and a 400eV planewave cut off of using the

PBE functional. maximum number of plane-waves: 96,834,

2 k-points, FFT grid: (65, 65, 43); 181,675 points

VASP 5.4.1 – Zeolite Benchmark


Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (64 PEs)

14 December 2017

40

OpenFOAM - The open source CFD toolbox

• OpenFOAM has an extensive range of features to solve anything from complex fluid flows involving chemical reactions, turbulence and heat transfer, to solid dynamics and electromagnetics.

Features

• Includes over 80 solver applications that simulate specific problems in engineering mechanics and over 170 utility applications that perform pre- and post-processing tasks, e.g. meshing, data visualisation, etc.

Applications

http://www.openfoam.com/

The OpenFOAM® (Open Field Operation and Manipulation) CFD

Toolbox is a free, open source CFD software package produced by

OpenCFD Ltd.


• Isothermal, incompressible flow in a 2D square domain. The geometry has all the boundaries of the square are walls. The top wall moves in the x-direction at 1 m/s while the other 3 are stationary. Initially, the flow is assumed laminar and is solved on a uniform mesh using the icoFoam solver.

Lid-driven cavity flow (Cavity 3d)

Geometry of the lid driven cavity

http://www.openfoam.org/docs/user/cavity.php#x5-170002.1.5

1.02.2

3.3

7.2

9.3

11.512.4

15.0 15.0

16.9 16.9

19.9

23.4

26.5 26.5

1.4

3.1

4.6

9.7

12.4

15.6

17.6

23.4

29.4

34.5

36.1

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

0 64 128 192 256 320 384 448 512








41

OpenFOAM – Cavity 3d-3M


Perf

orm

an

ce

Relative to the Fujitsu CX250 e5-2670 8-C (32 PEs)

BE

TT

ER


OpenFOAM with lid-driven cavity flow 3d-3M data set


0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

100.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU (%)

MPI (%)

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

32 PEs

64 PEs

128 PEs

256 PEs





OpenFOAM – Cavity 3d-3M Performance Report


OpenFOAM with lid-driven cavity

flow 3d-3M data set

CPU Time Breakdown


Breakdown

14 December 2017

Application Code Performance in Materials

Science, Chemistry and Nanoscience:

2. Selecting Fabrics and Optimising

Performance:

Intel MPI, Mellanox HPCX, IPM

and Allinea Performance Report.

• Intel MPI Library - can select a communication fabric at runtime without having

to recompile the application. By default, it automatically selects the most

appropriate fabric based on both S/W and H/W configuration i.e. in most cases

you do not have to worry about manually selecting a fabric.

• Specifying a particular fabric can boost performance. Can specify fabrics for both

intra-node and inter-node communications. Following fabrics available:

• For inter-node communication, it uses the first available fabric from the default

fabric list. This list is defined automatically for each H/W and S/W configuration

(see I_MPI_FABRICS_LIST).

• For most configurations, this list is as follows: dapl, ofa, tcp, tmi, ofi

Selecting Fabrics – MPI Optimisation


Fabric Network hardware and software used

shm Shared memory (for intra node communication only).

dapl Direct Access Programming Library (DAPL) fabrics, such as InfiniBand (IB)

and iWarp (through DAPL).

ofa OpenFabrics Alliance (OFA) fabrics e.g. InfiniBand (through OFED verbs).

tcp TCP/IP network fabrics, such as Ethernet and InfiniBand (through IPoIB).

tmi Tag Matching Interface (TMI) fabrics, such as Intel True Scale Fabric, Intel

Omni Path Architecture and Myrinet (through TMI).

ofi OpenFabrics Interfaces* (OFI) - capable fabrics, such as Intel True Scale

Fabric, Intel Omni Path Architecture, IB and Ethernet (through OFI API).

14 December 2017

MPIFLAGS+="-genv DAT_OVERRIDE /etc/dat.conf "

MPIFLAGS+="-genv I_MPI_DAT_LIBRARY /usr/lib64/libdat2.so "

if [[ "$TRANSPORT" == "DAPL" ]]; then

MPIFLAGS+="-DAPL "

MPIFLAGS+="-genv I_MPI_FABRICS shm:dapl "

MPIFLAGS+="-genv I_MPI_DAPL_UD off "

MPIFLAGS+="-genv I_MPI_DAPL_PROVIDER ofa-v2-$HCA-${HCAPORT}u "

MPIFLAGS+="-genv DAPL_MAX_INLINE 256 "

MPIFLAGS+="-genv I_MPI_DAPL_RDMA_RNDV_WRITE on "

MPIFLAGS+="-genv DAPL_IB_MTU 4096 "

elif [[ "$TRANSPORT" == "OFA" ]]; then

MPIFLAGS+="-IB "

MPIFLAGS+="-genv MV2_USE_APM 0 "

MPIFLAGS+="-genv I_MPI_FABRICS shm:ofa "

MPIFLAGS+="-genv I_MPI_OFA_USE_XRC 1 "

MPIFLAGS+="-genv I_MPI_OFA_NUM_ADAPTERS 1 "

MPIFLAGS+="-genv I_MPI_OFA_ADAPTER_NAME $HCA "

MPIFLAGS+="-genv I_MPI_OFA_NUM_PORTS 1 "

fi

if [ "$NET" == "OPA" ]; then

MPIFLAGS="-PSM2 "

fi

MPIFLAGS+="-genv I_MPI_PIN on "

MPIFLAGS+="-genv I_MPI_DEBUG 4 "

MPIFLAGS+="-genv MALLOC_MMAP_MAX_ 0 -genv MALLOC_TRIM_THRESHOLD_ -1 "


HCA=mlx5_0

HCAPORT=1

TRANSPORT=DAPL

mpirun -np $NP $MPIFLAGS

Selecting Fabrics – MPI Optimisation

Intel MPI Library

14 December 2017

Mellanox HPC-X Toolkit and Intel MPI

The Mellanox HPC-X Toolkit provides a MPI, SHMEM and UPC

software suite for HPC environments. Delivers “enhancements to

significantly increase the scalability & performance of message

communications in the network”. Includes:

¤ Complete MPI, SHMEM, UPC package, including Mellanox MXM and

FCA acceleration engines

¤ Offload collectives communication from MPI process onto Mellanox

interconnect hardware

¤ Maximize application performance with underlying hardware

architecture. Optimized for Mellanox InfiniBand and VPI interconnects

¤ Increase application scalability and resource efficiency

¤ Multiple transport support including RC, DC and UD

¤ Intra-node shared memory communication

• Performance comparison conducted on the Mellanox HP Proliant- E5-

2697A v4 EDR based cluster


http://www.mellanox.com/related-docs/prod_acceleration_software/PB_HPC-X.pdf

14 December 2017

Application Performance & Interconnect

Two comparison exercises undertaken:

¤ For each application (and associated data sets) analyse the

performance as a function of interconnect – Mellanox EDR and

Intel OPA – as a function of increasing core count.

• DLPOLY4 & GROMACS – 128-1024 cores

• VASP PdO (128-384 cores) & Zeolite (128-512 cores)

• Quantum ESPRESSO (Au112, 64-512; GRIR443, 128-1024)

• OpenFOAM (64-512 cores)

¤ On the Mellanox HP Proliant- E5-2697A v4 EDR based Thor

cluster, compare for each application (and associated data sets)

the relative performance undertaken using Intel MPI and

Mellanox HPCX i.e.

T HPCX / T Intel-MPI


http://www.mellanox.com/related-docs/prod_acceleration_software/PB_HPC-X.pdf

14 December 2017

4.9

8.4

13.8

19.2

5.2

8.9

13.7

19.3

5.3

9.1

13.7

0.0

3.0

6.0

9.0

12.0

15.0

18.0

128 256 512 1024



Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR HPCX




DL_POLY 4 – Gramicidin


Performance Relative to the Fujitsu CX250 e5-2670/ 2.6 GHz 8-C (32 PEs)

BE

TT

ER



14 December 2017


2.6

5.0

9.8

18.2

2.5

4.9

9.6

15.4

2.9

5.7

11.0

0.0

4.0

8.0

12.0

16.0

20.0

128 256 512 1024







49

GROMACS – Lignocellulose




BE

TT

ER

3,316,463 atom system using reaction-

field electrostatics instead of PME


1.51.7 1.7

2.9

4.7

5.6

3.0

4.3

4.9

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

128 256 512








Performance

BE

TT

ER

Zeolite (Si96O192) with MFI structure unit cell running a single point

calculation and a 400eV planewave cut off of using the PBE

functional. maximum number of plane-waves: 96,834, 2 k-points,

FFT grid: (65, 65, 43); 181,675 points




14 December 2017

2.4

4.6

6.5

7.7

2.5

4.4

5.7

5.2

2.8

5.2

6.3

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

128 256 512 1024



Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR HPCX





Pe

rfo

rma

nc

e

BE

TT

ER

Performance Data (128-1024 PEs)Quantum Espresso – GRIR443


[Re

lative

to

th

e F

ujit

su

e5

-2670

2.6

GH

z 8

-C (

96

PE

s)]

14 December 2017

2.3

8.3

16.9

33.1

2.4

9.0

21.5

39.7

2.5

7.9

15.3

36.1

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

64 128 256 512







52

OpenFOAM – Cavity 3d-3M


Perf

orm

an

ce

Relative to the Fujitsu CX250 e5-2670 8-C (32 PEs)

BE

TT

ER


OpenFOAM with lid-driven cavity flow 3d-3M data set


DL_POLY 4 – Intel MPI vs. HPCX


% Intel MPI

Performance vs.

HPCX

Processor Count

Intel MPI is seen to outperform HPC-X for the DLPOLY 4 NaCl test

case at all core counts, and at lower core counts for Gramicidin

90%

95%

100%

105%

110%

115%

120%

0 128 256 384 512 640 768 896 1024

DL_POLY4 - NaCl

DL_POLY4 - Gramicidin

14 December 2017

95%

100%

105%

110%

115%

120%

125%

130%

0 128 256 384 512 640 768 896 1024

GROMACS - ion channel

GROMACS - lignocellulose

GROMACS – Intel MPI vs. HPCX


% Intel MPI

Performance vs. HPCX

Processor Count

At no point does the HPC-X implementation

of Gromacs outperform that using Intel MPI

14 December 2017

50%

60%

70%

80%

90%

100%

110%

0 128 256 384 512

VASP - Palladium Complex

VASP - Zeolite Cluster

VASP 5.4.1 – Intel MPI vs. HPCX


% Intel MPI Performance vs.

HPCX

Processor Count

Significantly different to the classical MD codes – now

HPCX is seen to outperform Intel MPI for the Zeolite

cluster at all core counts, and at larger core counts for

the Palladium complex

14 December 2017

65%

75%

85%

95%

105%

0 128 256 384 512 640 768

Quantum Espresso - GRIR443

Quantum Espresso - Au112

Quantum Espresso – Intel MPI vs. HPCX


% Intel MPI Performance vs. HPCX

Processor Count

Significantly different to the classical MD codes – as

with VASP, HPCX is seen to outperform Intel MPI for the

larger core counts

14 December 2017

3. Relative Performance as a Function of

Processor Family and Interconnect.



0.00

0.20

0.40

0.60

0.80

1.00

DLPOLY-4Gramicidin

DLPOLY-4 NaCl

GROMACS ion-channel

GROMACSlignocellulose

OpenFoam -3d3M

QE Au112

QE GRIR443

VASP Pd-Ocomplex

VASP Zeolitecomplex

BSMBenchBalance

Bull b510 "Raven"Sandy Bridge e5-2670/2.6 GHz IB-QDR

Fujitsu CX250 SandyBridge e5-2670/2.6 GHzIB-QDR


Thor Dell|EMC e5-2697Av4 2.6GHz (T) EDR IMPI

Thor Dell|EMC e5-2697Av4 2.6GHz (T) EDRHPCX

Dell Skylake Gold 61302.1GHz (T) OPA

Intel Skylake Gold 61482.4GHz (T) OPA

Dell Skylake Gold 61422.6GHz (T) EDR


Target Codes and Data Sets – 128 PEs


128 PE Performance [Applications]

14 December 2017

http://www.cp2k.org/_media/lumo.jpg?cache=


1.08

1.19

1.32

1.34

1.38

1.39

1.40

1.42

1.44

1.44

1.49

1.52

1.55

1.64

1.76

2.07

2.12

2.29

2.64

1.00 1.30 1.60 1.90 2.20 2.50

DLPOLYclassic Bench7

OpenFoam - 3d3M

GAMESS-UK (hf12z)

GAMESS-UK (Siosi7)

GROMACS ion-channel

NAMD (F1atpase)

GAMESS-UK (valino.A2)

GAMESS-UK (cyc-sporin)

GROMACS…

LAMMPS (3d LJ melt)

NAMD (apoa1)

NAMD (stmv)

DLPOLY-4 NaCl

DLPOLY-4 Gramicidin

CP2K H2O.256

VASP Zeolite complex

QE GRIR443

QE Au112

VASP Pd-O complex

Improved Performance of

Dell |EMC Skylake Gold 6130

2.1GHz (T) OPA

vs.

Fujitsu CX250 Sandy Bridge e5-

2670/2.6 GHz IB-QDR


Average Factor = 1.54

SKL “Gold” 6130 2.1 GHz OPA vs. SB e5-2670 2.6 GHz QDR

256 cores

14 December 2017

1.08

1.19

1.32

1.34

1.38

1.39

1.40

1.42

1.44

1.44

1.49

1.52

1.55

1.64

1.76

2.07

2.12

2.29

2.64

1.00 1.30 1.60 1.90 2.20 2.50


OpenFoam - 3d3M

GAMESS-UK (hf12z)

GAMESS-UK (Siosi7)

GROMACS ion-channel

NAMD (F1atpase)



GROMACS…

LAMMPS (3d LJ melt)

NAMD (apoa1)

NAMD (stmv)

DLPOLY-4 NaCl

DLPOLY-4 Gramicidin

CP2K H2O.256


QE GRIR443

QE Au112

VASP Pd-O complex



2.7GHz (T) EDR

vs.


2670/2.6 GHz IB-QDR



SKL “Gold” 6150 2.7 GHz EDR vs. SB e5-2670 2.6 GHz QDR

256 cores

14 December 2017

0.87

0.98

1.00

1.00

1.01

1.02

1.02

1.04

1.04

1.04

1.11

1.11

1.11

1.12

1.12

1.22

1.33

1.45

1.58

1.59

0.70 0.80 0.90 1.00 1.10 1.20 1.30 1.40 1.50 1.60

OpenFoam - 3d3M


DLPOLY-4 NaCl

GAMESS-UK (hf12z)

LAMMPS (3d LJ melt)

GROMACS ion-channel


DLPOLY-4 Gramicidin

GAMESS-UK (Siosi7)


NAMD (stmv)

NAMD (apoa1)

GROMACS…

NAMD (F1atpase)

VASP Pd-O complex

QE Au112



QE GRIR443



Intel Skylake Gold 6148 2.4GHz (T)

OPA

vs.

ATOS Broadwell e5-2680v4 2.4GHz

(T) OPA



SKL “Gold” 6148 2.4 GHz vs. BDW e5-2680v4 2.4 GHz EDR

256 cores

14 December 2017


Performance Benchmarks – Node to Node

• Analysis of performance Metrics across a variety of data sets

¤ “Core to core” and “node to node” workload comparisons

• Previous charts based on Core to core comparison i.e.

performance for jobs with a fixed number of cores

• Node to Node comparison typical of the performance when

running a workload (real life production). Expected to reveal

the major benefits of increasing core count per socket

¤ Focus on a 6 “node to node” comparison of the following:

¤ Benchmarks based on set of 10 applications & 19 data sets.

1Fujitsu CX250 Sandy Bridge e5-2670/2.6

GHz IB-QDR [96 cores]Dell |EMC Skylake Gold 6150 2.7GHz

(T) EDR [216 cores]

2Fujitsu CX250 Sandy Bridge e5-2670/2.6

GHz IB-QDR [96 cores]Dell |EMC Skylake Gold 6142 2.6GHz

(T) EDR [192 cores]

3Fujitsu CX250 Sandy Bridge e5-

2670/2.6 GHz IB-QDR [96 cores]

Dell |EMC Skylake Gold 6130 2.1GHz

(T) OPA [168 cores]



“Thor” Broadwell cluster e5-2697A v4

2.6GHz (T) IB EDR [192 cores]

5“Thor” Broadwell cluster e5-2697A v4

2.6GHz (T) IB EDR [192 cores]


(T) EDR [216 cores]

14 December 2017

1.49

2.15

2.22

2.42

2.46

2.46

2.47

2.51

2.58

2.58

2.62

2.67

2.72

2.75

2.79

2.86

2.86

3.98

4.01

1.0 1.5 2.0 2.5 3.0 3.5 4.0



GAMESS-UK (hf12z)

GAMESS-UK (Siosi7)

VASP Pd-O complex


DLPOLY-4 NaCl

CP2K H2O.256

DLPOLY-4 Gramicidin

GROMACS ion-channel


NAMD (stmv)

LAMMPS (3d LJ melt)

GROMACS lignocellulose

NAMD (F1atpase)

NAMD (apoa1)

QE Au112

OpenFoam - 3d3M

QE GRIR443



2.1GHz (T) OPA [192 cores]

vs.





SKL “Gold” 6130 2.1 GHz OPA vs. SB e5-2670 2.6 GHz QDR

6 Node Comparison

14 December 2017

1.76

2.83

3.08

3.17

3.26

3.28

3.31

3.32

3.34

3.83

4.50

5.23

0.0 1.0 2.0 3.0 4.0 5.0

DLPOLYclassicBench5

QE Au112

DLPOLY-4 NaCl

GROMACS ion-channel

DLPOLY-4 Gramicidin


LAMMPS (3d LJ melt)

VASP Pd-O complex

CP2K H2O.256

GROMACSlignocellulose

QE GRIR443

OpenFoam - 3d3M



2.7GHz (T) EDR [216 cores]

vs.





SKL “Gold” 6150 2.7 GHz EDR vs. SB e5-2670 2.6 GHz QDR

6 Node Comparison

14 December 2017

0.91

0.91

0.93

0.94

1.00

1.01

1.02

1.02

1.04

1.04

1.05

1.05

1.06

1.08

1.09

1.10

1.11

1.12

1.14

0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2

VASP Pd-O complex

OpenFoam - 3d3M


QE Au112

QE GRIR443

DLPOLY-4 NaCl

CP2K H2O.256

GAMESS-UK (hf12z)

DLPOLY-4 Gramicidin


GAMESS-UK (Siosi7)



LAMMPS (3d LJ melt)

GROMACS ion-channel

NAMD (apoa1)

NAMD (F1atpase)

NAMD (stmv)




6 Node Comparison

SKL “Gold” 6142 2.6 GHz EDR vs. BDW e5-2697Av4 2.4 GHz EDR



(T) EDR [192 cores]

vs.

Thor Dell|EMC e5-2697A v4 2.6GHz

(T) EDR HPCX [192 cores]

14 December 2017

4. Preliminary Evaluation of the AMD

EPYC 7601 Processor.



• Zen cores

¤ Private L1/L2 cache

• CCX

¤ 4 ZEN cores (or less)

¤ 8MB L3 shared cache

• Zeppelin

¤ 2 CCX (or less)

¤ 2 DDR4 channels

¤ 2 PCIe 16x

• Naples

¤ 4 Zeppelin SoC dies fully

connected by Infinity

Fabric.

¤ 4 Numa Nodes !

067

EPYC Architecture - Naples, Zeppelin & CCX2x16 PCie

2x D

DR

4 C

han

nels

Coherent Links

Co

here

nt L

inks

8M

L3

Zen L2

Zen L2

Zen L2

Zen L2

8M

L3

L2

L2

Zen

Zen

L2

L2

Zen

Zen

2x16 PCie

2x D

DR

4 C

han

nels

Coherent Links

Co

here

nt L

inks

8M

L3

Zen L2

Zen L2

Zen L2

Zen L2

8M

L3

L2

L2

Zen

Zen

L2

L2

Zen

Zen

2x D

DR

4 C

ha

nn

els

Coherent Links

Co

here

nt L

inks

8M

L3

Zen L2

Zen L2

Zen L2

Zen L2

8M

L3

L2

L2

Zen

Zen

L2

L2

Zen

Zen

2x16 PCie

2x D

DR

4 C

han

nels

Coherent Links

Co

here

nt L

inks

8M

L3

Zen L2

Zen L2

Zen L2

Zen L2

8M

L3

L2

L2

Zen

Zen

L2

L2

Zen

Zen

2x16 PCie

∞

• Delivers 32 cores / 64 threads, 16MB L2 cache and 64MB L3 cache per socket.

• Design also means that there are four NUMA nodes per socket or eight NUMA nodes in a

dual socket system i.e. different memory latencies depending on which die needs data from

memory that can be attached to that die or another die on the fabric.

• The key difference with Intel’s Skylake SP architecture is that AMD needs to go off die within

the same socket where Intel stays on a single piece of silicon.


68

068

Inter-socket• 4 link (2B) between

sockets in 2 processors configurations.

• Each die connect to peer die

• 2 hop maximum system diameter

• 38GB/s bi-dir BW per link• 152GB/s between two

sockets • Infinity Control Fabric

connected between socket

INFINITY FABRICS and Inter-Socket Connectivity

Intra socket• Fully connect with 4B link• 42,6GB/s per link • 170GB/s

Infinity Fabric is a feat of engineering, but it does mean that

there are significant performance variations as you move off

die and onto the fabric.


069

SKU 7601 7551 7501 7451 7401 7351 7301

Freq (base) 2.2 2.0 2.0 2.3 2.0 2.4 2.2

Turboboost

All cores

active

2.7 2.6 2.6 2.9 2.8 2.9 2.7

Turboboost

On core

active

3.2 3.0 3.0 3.2 3.0 2.9 2.7

Cores/socket 32 32 32 24 24 16 16

L3 cache size 64 MB

Memory

Channel8

Memory Freq 2667 MT/s

TDP (W) 180 180 155/170 180 155/170 155/170 155/170

AMD® Epyc™ 7000 Series - SKU Map and FLOP/cycle

Architecture Sandy Bridge Haswell Skylake EPYC

ISA* AVX AVX2 AVX-512 AVX2

op/cycle2

(1 ADD, 1 MUL)

4

(2 FMA)

4

(2 FMA)

4

(2 ADD, 2 MUL)

Vector size (DP =

64-bits)4 4 8 2

FLOP/cycle 8 16 32 8

* Instruction Set Architecture


AMD EPYC 7601 System – 4 x SuperMicro AS-1123US-TR4

– 1U 1 node form-factor

– Bi-socket AMD® Epyc 7601

• x86_64 architecture (Support up to AVX2 ISA)

• Base Frequency 2.2GHz, Turbo core:3.2GHz (one

core active), 2.7GHz (all cores active)

• Per socket: 32 zen-cores/64 threads

– 563.2 MFLOPS in DP64 per socket

– L1-D 32KB, L1-I 64KB, L2 512KB, L3 3MB

– Hyperthreading On, Turbo core On

– TDP: 180 Watt

– Memory sub-system

– 8-Memory channels per socket, up to 2666MT/s

» Theoretical Peak: 170.4GBytes/s per socket

– 16 DIMM 16GB SR (1 per Channel) @ 2666MT/s

» Samsung M393A2K40BB2-CTD

▶ 4 disks ST1000NM0008-2F2

– Seagate Capacity 1TB (3,5“) SATA, 7200rpm

– 1 disk for OS (Ext4)

– 3 disks for /scratch (not configured)

▶ OS : RHEL7.4, kernel 3.10.0-693.el7.x86_64

▶ Interconnected through Mellanox® EDR-4x Fabric


EPYC - Compiler and Run-time Options

Compilation:

INTEL COMPILERS 2018, IntelMPI 2017

Update 3, FFTW-3.3.5

AMD EPYC: –O3 -xAVX2

INTEL SKYLAKE: -O3 –xCORE-AVX2

#

# Preload the amd-cputype library to navigate

# the Intel Genuine cpu test

module use /opt/amd/modulefiles

module load AMD/amd-cputype/1.0

export LD_PRELOAD=$AMD_CPUTYPE_LIB

export OMP_PROC_BIND=true

# export KMP_AFFINITY=granularity=fine

export I_MPI_DEBUG=5

export MKL_DEBUG_CPU_TYPE=5


STREAM:source /opt/intel/compilers_and_libraries_2018/linux/bin/compilervars.sh

intel64

module load AMD/amd-cputype/1.0

icc -o stream.x stream.c -DSTATIC -Ofast -xCORE-AVX2 -qopenmp -

DSTREAM_ARRAY_SIZE=800000000 \

-mcmodel=large -shared-intel

export OMP_NUM_THREADS=16

export OMP_PROC_BIND=true

export OMP_PLACES="{0:4:1}:16:4” #1 thread per CCX

export OMP_DISPLAY_ENV=true

2.0

2,999

11,567

1

10

100

1,000

10,000

1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07







MPI Performance – PingPong


1 PE / node

Latency


Mb

yte

s/s

ec


BE

TT

ER

14 December 2017

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07




Dell PE R730 Broadwell e5-2697Av4 2.6GHz (T) OPA

Dell Skylake Gold 6130 2.1GHz (T) OPA




MPI Collectives – Alltoallv (128 PEs)


128 PEs

Latency

BE

TT

ER


Measured Time (usec)


EPYC performance with

Intel MPI ~ 4-6 × worse

than that with SKL

processors!

14 December 2017

2.9

5.8

9.9

13.4

3.5

6.8

11.1

15.7

4.1

7.7

12.4

16.0

4.2

7.7

12.4

16.7

4.3

6.4

10.2

13.6

0.0

3.0

6.0

9.0

12.0

15.0

32 64 128 256


ClusterVision Ivy Bridge e5-2650v2 2.6GHz True Scale QDR

Bull Haswell e5-2680v3 2.5GHz (T) Connect-IB

Intel Broadwell2 e5-2690v4 2.6GHz (T) OPA





DL_POLY Classic – NaCl Simulation


Performance



BE

TT

ER

NaCl 27,000 atoms; 500 time steps


1.2

2.1

1.5

2.7

1.2

2.1

1.6

2.9

1.5

2.7

1.8

3.2

1.6

2.8

1.7

3.1

1.8

3.3

1.9

3.5

1.1

2.0

1.5

2.7

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

128 256


ClusterVision e5-2650v2 2.6GHz Truescale QDR

Bull Haswell e5-2695v3 2.3GHz Connect-IB

Intel Haswell e5-2697v3 2.6GHz (T) True Scale QDR

Huawei Fusion CH140 e5-2683 v4 2.1GHz (T) EDR

Thor Broadwell e5-2697A v4 2.6GHz (T) EDR







Zeolite Y cluster SioSi7 DZVP (Si,O), DZVP2 (H) B3LYP(3975 GTOs)

GAMESS-UK MPI/ScaLAPACK code

Performance



BE

TT

ER


1.8

3.1

4.9

2.9

5.1

8.5

2.7

5.0

8.4

3.0

5.4

9.0

2.4

3.6

4.9

0.0

2.0

4.0

6.0

8.0

10.0

64 128 256








DL_POLY 4 – Gramicidin Simulation


Performance Relative to the Fujitsu CX250 e5-2670/ 2.6 GHz 8-C (32 PEs)

BE

TT

ER



1.4

2.0

2.6

1.5

2.1

2.7

0.9

1.4

1.6

1.4

1.8

2.1

0.0

0.5

1.0

1.5

2.0

2.5

3.0

64 96 128




ATOS AMD EPYC 7601 2.2 GHz (T) EDR (16c/socket)


Perf

orm

an

ce


BE

TT

ER


Zeolite (Si96O192) with MFI structure unit cell running a single

point calculation and a 400eV planewave cut off of using the

PBE functional. maximum number of plane-waves: 96,834, 2 k-

points, FFT grid: (65, 65, 43); 181,675 points


1.3

1.9

2.42.3

3.2

3.8

2.8

4.0

5.2

2.8

4.0

4.9

1.4

1.82.0

0.0

1.0

2.0

3.0

4.0

5.0

6.0

96 128 160









Pe

rfo

rma

nc

e

BE

TT

ER

Performance Data (96-160 PEs)Quantum Espresso – GRIR443[R

ela

tive

to

th

e F

ujit

su

e5

-

26

70

2.6

GH

z 8

-C (

96

PE

s)]


0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00


DLPOLY-4Gramicidin

DLPOLY-4NaCl

GROMACSion-channel

GROMACSlignocellulos

e

GAMESS-UK(cyc-sporin)

GAMESS-UK(Siosi7)

QE Au112

QE GRIR443

VASP Pd-Ocomplex

VASP Zeolitecomplex

Fujitsu CX250 SandyBridge e5-2670/2.6 GHz IB-QDR


Thor Dell|EMC e5-2697Av4 2.6GHz (T) EDR IMPI

Dell Skylake Gold 61302.1GHz (T) OPA

Intel Skylake Gold 61482.4GHz (T) OPA



Bull|ATOS Skylake Gold6150 2.7GHz (T) EDR

ATOS AMD EPYC 7601 2.2GHz (T) EDR

EPYC - Target Codes and Data Sets – 128 PEs


128 PE Performance [Applications]

14 December 2017




Performance Benchmarks – Node to Node

• Analysis of performance Metrics across a variety of data sets

¤ “Core to core” and “node to node” workload comparisons

• Previous EPYC charts based on Core to core comparison

i.e. performance for jobs with a fixed number of cores

• Node to Node comparison typical of the performance when

running a workload (real life production). Expected to reveal

the major benefits of increasing core count per socket

¤ Focus on a “node to node” comparison of the following:

¤ Benchmarks based on set of 6 applications & 15 data sets.



ATOS AMD EPYC 7601 2.2 GHz (T)

EDR [256 cores]

2Dell |EMC Skylake Gold 6130 2.1GHz

(T) OPA [168 cores]

ATOS AMD EPYC 7601 2.2 GHz (T)

EDR [256 cores]

14 December 2017

1.48

1.93

2.09

2.09

2.46

2.49

2.67

2.92

3.37

3.43

3.98

4.05

1.0 1.5 2.0 2.5 3.0 3.5 4.0




VASP Pd-O complex

DLPOLY-4 NaCl


DLPOLY-4 Gramicidin

QE Au112

GROMACS ion-channel




Relative Performance of

ATOS AMD EPYC 7601 2.2 GHz

(T) EDR [256 cores]

vs.





AMD EPYC 7601 2.2 GHz (T) EDR vs. SB e5-2670 2.6 GHz QDR

14 December 2017

4 Node Comparison

0.70

0.72

0.74

0.76

0.95

1.00

1.06

1.19

1.29

1.43

1.49

1.59

1.70

1.71

1.78

0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9

QE Au112

QE GRIR443

VASP Pd-O complex



DLPOLY-4 NaCl

DLPOLY-4 Gramicidin




GROMACS ion-channel


GAMESS-UK (Siosi7)


GAMESS-UK (hf12z)

Relative Performance of

ATOS AMD EPYC 7601 2.2 GHz

(T) EDR [256 cores]

vs.


2.1GHz (T) OPA [128 cores]



SKL “Gold” 6130 2.1 GHz OPA vs. AMD EPYC 7601 2.2 GHz (T) EDR vs.

14 December 2017

4 Node Comparison

5. Acknowledgements

• Ludovic Sauge, Enguerrand Petit and Patrick Berghaeger (Bull/ATOS)

for informative performance discussions and access to the Skylake &

EPYC clusters at the Bull HPC Competency Centre.

• Pak Lui, David Cho, Gilad Shainer, Colin Bridger & Steve Davey for

access to the “Thor” cluster at the HPC Advisory Council and

“Hercules” partition in Mellanox.

• Doug Mark, Farid Parpia, John Simpson, Ludovic Enault, Xinghong

He, James Kuchler & Luke Willett for access to and assistance on the

IBM Power8 S822LC cluster in Poughkeepsie.

• David Power for access to two Skylake nodes in Boston /BIOSIT.

• Jamie Wilcox, Bogdan Pop, Toby Smith and Andrew Richardson (Intel)

for past access to and help with a host of processors and for access to

the Swindon clusters.

• Joshua Weage, Martin Hilgeman, Dave Coughlin, Gilles Civario and

Christopher Huggins for access to, and assistance with, the variety of

Skylake SKUs at the Dell Benchmarking Centre.


Summary

• Focus on performance benchmarks and clusters featuring Intel’s

Skylake “Gold” processors (6130, 2.1 GHz [16c]; 6148, 2.4 GHz

[20c], 6142, 2.6 GHz [16c] ; and 6150, 2.7 GHz [18c]).

• Performance comparison with current Sandy Bridge systems and

those based on dual Intel Broadwell processor EP nodes (16-

core, 14-core) with Mellanox EDR and Intel’s Omnipath OPA

interconnects.

• Measurements of parallel application performance based on

synthetic (STREAM, IMB, HPCC and IOR) and end user

applications – DLPOLY, Gromacs, NAMD, LAMMPS, GAMESS-

UK, Quantum ESPRESSO, VASP and CP2K.

¤ Use of IPM and Allinea Performance reports and comparison of

Mellanox’s HPC-X and Intel MPI on EDR-based systems

• Enhanced performance of the Skylake-based clusters is at first

sight modest, particularly when compared with optimised runs

(HPCX) on previous generation Intel Broadwell clusters.


Summary II

• A Core-to-Core comparison across 19 data sets (10 applications)

suggests modest speedups (ca. 1.08) when comparing a Skylake

“Gold” 6150 cluster (EDR) to the Broadwell-based “Thor” e5-2697A v4

2.6GHz (T) cluster with EDR and the HPCX environment.

¤ Little difference in application performance based on Mellanox’s IB

EDR interconnect and Intel’s Omnipath (OPA) interconnect, at least

at modest (< 512 core) counts.

¤ A comparison against the Fujitsu CX250 Sandy Bridge 2.6GHz IB-

QDR, shows average performance increase factors of

• 1.54 (256 cores) for clusters based on the Gold 6130 [OPA]

• 1.64 (256 cores) for clusters based on the Gold 6148 [OPA]

• 1.71 (256 cores) for clusters based on the Gold 6142 [EDR]

• 1.88 (256 cores) for clusters based on the Gold 6150 [EDR]

¤ Some applications however show much higher factors e.g.

Quantum Espresso and VASP.

¤ A Node-to-Node comparison typical of the performance when

running a workload is more encouraging.


Summary III

• A 6-node benchmark based on examples from 10 application and 19

data sets show the following improvement factors against 6 node runs

on the Fujitsu CX250 Sandy Bridge e5-2670/2.6 GHz cluster

¤ Skylake “Gold” 6130 cluster (16c) with EDR interconnect : 2.54



• Optimum Interconnect performance is a function of both application

and core count.

¤ With the materials-based codes & OpenFOAM, and at high core

count (> 512 cores), EDR exhibits a clear performance

advantage over OPA.

¤ This is not the case for the classical MD codes where OPA

shows a distinct advantage at all but the highest core counts.

• Preliminary studies on the EPYC 2701 shows a complex performance

dependency on EPYC architecture.

¤ Codes with high usage of vector instructions (Gromacs, VASP and

Quantum Espresso) perform at best in somewhat modest fashion.


Download - Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

Top Related