applications on the a64fx

Adrian Jackson

Senior Research Fellow

EPCC, The University of Edinburgh

[email protected]

@adrianjhpc

Investigating

Applications on the

A64FXAdrian Jackson

Michèle Weiland

Nick Brown

Andrew Turner

Mark Parsons

EPCC, The University of Edinburgh

Arm-based processors

• Arm (Nvidia?) designed processors strong presence in low-

power computing

– Arm-licenced processor designs in a wide range of mobile devices,

and even in Intel products

• Recent work has seen a number of Arm-design processors

being developed for server class applications

– Cavium’s ThunderX2

– Amazon Graviton

– Ampere eMAG

– Huawei Kunpeng 920

– Fujitsu A64FX

A64FX processor

• Fujitsu A64FX

– 48 cores

– Maximum of 2.2GHz

– 12 cores per quadrant

– Separate assistant cores for the O/S

– 64KB L1 per core, 32MB L2 (8MB per quadrant) per chip

– 512-bit wide SVE vectors

– Compared to Skylake’s 512-bit, Broadwell’s 256-bit, IvyBridge and

TX2’s 128-bit vectors

– 4 memory controllers (one per quadrant) with 8 GB HBM2 each

– ~1TB/s memory bandwidth across the chip

– Compared to 6 channels on latest Intel

– AMD’s EPYC also has 8 channels

Previous work

• Existing results on other Arm processors (TX2)

Evaluating the Arm Ecosystem for High Performance

Computing. A. Jackson, A. Turner, M. Weiland, N. Johnson, O. Perks, M.

Parsons, PASC19’, June 2019.

0

0.5

1

1.5

2

2.5

HPE Apollo 70 SGI ICE XA Cray XC30 Dell EMC

Tim

e t

o s

olu

tion r

ela

tive t

o S

GI IC

E X

A(low

er

is q

uic

ker)

System

COSA OpenSBLI

GROMACS nektar++

Multi-node performance

• Investigating A64FX performance at scale

• Large scale application runs

• Networking evaluation

• Evaluating porting effort and software eco-system maturity

– Including relevant programming languages

– Fortran, C, and C++

System comparison

Application benchmarking

HPCG

• High performance conjugate gradient kernel benchmark

aiming to exercise:

– Floating point performance

– Memory bandwidth

– Network bandwidth and latency

– Implemented with C++, MPI, and OpenMP

HPCG multi-node

minikab

• Mini Krylov ASiMoV Benchmark (minikab)

• Parallel CG solver

– Fortran 2008 MPI and OpenMP

• Can configure

– The type of decomposition;

– The solver algorithm;

– The communication approach;

– The serial sparse-matrix routine in plain Fortran or implemented

– via a numerical library (such as MKL).

• Sparse matrix benchmark

– 9,573,984 degrees of freedom and 696,096,138 non-zero elements

minikab performance

nekbone

• Nekbone mini-app benchmark captures the basic structure

the Nek5000 application

– a high order, incompressible NS solver based on the spectral element

method, implemented in Fortran

– Dominated by matrix-vector multiplication operation in an element-by-

element fashion.

– Nearest-neighbour communication, and MPI Allreduce operations.

nekbone

COSA

• Fluid dynamics code

– Harmonic balance (frequency domain approach)

– Unsteady navier-stokes solver

– Optimise performance of turbo-machinery like problems

– Multi-grid, multi-level, multi-block code

– Implemented in Fortran (with Cray pointers )

– Parallelised with MPI

CASTEP

• CASTEP DFT code for calculating the properties of materials

from first principles

– can simulate a wide range of materials proprieties including

energetics, structure at the atomic level, vibrational properties,

electronic response properties etc.

– Fortran code with MPI and OpenMP parallelisations

– Uses FFT libraries heavily

CASTEP

OpenSBLI

• Programming framework to generate finite difference

approximations

• Implemented in Python which generates C using the

OPS library, with MPI and OpenMP parallel functionality

as well as GPU parallelisations.

OpenSBLI

Summary

• Porting not optimisation investigation

– General porting was very straight forward

– GNU compilers and Fujitsu maths libraries help significantly

– Fujitsu compilers can bring performance benefits

• Performance is generally extremely good

– Not all codes necessarily better than top end Intel

• Memory bandwidth dominated codes benefit significantly

• Small memory limit on nodes has challenges for some

applications

– Over decomposition necessary to fit simulations into memory

• Scope for targeted optimisation to improve performance

Acknowledgements

• Access to the A64FX was provided through the Fujitsu early

access programme

• The Fulhame HPE Apollo 70 system is supplied to EPCC as part

of the Catalyst UK programme, a collaboration with Hewlett

Packard Enterprise, Arm and SUSE to accelerate the adoption of

Arm based supercomputer applications in the UK.

• This work used the Cirrus UK National Tier-2 HPC Service at

EPCC (http://www.cirrus.ac.uk) funded by the University of

Edinburgh and EPSRC (EP/P020267/1).

• This work used the ARCHER UK National Supercomputing

Service (http://www.archer.ac.uk).

• The EPCC NGIO system was funded by the European Union's

Horizon 2020 Research and Innovation programme under Grant

Agreement no. 671951

applications on the a64fx

Documents