applications on the a64fx
TRANSCRIPT
Adrian Jackson
Senior Research Fellow
EPCC, The University of Edinburgh
@adrianjhpc
Investigating
Applications on the
A64FXAdrian Jackson
Michèle Weiland
Nick Brown
Andrew Turner
Mark Parsons
EPCC, The University of Edinburgh
Arm-based processors
• Arm (Nvidia?) designed processors strong presence in low-
power computing
– Arm-licenced processor designs in a wide range of mobile devices,
and even in Intel products
• Recent work has seen a number of Arm-design processors
being developed for server class applications
– Cavium’s ThunderX2
– Amazon Graviton
– Ampere eMAG
– Huawei Kunpeng 920
– Fujitsu A64FX
A64FX processor
• Fujitsu A64FX
– 48 cores
– Maximum of 2.2GHz
– 12 cores per quadrant
– Separate assistant cores for the O/S
– 64KB L1 per core, 32MB L2 (8MB per quadrant) per chip
– 512-bit wide SVE vectors
– Compared to Skylake’s 512-bit, Broadwell’s 256-bit, IvyBridge and
TX2’s 128-bit vectors
– 4 memory controllers (one per quadrant) with 8 GB HBM2 each
– ~1TB/s memory bandwidth across the chip
– Compared to 6 channels on latest Intel
– AMD’s EPYC also has 8 channels
Previous work
• Existing results on other Arm processors (TX2)
Evaluating the Arm Ecosystem for High Performance
Computing. A. Jackson, A. Turner, M. Weiland, N. Johnson, O. Perks, M.
Parsons, PASC19’, June 2019.
0
0.5
1
1.5
2
2.5
HPE Apollo 70 SGI ICE XA Cray XC30 Dell EMC
Tim
e t
o s
olu
tion r
ela
tive t
o S
GI IC
E X
A(low
er
is q
uic
ker)
System
COSA OpenSBLI
GROMACS nektar++
Multi-node performance
• Investigating A64FX performance at scale
• Large scale application runs
• Networking evaluation
• Evaluating porting effort and software eco-system maturity
– Including relevant programming languages
– Fortran, C, and C++
System comparison
Application benchmarking
HPCG
• High performance conjugate gradient kernel benchmark
aiming to exercise:
– Floating point performance
– Memory bandwidth
– Network bandwidth and latency
– Implemented with C++, MPI, and OpenMP
HPCG multi-node
minikab
• Mini Krylov ASiMoV Benchmark (minikab)
• Parallel CG solver
– Fortran 2008 MPI and OpenMP
• Can configure
– The type of decomposition;
– The solver algorithm;
– The communication approach;
– The serial sparse-matrix routine in plain Fortran or implemented
– via a numerical library (such as MKL).
• Sparse matrix benchmark
– 9,573,984 degrees of freedom and 696,096,138 non-zero elements
minikab performance
nekbone
• Nekbone mini-app benchmark captures the basic structure
the Nek5000 application
– a high order, incompressible NS solver based on the spectral element
method, implemented in Fortran
– Dominated by matrix-vector multiplication operation in an element-by-
element fashion.
– Nearest-neighbour communication, and MPI Allreduce operations.
nekbone
COSA
• Fluid dynamics code
– Harmonic balance (frequency domain approach)
– Unsteady navier-stokes solver
– Optimise performance of turbo-machinery like problems
– Multi-grid, multi-level, multi-block code
– Implemented in Fortran (with Cray pointers )
– Parallelised with MPI
COSA
COSA
CASTEP
• CASTEP DFT code for calculating the properties of materials
from first principles
– can simulate a wide range of materials proprieties including
energetics, structure at the atomic level, vibrational properties,
electronic response properties etc.
– Fortran code with MPI and OpenMP parallelisations
– Uses FFT libraries heavily
CASTEP
OpenSBLI
• Programming framework to generate finite difference
approximations
• Implemented in Python which generates C using the
OPS library, with MPI and OpenMP parallel functionality
as well as GPU parallelisations.
OpenSBLI
OpenSBLI
Summary
• Porting not optimisation investigation
– General porting was very straight forward
– GNU compilers and Fujitsu maths libraries help significantly
– Fujitsu compilers can bring performance benefits
• Performance is generally extremely good
– Not all codes necessarily better than top end Intel
• Memory bandwidth dominated codes benefit significantly
• Small memory limit on nodes has challenges for some
applications
– Over decomposition necessary to fit simulations into memory
• Scope for targeted optimisation to improve performance
Acknowledgements
• Access to the A64FX was provided through the Fujitsu early
access programme
• The Fulhame HPE Apollo 70 system is supplied to EPCC as part
of the Catalyst UK programme, a collaboration with Hewlett
Packard Enterprise, Arm and SUSE to accelerate the adoption of
Arm based supercomputer applications in the UK.
• This work used the Cirrus UK National Tier-2 HPC Service at
EPCC (http://www.cirrus.ac.uk) funded by the University of
Edinburgh and EPSRC (EP/P020267/1).
• This work used the ARCHER UK National Supercomputing
Service (http://www.archer.ac.uk).
• The EPCC NGIO system was funded by the European Union's
Horizon 2020 Research and Innovation programme under Grant
Agreement no. 671951