«What I cannot compute, I do not understand.» (adapted from Richard P. Feynman)
Achievements and challenges running GPU-accelerated Quantum ESPRESSO
on heterogeneous clusters
Filippo Spiga1,2 <[email protected]>
1HPCS, University of Cambridge2Quantum ESPRESSO Foundation
What is Quantum ESPRESSO?
• Quantum ESPRESSO is an integrated suite of computer codes for atomistic simulations based on DFT, pseudo-potentials, and plane waves
• "ESPRESSO" stands for opEn Source Package for Research in Electronic Structure, Simulation, and Optimization
• Quantum ESPRESSO is an initiative of SISSA, EPFL, and ICTP, with many partners in Europe and worldwide
• Quantum ESPRESSO is free software that can be freely downloaded. Everybody is free to use it and welcome to contribute to its development
2
What Quantum ESPRESSO can do?
3
• ground-state calculations– Kohn-Sham orbitals and energies, total energies
and atomic forces– finite as well as infinite system– any crystal structure or supercell– insulators and metals (different schemes of BZ
integration)– structural optimization (many minimization
schemes available)– transition states and minimum-energy paths (via
NEB or string dynamics) electronic polarization via Berry’s phase
– finite electric fields via saw-tooth potential or electric enthalpy
• norm-conserving as well as ultra-soft and PAW pseudo-potentials
• many different energy functionals, including meta-GGA, DFT+U, and hybrids (van der Waals soon to be available)
• scalar-relativistic as well as fully relativistic (spin-orbit) calculations
• magnetic systems, including non-collinear magnetism• Wannier intepolations
• ab-initio molecular dynamics– Car-Parrinello (many ensembles and flavors) – Born-Oppenheimer (many ensembles and flavors) – QM-MM (interface with LAMMPS)
• linear response and vibrational dynamics– phonon dispersions, real-space interatomic force
constants– electron-phonon interactions and
superconductivity effective charges and dielectric tensors
– third-order an-harmonicities and phonon lifetimes– infrared and (off-resonance) Raman cross
sections– thermal properties via the quasi-harmonic
approximation• electronic excited states
– TDDFT for very large systems (both real-time and “turbo-Lanczos”)
– MBPT for very large systems (GW, BSE)
.... plus several post processing tools!
Quantum ESPRESSO in numbers
• 350,000+ lines of FORTRAN/C code
• 46 registered developers
• 1600+ registered users
• 5700+ downloads of the latest 5.x.x version
• 2 web-sites (quantum-espresso.org & qe-forge.org)
• 1 official user mailing-list, 1 official developer mailing-list
• 24 international schools and training courses (1000+ participants)
4
PWscf in a nutshellprogram flow
5
3D-FFT + GEMM + LAPACK
3D-FFT + GEMM + LAPACK
3D-FFT3D-FFT 3D-FFT + GEMM
3D-FFT + GEMM
Spoiler!
• Only PWscf ported to GPU
• Performance serial (full socket vs full socket + GPU): 3x ~ 4x
• Performance parallel (best MPI+OpenMP vs ... + GPU): 2x ~ 3x
• Designed to run better at low number of nodes (efficiency not high)
• spin magnetization and noncolin not ported (working on it)
• I/O set low on purpose
• NVIDIA Kepler GPU not exploited at their best (working on it)
6
Achievement: smart and selective BLAS
7
ICPU
GPU
H2D
unbalance
D2H
A1
+
B C1
×
A2
+
B C2
×
phiGEMM: CPU+GPU GEMM operations
•Drop-in library wont work as expected, need control
•overcome limit of the GPU memory
•flexible interface (C on the HOST, C on the DEVICE)
•dynamic workload adjustment (SPLIT) -- heuristic
•call-by-call profiling capabilities
Challenge: rectangular GEMMbad shape, poor performance
8
Issues:•A and B can be larger than GPU memory•A and B matrices are "badly" rectangular (dominant dimension)
Solutions: ~ +15% performance•tiling approach
• not too big, not too small• GEMM computation must exceed copies (H-D, D-H), especially for small tiles
•handling the "SPECIAL-K" case• adding beta × C done once
• accumulating alpha × Ai × Bi timesOptimizations included in phiGEMM ( version >1.9)
m
n
k
k
m
n
Common case due to data distribution
Challenge: parallel 3D-FFT
• 3D-FFT burns up to 40%~45% of total SCF run-time
• 90-ish % 3D-FFT of PWscf are inside vloc_psi ("Wave" grid)
• 3D-FFT is "small" <3003 COMPLEX DP
• 3D-FFT can be not a cube
• In serial a 3DFFT is called as it is, in parallel 3D-FFT = Σ1D-FFT
• In serial data layout is straightforward, in parallel not*
• MPI communication become big issue for many-node problem
• GPU FFT is mainly memory bounded grouping & batching 3D-FFT
9
Challenge: FFT data layoutit is all about sticks & planes
10
Ny Nz FFT along x
Transform along X
0
1
2
3
PE 0
PE 1PE 2
PE 3
0 1 2 3 00 0 0
11 1 1
22 2
2
33 3
3
Transform along Z
~ Nx Ny / 5 FFT along z Nx Nz / 2 FFT along y
z
0
1
2
3
Transform along Y
Pa
ralle
l Tra
nsp
ose
~ N
x N
y N
z / (
5 N
p) d
ata
exch
ange
d pe
r P
E
xyz x
y z
0
1
2
3
xy
Data are not contiguous and not “trivially” distributed across processors
A single 3D-FFT is divided in independent 1D-FFTsThere are two "FFT grid" representation in
Reciprocal Space: wave functions (Ecut) and charge density (4Ecut)
Zeros are not transformed. Different cut-offs preserve accuracy
Challenge: parallel 3D-FFT
11
Optimization #1
•CUDA-enabled MPI for P2P (within socket)
•Overlap FFT computation with MPI communication
•MPI communication >>> FFT computation for many nodes
Challenge: parallel 3D-FFT
Optimization #2
Observation: Limitation in overlapping D-H copy due to MPI communication
•pinned needed (!!!)
•Stream D-H copy to hide CPU copy and FFT computation
Optimization #3
Observation: MPI “packets” small for many nodes
•Re-order data before communication
•Batch MPI_Alltoallv communications
Optimization #4
Idea: reduce data transmitted (risky...)
•Perform FFTs and GEMM in DP, truncate data before communication to SP
12
Achievements: parallel 3D-FFTminiDFT 1.6 (k-points calculations, ultra-soft pseudo-potentials)
13
Optimization #1: +37% improvement in communication
Optimization #2:
Optimization #3: +10% improvement in communication
Optimization #4: +52% (!!!) improvement in communication SP vs DP
Lower gain in PWscf !!!
Challenge: parallel 3D-FFT
14
All data of all FFT computed back to host mem
All data of all FFT computed back to host mem
Data reordering before GPU-GPU communication
Data reordering before GPU-GPU communication
1
2
Challenge: H*psi
15
compute/update H * psi:compute kinetic and non-local term (in G space)
complexity : Ni × (N × Ng+ Ng × N × Np)Loop over (not converged) bands:
FFT (psi) to R space complexity : Ni × Nb × FFT(Nr)
compute V * psi complexity : Ni × Nb × Nr
FFT (V * psi) back to G space complexity : Ni × Nb × FFT(Nr)
compute Vexx complexity : Ni × Nc × Nq × Nb × (5 × Nr +
2×FFT(Nr))N = 2×Nb (where Nb = number of valence bands)Ng = number of G vectorsNi = number of Davidson iteration
Np = number of PP projectorNr = size of the 3D FFT gridNq = number of q-point (may be different from Nk)
Challenge: H*psinon-converged electronic bands dilemma
16
Non-predictable number of FFT across all SCF iterations
Challenge: parallel 3D-FFTthe orthogonal approach
17
PSI PSIC
CUFFT GR
products
CUFFT RG
PSICHPSI
PSIC
PSIC
“MPI_Allgatherv”
“MPI_Allscatterv”
Multiple LOCAL grid to compute
Overlapping is possible!!
PSI PSIC
FFT GR
products
FFT RG
PSICHPSI
DISTRIBUTED
Considerations:
•memory on GPU ATLAS K40 (12 GByte)
•(still) too much communication GPU Direct capability needed
•enough 3D-FFT not predictable in advance
•benefit also for CPU-only! Not ready for production yet
Challenge: eigen-solverswhich library?
• LAPACK MAGMA (ICL, University of Tennessee)– hybridization approach (CPU + GPU), dynamic scheduling based on DLA (QUARK)– single and multi-GPU, no memory distributed (yet)– some (inevitable) numerical "discrepancies"
• ScaLAPACK ELPA ELPA + GPU (RZG + NVIDIA)
– ELPA (Eigenvalue SoLvers for Petaflop Applications) improves ScaLAPACK
– ELPA-GPU proof-of-concept based on CUDA FORTRAN
– effective results below expectation
• Lancronz diagonaliz w/ tridiagonal QR algorithm (Penn State)
– simple (too simple?) and designed to be GPU friendly
– take advantage of GPU Direct
– experimental, need testing and validation
18
HPC Machines
• 128 nodes dual-socket• dual 6-core Intel Ivy Bridge • dual NVIDIA K20c per node• dual Mellanox Connect-IB FDR
19
WILKES (HPCS) [DELL]
#2 Green500 Nov 2013
( ~3632 MFlops/W )
• 18688 nodes single-socket• single 16-core AMD Opteron• one NVIDIA K20x per node• Gemini interconnection
TITAN (ORNL) [CRAY]
#2 Top500 Jun 2013
( ~17.59 PFlops Rmax )
Achievement: Save Powerserial multi-threaded, single GPU, NVIDIA Fermi generation
20
3.67x 3.1x3.2x
-57% -58% -54%
Tests run early 2012 @ ICHEC
Achievement: improved time-to-solution
21
2.4x
~2.1x
2.5x
~3.4x
~2.9x
2.4x
~3.4x
~3.5x
Serial Parallel Parallel tests run on WilkesSerial tests run on SBN machine
Challenge: running on CRAY XK7
Key differences...
•AMD Bulldozer architecture, 2 cores shares same FPU pipeline aprun -j 1
•NUMA locality matters a lot , for both CPU-only and CPU+GPU aprun –cc numanode
•GPU Direct over RDMA is not supported (yet?) 3D-FFT not working
•Scheduling policy "unfriendly" input has to be really big
Performance below expectation (<2x)
Tricks: many-pw.x, __USE_3D_FFT
22
Challenge: educate users
• Performance portability myth
• "configure, compile, run" same as the CPU version
• All dependencies (MAGMA, phiGEMM) compiled by QE-GPU
• No more than 2 MPI process per GPU
– Hyper-Q does not work automatically, an additional running deamon is needed
• Forget about 1:1 output comparison
• QE-GPU can run on every GPU but some GPU are better than others...
23
Lessons learntbeing "heterogeneous" today and tomorrow
• GPU does not really improve code scalability, only time-to-solution
• Re-think about data distribution for massive parallel architectures
• Deal with un-controlled "numerical fluctuations" (GPU magnifies this)
• The "data movement" constrain will soon disappear new Intel Phi Kings Landing, NVIDIA project Denver expected by 2015
• Looking for true alternatives, new algorithms– not easy, extensive validation _plus_ module dependencies
• Performance is a function of human effort
• Follow the mantra «Do what you are good at.»
24
Thank you for your attention!
Links:•http://hpc.cam.ac.uk•http://www.quantum-espresso.org/•http://foundation.quantum-espresso.org/•http://qe-forge.org/gf/project/q-e/•http://qe-forge.org/gf/project/q-e-gpu/