efficient multigrid solvers for mixed finite element ... · eike mueller fem multigrid solvers in...
TRANSCRIPT
1/20
Efficient multigrid solvers for mixed finite elementdiscretisations in NWP models
Colin Cotter†, David Ham†, Lawrence Mitchell†,Eike Hermann Muller*, Robert Scheichl*
*University of Bath, †Imperial College London
16th ECMWF Workshop on High Performance Computing inMeteorology Oct 2014
Eike Mueller FEM multigrid solvers in NWP
2/20
Motivation
Equations, Algorithms
Parallel Optimised
Code
Equations, Algorithms
High level code
Parallel Optimised
Code
PETSc firedrakePyOP2 *
Computational ScientistDomain specialist
discretisation
Use case: complex iterative solver for pressure correction eqn.* All credit for developing firedrake/PyOP2 goes to David Ham and his group at
Imperial, I’m giving this talk as a “user”
Eike Mueller FEM multigrid solvers in NWP
Separation of concerns and high-level abstractions
3/20
Motivation
Computational challenge in numerical forecasts:
Solver for elliptic pressure correction equation
(algorithmic + parallel) scalability & performance
1 4 16 64 256 1024 4096 16384Number of GPUs
0.1
1
10
Tota
l sol
utio
n tim
e [s
]
16 64 256 1024 4096 16384 65536Number of CPU cores
256512768
CGMultigrid
Titan [nVidia Kepler]Hector [AMD Opteron]
1 4 16 64 256 1024 4096 16384Number of GPUs
109
1010
1011
1012
1013
1014
1015
1016
FLOP
s
GigaFLOP
TeraFLOP
PetaFLOP
single GPU peak
Titan theoretical peak
256512768
CGMultigrid
Weak scaling, total solution time Absolute performance
0.5 · 1012 dofs on 16384 GPUs of TITAN, 0.78 PFLOPs, 20%-50% BWFinite volume discretisation, simpler geometry
Eike Mueller FEM multigrid solvers in NWP
4/20
Motivation
ChallengesFinite element discretisation (Met Office next generation dyn. core),unstructured grids⇒ more complicatedDuplicate effort for CPU, GPU (Xeon Phi,. . . ) implementation& optimisation, reinvent the wheel?Mix two issues:
algorithmic (domain specialist/numerical analyst)parallelisation & low level optimisation (computational scientist)
GoalSeparation of concerns (algorithmic↔ computational)⇒ rapid algorithm development and testing at highabstraction levelPerformance portabilityReuse existing, tested and optimised tools and libraries
⇒ Choice of correct abstraction(s)Eike Mueller FEM multigrid solvers in NWP
5/20
This talk
Iterative solver for Helmholtz equation:
φ + ω∇ (φ∗u) = rφ−u − ω∇φ = ru
Mixed finite element discretisation, icosahedral grid
Matrix-free multigrid preconditioner for pressure correction
Performance portable firedrake/PyOP2 toolchain[Rathgeber et al., (2012 & 2014)]
Python⇒ automatic generation of optimised C code
Collaboration between
Numerical algorithm developers (Colin, Rob, Eike)
Computational scientists, Library developers (David,Lawrence, PETSc,PyOP2,firedrake teams)
Eike Mueller FEM multigrid solvers in NWP
6/20
Shallow water equations
Model system: shallow water equations
∂φ
∂t+ ∇ (φu) = 0
∂u∂t
+ (u · ∇) u + g∇φ = 0
Implicit timestepping⇒ linear system for velocity u and height perturbation φ
φ + ω∇ (φ∗u) = rφ−u − ω∇φ = ru
reference state φ∗ ≡ 1
ω = ch∆t =√
gφ∗∆t
Eike Mueller FEM multigrid solvers in NWP
7/20
Mixed system
Finite element discretisation: φ ∈ V2, u ∈ V1
⇒ Matrix system for dof-vectors Φ ∈ RnV2 , U ∈ RnV1
A(ΦU
)=
(Mφ ωBωBT −Mu
) (ΦU
)=
(Rφ
Ru
)
Mixed finite elements
DG0 + RT1 (lowest order)
DG1 + BDFM1 (higher order)
(Mφ)ij ≡
∫Ω
βi(x)βj(x) dx, (Mu)ef ≡
∫Ω
ve(x) · v f (x) dx (mass matrix)
Bie ≡
∫Ω
∇ve(x)βi(x) dx (derivative matrix)
Eike Mueller FEM multigrid solvers in NWP
8/20
Iterative solver
Iterative solver
1 Outer iteration (mixed system):GMRES, Richardson, BiCGStab, . . .
Preconditioner:
P ≡(
Mφ ωBωBT −M∗u
)lumped velocity mass matrix
P−1 =
(1 0
(M∗u)−1ωBT1
) (H−1 00 −(M∗u)−1
) (1 ωB(M∗u)−1
0 1
)Elliptic positive definite Helmholtz operator
H = Mφ + ω2B (M∗u)−1 BT
2 Inner iteration (pressure system Hφ′ = R ′φ):CG with geometric multigrid preconditioner
Eike Mueller FEM multigrid solvers in NWP
9/20
Multigrid
Multigrid V-cycle
smooth
smooth
smooth
smooth
smooth
smooth
smooth x 1
x 1/4
x 1/16
x 1/64
Computational bottleneckSmoother and operator application
φ′ 7→ φ′ + ρrelaxD−1H
(R ′φ − Hφ′
)H = Mφ + ω2B (M∗u)−1 BT
Intrinsic length scale
Λ/∆x = cs∆t/∆x ∼ νCFL = O(10) grid spacings (∀ resolutions ∆x)
⇒ small number of multigrid levels/coarse smoother sufficient
Eike Mueller FEM multigrid solvers in NWP
10/20
Grid iteration in FEM codes
Computational bottlenecks in finite element codes
1
23
4
6
7 8
9
513
14
1516
12
10 11
5
4
6
7 8
10
1112 5
51913
16
18
2021
1715
14
9
um (1)
um (3) um (2)
ɸee
e
e
Example: Weak divergenceφ = ∇ · u→ Φ = BU
φe 7→ φe + b(e)1 ume (1)
+ b(e)2 ume (2) + b(e)
3 ume (3)
Loop over topological entities (e.g. cells). In each cellAccess local dofs (e.g. pressure and velocity)Execute local kernelWrite data back
Manage halo exchanges, thread parallelismEike Mueller FEM multigrid solvers in NWP
11/20
Firedrake/PyOP2 framework
PyOP2Parallelunstructuredmeshcomputationframework
FiredrakePerformance-portableFinite-elementcomputationframework
Unified FormLanguage (UFL)
PyOP2Interface
modifiedFFC
Parallel scheduling, code generation
CPU(OpenMP/OpenCL)
GPU(PyCUDA /PyOpenCL)
Futurearch.
Problem definitionin FEM weak form
Local assemblykernels (AST)
Parallel loops: kernelsexecuted over mesh
Explicitlyparallelhardware-specificimplemen-tation
Meshes,matrices,vectors
PETSc4py (KSP,SNES, DMPlex)
FiredrakeInterface
MPI
Geometry,(non)linearsolves
assembly,compiledexpressions
FIAT
parallelloop
parallelloop
COFFEEAST optimizer
data structures(Set, Map, Dat)
Algorithmsdevelopers,num. analysts
Eike, Rob, Colin
Librarydevelopers,comp. scientists
Lawrence, David,PyOP2firedrake,PETScdevelopers
[Florian Rathgeber, PhD thesis, Imperical College (2014)]
Eike Mueller FEM multigrid solvers in NWP
12/20
UFL
Unified Form Language [Alnæs et al. (2014)]Example: Bilinear form
a(u, v) =
∫Ω
u(x)v(x) dx
Python code
from firedrake import *
mesh = UnitSquareMesh(8,8)
V = FunctionSpace(mesh,’CG’,1)
u = Function(V)
v = Function(V)
a_uv = assemble(u*v*dx)
Eike Mueller FEM multigrid solvers in NWP
13/20
Generated code
C-Code generated byFiredrake/PyOP2
Kernel (below)execution over grid (right)
OpenMP backend
Eike Mueller FEM multigrid solvers in NWP
14/20
Putting it all together
≈ 1, 600 lines of highly modular python code (incl. comments)1 Use UFL wherever possible
∫φ(∇·u) dx 7→ assemble(phi*div(u)*dx) (weak divergence operator)
2 Use PETSc for iterative solversprovide operators/preconditioners as callbacks
3 Escape hatch: Write direct PyOP2 parallel loops
lumped mass matrix division: U 7→ (M∗u)−1 Ukernel_code = ’’’void lumped_mass_divide(double **m,double **U)
for (int i=0; i<4; ++i) U[i][0] /= m[0][i];
’’’
kernel = op2.Kernel(kernel_code,"lumped_mass_divide")
op2.par_loop(kernel,facetset,
m.dat(op2.READ,self.facet2dof_map_facets),
u.dat(op2.RW,self.facet2dof_map_BDFM1))
Eike Mueller FEM multigrid solvers in NWP
15/20
Helmholtz operator
Helmholtz operator Hφ = Mφφ + ω2B(M∗u)−1BTφ
Operator applicationclass Operator(object):[...]
def apply(self,phi):
psi = TestFunction(V_pressure)
w = TestFunction(V_velocity)
B_phi = assemble(div(w)*phi*dx)
self.velocity_mass_matrix.divide(B_phi)
BT_B_phi = assemble(psi*div(B_phi)*dx)
M_phi = assemble(psi*phi*dx)
return assemble(M_phi + self.omega**2*BT_B_phi)
Interface with PETScpetsc_op = PETSc.Mat().create()
petsc_op.setPythonContext(Operator(omega))petsc_op.setUp()
ksp = PETSc.KSP()
ksp.create()
ksp.setOperators(petsc_op)
Eike Mueller FEM multigrid solvers in NWP
16/20
Algorithmic performance
Convergence historyInner Solve: 3 CG iterations with multigrid preconditioner
Icosahedral grid, 327,680 cells
4 multigrid levels
0 5 10 15 20iteration
10-6
10-5
10-4
10-3
10-2
10-1
100
rela
tive r
esi
dual ||r|| 2/||r
0|| 2
DG0 +RT0 Richardson
DG0 +RT0 GMRES
DG1 +BDFM1 Richardson
DG1 +BDFM1 GMRES
Eike Mueller FEM multigrid solvers in NWP
17/20
Efficiency
Single node run on ARCHER, lowest order, 5.2 · 106 cells[preliminary]
0 1 2 3 4 5 6 7time [s]
17.1GB/s
10.6GB/s
4.7GB/s
1.3GB/s
Total solve
Pressure solve
Helmholtz operator
STREAM triad: 74.1GB/s per node⇒ up to ≈ 23% of peak BW
Eike Mueller FEM multigrid solvers in NWP
18/20
Parallel scalability
Weak scaling on ARCHER [preliminary]
6 24 96 384
1536
Number of cores
100
101
102
Solu
tion t
ime [
s]Number of cells
1.3
mio
5.2
mio
21.0
mio
83.9
mio
335.
5mio
hypre BoomerAMG
matrix-free MGDG1 +BDFM1
DG0 +RT0
Eike Mueller FEM multigrid solvers in NWP
19/20
Conclusion
Summary
Matrix-free multigrid-preconditioner for pressure correctionequationImplementation in firedrake/PyOP2/PETSc framework:
Performance portabilityCorrect abstractions⇒ Separation of concerns
Outlook
Test on GPU backend
Extend to 3d: Regular vertical grid⇒ Improved caching⇒Improved memory BW
Improve and extend parallel scaling
Full 3d dynamical core (Colin Cotter)
Eike Mueller FEM multigrid solvers in NWP
20/20
References
The firedrake project: http://firedrakeproject.org/
F. Rathgeber et al.: Firedrake: automating the finite elementmethod by composing abstractions. in preparation
F. Rathgeber et al.: PyOP2: A High-Level Framework forPerformance-Portable Simulations on Unstructured Meshes.HPC, Networking Storage and Analysis, SC Companion, p 1116-1123, Los
Alamitos, CA, USA, 2012. IEEE Computer Society
E. Muller et al.: Petascale elliptic solvers for anisotropic PDEson GPU clusters.Submitted to Parallel Computing (2014) [arxiv:1402.3545]
E. Muller, R. Scheichl: Massively parallel solvers for ellipticPDEs in Numerical Weather- and Climate Prediction.QJRMS (2013) [arxiv:1307.2036]
Helmholtz solver code on githubhttps://github.com/firedrakeproject/firedrake-helmholtzsolver
Eike Mueller FEM multigrid solvers in NWP