efficient multigrid solvers for mixed finite element ... · eike mueller fem multigrid solvers in...

1/20

Efficient multigrid solvers for mixed finite elementdiscretisations in NWP models

Colin Cotter†, David Ham†, Lawrence Mitchell†,Eike Hermann Muller*, Robert Scheichl*

*University of Bath, †Imperial College London

16th ECMWF Workshop on High Performance Computing inMeteorology Oct 2014

Eike Mueller FEM multigrid solvers in NWP

2/20

Motivation

Equations, Algorithms

Parallel Optimised

Code

Equations, Algorithms

High level code

Parallel Optimised

Code

PETSc firedrakePyOP2 *

Computational ScientistDomain specialist

discretisation

Use case: complex iterative solver for pressure correction eqn.* All credit for developing firedrake/PyOP2 goes to David Ham and his group at

Imperial, I’m giving this talk as a “user”


Separation of concerns and high-level abstractions

3/20

Motivation

Computational challenge in numerical forecasts:

Solver for elliptic pressure correction equation

(algorithmic + parallel) scalability & performance

1 4 16 64 256 1024 4096 16384Number of GPUs

0.1

1

10

Tota

l sol

utio

n tim

e [s

]

16 64 256 1024 4096 16384 65536Number of CPU cores

256512768

CGMultigrid

Titan [nVidia Kepler]Hector [AMD Opteron]

1 4 16 64 256 1024 4096 16384Number of GPUs

109

1010

1011

1012

1013

1014

1015

1016

FLOP

s

GigaFLOP

TeraFLOP

PetaFLOP

single GPU peak

Titan theoretical peak

256512768

CGMultigrid

Weak scaling, total solution time Absolute performance

0.5 · 1012 dofs on 16384 GPUs of TITAN, 0.78 PFLOPs, 20%-50% BWFinite volume discretisation, simpler geometry


4/20

Motivation

ChallengesFinite element discretisation (Met Office next generation dyn. core),unstructured grids⇒ more complicatedDuplicate effort for CPU, GPU (Xeon Phi,. . . ) implementation& optimisation, reinvent the wheel?Mix two issues:

algorithmic (domain specialist/numerical analyst)parallelisation & low level optimisation (computational scientist)

GoalSeparation of concerns (algorithmic↔ computational)⇒ rapid algorithm development and testing at highabstraction levelPerformance portabilityReuse existing, tested and optimised tools and libraries

⇒ Choice of correct abstraction(s)Eike Mueller FEM multigrid solvers in NWP

5/20

This talk

Iterative solver for Helmholtz equation:

φ + ω∇ (φ∗u) = rφ−u − ω∇φ = ru

Mixed finite element discretisation, icosahedral grid

Matrix-free multigrid preconditioner for pressure correction

Performance portable firedrake/PyOP2 toolchain[Rathgeber et al., (2012 & 2014)]

Python⇒ automatic generation of optimised C code

Collaboration between

Numerical algorithm developers (Colin, Rob, Eike)

Computational scientists, Library developers (David,Lawrence, PETSc,PyOP2,firedrake teams)


6/20

Shallow water equations

Model system: shallow water equations

∂φ

∂t+ ∇ (φu) = 0

∂u∂t

+ (u · ∇) u + g∇φ = 0

Implicit timestepping⇒ linear system for velocity u and height perturbation φ

φ + ω∇ (φ∗u) = rφ−u − ω∇φ = ru

reference state φ∗ ≡ 1

ω = ch∆t =√

gφ∗∆t


7/20

Mixed system

Finite element discretisation: φ ∈ V2, u ∈ V1

⇒ Matrix system for dof-vectors Φ ∈ RnV2 , U ∈ RnV1

A(ΦU

)=

(Mφ ωBωBT −Mu

) (ΦU

)=

(Rφ

Ru

)

Mixed finite elements

DG0 + RT1 (lowest order)

DG1 + BDFM1 (higher order)

(Mφ)ij ≡

∫Ω

βi(x)βj(x) dx, (Mu)ef ≡

∫Ω

ve(x) · v f (x) dx (mass matrix)

Bie ≡

∫Ω

∇ve(x)βi(x) dx (derivative matrix)


8/20

Iterative solver

Iterative solver

1 Outer iteration (mixed system):GMRES, Richardson, BiCGStab, . . .

Preconditioner:

P ≡(

Mφ ωBωBT −M∗u

)lumped velocity mass matrix

P−1 =

(1 0

(M∗u)−1ωBT1

) (H−1 00 −(M∗u)−1

) (1 ωB(M∗u)−1

0 1

)Elliptic positive definite Helmholtz operator

H = Mφ + ω2B (M∗u)−1 BT

2 Inner iteration (pressure system Hφ′ = R ′φ):CG with geometric multigrid preconditioner


9/20

Multigrid

Multigrid V-cycle

smooth

smooth

smooth

smooth

smooth

smooth

smooth x 1

x 1/4

x 1/16

x 1/64

Computational bottleneckSmoother and operator application

φ′ 7→ φ′ + ρrelaxD−1H

(R ′φ − Hφ′

)H = Mφ + ω2B (M∗u)−1 BT

Intrinsic length scale

Λ/∆x = cs∆t/∆x ∼ νCFL = O(10) grid spacings (∀ resolutions ∆x)

⇒ small number of multigrid levels/coarse smoother sufficient


10/20

Grid iteration in FEM codes

Computational bottlenecks in finite element codes

1

23

4

6

7 8

9

513

14

1516

12

10 11

5

4

6

7 8

10

1112 5

51913

16

18

2021

1715

14

9

um (1)

um (3) um (2)

ɸee

e

e

Example: Weak divergenceφ = ∇ · u→ Φ = BU

φe 7→ φe + b(e)1 ume (1)

+ b(e)2 ume (2) + b(e)

3 ume (3)

Loop over topological entities (e.g. cells). In each cellAccess local dofs (e.g. pressure and velocity)Execute local kernelWrite data back

Manage halo exchanges, thread parallelismEike Mueller FEM multigrid solvers in NWP

11/20

Firedrake/PyOP2 framework

PyOP2Parallelunstructuredmeshcomputationframework

FiredrakePerformance-portableFinite-elementcomputationframework

Unified FormLanguage (UFL)

PyOP2Interface

modifiedFFC

Parallel scheduling, code generation

CPU(OpenMP/OpenCL)

GPU(PyCUDA /PyOpenCL)

Futurearch.

Problem definitionin FEM weak form

Local assemblykernels (AST)

Parallel loops: kernelsexecuted over mesh

Explicitlyparallelhardware-specificimplemen-tation

Meshes,matrices,vectors

PETSc4py (KSP,SNES, DMPlex)

FiredrakeInterface

MPI

Geometry,(non)linearsolves

assembly,compiledexpressions

FIAT

parallelloop

parallelloop

COFFEEAST optimizer

data structures(Set, Map, Dat)

Algorithmsdevelopers,num. analysts

Eike, Rob, Colin

Librarydevelopers,comp. scientists

Lawrence, David,PyOP2firedrake,PETScdevelopers

[Florian Rathgeber, PhD thesis, Imperical College (2014)]


12/20

UFL

Unified Form Language [Alnæs et al. (2014)]Example: Bilinear form

a(u, v) =

∫Ω

u(x)v(x) dx

Python code

from firedrake import *

mesh = UnitSquareMesh(8,8)

V = FunctionSpace(mesh,’CG’,1)

u = Function(V)

v = Function(V)

a_uv = assemble(u*v*dx)


13/20

Generated code

C-Code generated byFiredrake/PyOP2

Kernel (below)execution over grid (right)

OpenMP backend


14/20

Putting it all together

≈ 1, 600 lines of highly modular python code (incl. comments)1 Use UFL wherever possible

∫φ(∇·u) dx 7→ assemble(phi*div(u)*dx) (weak divergence operator)

2 Use PETSc for iterative solversprovide operators/preconditioners as callbacks

3 Escape hatch: Write direct PyOP2 parallel loops

lumped mass matrix division: U 7→ (M∗u)−1 Ukernel_code = ’’’void lumped_mass_divide(double **m,double **U)

for (int i=0; i<4; ++i) U[i][0] /= m[0][i];

’’’

kernel = op2.Kernel(kernel_code,"lumped_mass_divide")

op2.par_loop(kernel,facetset,

m.dat(op2.READ,self.facet2dof_map_facets),

u.dat(op2.RW,self.facet2dof_map_BDFM1))


15/20

Helmholtz operator

Helmholtz operator Hφ = Mφφ + ω2B(M∗u)−1BTφ

Operator applicationclass Operator(object):[...]

def apply(self,phi):

psi = TestFunction(V_pressure)

w = TestFunction(V_velocity)

B_phi = assemble(div(w)*phi*dx)

self.velocity_mass_matrix.divide(B_phi)

BT_B_phi = assemble(psi*div(B_phi)*dx)

M_phi = assemble(psi*phi*dx)

return assemble(M_phi + self.omega**2*BT_B_phi)

Interface with PETScpetsc_op = PETSc.Mat().create()

petsc_op.setPythonContext(Operator(omega))petsc_op.setUp()

ksp = PETSc.KSP()

ksp.create()

ksp.setOperators(petsc_op)


16/20

Algorithmic performance

Convergence historyInner Solve: 3 CG iterations with multigrid preconditioner

Icosahedral grid, 327,680 cells

4 multigrid levels

0 5 10 15 20iteration

10-6

10-5

10-4

10-3

10-2

10-1

100

rela

tive r

esi

dual ||r|| 2/||r

0|| 2

DG0 +RT0 Richardson

DG0 +RT0 GMRES

DG1 +BDFM1 Richardson

DG1 +BDFM1 GMRES


17/20

Efficiency

Single node run on ARCHER, lowest order, 5.2 · 106 cells[preliminary]

0 1 2 3 4 5 6 7time [s]

17.1GB/s

10.6GB/s

4.7GB/s

1.3GB/s

Total solve

Pressure solve

Helmholtz operator

STREAM triad: 74.1GB/s per node⇒ up to ≈ 23% of peak BW


18/20

Parallel scalability

Weak scaling on ARCHER [preliminary]

6 24 96 384

1536

Number of cores

100

101

102

Solu

tion t

ime [

s]Number of cells

1.3

mio

5.2

mio

21.0

mio

83.9

mio

335.

5mio

hypre BoomerAMG

matrix-free MGDG1 +BDFM1

DG0 +RT0


19/20

Conclusion

Summary

Matrix-free multigrid-preconditioner for pressure correctionequationImplementation in firedrake/PyOP2/PETSc framework:

Performance portabilityCorrect abstractions⇒ Separation of concerns

Outlook

Test on GPU backend

Extend to 3d: Regular vertical grid⇒ Improved caching⇒Improved memory BW

Improve and extend parallel scaling

Full 3d dynamical core (Colin Cotter)


20/20

References

The firedrake project: http://firedrakeproject.org/

F. Rathgeber et al.: Firedrake: automating the finite elementmethod by composing abstractions. in preparation

F. Rathgeber et al.: PyOP2: A High-Level Framework forPerformance-Portable Simulations on Unstructured Meshes.HPC, Networking Storage and Analysis, SC Companion, p 1116-1123, Los

Alamitos, CA, USA, 2012. IEEE Computer Society

E. Muller et al.: Petascale elliptic solvers for anisotropic PDEson GPU clusters.Submitted to Parallel Computing (2014) [arxiv:1402.3545]

E. Muller, R. Scheichl: Massively parallel solvers for ellipticPDEs in Numerical Weather- and Climate Prediction.QJRMS (2013) [arxiv:1307.2036]

Helmholtz solver code on githubhttps://github.com/firedrakeproject/firedrake-helmholtzsolver


efficient multigrid solvers for mixed finite element ... · eike mueller fem multigrid solvers in...

Documents