massively parallel lattice boltzmann-based simulations with grid … · 8192 32,768 262,144 cores...

Post on 10-Oct-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Massively Parallel Lattice Boltz-mann-based Simulations with Grid Refinement SIAM PP 2014, Portland

February 19, 2014

Florian Schornbaum, Ehsan Fattahi, David Staubach, Christian Godenschwager, Martin Bauer, Ulrich Rüde

Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

2

Outline

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Introduction • The waLBerla Simulation Framework

• The Lattice Boltzmann Method

• Uniform Grids • Domain Decomposition & Parallelization

• Performance / Benchmarks

• Statically Refined Grids • Lattice Boltzmann & Grid Refinement

• Domain Decomposition & Load Balancing

• Performance / Benchmarks

• Outlook on Dynamic Grid Refinement / Conclusion

Introduction

• The waLBerla Simulation Framework

• The Lattice Boltzmann Method

4

Introduction

• waLBerla (widely applicable Lattice Boltzmann frame-work from Erlangen): • main focus on CFD (computational fluid dynamics) simulations

based on the lattice Boltzmann method (LBM)

• at its very core designed as an HPC software framework: • scales from laptops to current petascale supercomputers

• largest simulation: 1,835,008 processes (IBM Blue Gene/Q @ Jülich)

• hybrid parallelization: MPI + OpenMP

• vectorization of compute kernels

• written in C++(11)

• support for different platforms (Linux, Windows) and compilers (GCC, Intel XE, Visual Studio, llvm/clang, IBM XL)

• coupling with in-house rigid body physics engine pe

• open source → http://www.walberla.net

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

5

Introduction

• The lattice Boltzmann method: • regular grid with multiple particle distribution functions (=

scalar values) per cell [D2Q9, D3Q19, D3Q27, …]

• explicit method → time stepping

• two steps: stream (neighbors) & collide (cell-local)

• For the collision, different operators exist: SRT, TRT, MRT.

• Macroscopic quantities (velocity, density, …) can be calculated from the particle distribution functions.

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

(stream) (collide)

Uniform Grids

• Domain Decomposition & Parallelization

• Performance / Benchmarks

7

Uniform Grids

• Domain Decomposition: • regular decomposition into blocks containing uniform grids

• Parallelization: • data exchange on borders between blocks via ghost layers

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

[special case of our much more general forest of

octrees data structure → non-uniform/refined grids]

receiver process

sender process

8

Uniform Grids

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

geometry given by surface mesh domain decomposition into blocks

empty blocks are discarded load balancing

Load balancing can be based on either space-filling curves (Z-order/Morton order, Hilbert curve) using the under-lying forest of octrees or graph partitioning (METIS, …).

Whatever fits best the needs of the simulation.

flow simulation only in here (example: complex geometry

of an artery)

9

Uniform Grids

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

geometry given by surface mesh domain decomposition into blocks

empty blocks are discarded load balancing

Load balancing can be based on either space-filling curves (Z-order/Morton order, Hilbert curve) using the under-lying forest of octrees or graph partitioning (METIS, …).

Whatever fits best the needs of the simulation.

10

Uniform Grids

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

geometry given by surface mesh domain decomposition into blocks

empty blocks are discarded load balancing

allocation of block data (→ grids)

The domain decomposition and load balancing can be performed during

the actual simulation … OR …

11

Uniform Grids

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

geometry given by surface mesh domain decomposition into blocks

empty blocks are discarded load balancing

allocation of block data (→ grids)

DISK

separation of domain

partitioning from simulation

file size: kilobytes to few megabytes

DISK

12

Uniform Grids

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

allocation of block data (→ grids)

DISK

separation of domain

partitioning from simulation

file size: kilobytes to few megabytes

DISK

empty blocks are discarded load balancing

geometry given by surface mesh domain decomposition into blocks

All of this (the entire pipeline) works just the same when grid

refinement is used.

13

• Benchmark Environments: • JUQUEEN (TOP500: 8)

• Blue Gene/Q, 459K cores, 1 GB/core

• compiler: IBM XL / IBM MPI

• SuperMUC (TOP500: 10) • Intel Xeon, 147K cores, 2 GB/core

• compiler: Intel XE / IBM MPI

• Benchmarks (LBM D3Q19):

Uniform Grids - Performance

lid-driven cavity weak scaling

(= const. number of cells per core)

coronary artery tree strong scaling (= const. total

number of cells)

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

⇒ C. Godenschwager, F. Schornbaum, M. Bauer, H. Köstler, and U. Rüde, A Framework for Hybrid ⇒ Parallel Flow Simulations with a Trillion Cells in Complex Geometries, SC13, Denver

14

• SuperMUC – single node (3.34 million cells per core)

⇒ LBM: low FLOPs to bytes ratio → memory intensive

0

20

40

60

80

100

120

140

160

180

MLU

P/s

SRT .

TRT

1 2 4 8 12 16 cores

LBM compute kernel type

more complex, but also much more relevant

for physical simulations!

MLUP/s: million/mega

lattice cell updates per

second

Uniform Grids - Performance

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

15

• SuperMUC – single node (3.34 million cells per core)

⇒ limited by memory bandwidth

0

20

40

60

80

100

120

140

160

180

MLU

P/s

SRT

TRT

SRT .

SRT .

1 2 4 8 12 16 cores naïve, straightforward

implementation

already quite optimized!

Uniform Grids - Performance

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

bandwidth limit

vectorized compute kernel

16

• JUQUEEN – single node (1.73 million cells per core)

⇒ limited by memory bandwidth

0

10

20

30

40

50

60

70

80

MLU

P/s

SRT

TRT

SRT .

SRT .

1 2 4 8 16 cores naïve, straightforward

implementation

already quite optimized!

Uniform Grids - Performance

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

vectorized compute kernel

bandwidth limit

17

• SuperMUC – TRT kernel (3.34 million cells per core)

0

1

2

3

4

5

6

7

8

9

10

MLU

P/s

pe

r c

ore

16P 1T

4P 4T

2P 8T

32 8192 147,456 cores

#processes per node

#threads per process

0.99 x 1012 cells updated per second! (19 values per cell)

Uniform Grids - Performance

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

18

• JUQUEEN – TRT kernel (1.73 million cells per core)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

MLU

P/s

pe

r c

ore

64P 1T

16P 4T

8P 8T

32 512 8192 458,752 cores

#processes per node

#threads per process

2.1 x 1012 cells updated per second! (19 values per cell)

⇒ 40 x 1012 values updated per second!

⇒ 0.41 PFlop/s (0.41 x 1015 Flop/s)

Uniform Grids - Performance

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

19

• SuperMUC – TRT kernel (2.1 million fluid cells)

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

tim

e s

tep

s /

sec

32 128 512 2048 8192 32,768 cores

Uniform Grids - Performance

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

“extreme” strong scaling

more processes = shorter time to solution!

20

• JUQUEEN – TRT kernel (16.9 million fluid cells)

0

100

200

300

400

500

600

700

800

900

1000

tim

e s

tep

s /

sec

512 2048 8192 32,768 262,144 cores

Uniform Grids - Performance

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

“extreme” strong scaling

more processes = shorter time to solution!

Statically Refined Grids

• Lattice Boltzmann & Grid Refinement

• Domain Decomposition & Load Balancing

• Performance / Benchmarks

Statically Refined Grids

• Lattice Boltzmann & Grid Refinement: • Almost all grid refinement schemes for the lattice Boltzmann

method rely on a 2:1 balance between neighboring cells:

• waLBerla now uses a massively parallel implementation of the refinement scheme presented in [1] (also relies on 2:1 balance)

• consequences of the 2:1 balance: • twice as many time steps on the fine grid as on the next coarser grid

• In 3D, for each finer grid level, the memory requirement increases by a factor of 8 and the generated workload by a factor of 16.

22 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

[1] Zhao Yu and Liang-Shih Fan, 2009, An interaction potential based lattice Boltzmann method with adaptive [1] mesh refinement (AMR) for two-phase flow simulation, J. Comput. Phys. 228, 17 (September 2009), 6456-6478

Statically Refined Grids

23 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Domain Decomposition: • “blocks containing regular grids

distributed among all processes”

⇒ distributed forest of octrees (→ 2:1 balance)

• Each process only knows about its own blocks and their neighbors, but has no information about the rest of the domain.

⇒ perfectly distributed data structure that scales to huge numbers of processes without any runtime overhead

• Property of this Distributed Data Structure: • can be viewed as a graph: each block is connected to all of its

neighboring blocks → enables all kinds of graph algorithms

Statically Refined Grids

24 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Comparison with Uniform LBM: • The setup phase remains unchanged:

The decoupling (via a file) of the domain decomposition & initial load balancing from the actual simulation is still possible.

• All the refinement “magic” happens during communication

(fine and coarse blocks share ghost layer regions → refinement requires multiple ghost layers per block†).

• Time stepping becomes more complicated (simple, uniform LBM:

boundary handling → stream & collide → communication).

• All the compute kernels remain unchanged!

† Even though for the computation/interpolation multiple ghost layers are required, the com- † munication is heavily optimized and the average number of bytes communicated per block † only slightly increases compared to a uniform simulation.

Statically Refined Grids

25 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level

3 3 3 3 3 3 3 3

2 2 2 2 2 2

1 1 1

2 2 2

0

2

1 1

3

3

2

3

3

2

2

2

2

2

2

1

1

0

1

1 1

2

2

2

3

3

3

3

Example: 24 blocks on four

different levels (forest with 3 trees)

Statically Refined Grids

26 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level

3 3 3 3 3 3 3 3

2 2 2 2 2 2

1 1 1

2 2 2

0

2

1 1

3

3

2

3

3

2

2

2

2

2

2

1

1

0

1

1 1

2

2

2

3

3

3

3

Example: 24 blocks on four

different levels (forest with 3 trees)

Statically Refined Grids

27 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level

3 3 3 3 3 3 3 3

2 2 2 2 2 2

1 1 1

2 2 2

0

- 0 -

- 1 1 - 1 1 - 1

2 - 2 2 2 2 2 2 2 2 - 2

3 3 3 3 3 3 3 3

2

1 1

numbers represent corresponding grid levels

Example: 24 blocks on four

different levels (forest with 3 trees)

Statically Refined Grids

28 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level

3 3 3 3 3 3 3 3

2 2 2 2 2 2

1 1 1

2 2 2

0

- 0 -

- 1 1 - 1 1 - 1

2 - 2 2 2 2 2 2 2 2 - 2

3 3 3 3 3 3 3 3

2

1 1 Example: 24 blocks on four

different levels (forest with 3 trees)

Statically Refined Grids

29 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level

3 3 3 3 3 3 3 3

2 2 2 2 2 2

1 1 1

2 2 2

0

- 0 -

- 1 1 - 1 1 - 1

2 - 2 2 2 2 2 2 2 2 - 2

3 3 3 3 3 3 3 3

2

1 1 Example: 24 blocks on four

different levels (forest with 3 trees)

Statically Refined Grids

30 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level

3 3 3 3 3 3 3 3

2 2 2 2 2 2

1 1 1

2 2 2

0

- 0 -

- 1 1 - 1 1 - 1

2 - 2 2 2 2 2 2 2 2 - 2

3 3 3 3 3 3 3 3

2

1 1 Example: 24 blocks on four

different levels (forest with 3 trees)

Statically Refined Grids

31 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level

3 3 3 3 3 3 3 3

2 2 2 2 2 2

1 1 1

2 2 2

0

- 0 -

- 1 1 - 1 1 - 1

2 - 2 2 2 2 2 2 2 2 - 2

3 3 3 3 3 3 3 3

2

1 1 Example: 24 blocks on four

different levels (forest with 3 trees)

Statically Refined Grids

32 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level

• “even” distribution to all available processes

Example (cont.): distribution to six

available processes - level by level

process: 0 1 2 3 4 5

3 3 3 3 3 3 3 3

2 2 2 2 2 2

1 1 1

2 2 2

0

2

1 1 Example: 24 blocks on four

different levels (forest with 3 trees)

Statically Refined Grids

33 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level

• “even” distribution to all available processes

Example (cont.): distribution to six

available processes - level by level

process: 0 1 2 3 4 5

3

3

3

3

3 3 3 3

3 3 3 3 3 3 3 3

2 2 2 2 2 2

1 1 1

2 2 2

0

2

1 1 Example: 24 blocks on four

different levels (forest with 3 trees)

Statically Refined Grids

34 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level

• “even” distribution to all available processes

Example (cont.): distribution to six

available processes - level by level

process: 0 1 2 3 4 5

3

3

2

3

3

2

3

2

2

3

2

2

3

2

2

3

2

2

3 3 3 3 3 3 3 3

2 2 2 2 2 2

1 1 1

2 2 2

0

2

1 1 Example: 24 blocks on four

different levels (forest with 3 trees)

Statically Refined Grids

35 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level

• “even” distribution to all available processes

Example (cont.): distribution to six

available processes - level by level

process: 0 1 2 3 4 5

3

3

2

1

3

3

2

1

3

2

2

1

3

2

2

1

3

2

2

1

3

2

2

3 3 3 3 3 3 3 3

2 2 2 2 2 2

1 1 1

2 2 2

0

2

1 1 Example: 24 blocks on four

different levels (forest with 3 trees)

Statically Refined Grids

36 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level

• “even” distribution to all available processes

Example (cont.): distribution to six

available processes - level by level

process: 0 1 2 3 4 5

3

3

2

1

3

3

2

1

3

2

2

1

3

2

2

1

3

2

2

1

3

2

2

0

3 3 3 3 3 3 3 3

2 2 2 2 2 2

1 1 1

2 2 2

0

2

1 1 Example: 24 blocks on four

different levels (forest with 3 trees)

37

• Benchmark Environments: • JUQUEEN (TOP500: 8)

• Blue Gene/Q, 459K cores, 1 GB/core

• compiler: IBM XL / IBM MPI

• SuperMUC (TOP500: 10) • Intel Xeon, 147K cores, 2 GB/core

• compiler: Intel XE / IBM MPI

• Benchmark (LBM D3Q19 TRT):

• lid-driven cavity • weak scaling • 6.6875 blocks per

process • 4 grid levels

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

Statically Refined Grids

• finest level … … covers 1.4% of space … generates 80% of workload

• coarsest level … … covers 78% of space … generates 1.1% of workload

38

• SuperMUC – TRT kernel (3.34 million cells per core)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

MLU

P/s

pe

r c

ore

16P 1T

4P 4T

2P 8T

32 8192 147,456 cores

#processes per node

#threads per process

Statically Refined Grids

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

39

• JUQUEEN – TRT kernel (1.81 million cells per core)

0

0.5

1

1.5

2

2.5

MLU

P/s

pe

r c

ore

64P 1T

16P 4T

8P 8T

32 512 8192 262,144 cores

#processes per node

#threads per process

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

Statically Refined Grids

Statically Refined Grids

40 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• LBM with refinement only achieves half the number of cell updates per second as regular LBM without refinement.

• Refinement Overhead:

• much more complex communication patterns which additionally involve interpolation

• more complex time stepping scheme

• multiple blocks per process ⇔ smaller blocks

BUT

LBM with grid refinement requires much less memory and far fewer cells (100 times fewer ...) !

Statically Refined Grids

41 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Example Application: A Study of the Vocal Fold

⇒ http://youtu.be/kUPf__THVZs

Statically Refined Grids

42 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Example Application: A Study of the Vocal Fold

DNS (direct numerical simulation)

Reynolds number: 1000 / D3Q19 TRT

4300 processes @ SuperMUC

101,466,432 fluid cells

25,800 blocks with 16 x 16 x 16 cells

Statically Refined Grids

43 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Example Application: A Study of the Vocal Fold

number of different grid levels: 5

95.5 time steps / sec (finest grid)

total number of time steps (finest grid): 864,000

without refinement: 55.2 times more memory …

… and 98.6 times the workload

Outlook on Dynamic Grid Refinement / Conclusion

45

Outlook / Conclusion

• Dynamic Grid Refinement: • space filling curve? (see p4est and work done by C. Burstedde et al.)

• multiple space filling curves – one for each grid level?

• repartitioning based on graph algorithms?

→ diffusive algorithms! (see work of H. Meyerhenke et al.)

• Conclusion: • The waLBerla framework now has a highly efficient, massively

parallel implementation of LBM for …

• … uniform grids as well as …

• … statically refined grids.

⇒ next milestone: support for dynamic grid refinement / next milestone: runtime adaptivity

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

THANK YOU FOR YOUR ATTENTION!

QUESTIONS ? (visit http://www.walberla.net)

top related