experiences with earthquake and tsunami [0.5ex] simulation ... · m. bader et al.j earthquake and...

Experiences with Earthquake and TsunamiSimulation on Xeon Phi Platforms

Special session on MIC experience and best practice

Michael Bader (and co-authors listed with chapters)Technical University of Munich

LRZ, 29 June 2016

Overview and AgendaDynamic Rupture and Earthquake Simulation with SeisSol:

• unstructured tetrahedral meshes• high-order ADER-DG discretisation• compute-bound performance via optimized matrix kernels

Optimising SeisSol for Xeon Phi Platforms

• offload scheme: 1992 Landers Earthquake as landmark simulation,scalability on SuperMUC, Tianhe-2, Stampede

• optimisation for Knights Corner and Landing• towards simulations in symmetric mode (1st results on Salomon)

Tsunami Simulation on SuperMIC:

• parallel adaptive mesh refinement with sam(oa)2

• enable vectorization via introducing patches• towards load balancing on heterogeneous systems

M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 2

Part I

Dynamic Rupture and EarthquakeSimulation with SeisSol

http://www.seissol.org/

Dumbser, Kaser et al. [9]An arbitrary high-order discontinuous Galerkin method . . .

Pelties, Gabriel et al. [12]Verification of an ADER-DG method for complex dynamic rupture problems


http://www.seissol.org/

Dynamic Rupture and Earthquake Simulation

Landers fault system: simulated ground motion and seismic waves [3]

SeisSol – ADER-DG for seismic simulations:• adaptive tetrahedral meshes→ complex geometries, heterogeneous media, multiphysics

• complicated fault systems with multiple branches→ non-linear multiphysics dynamic rupture simulation

• ADER-DG: high-order discretisation in space and time


Example: 1992 Landers M7.2 Earthquake

• multiphysics simulation of dynamic rupture and resulting ground motion ofa M7.2 earthquake

• fault inferred from measured data, regional topography from satellite data,physically consistent stress and friction parameters

• static mesh refinement at fault and near surface


Multiphysics Dynamic Rupture Simulation

• spontaneous rupture, non-linear interaction with wave-field• featuring rupture jumps, fault branching, etc.• tackles fundamental questions on earthquake dynamics• realistic rupture source for seismic hazard assessment


Part II

SeisSol as a Compute-Bound Code:Code Generation for Matrix Kernels

Breuer, Heinecke, Rannabauer, Bader [1]: High-Order ADER-DG MinimizesEnergy- and Time-to-Solution of SeisSol (ISC’15)

Uphoff, Bader [6]: Generating high performance matrix kernels forearthquake simulations with viscoelastic attenuation (HPCS 2016)


Seismic Wave Propagation with SeisSolElastic Wave Equations: (velocity-stress formulation)

qt + Aqx + Bqy + Cqz = 0

with q = (σ11, σ22, σ33, σ12, σ23, σ13,u, v ,w)T

Technische Universität MünchenDepartment of Informatics V

A. Breuer, A. Heinecke, S. Rettenberger,M. Bader, C. Pelties

Optimization of SeisSol

SeisSol in a nutshell: Governing eq’s

4

qt + Aqx + Bqy + Cqz = 0. (46)The nine dimensional vector of unknowns,

q =

0BBBBBBBBBBBB@

�11

�22

�33

�12

�23

�13

uvw

1CCCCCCCCCCCCA

, (47)

includes the normal stress components �11, �22 and �33, the shear stresses �12,�23 and �13 and the particle velocities in x-, y-, and z-direction, u, v and w(see [7, ch. 22.1], TODO: Puente 1.2). Furthermore the matrices A, B and Care defined by (see [Eigenstructure3D elastic.mws TODO]):

A =

0BBBBBBBBBBBB@

0 0 0 0 0 0 -�- 2 µ 0 00 0 0 0 0 0 -� 0 00 0 0 0 0 0 -� 0 00 0 0 0 0 0 0 -µ 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 -µ

-⇢-1 0 0 0 0 0 0 0 00 0 0 -⇢-1 0 0 0 0 00 0 0 0 0 -⇢-1 0 0 0

1CCCCCCCCCCCCA

B =

0BBBBBBBBBBBB@

0 0 0 0 0 0 0 -� 00 0 0 0 0 0 0 -�- 2 µ 00 0 0 0 0 0 0 -� 00 0 0 0 0 0 -µ 0 00 0 0 0 0 0 0 0 -µ0 0 0 0 0 0 0 0 00 0 0 -⇢-1 0 0 0 0 00 -⇢-1 0 0 0 0 0 0 00 0 0 0 -⇢-1 0 0 0 0

1CCCCCCCCCCCCA

C =

0BBBBBBBBBBBB@

0 0 0 0 0 0 0 0 -�0 0 0 0 0 0 0 0 -�0 0 0 0 0 0 0 0 -�- 2 µ0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 -µ 00 0 0 0 0 0 -µ 0 00 0 0 0 0 -⇢-1 0 0 00 0 0 0 -⇢-1 0 0 0 00 0 -⇢-1 0 0 0 0 0 0

1CCCCCCCCCCCCA

. (48)

�(x, y, z) and µ(x, y, z) are the Lame parameters, whereas µ is the shear modulusand � doesn’t have a direct physical interpretation (see [7, ch. 22.1]), ⇢(x, y, z) >0 is the density of the material (see [7, ch. 2.12.4]).

16

The nine dimensional vector of unknowns,

q =

0BBBBBBBBBBBB@

�11

�22

�33

�12

�23

�13

uvw

1CCCCCCCCCCCCA

, (47)


A =

0BBBBBBBBBBBB@

0 0 0 0 0 0 -�- 2 µ 0 00 0 0 0 0 0 -� 0 00 0 0 0 0 0 -� 0 00 0 0 0 0 0 0 -µ 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 -µ

-⇢-1 0 0 0 0 0 0 0 00 0 0 -⇢-1 0 0 0 0 00 0 0 0 0 -⇢-1 0 0 0

1CCCCCCCCCCCCA

B =

0BBBBBBBBBBBB@

0 0 0 0 0 0 0 -� 00 0 0 0 0 0 0 -�- 2 µ 00 0 0 0 0 0 0 -� 00 0 0 0 0 0 -µ 0 00 0 0 0 0 0 0 0 -µ0 0 0 0 0 0 0 0 00 0 0 -⇢-1 0 0 0 0 00 -⇢-1 0 0 0 0 0 0 00 0 0 0 -⇢-1 0 0 0 0

1CCCCCCCCCCCCA

C =

0BBBBBBBBBBBB@

0 0 0 0 0 0 0 0 -�0 0 0 0 0 0 0 0 -�0 0 0 0 0 0 0 0 -�- 2 µ0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 -µ 00 0 0 0 0 0 -µ 0 00 0 0 0 0 -⇢-1 0 0 00 0 0 0 -⇢-1 0 0 0 00 0 -⇢-1 0 0 0 0 0 0

1CCCCCCCCCCCCA

. (48)


16Friday, January 18, 13




SeisSol in a nutshell: Governing eq’s

4

qt + Aqx + Bqy + Cqz = 0. (46)The nine dimensional vector of unknowns,

q =

0BBBBBBBBBBBB@

�11

�22

�33

�12

�23

�13

uvw

1CCCCCCCCCCCCA

, (47)


A =

0BBBBBBBBBBBB@

0 0 0 0 0 0 -�- 2 µ 0 00 0 0 0 0 0 -� 0 00 0 0 0 0 0 -� 0 00 0 0 0 0 0 0 -µ 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 -µ

-⇢-1 0 0 0 0 0 0 0 00 0 0 -⇢-1 0 0 0 0 00 0 0 0 0 -⇢-1 0 0 0

1CCCCCCCCCCCCA

B =

0BBBBBBBBBBBB@

0 0 0 0 0 0 0 -� 00 0 0 0 0 0 0 -�- 2 µ 00 0 0 0 0 0 0 -� 00 0 0 0 0 0 -µ 0 00 0 0 0 0 0 0 0 -µ0 0 0 0 0 0 0 0 00 0 0 -⇢-1 0 0 0 0 00 -⇢-1 0 0 0 0 0 0 00 0 0 0 -⇢-1 0 0 0 0

1CCCCCCCCCCCCA

C =

0BBBBBBBBBBBB@

0 0 0 0 0 0 0 0 -�0 0 0 0 0 0 0 0 -�0 0 0 0 0 0 0 0 -�- 2 µ0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 -µ 00 0 0 0 0 0 -µ 0 00 0 0 0 0 -⇢-1 0 0 00 0 0 0 -⇢-1 0 0 0 00 0 -⇢-1 0 0 0 0 0 0

1CCCCCCCCCCCCA

. (48)


16

The nine dimensional vector of unknowns,

q =

0BBBBBBBBBBBB@

�11

�22

�33

�12

�23

�13

uvw

1CCCCCCCCCCCCA

, (47)


A =

0BBBBBBBBBBBB@

0 0 0 0 0 0 -�- 2 µ 0 00 0 0 0 0 0 -� 0 00 0 0 0 0 0 -� 0 00 0 0 0 0 0 0 -µ 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 -µ

-⇢-1 0 0 0 0 0 0 0 00 0 0 -⇢-1 0 0 0 0 00 0 0 0 0 -⇢-1 0 0 0

1CCCCCCCCCCCCA

B =

0BBBBBBBBBBBB@

0 0 0 0 0 0 0 -� 00 0 0 0 0 0 0 -�- 2 µ 00 0 0 0 0 0 0 -� 00 0 0 0 0 0 -µ 0 00 0 0 0 0 0 0 0 -µ0 0 0 0 0 0 0 0 00 0 0 -⇢-1 0 0 0 0 00 -⇢-1 0 0 0 0 0 0 00 0 0 0 -⇢-1 0 0 0 0

1CCCCCCCCCCCCA

C =

0BBBBBBBBBBBB@

0 0 0 0 0 0 0 0 -�0 0 0 0 0 0 0 0 -�0 0 0 0 0 0 0 0 -�- 2 µ0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 -µ 00 0 0 0 0 0 -µ 0 00 0 0 0 0 -⇢-1 0 0 00 0 0 0 -⇢-1 0 0 0 00 0 -⇢-1 0 0 0 0 0 0

1CCCCCCCCCCCCA

. (48)


16Friday, January 18, 13

• high order discontinuous Galerkin discretisation• ADER-DG: high approximation order in space and time:• additional features: local time stepping, high accuracy of earthquake

faulting (full frictional sliding)→ Dumbser, Kaser et al., e.g. [9]


SeisSol in a Nutshell – ADER-DG




SeisSol in a nutshell: ADER-DG

5

Mathematical Operation File name Subroutine(K⇠)T (Qk)T (A⇤

k)T + (K⌘)T (Qk)T (B⇤k)T cauchykovalewski.f90 cauchyKovalewskiTimeIntegration

recursively ! I(tn+1 - tn)

Qk = A⇤kI(tn+1 - tn) galerkin2d_solver.f90 ADERGalerkin2D_Unified

Qk = QkK⇠ galerkin2d_solver.f90 ADERGalerkin2D_Unified

Qk = B⇤kI(tn+1 - tn) galerkin2d_solver.f90 ADERGalerkin2D_Unified

Qk = QkK⌘ galerkin2d_solver.f90 ADERGalerkin2D_Unified

⇥ ⇥ ⇥

Table 1: Matrix operations in SeisSol2D

3.7 Complete Update Scheme

Combining the spatial discretization in chapter 3.3 with the flux computationin chapter 3.4 and the ADER time discretization of chapter 3.6 the followingexplicit update scheme is obtained:

Qn+1k = Qk-

|Sk|

|Jk|M-1

✓ 4X

i=1

Nk,iA+k N-1

k,iI(tn, tn+1, Qn

k )F-,i

+

4X

i=1

Nk,iA-k(i)N

-1k,iI(t

n, tn+1, Qnk(i))F

+,i,j,h

◆

+M-1A⇤kI(tn, tn+1, Qn

k )K⇠

+M-1B⇤kI(tn, tn+1, Qn

k )K⌘

+M-1C⇤kI(tn, tn+1, Qn

k )K⇣

(91)

Qn+1k = Qk-

|Sk|

|Jk|M-1

✓ 4X

i=1

F-,iI(tn, tn+1, Qnk )Nk,iA

+k N-1

k,i

+

4X

i=1

F+,i,j,hI(tn, tn+1, Qnk(i))Nk,iA

-k(i)N

-1k,i

◆

+M-1K⇠I(tn, tn+1, Qnk )A⇤

k

+M-1K⌘I(tn, tn+1, Qnk )B⇤

k

+M-1K⇣I(tn, tn+1, Qnk )C⇤

k

(92)

4 Structure of the Code

5 Matrix Patterns

This chapter shows the matrix patterns in SeisSol for the ADER-DG schemeof polynomial order 3, which results in (3 + 1)(3 + 2)(3 + 3)/6 = 20 degrees offreedom. The global matrices correspond to the reference tetrahedron. Localmatrices are shown for the first element only. The output was extracted fromSeisSol directly.

29

Upd

ate

sche

me

I(tn, tn+1, Qnk ) =

JX

j=0

(tn+1 - tn)j+1

(j + 1)!

@j

@tjQk(tn)

(Qk)t = -M-1�(K⇠)TQkA⇤

k + (K⌘)TQkB⇤k + (K⇣)TQkC⇤

k

�Cau

chy

Kov

alew

ski

Friday, January 18, 13


Optimisation of Matrix OperationsApply sparse matrices to multiple DOF-vectors Qk

0 1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

0 1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

0 1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

0 1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

7

8

(a) Computation of the first time derivative ∂1/∂t1Qk.

0 1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

0 1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

0 1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

0 1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

7

8

(b) Computation of the second derivative ∂2/∂t2Qk.

Fig. 1. First two recursions of the ADER time integration for a fifth order method.Orange blocks generate zero-blocks in the derivatives, ivory blocks hit zero blocks.

2.2 Efficient Evaluation of the ADER Time Integration

In this subsection, we introduce an improved scheme that reduces the compu-tational effort of the ADER time integration, formulated in [24] and in discreteform in Sec. 2.1.

Analyzing the sparsity patterns of the involved stiffness matrices, we cansubstantially reduce the number of operations in the ADER time integration.This is especially true for applied high orders in space and time: The transposedstiffness matrices (Kξ)T , (Kη)T and (Kζ)T of order O contain a zero blockstarting at the first row, which accounts for the new hierarchical basis functionsadded to those of the preceding order O − 1 (without proof). According to Eq.(1) we have to compute the temporal derivates ∂j/∂tjQk for j ∈ 1 . . .O − 1 byapplying Eq. (2) recursively to reach approximation order O in time.

The first step (illustrated in Fig. 1 (a)) computes ∂1/∂t1Qk from the initialDOFs Qnk . The zero blocks of the three matrices Kξc generate a zero block in theresulting derivative and only the upper block of size BO−1×9 contains non-zeros.In the computation of the second derivative ∂2/∂t2Qk we only have to accountfor the top-left block of size BO−1 × BO−1 in the matrices Kξc . As illustratedin Fig. 1 (b) the additional non-zeros hit the previously generated zero blockof derivative ∂1/∂t1Qk. The following derivatives ∂j/∂tjQk for j ∈ 3 . . .O − 1proceed analogously and for each derivative j only the Kξc-sub-blocks of sizeBO−j × BO−j have to be taken into account. Obviously, the zero blocks alsoappear in the multiplication with the matrices A?, B? and C? from the rightand additional operations can be saved.

0 1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

0 1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

0 1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

0 1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

7

8

(a) Computation of the first time derivative ∂1/∂t1Qk.

0 1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

0 1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

0 1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

0 1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

7

8

(b) Computation of the second derivative ∂2/∂t2Qk.

Fig. 1. First two recursions of the ADER time integration for a fifth order method.Orange blocks generate zero-blocks in the derivatives, ivory blocks hit zero blocks.

2.2 Efficient Evaluation of the ADER Time Integration

In this subsection, we introduce an improved scheme that reduces the compu-tational effort of the ADER time integration, formulated in [24] and in discreteform in Sec. 2.1.

Analyzing the sparsity patterns of the involved stiffness matrices, we cansubstantially reduce the number of operations in the ADER time integration.This is especially true for applied high orders in space and time: The transposedstiffness matrices (Kξ)T , (Kη)T and (Kζ)T of order O contain a zero blockstarting at the first row, which accounts for the new hierarchical basis functionsadded to those of the preceding order O − 1 (without proof). According to Eq.(1) we have to compute the temporal derivates ∂j/∂tjQk for j ∈ 1 . . .O − 1 byapplying Eq. (2) recursively to reach approximation order O in time.

The first step (illustrated in Fig. 1 (a)) computes ∂1/∂t1Qk from the initialDOFs Qnk . The zero blocks of the three matrices Kξc generate a zero block in theresulting derivative and only the upper block of size BO−1×9 contains non-zeros.In the computation of the second derivative ∂2/∂t2Qk we only have to accountfor the top-left block of size BO−1 × BO−1 in the matrices Kξc . As illustratedin Fig. 1 (b) the additional non-zeros hit the previously generated zero blockof derivative ∂1/∂t1Qk. The following derivatives ∂j/∂tjQk for j ∈ 3 . . .O − 1proceed analogously and for each derivative j only the Kξc-sub-blocks of sizeBO−j × BO−j have to be taken into account. Obviously, the zero blocks alsoappear in the multiplication with the matrices A?, B? and C? from the rightand additional operations can be saved.

Dense vs. Sparse Kernels: (Breuer et al. [2])

• most kernels fastest, if executed as dense matrix multiplications• exploit zero-blocks generated during recursive CK computation• switch to sparse kernels depending on achieved time to solution


Sparse, Dense→ Block-SparseConsider equaivalent sparsity patterns: (Uphoff, [6])

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

Graph representation and block-sparse memory layouts

A1 A2 A3

0 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455

012345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455


Code Generator: Instrinsics→ Assembler


Code Generator – Programming Interface

db = Tools.parseMatrixFile(’matrices.xml’)

Tools.memoryLayoutFromFile(’layout.xml’, db)

arch = Arch.getArchitectureByIdentifier(’dhsw’)

volume = db[’kXiDivM’]

* db[’timeIntegrated’]

* db[’AstarT’]

+ db[’timeIntegrated’]

* db[’ET’]

kernels = [(’volume’, volume)]

Tools.generate(

’path/to/output’,

db,

kernels,

’path/to/libxsmm_gemm_generator’,

arch

)

Exploit efficient backend: libxsmm library [11]


Benefit of High Order ADER-DG – Energy-Efficient

1E-10

1E-09

1E-08

1E-07

1E-06

1E-05

1E-04

1E-03

1E-02

75 100 150 250 500 750 1000

erro

r

energy (kJ)

shsw2dhsw2

shsw3dhsw3

shsw4dhsw4

dhsw5dhsw6

dhsw7

Fig. 3. L∞-error of variable σyz in dependency of the consumed energy for the HSWmachine in single- and double-precision.

fully reached, as we align the elements’ DOFs to the 32-byte boundary. This, ingeneral, imposes some overheads with longer vectors and switching from doubleto single precision naturally doubles the vector length.

Finally, in Fig. 4 we evaluated the measured power-to-solution curves bylinear interpolation in log-log-space at 150 kJ for all architectures (using on-demand frequency for WSM and SNB). For this energy budget we can now easilycompare the achieved accuracy depending on convergence order and architecture.Note that we excluded settings beyond convergence and low order settings notfitting into the memory of KNC due to slow convergence. Comparing O2 andO7 on HSW we see an error reduction of five orders of magnitude. For the cross-architecture comparison we select O6. According to Fig. 1 roughly the sameefficiency can be reached on all platforms. In this case the error is reduced bya factor of 3.4 when switching from WSM to SNB and 4× when comparingSNB to KNC and HSW. These findings align well with the Dennard scaling4

between SNB and HSW on an iso-frequency and iso-TDP level. From WSM toSNB we even see a superior scaling. This is due to the fact that we use full-box power measurements, but strictly speaking Dennard scaling applies to theprocessor only. Other parts of the system became more power-efficient, too, suchthat SNB clearly exceeds the WSM machine. One KNC coprocessor is able toachieve the same energy efficiency as the entire HSW machine. The reason fornot (clearly) outperforming HSW is the (already analyzed) lower efficiency ofthe matrix kernels. This emphasizes that for compute-bound applications thefastest execution is also most likely the most energy-efficient one.

4 doubling the number of transistors doubles the amount of computations within thesame energy budget; number of transistors: WSM: 2×1.17 B, SNB: 2×2.26 B, HSW:2×5.57 B

• mesasure maximum error vs. consumed energy• for increasing discretisation order on regular meshes• here: dual-socket “Haswell” server, 36 cores @1.9 GHz


Benefit of High Order ADER-DG – Energy-Efficient

1E-10

1E-09

1E-08

1E-07

1E-06

1E-05

1E-04

1E-03

1E-02

75 100 150 250 500 750 1000

erro

r

energy (kJ)

shsw2dhsw2

shsw3dhsw3

shsw4dhsw4

dhsw5dhsw6

dhsw7

Fig. 3. L∞-error of variable σyz in dependency of the consumed energy for the HSWmachine in single- and double-precision.

fully reached, as we align the elements’ DOFs to the 32-byte boundary. This, ingeneral, imposes some overheads with longer vectors and switching from doubleto single precision naturally doubles the vector length.

Finally, in Fig. 4 we evaluated the measured power-to-solution curves bylinear interpolation in log-log-space at 150 kJ for all architectures (using on-demand frequency for WSM and SNB). For this energy budget we can now easilycompare the achieved accuracy depending on convergence order and architecture.Note that we excluded settings beyond convergence and low order settings notfitting into the memory of KNC due to slow convergence. Comparing O2 andO7 on HSW we see an error reduction of five orders of magnitude. For the cross-architecture comparison we select O6. According to Fig. 1 roughly the sameefficiency can be reached on all platforms. In this case the error is reduced bya factor of 3.4 when switching from WSM to SNB and 4× when comparingSNB to KNC and HSW. These findings align well with the Dennard scaling4

between SNB and HSW on an iso-frequency and iso-TDP level. From WSM toSNB we even see a superior scaling. This is due to the fact that we use full-box power measurements, but strictly speaking Dennard scaling applies to theprocessor only. Other parts of the system became more power-efficient, too, suchthat SNB clearly exceeds the WSM machine. One KNC coprocessor is able toachieve the same energy efficiency as the entire HSW machine. The reason fornot (clearly) outperforming HSW is the (already analyzed) lower efficiency ofthe matrix kernels. This emphasizes that for compute-bound applications thefastest execution is also most likely the most energy-efficient one.

4 doubling the number of transistors doubles the amount of computations within thesame energy budget; number of transistors: WSM: 2×1.17 B, SNB: 2×2.26 B, HSW:2×5.57 B

• high order (“compute”) beats high resolution (“memory”)• ≈35% gain in energy-to-solution for single precision,

but only for low orderM. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 14

Benefit of High Order ADER-DG – Compute-Bound

GFLO

PS

0

100

200

300

400

500

600

700

800

234567 234567 234567 234567 234567 234567 23456 234567 234567 234567

355375

345

283

152

84

299317

307

256

155

82

192210

214

196

140

83

178

159

133

100

57

89109

115

115

6191

69849090745241515454463351566163553132 32 30 28 25 29 353942444325

1.60GHz 2.26GHz 3.46GHz 1.20GHz 2.00GHz 2.60GHz 1.06GHz 1.20GHz 1.90GHz 2.30GHz

WSM SNB HSWKNC

• mesasured “GFlop/s” and “MFlop/s per Watt” for Westmere, SandyBridge, Knights Corner and Haswell architectures [1]

• at selected clock frequencies and for different order• preference towards high order and low frequency on newest architectures


Part III

Accelerators – Dynamic RuptureSimulation on Xeon Phi

Supercomputers

Heinecke, Breuer, Rettenberger, Gabriel, Pelties et al. [3]:Petascale High Order Dynamic Rupture Earthquake Simulations onHeterogeneous Supercomputers (Gordon Bell Prize Finalist 2014)


On the Road from Peta- to Exascale?SuperMUC @ LRZ, Munich

• 9216 compute nodes (18 “thin node” islands)147,456 Intel SNB-EP cores (2.7 GHz)

• Infiniband FDR10 interconnect (fat tree)• #20 in Top 500: 2.897 PFlop/s

Stampede @ TACC, Austin• 6400 compute nodes, 522,080 cores

2 SNB-EP (8c) + 1 Xeon Phi SE10P per node• Mellanox FDR 56 interconnect (fat tree)• #8 in Top 500: 5.168 PFlop/s

Tianhe-2 @ NSCC, Guangzhou• 8000 compute nodes used, 1.6 Mio cores

2 IVB-EP (12c) + 3 Xeon Phi 31S1P per node• TH2-Express custom interconnect• #1 in Top 500: 33.862 PFlop/s


Optimization for Intel Xeon Phi Platforms

Offload Scheme:

• hide2 communication withXeon Phi and between nodes

• use “heavy” CPU cores fordynamic rupture

Hybrid parallelism:

• on 1–3 Xeon Phis and hostCPU(s)

• reflects multiphysicssimulation

• manycore parallelism on XeonPhi

Host PCIe Xeon Phi

MPI comm.,receiver output

dynamic rupture fluxes, fault output

plot wave field(if required)

download cells forreceivers, DR, MPI

upload MPI-received cells

upload dynamicrupture updates

download all data(if required)

time integration of non-MPI cells,

volume integration

time integration ofMPI boundary cells

wave propagationfluxes

dynamic rupture updates,

pack transfer data


Strong Scaling of Landers Scenario

40

50

60

70

80

90

100 2

56

512

102

4

204

8

409

6

614

4

921

6

% p

aral

lel e

ffici

ency

# nodes

StampedeSuperMUC, classic

SuperMUC, gr. buff.

Tianhe-2, 1 cardTianhe-2, 2 cardsTianhe-2, 3 cards

• 191 million tetrahedrons; 220,982 element faces on fault• 6th order, 96 billion degrees of freedom



40

50

60

70

80

90

100 2

56

512

102

4

204

8

409

6

614

4

921

6

% p

aral

lel e

ffici

ency

# nodes


SuperMUC, gr. buff.


• more than 85 % parallel efficiency on Stampede and Tianhe-2(when using only one Xeon Phi per node)

• multiple-Xeon-Phi performance suffers from MPI communication



40

50

60

70

80

90

100 2

56

512

102

4

204

8

409

6

614

4

921

6

% p

aral

lel e

ffici

ency

# nodes


SuperMUC, gr. buff.


• 3.3 PFlop/s on Tianhe-2 (7000 nodes)• 2.0 PFlop/s on Stampede (6144 nodes)• 1.3 PFlop/s on SuperMUC (9216 nodes)


Optimizing SeisSol for Xeon Phi (Knights Landing)Heinecke et al., ISC 16 [7]

Code Generation:

• 512-bit wide vector processing unit• profits from Knights Landing optimization of libxsmm library [11]

Memory Optimization:

• examine impact of DRAM-only, CACHE and FLAT mode• FLAT mode: careful placement of element-local matrices in local

MCDRAM (table from [7]):10 Alexander Heinecke, Alexander Breuer, Michael Bader, and Pradeep Dubey

order Qk Bk,Dk Aξck , A−,ik , A+,i

k Kξc , Kξc , F−,i, F+,i,j,h

2 MCDRAM MCDRAM MCDRAM MCDRAM3 MCDRAM MCDRAM MCDRAM MCDRAM4 DDR4 MCDRAM MCDRAM MCDRAM5 DDR4 MCDRAM DDR4 MCDRAM6 DDR4 MCDRAM DDR4 MCDRAM

Table 1. Placements for all orders and the different data structures of SeisSol;DDR4/MCDRAM denotes if a particular data structure is placed in DDR4/MCDRAM.

higher orders as they are bigger but have the same access frequency. These accesspatterns allow to overcome size limitations of the 16 GB MCDRAM by placingthe ’slow-running’ data structures in DDR4. Therefore, in FLAT mode and forhigher order runs, we store Bk and/or Dk of every element into MCDRAM onthe fly via the memkind library when computing them. As both memory typesare seamlessly integrated into the architecture, we simply change the place ofallocation, but not our macro-kernels. Thus pointers to Bk and/or Dk reference

memory physically stored in MCDRAM whereas Aξck , A−,ik , A+,i

k , Qk reside in theDDR4 portion of the address space for orders O = 5 and O = 6. Additionally,we hold unique matrices, Kξc , Kξc , F−,i, F+,i,j,h, including the 48 flux matricesrequired for neighboring elements’ contribution to the surface kernel (7), in MC-DRAM as well, as we expect local L2 cache evicts for higher orders. For lowerorders, two to four, the bandwidth requirements of SeisSol for the element localmatrices and Qk increase. We therefore allocate more data structures in MC-DRAM. In fact, for orders O = 2 and O = 3, all important data structures areplaced in MCDRAM. Table 1 summarizes the used placements, when runningon KNL in FLAT mode.

4.3 Optimizing the Mesh Traffic and Prefetching

KNL’s last level cache (LLC) is not a shared cache level as it is implementedby a 2D mesh of up to 36 1 MB large slices of L2 caches, c.f. Sect. 2. Theseslices are kept coherent by a distributed tag directory in each tile’s CHA. As wepointed out in the last section, for higher orders than four, the 48 flux matricesF+,i,j,h approach (500 KB for order five) or even exceed the size (1.5 MB fororder O = 6) of one tile’s L2 cache. This can negatively effect the performanceof (7) for two reasons: a) especially for order O = 6 this results into a high rateof CHA-to-CHA communication as the unstructured mesh causes unstructuredaccesses to the flux matrices b) the hardware prefetcher cannot pick-up theunstructured accesses. Keeping the last section in mind, we know that we stillhave plenty of MCDRAM bandwidth available in higher orders. Therefore, weplace several copies, one per two tiles, in MCDRAM. This ensures that the meshtraffic gets equally distributed and the access latency may not be limited byone CHA in the entire mesh holding the directory entries for one particularflux matrix. Additionally, we are using modified matrix kernel operations in (7),


Performance Results on Knights LandingHeinecke et al., ISC 16 [7]

High Order Seismic Simulations on Intel Xeon Phi 17

00.5

11.5

22.5

33.5

4

HSX

KNC

DD

R4

CAC

HE

FLAT

HSX

KNC

DD

R4

CAC

HE

FLAT

HSX

KNC

DD

R4

CAC

HE

FLAT

HSX

KNC

DD

R4

CAC

HE

FLAT

HSX

KNC

DD

R4

CAC

HE

FLAT

3.47

3.42

3.01

1.20

1.17

3.03

3.03

2.28

1.031.13

3.04

3.01

1.76

1.11

1.14

2.96

2.88

1.34

1.07

0.98

2.93

2.78

1.18

1.15

1.08

2.88

2.83

2.63

1.12

1.00

2.56

2.56

2.10

1.00

1.00

2.54

2.54

1.69

1.07

1.00

2.53

2.50

1.30

1.04

1.00

2.53

2.53

1.14

1.12

1.00

Intel AVX Core Frequency Intel AVX Turbo Boost Technology

KNL2

KNL3

KNL4

KNL5

KNL6

spee

dup

over

HSX

GTS

Fig. 9. Normalized time-to-solution speed-up over HSX for KNC and KNL and orders2-6 when simulating the 1992 Landers scenario using global time stepping.

would easily fit into MCDRAM any time as the total memory consumption atorder 6 is 7.1 GB.

The GTS performance of the 1992 Landers setup is provided in Fig. 9. Asthis is a multi-physics scenario, we expect slightly lower performance than forthe earlier pure wave propagation runs on a many-core processor. This is dueto the fact that the dynamic rupture portion of the solver requires high scalarperformance. Here, KNL’s increased single-thread performance becomes visible.KNL reassembles more than 92% of the pure wave propagation speed-up overHSX whereas the previous generation KNC chip is only able to attain 83%. Thisresults into a relative performance which is comparable to HSX. KNL’s time-to-solution speed-up for executing the 1992 Landers earthquake simulations is 2.5- 2.9 × depending on the chosen order.

6 Conclusion

In this article, we presented a holistic optimization of SeisSol, a multi-physicssimulation package for seismic simulations, which tightly couples seismic wavepropagation, and dynamic rupture processes. First, we presented a deep-diveinto KNL’s architectural features and their challenges and opportunities forhigh-performance software. After a brief recapitulation of SeisSol’s mathemati-cal background, we discussed in detail how to exploit KNL’s two VPUs per coreefficiently and to leverage both memory subsystems for a novel out-of-core imple-mentation in SeisSol’s high-order wave propagation solver. The KNL-optimizedimplementation was evaluated for three different scenarios with distinct chal-lenges and sizes. In case of global time stepping runs, KNL was able to out-perform its predecessor, KNC, by 2.9 × and the current most powerful IntelXeon processor, E5v3, by more than 3.4 ×. Even more important, in contrast to

Landers scenario, 466,574 elements


Part IV

Current Work – Simulations inSymmetric Mode on Salomon

Rettenberger, Uphoff, Rannabauer;Project CzeBaCCA: Czech-Bavarian Competence Centre for Supercomputing

Applications


Modify Mesh Input and Load Distribution

Scalable Mesh Partitioning and Input Pipeline:

GambitMesh File

PUMGen

ParMETIS

Simulation ModelingSuite C++ API

SimModeler/GAMBIT

CAD Model

Parallel Mesh Generation

Serial Mesh Generation

Towards Symmetric Mode:

• Modify weights for METIS graph partitioning:compensate speed differences between host CPU and Xeon Phi

• Work in progress: modify input of meshes→ Xeon Phi mesh partitions may be read by host and sent via MPI→ in case of bad I/O bandwidth (library support) of Xeon Phis


Work in Progress: Modify Wave Field OutputAggregation of MPI ranks to speed up I/O: (Rettenberger [5])

SeisSol

HDF5

Tasks

Blocks onFile System

Mulitple processeswriting to one block

One processwriting to mulitple blocks

. . .0 1 2 3 4 5

SeisSol

HDF5

Tasks

Blocks onFile System

... Aggregation

I/O Tasks

0 1 2 3 4 5

0 3 6

. . .

. .

Towards Symmetric Mode:

• Output routines aggregate data from several MPI ranks→ match I/O block size to achieve substantial speedup

• Only use host MPI ranks for output→ again in case of bad I/O bandwidth (library support) of Xeon Phis


First Runs on Salomon – Native and SymmetricSetup: LOH4 benchmark, 250k elements, order 6, no output yet

0.5

1

2

4

8

16

32

64

2 4 8 16 32 64

TF

LO

P/s

(hard

ware

)

#nodes

idealHostMIC

MIC+MICHost+MIC

Host+MIC+MIC


First Runs on Salomon – Native and SymmetricSetup: LOH4 benchmark, 250k elements, order 6, no output yet

0.5

1

2

4

8

16

32

64

2 4 8 16 32 64

TF

LO

P/s

(hard

ware

)

#devices

idealHostMIC

MIC+MICHost+MIC

Host+MIC+MIC


Part V

sam(oa)2

Parallel Adaptive Mesh Refinement usingSierpinski Space Filling Curves

O. Meister, K. Rahnema, M. Bader [4]Parallel Memory Efficient Adaptive Mesh Refinement

on Structured Triangular Meshes with Billions of Grid Cells


sam(oa)2: Scalable Dynamic AdaptivityUsing Structured Triangular Meshes and Sierpinski Space-Filling Curve


Triangular Meshes Generated by Bisection

“Newest Vertex Bisection” Refinement & Sierpinski Curves:

• fully adaptive grid described by a corresponding refinement tree• tree and grid cells traversed in Sierpinski order• minimum memory requirements (1 bit per tree node?!)→ triangle strips as data structure

• exploit for cache efficiency and parallelisation


The Stack Principle for Data Exchange on Edges

B

C

A

• to compute numerical fluxes in Finite Volume and DG methods• to synchronise refinement of neighbour cells (conforming grids)


Stream- and Stack-based Processing

system of stacks

element−wise processing

for refinement

inpu

t str

eam

outp

ut s

trea

m

for coarseningdelete cellsadd cells

• drop fixed memory location of variables!• persistent (degrees of freedom) vs. non-persistent (residuals, “old”

variables) data• aim: reduce memory footprint and number of traversals• strongly element-oriented; impedes vectorization over elements


Partitions & Load Balancing: Sierpinski Sections

add

cut

Use Sierpinski Space-Filling Curve for Partitioning:• remeshing requires three sections: “add”, “keep”, and “cut”• generalized towards restructuring of Sierpinski sections:

turn M sections into N (balanced) sections• migrate only sections to simplify data transfer

(tolerate sacrifice on load balancing)• flexible concept of “load”: number of cells, weight per cell,

average measured runtime


Load Balancing Using Sierpinski Index Sections

Adaptive traversal: Refine/coarsen cells

Load balancing: Estimate new load and exchange sections

Resize sections

Compute (mark cells for refinement/coarsening)...

+1 -1 +3 -1 -1 -1 -1 -1 -1 +1 +2 +0

+1 -1 +3 -1 -1 -1 -1 -1 -1 +1 +2 +0


Part VI

Dynamically Adaptive TsunamiSimulation – Performance, Scalability,

Patches

Oliver Meister, Kaveh Rahnema, Chaulio Ferreira


Tsunami Simulation: Vectorization vs. AMR

0

2

4

6

8

10

12

400 800 1600 2400 4000Mio

Ele

me

nt U

pd

ate

s/s

pe

r co

re

elements per dim.

no vec. 1 core

no vec. 16 cores

SSE4 1 core

SSE4 16 cores

AVX 1 core

AVX 16 cores

Vectorization Imperative on Modern CPU Architectures:• example above: shallow water simulation on static Cartesian mesh• speed-up due to intrinsics-implementation of augm. Riemann solver• sam(oa)2: mesh adaptivity impedes vectorization over grid cells• possible remidy: introduce regularly refined patches


Using Patches – First Results on Xeon PhiSuperMIC: 2 × Xeon Phi 5110P (60 cores, 1.1 GHz) Chaulio Ferreira

Simple f-wave Riemann solver:

• solid speedup, but away from optimum (vector width) and not perfectlyscaling (single node, shared memory bandwidth)

• need to study influence of memory-bound parts, etc.M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 35

Using Patches – First Results on Xeon PhiSuperMIC: 2 × Xeon Phi 5110P (60 cores, 1.1 GHz) Chaulio Ferreira

Vector HLLE solver: (compared against augmented Riemann)

• relatively larger speed-up (more compute-bound than f-wave)• best time-to-solution for patch-size 8 detailed analysis “to do”• from ∼12 to ∼45 Mio element updates per second per node


Load Balancing for Symmetric ModeSuperMIC: 2× Ivy Bridge (8 cores, 2.6 GHz) plus 2× Xeon Phi 5110P (60 cores, 1.1 GHz)

• guided (2:1:1) clearly superior to homogeneous (1:1:1) load balancingbetween hosts (1 MPI rank) and Xeon Phi (2 ranks)

• “automatic” (measure runtime) load balancing finds 42:29:29• suspect heavy losses due to communication


Symmetric Mode – Multiple NodesSuperMIC: 2× Ivy Bridge (8 cores, 2.6 GHz) plus 2× Xeon Phi 5110P (60 cores, 1.1 GHz)

• strong scaling currently impeded by slow communication• withdraw to load balancing every 10 time steps• will need to test with communication proxy/on Salomon


Part VII

Conclusions and Outlook


Conclusions on SeisSol

Performance Optimisation on Multi&Manycore Platforms:

• high convergence order and high computational intensity of ADER-DG→ compute-bound performance on current and imminent CPUs

• code generation to accelerate element kernels• careful tuning and parallelisation of the entire simulation pipeline

(scalable mesh input, output and checkpointing)

Xeon Phi Platforms

• offload scheme scaled to 1.5 million cores (Tianhe-2, Stampede)• our goal: scale in symmetric mode on heterogeneous supercomputers→ current work on SuperMIC and esp. Salomon

• heterogeneity challenges exist in load balancing and scalable I/O• SeisSol runs on Knights Landing→ ISC’16 [7] and KNL-Book [8]


(Preliminary) Conclusions on sam(oa)2

High-Performance Parallel AMR:• making dynamically adaptive simulation codes achieve high performance

is hard work• space-filling-curve approaches for hybrid parallelism→ sam(oa)2

• load balancing in each time step is feasible (and hybrid parallism helps)

Performance Challenges for Modern Heterogeneous Platforms:• crucial performance question #1:

Where can I apply SIMD parallelism for dynamic adaptivity?

• patches offer additional opportunity for vectorization(similar: layered models→ 2D adaptivity for 3D problems)

• crucial performance question #2:How do I effectively load balance with heterogeneity?

• hoping for automagic (runtime-based) load balancing;→ prescribed weights currently more successful


Acknowledgements

Special thanks go to . . .• the entire SeisSol team and all contributors, esp.:

– Alex Breuer, Sebastian Rettenberger, Carsten Uphoff– Alex Heinecke– Alice Gabriel, Christian Pelties, Stephanie Wolherr

• the sam(oa)2 team:– Oliver Meister, Kaveh Rahnema, Chaulio Ferreira

• all colleagues from the Leibniz Supercomputing Centre• all colleagues from the IT4Innovations Supercomputing Centre• for financial and project support:

– Intel Corporation (IPCC ExScaMIC)– Volkswagen Foundation (project ASCETE)– BMBF (project CzeBACCA)


Publications[1] A. Breuer, A. Heinecke, L. Rannabauer, M. Bader: High-Order ADER-DG Minimizes Energy-

and Time-to-Solution of SeisSol. In: High Performance Computing, Proceedings of ISC 15,LNCS 9137, p. 340–357, 2015.

[2] A. Breuer, A. Heinecke, S. Rettenberger, M. Bader, A.-A. Gabriel, C. Pelties: SustainedPetascale Performance of Seismic Simulations with SeisSol on SuperMUC. In:Supercomputing, LNCS 8488, p. 1–18. PRACE ISC Award 2014.

[3] A. Heinecke, A. Breuer, S. Rettenberger, M. Bader, A.-A. Gabriel, C. Pelties, A. Bode, W. Barth,X.-K. Liao, K. Vaidyanathan, M. Smelyanskiy, P. Dubey: Petascale High Order DynamicRupture Earthquake Simulations on Heterogeneous Supercomputers. Gordon Bell PrizeFinalist 2014.

[4] O. Meister, K. Rahnema, M. Bader: Parallel Memory Efficient Adaptive Mesh Refinement onStructured Triangular Meshes with Billions of Grid Cells.ACM Transactions on Mathematical Software TOMS, accepted.

[5] S. Rettenberger, M. Bader: Optimizing Large Scale I/O for Petascale Seismic Simulations onUnstructured Meshes 2015 IEEE International Conference on Cluster Computing (CLUSTER),p. 314–317. IEEE Xplore, 2015.

[6] C. Uphoff, M. Bader: Generating high performance matrix kernels for earthquake simulationswith viscoelastic attenuation. The 2016 International Conference on High PerformanceComputing & Simulation (HPCS 2016), p. 908–916.


Publications and References[7] A. Heinecke, A. Breuer, M. Bader: High Order Seismic Simulations on the Intel Xeon Phi

Processor (Knights Landing). ISC High Performance, 2016.[8] A. Heinecke, A. Breuer, M. Bader: High Performance Seismic Simulations. In J. Jeffers, J.

Reinders, A. Sodani (ed.), Intel Xeon Phi Processor High Performance Programming –Knights Landing Edition, ch. 21. Morgan Kaufmann, 2016.

[9] M. Dumbser, M. Kaser: An arbitrary high-order discontinuous Galerkin method for elasticwaves on unstructured meshes – II. The three-dimensional isotropic case. Geophys. J. Int.167(1), 2006.

[10] D. George: Augmented Riemann solvers for the shallow water equations over variabletopography with steady states and inundation.J. Comput. Phys. 227(6), 2008.

[11] A. Heinecke, G. Henry, M. Hutchinson, H. Pabst: LIBXSMM: Accelerating Small MatrixMultiplications by Runtime Code Generation, SC16, accepted.

[12] C. Pelties, A.-A. Gabriel, J.-P. Ampuero: Verification of an ADER-DG method for complexdynamic rupture problems, Geoscientific Model Development, 7(3), p. 847–866.

[13] C. Pelties, J. de la Puente, J.-P. Ampuero, G. B. Brietzke, M. Kaser: Three-dimensionaldynamic rupture simulation with a high-order discontinuous Galerkin method on unstructuredtetrahedral meshes. J. Geophys. Res.: Solid Earth, 117(B2), 2012.


experiences with earthquake and tsunami [0.5ex] simulation ... · m. bader et al.j earthquake and...

Documents