experiences with earthquake and tsunami [0.5ex] simulation ... · m. bader et al.j earthquake and...
TRANSCRIPT
Experiences with Earthquake and TsunamiSimulation on Xeon Phi Platforms
Special session on MIC experience and best practice
Michael Bader (and co-authors listed with chapters)Technical University of Munich
LRZ, 29 June 2016
Overview and AgendaDynamic Rupture and Earthquake Simulation with SeisSol:
• unstructured tetrahedral meshes• high-order ADER-DG discretisation• compute-bound performance via optimized matrix kernels
Optimising SeisSol for Xeon Phi Platforms
• offload scheme: 1992 Landers Earthquake as landmark simulation,scalability on SuperMUC, Tianhe-2, Stampede
• optimisation for Knights Corner and Landing• towards simulations in symmetric mode (1st results on Salomon)
Tsunami Simulation on SuperMIC:
• parallel adaptive mesh refinement with sam(oa)2
• enable vectorization via introducing patches• towards load balancing on heterogeneous systems
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 2
Part I
Dynamic Rupture and EarthquakeSimulation with SeisSol
http://www.seissol.org/
Dumbser, Kaser et al. [9]An arbitrary high-order discontinuous Galerkin method . . .
Pelties, Gabriel et al. [12]Verification of an ADER-DG method for complex dynamic rupture problems
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 3
Dynamic Rupture and Earthquake Simulation
Landers fault system: simulated ground motion and seismic waves [3]
SeisSol – ADER-DG for seismic simulations:• adaptive tetrahedral meshes→ complex geometries, heterogeneous media, multiphysics
• complicated fault systems with multiple branches→ non-linear multiphysics dynamic rupture simulation
• ADER-DG: high-order discretisation in space and time
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 4
Example: 1992 Landers M7.2 Earthquake
• multiphysics simulation of dynamic rupture and resulting ground motion ofa M7.2 earthquake
• fault inferred from measured data, regional topography from satellite data,physically consistent stress and friction parameters
• static mesh refinement at fault and near surface
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 5
Multiphysics Dynamic Rupture Simulation
• spontaneous rupture, non-linear interaction with wave-field• featuring rupture jumps, fault branching, etc.• tackles fundamental questions on earthquake dynamics• realistic rupture source for seismic hazard assessment
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 6
Multiphysics Dynamic Rupture Simulation
• spontaneous rupture, non-linear interaction with wave-field• featuring rupture jumps, fault branching, etc.• tackles fundamental questions on earthquake dynamics• realistic rupture source for seismic hazard assessment
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 6
Multiphysics Dynamic Rupture Simulation
• spontaneous rupture, non-linear interaction with wave-field• featuring rupture jumps, fault branching, etc.• tackles fundamental questions on earthquake dynamics• realistic rupture source for seismic hazard assessment
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 6
Multiphysics Dynamic Rupture Simulation
• spontaneous rupture, non-linear interaction with wave-field• featuring rupture jumps, fault branching, etc.• tackles fundamental questions on earthquake dynamics• realistic rupture source for seismic hazard assessment
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 6
Multiphysics Dynamic Rupture Simulation
• spontaneous rupture, non-linear interaction with wave-field• featuring rupture jumps, fault branching, etc.• tackles fundamental questions on earthquake dynamics• realistic rupture source for seismic hazard assessment
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 6
Multiphysics Dynamic Rupture Simulation
• spontaneous rupture, non-linear interaction with wave-field• featuring rupture jumps, fault branching, etc.• tackles fundamental questions on earthquake dynamics• realistic rupture source for seismic hazard assessment
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 6
Multiphysics Dynamic Rupture Simulation
• spontaneous rupture, non-linear interaction with wave-field• featuring rupture jumps, fault branching, etc.• tackles fundamental questions on earthquake dynamics• realistic rupture source for seismic hazard assessment
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 6
Multiphysics Dynamic Rupture Simulation
• spontaneous rupture, non-linear interaction with wave-field• featuring rupture jumps, fault branching, etc.• tackles fundamental questions on earthquake dynamics• realistic rupture source for seismic hazard assessment
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 6
Multiphysics Dynamic Rupture Simulation
• spontaneous rupture, non-linear interaction with wave-field• featuring rupture jumps, fault branching, etc.• tackles fundamental questions on earthquake dynamics• realistic rupture source for seismic hazard assessment
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 6
Multiphysics Dynamic Rupture Simulation
• spontaneous rupture, non-linear interaction with wave-field• featuring rupture jumps, fault branching, etc.• tackles fundamental questions on earthquake dynamics• realistic rupture source for seismic hazard assessment
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 6
Multiphysics Dynamic Rupture Simulation
• spontaneous rupture, non-linear interaction with wave-field• featuring rupture jumps, fault branching, etc.• tackles fundamental questions on earthquake dynamics• realistic rupture source for seismic hazard assessment
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 6
Multiphysics Dynamic Rupture Simulation
• spontaneous rupture, non-linear interaction with wave-field• featuring rupture jumps, fault branching, etc.• tackles fundamental questions on earthquake dynamics• realistic rupture source for seismic hazard assessment
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 6
Multiphysics Dynamic Rupture Simulation
• spontaneous rupture, non-linear interaction with wave-field• featuring rupture jumps, fault branching, etc.• tackles fundamental questions on earthquake dynamics• realistic rupture source for seismic hazard assessment
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 6
Multiphysics Dynamic Rupture Simulation
• spontaneous rupture, non-linear interaction with wave-field• featuring rupture jumps, fault branching, etc.• tackles fundamental questions on earthquake dynamics• realistic rupture source for seismic hazard assessment
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 6
Multiphysics Dynamic Rupture Simulation
• spontaneous rupture, non-linear interaction with wave-field• featuring rupture jumps, fault branching, etc.• tackles fundamental questions on earthquake dynamics• realistic rupture source for seismic hazard assessment
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 6
Part II
SeisSol as a Compute-Bound Code:Code Generation for Matrix Kernels
Breuer, Heinecke, Rannabauer, Bader [1]: High-Order ADER-DG MinimizesEnergy- and Time-to-Solution of SeisSol (ISC’15)
Uphoff, Bader [6]: Generating high performance matrix kernels forearthquake simulations with viscoelastic attenuation (HPCS 2016)
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 7
Seismic Wave Propagation with SeisSolElastic Wave Equations: (velocity-stress formulation)
qt + Aqx + Bqy + Cqz = 0
with q = (σ11, σ22, σ33, σ12, σ23, σ13,u, v ,w)T
Technische Universität MünchenDepartment of Informatics V
A. Breuer, A. Heinecke, S. Rettenberger,M. Bader, C. Pelties
Optimization of SeisSol
SeisSol in a nutshell: Governing eq’s
4
qt + Aqx + Bqy + Cqz = 0. (46)The nine dimensional vector of unknowns,
q =
0BBBBBBBBBBBB@
�11
�22
�33
�12
�23
�13
uvw
1CCCCCCCCCCCCA
, (47)
includes the normal stress components �11, �22 and �33, the shear stresses �12,�23 and �13 and the particle velocities in x-, y-, and z-direction, u, v and w(see [7, ch. 22.1], TODO: Puente 1.2). Furthermore the matrices A, B and Care defined by (see [Eigenstructure3D elastic.mws TODO]):
A =
0BBBBBBBBBBBB@
0 0 0 0 0 0 -�- 2 µ 0 00 0 0 0 0 0 -� 0 00 0 0 0 0 0 -� 0 00 0 0 0 0 0 0 -µ 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 -µ
-⇢-1 0 0 0 0 0 0 0 00 0 0 -⇢-1 0 0 0 0 00 0 0 0 0 -⇢-1 0 0 0
1CCCCCCCCCCCCA
B =
0BBBBBBBBBBBB@
0 0 0 0 0 0 0 -� 00 0 0 0 0 0 0 -�- 2 µ 00 0 0 0 0 0 0 -� 00 0 0 0 0 0 -µ 0 00 0 0 0 0 0 0 0 -µ0 0 0 0 0 0 0 0 00 0 0 -⇢-1 0 0 0 0 00 -⇢-1 0 0 0 0 0 0 00 0 0 0 -⇢-1 0 0 0 0
1CCCCCCCCCCCCA
C =
0BBBBBBBBBBBB@
0 0 0 0 0 0 0 0 -�0 0 0 0 0 0 0 0 -�0 0 0 0 0 0 0 0 -�- 2 µ0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 -µ 00 0 0 0 0 0 -µ 0 00 0 0 0 0 -⇢-1 0 0 00 0 0 0 -⇢-1 0 0 0 00 0 -⇢-1 0 0 0 0 0 0
1CCCCCCCCCCCCA
. (48)
�(x, y, z) and µ(x, y, z) are the Lame parameters, whereas µ is the shear modulusand � doesn’t have a direct physical interpretation (see [7, ch. 22.1]), ⇢(x, y, z) >0 is the density of the material (see [7, ch. 2.12.4]).
16
The nine dimensional vector of unknowns,
q =
0BBBBBBBBBBBB@
�11
�22
�33
�12
�23
�13
uvw
1CCCCCCCCCCCCA
, (47)
includes the normal stress components �11, �22 and �33, the shear stresses �12,�23 and �13 and the particle velocities in x-, y-, and z-direction, u, v and w(see [7, ch. 22.1], TODO: Puente 1.2). Furthermore the matrices A, B and Care defined by (see [Eigenstructure3D elastic.mws TODO]):
A =
0BBBBBBBBBBBB@
0 0 0 0 0 0 -�- 2 µ 0 00 0 0 0 0 0 -� 0 00 0 0 0 0 0 -� 0 00 0 0 0 0 0 0 -µ 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 -µ
-⇢-1 0 0 0 0 0 0 0 00 0 0 -⇢-1 0 0 0 0 00 0 0 0 0 -⇢-1 0 0 0
1CCCCCCCCCCCCA
B =
0BBBBBBBBBBBB@
0 0 0 0 0 0 0 -� 00 0 0 0 0 0 0 -�- 2 µ 00 0 0 0 0 0 0 -� 00 0 0 0 0 0 -µ 0 00 0 0 0 0 0 0 0 -µ0 0 0 0 0 0 0 0 00 0 0 -⇢-1 0 0 0 0 00 -⇢-1 0 0 0 0 0 0 00 0 0 0 -⇢-1 0 0 0 0
1CCCCCCCCCCCCA
C =
0BBBBBBBBBBBB@
0 0 0 0 0 0 0 0 -�0 0 0 0 0 0 0 0 -�0 0 0 0 0 0 0 0 -�- 2 µ0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 -µ 00 0 0 0 0 0 -µ 0 00 0 0 0 0 -⇢-1 0 0 00 0 0 0 -⇢-1 0 0 0 00 0 -⇢-1 0 0 0 0 0 0
1CCCCCCCCCCCCA
. (48)
�(x, y, z) and µ(x, y, z) are the Lame parameters, whereas µ is the shear modulusand � doesn’t have a direct physical interpretation (see [7, ch. 22.1]), ⇢(x, y, z) >0 is the density of the material (see [7, ch. 2.12.4]).
16Friday, January 18, 13
Technische Universität MünchenDepartment of Informatics V
A. Breuer, A. Heinecke, S. Rettenberger,M. Bader, C. Pelties
Optimization of SeisSol
SeisSol in a nutshell: Governing eq’s
4
qt + Aqx + Bqy + Cqz = 0. (46)The nine dimensional vector of unknowns,
q =
0BBBBBBBBBBBB@
�11
�22
�33
�12
�23
�13
uvw
1CCCCCCCCCCCCA
, (47)
includes the normal stress components �11, �22 and �33, the shear stresses �12,�23 and �13 and the particle velocities in x-, y-, and z-direction, u, v and w(see [7, ch. 22.1], TODO: Puente 1.2). Furthermore the matrices A, B and Care defined by (see [Eigenstructure3D elastic.mws TODO]):
A =
0BBBBBBBBBBBB@
0 0 0 0 0 0 -�- 2 µ 0 00 0 0 0 0 0 -� 0 00 0 0 0 0 0 -� 0 00 0 0 0 0 0 0 -µ 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 -µ
-⇢-1 0 0 0 0 0 0 0 00 0 0 -⇢-1 0 0 0 0 00 0 0 0 0 -⇢-1 0 0 0
1CCCCCCCCCCCCA
B =
0BBBBBBBBBBBB@
0 0 0 0 0 0 0 -� 00 0 0 0 0 0 0 -�- 2 µ 00 0 0 0 0 0 0 -� 00 0 0 0 0 0 -µ 0 00 0 0 0 0 0 0 0 -µ0 0 0 0 0 0 0 0 00 0 0 -⇢-1 0 0 0 0 00 -⇢-1 0 0 0 0 0 0 00 0 0 0 -⇢-1 0 0 0 0
1CCCCCCCCCCCCA
C =
0BBBBBBBBBBBB@
0 0 0 0 0 0 0 0 -�0 0 0 0 0 0 0 0 -�0 0 0 0 0 0 0 0 -�- 2 µ0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 -µ 00 0 0 0 0 0 -µ 0 00 0 0 0 0 -⇢-1 0 0 00 0 0 0 -⇢-1 0 0 0 00 0 -⇢-1 0 0 0 0 0 0
1CCCCCCCCCCCCA
. (48)
�(x, y, z) and µ(x, y, z) are the Lame parameters, whereas µ is the shear modulusand � doesn’t have a direct physical interpretation (see [7, ch. 22.1]), ⇢(x, y, z) >0 is the density of the material (see [7, ch. 2.12.4]).
16
The nine dimensional vector of unknowns,
q =
0BBBBBBBBBBBB@
�11
�22
�33
�12
�23
�13
uvw
1CCCCCCCCCCCCA
, (47)
includes the normal stress components �11, �22 and �33, the shear stresses �12,�23 and �13 and the particle velocities in x-, y-, and z-direction, u, v and w(see [7, ch. 22.1], TODO: Puente 1.2). Furthermore the matrices A, B and Care defined by (see [Eigenstructure3D elastic.mws TODO]):
A =
0BBBBBBBBBBBB@
0 0 0 0 0 0 -�- 2 µ 0 00 0 0 0 0 0 -� 0 00 0 0 0 0 0 -� 0 00 0 0 0 0 0 0 -µ 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 -µ
-⇢-1 0 0 0 0 0 0 0 00 0 0 -⇢-1 0 0 0 0 00 0 0 0 0 -⇢-1 0 0 0
1CCCCCCCCCCCCA
B =
0BBBBBBBBBBBB@
0 0 0 0 0 0 0 -� 00 0 0 0 0 0 0 -�- 2 µ 00 0 0 0 0 0 0 -� 00 0 0 0 0 0 -µ 0 00 0 0 0 0 0 0 0 -µ0 0 0 0 0 0 0 0 00 0 0 -⇢-1 0 0 0 0 00 -⇢-1 0 0 0 0 0 0 00 0 0 0 -⇢-1 0 0 0 0
1CCCCCCCCCCCCA
C =
0BBBBBBBBBBBB@
0 0 0 0 0 0 0 0 -�0 0 0 0 0 0 0 0 -�0 0 0 0 0 0 0 0 -�- 2 µ0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 -µ 00 0 0 0 0 0 -µ 0 00 0 0 0 0 -⇢-1 0 0 00 0 0 0 -⇢-1 0 0 0 00 0 -⇢-1 0 0 0 0 0 0
1CCCCCCCCCCCCA
. (48)
�(x, y, z) and µ(x, y, z) are the Lame parameters, whereas µ is the shear modulusand � doesn’t have a direct physical interpretation (see [7, ch. 22.1]), ⇢(x, y, z) >0 is the density of the material (see [7, ch. 2.12.4]).
16Friday, January 18, 13
• high order discontinuous Galerkin discretisation• ADER-DG: high approximation order in space and time:• additional features: local time stepping, high accuracy of earthquake
faulting (full frictional sliding)→ Dumbser, Kaser et al., e.g. [9]
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 8
SeisSol in a Nutshell – ADER-DG
Technische Universität MünchenDepartment of Informatics V
A. Breuer, A. Heinecke, S. Rettenberger,M. Bader, C. Pelties
Optimization of SeisSol
SeisSol in a nutshell: ADER-DG
5
Mathematical Operation File name Subroutine(K⇠)T (Qk)T (A⇤
k)T + (K⌘)T (Qk)T (B⇤k)T cauchykovalewski.f90 cauchyKovalewskiTimeIntegration
recursively ! I(tn+1 - tn)
Qk = A⇤kI(tn+1 - tn) galerkin2d_solver.f90 ADERGalerkin2D_Unified
Qk = QkK⇠ galerkin2d_solver.f90 ADERGalerkin2D_Unified
Qk = B⇤kI(tn+1 - tn) galerkin2d_solver.f90 ADERGalerkin2D_Unified
Qk = QkK⌘ galerkin2d_solver.f90 ADERGalerkin2D_Unified
⇥ ⇥ ⇥
Table 1: Matrix operations in SeisSol2D
3.7 Complete Update Scheme
Combining the spatial discretization in chapter 3.3 with the flux computationin chapter 3.4 and the ADER time discretization of chapter 3.6 the followingexplicit update scheme is obtained:
Qn+1k = Qk-
|Sk|
|Jk|M-1
✓ 4X
i=1
Nk,iA+k N-1
k,iI(tn, tn+1, Qn
k )F-,i
+
4X
i=1
Nk,iA-k(i)N
-1k,iI(t
n, tn+1, Qnk(i))F
+,i,j,h
◆
+M-1A⇤kI(tn, tn+1, Qn
k )K⇠
+M-1B⇤kI(tn, tn+1, Qn
k )K⌘
+M-1C⇤kI(tn, tn+1, Qn
k )K⇣
(91)
Qn+1k = Qk-
|Sk|
|Jk|M-1
✓ 4X
i=1
F-,iI(tn, tn+1, Qnk )Nk,iA
+k N-1
k,i
+
4X
i=1
F+,i,j,hI(tn, tn+1, Qnk(i))Nk,iA
-k(i)N
-1k,i
◆
+M-1K⇠I(tn, tn+1, Qnk )A⇤
k
+M-1K⌘I(tn, tn+1, Qnk )B⇤
k
+M-1K⇣I(tn, tn+1, Qnk )C⇤
k
(92)
4 Structure of the Code
5 Matrix Patterns
This chapter shows the matrix patterns in SeisSol for the ADER-DG schemeof polynomial order 3, which results in (3 + 1)(3 + 2)(3 + 3)/6 = 20 degrees offreedom. The global matrices correspond to the reference tetrahedron. Localmatrices are shown for the first element only. The output was extracted fromSeisSol directly.
29
Upd
ate
sche
me
I(tn, tn+1, Qnk ) =
JX
j=0
(tn+1 - tn)j+1
(j + 1)!
@j
@tjQk(tn)
(Qk)t = -M-1�(K⇠)TQkA⇤
k + (K⌘)TQkB⇤k + (K⇣)TQkC⇤
k
�Cau
chy
Kov
alew
ski
Friday, January 18, 13
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 9
Optimisation of Matrix OperationsApply sparse matrices to multiple DOF-vectors Qk
0 1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
0 1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
0 1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
0 1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
8
(a) Computation of the first time derivative ∂1/∂t1Qk.
0 1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
0 1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
0 1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
0 1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
8
(b) Computation of the second derivative ∂2/∂t2Qk.
Fig. 1. First two recursions of the ADER time integration for a fifth order method.Orange blocks generate zero-blocks in the derivatives, ivory blocks hit zero blocks.
2.2 Efficient Evaluation of the ADER Time Integration
In this subsection, we introduce an improved scheme that reduces the compu-tational effort of the ADER time integration, formulated in [24] and in discreteform in Sec. 2.1.
Analyzing the sparsity patterns of the involved stiffness matrices, we cansubstantially reduce the number of operations in the ADER time integration.This is especially true for applied high orders in space and time: The transposedstiffness matrices (Kξ)T , (Kη)T and (Kζ)T of order O contain a zero blockstarting at the first row, which accounts for the new hierarchical basis functionsadded to those of the preceding order O − 1 (without proof). According to Eq.(1) we have to compute the temporal derivates ∂j/∂tjQk for j ∈ 1 . . .O − 1 byapplying Eq. (2) recursively to reach approximation order O in time.
The first step (illustrated in Fig. 1 (a)) computes ∂1/∂t1Qk from the initialDOFs Qnk . The zero blocks of the three matrices Kξc generate a zero block in theresulting derivative and only the upper block of size BO−1×9 contains non-zeros.In the computation of the second derivative ∂2/∂t2Qk we only have to accountfor the top-left block of size BO−1 × BO−1 in the matrices Kξc . As illustratedin Fig. 1 (b) the additional non-zeros hit the previously generated zero blockof derivative ∂1/∂t1Qk. The following derivatives ∂j/∂tjQk for j ∈ 3 . . .O − 1proceed analogously and for each derivative j only the Kξc-sub-blocks of sizeBO−j × BO−j have to be taken into account. Obviously, the zero blocks alsoappear in the multiplication with the matrices A?, B? and C? from the rightand additional operations can be saved.
0 1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
0 1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
0 1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
0 1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
8
(a) Computation of the first time derivative ∂1/∂t1Qk.
0 1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
0 1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
0 1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
0 1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
8
(b) Computation of the second derivative ∂2/∂t2Qk.
Fig. 1. First two recursions of the ADER time integration for a fifth order method.Orange blocks generate zero-blocks in the derivatives, ivory blocks hit zero blocks.
2.2 Efficient Evaluation of the ADER Time Integration
In this subsection, we introduce an improved scheme that reduces the compu-tational effort of the ADER time integration, formulated in [24] and in discreteform in Sec. 2.1.
Analyzing the sparsity patterns of the involved stiffness matrices, we cansubstantially reduce the number of operations in the ADER time integration.This is especially true for applied high orders in space and time: The transposedstiffness matrices (Kξ)T , (Kη)T and (Kζ)T of order O contain a zero blockstarting at the first row, which accounts for the new hierarchical basis functionsadded to those of the preceding order O − 1 (without proof). According to Eq.(1) we have to compute the temporal derivates ∂j/∂tjQk for j ∈ 1 . . .O − 1 byapplying Eq. (2) recursively to reach approximation order O in time.
The first step (illustrated in Fig. 1 (a)) computes ∂1/∂t1Qk from the initialDOFs Qnk . The zero blocks of the three matrices Kξc generate a zero block in theresulting derivative and only the upper block of size BO−1×9 contains non-zeros.In the computation of the second derivative ∂2/∂t2Qk we only have to accountfor the top-left block of size BO−1 × BO−1 in the matrices Kξc . As illustratedin Fig. 1 (b) the additional non-zeros hit the previously generated zero blockof derivative ∂1/∂t1Qk. The following derivatives ∂j/∂tjQk for j ∈ 3 . . .O − 1proceed analogously and for each derivative j only the Kξc-sub-blocks of sizeBO−j × BO−j have to be taken into account. Obviously, the zero blocks alsoappear in the multiplication with the matrices A?, B? and C? from the rightand additional operations can be saved.
Dense vs. Sparse Kernels: (Breuer et al. [2])
• most kernels fastest, if executed as dense matrix multiplications• exploit zero-blocks generated during recursive CK computation• switch to sparse kernels depending on achieved time to solution
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 10
Sparse, Dense→ Block-SparseConsider equaivalent sparsity patterns: (Uphoff, [6])
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Graph representation and block-sparse memory layouts
A1 A2 A3
0 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
012345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 11
Code Generator: Instrinsics→ Assembler
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 12
Code Generator – Programming Interface
db = Tools.parseMatrixFile(’matrices.xml’)
Tools.memoryLayoutFromFile(’layout.xml’, db)
arch = Arch.getArchitectureByIdentifier(’dhsw’)
volume = db[’kXiDivM’]
* db[’timeIntegrated’]
* db[’AstarT’]
+ db[’timeIntegrated’]
* db[’ET’]
kernels = [(’volume’, volume)]
Tools.generate(
’path/to/output’,
db,
kernels,
’path/to/libxsmm_gemm_generator’,
arch
)
Exploit efficient backend: libxsmm library [11]
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 13
Benefit of High Order ADER-DG – Energy-Efficient
1E-10
1E-09
1E-08
1E-07
1E-06
1E-05
1E-04
1E-03
1E-02
75 100 150 250 500 750 1000
erro
r
energy (kJ)
shsw2dhsw2
shsw3dhsw3
shsw4dhsw4
dhsw5dhsw6
dhsw7
Fig. 3. L∞-error of variable σyz in dependency of the consumed energy for the HSWmachine in single- and double-precision.
fully reached, as we align the elements’ DOFs to the 32-byte boundary. This, ingeneral, imposes some overheads with longer vectors and switching from doubleto single precision naturally doubles the vector length.
Finally, in Fig. 4 we evaluated the measured power-to-solution curves bylinear interpolation in log-log-space at 150 kJ for all architectures (using on-demand frequency for WSM and SNB). For this energy budget we can now easilycompare the achieved accuracy depending on convergence order and architecture.Note that we excluded settings beyond convergence and low order settings notfitting into the memory of KNC due to slow convergence. Comparing O2 andO7 on HSW we see an error reduction of five orders of magnitude. For the cross-architecture comparison we select O6. According to Fig. 1 roughly the sameefficiency can be reached on all platforms. In this case the error is reduced bya factor of 3.4 when switching from WSM to SNB and 4× when comparingSNB to KNC and HSW. These findings align well with the Dennard scaling4
between SNB and HSW on an iso-frequency and iso-TDP level. From WSM toSNB we even see a superior scaling. This is due to the fact that we use full-box power measurements, but strictly speaking Dennard scaling applies to theprocessor only. Other parts of the system became more power-efficient, too, suchthat SNB clearly exceeds the WSM machine. One KNC coprocessor is able toachieve the same energy efficiency as the entire HSW machine. The reason fornot (clearly) outperforming HSW is the (already analyzed) lower efficiency ofthe matrix kernels. This emphasizes that for compute-bound applications thefastest execution is also most likely the most energy-efficient one.
4 doubling the number of transistors doubles the amount of computations within thesame energy budget; number of transistors: WSM: 2×1.17 B, SNB: 2×2.26 B, HSW:2×5.57 B
• mesasure maximum error vs. consumed energy• for increasing discretisation order on regular meshes• here: dual-socket “Haswell” server, 36 cores @1.9 GHz
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 14
Benefit of High Order ADER-DG – Energy-Efficient
1E-10
1E-09
1E-08
1E-07
1E-06
1E-05
1E-04
1E-03
1E-02
75 100 150 250 500 750 1000
erro
r
energy (kJ)
shsw2dhsw2
shsw3dhsw3
shsw4dhsw4
dhsw5dhsw6
dhsw7
Fig. 3. L∞-error of variable σyz in dependency of the consumed energy for the HSWmachine in single- and double-precision.
fully reached, as we align the elements’ DOFs to the 32-byte boundary. This, ingeneral, imposes some overheads with longer vectors and switching from doubleto single precision naturally doubles the vector length.
Finally, in Fig. 4 we evaluated the measured power-to-solution curves bylinear interpolation in log-log-space at 150 kJ for all architectures (using on-demand frequency for WSM and SNB). For this energy budget we can now easilycompare the achieved accuracy depending on convergence order and architecture.Note that we excluded settings beyond convergence and low order settings notfitting into the memory of KNC due to slow convergence. Comparing O2 andO7 on HSW we see an error reduction of five orders of magnitude. For the cross-architecture comparison we select O6. According to Fig. 1 roughly the sameefficiency can be reached on all platforms. In this case the error is reduced bya factor of 3.4 when switching from WSM to SNB and 4× when comparingSNB to KNC and HSW. These findings align well with the Dennard scaling4
between SNB and HSW on an iso-frequency and iso-TDP level. From WSM toSNB we even see a superior scaling. This is due to the fact that we use full-box power measurements, but strictly speaking Dennard scaling applies to theprocessor only. Other parts of the system became more power-efficient, too, suchthat SNB clearly exceeds the WSM machine. One KNC coprocessor is able toachieve the same energy efficiency as the entire HSW machine. The reason fornot (clearly) outperforming HSW is the (already analyzed) lower efficiency ofthe matrix kernels. This emphasizes that for compute-bound applications thefastest execution is also most likely the most energy-efficient one.
4 doubling the number of transistors doubles the amount of computations within thesame energy budget; number of transistors: WSM: 2×1.17 B, SNB: 2×2.26 B, HSW:2×5.57 B
• high order (“compute”) beats high resolution (“memory”)• ≈35% gain in energy-to-solution for single precision,
but only for low orderM. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 14
Benefit of High Order ADER-DG – Compute-Bound
GFLO
PS
0
100
200
300
400
500
600
700
800
234567 234567 234567 234567 234567 234567 23456 234567 234567 234567
355375
345
283
152
84
299317
307
256
155
82
192210
214
196
140
83
178
159
133
100
57
89109
115
115
6191
69849090745241515454463351566163553132 32 30 28 25 29 353942444325
1.60GHz 2.26GHz 3.46GHz 1.20GHz 2.00GHz 2.60GHz 1.06GHz 1.20GHz 1.90GHz 2.30GHz
WSM SNB HSWKNC
• mesasured “GFlop/s” and “MFlop/s per Watt” for Westmere, SandyBridge, Knights Corner and Haswell architectures [1]
• at selected clock frequencies and for different order• preference towards high order and low frequency on newest architectures
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 15
Part III
Accelerators – Dynamic RuptureSimulation on Xeon Phi
Supercomputers
Heinecke, Breuer, Rettenberger, Gabriel, Pelties et al. [3]:Petascale High Order Dynamic Rupture Earthquake Simulations onHeterogeneous Supercomputers (Gordon Bell Prize Finalist 2014)
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 16
On the Road from Peta- to Exascale?SuperMUC @ LRZ, Munich
• 9216 compute nodes (18 “thin node” islands)147,456 Intel SNB-EP cores (2.7 GHz)
• Infiniband FDR10 interconnect (fat tree)• #20 in Top 500: 2.897 PFlop/s
Stampede @ TACC, Austin• 6400 compute nodes, 522,080 cores
2 SNB-EP (8c) + 1 Xeon Phi SE10P per node• Mellanox FDR 56 interconnect (fat tree)• #8 in Top 500: 5.168 PFlop/s
Tianhe-2 @ NSCC, Guangzhou• 8000 compute nodes used, 1.6 Mio cores
2 IVB-EP (12c) + 3 Xeon Phi 31S1P per node• TH2-Express custom interconnect• #1 in Top 500: 33.862 PFlop/s
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 17
Optimization for Intel Xeon Phi Platforms
Offload Scheme:
• hide2 communication withXeon Phi and between nodes
• use “heavy” CPU cores fordynamic rupture
Hybrid parallelism:
• on 1–3 Xeon Phis and hostCPU(s)
• reflects multiphysicssimulation
• manycore parallelism on XeonPhi
Host PCIe Xeon Phi
MPI comm.,receiver output
dynamic rupture fluxes, fault output
plot wave field(if required)
download cells forreceivers, DR, MPI
upload MPI-received cells
upload dynamicrupture updates
download all data(if required)
time integration of non-MPI cells,
volume integration
time integration ofMPI boundary cells
wave propagationfluxes
dynamic rupture updates,
pack transfer data
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 18
Strong Scaling of Landers Scenario
40
50
60
70
80
90
100 2
56
512
102
4
204
8
409
6
614
4
921
6
% p
aral
lel e
ffici
ency
# nodes
StampedeSuperMUC, classic
SuperMUC, gr. buff.
Tianhe-2, 1 cardTianhe-2, 2 cardsTianhe-2, 3 cards
• 191 million tetrahedrons; 220,982 element faces on fault• 6th order, 96 billion degrees of freedom
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 19
Strong Scaling of Landers Scenario
40
50
60
70
80
90
100 2
56
512
102
4
204
8
409
6
614
4
921
6
% p
aral
lel e
ffici
ency
# nodes
StampedeSuperMUC, classic
SuperMUC, gr. buff.
Tianhe-2, 1 cardTianhe-2, 2 cardsTianhe-2, 3 cards
• more than 85 % parallel efficiency on Stampede and Tianhe-2(when using only one Xeon Phi per node)
• multiple-Xeon-Phi performance suffers from MPI communication
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 19
Strong Scaling of Landers Scenario
40
50
60
70
80
90
100 2
56
512
102
4
204
8
409
6
614
4
921
6
% p
aral
lel e
ffici
ency
# nodes
StampedeSuperMUC, classic
SuperMUC, gr. buff.
Tianhe-2, 1 cardTianhe-2, 2 cardsTianhe-2, 3 cards
• 3.3 PFlop/s on Tianhe-2 (7000 nodes)• 2.0 PFlop/s on Stampede (6144 nodes)• 1.3 PFlop/s on SuperMUC (9216 nodes)
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 19
Optimizing SeisSol for Xeon Phi (Knights Landing)Heinecke et al., ISC 16 [7]
Code Generation:
• 512-bit wide vector processing unit• profits from Knights Landing optimization of libxsmm library [11]
Memory Optimization:
• examine impact of DRAM-only, CACHE and FLAT mode• FLAT mode: careful placement of element-local matrices in local
MCDRAM (table from [7]):10 Alexander Heinecke, Alexander Breuer, Michael Bader, and Pradeep Dubey
order Qk Bk,Dk Aξck , A−,ik , A+,i
k Kξc , Kξc , F−,i, F+,i,j,h
2 MCDRAM MCDRAM MCDRAM MCDRAM3 MCDRAM MCDRAM MCDRAM MCDRAM4 DDR4 MCDRAM MCDRAM MCDRAM5 DDR4 MCDRAM DDR4 MCDRAM6 DDR4 MCDRAM DDR4 MCDRAM
Table 1. Placements for all orders and the different data structures of SeisSol;DDR4/MCDRAM denotes if a particular data structure is placed in DDR4/MCDRAM.
higher orders as they are bigger but have the same access frequency. These accesspatterns allow to overcome size limitations of the 16 GB MCDRAM by placingthe ’slow-running’ data structures in DDR4. Therefore, in FLAT mode and forhigher order runs, we store Bk and/or Dk of every element into MCDRAM onthe fly via the memkind library when computing them. As both memory typesare seamlessly integrated into the architecture, we simply change the place ofallocation, but not our macro-kernels. Thus pointers to Bk and/or Dk reference
memory physically stored in MCDRAM whereas Aξck , A−,ik , A+,i
k , Qk reside in theDDR4 portion of the address space for orders O = 5 and O = 6. Additionally,we hold unique matrices, Kξc , Kξc , F−,i, F+,i,j,h, including the 48 flux matricesrequired for neighboring elements’ contribution to the surface kernel (7), in MC-DRAM as well, as we expect local L2 cache evicts for higher orders. For lowerorders, two to four, the bandwidth requirements of SeisSol for the element localmatrices and Qk increase. We therefore allocate more data structures in MC-DRAM. In fact, for orders O = 2 and O = 3, all important data structures areplaced in MCDRAM. Table 1 summarizes the used placements, when runningon KNL in FLAT mode.
4.3 Optimizing the Mesh Traffic and Prefetching
KNL’s last level cache (LLC) is not a shared cache level as it is implementedby a 2D mesh of up to 36 1 MB large slices of L2 caches, c.f. Sect. 2. Theseslices are kept coherent by a distributed tag directory in each tile’s CHA. As wepointed out in the last section, for higher orders than four, the 48 flux matricesF+,i,j,h approach (500 KB for order five) or even exceed the size (1.5 MB fororder O = 6) of one tile’s L2 cache. This can negatively effect the performanceof (7) for two reasons: a) especially for order O = 6 this results into a high rateof CHA-to-CHA communication as the unstructured mesh causes unstructuredaccesses to the flux matrices b) the hardware prefetcher cannot pick-up theunstructured accesses. Keeping the last section in mind, we know that we stillhave plenty of MCDRAM bandwidth available in higher orders. Therefore, weplace several copies, one per two tiles, in MCDRAM. This ensures that the meshtraffic gets equally distributed and the access latency may not be limited byone CHA in the entire mesh holding the directory entries for one particularflux matrix. Additionally, we are using modified matrix kernel operations in (7),
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 20
Performance Results on Knights LandingHeinecke et al., ISC 16 [7]
High Order Seismic Simulations on Intel Xeon Phi 17
00.5
11.5
22.5
33.5
4
HSX
KNC
DD
R4
CAC
HE
FLAT
HSX
KNC
DD
R4
CAC
HE
FLAT
HSX
KNC
DD
R4
CAC
HE
FLAT
HSX
KNC
DD
R4
CAC
HE
FLAT
HSX
KNC
DD
R4
CAC
HE
FLAT
3.47
3.42
3.01
1.20
1.17
3.03
3.03
2.28
1.031.13
3.04
3.01
1.76
1.11
1.14
2.96
2.88
1.34
1.07
0.98
2.93
2.78
1.18
1.15
1.08
2.88
2.83
2.63
1.12
1.00
2.56
2.56
2.10
1.00
1.00
2.54
2.54
1.69
1.07
1.00
2.53
2.50
1.30
1.04
1.00
2.53
2.53
1.14
1.12
1.00
Intel AVX Core Frequency Intel AVX Turbo Boost Technology
KNL2
KNL3
KNL4
KNL5
KNL6
spee
dup
over
HSX
GTS
Fig. 9. Normalized time-to-solution speed-up over HSX for KNC and KNL and orders2-6 when simulating the 1992 Landers scenario using global time stepping.
would easily fit into MCDRAM any time as the total memory consumption atorder 6 is 7.1 GB.
The GTS performance of the 1992 Landers setup is provided in Fig. 9. Asthis is a multi-physics scenario, we expect slightly lower performance than forthe earlier pure wave propagation runs on a many-core processor. This is dueto the fact that the dynamic rupture portion of the solver requires high scalarperformance. Here, KNL’s increased single-thread performance becomes visible.KNL reassembles more than 92% of the pure wave propagation speed-up overHSX whereas the previous generation KNC chip is only able to attain 83%. Thisresults into a relative performance which is comparable to HSX. KNL’s time-to-solution speed-up for executing the 1992 Landers earthquake simulations is 2.5- 2.9 × depending on the chosen order.
6 Conclusion
In this article, we presented a holistic optimization of SeisSol, a multi-physicssimulation package for seismic simulations, which tightly couples seismic wavepropagation, and dynamic rupture processes. First, we presented a deep-diveinto KNL’s architectural features and their challenges and opportunities forhigh-performance software. After a brief recapitulation of SeisSol’s mathemati-cal background, we discussed in detail how to exploit KNL’s two VPUs per coreefficiently and to leverage both memory subsystems for a novel out-of-core imple-mentation in SeisSol’s high-order wave propagation solver. The KNL-optimizedimplementation was evaluated for three different scenarios with distinct chal-lenges and sizes. In case of global time stepping runs, KNL was able to out-perform its predecessor, KNC, by 2.9 × and the current most powerful IntelXeon processor, E5v3, by more than 3.4 ×. Even more important, in contrast to
Landers scenario, 466,574 elements
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 21
Part IV
Current Work – Simulations inSymmetric Mode on Salomon
Rettenberger, Uphoff, Rannabauer;Project CzeBaCCA: Czech-Bavarian Competence Centre for Supercomputing
Applications
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 22
Modify Mesh Input and Load Distribution
Scalable Mesh Partitioning and Input Pipeline:
GambitMesh File
PUMGen
ParMETIS
Simulation ModelingSuite C++ API
SimModeler/GAMBIT
CAD Model
Parallel Mesh Generation
Serial Mesh Generation
Towards Symmetric Mode:
• Modify weights for METIS graph partitioning:compensate speed differences between host CPU and Xeon Phi
• Work in progress: modify input of meshes→ Xeon Phi mesh partitions may be read by host and sent via MPI→ in case of bad I/O bandwidth (library support) of Xeon Phis
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 23
Work in Progress: Modify Wave Field OutputAggregation of MPI ranks to speed up I/O: (Rettenberger [5])
SeisSol
HDF5
Tasks
Blocks onFile System
Mulitple processeswriting to one block
One processwriting to mulitple blocks
. . .0 1 2 3 4 5
SeisSol
HDF5
Tasks
Blocks onFile System
... Aggregation
I/O Tasks
0 1 2 3 4 5
0 3 6
. . .
. .
Towards Symmetric Mode:
• Output routines aggregate data from several MPI ranks→ match I/O block size to achieve substantial speedup
• Only use host MPI ranks for output→ again in case of bad I/O bandwidth (library support) of Xeon Phis
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 24
First Runs on Salomon – Native and SymmetricSetup: LOH4 benchmark, 250k elements, order 6, no output yet
0.5
1
2
4
8
16
32
64
2 4 8 16 32 64
TF
LO
P/s
(hard
ware
)
#nodes
idealHostMIC
MIC+MICHost+MIC
Host+MIC+MIC
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 25
First Runs on Salomon – Native and SymmetricSetup: LOH4 benchmark, 250k elements, order 6, no output yet
0.5
1
2
4
8
16
32
64
2 4 8 16 32 64
TF
LO
P/s
(hard
ware
)
#devices
idealHostMIC
MIC+MICHost+MIC
Host+MIC+MIC
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 25
Part V
sam(oa)2
Parallel Adaptive Mesh Refinement usingSierpinski Space Filling Curves
O. Meister, K. Rahnema, M. Bader [4]Parallel Memory Efficient Adaptive Mesh Refinement
on Structured Triangular Meshes with Billions of Grid Cells
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 26
sam(oa)2: Scalable Dynamic AdaptivityUsing Structured Triangular Meshes and Sierpinski Space-Filling Curve
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 27
sam(oa)2: Scalable Dynamic AdaptivityUsing Structured Triangular Meshes and Sierpinski Space-Filling Curve
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 27
sam(oa)2: Scalable Dynamic AdaptivityUsing Structured Triangular Meshes and Sierpinski Space-Filling Curve
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 27
sam(oa)2: Scalable Dynamic AdaptivityUsing Structured Triangular Meshes and Sierpinski Space-Filling Curve
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 27
sam(oa)2: Scalable Dynamic AdaptivityUsing Structured Triangular Meshes and Sierpinski Space-Filling Curve
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 27
sam(oa)2: Scalable Dynamic AdaptivityUsing Structured Triangular Meshes and Sierpinski Space-Filling Curve
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 27
sam(oa)2: Scalable Dynamic AdaptivityUsing Structured Triangular Meshes and Sierpinski Space-Filling Curve
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 27
sam(oa)2: Scalable Dynamic AdaptivityUsing Structured Triangular Meshes and Sierpinski Space-Filling Curve
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 27
sam(oa)2: Scalable Dynamic AdaptivityUsing Structured Triangular Meshes and Sierpinski Space-Filling Curve
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 27
sam(oa)2: Scalable Dynamic AdaptivityUsing Structured Triangular Meshes and Sierpinski Space-Filling Curve
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 27
sam(oa)2: Scalable Dynamic AdaptivityUsing Structured Triangular Meshes and Sierpinski Space-Filling Curve
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 27
sam(oa)2: Scalable Dynamic AdaptivityUsing Structured Triangular Meshes and Sierpinski Space-Filling Curve
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 27
sam(oa)2: Scalable Dynamic AdaptivityUsing Structured Triangular Meshes and Sierpinski Space-Filling Curve
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 27
sam(oa)2: Scalable Dynamic AdaptivityUsing Structured Triangular Meshes and Sierpinski Space-Filling Curve
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 27
sam(oa)2: Scalable Dynamic AdaptivityUsing Structured Triangular Meshes and Sierpinski Space-Filling Curve
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 27
sam(oa)2: Scalable Dynamic AdaptivityUsing Structured Triangular Meshes and Sierpinski Space-Filling Curve
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 27
sam(oa)2: Scalable Dynamic AdaptivityUsing Structured Triangular Meshes and Sierpinski Space-Filling Curve
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 27
sam(oa)2: Scalable Dynamic AdaptivityUsing Structured Triangular Meshes and Sierpinski Space-Filling Curve
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 27
sam(oa)2: Scalable Dynamic AdaptivityUsing Structured Triangular Meshes and Sierpinski Space-Filling Curve
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 27
sam(oa)2: Scalable Dynamic AdaptivityUsing Structured Triangular Meshes and Sierpinski Space-Filling Curve
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 27
Triangular Meshes Generated by Bisection
“Newest Vertex Bisection” Refinement & Sierpinski Curves:
• fully adaptive grid described by a corresponding refinement tree• tree and grid cells traversed in Sierpinski order• minimum memory requirements (1 bit per tree node?!)→ triangle strips as data structure
• exploit for cache efficiency and parallelisation
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 28
The Stack Principle for Data Exchange on Edges
B
C
A
• to compute numerical fluxes in Finite Volume and DG methods• to synchronise refinement of neighbour cells (conforming grids)
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 29
Stream- and Stack-based Processing
system of stacks
element−wise processing
for refinement
inpu
t str
eam
outp
ut s
trea
m
for coarseningdelete cellsadd cells
• drop fixed memory location of variables!• persistent (degrees of freedom) vs. non-persistent (residuals, “old”
variables) data• aim: reduce memory footprint and number of traversals• strongly element-oriented; impedes vectorization over elements
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 30
Partitions & Load Balancing: Sierpinski Sections
add
cut
Use Sierpinski Space-Filling Curve for Partitioning:• remeshing requires three sections: “add”, “keep”, and “cut”• generalized towards restructuring of Sierpinski sections:
turn M sections into N (balanced) sections• migrate only sections to simplify data transfer
(tolerate sacrifice on load balancing)• flexible concept of “load”: number of cells, weight per cell,
average measured runtime
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 31
Load Balancing Using Sierpinski Index Sections
Adaptive traversal: Refine/coarsen cells
Load balancing: Estimate new load and exchange sections
Resize sections
Compute (mark cells for refinement/coarsening)...
+1 -1 +3 -1 -1 -1 -1 -1 -1 +1 +2 +0
+1 -1 +3 -1 -1 -1 -1 -1 -1 +1 +2 +0
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 32
Part VI
Dynamically Adaptive TsunamiSimulation – Performance, Scalability,
Patches
Oliver Meister, Kaveh Rahnema, Chaulio Ferreira
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 33
Tsunami Simulation: Vectorization vs. AMR
0
2
4
6
8
10
12
400 800 1600 2400 4000Mio
Ele
me
nt U
pd
ate
s/s
pe
r co
re
elements per dim.
no vec. 1 core
no vec. 16 cores
SSE4 1 core
SSE4 16 cores
AVX 1 core
AVX 16 cores
Vectorization Imperative on Modern CPU Architectures:• example above: shallow water simulation on static Cartesian mesh• speed-up due to intrinsics-implementation of augm. Riemann solver• sam(oa)2: mesh adaptivity impedes vectorization over grid cells• possible remidy: introduce regularly refined patches
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 34
Using Patches – First Results on Xeon PhiSuperMIC: 2 × Xeon Phi 5110P (60 cores, 1.1 GHz) Chaulio Ferreira
Simple f-wave Riemann solver:
• solid speedup, but away from optimum (vector width) and not perfectlyscaling (single node, shared memory bandwidth)
• need to study influence of memory-bound parts, etc.M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 35
Using Patches – First Results on Xeon PhiSuperMIC: 2 × Xeon Phi 5110P (60 cores, 1.1 GHz) Chaulio Ferreira
Vector HLLE solver: (compared against augmented Riemann)
• relatively larger speed-up (more compute-bound than f-wave)• best time-to-solution for patch-size 8 detailed analysis “to do”• from ∼12 to ∼45 Mio element updates per second per node
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 35
Load Balancing for Symmetric ModeSuperMIC: 2× Ivy Bridge (8 cores, 2.6 GHz) plus 2× Xeon Phi 5110P (60 cores, 1.1 GHz)
• guided (2:1:1) clearly superior to homogeneous (1:1:1) load balancingbetween hosts (1 MPI rank) and Xeon Phi (2 ranks)
• “automatic” (measure runtime) load balancing finds 42:29:29• suspect heavy losses due to communication
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 36
Symmetric Mode – Multiple NodesSuperMIC: 2× Ivy Bridge (8 cores, 2.6 GHz) plus 2× Xeon Phi 5110P (60 cores, 1.1 GHz)
• strong scaling currently impeded by slow communication• withdraw to load balancing every 10 time steps• will need to test with communication proxy/on Salomon
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 37
Part VII
Conclusions and Outlook
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 38
Conclusions on SeisSol
Performance Optimisation on Multi&Manycore Platforms:
• high convergence order and high computational intensity of ADER-DG→ compute-bound performance on current and imminent CPUs
• code generation to accelerate element kernels• careful tuning and parallelisation of the entire simulation pipeline
(scalable mesh input, output and checkpointing)
Xeon Phi Platforms
• offload scheme scaled to 1.5 million cores (Tianhe-2, Stampede)• our goal: scale in symmetric mode on heterogeneous supercomputers→ current work on SuperMIC and esp. Salomon
• heterogeneity challenges exist in load balancing and scalable I/O• SeisSol runs on Knights Landing→ ISC’16 [7] and KNL-Book [8]
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 39
(Preliminary) Conclusions on sam(oa)2
High-Performance Parallel AMR:• making dynamically adaptive simulation codes achieve high performance
is hard work• space-filling-curve approaches for hybrid parallelism→ sam(oa)2
• load balancing in each time step is feasible (and hybrid parallism helps)
Performance Challenges for Modern Heterogeneous Platforms:• crucial performance question #1:
Where can I apply SIMD parallelism for dynamic adaptivity?
• patches offer additional opportunity for vectorization(similar: layered models→ 2D adaptivity for 3D problems)
• crucial performance question #2:How do I effectively load balance with heterogeneity?
• hoping for automagic (runtime-based) load balancing;→ prescribed weights currently more successful
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 40
Acknowledgements
Special thanks go to . . .• the entire SeisSol team and all contributors, esp.:
– Alex Breuer, Sebastian Rettenberger, Carsten Uphoff– Alex Heinecke– Alice Gabriel, Christian Pelties, Stephanie Wolherr
• the sam(oa)2 team:– Oliver Meister, Kaveh Rahnema, Chaulio Ferreira
• all colleagues from the Leibniz Supercomputing Centre• all colleagues from the IT4Innovations Supercomputing Centre• for financial and project support:
– Intel Corporation (IPCC ExScaMIC)– Volkswagen Foundation (project ASCETE)– BMBF (project CzeBACCA)
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 41
Publications[1] A. Breuer, A. Heinecke, L. Rannabauer, M. Bader: High-Order ADER-DG Minimizes Energy-
and Time-to-Solution of SeisSol. In: High Performance Computing, Proceedings of ISC 15,LNCS 9137, p. 340–357, 2015.
[2] A. Breuer, A. Heinecke, S. Rettenberger, M. Bader, A.-A. Gabriel, C. Pelties: SustainedPetascale Performance of Seismic Simulations with SeisSol on SuperMUC. In:Supercomputing, LNCS 8488, p. 1–18. PRACE ISC Award 2014.
[3] A. Heinecke, A. Breuer, S. Rettenberger, M. Bader, A.-A. Gabriel, C. Pelties, A. Bode, W. Barth,X.-K. Liao, K. Vaidyanathan, M. Smelyanskiy, P. Dubey: Petascale High Order DynamicRupture Earthquake Simulations on Heterogeneous Supercomputers. Gordon Bell PrizeFinalist 2014.
[4] O. Meister, K. Rahnema, M. Bader: Parallel Memory Efficient Adaptive Mesh Refinement onStructured Triangular Meshes with Billions of Grid Cells.ACM Transactions on Mathematical Software TOMS, accepted.
[5] S. Rettenberger, M. Bader: Optimizing Large Scale I/O for Petascale Seismic Simulations onUnstructured Meshes 2015 IEEE International Conference on Cluster Computing (CLUSTER),p. 314–317. IEEE Xplore, 2015.
[6] C. Uphoff, M. Bader: Generating high performance matrix kernels for earthquake simulationswith viscoelastic attenuation. The 2016 International Conference on High PerformanceComputing & Simulation (HPCS 2016), p. 908–916.
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 42
Publications and References[7] A. Heinecke, A. Breuer, M. Bader: High Order Seismic Simulations on the Intel Xeon Phi
Processor (Knights Landing). ISC High Performance, 2016.[8] A. Heinecke, A. Breuer, M. Bader: High Performance Seismic Simulations. In J. Jeffers, J.
Reinders, A. Sodani (ed.), Intel Xeon Phi Processor High Performance Programming –Knights Landing Edition, ch. 21. Morgan Kaufmann, 2016.
[9] M. Dumbser, M. Kaser: An arbitrary high-order discontinuous Galerkin method for elasticwaves on unstructured meshes – II. The three-dimensional isotropic case. Geophys. J. Int.167(1), 2006.
[10] D. George: Augmented Riemann solvers for the shallow water equations over variabletopography with steady states and inundation.J. Comput. Phys. 227(6), 2008.
[11] A. Heinecke, G. Henry, M. Hutchinson, H. Pabst: LIBXSMM: Accelerating Small MatrixMultiplications by Runtime Code Generation, SC16, accepted.
[12] C. Pelties, A.-A. Gabriel, J.-P. Ampuero: Verification of an ADER-DG method for complexdynamic rupture problems, Geoscientific Model Development, 7(3), p. 847–866.
[13] C. Pelties, J. de la Puente, J.-P. Ampuero, G. B. Brietzke, M. Kaser: Three-dimensionaldynamic rupture simulation with a high-order discontinuous Galerkin method on unstructuredtetrahedral meshes. J. Geophys. Res.: Solid Earth, 117(B2), 2012.
M. Bader et al. | Earthquake and tsunami simulation on Xeon Phi | CzeBaCCA Workshop | 29 June 2016 43