bemfmm: an fmm-accelerated boundary element method …...bemfmm: an fmm-accelerated boundary element...
TRANSCRIPT
BEMFMM: An FMM-Accelerated Boundary Element
Method-Based Solver for the 3D Helmholtz Equation
Mustafa Abduljabbar, Mohammed Al Farhan, Noha Al-Harthi, Rui Chen, Rio Yokota, Hakan
Bagci, and David Keyes
April 24, 2018
King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
Tokyo Institute of Technology, Tokyo, Japan
Highlights
- FMM-based solver for BEM discretizations of oscillatory operators.
• Demonstrated for acoustics in Helmholtz formulation.
• Scattering from a sphere (exact solution available).
- Three levels of parallelism: MPI + X + Y.
- Three contemporary Intel architectures: Haswell, Skylake, KNL.
- Up to 2 billion Degrees-of-Freedom and up to 0.2 million cores of Shaheen XC40.
1
Outline
Introduction and Background
Extreme Scale Implementation and Optimizations
Shared-memory Optimizations
Distributed-memory Optimizations
Evaluation and Discussion
Shared-memory Optimizations
Distributed-memory Optimizations
Concluding Remarks and Future Work
2
Motivation
- Validating FMM-based horizontal and vertical parallelism techniques on modern
many/multi-core architectures and 196,608 hardware cores of Shaheen XC40
supercomputer.
- Propose a performance model to estimate a near optimal granularity of recursive task
creation during tree traversal.
- Describing the tradeoffs of using various modes of singularity treatment in the BEM.
3
Problem Statement and Formulation [1/3]
• The time-dependent form of the wave equation is governed by the Helmholtz
equation.
∇2U(r , t)− 1
c2
∂
∂t2U(r , t) = 0 (1)
• The time-harmonic form as a result of plugging U(r , t) = Re[U0(r)e−jwt ] in Eq. 1
is:
∇2U0(r) + k2U0(r) = 0 (2)
4
Problem Statement and Formulation [2/3]
- The surface integral solution as a result of plugging the second form of Green’s
theorem looks like
pinc(r) +
∫S
[∂G (r , r ′)
∂n′p(r ′)− G (r , r ′)q(r ′)]dS ′ =
1
2p(r), r ∈ S (3)
G (r , r ′) =e jkR
4πR(4)
- if we set p = 0, we obtain ∫SG (r , r ′)q(r ′) = pinc(r) (5)
5
Problem Statement and Formulation [3/3]
• Wave scattering and applications.
High-intensity focused ultrasound12
1T. Betcke, E. van ’t Wout and P. Glat. Computationally efficient boundary element methods for high-frequency Helmholtz problems in unbounded
domains, in: Modern Solvers for Helmholtz Problems, Springer (2017).
2E. van ’t Wout, P. Glat, T. Betcke and S. Arridge. A fast boundary element method for the scattering analysis of high-intensity focused
ultrasound, Journal of the Acoustical Society of America 138(5) (2015) pp2726-2737.
6
System Overview
Normalization
BEMFMM PETSc GMRES FMM Wrapper ExaFMMu,wGmsh α β ym
iteration
Figure 1: Dataflow across libraries (color-coded)
7
Discretization of the Scatterer’s Surface
- Divide the surface into curvilinear
triangular patches.
- Each curvilinear patch has Ni inter-
polation points.
- Using Nystrom method, The unknown
velocity is expanded as an interpolation
given by Equation:
X
Y
Z
q(r ′) =
Np∑n=1
Ni∑i=1
ϑ−1(r ′)L(i ,n)(ζ, η){I}(i ,n) (6)
ZI = V inc (7)
[Z ](j ,m)(i ,n) =
∫∆n
G (r(j ,m), r′)ϑ−1(r ′)L(i ,n)(ζ, η)dr ′ (8)
8
FMM in Numerical Science
- FMM looks for a solution that can be written as a BIE u(x) =∫
Γ Φ(x , y)q(y)dS(y)
(to some extent)
- Examples of FMM use cases
Application Kernel Single/Double
Gravitational, Potential Laplace Single
Electrostatic Field Laplace Double
Acoustics Scattering Field (Low Frequency) Helmholtz Single
Electromagnetic Scattering Field Helmholtz Double
9
FMM as an Accelerator for the 3D Helmholtz Solution
- FMM works as a matrix-free accelerator for the mat-vec multiplication (or IFMM as a
pre-conditioner [Takahashi et al., 2017]).
- The resulting BIE results in a structured dense matrix.
- Example: impulse response due to a monopole source.
p(r) =Ns∑j
G (r , r ′)q(r ′) (9)
- Eq. 9 is expanded into a series of spherical harmonics.
p(r) = ik∞∑n=0
n∑m=−n
S−mn (rj)Rmn (r), r ≤ rq (10)
p(r) = ik∞∑n=0
n∑m=−n
Cmn Rm
n (r) (11) Cmn =
∑rq<Rmax
QqS−mn (rq) (12)
10
An Overview of FMM
Quad-tree partitioning on 4 processes
P2M
P2P
L2P
M2L
M2M
L2L
FMM Hierarchy (Upward, Horizontal, and Downward Sweeps)
11
Traversal Stage
Traditional FMM: Iterate over
Hilbert/Morton orders and probe each cell
for its neighbor [MS Warren et al., 1995].
0000 0001
0010001100 001101
001110 001111
110000
01
10
110001
110010 1100111101
1110 1111
Warren and Salmon original FMM
Dual Tree Traversal (DTT)3:
Simultaneously parse source and targets in
a pre-order depth-first manner.
Dual-tree traversal
3Mustafa Abduljabbar, Mohammed Al Farhan, Rio Yokota and David Keyes. Performance Evaluation of Computation and Communication Kernels
of the Fast Multipole Method on Intel Manycore Architecture, Proceedings of the European Conference on Parallel Processing (2017) 553–564.
12
P2P/M2L: Data-level Parallelism
Auto-vectorization with #pragma simd
may almost always win in embarrassing-
ly parallel kernels with structured me-
mory access; however,
• Check icc compiler’s output with
-qopt-report. The default was the
inner-most loop (suboptimal here).
• Double-check vector loads for
array-of-structs.
• Rewrite divisions and complex
numbers to their polar form
(experimental 20% reduction in
vector moves).
1 f o r ( ; i<n i ; ++i ) {2 v i r = r e a l ( Bi . SRC) ; v i i = imag ( Bi . SRC) ;
3 f o r ( j =0; j<n j ; ++j ) {4 dX = x i−x j ; R2 = norm (dX) ;
5 \\ r e l a y s e l f−s i g u l a r i t y to PETSc c a l l b a c k
6 i f ( Bi .PATCH!= Bj .PATCH && R2!=0) {7 r e a l t R=s q r t ( R2 ) ;
8 i f (R<=n e a r p a t c h d i s t a n c e ) {9 f o r ( k=0; k<g a u s s q u a d p o i n t s ; ++k ) {
10 \\ n e a r patch s i n g u l a r i t y t r e a t m e n t
11 }12 } e l s e {13 v j r = r e a l ( Bj . SRC) ; v j i = imag ( Bj . SRC) ;
14 s r c 2 r = v i r ∗ v j r−v i i ∗ v j i ;
15 s r c 2 i = v i r ∗ v j i + v i i ∗ v j r ;
16 invR = 1 . 0 / s q r t (R) ;
17 e i k r = 1 . 0 / exp ( w a v e i ∗R) ;
18 e i k r ∗= invR ;
19 e i k r r = cos ( wave r∗R)∗ e i k r ;
20 e i k r i = s i n ( wave r∗R)∗ e i k r ;
21 p o t r += s r c 2 r ∗ e i k r r−s r c 2 i ∗ e i k r i ;
22 p o t i += s r c 2 r ∗ e i k r i +s r c 2 i ∗ e i k r r ;
23 }24 }25 }26 Bi .TRG += complex ( p o t r , p o t i ) ;
27 }
Helmholtz S2T Kernel
13
Traversal: Thread-level Parallelism
- Control cell size and task granularity by minimizing the difference between LLC and
the interacting cell sizes (minimize cache-miss rate).
mins,c
f (s, c) = (M2Lsize + P2Psize)− L2/L3 Cache
= 2× c × task size × nthreads/core
× [(csize × logs
c) + bsize]− L2/L3 Cache
(13)
- Multiplier “2” is inclusive of source and
target.- csize/bsize is the cell/body struct size.- s is the task spawning parameter.- c is the number of bodies per leaf cell.- log s
c is the depth of recursive branch.- L2/L3 Last Level Cache (LLC) size.
14
Large Mesh Partitioning
- Pre-partitioning:
• Create intermediate format with
minimal indexed binary data
(double-precision coordinates).
• Map each rank to its region in the
file.
- Partitioning:
• Separate Global/Local trees using
modified ORB [Abduljabbar et al.,
2017].
• Graft tree in one step when doing
global traversal.
rank 0 rank 2my rank
localtree
processtree
local essential treelocal root nodes
LET Grafting
15
Load-balancing (Repartitioning)
- Load-imbalance harms GMRES performance due to long wait on reduction of each
step.
- Optimal partitioning for random distributions is NP-hard.
- Consider previous work-load and communication.
wi = li + α ∗ ri (14)
- The variables li and ri , and the total runtime are already measured in the present
code, so the information is available with negligible cost.
- Control granularity of partitioning based on load-imbalance.
Coarse Fine
Deferpartitioningfor a few steps
WeightedMorton/Hilbertpartitioningevery time step
Migrationwithin a time step(Node failure)
16
A Neighborhood-based Communication Protocol (HSDX ) 4
4Mustafa Abduljabbar, George S. Markomanolis, Huda Ibeid, Rio Yokota and David Keyes. Communication Reducing Algorithms for Distributed
Hierarchical N-Body Problems with Boundary Distributions, Proceedings of the International Supercomputing Conference (2017), 79–96. 17
Hardware Specifications
KNL Haswell Skylake
Family x200 E5V3 Scalable
Model 7290 2670 8176
Socket(s) 1 2 2
Cores 72 32 56
GHz 1.50 2.60 2.10
Watts/socket 245 120 165
DDR4 (GB) 192 128 264
Frequency Driver acpi-cpufreq acpi-cpufreq acpi-cpufreq
Max GHz 1.50 2.60 2.10
Governor conservative performance ondemand
Turbo Boost X X X
18
Data-level Parallelism
0 0.5 1 1.5 2 2.5N 106
0
1000
2000
3000
4000
5000
6000
7000
8000
GF
LO
P/s
Skylake - 112 Threads
No-vecIntrinAuto-vecTheoratical Peak
0 0.5 1 1.5 2 2.5N 106
0
1000
2000
3000
4000
5000
6000
7000
8000
GF
LO
P/s
KNL [quadrant/flat mode] - 288 Threads
No-vecIntrinAuto-vecTheoratical Peak
Single Precision Floating Point Performance
- Lower latency and higher throughput for AVX512 in Skylake.
- Increased cache miss penalty in KNL.
19
Thread-level Parallelism [1/2]
16 32 64 128 256 512Task Size
2048
1024
512
256
128
64
Gra
in S
ize
Skylake - 112 Threads
5.3
5.26
5.33
5.62
5.96
7.4
4.71
4.24
4.25
4.56
4.86
5.77
4.76
3.8
3.72
4.04
4.24
4.24
6.6
4.74
4.74
4.83
4.9
4.88
11.66
8.98
9.27
9.15
8.97
8.96
14.89
15.97
15.95
16.94
17.15
17.254
6
8
10
12
14
16
16 32 64 128 256 512Task Size
2048
1024
512
256
128
64
Gra
in S
ize
Skylake - 112 Threads
0.62
0.66
0.71
0.75
0.79
0.83
0.33
0.41
0.5
0.58
0.66
0.75
0.18
0.01
0.16
0.33
0.5
0.5
1.02
0.68
0.34
0.01
0.01
0.01
2.36
1.69
1.02
1.02
1.02
1.02
4.38
3.03
3.03
3.03
3.03
3.030.5
1
1.5
2
2.5
3
3.5
4
Experimental results vs. performance model on Skylake
20
Thread-level Parallelism [2/2]
16 32 64 128 256 512Task Size
2048
1024
512
256
128
64
Gra
in S
ize
KNL [quadrant/flat mode] - 288 Threads
14.76
13.82
13.96
13.93
14.48
17.58
11.57
10.94
10.8
11.15
11.63
13.49
9.08
8.35
8.24
8.71
9.03
9.04
11.07
10.09
9.7
10.07
10.1
10.1
19.81
18.57
19.5
18.81
18.49
18.46
32.51
35.35
37.05
39.93
41.17
41.1910
15
20
25
30
35
40
16 32 64 128 256 512Task Size
2048
1024
512
256
128
64
Gra
in S
ize
KNL [quadrant/flat mode] - 288 Threads
0.59
0.64
0.68
0.73
0.77
0.82
0.27
0.36
0.45
0.54
0.64
0.73
0.27
0.09
0.09
0.27
0.45
0.45
1.18
0.82
0.46
0.09
0.09
0.09
2.64
1.91
1.18
1.18
1.18
1.18
4.83
3.37
3.37
3.37
3.37
3.37 0.5
1
1.5
2
2.5
3
3.5
4
4.5
Experimental results vs. performance model on KNL
21
Cray XC40 Characteristics
- Distributed experiments are run on Shaheen XC40, which hosts 196,608 Haswell
cores, and has a linpack performance of 5.5 PFlop/s.
- Strong scaling is challenging in traditional FMM codes.
XC40 Network Hierarchy
Level Hardware/Network Unit Nodes Cores Overhead Hops
1 Socket 1 16 32 N/A
2 NUMA Node 1 32 64 N/A
3 Blade 4 128 256 1
4 Chassis 64 2,048 4,096 1
5 Cabinet 192 6,144 8,192 1
6 Local alltoall G 384 12,288 16,384 1
7 Global alltoall G1 2,304 74,728 131,072 2
8 Global alltoall G2 6,174 197,568 N/A 3 22
Weak Scalability
• A single scattering sphere, to
validate the concept and the
convergence.
• We fix the problem size while
keeping 10 points per wavelength
(1.0e−4 accuracy vs. analytical).32 128 512 2048 8192 32768 131072
Number of Cores
0
5
10
15
20
25
30
Mea
n T
ime
[s]
561,5402,246,160
8,984,640
35,938,560143,754,240
575,016,960
2,300,067,840
1 0.95
1.06 1 0.96
1.03
1.05
Weak ScalabilityIdeal Scaling
Cores Mem. [GB] Freq. [KHz] N (DoF) T [S] Iters
512 2,048 24 8,984,640 3.1e3 185
2,048 8,192 96 35,938,560 3.3e3 190
8,192 32,768 384 143,754,240 3.6e3 200
32,768 131,072 1,536 575,016,960 4.6e3 223
131,072 524,288 6,144 2,300,067,840 5.7e3 25623
Strong Scalability [1/2]
2,300,067,840 DoF
1
0.97
0.85
0.840.63
16384 32768 65536 131072 196608Number of Cores
10 0
10 1
Spee
dup
Strong ScalabilityIdeal Scaling
575,016,960 DoF
1
1.06
0.88
0.87
0.54 0.37
1966088192 16384 32768 65536 131072Number of Cores
10 0
10 1
Spee
dup
Strong ScalabilityIdeal Scaling
143,754,240 DoF
1
0.88
0.91
0.79
0.85
0.65
0.51
0.37
1024 2048 4096 8192 16384 32768 65536 131072Number of Cores
10 0
10 1
10 2
Spee
dup
Strong ScalabilityIdeal Scaling
24
Strong Scalability [2/2]
35,938,560 DoF
1
1.02
0.94
0.79
0.75
0.73
0.86 0.44
0.32 0.17
256 512 1024 2048 4096 8192 16384 32768 65536 131072Number of Cores
10 0
10 1
10 2Sp
eedu
p
Strong ScalabilityIdeal Scaling
8,984,640 DoF
1
0.88
0.91
0.86
0.83
0.76
0.77
0.66
0.55
64 128 256 512 1024 2048 4096 8192 16384Number of Cores
10 0
10 1
10 2
Spee
dup
Strong ScalabilityIdeal Scaling
2,246,160 DoF
1
0.95
0.89
0.82
0.81
0.86
0.78
0.61
0.51
32 64 128 256 512 1024 2048 4096 8192Number of Cores
10 0
10 1
10 2
Spee
dup
Strong ScalabilityIdeal Scaling
561,540 DoF
1
0.94
1.13
1.06
0.79
0.67
0.610.37
32 64 128 256 512 1024 2048 4096Number of Cores
10 0
10 1
10 2
Spee
dup
Strong ScalabilityIdeal Scaling
25
Communication Load Balancing
100 200 300 400 500 600 700 800 900 1000MPI Rank
10-2
10-1
100
Mea
n T
ime
[s]
Comm LET bodiesComm LET cells
100 200 300 400 500 600 700 800 900 1000MPI Rank
10-2
10-1
100
Mea
n T
ime
[s]
Comm LET bodiesComm LET cells
1 2 3Time step
0
0.5
1
1.5
2
2.5
3
3.5
Mea
n T
ime
[s]
Balancing Step
26
Data Scalability
10 5 10 6 10 7
N
10 -1
10 0
10 1
10 2
10 3
10 4
10 5
Mea
n T
ime
[s]
FMM TimeO(N)O(NlogN)
O(N 2 )
64 compute nodes of Shaheen
10 8 10 9
N
10 -1
10 0
10 1
10 2
10 3
10 4
Mea
n T
ime
[s]
FMM TimeO(N)O(NlogN)
O(N 2 )
1,024 compute nodes of Shaheen
27
Convergence Aspects and Numerical Error
200 400 600 800 1000 1200 1400Number of Iterations
10 -4
10 -3
10 -2
10 -1
10 0
Rel
ativ
e R
esid
ual N
orm
Self and Near SingularitySelf SingularityNear SingularityIgnore Self Singularity
Convergence behavior of 1m DoF using different
singularity correction modes
28
Conclusion
• Auto-vectorization of FMM kernels by the Intel C/C++ Compiler 2017 can
achieve significant performance boost with given precautions.
• Heavy tuning parameters for recursive task parallelism of FMM’s tree traversal
can be avoided using our performance model with a small prediction penalty.
• Communication balancing results in 6 times faster global communication time for
100m DoF.
• Solution of a 2 billion DoF systems of high-order curvilinear triangular patches of
a spherical mesh in about 60 minutes time-to-solution and an accuracy of 1.0e−4
compared to the analytical solution.
• Near-optimal parallel efficiency on Shaheen for both weak and strong scalability
studies.
29
Future Direction
• Explore and address performance challenges of more heterogeneous HPCarchitectures.
• GPU-based supercomputers.
• Study different network topologies of various supercomputer architectures, andmake HSDX aware of the underlying network units.
• Intel Omni-Path network architecture.
• Further extend our solver implementation to include different highly nonuniformdomains and more complex geometries.
• Wing-fuselage configuration emulated by two intersecting ellipsoids.
30
Thank you