bemfmm: an fmm-accelerated boundary element method …...bemfmm: an fmm-accelerated boundary element...

BEMFMM: An FMM-Accelerated Boundary Element

Method-Based Solver for the 3D Helmholtz Equation

Mustafa Abduljabbar, Mohammed Al Farhan, Noha Al-Harthi, Rui Chen, Rio Yokota, Hakan

Bagci, and David Keyes

April 24, 2018

King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia

Tokyo Institute of Technology, Tokyo, Japan

Highlights

- FMM-based solver for BEM discretizations of oscillatory operators.

• Demonstrated for acoustics in Helmholtz formulation.

• Scattering from a sphere (exact solution available).

- Three levels of parallelism: MPI + X + Y.

- Three contemporary Intel architectures: Haswell, Skylake, KNL.

- Up to 2 billion Degrees-of-Freedom and up to 0.2 million cores of Shaheen XC40.

1

Outline

Introduction and Background

Extreme Scale Implementation and Optimizations

Shared-memory Optimizations

Distributed-memory Optimizations

Evaluation and Discussion

Shared-memory Optimizations

Distributed-memory Optimizations

Concluding Remarks and Future Work

2

Motivation

- Validating FMM-based horizontal and vertical parallelism techniques on modern

many/multi-core architectures and 196,608 hardware cores of Shaheen XC40

supercomputer.

- Propose a performance model to estimate a near optimal granularity of recursive task

creation during tree traversal.

- Describing the tradeoffs of using various modes of singularity treatment in the BEM.

3

Problem Statement and Formulation [1/3]

• The time-dependent form of the wave equation is governed by the Helmholtz

equation.

∇2U(r , t)− 1

c2

∂

∂t2U(r , t) = 0 (1)

• The time-harmonic form as a result of plugging U(r , t) = Re[U0(r)e−jwt ] in Eq. 1

is:

∇2U0(r) + k2U0(r) = 0 (2)

4


- The surface integral solution as a result of plugging the second form of Green’s

theorem looks like

pinc(r) +

∫S

[∂G (r , r ′)

∂n′p(r ′)− G (r , r ′)q(r ′)]dS ′ =

1

2p(r), r ∈ S (3)

G (r , r ′) =e jkR

4πR(4)

- if we set p = 0, we obtain ∫SG (r , r ′)q(r ′) = pinc(r) (5)

5


• Wave scattering and applications.

High-intensity focused ultrasound12

1T. Betcke, E. van ’t Wout and P. Glat. Computationally efficient boundary element methods for high-frequency Helmholtz problems in unbounded

domains, in: Modern Solvers for Helmholtz Problems, Springer (2017).

2E. van ’t Wout, P. Glat, T. Betcke and S. Arridge. A fast boundary element method for the scattering analysis of high-intensity focused

ultrasound, Journal of the Acoustical Society of America 138(5) (2015) pp2726-2737.

6

System Overview

Normalization

BEMFMM PETSc GMRES FMM Wrapper ExaFMMu,wGmsh α β ym

iteration

Figure 1: Dataflow across libraries (color-coded)

7

Discretization of the Scatterer’s Surface

- Divide the surface into curvilinear

triangular patches.

- Each curvilinear patch has Ni inter-

polation points.

- Using Nystrom method, The unknown

velocity is expanded as an interpolation

given by Equation:

X

Y

Z

q(r ′) =

Np∑n=1

Ni∑i=1

ϑ−1(r ′)L(i ,n)(ζ, η){I}(i ,n) (6)

ZI = V inc (7)

[Z ](j ,m)(i ,n) =

∫∆n

G (r(j ,m), r′)ϑ−1(r ′)L(i ,n)(ζ, η)dr ′ (8)

8

FMM in Numerical Science

- FMM looks for a solution that can be written as a BIE u(x) =∫

Γ Φ(x , y)q(y)dS(y)

(to some extent)

- Examples of FMM use cases

Application Kernel Single/Double

Gravitational, Potential Laplace Single

Electrostatic Field Laplace Double

Acoustics Scattering Field (Low Frequency) Helmholtz Single

Electromagnetic Scattering Field Helmholtz Double

9

FMM as an Accelerator for the 3D Helmholtz Solution

- FMM works as a matrix-free accelerator for the mat-vec multiplication (or IFMM as a

pre-conditioner [Takahashi et al., 2017]).

- The resulting BIE results in a structured dense matrix.

- Example: impulse response due to a monopole source.

p(r) =Ns∑j

G (r , r ′)q(r ′) (9)

- Eq. 9 is expanded into a series of spherical harmonics.

p(r) = ik∞∑n=0

n∑m=−n

S−mn (rj)Rmn (r), r ≤ rq (10)

p(r) = ik∞∑n=0

n∑m=−n

Cmn Rm

n (r) (11) Cmn =

∑rq<Rmax

QqS−mn (rq) (12)

10

An Overview of FMM

Quad-tree partitioning on 4 processes

P2M

P2P

L2P

M2L

M2M

L2L

FMM Hierarchy (Upward, Horizontal, and Downward Sweeps)

11

Traversal Stage

Traditional FMM: Iterate over

Hilbert/Morton orders and probe each cell

for its neighbor [MS Warren et al., 1995].

0000 0001

0010001100 001101

001110 001111

110000

01

10

110001

110010 1100111101

1110 1111

Warren and Salmon original FMM

Dual Tree Traversal (DTT)3:

Simultaneously parse source and targets in

a pre-order depth-first manner.

Dual-tree traversal

3Mustafa Abduljabbar, Mohammed Al Farhan, Rio Yokota and David Keyes. Performance Evaluation of Computation and Communication Kernels

of the Fast Multipole Method on Intel Manycore Architecture, Proceedings of the European Conference on Parallel Processing (2017) 553–564.

12

P2P/M2L: Data-level Parallelism

Auto-vectorization with #pragma simd

may almost always win in embarrassing-

ly parallel kernels with structured me-

mory access; however,

• Check icc compiler’s output with

-qopt-report. The default was the

inner-most loop (suboptimal here).

• Double-check vector loads for

array-of-structs.

• Rewrite divisions and complex

numbers to their polar form

(experimental 20% reduction in

vector moves).

1 f o r ( ; i<n i ; ++i ) {2 v i r = r e a l ( Bi . SRC) ; v i i = imag ( Bi . SRC) ;

3 f o r ( j =0; j<n j ; ++j ) {4 dX = x i−x j ; R2 = norm (dX) ;

5 \\ r e l a y s e l f−s i g u l a r i t y to PETSc c a l l b a c k

6 i f ( Bi .PATCH!= Bj .PATCH && R2!=0) {7 r e a l t R=s q r t ( R2 ) ;

8 i f (R<=n e a r p a t c h d i s t a n c e ) {9 f o r ( k=0; k<g a u s s q u a d p o i n t s ; ++k ) {

10 \\ n e a r patch s i n g u l a r i t y t r e a t m e n t

11 }12 } e l s e {13 v j r = r e a l ( Bj . SRC) ; v j i = imag ( Bj . SRC) ;

14 s r c 2 r = v i r ∗ v j r−v i i ∗ v j i ;

15 s r c 2 i = v i r ∗ v j i + v i i ∗ v j r ;

16 invR = 1 . 0 / s q r t (R) ;

17 e i k r = 1 . 0 / exp ( w a v e i ∗R) ;

18 e i k r ∗= invR ;

19 e i k r r = cos ( wave r∗R)∗ e i k r ;

20 e i k r i = s i n ( wave r∗R)∗ e i k r ;

21 p o t r += s r c 2 r ∗ e i k r r−s r c 2 i ∗ e i k r i ;

22 p o t i += s r c 2 r ∗ e i k r i +s r c 2 i ∗ e i k r r ;

23 }24 }25 }26 Bi .TRG += complex ( p o t r , p o t i ) ;

27 }

Helmholtz S2T Kernel

13

Traversal: Thread-level Parallelism

- Control cell size and task granularity by minimizing the difference between LLC and

the interacting cell sizes (minimize cache-miss rate).

mins,c

f (s, c) = (M2Lsize + P2Psize)− L2/L3 Cache

= 2× c × task size × nthreads/core

× [(csize × logs

c) + bsize]− L2/L3 Cache

(13)

- Multiplier “2” is inclusive of source and

target.- csize/bsize is the cell/body struct size.- s is the task spawning parameter.- c is the number of bodies per leaf cell.- log s

c is the depth of recursive branch.- L2/L3 Last Level Cache (LLC) size.

14

Large Mesh Partitioning

- Pre-partitioning:

• Create intermediate format with

minimal indexed binary data

(double-precision coordinates).

• Map each rank to its region in the

file.

- Partitioning:

• Separate Global/Local trees using

modified ORB [Abduljabbar et al.,

2017].

• Graft tree in one step when doing

global traversal.

rank 0 rank 2my rank

localtree

processtree

local essential treelocal root nodes

LET Grafting

15

Load-balancing (Repartitioning)

- Load-imbalance harms GMRES performance due to long wait on reduction of each

step.

- Optimal partitioning for random distributions is NP-hard.

- Consider previous work-load and communication.

wi = li + α ∗ ri (14)

- The variables li and ri , and the total runtime are already measured in the present

code, so the information is available with negligible cost.

- Control granularity of partitioning based on load-imbalance.

Coarse Fine

Deferpartitioningfor a few steps

WeightedMorton/Hilbertpartitioningevery time step

Migrationwithin a time step(Node failure)

16

A Neighborhood-based Communication Protocol (HSDX ) 4

4Mustafa Abduljabbar, George S. Markomanolis, Huda Ibeid, Rio Yokota and David Keyes. Communication Reducing Algorithms for Distributed

Hierarchical N-Body Problems with Boundary Distributions, Proceedings of the International Supercomputing Conference (2017), 79–96. 17

Hardware Specifications

KNL Haswell Skylake

Family x200 E5V3 Scalable

Model 7290 2670 8176

Socket(s) 1 2 2

Cores 72 32 56

GHz 1.50 2.60 2.10

Watts/socket 245 120 165

DDR4 (GB) 192 128 264

Frequency Driver acpi-cpufreq acpi-cpufreq acpi-cpufreq

Max GHz 1.50 2.60 2.10

Governor conservative performance ondemand

Turbo Boost X X X

18

Data-level Parallelism

0 0.5 1 1.5 2 2.5N 106

0

1000

2000

3000

4000

5000

6000

7000

8000

GF

LO

P/s

Skylake - 112 Threads

No-vecIntrinAuto-vecTheoratical Peak

0 0.5 1 1.5 2 2.5N 106

0

1000

2000

3000

4000

5000

6000

7000

8000

GF

LO

P/s

KNL [quadrant/flat mode] - 288 Threads

No-vecIntrinAuto-vecTheoratical Peak

Single Precision Floating Point Performance

- Lower latency and higher throughput for AVX512 in Skylake.

- Increased cache miss penalty in KNL.

19

Thread-level Parallelism [1/2]

16 32 64 128 256 512Task Size

2048

1024

512

256

128

64

Gra

in S

ize


5.3

5.26

5.33

5.62

5.96

7.4

4.71

4.24

4.25

4.56

4.86

5.77

4.76

3.8

3.72

4.04

4.24

4.24

6.6

4.74

4.74

4.83

4.9

4.88

11.66

8.98

9.27

9.15

8.97

8.96

14.89

15.97

15.95

16.94

17.15

17.254

6

8

10

12

14

16

16 32 64 128 256 512Task Size

2048

1024

512

256

128

64

Gra

in S

ize


0.62

0.66

0.71

0.75

0.79

0.83

0.33

0.41

0.5

0.58

0.66

0.75

0.18

0.01

0.16

0.33

0.5

0.5

1.02

0.68

0.34

0.01

0.01

0.01

2.36

1.69

1.02

1.02

1.02

1.02

4.38

3.03

3.03

3.03

3.03

3.030.5

1

1.5

2

2.5

3

3.5

4

Experimental results vs. performance model on Skylake

20

Thread-level Parallelism [2/2]

16 32 64 128 256 512Task Size

2048

1024

512

256

128

64

Gra

in S

ize


14.76

13.82

13.96

13.93

14.48

17.58

11.57

10.94

10.8

11.15

11.63

13.49

9.08

8.35

8.24

8.71

9.03

9.04

11.07

10.09

9.7

10.07

10.1

10.1

19.81

18.57

19.5

18.81

18.49

18.46

32.51

35.35

37.05

39.93

41.17

41.1910

15

20

25

30

35

40

16 32 64 128 256 512Task Size

2048

1024

512

256

128

64

Gra

in S

ize


0.59

0.64

0.68

0.73

0.77

0.82

0.27

0.36

0.45

0.54

0.64

0.73

0.27

0.09

0.09

0.27

0.45

0.45

1.18

0.82

0.46

0.09

0.09

0.09

2.64

1.91

1.18

1.18

1.18

1.18

4.83

3.37

3.37

3.37

3.37

3.37 0.5

1

1.5

2

2.5

3

3.5

4

4.5

Experimental results vs. performance model on KNL

21

Cray XC40 Characteristics

- Distributed experiments are run on Shaheen XC40, which hosts 196,608 Haswell

cores, and has a linpack performance of 5.5 PFlop/s.

- Strong scaling is challenging in traditional FMM codes.

XC40 Network Hierarchy

Level Hardware/Network Unit Nodes Cores Overhead Hops

1 Socket 1 16 32 N/A

2 NUMA Node 1 32 64 N/A

3 Blade 4 128 256 1

4 Chassis 64 2,048 4,096 1

5 Cabinet 192 6,144 8,192 1

6 Local alltoall G 384 12,288 16,384 1

7 Global alltoall G1 2,304 74,728 131,072 2

8 Global alltoall G2 6,174 197,568 N/A 3 22

Weak Scalability

• A single scattering sphere, to

validate the concept and the

convergence.

• We fix the problem size while

keeping 10 points per wavelength

(1.0e−4 accuracy vs. analytical).32 128 512 2048 8192 32768 131072

Number of Cores

0

5

10

15

20

25

30

Mea

n T

ime

[s]

561,5402,246,160

8,984,640

35,938,560143,754,240

575,016,960

2,300,067,840

1 0.95

1.06 1 0.96

1.03

1.05

Weak ScalabilityIdeal Scaling

Cores Mem. [GB] Freq. [KHz] N (DoF) T [S] Iters

512 2,048 24 8,984,640 3.1e3 185

2,048 8,192 96 35,938,560 3.3e3 190

8,192 32,768 384 143,754,240 3.6e3 200

32,768 131,072 1,536 575,016,960 4.6e3 223

131,072 524,288 6,144 2,300,067,840 5.7e3 25623

Strong Scalability [1/2]

2,300,067,840 DoF

1

0.97

0.85

0.840.63

16384 32768 65536 131072 196608Number of Cores

10 0

10 1

Spee

dup

Strong ScalabilityIdeal Scaling

575,016,960 DoF

1

1.06

0.88

0.87

0.54 0.37

1966088192 16384 32768 65536 131072Number of Cores

10 0

10 1

Spee

dup


143,754,240 DoF

1

0.88

0.91

0.79

0.85

0.65

0.51

0.37

1024 2048 4096 8192 16384 32768 65536 131072Number of Cores

10 0

10 1

10 2

Spee

dup


24

Strong Scalability [2/2]

35,938,560 DoF

1

1.02

0.94

0.79

0.75

0.73

0.86 0.44

0.32 0.17

256 512 1024 2048 4096 8192 16384 32768 65536 131072Number of Cores

10 0

10 1

10 2Sp

eedu

p


8,984,640 DoF

1

0.88

0.91

0.86

0.83

0.76

0.77

0.66

0.55

64 128 256 512 1024 2048 4096 8192 16384Number of Cores

10 0

10 1

10 2

Spee

dup


2,246,160 DoF

1

0.95

0.89

0.82

0.81

0.86

0.78

0.61

0.51

32 64 128 256 512 1024 2048 4096 8192Number of Cores

10 0

10 1

10 2

Spee

dup


561,540 DoF

1

0.94

1.13

1.06

0.79

0.67

0.610.37

32 64 128 256 512 1024 2048 4096Number of Cores

10 0

10 1

10 2

Spee

dup


25

Communication Load Balancing

100 200 300 400 500 600 700 800 900 1000MPI Rank

10-2

10-1

100

Mea

n T

ime

[s]

Comm LET bodiesComm LET cells

100 200 300 400 500 600 700 800 900 1000MPI Rank

10-2

10-1

100

Mea

n T

ime

[s]

Comm LET bodiesComm LET cells

1 2 3Time step

0

0.5

1

1.5

2

2.5

3

3.5

Mea

n T

ime

[s]

Balancing Step

26

Data Scalability

10 5 10 6 10 7

N

10 -1

10 0

10 1

10 2

10 3

10 4

10 5

Mea

n T

ime

[s]

FMM TimeO(N)O(NlogN)

O(N 2 )

64 compute nodes of Shaheen

10 8 10 9

N

10 -1

10 0

10 1

10 2

10 3

10 4

Mea

n T

ime

[s]

FMM TimeO(N)O(NlogN)

O(N 2 )

1,024 compute nodes of Shaheen

27

Convergence Aspects and Numerical Error

200 400 600 800 1000 1200 1400Number of Iterations

10 -4

10 -3

10 -2

10 -1

10 0

Rel

ativ

e R

esid

ual N

orm

Self and Near SingularitySelf SingularityNear SingularityIgnore Self Singularity

Convergence behavior of 1m DoF using different

singularity correction modes

28

Conclusion

• Auto-vectorization of FMM kernels by the Intel C/C++ Compiler 2017 can

achieve significant performance boost with given precautions.

• Heavy tuning parameters for recursive task parallelism of FMM’s tree traversal

can be avoided using our performance model with a small prediction penalty.

• Communication balancing results in 6 times faster global communication time for

100m DoF.

• Solution of a 2 billion DoF systems of high-order curvilinear triangular patches of

a spherical mesh in about 60 minutes time-to-solution and an accuracy of 1.0e−4

compared to the analytical solution.

• Near-optimal parallel efficiency on Shaheen for both weak and strong scalability

studies.

29

Future Direction

• Explore and address performance challenges of more heterogeneous HPCarchitectures.

• GPU-based supercomputers.

• Study different network topologies of various supercomputer architectures, andmake HSDX aware of the underlying network units.

• Intel Omni-Path network architecture.

• Further extend our solver implementation to include different highly nonuniformdomains and more complex geometries.

• Wing-fuselage configuration emulated by two intersecting ellipsoids.

30

Thank you

bemfmm: an fmm-accelerated boundary element method …...bemfmm: an fmm-accelerated boundary element...

Documents