scaling sinbad software to 3-d on yemoja

University of British ColumbiaSLIM

Curt Da Silva, Haneet Wason, Mathias Louboutin, Bas Peters, Shashin Sharan, Zhilong Fang

Scaling SINBAD software to 3-D on Yemoja

Wednesday, 28 October, 15

Released to public domain under Creative Commons license type BY (https://creativecommons.org/licenses/by/4.0).Copyright (c) 2018 SINBAD consortium - SLIM group @ The University of British Columbia.

This talk

Showcase SLIM software as it applies to larg(er) scale problems on the Yemoja cluster

Performance scaling• as # parallel resources increases• comparisons to existing codes in C

Large data examples

2Wednesday, 28 October, 15

3

FWI - Time DomainMathias Louboutin


Linear algebra form

Au = q

1

v2@2u

@t2�r2u = q

Acoustic wave equation in time domain

4

A : time domain forward modelling matrix

u : vectorized wavefield of all time steps and modelling grid points

q : source

Continuous form


Usual forward modelling

Fully discretized wave equation

with:

5

A1 = diag(m

4t2)

A3 = diag(m

4t2)

A2 = �L� 2diag(m

4t2)

A1uk+1 +A2u

k +A3uk�1 = qk�1

qk : Source wave field at time step k


FWI Gradient

The FWI gradients have to pass the adjoint test:‣ we only compute actions of , never the matrices itself‣ to ensure they are true adjoints, the migration/demigration operators need to satisfy

6

J,JT

||�dTJ�m� �mTJT �d|| < ✏


Gradient Test

7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -110 -12

10 -10

10 -8

10 -6

10 -4

10 -2

10 0

10 2

10 4 Multiparameter Gradient Test

Zeroth order Taylor ErrorO(h)First order Taylor ErrorO(h2)

||F(m0 + h · �m, ✏0 + h · �✏)� F(m0, ✏0)||

||F(m0 + h · �m, ✏0 + h · �✏)�F(m0, ✏0)� h · Jm�m� h · J✏�✏||

Ensure 2nd order convergence of Taylor expansion


True velocity

8

X location (m)0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Dep

th (m

)

0

500

1000

1500

2000 1.5

2

2.5

3

3.5

4

4.5


9

Good initial model

X location (m)0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Dep

th (m

)

0

500

1000

1500

2000 1.5

2

2.5

3

3.5

4

4.5


10

FWI

X location (m)0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Dep

th (m

)

0

500

1000

1500

2000 1.5

2

2.5

3

3.5

4

4.5


Implementation


Chevron Modeling code

Modeling only

SSE and AVX enabled

10th order in space, 4th order in time

Stencil based (No matrix)


Matlab basic

Matrix based

No fancy speed up yet

Contains • Forward time stepping and adjoint time stepping (true adjoint)• Jacobian and its adjoint (true adjoint)• Necessary for FWI,LSRTM.....


Matlab basic

Setup : 300sec for 400x400x400 (1 source)

Time step

14

A1_inv : 1.891 GBA2 : 16.968 GBA3 : 1.891 GBPs : 1.703 KBU1 : 645.481 MBU2 : 645.481 MBU3 : 645.481 MBadjoint_mode : 1.000 Bmode : 8.000 Bnt : 8.000 Bop : 645.494 MBx : 38.586 KBy : 645.481 MB==========================T : 23.902 GB


Matlab advanced

No matrix => stencil based + 3 vectors

Double precision => Single precision

Matlab MatVec => C MatVec with multi RHS


Matlab advanced

Single precision + stencil based : • 20 times less memory than sparse matrices

Single precision :• Wavefields two times less expensive memory-‐wise

C MatVec• Multi-‐threaded over RHS • No matrix instead of 20• Communication overhead between Matlab-‐C


Setup : 17 sec for 400x400x400 (20 sources)

17

N : 12.000 BP : 8.000 BU1 : 5.673 GBU2 : 5.673 GBU3 : 5.673 GBa1i : 322.741 MBa2 : 322.741 MBa3 : 322.741 MBadjoint_mode : 1.000 Bd : 4.000 Bi : 8.000 Bidx : 52.000 Bidxsrc : 7.594 KBmode : 8.000 Bnt : 8.000 Bop : 645.495 MBtsrc : 8.000 Bwsrc : 7.594 KBx : 19.293 KBy : 5.673 GB==========================T : 24.269 GB


Memory usage

18

N : 12.000 BP : 8.000 BU1 : 5.673 GBU2 : 5.673 GBU3 : 5.673 GBa1i : 322.741 MBa2 : 322.741 MBa3 : 322.741 MBadjoint_mode : 1.000 Bd : 4.000 Bi : 8.000 Bidx : 52.000 Bidxsrc : 7.594 KBmode : 8.000 Bnt : 8.000 Bop : 645.495 MBtsrc : 8.000 Bwsrc : 7.594 KBx : 19.293 KBy : 5.673 GB==========================T : 24.269 GB


Timings and memory

How does Matlab scale

• Compared with the Chevron modelling code

• Compared to single precision multi RHS multiplication


Single time step

For a given 561x561x194 cube

Matlab

• 100Gb of RAM

• 2 sec per time-‐step per source

• 40 sec per time step for 20 sources (20 runs)


Single time step


Chevron

• .10 sec per time-‐step (20 threads)

• 2 sec per time step for 20 sources (needs to run 20 times)

• Stencil based, 0 RAM for matrices

• 1Gb of RAM (one source at a time)


Single time step


Chevron

• 2 sec per time-‐step (1 threads)

• 2 sec per time step for 20 sources (can run 20 at once)

• Stencil based, 0 RAM for matrices

• 1Gb of RAM (one source at a time)


Single time step


Single precision MEX Matlab

• ~.2 sec per Time-‐step per source

• ~ 4 sec for 20 source per time step

• 0Gb of RAM for the matrices

• 1Gb of RAM per source


24

Full Waveform Inversion - Time HarmonicZhilong Fang


FWI performance scalingModel size : 134*134*28Number of shots : 30Number of frequencies : 1

25

Time per iteration Memory use per node

1 node * 8 processes 1 hour 0.5 GB

1 node * 16 processes 0.53 hours 1 GB

5 nodes * 16 processes 0.15 hours 1 GB


FWI performance scalingModel size : 268*268*56Number of shots : 30Number of frequencies : 1

26

Time per iteration Memory use per node

1 node * 8 processes 12 hours 4 GB

1 node * 16 processes 6.3 hours 8 GB

5 nodes * 16 processes 1.3 hours 8 GB


27

3D WRIBas Peters


Scaling

• 6 km cube model• ~40 wavelengths propagated between source & receiver• 8 nodes• Each node solves the PDE’s in the sub-‐problems for 8 right-‐hand-‐sides simultaneously.

• This setup can process 8 x 8 = 64 PDE solves simultaneously.• Fixed tolerance for all PDE solves.


298 Hz. Varying number of sources & receivers (8 -‐ 256).

101 102101

102

103

104

nsrc

time

[s]

8 nodes , 8Hz

Totalcomp Ucomp Wother

0 50 100 150 200 250 30040

50

60

70

80

90

100

110

nsrc

time

[s]

time per source


30

101 102101

102

103

104

nsrc

time

[s]

8 nodes , 8Hz


0 50 100 150 200 250 30040

50

60

70

80

90

100

110

nsrc

time

[s]

time per source

not enough sources & receiversto use computational capacity of the nodes


31

101 102101

102

103

104

nsrc

time

[s]

8 nodes , 8Hz


0 50 100 150 200 250 30040

50

60

70

80

90

100

110

nsrc

time

[s]

time per source

more sources & receivers,still close to constant time per source


32

101 102101

102

103

104

nsrc

time

[s]

8 nodes , 8Hz


0 50 100 150 200 250 30040

50

60

70

80

90

100

110

nsrc

time

[s]

time per source

more sources & receivers,still close to constant time per source

other costs (including communication) increase, but remain relatively small


33

100 101101

102

103

104

# of nodes

time

[s]

64 sources & receivers , 8Hz


8 Hz. 64 sources & 64 receivers. Varying number of nodes (2 -‐ 16).


34

100 101101

102

103

104

# of nodes

time

[s]

64 sources & receivers , 8Hz


not enough sources & receivers to use computational capacity of the nodes; results in smaller speedup


35

3D data simulationHaneet Wason & Shashin Sharan


Simulation parameters

36

3D ocean bottom cable/node data set generated on the BG 3D Compass model- model size (nz x nx x ny): 164 x 601 x 601- grid size: 6 m x 25 m x 25 m

Data dimensions (2501 x 500 x 500 x 85 x 85)

- number of time samples: 2501 - number of receivers in x & y direction: 500 - number of shots in x & y direction: 85- sampling intervals: 0.004 s, 25 m (receiver), 150 m (shot)

Simulated with the Chevron 3D modeling code


37

BG 3D Compass model

Lateral [m]0 5000 10000 15000

Dept

h [m

]0

500

1000

1500

2000

Lateral [m]0 5000 10000 15000

Dept

h [m

]

0

500

1000

1500

2000

x direction

y direction


38

25 m

25 mx

y

150 m

150 m

Source

Receiver

Source-receiver layout


Computational resources used

39

Time & memory usage

Node partition: 128 GBNumber of nodes: 660

Simulation per 3D shot: 1.5 hoursCumulative simulation time (85 x 85 shots): 27 hours Memory storage of one shot record: 2.5 GBMemory storage of all shot records: 18 TB


40

Running jobs & activated nodes (SENAI Yemoja cluster)


41

3D shot records

Ry direction Rx direction

(nt x nrx x nry = 2500 x 500 x 500)


42

X & Y receiver spacing(m)

X & Y shot spacing (m)

Number of shots(X x Y)

Disk space (TB)

25 25 500 x 500 610

25 50 250 x 250 153

25 75 165 x 165 67

25 100 125 x 125 38.5

25 125 100 x 100 25

25 150 85 x 85 18

Simulation estimation


43

Simultaneous acquisitionHaneet Wason & Shashin Sharan


44

Performance scaling

Size of 3D survey: 2500 x 500 x 10 x 500 x 50- number of time samples: 2500 - number of streamers: 10 (with 500 channels each)- number of shots in x & y direction: 500 x 50

Number of workers Number of SPGL1 iteraDons

Recovery Dme per seismic line (hrs)

Recovery Dme all data (days)

20 200 78 162

50 200 31 64

100 200 16 33

500 200 3 6


45

Interpolation - Tensor CompletionCurt Da Silva


Hierarchical Tucker formatX � n1 ⇥ n2 ⇥ n3 ⇥ n4 tensor

U12

n1n2

k12

! U12n1

n2k12


Hierarchical Tucker formatX � n1 ⇥ n2 ⇥ n3 ⇥ n4 tensor

U12

n1n2

k12

! U12n1

n2k12

! U1

UT2

n1

k1 n2

k2

B12


Data set

Generated from the BG compass model using time-‐stepping

Transformed in to frequency slices, ~26 GB in size

85 x 85 sources at 150m spacing, 500 x 500 receivers at 25m spacing

90% receiver pairs removed, on-‐grid sampling


Tensor interpolation

Parallelized over frequencies, implicit parallelism via Matlab’s calls to LAPACK

20 iterations, Gauss-‐Newton Hierarchical Tucker interpolation

Each frequency slice takes 13-‐15 hours to interpolate, ~70-‐80 GB max memory

Run on the Yemoja cluster in Brazil “out of the box”


HT Interpolation - 90% missing receiversCommon source gather - 10Hz

50

receiver x50 100 150 200 250 300 350 400 450 500

rece

iver y

50

100

150

200

250

300

350

400

450

500

True data Subsampled data (input)

receiver x100 200 300 400 500

rece

iver y

50

100

150

200

250

300

350

400

450

500


51

receiver x50 100 150 200 250 300 350 400 450 500

rece

iver y

50

100

150

200

250

300

350

400

450

500

receiver x100 200 300 400 500

rece

iver y

50

100

150

200

250

300

350

400

450

500

True data Interpolated data -‐ SNR 19.3 dB



52

receiver x50 100 150 200 250 300 350 400 450 500

rece

iver y

50

100

150

200

250

300

350

400

450

500

True data Difference

receiver x100 200 300 400 500

rece

iver y

50

100

150

200

250

300

350

400

450

500



SNR vs Frequency

53frequency [hz]

50 60 70 80 90 100

SNR

[dB]

18

20

22

24

26

28

30Train SNRTest SNR


Acknowledgements

This work was financially supported by SINBAD Consortium members BG Group, BGP, CGG, Chevron, ConocoPhillips, DownUnder GeoSolutions, Hess, Petrobras, PGS, Schlumberger, Statoil, Sub Salt Solutions and Woodside; and by the Natural Sciences and Engineering Research Council of Canada via NSERC Collaborative Research and Development Grant DNOISEII (CRDPJ 375142-‐-‐08).

Thank you for your attention


scaling sinbad software to 3-d on yemoja

Documents