scaling sinbad software to 3-d on yemoja
TRANSCRIPT
University of British ColumbiaSLIM
Curt Da Silva, Haneet Wason, Mathias Louboutin, Bas Peters, Shashin Sharan, Zhilong Fang
Scaling SINBAD software to 3-D on Yemoja
Wednesday, 28 October, 15
Released to public domain under Creative Commons license type BY (https://creativecommons.org/licenses/by/4.0).Copyright (c) 2018 SINBAD consortium - SLIM group @ The University of British Columbia.
This talk
Showcase SLIM software as it applies to larg(er) scale problems on the Yemoja cluster
Performance scaling• as # parallel resources increases• comparisons to existing codes in C
Large data examples
2Wednesday, 28 October, 15
3
FWI - Time DomainMathias Louboutin
Wednesday, 28 October, 15
Linear algebra form
Au = q
1
v2@2u
@t2�r2u = q
Acoustic wave equation in time domain
4
A : time domain forward modelling matrix
u : vectorized wavefield of all time steps and modelling grid points
q : source
Continuous form
Wednesday, 28 October, 15
Usual forward modelling
Fully discretized wave equation
with:
5
A1 = diag(m
4t2)
A3 = diag(m
4t2)
A2 = �L� 2diag(m
4t2)
A1uk+1 +A2u
k +A3uk�1 = qk�1
qk : Source wave field at time step k
Wednesday, 28 October, 15
FWI Gradient
The FWI gradients have to pass the adjoint test:‣ we only compute actions of , never the matrices itself‣ to ensure they are true adjoints, the migration/demigration operators need to satisfy
6
J,JT
||�dTJ�m� �mTJT �d|| < ✏
Wednesday, 28 October, 15
Gradient Test
7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -110 -12
10 -10
10 -8
10 -6
10 -4
10 -2
10 0
10 2
10 4 Multiparameter Gradient Test
Zeroth order Taylor ErrorO(h)First order Taylor ErrorO(h2)
||F(m0 + h · �m, ✏0 + h · �✏)� F(m0, ✏0)||
||F(m0 + h · �m, ✏0 + h · �✏)�F(m0, ✏0)� h · Jm�m� h · J✏�✏||
Ensure 2nd order convergence of Taylor expansion
Wednesday, 28 October, 15
True velocity
8
X location (m)0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Dep
th (m
)
0
500
1000
1500
2000 1.5
2
2.5
3
3.5
4
4.5
Wednesday, 28 October, 15
9
Good initial model
X location (m)0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Dep
th (m
)
0
500
1000
1500
2000 1.5
2
2.5
3
3.5
4
4.5
Wednesday, 28 October, 15
10
FWI
X location (m)0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Dep
th (m
)
0
500
1000
1500
2000 1.5
2
2.5
3
3.5
4
4.5
Wednesday, 28 October, 15
Implementation
11Wednesday, 28 October, 15
Chevron Modeling code
Modeling only
SSE and AVX enabled
10th order in space, 4th order in time
Stencil based (No matrix)
12Wednesday, 28 October, 15
Matlab basic
Matrix based
No fancy speed up yet
Contains • Forward time stepping and adjoint time stepping (true adjoint)• Jacobian and its adjoint (true adjoint)• Necessary for FWI,LSRTM.....
13Wednesday, 28 October, 15
Matlab basic
Setup : 300sec for 400x400x400 (1 source)
Time step
14
A1_inv : 1.891 GBA2 : 16.968 GBA3 : 1.891 GBPs : 1.703 KBU1 : 645.481 MBU2 : 645.481 MBU3 : 645.481 MBadjoint_mode : 1.000 Bmode : 8.000 Bnt : 8.000 Bop : 645.494 MBx : 38.586 KBy : 645.481 MB==========================T : 23.902 GB
Wednesday, 28 October, 15
Matlab advanced
No matrix => stencil based + 3 vectors
Double precision => Single precision
Matlab MatVec => C MatVec with multi RHS
15Wednesday, 28 October, 15
Matlab advanced
Single precision + stencil based : • 20 times less memory than sparse matrices
Single precision :• Wavefields two times less expensive memory-‐wise
C MatVec• Multi-‐threaded over RHS • No matrix instead of 20• Communication overhead between Matlab-‐C
16Wednesday, 28 October, 15
Setup : 17 sec for 400x400x400 (20 sources)
17
N : 12.000 BP : 8.000 BU1 : 5.673 GBU2 : 5.673 GBU3 : 5.673 GBa1i : 322.741 MBa2 : 322.741 MBa3 : 322.741 MBadjoint_mode : 1.000 Bd : 4.000 Bi : 8.000 Bidx : 52.000 Bidxsrc : 7.594 KBmode : 8.000 Bnt : 8.000 Bop : 645.495 MBtsrc : 8.000 Bwsrc : 7.594 KBx : 19.293 KBy : 5.673 GB==========================T : 24.269 GB
Wednesday, 28 October, 15
Memory usage
18
N : 12.000 BP : 8.000 BU1 : 5.673 GBU2 : 5.673 GBU3 : 5.673 GBa1i : 322.741 MBa2 : 322.741 MBa3 : 322.741 MBadjoint_mode : 1.000 Bd : 4.000 Bi : 8.000 Bidx : 52.000 Bidxsrc : 7.594 KBmode : 8.000 Bnt : 8.000 Bop : 645.495 MBtsrc : 8.000 Bwsrc : 7.594 KBx : 19.293 KBy : 5.673 GB==========================T : 24.269 GB
Wednesday, 28 October, 15
Timings and memory
How does Matlab scale
• Compared with the Chevron modelling code
• Compared to single precision multi RHS multiplication
19Wednesday, 28 October, 15
Single time step
For a given 561x561x194 cube
Matlab
• 100Gb of RAM
• 2 sec per time-‐step per source
• 40 sec per time step for 20 sources (20 runs)
20Wednesday, 28 October, 15
Single time step
For a given 561x561x194 cube
Chevron
• .10 sec per time-‐step (20 threads)
• 2 sec per time step for 20 sources (needs to run 20 times)
• Stencil based, 0 RAM for matrices
• 1Gb of RAM (one source at a time)
21Wednesday, 28 October, 15
Single time step
For a given 561x561x194 cube
Chevron
• 2 sec per time-‐step (1 threads)
• 2 sec per time step for 20 sources (can run 20 at once)
• Stencil based, 0 RAM for matrices
• 1Gb of RAM (one source at a time)
22Wednesday, 28 October, 15
Single time step
For a given 561x561x194 cube
Single precision MEX Matlab
• ~.2 sec per Time-‐step per source
• ~ 4 sec for 20 source per time step
• 0Gb of RAM for the matrices
• 1Gb of RAM per source
23Wednesday, 28 October, 15
24
Full Waveform Inversion - Time HarmonicZhilong Fang
Wednesday, 28 October, 15
FWI performance scalingModel size : 134*134*28Number of shots : 30Number of frequencies : 1
25
Time per iteration Memory use per node
1 node * 8 processes 1 hour 0.5 GB
1 node * 16 processes 0.53 hours 1 GB
5 nodes * 16 processes 0.15 hours 1 GB
Wednesday, 28 October, 15
FWI performance scalingModel size : 268*268*56Number of shots : 30Number of frequencies : 1
26
Time per iteration Memory use per node
1 node * 8 processes 12 hours 4 GB
1 node * 16 processes 6.3 hours 8 GB
5 nodes * 16 processes 1.3 hours 8 GB
Wednesday, 28 October, 15
27
3D WRIBas Peters
Wednesday, 28 October, 15
Scaling
• 6 km cube model• ~40 wavelengths propagated between source & receiver• 8 nodes• Each node solves the PDE’s in the sub-‐problems for 8 right-‐hand-‐sides simultaneously.
• This setup can process 8 x 8 = 64 PDE solves simultaneously.• Fixed tolerance for all PDE solves.
28Wednesday, 28 October, 15
298 Hz. Varying number of sources & receivers (8 -‐ 256).
101 102101
102
103
104
nsrc
time
[s]
8 nodes , 8Hz
Totalcomp Ucomp Wother
0 50 100 150 200 250 30040
50
60
70
80
90
100
110
nsrc
time
[s]
time per source
Wednesday, 28 October, 15
30
101 102101
102
103
104
nsrc
time
[s]
8 nodes , 8Hz
Totalcomp Ucomp Wother
0 50 100 150 200 250 30040
50
60
70
80
90
100
110
nsrc
time
[s]
time per source
not enough sources & receiversto use computational capacity of the nodes
Wednesday, 28 October, 15
31
101 102101
102
103
104
nsrc
time
[s]
8 nodes , 8Hz
Totalcomp Ucomp Wother
0 50 100 150 200 250 30040
50
60
70
80
90
100
110
nsrc
time
[s]
time per source
more sources & receivers,still close to constant time per source
Wednesday, 28 October, 15
32
101 102101
102
103
104
nsrc
time
[s]
8 nodes , 8Hz
Totalcomp Ucomp Wother
0 50 100 150 200 250 30040
50
60
70
80
90
100
110
nsrc
time
[s]
time per source
more sources & receivers,still close to constant time per source
other costs (including communication) increase, but remain relatively small
Wednesday, 28 October, 15
33
100 101101
102
103
104
# of nodes
time
[s]
64 sources & receivers , 8Hz
Totalcomp Ucomp Wother
8 Hz. 64 sources & 64 receivers. Varying number of nodes (2 -‐ 16).
Wednesday, 28 October, 15
34
100 101101
102
103
104
# of nodes
time
[s]
64 sources & receivers , 8Hz
Totalcomp Ucomp Wother
not enough sources & receivers to use computational capacity of the nodes; results in smaller speedup
Wednesday, 28 October, 15
35
3D data simulationHaneet Wason & Shashin Sharan
Wednesday, 28 October, 15
Simulation parameters
36
3D ocean bottom cable/node data set generated on the BG 3D Compass model- model size (nz x nx x ny): 164 x 601 x 601- grid size: 6 m x 25 m x 25 m
Data dimensions (2501 x 500 x 500 x 85 x 85)
- number of time samples: 2501 - number of receivers in x & y direction: 500 - number of shots in x & y direction: 85- sampling intervals: 0.004 s, 25 m (receiver), 150 m (shot)
Simulated with the Chevron 3D modeling code
Wednesday, 28 October, 15
37
BG 3D Compass model
Lateral [m]0 5000 10000 15000
Dept
h [m
]0
500
1000
1500
2000
Lateral [m]0 5000 10000 15000
Dept
h [m
]
0
500
1000
1500
2000
x direction
y direction
Wednesday, 28 October, 15
38
25 m
25 mx
y
150 m
150 m
Source
Receiver
Source-receiver layout
Wednesday, 28 October, 15
Computational resources used
39
Time & memory usage
Node partition: 128 GBNumber of nodes: 660
Simulation per 3D shot: 1.5 hoursCumulative simulation time (85 x 85 shots): 27 hours Memory storage of one shot record: 2.5 GBMemory storage of all shot records: 18 TB
Wednesday, 28 October, 15
40
Running jobs & activated nodes (SENAI Yemoja cluster)
Wednesday, 28 October, 15
41
3D shot records
Ry direction Rx direction
(nt x nrx x nry = 2500 x 500 x 500)
Wednesday, 28 October, 15
42
X & Y receiver spacing(m)
X & Y shot spacing (m)
Number of shots(X x Y)
Disk space (TB)
25 25 500 x 500 610
25 50 250 x 250 153
25 75 165 x 165 67
25 100 125 x 125 38.5
25 125 100 x 100 25
25 150 85 x 85 18
Simulation estimation
Wednesday, 28 October, 15
43
Simultaneous acquisitionHaneet Wason & Shashin Sharan
Wednesday, 28 October, 15
44
Performance scaling
Size of 3D survey: 2500 x 500 x 10 x 500 x 50- number of time samples: 2500 - number of streamers: 10 (with 500 channels each)- number of shots in x & y direction: 500 x 50
Number of workers Number of SPGL1 iteraDons
Recovery Dme per seismic line (hrs)
Recovery Dme all data (days)
20 200 78 162
50 200 31 64
100 200 16 33
500 200 3 6
Wednesday, 28 October, 15
45
Interpolation - Tensor CompletionCurt Da Silva
Wednesday, 28 October, 15
Hierarchical Tucker formatX � n1 ⇥ n2 ⇥ n3 ⇥ n4 tensor
U12
n1n2
k12
! U12n1
n2k12
Wednesday, 28 October, 15
Hierarchical Tucker formatX � n1 ⇥ n2 ⇥ n3 ⇥ n4 tensor
U12
n1n2
k12
! U12n1
n2k12
! U1
UT2
n1
k1 n2
k2
B12
Wednesday, 28 October, 15
Data set
Generated from the BG compass model using time-‐stepping
Transformed in to frequency slices, ~26 GB in size
85 x 85 sources at 150m spacing, 500 x 500 receivers at 25m spacing
90% receiver pairs removed, on-‐grid sampling
48Wednesday, 28 October, 15
Tensor interpolation
Parallelized over frequencies, implicit parallelism via Matlab’s calls to LAPACK
20 iterations, Gauss-‐Newton Hierarchical Tucker interpolation
Each frequency slice takes 13-‐15 hours to interpolate, ~70-‐80 GB max memory
Run on the Yemoja cluster in Brazil “out of the box”
49Wednesday, 28 October, 15
HT Interpolation - 90% missing receiversCommon source gather - 10Hz
50
receiver x50 100 150 200 250 300 350 400 450 500
rece
iver y
50
100
150
200
250
300
350
400
450
500
True data Subsampled data (input)
receiver x100 200 300 400 500
rece
iver y
50
100
150
200
250
300
350
400
450
500
Wednesday, 28 October, 15
51
receiver x50 100 150 200 250 300 350 400 450 500
rece
iver y
50
100
150
200
250
300
350
400
450
500
receiver x100 200 300 400 500
rece
iver y
50
100
150
200
250
300
350
400
450
500
True data Interpolated data -‐ SNR 19.3 dB
HT Interpolation - 90% missing receiversCommon source gather - 10Hz
Wednesday, 28 October, 15
52
receiver x50 100 150 200 250 300 350 400 450 500
rece
iver y
50
100
150
200
250
300
350
400
450
500
True data Difference
receiver x100 200 300 400 500
rece
iver y
50
100
150
200
250
300
350
400
450
500
HT Interpolation - 90% missing receiversCommon source gather - 10Hz
Wednesday, 28 October, 15
SNR vs Frequency
53frequency [hz]
50 60 70 80 90 100
SNR
[dB]
18
20
22
24
26
28
30Train SNRTest SNR
Wednesday, 28 October, 15
Acknowledgements
This work was financially supported by SINBAD Consortium members BG Group, BGP, CGG, Chevron, ConocoPhillips, DownUnder GeoSolutions, Hess, Petrobras, PGS, Schlumberger, Statoil, Sub Salt Solutions and Woodside; and by the Natural Sciences and Engineering Research Council of Canada via NSERC Collaborative Research and Development Grant DNOISEII (CRDPJ 375142-‐-‐08).
Thank you for your attention
Wednesday, 28 October, 15