multi-block gpu implementation of a stokes equations ......title: multi-block gpu implementation of...

Multi-Block GPU Implementation of a Stokes Equations Solver for Absolute Permeability

Computation Nicolas Combaret, Ph.D. - Software Engineer

FEI Visualization Sciences Group

Preamble

2

GTC 2014

March 26, 2014

• Avizo Fire for Material Sciences:

– Visualization

– Image Processing

– Measures & Quantification

• Compute physical properties:

– Diffusive properties (thermal, electrical, molecular)

– Absolute Permeability

Preamble

3

GTC 2014

March 26, 2014

Why Absolute Permeability?

GTC 2014

4

Exploration Wells

Well analysis

Production wells

Upscaling

Reservoir Modeling

Overview

• Stokes Equations for Absolute Permeability

• Solving Stokes Equations

• Out-of-Core GPU implementation

5

GTC 2014

Stokes Equations for Absolute Permeability

Absolute Permeability

• Measures the ability of a porous media to transmit a fluid

• All the world of porous media:

– Soils (petroleum, mining, civil engineering)

– Rocks, core sample, core plugs

– Cement, foams, ceramics

– Powders, sands

7

GTC 2014

Darcy’s Law

• Empirical law Q

S= −

𝐤

μ

∆P

L

To estimate k:

• S, L and μ are external parameters

• Q and ΔP need to be computed

porous media

Q: fluid flow

input pressure output pressure

L: sample length

S

∆P

cross section

area

Stokes Equations

• Simplification of Navier-Stokes equations for uncompressible, Newtonian fluid and steady-state, laminar flow

• 𝛻2v − 𝛻p = 0

𝛻. v = 0

– v: local fluid velocity

– p: local fluid pressure

• With v known everywhere: Q

• With p known everywhere: ∆P

9

GTC 2014

Solving Stokes Equations

Stokes Equations Discretization (1)

• Finite volume, explicit discretization

• To compute one v at time step t:

11

GTC 2014

v(t − 1) p(t − 1)

+ = v(t)

Stokes Equations Discretization (2)

• Finite volume, explicit discretization

• To compute one p at time step t:

12

GTC 2014

v(t) p(t − 1)

+ = p(t)

Time and Space Dependency

13

GTC 2014

v(t − 1) p(t − 1)

+ = v(t)

v(t) p(t − 1)

+ = p(t)

Iterative solver

• Iterative solver with two time steps t − 1 and t

• p(t) depends on v(t)

• Convergence:

– Slow a lot iteration is necessary

– Guaranteed no divergence

14

GTC 2014

Compute v(t)

Compute p(t)

Convergence?

Initialize data structures

Output results

First Implementation

• CPU implementation:

– Double indirection approach

• Direct to GPU: bad performance

– Too many non-coalesced memory accesses

15

GTC 2014

indices array

Current Implementation

• Target CUDA GPU with Compute Capability ≥ 2.0 (Fermi)

• Target workstation with one or two GPU

• Regular 3D grid

• Velocities and pressures allocated on GPU

• Each GPU thread compute one value of velocity and pressure

• Error (for convergence) computed on GPU every 100 iterations

16

GTC 2014

Results

Data size CPU Time for

100 iterations (s) GPU Time for

100 iterations (s) Speedup

503 0.202 0.341 0.6

1003 1.854 0.628 3.0

2003 16.5 2.684 6.1

4003 151.583 18.454 8.2

5003 283.129 26.842 10.5

17

GTC 2014

GPU: Quadro K6000 CPU: 2×4 cores

Out-of-Core GPU implementation

Memory Limit is an Issue

• Max memory on GPU: up to 12GB for Quadro K6000 and Tesla K40

• Solver memory consumption:

– 4 unknowns per cell (3 velocity components + 1 pressure)

– Double precision (8 bytes each) = 32 bytes

– 2 time steps for each cell = 64 bytes

– Number of cells: 10003 data set = 64 GB

19

GTC 2014

Idea

• Divide data set in blocks that fit in GPU memory

• Cover block transfers with GPU computation

20

GTC 2014

Blocks Transfer Process (1)

GTC 2014

21

3 4

1 2 GPU


GTC 2014

22

3 4

1 2 GPU

GPU


GTC 2014

23

3 4

1 2

1

GPU


GTC 2014

24

3 4

1 2

1

2

GPU


GTC 2014

25

3 4

1 2

3

2

GPU


GTC 2014

26

3 4

1 2

3

4


GTC 2014

27

3 4

1 2 GPU

4


GTC 2014

28

3 4

1 2 GPU

Challenge

• Covering data transfer with computation:

– Several iteration computed on each block

– Halo data transferred:

• Values in black cell: 2 iterations

• Values in green cells: only 1 iteration

• Values in white cells: not computed

29

GTC 2014

• Available memory on GPU is determined at runtime

– Defines maximal size for a block (1/3 of GPU memory)

• Need to balance:

– Number of iterations = halo size = useless computation

– Number of blocks = number of transfer-compute cycles

30

GTC 2014

Challenge

Result

31

GTC 2014

GPU CPU CPU GPU kernels execution

Results

Data size CPU Time for

100 iterations (s) GPU Time for

100 iterations (s) Speedup

503 0.202 0.341 0.6

1003 1.854 0.628 3.0

2003 16.5 2.684 6.1

4003 151.583 18.454 8.2

5003 283.129 26.842 10.5

8003 461.86 161.42 2.86

10243 711.89 238.89 2.98

32

GTC 2014

Conclusion & future work

Conclusion

• Implementation of a Stokes equations solver in CUDA

• Able to manage “unlimited” size of data on one GPU

• From 3× (out-of-core) to 10× (in-core) compared to CPU

• CUDA integrated:

– In a general purpose software

– Large number of supported devices

– In a limited development time

34

GTC 2014

Future work

• Optimize GPU kernels: textures? Shared memory?

• Use more GPU:

– Peer-to-peer memory access if data fits in memory

– Distribute blocks to several GPU

• Optimize blocks division:

– Less blocks

– Better covering memory copies / compute

35

GTC 2014

Acknowledgments

• NVIDIA for training and support:

– François Courteille

– Paulius Micikevicius

– Julien Demouth

36

GTC 2014

Thank you for your attention.

multi-block gpu implementation of a stokes equations ......title: multi-block gpu implementation of...

Documents