multi-block gpu implementation of a stokes equations ......title: multi-block gpu implementation of...
TRANSCRIPT
-
Multi-Block GPU Implementation of a Stokes Equations Solver for Absolute Permeability
Computation Nicolas Combaret, Ph.D. - Software Engineer
FEI Visualization Sciences Group
-
Preamble
2
GTC 2014
March 26, 2014
• Avizo Fire for Material Sciences:
– Visualization
– Image Processing
– Measures & Quantification
• Compute physical properties:
– Diffusive properties (thermal, electrical, molecular)
– Absolute Permeability
-
Preamble
3
GTC 2014
March 26, 2014
-
Why Absolute Permeability?
GTC 2014
4
Exploration Wells
Well analysis
Production wells
Upscaling
Reservoir Modeling
-
Overview
• Stokes Equations for Absolute Permeability
• Solving Stokes Equations
• Out-of-Core GPU implementation
5
GTC 2014
-
Stokes Equations for Absolute Permeability
-
Absolute Permeability
• Measures the ability of a porous media to transmit a fluid
• All the world of porous media:
– Soils (petroleum, mining, civil engineering)
– Rocks, core sample, core plugs
– Cement, foams, ceramics
– Powders, sands
7
GTC 2014
-
Darcy’s Law
• Empirical law Q
S= −
𝐤
μ
∆P
L
To estimate k:
• S, L and μ are external parameters
• Q and ΔP need to be computed
porous media
Q: fluid flow
input pressure output pressure
L: sample length
S
∆P
cross section
area
-
Stokes Equations
• Simplification of Navier-Stokes equations for uncompressible, Newtonian fluid and steady-state, laminar flow
• 𝛻2v − 𝛻p = 0
𝛻. v = 0
– v: local fluid velocity
– p: local fluid pressure
• With v known everywhere: Q
• With p known everywhere: ∆P
9
GTC 2014
-
Solving Stokes Equations
-
Stokes Equations Discretization (1)
• Finite volume, explicit discretization
• To compute one v at time step t:
11
GTC 2014
v(t − 1) p(t − 1)
+ = v(t)
-
Stokes Equations Discretization (2)
• Finite volume, explicit discretization
• To compute one p at time step t:
12
GTC 2014
v(t) p(t − 1)
+ = p(t)
-
Time and Space Dependency
13
GTC 2014
v(t − 1) p(t − 1)
+ = v(t)
v(t) p(t − 1)
+ = p(t)
-
Iterative solver
• Iterative solver with two time steps t − 1 and t
• p(t) depends on v(t)
• Convergence:
– Slow a lot iteration is necessary
– Guaranteed no divergence
14
GTC 2014
Compute v(t)
Compute p(t)
Convergence?
Initialize data structures
Output results
-
First Implementation
• CPU implementation:
– Double indirection approach
• Direct to GPU: bad performance
– Too many non-coalesced memory accesses
15
GTC 2014
indices array
-
Current Implementation
• Target CUDA GPU with Compute Capability ≥ 2.0 (Fermi)
• Target workstation with one or two GPU
• Regular 3D grid
• Velocities and pressures allocated on GPU
• Each GPU thread compute one value of velocity and pressure
• Error (for convergence) computed on GPU every 100 iterations
16
GTC 2014
-
Results
Data size CPU Time for
100 iterations (s) GPU Time for
100 iterations (s) Speedup
503 0.202 0.341 0.6
1003 1.854 0.628 3.0
2003 16.5 2.684 6.1
4003 151.583 18.454 8.2
5003 283.129 26.842 10.5
17
GTC 2014
GPU: Quadro K6000 CPU: 2×4 cores
-
Out-of-Core GPU implementation
-
Memory Limit is an Issue
• Max memory on GPU: up to 12GB for Quadro K6000 and Tesla K40
• Solver memory consumption:
– 4 unknowns per cell (3 velocity components + 1 pressure)
– Double precision (8 bytes each) = 32 bytes
– 2 time steps for each cell = 64 bytes
– Number of cells: 10003 data set = 64 GB
19
GTC 2014
-
Idea
• Divide data set in blocks that fit in GPU memory
• Cover block transfers with GPU computation
20
GTC 2014
-
Blocks Transfer Process (1)
GTC 2014
21
3 4
1 2 GPU
-
Blocks Transfer Process (2)
GTC 2014
22
3 4
1 2 GPU
-
GPU
Blocks Transfer Process (3)
GTC 2014
23
3 4
1 2
1
-
GPU
Blocks Transfer Process (4)
GTC 2014
24
3 4
1 2
1
2
-
GPU
Blocks Transfer Process (5)
GTC 2014
25
3 4
1 2
3
2
-
GPU
Blocks Transfer Process (6)
GTC 2014
26
3 4
1 2
3
4
-
Blocks Transfer Process (7)
GTC 2014
27
3 4
1 2 GPU
4
-
Blocks Transfer Process (8)
GTC 2014
28
3 4
1 2 GPU
-
Challenge
• Covering data transfer with computation:
– Several iteration computed on each block
– Halo data transferred:
• Values in black cell: 2 iterations
• Values in green cells: only 1 iteration
• Values in white cells: not computed
29
GTC 2014
-
• Available memory on GPU is determined at runtime
– Defines maximal size for a block (1/3 of GPU memory)
• Need to balance:
– Number of iterations = halo size = useless computation
– Number of blocks = number of transfer-compute cycles
30
GTC 2014
Challenge
-
Result
31
GTC 2014
GPU CPU CPU GPU kernels execution
-
Results
Data size CPU Time for
100 iterations (s) GPU Time for
100 iterations (s) Speedup
503 0.202 0.341 0.6
1003 1.854 0.628 3.0
2003 16.5 2.684 6.1
4003 151.583 18.454 8.2
5003 283.129 26.842 10.5
8003 461.86 161.42 2.86
10243 711.89 238.89 2.98
32
GTC 2014
-
Conclusion & future work
-
Conclusion
• Implementation of a Stokes equations solver in CUDA
• Able to manage “unlimited” size of data on one GPU
• From 3× (out-of-core) to 10× (in-core) compared to CPU
• CUDA integrated:
– In a general purpose software
– Large number of supported devices
– In a limited development time
34
GTC 2014
-
Future work
• Optimize GPU kernels: textures? Shared memory?
• Use more GPU:
– Peer-to-peer memory access if data fits in memory
– Distribute blocks to several GPU
• Optimize blocks division:
– Less blocks
– Better covering memory copies / compute
35
GTC 2014
-
Acknowledgments
• NVIDIA for training and support:
– François Courteille
– Paulius Micikevicius
– Julien Demouth
36
GTC 2014
-
Thank you for your attention.