towards achieving gpu-native adaptive mesh refinement€¦ · daino (2016) the amr algorithm...

Post on 28-Sep-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Towards achieving GPU-native adaptive mesh refinement

Ania Brown

Prof Takayuki Aoki

Why AMR?

AMR is not GPU friendly

• Complicated, time varying data structures

• Can you use AMR and keep GPU performance?My conclusion: yes, but it’s messy

Contents

• Introduction to the algorithm + data structures

• The challenges

• Optimisation possibilities

• The RSE perspective — some lessons learnt

Problem domain

• Stencil calculations on a square structured mesh

• Cell centre values

i, j+1

i+1, ji-1, j i, j

i, j-1

1) Block structured AMR

1) Block structured AMR

1) Block structured AMR

2) Tree based AMRSimulation mesh Refinement representation

Simulation mesh Refinement representation

2) Tree based AMR

Simulation mesh Refinement representation

2) Tree based AMR

Simulation mesh Refinement representation

… …

2) Tree based AMR

Simulation mesh Refinement representation

… …… …

2) Tree based AMR

3) Patches +Tree AMR

Block-structured Enzo CHOMBO

Octree RAMSES

Octree + patches FLASH, using PARAMESH NIRVANA

CPU GPU

Octree + patches GAMER (2011) Daino (2016)

The AMR algorithmInitialize data structures

Create halo regions

Update patch values

Refine/coarsen patches

Output values for visualisation

Main loop:

Update neighbour relations

Initialize data structures

Main loop:

CPU GPU

Create halo regions

Update patch values

Refine/coarsen patches

Output values for visualisation

Update neighbour relations

Update step

1 CUDA block

Update step

• Tune block size for coalesced access

• Zmarching

Initialize data structures

Main loop:

CPU GPU

Create halo regions

Update patch values

Refine/coarsen patches

Output values for visualisation

Update neighbour relations

Initialize data structures

Main loop:

CPU GPU

Create halo regions

Update patch values

Refine/coarsen patches

Output values for visualisation

Update neighbour relations

Ordering leaves

Hilbert curve

At each step:

• Divide space into 4

• Replace each quadrant with rotated or reflected versions of the original curve

• Connect such that that start and end points remain the same

Rules for refinement

Neighbour relations

Leaf nodes:Neighbour indices in each direction Parent index

Parent nodes: Child indices

Create halo regionsFind correct neighbour node

Create halo regionsFind correct neighbour node

Create halo regionsFind correct neighbour node

Create halo regionsCopy halo values

Interpolating halo values

Interpolating halo values

Interpolating halo values

Interpolating halo values

Interpolating halo values

Reducing halo values

Reducing halo values

Main loop:

CPU GPU

Refine/coarsen patch values

Update neighbour relations

Defragment value array

Find patches to coarsen/refine

Coarsen/refine step

Defragment value array refine node

Defragment value array refine node

Defragment value array refine node

Defragment value array coarsen node

Defragment value array coarsen node

Defragment value array coarsen node

Calculate new defragmented position

• Input: for all nodes, flag whether that node is to be refined, coarsened or unchanged

• Refined: +3Coarsened: -3Unchanged: 0

• For each element, sum all preceding elements in the array

• For n nodes, requires n/2 threads and O(log2(n)) serial steps

Multi-GPULoad balancing

Boundaries between subdomains

Node 0

Node 1

Node 2

Boundaries between subdomains

Node 0

Node 1

Node 2

Boundaries between subdomains

How to distribute tree

How to distribute tree

Software design

• Code framework — allow user to edit/add functions for initialisation, resolution criterion, stencil calc

• Code generation — annotated regular data structures

• How much to offer? — cell/node centre, interpolation level, stencil type

Software development process

• Unit testing

• Verification

• Profiling

Code by: T.Shimokawabe, T.Takaki (2011)

Phase field model of dendritic solidification in a binary alloy

7 refinement levels in quad-tree

Regular mesh

Adaptive mesh

Performance testing for dendritic solidification model

256 x 256

L = 1.5⇥ 10�3m

R = 4.5⇥ 10�4m

�xmin = 6⇥ 10�6m

�x

max

= 1.2⇥ 10�5m

8192 x 8192

L = 1.5⇥ 10�3m

R = 4.5⇥ 10�4m

�x

max

= 1.2⇥ 10�5m

�xmin = 1.9⇥ 10�7m

Performance testing for dendritic solidification model

Exec

utio

n tim

e pe

r tim

este

p (m

s)

0

62.5

125

187.5

250

Resolution

1024

2048

3072

4096

5120

6144

7168

8192AMR Regular Mesh Worst case AMR

256512

In summary

• Patch based tree-AMR

• For quick gains, offload update step to GPU

• GPU-native version possible — values on GPU, neighbour relations on CPU

• Likely won’t be a one size fits all fix

Governing PDEs for phase field model

�(x, y, t)

c = (1� �)cL + �cSc(x, y, t)

cLcS

: phase

: liquid concentration: solid concentration

@c@t = r · [DS�rcS +DL(1� �)rcL]

Diffusion Interface anisotropy

Chemical driving force Phase change

@�

@t

= �M

r · (a2r�) +

@

@x

(a@a

@�

x

|r�|2) + @

@y

(a@a

@�

y

|r�|2)

+@

@z

(a@a

@�

z

|r�|2)� S�T

dp(�)

d�

�W

dq(�)

d�

M�

ap(�), q(�)DS , DL

SWT

: mobility: interface anisotropy

: interpolating functionM�

ap(�), q(�)DS , DL

SWT

q(�) : double well function: diffusion in solid, liquid

: entropy of fusion

: temperature: height of double well potential

top related