heterogeneous particle based simulation (siggraph asia 2011)

Post on 11-Jun-2015

277 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

HETEROGENEOUS PARTICLE

BASED SIMULATION

Takahiro Harada, AMD

2 Harada, Heterogeneous Particle-based Simulation

Large number of particles

Particles with identical size

– Work granularity is almost the same

– Good for the wide SIMD architecture

PARTICLE BASED SIMULATION ON THE GPU

Harada et al. 2007

3 Harada, Heterogeneous Particle-based Simulation

PARTICLE BASED SIMULATION

Collision

Integration

Acceleration structure is used for efficient collide

– Uniform grid → Suited for the GPU

– Less divergence

𝑓𝑐𝑜𝑙𝑙𝑖𝑑𝑒 = 𝑓𝑖𝑗

𝑣 +=𝑓

𝑚∆𝑡

𝑥 += 𝑣∆𝑡

𝑑𝑣

𝑑𝑡=𝑓

𝑚

𝑑𝑥

𝑑𝑡= 𝑣

4 Harada, Heterogeneous Particle-based Simulation

DIVERGENCE ON SIMD

0 1 2 3 4 5 6 7

Void Kernel()

{

if(A)

FuncA();

else if(B)

FuncB();

else

FuncC();

}

5 Harada, Heterogeneous Particle-based Simulation

PARTICLE BASED SIMULATION ON THE GPU

Particle collision using a uniform grid

0 1 2 3 4 5 6 7

Void Kernel()

{

prepare();

collide(Cell0);

collide(Cell1);

collide(Cell2);

collide(Cell3);

collide(Cell4);

collide(Cell5);

collide(Cell6);

collide(Cell7);

collide(Cell8);

}

Cell0 Cell1 Cell2

Cell3 Cell4 Cell5

Cell6 Cell7 Cell8

6 Harada, Heterogeneous Particle-based Simulation

MIXED PARTICLE SIMULATION

Not only small particles

Difficulty for GPUs

– Large particles interact with small particles

– Large-large collision

7 Harada, Heterogeneous Particle-based Simulation

CHALLENGE

Non uniform work granularity

– Small-small(SS) collision

Uniform, GPU

– Large-large(LL) collision

Non Uniform, CPU

– Large-small(LS) collision

Non Uniform, CPU

8 Harada, Heterogeneous Particle-based Simulation

FUSION ARCHITECTURE

CPU and GPU are:

– On the same die

– Much closer

– Efficient data sharing

CPU and GPU are good at different works

– CPU: serial computation, conditional branch

– GPU: parallel computation

Able to dispatch works to:

– Serial work with varying granularity → CPU

– Parallel work with the uniform granularity → GPU

9 Harada, Heterogeneous Particle-based Simulation

MIXED PARTICLE SIMULATION

Benefit from Fusion Architecture

– Different works in a simulation

– CPU & GPU are working together

– Shares data

10 Harada, Heterogeneous Particle-based Simulation

METHOD

11 Harada, Heterogeneous Particle-based Simulation

TWO SIMULATIONS

Small particles

Large particles

Build

Acc. Structure

SS

Collision

S

Integration

Build

Acc. Structure

LL

Collision

L

Integration

LS

Co

llis

ion

Position

Velocity

Force

Grid

Position

Velocity

Force

12 Harada, Heterogeneous Particle-based Simulation

Small particles

Large particles

Uniform Work

Non Uniform Work

CLASSIFY BY WORK GRANULARITY

Build

Acc. Structure

SS

Collision

S

Integration

L

Integration

Position

Velocity

Force

Grid

Position

Velocity

Force LL

Collision

LS

Collision

Build

Acc. Structure

13 Harada, Heterogeneous Particle-based Simulation

Small particles

Large particles

GPU

CPU

CLASSIFY BY WORK GRANULARITY, ASSIGN PROCESSOR

Build

Acc. Structure

SS

Collision

S

Integration

L

Integration

Position

Velocity

Force

Grid

Position

Velocity

Force LL

Collision

LS

Collision

Build

Acc. Structure

14 Harada, Heterogeneous Particle-based Simulation

Small particles

Large particles

Grid, small particle data has to be shared with the CPU for LS collision

– Allocated as zero copy buffer

GPU

CPU

DATA SHARING

Build

Acc. Structure

SS

Collision

S

Integration

L

Integration

Position

Velocity

Force

Grid

Position

Velocity

Force LL

Collision

Build

Acc. Structure

Position

Velocity

Grid

Force

LS

Collision

15 Harada, Heterogeneous Particle-based Simulation

Small particles

Large particles

Grid, small particle data has to be shared with the CPU for LS collision

– Allocated as zero copy buffer

GPU

CPU

SYNCHRONIZATION

Position

Velocity

Force

Grid

Position

Velocity

Force

SS

Collision

S

Integration

L

Integration

LL

Collision

Position

Velocity

Grid

Force

Syn

ch

ron

iza

tio

n

LS

Collision

Build

Acc. Structure

Build

Acc. Structure

Syn

ch

ron

iza

tio

n

Build

Acc. Structure

Build

Acc. Structure

16 Harada, Heterogeneous Particle-based Simulation

GPU

CPU

VISUALIZING WORKLOADS

Build

Acc. Structure

SS

Collision

S

Inte

gra

tio

n Position

Velocity

Force

Grid

Position

Velocity

Force LL

Collision

LS

Collision

Syn

ch

ron

iza

tio

n

L

Inte

gra

tio

n

Small particles

Large particles

Grid construction can be moved at the end of the pipeline

– Unbalanced workload

17 Harada, Heterogeneous Particle-based Simulation

Small particles

Large particles

To get better load balancing

– The sync is for passing the force buffer filled by the CPU to the GPU

– Move the LL collision after the sync

GPU

CPU

LOAD BALANCING

Build

Acc. Structure

SS

Collision

S

Inte

gra

tio

n Position

Velocity

Force

Grid

Position

Velocity

Force LL

Collision

Syn

ch

ron

iza

tio

n

L

Inte

gra

tio

n

LS

Collision

18 Harada, Heterogeneous Particle-based Simulation

GP

U W

ork

CP

U W

ork

19 Harada, Heterogeneous Particle-based Simulation

MULTI THREADING

(4 THREADS)

20 Harada, Heterogeneous Particle-based Simulation

FURTHER OPTIMIZATION

GPU

CPU0

CPU1

CPU2

Build

Acc.

Structure

SS

Collision

S

Inte

g.

LL

Collision

L

Inte

g.

LS

Collision

Syn

ch

ron

iza

tio

n

1. Not optimized for “Llano” which is a 4 core CPU

– Only 2 CPU core were used

– Can use 2 more cores for LS collision

2. LL collision was not optimized

– CPU waits when the GPU was constructing a grid

– Use CPU to improve SS collision

21 Harada, Heterogeneous Particle-based Simulation

OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION

Cannot split the work by large particle indices

– More than 1 large particle can collide with a small particle

– Have to lock the memory on write → Inefficient

Prepare a local buffer for a thread

– A buffer storing force on small particles

– Lock free

Local buffers are merged to one

L0

S0

S1

L1

Thread0

Thread1

Thread2

22 Harada, Heterogeneous Particle-based Simulation

OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION

GPU

Build

Acc. Structure

SS

Collision

S

Inte

g.

CPU0

LL

Collision

L

Inte

g.

CPU1

CPU2

LS

Collision

Syn

ch

ron

iza

tio

n

23 Harada, Heterogeneous Particle-based Simulation

OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION

GPU

Build

Acc. Structure

SS

Collision

S

Inte

g.

CPU0

LL

Collision

L

Inte

g.

CPU1

CPU2

LS

Collision

LS

Collision

LS

Collision S

yn

ch

ron

iza

tio

n

Merg

e

Merg

e

Merg

e

Syn

ch

ron

iza

tio

n

24 Harada, Heterogeneous Particle-based Simulation

Spatially coherent memory layout improves cache utilization

As particles move, spatial locality decreases

OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION

25 Harada, Heterogeneous Particle-based Simulation

Spatially coherent memory layout improves cache utilization

As particles move, spatial locality decreases

OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION

26 Harada, Heterogeneous Particle-based Simulation

Sort particles by spatial location to improve cache utilization

– Z curve

SPATIAL SORT

27 Harada, Heterogeneous Particle-based Simulation

Sort particles by spatial location to improve cache utilization

– Z curve

SPATIAL SORT

28 Harada, Heterogeneous Particle-based Simulation

Requirements

– Full sort was over the budget

– Full sort is not “a must”

– Sort is an optional computation for performance improvement

– Incremental sort

– Use multiple threads

Solution

– Used generalized “Odd-even transition sort”

CHOOSE SORT

29 Harada, Heterogeneous Particle-based Simulation

BLOCK TRANSITION SORT

Generalized “Odd-even transition sort”

Instead of sorting 2 adjacent elements, sort adjacent 2 blocks

Iterate until convergence

Use a thread to sort 2 adjacent blocks

– 6 blocks for 3 threads

– Radix sort

Odd-even transition sort

Block transition sort

30 Harada, Heterogeneous Particle-based Simulation

OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION

GPU

Build

Acc. Structure

SS

Collision

S

Inte

g.

CPU0

LL

Collision

L

Inte

g.

CPU1

CPU2

LS

Collision

LS

Collision

LS

Collision S

yn

ch

ron

iza

tio

n

Merg

e

Merg

e

Merg

e

Syn

ch

ron

iza

tio

n

31 Harada, Heterogeneous Particle-based Simulation

OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION

GPU

Build

Acc. Structure

SS

Collision

S

Inte

g.

CPU0

CPU1

CPU2

LS

Collision

LS

Collision

LS

Collision S

yn

ch

ron

iza

tio

n

Merg

e

Merg

e

Merg

e

LL

Co

ll.

L

Inte

g.

Syn

ch

ron

iza

tio

n

S Sorting

S Sorting

S Sorting

Syn

ch

ron

iza

tio

n

32 Harada, Heterogeneous Particle-based Simulation

DEMO

GP

U W

ork

CP

U W

ork

33 Harada, Heterogeneous Particle-based Simulation

DEMO

GP

U W

ork

CP

U W

ork

34 Harada, Heterogeneous Particle-based Simulation

CONCLUSIONS

Realized a simulation that handles variable sized particles by leveraging the best features of both the CPU

and GPU on AMD’s Fusion Architecture

– The CPU is used for works with non identical compute granularity

– The GPU is used for highly parallel works

Memory sharing between the CPU and GPU is the key for the efficiency

– Avoid wasteful memory copies

35 Harada, Heterogeneous Particle-based Simulation

REFERENCE

Takahiro Harada, Seiichi Koshizuka, Yoichiro Kawaguchi, Smoothed Particle Hydrodynamics on GPUs,

Proc. of Computer Graphics International, 63-70(2007)

Justin Hensley, Takahiro Harada, Chapter X OpenCL Case Study:Mixed Particle Simulation,

Heterogeneous Computing with OpenCL, Morgan Kaufmann(2011)

top related