heterogeneous particle based simulation (siggraph asia 2011)

35
HETEROGENEOUS PARTICLE BASED SIMULATION Takahiro Harada, AMD

Upload: takahiro-harada

Post on 11-Jun-2015

276 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

HETEROGENEOUS PARTICLE

BASED SIMULATION

Takahiro Harada, AMD

Page 2: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

2 Harada, Heterogeneous Particle-based Simulation

Large number of particles

Particles with identical size

– Work granularity is almost the same

– Good for the wide SIMD architecture

PARTICLE BASED SIMULATION ON THE GPU

Harada et al. 2007

Page 3: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

3 Harada, Heterogeneous Particle-based Simulation

PARTICLE BASED SIMULATION

Collision

Integration

Acceleration structure is used for efficient collide

– Uniform grid → Suited for the GPU

– Less divergence

𝑓𝑐𝑜𝑙𝑙𝑖𝑑𝑒 = 𝑓𝑖𝑗

𝑣 +=𝑓

𝑚∆𝑡

𝑥 += 𝑣∆𝑡

𝑑𝑣

𝑑𝑡=𝑓

𝑚

𝑑𝑥

𝑑𝑡= 𝑣

Page 4: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

4 Harada, Heterogeneous Particle-based Simulation

DIVERGENCE ON SIMD

0 1 2 3 4 5 6 7

Void Kernel()

{

if(A)

FuncA();

else if(B)

FuncB();

else

FuncC();

}

Page 5: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

5 Harada, Heterogeneous Particle-based Simulation

PARTICLE BASED SIMULATION ON THE GPU

Particle collision using a uniform grid

0 1 2 3 4 5 6 7

Void Kernel()

{

prepare();

collide(Cell0);

collide(Cell1);

collide(Cell2);

collide(Cell3);

collide(Cell4);

collide(Cell5);

collide(Cell6);

collide(Cell7);

collide(Cell8);

}

Cell0 Cell1 Cell2

Cell3 Cell4 Cell5

Cell6 Cell7 Cell8

Page 6: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

6 Harada, Heterogeneous Particle-based Simulation

MIXED PARTICLE SIMULATION

Not only small particles

Difficulty for GPUs

– Large particles interact with small particles

– Large-large collision

Page 7: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

7 Harada, Heterogeneous Particle-based Simulation

CHALLENGE

Non uniform work granularity

– Small-small(SS) collision

Uniform, GPU

– Large-large(LL) collision

Non Uniform, CPU

– Large-small(LS) collision

Non Uniform, CPU

Page 8: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

8 Harada, Heterogeneous Particle-based Simulation

FUSION ARCHITECTURE

CPU and GPU are:

– On the same die

– Much closer

– Efficient data sharing

CPU and GPU are good at different works

– CPU: serial computation, conditional branch

– GPU: parallel computation

Able to dispatch works to:

– Serial work with varying granularity → CPU

– Parallel work with the uniform granularity → GPU

Page 9: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

9 Harada, Heterogeneous Particle-based Simulation

MIXED PARTICLE SIMULATION

Benefit from Fusion Architecture

– Different works in a simulation

– CPU & GPU are working together

– Shares data

Page 10: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

10 Harada, Heterogeneous Particle-based Simulation

METHOD

Page 11: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

11 Harada, Heterogeneous Particle-based Simulation

TWO SIMULATIONS

Small particles

Large particles

Build

Acc. Structure

SS

Collision

S

Integration

Build

Acc. Structure

LL

Collision

L

Integration

LS

Co

llis

ion

Position

Velocity

Force

Grid

Position

Velocity

Force

Page 12: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

12 Harada, Heterogeneous Particle-based Simulation

Small particles

Large particles

Uniform Work

Non Uniform Work

CLASSIFY BY WORK GRANULARITY

Build

Acc. Structure

SS

Collision

S

Integration

L

Integration

Position

Velocity

Force

Grid

Position

Velocity

Force LL

Collision

LS

Collision

Build

Acc. Structure

Page 13: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

13 Harada, Heterogeneous Particle-based Simulation

Small particles

Large particles

GPU

CPU

CLASSIFY BY WORK GRANULARITY, ASSIGN PROCESSOR

Build

Acc. Structure

SS

Collision

S

Integration

L

Integration

Position

Velocity

Force

Grid

Position

Velocity

Force LL

Collision

LS

Collision

Build

Acc. Structure

Page 14: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

14 Harada, Heterogeneous Particle-based Simulation

Small particles

Large particles

Grid, small particle data has to be shared with the CPU for LS collision

– Allocated as zero copy buffer

GPU

CPU

DATA SHARING

Build

Acc. Structure

SS

Collision

S

Integration

L

Integration

Position

Velocity

Force

Grid

Position

Velocity

Force LL

Collision

Build

Acc. Structure

Position

Velocity

Grid

Force

LS

Collision

Page 15: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

15 Harada, Heterogeneous Particle-based Simulation

Small particles

Large particles

Grid, small particle data has to be shared with the CPU for LS collision

– Allocated as zero copy buffer

GPU

CPU

SYNCHRONIZATION

Position

Velocity

Force

Grid

Position

Velocity

Force

SS

Collision

S

Integration

L

Integration

LL

Collision

Position

Velocity

Grid

Force

Syn

ch

ron

iza

tio

n

LS

Collision

Build

Acc. Structure

Build

Acc. Structure

Syn

ch

ron

iza

tio

n

Build

Acc. Structure

Build

Acc. Structure

Page 16: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

16 Harada, Heterogeneous Particle-based Simulation

GPU

CPU

VISUALIZING WORKLOADS

Build

Acc. Structure

SS

Collision

S

Inte

gra

tio

n Position

Velocity

Force

Grid

Position

Velocity

Force LL

Collision

LS

Collision

Syn

ch

ron

iza

tio

n

L

Inte

gra

tio

n

Small particles

Large particles

Grid construction can be moved at the end of the pipeline

– Unbalanced workload

Page 17: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

17 Harada, Heterogeneous Particle-based Simulation

Small particles

Large particles

To get better load balancing

– The sync is for passing the force buffer filled by the CPU to the GPU

– Move the LL collision after the sync

GPU

CPU

LOAD BALANCING

Build

Acc. Structure

SS

Collision

S

Inte

gra

tio

n Position

Velocity

Force

Grid

Position

Velocity

Force LL

Collision

Syn

ch

ron

iza

tio

n

L

Inte

gra

tio

n

LS

Collision

Page 18: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

18 Harada, Heterogeneous Particle-based Simulation

GP

U W

ork

CP

U W

ork

Page 19: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

19 Harada, Heterogeneous Particle-based Simulation

MULTI THREADING

(4 THREADS)

Page 20: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

20 Harada, Heterogeneous Particle-based Simulation

FURTHER OPTIMIZATION

GPU

CPU0

CPU1

CPU2

Build

Acc.

Structure

SS

Collision

S

Inte

g.

LL

Collision

L

Inte

g.

LS

Collision

Syn

ch

ron

iza

tio

n

1. Not optimized for “Llano” which is a 4 core CPU

– Only 2 CPU core were used

– Can use 2 more cores for LS collision

2. LL collision was not optimized

– CPU waits when the GPU was constructing a grid

– Use CPU to improve SS collision

Page 21: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

21 Harada, Heterogeneous Particle-based Simulation

OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION

Cannot split the work by large particle indices

– More than 1 large particle can collide with a small particle

– Have to lock the memory on write → Inefficient

Prepare a local buffer for a thread

– A buffer storing force on small particles

– Lock free

Local buffers are merged to one

L0

S0

S1

L1

Thread0

Thread1

Thread2

Page 22: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

22 Harada, Heterogeneous Particle-based Simulation

OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION

GPU

Build

Acc. Structure

SS

Collision

S

Inte

g.

CPU0

LL

Collision

L

Inte

g.

CPU1

CPU2

LS

Collision

Syn

ch

ron

iza

tio

n

Page 23: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

23 Harada, Heterogeneous Particle-based Simulation

OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION

GPU

Build

Acc. Structure

SS

Collision

S

Inte

g.

CPU0

LL

Collision

L

Inte

g.

CPU1

CPU2

LS

Collision

LS

Collision

LS

Collision S

yn

ch

ron

iza

tio

n

Merg

e

Merg

e

Merg

e

Syn

ch

ron

iza

tio

n

Page 24: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

24 Harada, Heterogeneous Particle-based Simulation

Spatially coherent memory layout improves cache utilization

As particles move, spatial locality decreases

OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION

Page 25: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

25 Harada, Heterogeneous Particle-based Simulation

Spatially coherent memory layout improves cache utilization

As particles move, spatial locality decreases

OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION

Page 26: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

26 Harada, Heterogeneous Particle-based Simulation

Sort particles by spatial location to improve cache utilization

– Z curve

SPATIAL SORT

Page 27: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

27 Harada, Heterogeneous Particle-based Simulation

Sort particles by spatial location to improve cache utilization

– Z curve

SPATIAL SORT

Page 28: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

28 Harada, Heterogeneous Particle-based Simulation

Requirements

– Full sort was over the budget

– Full sort is not “a must”

– Sort is an optional computation for performance improvement

– Incremental sort

– Use multiple threads

Solution

– Used generalized “Odd-even transition sort”

CHOOSE SORT

Page 29: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

29 Harada, Heterogeneous Particle-based Simulation

BLOCK TRANSITION SORT

Generalized “Odd-even transition sort”

Instead of sorting 2 adjacent elements, sort adjacent 2 blocks

Iterate until convergence

Use a thread to sort 2 adjacent blocks

– 6 blocks for 3 threads

– Radix sort

Odd-even transition sort

Block transition sort

Page 30: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

30 Harada, Heterogeneous Particle-based Simulation

OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION

GPU

Build

Acc. Structure

SS

Collision

S

Inte

g.

CPU0

LL

Collision

L

Inte

g.

CPU1

CPU2

LS

Collision

LS

Collision

LS

Collision S

yn

ch

ron

iza

tio

n

Merg

e

Merg

e

Merg

e

Syn

ch

ron

iza

tio

n

Page 31: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

31 Harada, Heterogeneous Particle-based Simulation

OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION

GPU

Build

Acc. Structure

SS

Collision

S

Inte

g.

CPU0

CPU1

CPU2

LS

Collision

LS

Collision

LS

Collision S

yn

ch

ron

iza

tio

n

Merg

e

Merg

e

Merg

e

LL

Co

ll.

L

Inte

g.

Syn

ch

ron

iza

tio

n

S Sorting

S Sorting

S Sorting

Syn

ch

ron

iza

tio

n

Page 32: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

32 Harada, Heterogeneous Particle-based Simulation

DEMO

GP

U W

ork

CP

U W

ork

Page 33: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

33 Harada, Heterogeneous Particle-based Simulation

DEMO

GP

U W

ork

CP

U W

ork

Page 34: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

34 Harada, Heterogeneous Particle-based Simulation

CONCLUSIONS

Realized a simulation that handles variable sized particles by leveraging the best features of both the CPU

and GPU on AMD’s Fusion Architecture

– The CPU is used for works with non identical compute granularity

– The GPU is used for highly parallel works

Memory sharing between the CPU and GPU is the key for the efficiency

– Avoid wasteful memory copies

Page 35: Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

35 Harada, Heterogeneous Particle-based Simulation

REFERENCE

Takahiro Harada, Seiichi Koshizuka, Yoichiro Kawaguchi, Smoothed Particle Hydrodynamics on GPUs,

Proc. of Computer Graphics International, 63-70(2007)

Justin Hensley, Takahiro Harada, Chapter X OpenCL Case Study:Mixed Particle Simulation,

Heterogeneous Computing with OpenCL, Morgan Kaufmann(2011)