heterogeneous particle based simulation (siggraph asia 2011)

HETEROGENEOUS PARTICLE

BASED SIMULATION

Takahiro Harada, AMD

2 Harada, Heterogeneous Particle-based Simulation

Large number of particles

Particles with identical size

– Work granularity is almost the same

– Good for the wide SIMD architecture

PARTICLE BASED SIMULATION ON THE GPU

Harada et al. 2007

PARTICLE BASED SIMULATION

Collision

Integration

Acceleration structure is used for efficient collide

– Uniform grid → Suited for the GPU

– Less divergence

𝑓𝑐𝑜𝑙𝑙𝑖𝑑𝑒 = 𝑓𝑖𝑗

𝑣 +=𝑓

𝑚∆𝑡

𝑥 += 𝑣∆𝑡

𝑑𝑣

𝑑𝑡=𝑓

𝑑𝑥

𝑑𝑡= 𝑣

DIVERGENCE ON SIMD

0 1 2 3 4 5 6 7

Void Kernel()

FuncA();

else if(B)

FuncB();

FuncC();

PARTICLE BASED SIMULATION ON THE GPU

Particle collision using a uniform grid

0 1 2 3 4 5 6 7

Void Kernel()

prepare();

collide(Cell0);

collide(Cell1);

collide(Cell2);

collide(Cell3);

collide(Cell4);

collide(Cell5);

collide(Cell6);

collide(Cell7);

collide(Cell8);

Cell0 Cell1 Cell2

Cell3 Cell4 Cell5

Cell6 Cell7 Cell8

MIXED PARTICLE SIMULATION

Not only small particles

Difficulty for GPUs

– Large particles interact with small particles

– Large-large collision

CHALLENGE

Non uniform work granularity

– Small-small(SS) collision

Uniform, GPU

– Large-large(LL) collision

Non Uniform, CPU

– Large-small(LS) collision

Non Uniform, CPU

FUSION ARCHITECTURE

CPU and GPU are:

– On the same die

– Much closer

– Efficient data sharing

CPU and GPU are good at different works

– CPU: serial computation, conditional branch

– GPU: parallel computation

Able to dispatch works to:

– Serial work with varying granularity → CPU

– Parallel work with the uniform granularity → GPU

MIXED PARTICLE SIMULATION

Benefit from Fusion Architecture

– Different works in a simulation

– CPU & GPU are working together

– Shares data

METHOD

TWO SIMULATIONS

Small particles

Large particles

Acc. Structure

Collision

Integration

Acc. Structure

Collision

Integration

Position

Velocity

Position

Velocity

Small particles

Large particles

Uniform Work

Non Uniform Work

CLASSIFY BY WORK GRANULARITY

Acc. Structure

Collision

Integration

Position

Velocity

Position

Velocity

Force LL

Collision

Acc. Structure

Small particles

Large particles

CLASSIFY BY WORK GRANULARITY, ASSIGN PROCESSOR

Acc. Structure

Collision

Integration

Position

Velocity

Position

Velocity

Force LL

Collision

Acc. Structure

Small particles

Large particles

Grid, small particle data has to be shared with the CPU for LS collision

– Allocated as zero copy buffer

DATA SHARING

Acc. Structure

Collision

Integration

Position

Velocity

Position

Velocity

Force LL

Collision

Acc. Structure

Position

Velocity

Collision

Small particles

Large particles

Grid, small particle data has to be shared with the CPU for LS collision

– Allocated as zero copy buffer

SYNCHRONIZATION

Position

Velocity

Position

Velocity

Collision

Integration

Collision

Position

Velocity

Collision

Acc. Structure

VISUALIZING WORKLOADS

Acc. Structure

Collision

n Position

Velocity

Position

Velocity

Force LL

Collision

Small particles

Large particles

Grid construction can be moved at the end of the pipeline

– Unbalanced workload

Small particles

Large particles

To get better load balancing

– The sync is for passing the force buffer filled by the CPU to the GPU

– Move the LL collision after the sync

LOAD BALANCING

Acc. Structure

Collision

n Position

Velocity

Position

Velocity

Force LL

Collision

MULTI THREADING

(4 THREADS)

FURTHER OPTIMIZATION

Structure

Collision

1. Not optimized for “Llano” which is a 4 core CPU

– Only 2 CPU core were used

– Can use 2 more cores for LS collision

2. LL collision was not optimized

– CPU waits when the GPU was constructing a grid

– Use CPU to improve SS collision

OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION

Cannot split the work by large particle indices

– More than 1 large particle can collide with a small particle

– Have to lock the memory on write → Inefficient

Prepare a local buffer for a thread

– A buffer storing force on small particles

– Lock free

Local buffers are merged to one

Thread0

Thread1

Thread2

Acc. Structure

Collision

Acc. Structure

Collision

Collision S

Spatially coherent memory layout improves cache utilization

As particles move, spatial locality decreases

OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION

Spatially coherent memory layout improves cache utilization

As particles move, spatial locality decreases

Sort particles by spatial location to improve cache utilization

– Z curve

SPATIAL SORT

Sort particles by spatial location to improve cache utilization

– Z curve

SPATIAL SORT

Requirements

– Full sort was over the budget

– Full sort is not “a must”

– Sort is an optional computation for performance improvement

– Incremental sort

– Use multiple threads

Solution

– Used generalized “Odd-even transition sort”

CHOOSE SORT

BLOCK TRANSITION SORT

Generalized “Odd-even transition sort”

Instead of sorting 2 adjacent elements, sort adjacent 2 blocks

Iterate until convergence

Use a thread to sort 2 adjacent blocks

– 6 blocks for 3 threads

– Radix sort

Odd-even transition sort

Block transition sort

Acc. Structure

Collision

Collision S

Acc. Structure

Collision

Collision S

S Sorting

CONCLUSIONS

Realized a simulation that handles variable sized particles by leveraging the best features of both the CPU

and GPU on AMD’s Fusion Architecture

– The CPU is used for works with non identical compute granularity

– The GPU is used for highly parallel works

Memory sharing between the CPU and GPU is the key for the efficiency

– Avoid wasteful memory copies

REFERENCE

Takahiro Harada, Seiichi Koshizuka, Yoichiro Kawaguchi, Smoothed Particle Hydrodynamics on GPUs,

Proc. of Computer Graphics International, 63-70(2007)

Justin Hensley, Takahiro Harada, Chapter X OpenCL Case Study:Mixed Particle Simulation,

Heterogeneous Computing with OpenCL, Morgan Kaufmann(2011)

heterogeneous particle based simulation (siggraph asia 2011)

gpu particle collision

simulation cpu gpu

structure ll collision

gpu harada

cpu largesmallls collision

gpu largelargell collision

simulation takahiro

small particle data

Technology

progressive meshes (siggraph ’96)

heterogeneous/particle-laden blast...

historical perspective on heterogeneous gas-particle

mapping adaptive particle filters to heterogeneous...

siggraph 2013:overview

siggraph 2016 vulkan and nvidia: the...

mobile crossplatformchallenges siggraph

siggraph 2007 course notes practical least-squares for...

siggraph paper reading 2011

surface modeling with oriented particle system szeliski and...

siggraph 2003, san diego

superdiﬀusive, heterogeneous, and collective …april 2010...

webcl bof - siggraph 2014

siggraph 2006

netra on siggraph 2010

siggraph 2010

siggraph 2007, san diego

opencl bof - siggraph 2014

siggraph asia 2012 "2x3d"

siggraph realtime radiosity architecture