a parallel constraint solver for a rigid body simulation (siggraph asia 2011)
Post on 27-Jun-2015
301 Views
Preview:
TRANSCRIPT
A PARALLEL CONSTRAINT SOLVER
FOR A RIGID BODY SIMULATION
Takahiro Harada, AMD
2 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
INTRODUCTION
3 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
RIGID BODY SIMULATION PIPELINE
Broad phase collision detection
– Quick check using bounding volumes
– (A,C)(A,B)(B,C)(B,D)(E,F)
Narrow phase collision detection
– Detailed check using geometry
– (A,B)(B,D)(E,F)
Constraint solve
A
B
C
D
E
F
4 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
RIGID BODY SIMULATION ON THE GPU
Broad phase collision detection
– Harada et al., Smoothed Particle Hydrodynamics on GPUs (2007)
– Le Grand, Broad-phase Collision Detection with CUDA (2007)
– Liu et al., Real-time Collision Culling of a Million Bodies on Graphics Processing Units (2010)
Narrow phase collision detection
– Sathe, Rigid Body Collision Detection on the GPU (2006)
– Harada et al., Real-time Rigid Body Simulation on GPUs (2007)
– Kipfer, LCP Algorithms for Collision Detection using CUDA (2007)
Constraint solve
– Harada, Real-time Rigid Body Simulation on GPUs (2007)
– Harada, Parallelizing the Physics Pipeline (2009)
– Tonge et al., PhysX GPU Rigid Bodies in Batman (2010)
5 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
WHY SOLVER ISN’T STRAIGHT FORWARD??
LCP
– Projected Gauss Seidel
Dependency between constraints
6 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
PARALLEL SOLVE
Split constraints into batches
Objects are dynamic
7 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
BY INTRODUCING BATCHES
Now can solve in parallel
Batch creation is serial process
– GPU needs parallezm
But have to create batch in parallel
– How??
8 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
RIGID BODY SIMULATION ON THE GPU
Broad phase collision detection
– Harada et al., Smoothed Particle Hydrodynamics on GPUs (2007)
– Le Grand, Broad-phase Collision Detection with CUDA (2007)
– Liu et al., Real-time Collision Culling of a Million Bodies on Graphics Processing Units (2010)
Narrow phase collision detection
– Sathe, Rigid Body Collision Detection on the GPU (2006)
– Harada et al., Real-time Rigid Body Simulation on GPUs (2007)
– Kipfer, LCP Algorithms for Collision Detection using CUDA (2007)
Constraint solve
– Harada, Real-time Rigid Body Simulation on GPUs (2007) <- Penalty method
– Harada, Parallelizing the Physics Pipeline (2009) <- Partially serializing
batch creation
– Tonge et al., PhysX GPU Rigid Bodies in Batman (2010) <- Global atomics. Many corner cases
9 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
METHOD
10 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
GOAL
Good performance == Fit to the architecture
11 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
DESIGNING FOR GPUS
2 level of parallelization
– SIMD level
– SIMD lane level
Sync
Share data
Less communication is better
– Inter SIMD
– Inter SIMD lane
The best algorithm for CPUs is not always the best
for GPUs
– In order, out of order
Global Memory
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
12 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
STRATEGY
2 step batch creation
– 1st step: Global split
Localize the problem by splitting pairs into disjoint sets
– 2nd step: Local batch creation
Efficient local operation with streaming data from global memory
13 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
GLOBAL SPLIT
Split the pairs by space
Procedures
– Calculate cell index for a pair
– Reorder pairs by cell indices
GPU Radix sort
Each group(cell) is independent
– except for adjacent cells
14 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
GLOBAL SPLIT
Split the pairs by space
Procedures
– Calculate cell index for a pair
– Reorder pairs by cell index
GPU Radix sort
Ref: Introduction to GPU Radix Sort
4 independent set of groups
4 kernel dispatches
– Green
– Orange
– Red
– Blue
15 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
1ST DISPATCH (GREEN)
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
16 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
2ND DISPATCH (ORANGE)
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
17 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
3RD DISPATCH (RED)
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
18 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
4TH DISPATCH (BLUE)
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
AL
U
LDS
19 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
PROBLEM SOLVED?
Not yet
Need a strategy for each SIMD (64 wide)
Solution
2nd level: Local batch creation
20 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
LOCAL BATCH CREATION
Constraints are assigned for a SIMD
Q: How to extract the independent batches to utilize SIMD?
Parallel batch creation doesn’t work
0 1 2 3 0 1 2 3
21 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
A
B
C
D
E
F
H
G
I
J
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8
A
B
C
D
E
F
G
H
I
J
PARALLEL BATCH CREATION FAILURE CASE
22 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
A
B
C
D
E
F
H
G
I
J
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8
A
B V V
C V
D V
E V
F V
G V V
H
I V
J
PARALLEL BATCH CREATION FAILURE CASE
23 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
A
B
C
D
E
F
H
G
I
J
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8
A
B O X
C O
D O
E O
F O
G O X
H
I O
J
PARALLEL BATCH CREATION FAILURE CASE
24 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
A
B
C
D
E
F
H
G
I
J
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8
A
B O X
C V O
D V O
E V O
F V V O
G O X
H V V
I V O
J V
PARALLEL BATCH CREATION FAILURE CASE
25 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
A
B
C
D
E
F
H
G
I
J
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8
A
B O X
C X O
D X O
E X O
F X X O
G O X
H O X
I X O
J X
PARALLEL BATCH CREATION FAILURE CASE
26 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
LOCAL ATOMICS BATCH CREATION
Parallel approach doesn’t work
Serial approach is inefficient
Iterative Parallel Batch creation
– Even if one shot doesn’t work, it will get better after a few iteration
– Need frequent access to the pairs
Utilize fast LDS
27 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
WHAT IF A CELL HAS 1,000,000 PAIRS??
Obviously, does not fit to LDS
Likely to happen
Streaming the pairs to LDS
– Like moving a window over a buffer
Procedures
– Fill the local buffer with pairs
– Iterative batch creation
– Flush to global memory
This step only reorder constraints
No additional data output
Local Constraint Buffer
Local Constraint Buffer
Local Batched Buffer
Local Constraint Buffer
Local Constraint Buffer
Local Batched Buffer
Local Constraint Buffer
Local Constraint Buffer
Local Batched Buffer
(1) Fill (3) Fill (5) Fill
(2) Batch
(4) Batch
(6) Batch
Global Constraint Buffer
Global Constraint Buffer (7) Flush
1st iteration
2nd iteration
3rd iteration
28 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
SOLVE A GROUP
The pairs are already sorted by batch
But we need to know where is the boundary
Procedures
– Read pairs to local dispatch buffer
– Check boundary
– Parallel solve
– Repeat until done
Batches were maintained entirely by the CPU
This moves batch dispatch work to GPU
Constraint Buffer
Batch0
Batch1
Batch2
Batch3
Batch4
SIMD width
29 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
RESULTS
30 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
PIPELINE
Copy body and pair buffer
GPU allocates big buffers
– Contact
– Constraints
Narrow phase and solve is done on the GPU
Don’t have to read back big buffers
Bo
dy
Pai
r
Bo
dy
Co
nta
ct
Co
nst
rain
t
Pai
r
Merge
Dispatch Logic
CPU
Broad phase Collision
GPU
NP Collision
Solve
Body, Pair
Body Copy
Copy
Copy
Copy
31 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
MOVIE
32 Harada, A Parallel Constraint Solver for a Rigid Body Simulation
CONCLUSIONS
Presented parallel constraint solver
– 2 stage batch creation
– Reduced # of dispatch from the CPU
– GPU does dispatch by itself
Parallel iterative batch creation improved the batch quality a lot
– It surpassed the quality of single theaded batch creation after a few iteration
Still room for improvement for SIMD utilization
Integrate GPU broadphase collision detection to complete GPU pipeline
top related