a parallel constraint solver for a rigid body simulation (siggraph asia 2011)

32
A PARALLEL CONSTRAINT SOLVER FOR A RIGID BODY SIMULATION Takahiro Harada, AMD

Upload: takahiro-harada

Post on 27-Jun-2015

301 views

Category:

Technology


6 download

TRANSCRIPT

Page 1: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

A PARALLEL CONSTRAINT SOLVER

FOR A RIGID BODY SIMULATION

Takahiro Harada, AMD

Page 2: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

2 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

INTRODUCTION

Page 3: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

3 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

RIGID BODY SIMULATION PIPELINE

Broad phase collision detection

– Quick check using bounding volumes

– (A,C)(A,B)(B,C)(B,D)(E,F)

Narrow phase collision detection

– Detailed check using geometry

– (A,B)(B,D)(E,F)

Constraint solve

A

B

C

D

E

F

Page 4: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

4 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

RIGID BODY SIMULATION ON THE GPU

Broad phase collision detection

– Harada et al., Smoothed Particle Hydrodynamics on GPUs (2007)

– Le Grand, Broad-phase Collision Detection with CUDA (2007)

– Liu et al., Real-time Collision Culling of a Million Bodies on Graphics Processing Units (2010)

Narrow phase collision detection

– Sathe, Rigid Body Collision Detection on the GPU (2006)

– Harada et al., Real-time Rigid Body Simulation on GPUs (2007)

– Kipfer, LCP Algorithms for Collision Detection using CUDA (2007)

Constraint solve

– Harada, Real-time Rigid Body Simulation on GPUs (2007)

– Harada, Parallelizing the Physics Pipeline (2009)

– Tonge et al., PhysX GPU Rigid Bodies in Batman (2010)

Page 5: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

5 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

WHY SOLVER ISN’T STRAIGHT FORWARD??

LCP

– Projected Gauss Seidel

Dependency between constraints

Page 6: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

6 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

PARALLEL SOLVE

Split constraints into batches

Objects are dynamic

Page 7: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

7 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

BY INTRODUCING BATCHES

Now can solve in parallel

Batch creation is serial process

– GPU needs parallezm

But have to create batch in parallel

– How??

Page 8: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

8 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

RIGID BODY SIMULATION ON THE GPU

Broad phase collision detection

– Harada et al., Smoothed Particle Hydrodynamics on GPUs (2007)

– Le Grand, Broad-phase Collision Detection with CUDA (2007)

– Liu et al., Real-time Collision Culling of a Million Bodies on Graphics Processing Units (2010)

Narrow phase collision detection

– Sathe, Rigid Body Collision Detection on the GPU (2006)

– Harada et al., Real-time Rigid Body Simulation on GPUs (2007)

– Kipfer, LCP Algorithms for Collision Detection using CUDA (2007)

Constraint solve

– Harada, Real-time Rigid Body Simulation on GPUs (2007) <- Penalty method

– Harada, Parallelizing the Physics Pipeline (2009) <- Partially serializing

batch creation

– Tonge et al., PhysX GPU Rigid Bodies in Batman (2010) <- Global atomics. Many corner cases

Page 9: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

9 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

METHOD

Page 10: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

10 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

GOAL

Good performance == Fit to the architecture

Page 11: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

11 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

DESIGNING FOR GPUS

2 level of parallelization

– SIMD level

– SIMD lane level

Sync

Share data

Less communication is better

– Inter SIMD

– Inter SIMD lane

The best algorithm for CPUs is not always the best

for GPUs

– In order, out of order

Global Memory

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

Page 12: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

12 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

STRATEGY

2 step batch creation

– 1st step: Global split

Localize the problem by splitting pairs into disjoint sets

– 2nd step: Local batch creation

Efficient local operation with streaming data from global memory

Page 13: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

13 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

GLOBAL SPLIT

Split the pairs by space

Procedures

– Calculate cell index for a pair

– Reorder pairs by cell indices

GPU Radix sort

Each group(cell) is independent

– except for adjacent cells

Page 14: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

14 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

GLOBAL SPLIT

Split the pairs by space

Procedures

– Calculate cell index for a pair

– Reorder pairs by cell index

GPU Radix sort

Ref: Introduction to GPU Radix Sort

4 independent set of groups

4 kernel dispatches

– Green

– Orange

– Red

– Blue

Page 15: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

15 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

1ST DISPATCH (GREEN)

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

Page 16: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

16 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

2ND DISPATCH (ORANGE)

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

Page 17: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

17 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

3RD DISPATCH (RED)

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

Page 18: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

18 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

4TH DISPATCH (BLUE)

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

AL

U

LDS

Page 19: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

19 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

PROBLEM SOLVED?

Not yet

Need a strategy for each SIMD (64 wide)

Solution

2nd level: Local batch creation

Page 20: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

20 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

LOCAL BATCH CREATION

Constraints are assigned for a SIMD

Q: How to extract the independent batches to utilize SIMD?

Parallel batch creation doesn’t work

0 1 2 3 0 1 2 3

Page 21: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

21 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

A

B

C

D

E

F

H

G

I

J

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

A

B

C

D

E

F

G

H

I

J

PARALLEL BATCH CREATION FAILURE CASE

Page 22: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

22 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

A

B

C

D

E

F

H

G

I

J

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

A

B V V

C V

D V

E V

F V

G V V

H

I V

J

PARALLEL BATCH CREATION FAILURE CASE

Page 23: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

23 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

A

B

C

D

E

F

H

G

I

J

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

A

B O X

C O

D O

E O

F O

G O X

H

I O

J

PARALLEL BATCH CREATION FAILURE CASE

Page 24: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

24 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

A

B

C

D

E

F

H

G

I

J

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

A

B O X

C V O

D V O

E V O

F V V O

G O X

H V V

I V O

J V

PARALLEL BATCH CREATION FAILURE CASE

Page 25: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

25 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

A

B

C

D

E

F

H

G

I

J

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

A

B O X

C X O

D X O

E X O

F X X O

G O X

H O X

I X O

J X

PARALLEL BATCH CREATION FAILURE CASE

Page 26: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

26 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

LOCAL ATOMICS BATCH CREATION

Parallel approach doesn’t work

Serial approach is inefficient

Iterative Parallel Batch creation

– Even if one shot doesn’t work, it will get better after a few iteration

– Need frequent access to the pairs

Utilize fast LDS

Page 27: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

27 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

WHAT IF A CELL HAS 1,000,000 PAIRS??

Obviously, does not fit to LDS

Likely to happen

Streaming the pairs to LDS

– Like moving a window over a buffer

Procedures

– Fill the local buffer with pairs

– Iterative batch creation

– Flush to global memory

This step only reorder constraints

No additional data output

Local Constraint Buffer

Local Constraint Buffer

Local Batched Buffer

Local Constraint Buffer

Local Constraint Buffer

Local Batched Buffer

Local Constraint Buffer

Local Constraint Buffer

Local Batched Buffer

(1) Fill (3) Fill (5) Fill

(2) Batch

(4) Batch

(6) Batch

Global Constraint Buffer

Global Constraint Buffer (7) Flush

1st iteration

2nd iteration

3rd iteration

Page 28: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

28 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

SOLVE A GROUP

The pairs are already sorted by batch

But we need to know where is the boundary

Procedures

– Read pairs to local dispatch buffer

– Check boundary

– Parallel solve

– Repeat until done

Batches were maintained entirely by the CPU

This moves batch dispatch work to GPU

Constraint Buffer

Batch0

Batch1

Batch2

Batch3

Batch4

SIMD width

Page 29: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

29 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

RESULTS

Page 30: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

30 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

PIPELINE

Copy body and pair buffer

GPU allocates big buffers

– Contact

– Constraints

Narrow phase and solve is done on the GPU

Don’t have to read back big buffers

Bo

dy

Pai

r

Bo

dy

Co

nta

ct

Co

nst

rain

t

Pai

r

Merge

Dispatch Logic

CPU

Broad phase Collision

GPU

NP Collision

Solve

Body, Pair

Body Copy

Copy

Copy

Copy

Page 31: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

31 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

MOVIE

Page 32: A Parallel Constraint Solver for a Rigid Body Simulation (SIGGRAPH ASIA 2011)

32 Harada, A Parallel Constraint Solver for a Rigid Body Simulation

CONCLUSIONS

Presented parallel constraint solver

– 2 stage batch creation

– Reduced # of dispatch from the CPU

– GPU does dispatch by itself

Parallel iterative batch creation improved the batch quality a lot

– It surpassed the quality of single theaded batch creation after a few iteration

Still room for improvement for SIMD utilization

Integrate GPU broadphase collision detection to complete GPU pipeline