internals of the cunbody-1 library: particle/force decomposition … · 2007. 11. 11. · tsuyoshi...

16
Tsuyoshi Hamada, AstroGPU2007 1 Internals of the CUNBODY-1 library: particle/force decomposition and reduction Tsuyoshi Hamada, Yousuke Ohno, Gentaro Morimoto, Makoto Taiji High-Performance Molecular Simulation Team Genomic Sciences Center, RIKEN Toshiaki Iitaka, Computational Astrophysics Laboratory, RIKEN Keigo Nitadori Department of Astronomy, University of Tokyo

Upload: others

Post on 29-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first

Tsuyoshi Hamada, AstroGPU2007

1

Internals of the CUNBODY-1 library: particle/force decomposition and reduction

Tsuyoshi Hamada, Yousuke Ohno, Gentaro Morimoto, Makoto Taiji

High-Performance Molecular Simulation TeamGenomic Sciences Center, RIKEN

Toshiaki Iitaka,Computational Astrophysics Laboratory,RIKEN

Keigo NitadoriDepartment of Astronomy, University of Tokyo

Page 2: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first

Tsuyoshi Hamada, AstroGPU2007

2

GPU computing at RIKEN Motivation

Accelerating billions of particle simulations such as

Cosmological N-body simulation

Large-scale molecular dynamics simulation

using cost-effective hardware.

Target code

PPPM (Particle-Particle Particle-Mesh)

PME (Particle Mesh Ewald)

TreePM (Tree Particle-Mesh)

Page 3: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first

Tsuyoshi Hamada, AstroGPU2007

3

N-body simulation Particles are interacting with each other

Particle

stars (star cluster)

galaxies (cluster of galaxies)

atom s(protein, metal, crystal, etc)

Computation cost in naive algorithm)( 2NO

particle

interaction

Page 4: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first

Tsuyoshi Hamada, AstroGPU2007

4

CUNBODY-1 library CUda NBODY version 1 library

The first implementation of accelerating particle-particle interaction in N-body simulation using

CUDA

GeForce8800GTX

Host ComputerGPU

(GeForce8800GTX)

time-integration, etc interaction between xi and xj

(Hamada & Iitaka, Astro-ph 2007)

The basic idea is the same as GRAPE systems

ij

jii xxgf ),(

ji xx

,

if

You can get source code from http://progrape.jp/cs/

Page 5: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first

Tsuyoshi Hamada, AstroGPU2007

5

History of N-body with GPU

0

50

100

150

200

250

300

350

400

L.Nyland

(2004)

M. Harris

(2005)

M. Harris

(2005)

S. P. Zwart

(2007)

S. P. Zwart

(2007)

M. Harris

(2007)

T.Hamada

(2007)

Gfl

ops

Page 6: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first

Tsuyoshi Hamada, AstroGPU2007

6

The key to breakthrough –GeForce8800GTX + CUDA

GeForce8800GTX

8192 thread s in maxmum

345 Gflops peak

Using CUDA software

We can use scatter memory operation on GPU.

GPU Threads (~ 8192)

Memory

d0 d0 d0 d0 d0 d0 d0 d0

Scatter memory operation

Page 7: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first

Tsuyoshi Hamada, AstroGPU2007

7

Practice : a simple test- direct summation GeForce8800GTX + Core2Duo E4400

131 k particles

2000 shared time steps

1 hours

about 300 Gflops

http://progrape.jp/cs/ or YouTubePlummer sphere, 131072 particles

You can get source code from http://progrape.jp/cs/

Page 8: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first

Tsuyoshi Hamada, AstroGPU2007

8

Details of parallelization

j 0 1 2 3 4 5 6 ...i0 → GPU Thread 01 → GPU Thread 12 → GPU Thread 23 Fij → GPU Thread 3

4 → GPU Thread 45 → GPU Thread 56 → GPU Thread 6...

→ GPU Thread 20472047

All 2048 threads are calculating forces on different i-particles

→ Particle Decomposition (i-parallelization)

Page 9: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first

Tsuyoshi Hamada, AstroGPU2007

9

Practice: Hierarchical Tree Algorithm

O(N log N) scaling with particle number

particles with a long distance are grouped

using hierarchical oct-tree structure

faster than direct summation for large number of particles

distant j-particles

i-particles

Page 10: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first

Tsuyoshi Hamada, AstroGPU2007

10

Almost all GPU threads will sleep in hierarchical tree algorithm

2M particles

one time-step

ng(average number of groups) ≈ 500

Page 11: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first

Tsuyoshi Hamada, AstroGPU2007

11

# of i particles ≈ 500, # of threads = 2048

j 0 1 2 3 4 5 6 ...i0 → GPU Thread 0...

Fij → GPU Thread 499

→ GPU Thread 500E

... m …p

ty → GPU Thread 20472047

499

500

About 1500 threads are sleeping

Page 12: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first

Tsuyoshi Hamada, AstroGPU2007

12

Force decomposition (j-parallelization)

ji

0 → GPU Thread 0@block N1 → GPU Thread 1@block N2 → GPU Thread 2@block N3 Fij → GPU Thread 3@block N

4 → GPU Thread 4@block N5 → GPU Thread 5@block N6 → GPU Thread 6@block N...

→ GPU Thread 127@block N

3840...4095

127

GPU Block0

GPU Block1

GPU Block15

↓ ↓ ↓

256...5110..255

......

Force Reduction

Reducing the i-parallelization using j-parallelization

15

0block j

ijforce

Page 13: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first

Tsuyoshi Hamada, AstroGPU2007

13

Practice : more complex test - Hierarchical Tree Algorithm GeForce8800GTX + Core2Duo E4400

2M particles

1000 time steps

2 hours

about 70 Gflops

Cosmological N-body simulation

Page 14: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first

Tsuyoshi Hamada, AstroGPU2007

14

Breakdown of calculation time, and future work

Host calc 3.7 sec (47%) 3.7 sec (83%)Host->GPU 0.45 sec (6%) 0.45 sec (10%)GPU calc 0.04 sec (1%) 0.04 sec (1%)GPU->Host 3.6 sec (46%) 0.23 sec (5%)Total Time 7.79 sec 4.44 secTotal Speed 70 Gflop/s 140 Gflop/s

CUNBODY-1 Future

Page 15: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first

Tsuyoshi Hamada, AstroGPU2007

15

Conclusion CUNBODY-1 library

The first implementation of accelerating particle-particle interaction in N-body simulation using GeForce8800GTX

Performance in Direct sum. algorithm

131072 particle

particle decomposition number = 2048

sustained speed ≈ 300 Gflops Performance in Hierarchical Tree Algorithm

2M particle

particle decomposition number = 128

force decomposition number = 16

sustained speed ≈ 70 Gflops

there is still room for improvement. (→ 140 Gflops)

Page 16: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first

Tsuyoshi Hamada, AstroGPU2007

16

Acknowledgement H. Yahagi, J. Makino (NAOJ)

Constructing GPU cluster

Developing tree code/initial conditions

Many discussions

T. Fukushige, A. Kawai (K&F co. Ltd)

Developing tree code/initial conditions

K. Yasuda (Univ. Nagoya)

Constructing GPU cluster

S. Fujikawa, T. Takahei, H. Matsubara, T. Ebisuzaki (RIKEN) Constructing GPU cluster

S. Nakamura (Mitsubishi Chemical Group) Constructing GPU cluster

T. Narumi (Keio Univ)

Many discussions

N.Nakasato (Aizu Univ)

Initial conditions/tree code