internals of the cunbody-1 library: particle/force decomposition … · 2007. 11. 11. · tsuyoshi...
TRANSCRIPT
![Page 1: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first](https://reader036.vdocuments.us/reader036/viewer/2022071105/5fdfc268e54bf76d8e4c87e3/html5/thumbnails/1.jpg)
Tsuyoshi Hamada, AstroGPU2007
1
Internals of the CUNBODY-1 library: particle/force decomposition and reduction
Tsuyoshi Hamada, Yousuke Ohno, Gentaro Morimoto, Makoto Taiji
High-Performance Molecular Simulation TeamGenomic Sciences Center, RIKEN
Toshiaki Iitaka,Computational Astrophysics Laboratory,RIKEN
Keigo NitadoriDepartment of Astronomy, University of Tokyo
![Page 2: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first](https://reader036.vdocuments.us/reader036/viewer/2022071105/5fdfc268e54bf76d8e4c87e3/html5/thumbnails/2.jpg)
Tsuyoshi Hamada, AstroGPU2007
2
GPU computing at RIKEN Motivation
Accelerating billions of particle simulations such as
Cosmological N-body simulation
Large-scale molecular dynamics simulation
using cost-effective hardware.
Target code
PPPM (Particle-Particle Particle-Mesh)
PME (Particle Mesh Ewald)
TreePM (Tree Particle-Mesh)
![Page 3: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first](https://reader036.vdocuments.us/reader036/viewer/2022071105/5fdfc268e54bf76d8e4c87e3/html5/thumbnails/3.jpg)
Tsuyoshi Hamada, AstroGPU2007
3
N-body simulation Particles are interacting with each other
Particle
stars (star cluster)
galaxies (cluster of galaxies)
atom s(protein, metal, crystal, etc)
Computation cost in naive algorithm)( 2NO
particle
interaction
![Page 4: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first](https://reader036.vdocuments.us/reader036/viewer/2022071105/5fdfc268e54bf76d8e4c87e3/html5/thumbnails/4.jpg)
Tsuyoshi Hamada, AstroGPU2007
4
CUNBODY-1 library CUda NBODY version 1 library
The first implementation of accelerating particle-particle interaction in N-body simulation using
CUDA
GeForce8800GTX
Host ComputerGPU
(GeForce8800GTX)
time-integration, etc interaction between xi and xj
(Hamada & Iitaka, Astro-ph 2007)
The basic idea is the same as GRAPE systems
ij
jii xxgf ),(
ji xx
,
if
You can get source code from http://progrape.jp/cs/
![Page 5: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first](https://reader036.vdocuments.us/reader036/viewer/2022071105/5fdfc268e54bf76d8e4c87e3/html5/thumbnails/5.jpg)
Tsuyoshi Hamada, AstroGPU2007
5
History of N-body with GPU
0
50
100
150
200
250
300
350
400
L.Nyland
(2004)
M. Harris
(2005)
M. Harris
(2005)
S. P. Zwart
(2007)
S. P. Zwart
(2007)
M. Harris
(2007)
T.Hamada
(2007)
Gfl
ops
![Page 6: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first](https://reader036.vdocuments.us/reader036/viewer/2022071105/5fdfc268e54bf76d8e4c87e3/html5/thumbnails/6.jpg)
Tsuyoshi Hamada, AstroGPU2007
6
The key to breakthrough –GeForce8800GTX + CUDA
GeForce8800GTX
8192 thread s in maxmum
345 Gflops peak
Using CUDA software
We can use scatter memory operation on GPU.
GPU Threads (~ 8192)
Memory
d0 d0 d0 d0 d0 d0 d0 d0
Scatter memory operation
![Page 7: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first](https://reader036.vdocuments.us/reader036/viewer/2022071105/5fdfc268e54bf76d8e4c87e3/html5/thumbnails/7.jpg)
Tsuyoshi Hamada, AstroGPU2007
7
Practice : a simple test- direct summation GeForce8800GTX + Core2Duo E4400
131 k particles
2000 shared time steps
1 hours
about 300 Gflops
http://progrape.jp/cs/ or YouTubePlummer sphere, 131072 particles
You can get source code from http://progrape.jp/cs/
![Page 8: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first](https://reader036.vdocuments.us/reader036/viewer/2022071105/5fdfc268e54bf76d8e4c87e3/html5/thumbnails/8.jpg)
Tsuyoshi Hamada, AstroGPU2007
8
Details of parallelization
j 0 1 2 3 4 5 6 ...i0 → GPU Thread 01 → GPU Thread 12 → GPU Thread 23 Fij → GPU Thread 3
4 → GPU Thread 45 → GPU Thread 56 → GPU Thread 6...
→ GPU Thread 20472047
All 2048 threads are calculating forces on different i-particles
→ Particle Decomposition (i-parallelization)
![Page 9: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first](https://reader036.vdocuments.us/reader036/viewer/2022071105/5fdfc268e54bf76d8e4c87e3/html5/thumbnails/9.jpg)
Tsuyoshi Hamada, AstroGPU2007
9
Practice: Hierarchical Tree Algorithm
O(N log N) scaling with particle number
particles with a long distance are grouped
using hierarchical oct-tree structure
faster than direct summation for large number of particles
distant j-particles
i-particles
![Page 10: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first](https://reader036.vdocuments.us/reader036/viewer/2022071105/5fdfc268e54bf76d8e4c87e3/html5/thumbnails/10.jpg)
Tsuyoshi Hamada, AstroGPU2007
10
Almost all GPU threads will sleep in hierarchical tree algorithm
2M particles
one time-step
ng(average number of groups) ≈ 500
![Page 11: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first](https://reader036.vdocuments.us/reader036/viewer/2022071105/5fdfc268e54bf76d8e4c87e3/html5/thumbnails/11.jpg)
Tsuyoshi Hamada, AstroGPU2007
11
# of i particles ≈ 500, # of threads = 2048
j 0 1 2 3 4 5 6 ...i0 → GPU Thread 0...
Fij → GPU Thread 499
→ GPU Thread 500E
... m …p
ty → GPU Thread 20472047
499
500
About 1500 threads are sleeping
![Page 12: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first](https://reader036.vdocuments.us/reader036/viewer/2022071105/5fdfc268e54bf76d8e4c87e3/html5/thumbnails/12.jpg)
Tsuyoshi Hamada, AstroGPU2007
12
Force decomposition (j-parallelization)
ji
0 → GPU Thread 0@block N1 → GPU Thread 1@block N2 → GPU Thread 2@block N3 Fij → GPU Thread 3@block N
4 → GPU Thread 4@block N5 → GPU Thread 5@block N6 → GPU Thread 6@block N...
→ GPU Thread 127@block N
3840...4095
127
GPU Block0
GPU Block1
GPU Block15
↓ ↓ ↓
256...5110..255
......
Force Reduction
Reducing the i-parallelization using j-parallelization
15
0block j
ijforce
![Page 13: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first](https://reader036.vdocuments.us/reader036/viewer/2022071105/5fdfc268e54bf76d8e4c87e3/html5/thumbnails/13.jpg)
Tsuyoshi Hamada, AstroGPU2007
13
Practice : more complex test - Hierarchical Tree Algorithm GeForce8800GTX + Core2Duo E4400
2M particles
1000 time steps
2 hours
about 70 Gflops
Cosmological N-body simulation
![Page 14: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first](https://reader036.vdocuments.us/reader036/viewer/2022071105/5fdfc268e54bf76d8e4c87e3/html5/thumbnails/14.jpg)
Tsuyoshi Hamada, AstroGPU2007
14
Breakdown of calculation time, and future work
Host calc 3.7 sec (47%) 3.7 sec (83%)Host->GPU 0.45 sec (6%) 0.45 sec (10%)GPU calc 0.04 sec (1%) 0.04 sec (1%)GPU->Host 3.6 sec (46%) 0.23 sec (5%)Total Time 7.79 sec 4.44 secTotal Speed 70 Gflop/s 140 Gflop/s
CUNBODY-1 Future
![Page 15: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first](https://reader036.vdocuments.us/reader036/viewer/2022071105/5fdfc268e54bf76d8e4c87e3/html5/thumbnails/15.jpg)
Tsuyoshi Hamada, AstroGPU2007
15
Conclusion CUNBODY-1 library
The first implementation of accelerating particle-particle interaction in N-body simulation using GeForce8800GTX
Performance in Direct sum. algorithm
131072 particle
particle decomposition number = 2048
sustained speed ≈ 300 Gflops Performance in Hierarchical Tree Algorithm
2M particle
particle decomposition number = 128
force decomposition number = 16
sustained speed ≈ 70 Gflops
there is still room for improvement. (→ 140 Gflops)
![Page 16: Internals of the CUNBODY-1 library: particle/force decomposition … · 2007. 11. 11. · Tsuyoshi Hamada, AstroGPU2007 4 CUNBODY-1 library CUda NBODY version 1 library The first](https://reader036.vdocuments.us/reader036/viewer/2022071105/5fdfc268e54bf76d8e4c87e3/html5/thumbnails/16.jpg)
Tsuyoshi Hamada, AstroGPU2007
16
Acknowledgement H. Yahagi, J. Makino (NAOJ)
Constructing GPU cluster
Developing tree code/initial conditions
Many discussions
T. Fukushige, A. Kawai (K&F co. Ltd)
Developing tree code/initial conditions
K. Yasuda (Univ. Nagoya)
Constructing GPU cluster
S. Fujikawa, T. Takahei, H. Matsubara, T. Ebisuzaki (RIKEN) Constructing GPU cluster
S. Nakamura (Mitsubishi Chemical Group) Constructing GPU cluster
T. Narumi (Keio Univ)
Many discussions
N.Nakasato (Aizu Univ)
Initial conditions/tree code