efficient parallelization of molecular dynamics...
Post on 05-May-2018
217 Views
Preview:
TRANSCRIPT
Efficient Parallelization of Molecular
Dynamics Simulations on Hybrid
CPU/GPU Supercoputers
Jaewoon Jung (RIKEN, RIKEN AICS)
Yuji Sugita (RIKEN, RIKEN AICS, RIKEN QBiC,
RIKEN iTHES)
Molecular Dynamics (MD)
1. Energy/forces are described by classical molecular mechanics
force field.
2. Update state according to equations of motion.
Long time MD trajectories are important to obtain thermodynamic quantities of target systems.
i i
ii
d
dt m
d
dt
r p
pF
Equation of motion Long time MD trajectory
=> Ensemble generation
( ) ( )
( ) ( )
ii i
i i i
t t t tm
t t t t
pr r
p p F
Integration
Potential energy in MD using PME
3
2total 0
bonds
20
angles
dihedrals
12 61
0 0
1 1
( )
( )
[1 cos( )]
2
b
a
n
N Nij ij i j
ijij ij ijj i j
E k b b
k
V n
r r q q
r r r
O(N)
O(N)
O(N)
O(N2) Main bottleneck in MD
12 62 2
0 0
20
erfc( ) exp( / 4 )2 FFT( ( ))
ij ij i j ijij
ij ij iji j R
r r q q rQ
r r r
k
kk
k
Real space, O(CN) Reciprocal space, O(NlogN)
Total number of particles
GENESIS MD software
(Generalized Ensemble Simulation Systems)
1. Aims at developing efficient and accurate methodologies for free
energy calculations in biological systems.
2. Efficient Parallelization - Suitable for massively parallel super-
computers, in particular, K.
3. Applicability for large scale simulation.
4. Algorithms coupled with different molecular models such as coar-
segrained, all-atom, and hybrid QM/MM.
5. Generalized ensemble with Replica Exchange Molecular
Dynamics.
Ref : Jaewoon Jung et al. WIREs CMS, 5, 310-323 (2015)
Website : http://www.riken.jp/TMS2012/cbp/en/research/software/genesis/index.html
Midpoint method : interaction between
two particles are decided from the
midpoint position of them.
Midpoint cell method : interaction
between two particles are decided from
the midpoint cells where each particle
resides.
Small communication, efficient energy/force evaluations
Ref : J. Jung, T. Mori and Y. Sugita, JCC 35, 1064 (2014)
Parallelization of the real space interaction:
Midpoint cell method (1)
Basic domain decomposition using the midpoint
cell method (2)
1. Partitioning space into fixed size
boxes, with dimension larger than
the cutoff distance.
2. We need only information of neighbor
space(domain) for computation of
energies.
3. Communication is reduced by
increasing process number .
4. Efficient for good parallelization and
suitable for large system with massiv-
ely parallel supercomputers.
Parallelization of FFT in GENESIS :
Volumetric decomposition scheme in 3D FFT
1. More communications than
existing FFT
2. MPI Alltoall communications
only in one dimensional space
(existing : communications in
two/three dimensional space)
3. Reduce communicational cost
for large number of processors
FFT in GENESIS (2 dimensional view)
GENESIS
(Identical domain
decomposition
between two space)
NAMD, Gromacs
(Different domain
decomposition)
GENESIS performance on K
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
GENESISr1.0
GENESISr1.1
GENESISr1.2
NAMD 2.9 Gromacs5.0.7
Gromacs5.1.2
ApoA1 (92,224 atoms) on 128 cores STMV (1,066,628 atoms)
128 256 512 1024 2048 4096 8192 16384 32768 65536
0.5
1
2
4
8
16
32
64
ns/d
ay
Number of Cores
GENESIS r1.1
GENESIS r1.0
NAMD 2.9
Why midpoint cell method for GPU+CPU cluster?
1. The main bottleneck of MD is the real space non-bonded interactions for
small number of processors.
2. The main bottleneck of MD moves to the reciprocal space non-bonded
interactions as we increase the number of processors.
3. When we assign GPU for the real space non-bonded interactions, the
reciprocal space interaction will be more crucial.
4. The midpoint cell method with volumetric decomposition FFT could be one
of good solutions to optimize the reciprocal space interactions due to
avoiding communications before/after FFT.
5. In particular, the midpoint cell method with volumetric decomposition FFT
will be very useful for massively parallel supercomputers with GPUs.
Overview of CPU+GPU calculations
1. Computation intensive work : GPU
• Pairlist
• Real space non-bonded interaction
2. Communication intensive work or no
computation intensive work : CPU
• Reciprocal space non-bonded
interaction with FFT
• Bonded interaction
• Exclusion list
3. Integration is performed on CPU due
to file I/O.
Real space non-bonded interaction on GPU (1)
- non-excluded particle list scheme
Non-excluded particle list scheme is suitable for GPU due to small amount of
memory for pairlist.
Real space non-bonded interaction on GPU (2)
- How to make block/threads in each cell pair
We make 32 thread blocks for efficient calculation on GPU by making blocks
according to 8 atoms in cell I and 4 atoms in cell j.
Overview of GPU+CPU calculations with multiple
time step integrator
1. In the multiple time step
integrator, we do not perform
reciprocal space interaction
every step.
2. If reciprocal space interaction
is not necessary, we assign
subset of real space interaction
on CPU to maximize the
performance.
3. Integration is performed on
CPU only.
Validation Tests (Energy drift)
Machine Precision Integrator Energy drift
CPU Double Velocity Verlet 3.37×10-6
CPU Single Velocity Verlet 1.03×10-5
CPU Double RESPA (4fs) 1.01×10-6
CPU Single RESPA (4fs) 8.92×10-5
CPU+GPU Double Velocity Verlet 7.03×10-6
CPU+GPU Single Velocity Verlet -4.56×10-5
CPU+GPU Double RESPA (4fs) -3.21×10-6
CPU+GPU Single RESPA (4fs) -3.68×10-5
CPU+GPU Single Langevin RESPA (8fs) 5.48×10-5
CPU+GPU Single Langevin RESPA (10fs) 1.63×10-6
• Unit : kT/ns/degree of freedom
• 2fs time step with SHAKE/RATTLE/SETTLE constraints
• In the case of RESPA, the slow force time step is written in parentheses
• Our energy drift is similar to AMBER double and hybrid precision calculation.
Benchmark condition
1. MD program : GENESIS
2. System : TIP3P water (22,000), STMV (1 million), Crowding system1 (11.7
million), and Crowding system 2 (100 million)
3. Cutoff : 12.0 Å
4. PME grid sizes : 1923 (STMV), 3843 (Crowding system1), and 7683
(Crowding system 2)
5. Integrator : Velocity Verlet (VVER), Respa (PME reciprocal every second
step), and Langevin Respa (PME reciprocal every fourth step)
Acceleration of real space interactions (1)
• System : 9,250 TIP water molecules
• Cutoff distance : 12 Å
• Box size : 64 Å × 64 Å × 64 Å
1. 1 GPU increase the speed 3 times and 2 GPUs 6 times.
2. By assigning CPU as well as GPU when FFT on CPU is skipped, we can increase the
speed up to 7.7 times.
Acceleration of real space interactions (2)
Benchmark system
(11.7 million atoms,
Cutoff = 12.0 Å)
Comparison between real space and
reciprocal space interactions
1. In both systems, the main bottleneck is the reciprocal space interactions irrespective of
the number of processors.
2. Therefore, it is important to optimize the reciprocal space interaction when CPU+GPU
clusters are used (=> Midpoint cell method could be best choice)
STMV (1 million atoms) 11.7 million atoms
Comparison between TSUBAME and K
STMV (1 million atoms) 11.7 million atoms
1. K has better parallel efficiency of reciprocal space interaction than TSUBAME.
2. Irrespective of the performance of reciprocal space interaction, TSUBAME shows better
performance than K due to efficient evaluation of real space interaction on GPU.
Benchmark on TSUBAME
VVER (1 million atoms) VVER (11.7 million atoms)
RESPA (1 million atoms) RESPA (11.7 million atoms)
Performance on TSUBAME for 100
million atom systems
Integrator Number of Nodes Time per step (ms) Simulation time
(ns/day)
VVER 512 126.09 1.37
1024 97.87 1.77
RESPA 512 109.80 1.57
1024 70.77 2.44
Langevin
RESPA
512 78.92 2.19
1024 44.13 3.92
Summary
1. We implemented MD for GPU+CPU clusters.
2. We assign GPUs for real space non-bonded interactions and CPUs for
reciprocal space interactions, bonded interactions, and integrators.
3. We introduce a non-excluded particle list scheme for efficient usage of
memory on GPU.
4. We also optimized the usage of GPUs and CPUs for multiple time step
integrators.
5. Benchmark result on TSUBAME shows very good strong/weak scalability
for 1 million, 11.7 million, and 100 million atoms systems.
top related