efficient parallelization of molecular dynamics...

Efficient Parallelization of Molecular

Dynamics Simulations on Hybrid

CPU/GPU Supercoputers

Jaewoon Jung (RIKEN, RIKEN AICS)

Yuji Sugita (RIKEN, RIKEN AICS, RIKEN QBiC,

RIKEN iTHES)

Molecular Dynamics (MD)

1. Energy/forces are described by classical molecular mechanics

force field.

2. Update state according to equations of motion.

Long time MD trajectories are important to obtain thermodynamic quantities of target systems.

i i

ii

d

dt m

d

dt

r p

pF

Equation of motion Long time MD trajectory

=> Ensemble generation

( ) ( )

( ) ( )

ii i

i i i

t t t tm

t t t t

pr r

p p F

Integration

Potential energy in MD using PME

3

2total 0

bonds

20

angles

dihedrals

12 61

0 0

1 1

( )

( )

[1 cos( )]

2

b

a

n

N Nij ij i j

ijij ij ijj i j

E k b b

k

V n

r r q q

r r r

O(N)

O(N)

O(N)

O(N2) Main bottleneck in MD

12 62 2

0 0

20

erfc( ) exp( / 4 )2 FFT( ( ))

ij ij i j ijij

ij ij iji j R

r r q q rQ

r r r

k

kk

k

Real space, O(CN) Reciprocal space, O(NlogN)

Total number of particles

GENESIS MD software

(Generalized Ensemble Simulation Systems)

1. Aims at developing efficient and accurate methodologies for free

energy calculations in biological systems.

2. Efficient Parallelization - Suitable for massively parallel super-

computers, in particular, K.

3. Applicability for large scale simulation.

4. Algorithms coupled with different molecular models such as coar-

segrained, all-atom, and hybrid QM/MM.

5. Generalized ensemble with Replica Exchange Molecular

Dynamics.

Ref : Jaewoon Jung et al. WIREs CMS, 5, 310-323 (2015)

Website : http://www.riken.jp/TMS2012/cbp/en/research/software/genesis/index.html

Midpoint method : interaction between

two particles are decided from the

midpoint position of them.

Midpoint cell method : interaction

between two particles are decided from

the midpoint cells where each particle

resides.

Small communication, efficient energy/force evaluations

Ref : J. Jung, T. Mori and Y. Sugita, JCC 35, 1064 (2014)

Parallelization of the real space interaction:

Midpoint cell method (1)

Basic domain decomposition using the midpoint

cell method (2)

1. Partitioning space into fixed size

boxes, with dimension larger than

the cutoff distance.

2. We need only information of neighbor

space(domain) for computation of

energies.

3. Communication is reduced by

increasing process number .

4. Efficient for good parallelization and

suitable for large system with massiv-

ely parallel supercomputers.

Parallelization of FFT in GENESIS :

Volumetric decomposition scheme in 3D FFT

1. More communications than

existing FFT

2. MPI Alltoall communications

only in one dimensional space

(existing : communications in

two/three dimensional space)

3. Reduce communicational cost

for large number of processors

FFT in GENESIS (2 dimensional view)

GENESIS

(Identical domain

decomposition

between two space)

NAMD, Gromacs

(Different domain

decomposition)

GENESIS performance on K

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

GENESISr1.0

GENESISr1.1

GENESISr1.2

NAMD 2.9 Gromacs5.0.7

Gromacs5.1.2

ApoA1 (92,224 atoms) on 128 cores STMV (1,066,628 atoms)

128 256 512 1024 2048 4096 8192 16384 32768 65536

0.5

1

2

4

8

16

32

64

ns/d

ay

Number of Cores

GENESIS r1.1

GENESIS r1.0

NAMD 2.9

Why midpoint cell method for GPU+CPU cluster?

1. The main bottleneck of MD is the real space non-bonded interactions for

small number of processors.

2. The main bottleneck of MD moves to the reciprocal space non-bonded

interactions as we increase the number of processors.

3. When we assign GPU for the real space non-bonded interactions, the

reciprocal space interaction will be more crucial.

4. The midpoint cell method with volumetric decomposition FFT could be one

of good solutions to optimize the reciprocal space interactions due to

avoiding communications before/after FFT.

5. In particular, the midpoint cell method with volumetric decomposition FFT

will be very useful for massively parallel supercomputers with GPUs.

Overview of CPU+GPU calculations

1. Computation intensive work : GPU

• Pairlist

• Real space non-bonded interaction

2. Communication intensive work or no

computation intensive work : CPU

• Reciprocal space non-bonded

interaction with FFT

• Bonded interaction

• Exclusion list

3. Integration is performed on CPU due

to file I/O.

Real space non-bonded interaction on GPU (1)

- non-excluded particle list scheme

Non-excluded particle list scheme is suitable for GPU due to small amount of

memory for pairlist.

Real space non-bonded interaction on GPU (2)

- How to make block/threads in each cell pair

We make 32 thread blocks for efficient calculation on GPU by making blocks

according to 8 atoms in cell I and 4 atoms in cell j.

Overview of GPU+CPU calculations with multiple

time step integrator

1. In the multiple time step

integrator, we do not perform

reciprocal space interaction

every step.

2. If reciprocal space interaction

is not necessary, we assign

subset of real space interaction

on CPU to maximize the

performance.

3. Integration is performed on

CPU only.

Validation Tests (Energy drift)

Machine Precision Integrator Energy drift

CPU Double Velocity Verlet 3.37×10-6

CPU Single Velocity Verlet 1.03×10-5

CPU Double RESPA (4fs) 1.01×10-6

CPU Single RESPA (4fs) 8.92×10-5

CPU+GPU Double Velocity Verlet 7.03×10-6

CPU+GPU Single Velocity Verlet -4.56×10-5

CPU+GPU Double RESPA (4fs) -3.21×10-6

CPU+GPU Single RESPA (4fs) -3.68×10-5

CPU+GPU Single Langevin RESPA (8fs) 5.48×10-5

CPU+GPU Single Langevin RESPA (10fs) 1.63×10-6

• Unit : kT/ns/degree of freedom

• 2fs time step with SHAKE/RATTLE/SETTLE constraints

• In the case of RESPA, the slow force time step is written in parentheses

• Our energy drift is similar to AMBER double and hybrid precision calculation.

Benchmark condition

1. MD program : GENESIS

2. System : TIP3P water (22,000), STMV (1 million), Crowding system1 (11.7

million), and Crowding system 2 (100 million)

3. Cutoff : 12.0 Å

4. PME grid sizes : 1923 (STMV), 3843 (Crowding system1), and 7683

(Crowding system 2)

5. Integrator : Velocity Verlet (VVER), Respa (PME reciprocal every second

step), and Langevin Respa (PME reciprocal every fourth step)

Acceleration of real space interactions (1)

• System : 9,250 TIP water molecules

• Cutoff distance : 12 Å

• Box size : 64 Å × 64 Å × 64 Å

1. 1 GPU increase the speed 3 times and 2 GPUs 6 times.

2. By assigning CPU as well as GPU when FFT on CPU is skipped, we can increase the

speed up to 7.7 times.

Acceleration of real space interactions (2)

Benchmark system

(11.7 million atoms,

Cutoff = 12.0 Å)

Comparison between real space and

reciprocal space interactions

1. In both systems, the main bottleneck is the reciprocal space interactions irrespective of

the number of processors.

2. Therefore, it is important to optimize the reciprocal space interaction when CPU+GPU

clusters are used (=> Midpoint cell method could be best choice)

STMV (1 million atoms) 11.7 million atoms

Comparison between TSUBAME and K

STMV (1 million atoms) 11.7 million atoms

1. K has better parallel efficiency of reciprocal space interaction than TSUBAME.

2. Irrespective of the performance of reciprocal space interaction, TSUBAME shows better

performance than K due to efficient evaluation of real space interaction on GPU.

Benchmark on TSUBAME

VVER (1 million atoms) VVER (11.7 million atoms)

RESPA (1 million atoms) RESPA (11.7 million atoms)

Performance on TSUBAME for 100

million atom systems

Integrator Number of Nodes Time per step (ms) Simulation time

(ns/day)

VVER 512 126.09 1.37

1024 97.87 1.77

RESPA 512 109.80 1.57

1024 70.77 2.44

Langevin

RESPA

512 78.92 2.19

1024 44.13 3.92

Summary

1. We implemented MD for GPU+CPU clusters.

2. We assign GPUs for real space non-bonded interactions and CPUs for

reciprocal space interactions, bonded interactions, and integrators.

3. We introduce a non-excluded particle list scheme for efficient usage of

memory on GPU.

4. We also optimized the usage of GPUs and CPUs for multiple time step

integrators.

5. Benchmark result on TSUBAME shows very good strong/weak scalability

for 1 million, 11.7 million, and 100 million atoms systems.

efficient parallelization of molecular dynamics...

Documents