efficient molecular dynamics on heterogeneous...

Efficient Molecular Dynamics on Heterogeneous Architectures in

GROMACS

Berk Hess, Szilárd PállKTH Royal Institute of Technology

GTC 2012

GROMACS: fast, scalable, free

● Classical molecular dynamics package● Main developers: Stockholm, Sweden & world-wide● Open source: GPL www.gromacs.org● User base:

– thousands worldwide both academic & private– hundreds of thousands through Folding@Home (300k active CPUs, Apr. 2012)

● Force-fields: AMBER, CHARMM, OPLS, GROMOS● Coarse-grained simulations

● Strong focus on optimized algorithms and efficient code– philosophy: do more with less scaling != absolute performance →

mailto:Folding@Home

Target application areasMembrane protein: 200k atoms Water droplet on substrate: 1.5 million atoms

Cellulose + lignocellulose + water: 5 million atoms

GROMACS acceleration

● GROMACS 4.5:– highly optimized non-bonded SSE assembly kernels– single-GPU acceleration using OpenMM– wall-time per iteration as low as 1 ms

● GROMACS 4.6:– SSE/AVX intrinsics in all compute-intensive code– GPU acceleration:

● hard to beat the CPU● re-implementing everything is not an option

● Design principles:– support all features – maximize both CPU and GPU utilization– develop future-proof algorithms

● Offload non-bonded force calculation to GPUs– strategy successfully used by other packages– our challenges:

● a fast code is hard to accelerate● sub-millisecond iteration rate: latencies hurt more

arbitraryunits cells

parallelconstraints

virtual interaction sites

What/how to accelerate with GPUs?

Hybrid parallelization

● Each domain maps to an MPI process– OpenMP within a process– single GPU per process

● Automated multi-level load balancing:– inter-process:

dynamic domain resizing – intra-process:

automated CPU ↔ GPU work shifting

Non-bonded cluster pair-list

Standard cell grid: spatially uniform

x,y,z gridding x, y griddingz sortingz binning

cluster pair-list generation

Atom clusters: #atoms uniform

Non-bonded algorithm

● CPU: SSE (AVX)– cluster: 4 atoms– work unit: 4x4 pair-forces

● GPU: CUDA– cluster: 8 atoms– work unit: 8x8 pair-forces (2 warps)– optimize for caching:

super-clusters with 8 clusters each

Heterogeneous scheme: data & control flow

Avg. CPU/GPU overlap: 60-80% per iteration

Bonded F PME Integration,Constraints

Non-bonded F&

Pair-list pruning

Waitfor GPU

IdleIdleIdle

CPU OpenMP threads

GPUCUDA

Pairsearch

Idle

Pair-search step every 10-50 iterations

MD iteration

pair-listcoordinatescharges

forces (energies)

GPU non-bonded kernelsci i-supercluster index = block index

for each cj cluster (loop over all neighbors of any ci in sci) load i cluster interaction and exclusion mask if sj not masked load j atom data for each ci cluster in sci (loop over the 8 i-clusters) load i atom data r2 = sqrt | xj – xi |

load exclusion mask (one per warp) extract exclusion bit for i-j atom-pair excl_bit if (r2 < rcoulomb_squared * excl_bit) calculate i-j coulomb and LJ forces accumulate i- and j-forces in registers

store per-thred j-forces in shmem reduce j-forces

reduce i-forces

Launch configuration:

– grid: #i-superclusters x1

(one supercluster/block)

– block: 8x8x1, 64 threads

– shared mem:

Fermi: 768 bytes (reduction)

Kepler: 0 bytes

Pruning● All vs all atom distance check is expensive

→ pair-list built with cluster bounding-box distance check on the CPU

→ distance anyway calculated on the GPU

● Solution: prune using warp vote (Fermi+)

__any(r2 < rlist_sq) == false ↔ if no pairs are within range

– 10-25% overhead● need to prune only every pair-search step!● prunes 25-35% of the atom-pairs

GPU non-bonded kernelsci i-supercluster index = block index

for each cj cluster (loop over all neighbors of any ci in sci) load i cluster interaction and exclusion mask if sj not masked load j atom data for each ci cluster in sci (loop over the 8 i-clusters) load i atom data r2 = sqrt | xj – xi | if !__any(r2 < rlist_squared) prune cj from current ci load exclusion mask (one per warp) extract exclusion bit for i-j atom-pair excl_bit if (r2 < rcoulomb_squared * excl_bit) calculate i-j coulomb and LJ forces accumulate i- and j-forces in registers

store per-thred j-forces in shmem reduce j-forces store pruned i-mask

reduce i-forces

pair-list pruning

Launch configuration:

– grid: #i-superclusters x1

(one supercluster/block)

– block: 8x8x1, 64 threads

– shared mem:

Fermi: 768 bytes (reduction)

Kepler: 0 bytes

Kernel characteristics

● Full warp “skips” branch-free execution→● Kernel emphasizes data reuse, relies heavily on caching:

– 95% L1 hit rate– ~75 flops/byte ⇒ compute bound– ~39 flops, ~150 ops total/inner loop many iops (Kepler concern!)⇒

● 15.5 warps/cycle in flight● IPC: 1.5-1.6 on Fermi● Force accumulation requires lots of shmem/registers limiting→

– i-force: 8 x 64 x 12 bytes shmem or 8 x 12 bytes reg+shmem– j-force: (only) 64 x 12 bytes shmem

} ⇒ ECC agnostic

GPU non-bonded kernel work-efficiency

● Work-efficiency = number of non-zero forces calculated● Number of 0-s calculated is only 40-60%● Pruning improves by 1.6-2x

rc=0.9, rl=1.0

rc=1.2, rl=1.3

rc=1.5, rl=1.6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.73

0.75

0.82

0.43

0.52

0.59

0.21

0.29

0.36

Verlet Cluster-pair pruned Cluster-pair unpruned

rc=0.9, nstlist=10

rc=1.2, nstlist=10

0 1000 2000

1360

1408

690

834

1420

1382

606

700

PME SSE PME non-zero SSEPME AVX PME non-zero AVX

rc=0.9, nstlist=10

rc=0.9, nstlist=20

rc=0.9, nstlist=40

rc=1.2, nstlist=10

rc=1.2, nstlist=20

rc=1.2, nstlist=40

0 1000 2000 3000 4000 5000 6000 7000 8000

5991

6457

7579

6157

6572

7484

2397

2196

1743

3017

2826

2395

PME CUDA PME non-zero CUDAGeForce GTX 580

Core i7-2600 3.4 GHz

Pair force calculation rate

Nonbonded force evaluation:

2x – 5x faster on GPUs

– GeForce GTX 580:● effective: 6-7.5 Gpairs/s● useful: 1.7-3 Gpairs/s

– Core i7-2600 3.4 GHz (SSE4.1+AVX):

● effective: ~1.4 Gpairs/s● useful: 0.6-0.8 Gpairs/s

Mpairs/s

rc: cut-off, nstlist: pair-list update interval

Single-node weak scaling

● Systems: water boxes 1500 – 3 million atoms

● Settings: electrostatic cut-off ≥0.9 nm with PME (auto-tuned), 0.9 nm with reaction-field, LJ cut-off 0.9 nm, 2 fs time step

● Hardware: workstation with 2x Intel Xeon X5650 (6C), 4x NVIDIA Tesla C2075

1.5 3 6 12 24 48 96 192 384 768 1536 30720

0.05

0.1

0.15

0.2

0.25

0.3PME

System size/GPU (1000s of atoms)

Iter

atio

n t

ime

per

100

0 at

om

s (m

s/st

ep)

1.5 3 6 12 24 48 96 192 384 768 1536 30720

0.05

0.1

0.15

0.2

0.25

0.3Reaction-field

1xC2075 CUDA F kernel1xC2075 CPU total2xC2075 CPU total4xC2075 CPU total

System size/GPU (1000s of atoms)

Iter

atio

n t

ime

per

100

0 at

om

s (m

s/st

ep)

limit to strong scaling

Absolute performance & speedup● PME CUDA- vs SSE-accelerated

non-bonded kernels

● System: RNase in water:

– 24040 atoms cubic box

– 16816 atoms in dodecahedron box

● Settings:

– elec. cut-off tuned ≥0.9 nm

– LJ cut-off 0.9 nm

– 2 fs and 5 fs (with vsites)

● Hardware:

– 2x Xeon X5650 (6C)

– 4x Tesla C2075

dodec box + vsites

dodec box

cubic box, NPT

cubic box

0 50 100 150 200 250

33.4

15.0

10.9

11.2

106.2

48.6

37.3

38.7

60.7

27.8

19.9

20.7

137.5

61.3

58.4

61.3

72.9

33.4

26.2

26.8

207.5

93.1

73.3

75.7

90.2

42.4

33.7

34.6

239.1

108.2

89.3

91.2 3C3C+C20756C6C+2xC20759C9C+3xC207512C12C+4xC2075

Performance (ns/day)

1 10 1000.1

1

10

100

Box of water, 1.5M atoms

RFRF linear scalingPMEPME linear scaling

#GPUs

ns/

day

Strong scaling on GPU clusters

1 101

10

100

1000ADH solvated protein, 134k atoms

#GPUs

ns/

day

Settings:

– cut-off: elec. ≥0.9 nm with PME (tuned), 0.9 nm with reaction-field; LJ 0.9 nm

– 2 fs time-step

Hardware: BSC Bullx cluster

– 2x Intel Xeon E5649 (6C)

– 2x NVIDIA Tesla M2090

– 2x QDR Infiniband 40 Gb/s

48 4802

20

Cellulose + lignin + water, 23M atoms

RF scaling

RF linear sclaing

#GPUs

ns/

day

● Hardware: Cray XK6, Jaguar GPU partition 480 nodes

● Settings: reaction-field, cut-off 1.2 nm, 2 fs time-step

Courtesy: Roland Schulz, ORNL-CMB

Bonded F imbalance + kernel slowdown @10-20k atoms/GPU

Kepler outlook

● Current performance: GTX 680 ≈ GTX 580 + 15%● Concerns:

– integer througput– nvcc 4.2

● kernels slower than 4.0 even on Kepler worked around it, but required ninja moves● unrolling can result in spilling

● shfl-based reduction is not only elegant:

→ no in-shared memory accumulation/needed

→ shfl reduction + sm_30 = shmem reduction + sm_20 + 30%

● Dual-chip boards: GTX 690/equivalent Tesla + PCI-E 3.0

→ close to having 2x GTX 680 in PCI-E 2.0

Future directions

● Accelerating dihedral force calculations on GPU

→ improve CPU-GPU load-balance ⇒ better scaling

● Further workload balancing/regularization

→ improve scaling to small systems ⇒ better strong scaling

● Mont Blanc: Tegra 3 + GeForce 520M– 38 TFlops @ ~5 KW 7.5 GFlops/W, 3.5x better than BG/Q⇒

Acknowledgements● Developers:

Roland Schulz

Erik Lindahl

Sander Pronk

The GROMACS community● NVIDIA:

– Gernot Ziegler– engineering team

We are looking for computer scien-

tists/engineers to join our team!

Hardware / support

Funding

Extra material

Atom-cluster pair algorithmsuper-cluster definition and particle sorting

set x/y super-cluster size (s.t. super-clusters will be approx. cubic)

for each p<Np sx = x[p]/size_x sy = y[p]/size_y scluster_xy[p] = sx*Nsy + sy (column of super-clusters for given x,y)

for each scluster_xy sort p on z add dummy particles to get to Ns_xy*64 particles (now we have Ns_xy scluster in this column) for each scluster with this scluster_xy for upper and lower half sort p on y for upper and lower half sort p on x for upper and lower half define a cluster for the 8 particles we have here

pair search and cluster generation

for each si in scluster for each sj in scluster in range of si for each cj cluster in sj for each ci cluster in si if (ci in range of cj) add ci,cj to pair-list of si,sj to create interaction

● Cluster: algorithmic work unit, 4 with SSE, 8 with CUDA

● Super-cluster: hierarchical grouping for cache-efficiency on GPUs

● Flexible cluster size: adjust to match the architecture's SIMD width

GPU non-bonded kernel in/out

● Simulation constants:– C6/C12 params & tabulated Ewald coulomb force:

texture memory (fully cached)

● Coordinates + charges: updated every iteration● Pair-list: updated every 10-50 iterations

– list “j-clusters”: group 4-by-4 for coalesced loading– list of “i-superclusters” (8 i-cluster each) + reference to all

j-clusters in range– interaction bit-masks encoding:

● i-j cluster in-range relationship (updated at pruning)● exclusions

Calculate only what/when needed:

– forces: every iteration

– energies: typically every 50-500 iterations

– pruned pair-list: at pair-search (kept on GPU)

→ 12 kernels

● 4 per output type● 3 per electrostatics type

In: Out:

Data & control flow – parallel case

Bonded F PME IntegrationConstraints

Local non-bonded F

pair-list pruning

Wait fornon-local FCPU

GPU

Local Pair search

Tran

sfer

no

n-

loca

l F

Tran

sfer

loca

l x,q

Tran

sfer

loca

l pai

r-lis

t

Non-Local pair search

Tran

sfer

no

n-lo

cal p

air-

list

Non-local non-bonded F

pair-list pruning

Local stream

Non-local stream

Idle

Tran

sfer

loca

l F

Wait for local F

MPI receive non-local x MPI send non-local F

Tran

sfer

non

- lo

cal x

,qIdleIdle

Pair-search step every 10-50 iterations

MD iteration

Load balancing on GPU: balanced pair lists

Pair-list splitting balances workload:

⇒ improves SM load balance ⇒ improves scaling

performance with small inputs

CPU-GPU load balancing

0 2 4 6 8 10 12 14 16 180

2

4

6

8

10

PME tuning on Optereon 6276 16C + Tesla M2090

without tuningwith tuning

#cores

ns/

day

Settings:– cut-off:

● elecstrostatic ≥0.9 nm with tuning, 0.9 nm without

● LJ 0.9 nm– pair-list update every 20 steps– 2 fs time-step

efficient molecular dynamics on heterogeneous...

Documents