efficient molecular dynamics on heterogeneous...
TRANSCRIPT
Efficient Molecular Dynamics on Heterogeneous Architectures in
GROMACS
Berk Hess, Szilárd PállKTH Royal Institute of Technology
GTC 2012
GROMACS: fast, scalable, free
● Classical molecular dynamics package● Main developers: Stockholm, Sweden & world-wide● Open source: GPL www.gromacs.org● User base:
– thousands worldwide both academic & private– hundreds of thousands through Folding@Home (300k active CPUs, Apr. 2012)
● Force-fields: AMBER, CHARMM, OPLS, GROMOS● Coarse-grained simulations
● Strong focus on optimized algorithms and efficient code– philosophy: do more with less scaling != absolute performance →
Target application areasMembrane protein: 200k atoms Water droplet on substrate: 1.5 million atoms
Cellulose + lignocellulose + water: 5 million atoms
GROMACS acceleration
● GROMACS 4.5:– highly optimized non-bonded SSE assembly kernels– single-GPU acceleration using OpenMM– wall-time per iteration as low as 1 ms
● GROMACS 4.6:– SSE/AVX intrinsics in all compute-intensive code– GPU acceleration:
● hard to beat the CPU● re-implementing everything is not an option
● Design principles:– support all features – maximize both CPU and GPU utilization– develop future-proof algorithms
● Offload non-bonded force calculation to GPUs– strategy successfully used by other packages– our challenges:
● a fast code is hard to accelerate● sub-millisecond iteration rate: latencies hurt more
arbitraryunits cells
parallelconstraints
virtual interaction sites
What/how to accelerate with GPUs?
Hybrid parallelization
● Each domain maps to an MPI process– OpenMP within a process– single GPU per process
● Automated multi-level load balancing:– inter-process:
dynamic domain resizing – intra-process:
automated CPU ↔ GPU work shifting
Non-bonded cluster pair-list
Standard cell grid: spatially uniform
x,y,z gridding x, y griddingz sortingz binning
cluster pair-list generation
Atom clusters: #atoms uniform
Non-bonded algorithm
● CPU: SSE (AVX)– cluster: 4 atoms– work unit: 4x4 pair-forces
● GPU: CUDA– cluster: 8 atoms– work unit: 8x8 pair-forces (2 warps)– optimize for caching:
super-clusters with 8 clusters each
Heterogeneous scheme: data & control flow
Avg. CPU/GPU overlap: 60-80% per iteration
Bonded F PME Integration,Constraints
Non-bonded F&
Pair-list pruning
Waitfor GPU
IdleIdleIdle
CPU OpenMP threads
GPUCUDA
Pairsearch
Idle
Pair-search step every 10-50 iterations
MD iteration
pair-listcoordinatescharges
forces (energies)
GPU non-bonded kernelsci i-supercluster index = block index
for each cj cluster (loop over all neighbors of any ci in sci) load i cluster interaction and exclusion mask if sj not masked load j atom data for each ci cluster in sci (loop over the 8 i-clusters) load i atom data r2 = sqrt | xj – xi |
load exclusion mask (one per warp) extract exclusion bit for i-j atom-pair excl_bit if (r2 < rcoulomb_squared * excl_bit) calculate i-j coulomb and LJ forces accumulate i- and j-forces in registers
store per-thred j-forces in shmem reduce j-forces
reduce i-forces
Launch configuration:
– grid: #i-superclusters x1
(one supercluster/block)
– block: 8x8x1, 64 threads
– shared mem:
Fermi: 768 bytes (reduction)
Kepler: 0 bytes
Pruning● All vs all atom distance check is expensive
→ pair-list built with cluster bounding-box distance check on the CPU
→ distance anyway calculated on the GPU
● Solution: prune using warp vote (Fermi+)
__any(r2 < rlist_sq) == false ↔ if no pairs are within range
– 10-25% overhead● need to prune only every pair-search step!● prunes 25-35% of the atom-pairs
GPU non-bonded kernelsci i-supercluster index = block index
for each cj cluster (loop over all neighbors of any ci in sci) load i cluster interaction and exclusion mask if sj not masked load j atom data for each ci cluster in sci (loop over the 8 i-clusters) load i atom data r2 = sqrt | xj – xi | if !__any(r2 < rlist_squared) prune cj from current ci load exclusion mask (one per warp) extract exclusion bit for i-j atom-pair excl_bit if (r2 < rcoulomb_squared * excl_bit) calculate i-j coulomb and LJ forces accumulate i- and j-forces in registers
store per-thred j-forces in shmem reduce j-forces store pruned i-mask
reduce i-forces
pair-list pruning
Launch configuration:
– grid: #i-superclusters x1
(one supercluster/block)
– block: 8x8x1, 64 threads
– shared mem:
Fermi: 768 bytes (reduction)
Kepler: 0 bytes
Kernel characteristics
● Full warp “skips” branch-free execution→● Kernel emphasizes data reuse, relies heavily on caching:
– 95% L1 hit rate– ~75 flops/byte ⇒ compute bound– ~39 flops, ~150 ops total/inner loop many iops (Kepler concern!)⇒
● 15.5 warps/cycle in flight● IPC: 1.5-1.6 on Fermi● Force accumulation requires lots of shmem/registers limiting→
– i-force: 8 x 64 x 12 bytes shmem or 8 x 12 bytes reg+shmem– j-force: (only) 64 x 12 bytes shmem
} ⇒ ECC agnostic
GPU non-bonded kernel work-efficiency
● Work-efficiency = number of non-zero forces calculated● Number of 0-s calculated is only 40-60%● Pruning improves by 1.6-2x
rc=0.9, rl=1.0
rc=1.2, rl=1.3
rc=1.5, rl=1.6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.73
0.75
0.82
0.43
0.52
0.59
0.21
0.29
0.36
Verlet Cluster-pair pruned Cluster-pair unpruned
rc=0.9, nstlist=10
rc=1.2, nstlist=10
0 1000 2000
1360
1408
690
834
1420
1382
606
700
PME SSE PME non-zero SSEPME AVX PME non-zero AVX
rc=0.9, nstlist=10
rc=0.9, nstlist=20
rc=0.9, nstlist=40
rc=1.2, nstlist=10
rc=1.2, nstlist=20
rc=1.2, nstlist=40
0 1000 2000 3000 4000 5000 6000 7000 8000
5991
6457
7579
6157
6572
7484
2397
2196
1743
3017
2826
2395
PME CUDA PME non-zero CUDAGeForce GTX 580
Core i7-2600 3.4 GHz
Pair force calculation rate
Nonbonded force evaluation:
2x – 5x faster on GPUs
– GeForce GTX 580:● effective: 6-7.5 Gpairs/s● useful: 1.7-3 Gpairs/s
– Core i7-2600 3.4 GHz (SSE4.1+AVX):
● effective: ~1.4 Gpairs/s● useful: 0.6-0.8 Gpairs/s
Mpairs/s
rc: cut-off, nstlist: pair-list update interval
Single-node weak scaling
● Systems: water boxes 1500 – 3 million atoms
● Settings: electrostatic cut-off ≥0.9 nm with PME (auto-tuned), 0.9 nm with reaction-field, LJ cut-off 0.9 nm, 2 fs time step
● Hardware: workstation with 2x Intel Xeon X5650 (6C), 4x NVIDIA Tesla C2075
1.5 3 6 12 24 48 96 192 384 768 1536 30720
0.05
0.1
0.15
0.2
0.25
0.3PME
System size/GPU (1000s of atoms)
Iter
atio
n t
ime
per
100
0 at
om
s (m
s/st
ep)
1.5 3 6 12 24 48 96 192 384 768 1536 30720
0.05
0.1
0.15
0.2
0.25
0.3Reaction-field
1xC2075 CUDA F kernel1xC2075 CPU total2xC2075 CPU total4xC2075 CPU total
System size/GPU (1000s of atoms)
Iter
atio
n t
ime
per
100
0 at
om
s (m
s/st
ep)
limit to strong scaling
Absolute performance & speedup● PME CUDA- vs SSE-accelerated
non-bonded kernels
● System: RNase in water:
– 24040 atoms cubic box
– 16816 atoms in dodecahedron box
● Settings:
– elec. cut-off tuned ≥0.9 nm
– LJ cut-off 0.9 nm
– 2 fs and 5 fs (with vsites)
● Hardware:
– 2x Xeon X5650 (6C)
– 4x Tesla C2075
dodec box + vsites
dodec box
cubic box, NPT
cubic box
0 50 100 150 200 250
33.4
15.0
10.9
11.2
106.2
48.6
37.3
38.7
60.7
27.8
19.9
20.7
137.5
61.3
58.4
61.3
72.9
33.4
26.2
26.8
207.5
93.1
73.3
75.7
90.2
42.4
33.7
34.6
239.1
108.2
89.3
91.2 3C3C+C20756C6C+2xC20759C9C+3xC207512C12C+4xC2075
Performance (ns/day)
1 10 1000.1
1
10
100
Box of water, 1.5M atoms
RFRF linear scalingPMEPME linear scaling
#GPUs
ns/
day
Strong scaling on GPU clusters
1 101
10
100
1000ADH solvated protein, 134k atoms
#GPUs
ns/
day
Settings:
– cut-off: elec. ≥0.9 nm with PME (tuned), 0.9 nm with reaction-field; LJ 0.9 nm
– 2 fs time-step
Hardware: BSC Bullx cluster
– 2x Intel Xeon E5649 (6C)
– 2x NVIDIA Tesla M2090
– 2x QDR Infiniband 40 Gb/s
48 4802
20
Cellulose + lignin + water, 23M atoms
RF scaling
RF linear sclaing
#GPUs
ns/
day
● Hardware: Cray XK6, Jaguar GPU partition 480 nodes
● Settings: reaction-field, cut-off 1.2 nm, 2 fs time-step
Courtesy: Roland Schulz, ORNL-CMB
Bonded F imbalance + kernel slowdown @10-20k atoms/GPU
Kepler outlook
● Current performance: GTX 680 ≈ GTX 580 + 15%● Concerns:
– integer througput– nvcc 4.2
● kernels slower than 4.0 even on Kepler worked around it, but required ninja moves● unrolling can result in spilling
● shfl-based reduction is not only elegant:
→ no in-shared memory accumulation/needed
→ shfl reduction + sm_30 = shmem reduction + sm_20 + 30%
● Dual-chip boards: GTX 690/equivalent Tesla + PCI-E 3.0
→ close to having 2x GTX 680 in PCI-E 2.0
Future directions
● Accelerating dihedral force calculations on GPU
→ improve CPU-GPU load-balance ⇒ better scaling
● Further workload balancing/regularization
→ improve scaling to small systems ⇒ better strong scaling
● Mont Blanc: Tegra 3 + GeForce 520M– 38 TFlops @ ~5 KW 7.5 GFlops/W, 3.5x better than BG/Q⇒
Acknowledgements● Developers:
Roland Schulz
Erik Lindahl
Sander Pronk
The GROMACS community● NVIDIA:
– Gernot Ziegler– engineering team
We are looking for computer scien-
tists/engineers to join our team!
Hardware / support
Funding
Extra material
Atom-cluster pair algorithmsuper-cluster definition and particle sorting
set x/y super-cluster size (s.t. super-clusters will be approx. cubic)
for each p<Np sx = x[p]/size_x sy = y[p]/size_y scluster_xy[p] = sx*Nsy + sy (column of super-clusters for given x,y)
for each scluster_xy sort p on z add dummy particles to get to Ns_xy*64 particles (now we have Ns_xy scluster in this column) for each scluster with this scluster_xy for upper and lower half sort p on y for upper and lower half sort p on x for upper and lower half define a cluster for the 8 particles we have here
pair search and cluster generation
for each si in scluster for each sj in scluster in range of si for each cj cluster in sj for each ci cluster in si if (ci in range of cj) add ci,cj to pair-list of si,sj to create interaction
● Cluster: algorithmic work unit, 4 with SSE, 8 with CUDA
● Super-cluster: hierarchical grouping for cache-efficiency on GPUs
● Flexible cluster size: adjust to match the architecture's SIMD width
GPU non-bonded kernel in/out
● Simulation constants:– C6/C12 params & tabulated Ewald coulomb force:
texture memory (fully cached)
● Coordinates + charges: updated every iteration● Pair-list: updated every 10-50 iterations
– list “j-clusters”: group 4-by-4 for coalesced loading– list of “i-superclusters” (8 i-cluster each) + reference to all
j-clusters in range– interaction bit-masks encoding:
● i-j cluster in-range relationship (updated at pruning)● exclusions
Calculate only what/when needed:
– forces: every iteration
– energies: typically every 50-500 iterations
– pruned pair-list: at pair-search (kept on GPU)
→ 12 kernels
● 4 per output type● 3 per electrostatics type
In: Out:
Data & control flow – parallel case
Bonded F PME IntegrationConstraints
Local non-bonded F
pair-list pruning
Wait fornon-local FCPU
GPU
Local Pair search
Tran
sfer
no
n-
loca
l F
Tran
sfer
loca
l x,q
Tran
sfer
loca
l pai
r-lis
t
Non-Local pair search
Tran
sfer
no
n-lo
cal p
air-
list
Non-local non-bonded F
pair-list pruning
Local stream
Non-local stream
Idle
Tran
sfer
loca
l F
Wait for local F
MPI receive non-local x MPI send non-local F
Tran
sfer
non
- lo
cal x
,qIdleIdle
Pair-search step every 10-50 iterations
MD iteration
Load balancing on GPU: balanced pair lists
Pair-list splitting balances workload:
⇒ improves SM load balance ⇒ improves scaling
performance with small inputs
CPU-GPU load balancing
0 2 4 6 8 10 12 14 16 180
2
4
6
8
10
PME tuning on Optereon 6276 16C + Tesla M2090
without tuningwith tuning
#cores
ns/
day
Settings:– cut-off:
● elecstrostatic ≥0.9 nm with tuning, 0.9 nm without
● LJ 0.9 nm– pair-list update every 20 steps– 2 fs time-step