molecular dynamics range-limited force evaluation ... · but proof-of concept only. and fpgas still...
TRANSCRIPT
Molecular Dynamics Range-Limited
Force Evaluation Optimized for FPGA
Chen Yang§, Tong Geng§, Tianqi Wang*§, Charles Lin‡, Jiayi Sheng †,
Vipin Sachdeva‡, Woody Sherman‡, Martin Herbordt§
§Department of Electrical and Computer Engineering, Boston University, Boston, MA
*Department of Physics, University of Science and Technology of China, China‡Silicon Therapeutics, Boston, MA
†Falcon Computing Solutions, Inc., Los Angeles, CA
07/17/2019
Boston University Slideshow Title Goes Here
Why Molecular Dynamics Simulation is so important …
▪ Core of Computational Chemistry
▪ MD is a large fraction of supercomputing cycles (~25%)
▪ Central to Computational Biology, with applications to
→ Drug design
→ Understanding disease processes
From DeMarco & Dagett: PNAS 2/24/04
Shows conversion of PrP protein from
healthy to harmful isoform. Aggregation of
misfolded intermediates appears to be the
pathogenic species in amyloid (e.g. “mad
cow” & Alzheimer’s) diseases.
Note: this could only have been discovered
with simulation!
I am
MAD
Boston University Slideshow Title Goes Here
1ms of simulated reality for a 90K particle simulation takes
8 cores (PC) 4000 years
1K cores (cluster) 30 years
1.6M cores (Sequoia) ??? >> scale 90K particles to 1.6M cores?
0.5K ASICs (Anton) .1 years
Question – Can we get similar performance w/ off-the-shelf components?
Source: folding @home
Why Acceleration of
Molecular Dynamics
is so important …
Boston University Slideshow Title Goes Here
Current limit on
strong scaling w/
CPU/GPU clusters
104 105 106 107 108 109 1010 1011 1012 1013 1014
proteins viruschromo-tophore
bacteria
cell scale
eukaryote
cell scale
1 core-sec Particles
103
1 ms
Tim
e S
cale
, s
16 days w/ Anton2
1 day w/
1K cores 1 day w/ 2K cores
1us1 day AMBER w/ 2 GPUs
1 day w/ Anton2
2+ order of
magnitude gap
Time and Length Scales in Biology
Weak Scaling
(works!)
Strong Scaling
(very hard!)
*&%#!
Classic MD
perhaps not viable
Boston University Slideshow Title Goes Here
Current limit on
strong scaling w/
CPU/GPU clusters
104 105 106 107 108 109 1010 1011 1012 1013 1014
proteins viruschromo-tophore
bacteria
cell scale
eukaryote
cell scale
1 core-sec Particles
103
1 ms
Tim
e S
cale
, s
16 days w/ Anton2
1 day w/
1K cores 1 day w/ 2K cores
1us1 day AMBER w/ 2 GPUs
1 day w/ Anton2
2+ order of
magnitude gap
Time and Length Scales in Biology
Crucial applications!
• Protein Folding
• Drug Discovery
Boston University Slideshow Title Goes Here
7/22/2019High-Performance Communication and Application on FPGA-Centric Clusters
6
FPGA/MD Road Map
2008
Single FPGA 3x better than GPU ☺
But FPGAs are too expensive
FPL2009, TRETS2010, FCCM 2011
2008
Anton 100x better than anything!
But costs $??? and is not available
→ Shows potential of clusters w/ ultra-low latency
2014
FPGA Clusters demonstrate strong scaling ☺
But proof-of concept only. And FPGAs still not viable.
HPEC2014, HEART2015
2016-Present - FPGAs become plausible HPC devices ☺☺
Today – Time to build FPGA cluster for MD?
We know strong scaling will work
BUT
Is single FPGA performance still competitive?
ASAP2019, SC2019
2016
Anton II still 100x better than anything!
But costs $??? and is not available
→ Shows potential of clusters w/ ultra-low latency
Boston University Slideshow Title Goes Here
Molecular Dynamics Range-Limited Force Evaluation Optimized for FPGA
Molecular Dynamics Simulation
▪ Why MD important?▪ Core of Computational Chemistry
▪ Central to Computational Biology, with
applications to
▪ Drug design
▪ Understanding disease processes
1ms of simulated reality for a 90K particle
8 Cores PC 4000 years
1K Cores (HPC cluster) 50 years
0.5K ASICs (Anton) 0.1 year
Question: Can we get similar performance with COTs like FPGA?
▪ What need to be done?
▪ Force Evaluation
▪ Bonded force
▪ Non-Bonded force
▪ Motion Update
Force
Calculation
Motion
Update90% 10%𝑭 = 𝒎
𝒅𝟐𝒓
𝒅𝒕𝟐
All-to-all communication, hard to scale
Paradox:
More compute node to shorten computation time
Communication latency increasing with # of hops
bondednonHtorsionanglebondtotal
i FFFFFF −++++=
Generally O(n)Initially O(n2), performed
on accelerators
▪ Why MD is hard?
▪ State-of-the-Art
7/22/2019
7
Boston University Slideshow Title Goes Here
Outline
▪ MD Background
▪ Single-chip End-to-end MD System Implementation
▪ Evaluation
▪ Future Work:▪ Multi-FPGA Strong Scaling
▪ Communication Pattern Analysis
Molecular Dynamics Range-Limited Force Evaluation Optimized for FPGA 7/22/2019
8
Boston University Slideshow Title Goes Here
Background: Non-Bonded Force and O(N2) Complexity
▪ Lennard-Jones Force: ▪ Decays quickly…. Cut-off !
▪ Coulombic Force: ▪ Does not decay fast enough…..
▪ Partitions!
▪ Short-range part:
▪ Directly computation
▪ Long-range part:
▪ Flat curve
▪ Particle mesh Ewald algorithm
▪ Update every few iterations
Molecular Dynamics Range-Limited Force Evaluation Optimized for FPGA
𝐹𝑖𝐿𝐽
𝑟𝑗𝑖=
𝑗≠𝑖
𝜖𝑎𝑏𝜎𝑎𝑏
12𝜎𝑎𝑏
𝑟𝑗𝑖
14
− 6𝜎𝑎𝑏
𝑟𝑗𝑖
8
Cut-off approximation
Short-range
Long-range
Not on critical path
Lennard-Jones Force
Coulombic Force
𝐹𝑖𝐶
𝑟𝑗𝑖= 𝑞𝑖
𝑗≠𝑖
𝑞𝑗
𝑟𝑗𝑖3
7/22/2019
9
Boston University Slideshow Title Goes Here
Fundamental problem: computation is not on
particles, it’s on particle PAIRS
Analogous to MMM, Ǝ solution based on spatial locality
▪ Each (reference) particle is part of many particle pairs
(neighbor set)
▪ Each “other” particle interacts with its own (different)
neighbor set of particles
▪ Method 1 – Coarse spatial sorting by cell▪ Problem: 85% of work is unnecessary
▪ Method 2 – Each particle maintains its own list▪ Problem: Large preprocessing overhead
▪ Problem: Lots of storage
Molecular Dynamics Range-Limited Force Evaluation Optimized for FPGA
Z
X
Y
rc
P
7/22/2019
10
Boston University Slideshow Title Goes Here
Background: Cutoff Radius and Cell-list
▪ Cutoff Radius rc:▪ Evaluate the non-bonded force when distance within rc
▪ F = 0 when r > rc
▪ Cell list:▪ Cell size = rc
3
▪ Given reference particle P, only evaluate neighbor particles
within 26 nearest neighbor cells
▪ Newton’s 3rd Law:▪ Particle interaction is mutual -> Only evaluate half particles within
cutoff
▪ Filtering:▪ Remove unnecessary particles outside cutoff
Molecular Dynamics Range-Limited Force Evaluation Optimized for FPGA
Z
X
Y
rc
P
7/22/2019
11
Boston University Slideshow Title Goes Here
MD Related Work
▪ Software toolkit (Targeting CPU and GPU Cluster)
▪ Amber, NAMD, OpenMM, Gromacs, LAMMPS, etc.
▪ GPU
▪ GTX1080Ti (Courtesy of Silicon Therapeutics)
▪ OpenMM with OpenCL (86.21 ns/day)
▪ ASICs
▪ Anton & Anton 2 by DE Shaw Group
▪ SoC covers force evaluation, motion integration
▪ Off-chip SRAM for data storage & management
▪ MDGRAPE-4 by Riken Quantitative Biology Center
▪ SoC featuring 64 force evaluation pipelines
▪ Focus on non-bonded force
Molecular Dynamics Range-Limited Force Evaluation Optimized for FPGA
MDGRAPE-4Anton 2
7/22/2019
12
Boston University Slideshow Title Goes Here
Where we are now…
▪ FPGA/GPU work rely on off-chip device to host/process the input/output▪ Not enough on-chip ram -> Particle data in DRAM
▪ Host generate: input particle-pairs
▪ Host perform: Motion Update & Particle Migration
▪ New Opportunity …▪ High-end FPGAs now featuring ~200 Mb on-chip SRAM
▪ MGTs featuring high-bandwidth, low-latency communication
▪ Support for native floating-point
▪ More resources on-chip
Molecular Dynamics Range-Limited Force Evaluation Optimized for FPGA
Long Latency
for an end-to-end MD system on FPGA(s)!
Enough for Small Dataset
Distributed Storage
Better accuracy
More function units
7/22/2019
13
What if we can keep all the data chip?
Boston University Slideshow Title Goes Here
Outline
▪ MD Background
▪ Single-chip End-to-end MD System Implementation
▪ Evaluation
▪ Future Work:▪ Multi-FPGA Strong Scaling
▪ Communication Pattern Analysis
Molecular Dynamics Range-Limited Force Evaluation Optimized for FPGA 7/22/2019
14
Boston University Slideshow Title Goes Here
Single-Chip End-to-End MD Architecture
▪ Dataset: Liquid Argon▪ Monoisotopic Element
▪ Particle Cache▪ Position, Velocity, Force
▪ Filter Banks▪ Multiple filters per pipeline with buffers
▪ Force Evaluation▪ Non-bonded force evaluation
▪ Accumulation▪ Accumulate particle force to both reference particle
and neighbor particles
▪ Motion Update & Particle Migration▪ Update particle position
▪ Update particle velocity
Molecular Dynamics Range-Limited Force Evaluation Optimized for FPGA
Filte
rB
uffe
r
Force Pipeline
Ref
Acc Neighbor
Acc
Neighbor
Acc
Neighbor
Acc
Neighbor
Acc…
Force Cache
Position Cache
Velocity Cache
Motion Update
Particle Pairs Generation NeighborRef
Filter Arbitration
Force Negation
▪ LJ interaction only
7/22/2019
15
FF
P
V
F
Boston University Slideshow Title Goes Here
Mapping: Particles on Memory Modules
▪ Q1: Which Memory Resource?▪ MLAB: lut-based, suitable for small size
▪ M20K: flexible, large capacity, high-performance
▪ eSRAM: large capacity, low-latency, low flexibility, low
availability
▪ Q2: Unified Mem or Sperate Mem for Different
Datatypes?▪ Position, Velocity, Force
▪ Different access time & access frequency
▪ Q3: Single Global Mem or Individual Cell Mem?▪ Mem 1: Single Global Memory
▪ Simplified wiring, but limited rd&wr BW
▪ Mem 2: Individual Cell Memory
▪ Complex wiring, but sufficient BW
Molecular Dynamics Range-Limited Force Evaluation Optimized for FPGA
Position Cache
Pair Gen
F F
Force 1
…Pair Gen
F F
Force 2
…
Pair Gen
F F
Force M
……
Pair Gen
F F
Force 1
…Pair Gen
F F
Force 2
…
Pair Gen
F F
Force M
……
Cell 0 Cell 1 Cell N……
Mem 1: Single Global Memory
In Cache In Cache In Cache
Mem 2: Individual Cell Memory
7/22/2019
16
TBD
Boston University Slideshow Title Goes Here
Mapping: Workload on Pipelines
▪ Distribution 1: Pipelines
working on same reference
particle
Molecular Dynamics Range-Limited Force Evaluation Optimized for FPGA
Filt
ers …
Force 1
……… … … …
…Adder Tree
Filt
ers
…sum
Filt
ers …
…
CN CN+1 CN+2 CN+13
Force 2 Force 3 Force M
CN
CN+1
CN+2
CN+13Force 1Acc Force 2Acc Force MAcc
… … …
Force 1Acc
C1 C2 C3 C14
Filt
ers …
…
Force 2Acc
C2 C3 C4 C15
Filt
ers …
Force MAcc
CN CN+1 CN+2 CN+13…
…
▪ Distribution 2: Pipelines
working on same cell, but
different reference particles
▪ Distribution 3: Pipelines
working on different
homecells
7/22/2019
17
Boston University Slideshow Title Goes Here
Mapping: Workload on Pipelines
▪ Distribution 1: Pipelines working on same
reference particle
Molecular Dynamics Range-Limited Force Evaluation Optimized for FPGA
Filt
ers …
Force 1
……… … … …
…Adder Tree
CN CN+1 CN+2 CN+13
Force 2 Force 3 Force M
7/22/2019
18
Z
X
Y
rc
P
F1 F2
F3
F4
F5 F7F6
F.. FM
……
F..
Boston University Slideshow Title Goes Here
Mapping: Workload on Pipelines
▪ Distribution 2: Pipelines working on same cell, but
different reference particles
Molecular Dynamics Range-Limited Force Evaluation Optimized for FPGA 7/22/2019
19
Z
X
Y
rcF
ilters
…sum
CN
CN+1
CN+2
CN+13Force 1Acc Force 2Acc Force MAcc
… … … …
Boston University Slideshow Title Goes Here
Mapping: Workload on Pipelines
▪ Distribution 3: Pipelines working on different
homecells
Molecular Dynamics Range-Limited Force Evaluation Optimized for FPGA 7/22/2019
20
Z
X
Y
rcF
ilters …
…
Force 1Acc
C1 C2 C3 C14F
ilters …
…
Force 2Acc
C2 C3 C4 C15
Filt
ers …
Force MAcc
CN CN+1 CN+2 CN+13…
Boston University Slideshow Title Goes Here
Implementation: Filter Logic
▪ 8 Filters per Pipeline
▪ 𝐴𝑣𝑒𝑟𝑎𝑔𝑒𝑃𝑎𝑠𝑠𝑅𝑎𝑡𝑒 =4
3𝜋𝑟𝑐
3
27×𝑟𝑐3 ≈ 15.5%
▪ ≥ 7 filters to guarantee a throughput of 1
▪ Direct Computation▪ Datatype: Single Precision
▪ 𝑟2 = ∆𝑥2 + ∆𝑦2 + ∆𝑧2≤ 𝑟𝑐2
▪ DSP usage: 10
▪ Planar Method[1]
▪ Datatype: Fixed Point
▪ 𝑥 + 𝑦 + 𝑧 < 3𝑟𝑐▪ DSP usage: 0
▪ Overhead: 7% extra work
Molecular Dynamics Range-Limited Force Evaluation Optimized for FPGA
Comparator
Floating-Point
Multiplier
xi xj
-
yi yj zi zj
-
c
+/- Box Length +/- Box Length +/- Box Length
+ c + c +
+/- Box Length
Δx Δy Δz++
-
c 𝒙 + 𝒚 + 𝒛 < 𝟑𝒓𝒄
∆𝒛𝟐 + ∆𝒚𝟐 + ∆𝒛𝟐 < 𝒓𝒄𝟐
xi xj
-
yi yj zi zj
-
c
+/- Box Length +/- Box Length +/- Box Length
+ c + c +
+/- Box Length
xΔx Δx xΔy Δy
xΔz Δz
+ +
-
𝟑𝒓𝒄
C
+ Floating-Point
Adder
- Floating-Point
Subtractor
X
+ Fixed-Point
Adder
- Fixed-Point
Subtractor
Planar Method
Direct Computation
7/22/2019
21* M. Chiu, Molecular Dynamics Simulations on High Performance Reconfigurable Computing Systems, ACM-TRETS (2010)
Boston University Slideshow Title Goes Here▪ Interpolation: Sections and Intervals
▪ Divide the curve into multiple sections
▪ Each section is doubled range compare with previous
▪ Same # of intervals within each section
▪ Each interval: Interpolation with polynomials
▪ 𝑟𝑘 = 𝐶2 𝑥 − 𝑎 + 𝐶1 𝑥 − 𝑎 + 𝐶0
▪ Force Evaluation with Table Lookup▪ Pre-calculate coefficients and store in lookup tables
▪ Using R2 as table lookup entry
Imple: Force Evaluation using Interpolation
Molecular Dynamics Range-Limited Force Evaluation Optimized for FPGA
Locate intervalsLocate Segment
𝐹𝑖𝑅𝐿
𝑟𝑗𝑖= 𝐴𝑎𝑏𝑅14( 𝑟𝑗𝑖
2) + 𝐵𝑎𝑏𝑅8( 𝑟𝑗𝑖
2)
x
+
x-ac1
c0c1(x–a)
R-14
Extract Exponent and Mantissa:Locate Segment & Extract a
a
x
+
x-a
c1
c0c1(x–a)
R-8
x Aji x Bji+
-
x x xΔx Δy Δz
Fx Fy Fz
Coeff Mem
Coeff Mem
Coeff Mem
Coeff Mem
x = R2
Single Float
7/22/2019
interval
xi xj
-
yi yj zi zj
-
c
+/- Box Length +/- Box Length +/- Box Length
+ c + c +
+/- Box Length
xΔx Δx xΔy Δy
xΔz Δz
+ + R2 = Δx2 + Δy
2 + Δz
2
R2
-
x Aji x Bji x QQji
++
x x xΔx Δy Δz
Fx Fy Fz
/S
x
x xx
R-2
RR-4
c Comparator
Adder+
- Sub
x Multiplier
Multiplexer
/ Division
S Sqrt
xR-6
R-14
R-8
R-3
x R4
x
G2
xG1
+
+
G0
+
Coulomb + LJ C3 Smooth Function
SummationLogic
22
𝐹𝑖𝐿𝐽
𝑟𝑗𝑖=
𝑗≠𝑖
(𝐴𝑎𝑏𝑟𝑗𝑖−14 + 𝐵𝑎𝑏𝑟𝑗𝑖
−8)
Boston University Slideshow Title Goes Here
Implementation: Partial Force Accumulation
▪ Task:▪ Accumulate to Reference & Neighbor Particles
▪ Challenge:▪ DSP Latency: 3 cycles
▪ Reference Particle Acc▪ Location: output side of force pipeline
▪ Accumulating to same value
▪ Sol: Accumulate to 3 different temp values
▪ Neighbor Particle Acc▪ Location: input side of force cache
▪ Accumulating to different values most of the time
▪ Chances accumulating to same value
▪ Sol: Data hazard detection by checking active ID
Molecular Dynamics Range-Limited Force Evaluation Optimized for FPGA
FPID
PID1
PID2
PID3
In Buffer
PID
Com
p
Fo
rce C
ache
FP
Ad
de
r
Rea
dE
na
ble
Neighbor Particle Accumulator
Reference Particle Accumulator
7/22/2019
Sum Cache
Sum1Sum2Sum3 PID
F
In Buffer
==?
PID
0
+
FP
Ad
de
r
acc1
F1
acc2
F2
acc3
F3F4
Force CacheSUM
23
Boston University Slideshow Title Goes Here
Imple: Motion Update & Particle Migration
▪ Motion Update▪ Classic Newtonian Law
▪ Frequency: once every simulation timestep
▪ Datatype: Floating Point
▪ Particle Migration▪ Small timestep -> Migrate within nearest neighbor
▪ Cell Boundary: Table lookup
▪ Cell Particle Update: double buffering in Pos&Vol Cache
▪ Write new particle to alternative set of memory
▪ Avoid maintaining list of vacant memory space
Molecular Dynamics Range-Limited Force Evaluation Optimized for FPGA
Position & Velocity Cache
Ԧ𝑎 𝑡 =Ԧ𝐹(𝑡)
𝑚Ԧ𝑣 𝑡 + ∆𝑡 = Ԧ𝑣 𝑡 + Ԧ𝑎(𝑡) × ∆𝑡
Ԧ𝑟 𝑡 + ∆𝑡 = Ԧ𝑟 𝑡 + Ԧ𝑣(𝑡 + ∆𝑡) × ∆𝑡
Motion Update Datapath
……
CNCN-1
…
CN+1
P0
P1
P2
PM
P0
P1
P2
P0
P1
P2
PM
PX
On-Site Update Update with Double Buffer
…… …
P0
P1
P2
PM
P0
P1
P2
P0
P1
P2
PM
PX
… … …
CNCN-1 CN+1
7/22/2019
24
Boston University Slideshow Title Goes Here
Outline
▪ MD Background
▪ Single-chip End-to-end MD System Implementation
▪ Evaluation
▪ Future Work:▪ Multi-FPGA Strong Scaling
▪ Communication Pattern Analysis
Molecular Dynamics Range-Limited Force Evaluation Optimized for FPGA 7/22/2019
25
Boston University Slideshow Title Goes Here
Evaluation: Experimental Setup
▪ FPGA: Reflex XpressGX S10-FH200G Board▪ Intel Stratix 10 1SG280LU2F50E2VG
▪ Abundant DSP and RAM Resources:
▪ MD Dataset:▪ Liquid Argon Dataset: 20K atoms -> Generated by Packmol[1], only has LJ interaction
▪ MD Benchmark:▪ Amber: software-based benchmark, support CUDA
Molecular Dynamics Range-Limited Force Evaluation Optimized for FPGA
ALMs RAM Blocks DSP Units
933,120 11,721 5,760
[1] L. Martinez, R. Andrade, E. Birgin, and J. Martinez. PACKMOL: a package for building initial configurations for molecular
dynamics simulations. Journal of Computational Chemistry, 30(13):2157–2164, 2009.
7/22/2019
26
Boston University Slideshow Title Goes Here
Evaluation: Interpolation Accuracy
▪ Which interpolation order?
▪ How many intervals per section?
Molecular Dynamics Range-Limited Force Evaluation Optimized for FPGA
Interval 16 32 64 128 256
1st Order 99.7700 99.9336 99.9842 99.9960 99.9990
2nd Order 99.9899 99.9988 99.9998 99.9999 99.9999
3rd Order 99.9996 99.9999 99.9999 99.9999 99.9999
Higher order More DSP usage & More index to store
More Intervals More memory usage
DSP Block RAM
7/22/2019
27
interval
Mo
re D
SP
More Memory Usage for Indexes
Boston University Slideshow Title Goes Here
Evaluation: System Resource Utilization
▪ Quick Recap:▪ 2 particle-to-memory mapping:
▪ Mem 1: All particle in a single large memory unit
▪ Mem 2: Individual cell memory
▪ 3 workload distributions:
▪ Distribution 1: All pipelines working on same reference particle
▪ Distribution 2: All pipelines working on same homecell, but different
ref particle
▪ Distribution 3: Pipeline working on different homecells
▪ Total of 6 combinations
Molecular Dynamics Range-Limited Force Evaluation Optimized for FPGA
Particle-to-Memory Mappings
Workload Distributions
7/22/2019
28
Boston University Slideshow Title Goes Here
Evaluation: System Resource Utilization
Molecular Dynamics Range-Limited Force Evaluation Optimized for FPGA
Design ALM BRAM DSP Freq (MHz) # Pipe # MU
Design1: Mem1+Dist1 728,678(78.1%) 8,530(72.8%) 2,772(48.1%) 352 105 10
Design2: Mem2+Dist1 740,954(79.4%) 5,078(43.3%) 2,037(35.4%) 338 55 10
Design3: Mem1+Dist2 728,378(78.1%) 8,320(70.1%) 2,391(41.5%) 343 105 10
Design4: Mem2+Dist2 740,654(79.4%) 5,078(43.3%) 1,656(28.8%) 340 55 10
Design5: Mem1+Dist3 746,561(80.0%) 8,734(74.5%) 2,436(42.3%) 346 110 10
Design6: Mem2+Dist3 753,060(80.7%) 8,420(71.8%) 2,466(42.9%) 346 110 10
▪ Design 2 & 4: require all-to-all connection between distributed cell mem and pipelines
▪ Design 1 & 3: Design 1 requires a large adder tree to sum up partial results from all pipelines, thus
slight less pipelines
▪ Design 5 & 6: no adder tree, no all-to-all connection
▪ In general, Mem1 seems beneficial due to simplified interconnection
Global Memory is the way to go??
7/22/2019
29
Boston University Slideshow Title Goes Here
Evaluation: Simulation Time Performance
Platform Iteration Time (µs)Simulation Rate
(ns/day)
CPU (i7-8700K) 1-core 71,300 2.42
CPU (i7-8700K) 24-core 11,800 14.64
GPU (GTX 1080Ti) 402 430.05
Design1: Mem1+Dist1 63,645 2.72
Design2: Mem2+Dist1 184 941.45
Design3: Mem1+Dist2 832 207.72
Design4: Mem2+Dist2 255 677.53
Design5: Mem1+Dist3 180 959.43
Design6: Mem2+Dist3 122 1412.98
Molecular Dynamics Range-Limited Force Evaluation Optimized for FPGA
▪ Design 1: Memory bottleneck: loading time to fill the input cache cannot be
hide by computation
▪ Design 2: Distributed cell memory alleviate the bandwidth pressure
7/22/2019
30
▪ Design 3 & 4: each cell has less particles (80) than the # of pipelines, some
pipeline is idle
▪ Design 3: also suffered from the bandwidth limit
▪ Design 5 & 6: similar performance: no memory bottleneck, same amount of
pipelines
▪ Design 5: overhead on reading out the first set of input data, later on is hided
by computation time
Mem 1
Mem 2
Distribution 2
Design #Pipe
Design 3 95
Design 4 60
Distribution 3
Mem 1
Boston University Slideshow Title Goes Here
Outline
▪ MD Background
▪ Single-chip End-to-end MD System Implementation
▪ Evaluation
▪ Future Work:▪ Multi-FPGA Strong Scaling
▪ Communication Pattern Analysis
Molecular Dynamics Range-Limited Force Evaluation Optimized for FPGA 7/22/2019
31
Boston University Slideshow Title Goes Here
Future Work on MD: Full Evaluation
▪ What we achieved▪ Particle Pair Generation
▪ Short Range Pipeline
▪ Motion Update
▪ To be done▪ Long-Range Part
▪ The O(N) part:
▪ Bonded Force
▪ Angle, torsion, etc.
Molecular Dynamics Range-Limited Force Evaluation Optimized for FPGA
bondednonHtorsionanglebondtotal
i FFFFFF −++++=
Generally O(N), performed on hostInitially O(n2), performed
on coprocessor
7/22/2019
32
Boston University Slideshow Title Goes Here
Future Work: Multi-FPGA Communication Pattern
▪ Each FPGA Handles a Single Cell:▪ Exchange with 26 neighbor nodes
▪ Newton’s 3rd Law -> Half Shell Method*▪ Exchange with 13
▪ Data Movement:
Molecular Dynamics Range-Limited Force Evaluation Optimized for FPGA
▪ Export position data ▪ Import and sum partial forces
Multicast ReductionP
rc
rc
P
rc
rc
* D. Shaw, A Fast, Scalable Method for the Parallel Evaluation of Distance-Limited Pairwise Particle Interactions, Computational Chemistry, 2005
Z
Y
X
rc
rc
P
7/22/2019
33
Boston University Slideshow Title Goes Here▪ Mapping cells to FPGA*
▪ Case 1: Single Cell per FPGA
▪ Multicasting to 13 neighbor nodes in 3D (single cell)
▪ Case 2: Multiple cells per FPGA
▪ Multicasting to 13 neighbor nodes in 3D (multiple cells)
▪ Case 3: Single cell on multiple FPGAs
▪ Multicasting to 62 neighboring nodes (partial cell)
Molecular Dynamics Range-Limited Force Evaluation Optimized for FPGA
Future Work: Mapping Cell onto Multi-FPGA
Rc R
cR
c
Node
Cell
Home Cell
Neigh Cell
* B. Towles, Unifying on-chip and inter-node switching within the Anton 2 network, Computer Architecture News, 2014
7/22/2019
34
Boston University Slideshow Title Goes Here
Thank you!
Q & A
Molecular Dynamics Range-Limited Force Evaluation Optimized for FPGA 7/22/2019
35