anton d.e shaw research. force fields: typical energy functions bond stretches angle bending...
TRANSCRIPT
ANTON
D.E Shaw Research
Force Fields: Typical Energy Functions
20
20
12 6
1( )
2
1( )
2
[1 cos( )]2
( )
[ ]
rbonds
angles
n
torsions
improper
i j
elec ij
ij ij
LJ ij ij
U k r r
k
Vn
V improper torsion
q q
r
A B
r r
Bond stretches
Angle bending
Torsional rotation
Improper torsion (sp2)
Electrostatic interaction
Lennard-Jones interaction
Molecular Dynamics
• Solve Newton’s equation for a molecular system:
amF
Integrator: Verlet Algorithm
)()()()(2)( 42 tOtatttrtrttr
)()(2
1)()()( 32 tOtatttvtrttr
)()(2
1)()()( 32 tOtatttvtrttr
Start with {r(t), v(t)}, integrate it to {r(t+t), v(t+t)}:
{r(t), v(t)}
{r(t+t), v(t+t)}
The new position at t+t:
Similarly, the old position at t-t:
(1)
(2)
Add (1) and (2):
Thus the velocity at t is:
(3)
)())()((2
1)()( 2tOttrttr
ttrtv
(4)
Molecular Dynamics
IterateIterate
Iterate... and iterate
Iterate... and iterate
Integrate Newton’s laws of motion
Two Distinct Problems
Problem 1: Simulate many short trajectories
Problem 2: Simulate one long trajectory
Simulating Many Short Trajectories
Can answer surprising number of interesting questions
Can be done using– Many slow computers– Distributed processing approach– Little inter-processor communication
E.g., Pande’s Folding at Home project
Simulating One Long Trajectory
Harder problem
Essential to elucidate many biologically interesting processes
Requires a single machine with– Extremely high performance– Truly massive parallelism– Lots of inter-processor communication
DESRES Goal
Single, millisecond-scale MD simulations (long trajectories)
– Protein with 64K or more atoms
– Explicit water molecules
Why?
– That’s the time scale at which many biologically interesting things start to happen
Image: Istvan Kolossvary & Annabel Todd, D. E. Shaw Research
Protein Folding
What Will It Take to Simulate a Millisecond?
We need an enormous increase in speed– Current (single processor): ~ 100 ms / fs– Goal will require < 10 s / fs
Required speedup:
> 10,000 x faster than current single-processor speed
~ 1,000x faster than current parallel implementations
Can’t accept >10,000x the power (~5 Megawatts)!
What Takes So Long?
Inner loop of force field evaluation looks at all pairs of atoms (within distance R)
On the order of 64K atoms in typical system
Repeat ~1012 times
Current approaches too slow by several orders of magnitude
What can be done?
• Parallelization• (getting an idea of the level of computation
needed)
• For every time step, every atom must communicate within its cutt-off radius with every other atom.
• 2) A lot of inter-processor communication that can be scaled well is needed.
MD Simulator Requirements
• Parallelization• (getting an idea of the level of computation
needed)
• Whole System is broken down into boxes (processing nodes)
• Each node handles the bonded interactions within
• NT method for non-bonded interactions (much more common).
• NT method for Atom Migration
MD Simulator Requirements
• 1) Need a huge number of arithmetic processing elements
• 2) A lot of inter-processor communication that can be scaled well is needed.
• 3) Memory is not an issue– With 25,000 atoms (64bytes
each) total=1.6MB over 512 nodes=3.2KB/node which is < most L1
Why Specialized Hardware?
Memory
Communication
Computation
Needs
• Consider Moore’s Law on 10X improvement in 5 years vs. Anton’s 1000X in 1 year.
• Can great discoveries wait?• Can use custom pipelines
with more precision, increased datapath logic speed, over less silicon area.
• Have Tailored ISA’s for geometric calculations+• Programmability for accommodating various force fields and
integration algorithms• Dedicated memory for each particle to accumulate forces
Why Specialized Hardware?Memory
Communication
Computation
Needs
ANTON Strategy
New architectures– Design a specialized machine– Enormously parallel architecture– Based on special-purpose ASICs– Dramatically faster for MD, but less flexible– Projected completion: 2008
New algorithms– Applicable to
• Conventional clusters• Our own machine
– Scale to very large # of processing elements
Interdisciplinary Lab
Computational Chemists and Biologists
Computer Scientists and Applied Mathematicians
Computer Architects and Engineers
Alternative Machine Architectures
Conventional cluster of commodity processors
General-purpose scientific supercomputer
Special-purpose molecular dynamics machine
Conventional Cluster of Commodity Processors
Strengths:
– Flexibility
– Mass market economies of scale
Limitations
– Doesn’t exploit special features of the problem
– Communication bottlenecks
• Between processor and memory
• Among processors
– Insufficient arithmetic power
General-Purpose Scientific Supercomputer
E.g., IBM Blue Gene
More demanding goal than ours– General-purpose scientific supercomputing– Fast for wide range of applications
Strengths:– Flexibility– Ease of programmability
Limitations for MD simulations– Expensive– Still not fast enough for our purposes
Anton: Special-Purpose MD Machine
Strengths:
– Several orders of magnitude faster for MD
– Excellent cost/performance characteristics
Limitations:
– Not designed for other scientific applications
• They’d be difficult to program
• Still wouldn’t be especially fast
– Limited flexibility
Anton System-Level Organization
Multiple segments (probably 8 in first machine)
512 nodes (each consists of one ASIC plus DRAM) per segment– Organized in an 8 x 8 x 8 toroidal mesh
Each ASIC equivalent performance to roughly 500 general purpose microprocessors– ASIC power similar to a single
microprocessor
3D Torus Network
Why a 3D Torus?
Topology reflects physical space being simulated:– Three-dimensional nearest neighbor
connections– Periodic boundary conditions
Bulk of communications is to near neighbors– No switching to reach immediate neighbors
Source of Speedup on Our Machine
Judicious use of arithmetic specialization
– Flexibility, programmability only where needed
– Elsewhere, hardware tailored for speed• Tables and parameters, but not
programmable
Carefully choreographed communication
– Data flows to just where it’s needed
– Almost never need to access off-chip memory
Two Subsystems on Each ASIC
SpecializedSubsystem
FlexibleSubsystem
Programmable, general-purpose
Efficient geometric operations
Modest clock rate
Pairwise point interactions
Enormously parallel
Aggressive clock rate
28
Anton• 33M gate ASIC• Two computational
subsystems connected by communication ring
• Hardware datapaths compute over 25 billion interactions/s
• Full machine has 512 ASICs in a 3D torus
• 13 embedded processors
Flexible Subsystem
High-Throughput Interaction Subsystem
Me
mo
ry C
on
tro
ller
Host Interface
Channel Channel Channel
Ch
an
ne
l C
ha
nn
el
Ch
an
ne
l
Ro
ute
r
Router Router
Ro
ute
r Memory Controller
Ro
ute
r
Router
DRAM
DR
AM
+Z
+Y +X
-Z
-X
-Y Host Processor
Communication Ring
Where We Use Specialized Hardware
Specialized hardware (with tables, parameters) where:
Inner loop
Simple, regular algorithmic structure
Unlikely to change
Examples:
Electrostatic forces
Van der Waals interactions
Example: Particle Interaction Pipeline (one of 32)
– Executes Non-bonded MD interaction calculations (Charge Spreading & Force Interpolation)
– Accumulates forces on each particle as data streams through.– ICB Controls flow of data through the HTIS, programmable ISA
extensions, acts as a buffering, pre-fetching, synchronization, and write back
controller
High-Throughput Interaction Subsystem
Array of 32 Particle Interaction Pipelines
Advantages of Particle Interaction Pipelines
Save area that would have been allocated to
– Cache
– Control logic
– Wires
Achieve extremely high arithmetic density
Save time that would have been spent on
– Cache misses,
– Load/store instructions
– Misc. data shuffling
Where We Use Flexible Hardware
– Use programmable hardware where:• Algorithm less regular• Smaller % of total computation
- E.g., local interactions (fewer of them)• More likely to change
– Examples:• Bonded interactions• Bond length constraints• Experimentation with
- New, short-range force field terms- Alternative integration techniques
Forms of Parallelism in Flexible Subsystem
The Flexible Subsystem exploits three forms of parallelism:
– Multi-core parallelism (4 Tensilicas, 8 Geometry Cores)
– Instruction-level parallelism
– SIMD parallelism – calculate on 3D and 4D vectors as single operation
Overview of the Flexible Subsystem
GC = Geometry Core
(each a VLIW processor)
Geometry Core(one of 8; 64 pipelined lanes/chip)
+
X
+ + + +
InstructionMemory
Decode
FromTensilica
Core
X X X
X Y Z W
PC
+
X
+ + + +
X X X
X Y Z W
DataMemory
f f f f f f f f
But Communication is Still a Bottleneck
Scalability limited by inter-chip communication
To execute a single millisecond-scale simulation,
– Need a huge number of processing elements
– Must dramatically reduce amount of data transferred between these processing elements
Can’t do this without fundamentally new algorithms:
– A family of Neutral Territory (NT) methods that reduce pair interaction communication load significantly
– A new variant of Ewald distant method, Gaussian Split Ewald (GSE) which simplifies calculation and communication for distant interactions
– These are the subject of a different talk.
39
Anton in Action
• 500X NAMD 80-100X Desmond 100X Blue Matter
Simulation Evaluations
GPU+FPGA ???
GPU
6*GDDR5
FPGA
HIGH SPEED SERIAL I/O
UP TO 2 Tbit/S
LVDS
FFT and LJ
16*PCIe