scalability and interoperable libraries in namd
DESCRIPTION
Scalability and interoperable libraries in NAMD. Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois at Urbana-Champaign. Contributors. PI s : Laxmikant Kale, Klaus Schulten, Robert Skeel NAMD 1: - PowerPoint PPT PresentationTRANSCRIPT
Scalability and interoperable libraries in NAMD
Laxmikant (Sanjay) KaleTheoretical Biophysics group
and
Department of Computer Science
University of Illinois at Urbana-Champaign
Contributors
• PI s : – Laxmikant Kale, Klaus Schulten, Robert Skeel
• NAMD 1: – Robert Brunner, Andrew Dalke, Attila Gursoy, Bill
Humphrey, Mark Nelson
• NAMD2: – M. Bhandarkar, R. Brunner, A. Gursoy, J. Philips,
N.Krawetz, A. Shinozaki, K. Varadarajan, Gengbin Zheng, ..
Middle layers
Applications
Parallel Machines
“Middle Layers”:Languages, Tools, Libraries
Molecular Dynamics
• Collection of [charged] atoms, with bonds• Newtonian mechanics• At each time-step
– Calculate forces on each atom
• bonds:
• non-bonded: electrostatic and van der Waal’s
– Calculate velocities and Advance positions
• 1 femtosecond time-step, millions needed!• Thousands of atoms (1,000 - 100,000)
Cut-off radius
• Use of cut-off radius to reduce work– 8 - 14 Å
– Faraway charges ignored!
• 80-95 % work is non-bonded force computations• Some simulations need faraway contributions
– Periodic systems: Ewald, Particle-Mesh Ewald
– Aperiodic systems: FMA
• Even so, cut-off based computations are important:– near-atom calculations are part of the above
– multiple time-stepping is used: k cut-off steps, 1 PME/FMA
Scalability
• The Program should scale up to use a large number of processors. – But what does that mean?
• An individual simulation isn’t truly scalable• Better definition of scalability:
– If I double the number of processors, I should be able to retain parallel efficiency by increasing the problem size
Isoefficiency
• Quantify scalability – (Work of Vipin Kumar, U. Minnesota)
• How much increase in problem size is needed to retain the same efficiency on a larger machine?
• Efficiency : Seq. Time/ (P · Parallel Time)– parallel time =
• computation + communication + idle
Traditional Approaches
• Replicated Data:– All atom coordinates stored on each processor
– Non-bonded Forces distributed evenly
– Analysis: Assume N atoms, P processors
• Computation: O(N/P)
• Communication: O(N log P)
• Communication/Computation ratio: P log P
• Fraction of communication increases with number of processors, independent of problem size!
– So, not scalable by this definition
Atom decomposition
• Partition the Atoms array across processors– Nearby atoms may not be on the same processor
– Communication: O(N) per processor
– Communication/Computation: O(N)/(N/P): O(P)
– Again, not scalable by our definition
Force Decomposition
• Distribute force matrix to processors– Matrix is sparse, non uniform
– Each processor has one block
– Communication:
– Ratio:
• Better scalability in practice – (can use 100+ processors)
– Plimpton:
– Hwang, Saltz, et al:
• 6% on 32 Pes 36% on 128 processor
– Yet not scalable in the sense defined here!
P
N
P
Spatial Decomposition
• Allocate close-by atoms to the same processor• Three variations possible:
– Partitioning into P boxes, 1 per processor
• Good scalability, but hard to implement
– Partitioning into fixed size boxes, each a little larger than the cutoff distance
– Partitioning into smaller boxes
• Communication: O(N/P): – so, scalable in principle
Spatial Decomposition in NAMD
• NAMD 1 used spatial decomposition• Good theoretical isoefficiency, but for a fixed size
system, load balancing problems• For midsize systems, got good speedups up to 16
processors….• Use the symmetry of Newton’s 3rd law to
facilitate load balancing
Spatial Decomposition
But the load balancing problems are still severe:
FD + SD
• Now, we have many more objects to load balance:– Each diamond can be assigned to any processor
– Number of diamonds (3D):
• 14·Number of Patches
Bond Forces
• Multiple types of forces:– Bonds(2), Angles(3), Dihedrals (4), ..
– Luckily, each involves atoms in neighboring patches only
• Straightforward implementation:– Send message to all neighbors,
– receive forces from them
– 26*2 messages per patch!
Bonded Forces:• Assume one patch per processor:
– an angle force involving atoms in patches:
• (x1,y1,z1), (x2,y2,z2), (x3,y3,z3)
• is calculated in patch: (max{xi}, max{yi}, max{zi})
B
CA
Implementation
• Multiple Objects per processor– Different types: patches, pairwise forces, bonded forces,
– Each may have its data ready at different times
– Need ability to map and remap them
– Need prioritized scheduling
• Charm++ supports all of these
Charm++
• Parallel C++ with Data Driven Objects• Object Groups:
– global object with a “representative” on each PE
• Asynchronous method invocation• Prioritized scheduling• Mature, robust, portable• http://charm.cs.uiuc.edu
Data driven execution
Scheduler Scheduler
Message Q Message Q
Load Balancing
• Is a major challenge for this application– especially for a large number of processors
• Unpredictable workloads– Each diamond (force object) and patch encapsulate variable
amount of work
– Static estimates are inaccurate
• Measurement based Load Balancing Framework– Robert Brunner’s recent Ph.D. thesis
– Very slow variations across timesteps
Bipartite graph balancing
• Background load:– Patches (integration, ..) and bond-related forces:
• Migratable load:– Non-bonded forces
– bond-related forces involving atoms of the same patch
• Bipartite communication graph – between migratable and non-migratable objects
• Challenge:– Balance Load while minimizing communication
Load balancing
• Collect timing data for several cycles• Run heuristic load balancer
– Several alternative ones
• Re-map and migrate objects accordingly– Registration mechanisms facilitate migration
• Needs a separate talk!
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
4500000
5000000
Processors
Tim
emigratable work
non-migratable work
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
4500000
Processors
Tim
e migratable work
non-migratable work
Performance: size of system
# ofatoms
Procs 1 2 4 8 16 32 64 128 160
bR Time 1.14 0.58 .315 .158 .086 .0483,762atoms
Speedup 1.0 1.97 3.61 7.20 13.2 23.7
ER-ERE Time 6.115 3.099 1.598 .810 .397 0.212 0.123 0.09836,573atoms
Speedup (1.97) 3.89 7.54 14.9 30.3 56.8 97.9 123
ApoA-I Time 10.76 5.46 2.85 1.47 0.729 0.382 0.32192,224atoms
Speedup (3.88) 7.64 14.7 28.4 57.3 109 130
Performance data on Cray T3E
Performance: various machines
Procs 1 2 4 8 16 32 64 128 160 192
T3E Time 6.12 3.10 1.60 0.810 0.397 0.212 0.123 0.098
- ---------
Speedup (1.97) 3.89 7.54 14.9 30.3 56.8 97.9 123
Origin Time 8.28 4.20 2.17 1.07 0.542 0.271 0.152
2000-------
Speedup 1.0 1.96 3.80 7.74 15.3 30.5 54.3
ASCI- Time 28.0 13.9 7.24 3.76 1.91 1.01 0.500 0.279 0.227 0.196
Red ---------
Speedup 1.0 2.01 3.87 7.45 14.7 27.9 56.0 100 123 143
NOWs Time 24.1 12.4 6.39 3.69
HP735/125
Speedup 1.0 1.94 3.77 6.54
Speedup
0
20
40
60
80
100
120
140
160
180
200
220
240
0 20 40 60 80 100 120 140 160 180 200 220 240
Processors
Sp
eed
up
Speedup
Perfect Speedup
Recent Speedup Results: ASCI RedSpeedup on ASCI Red: Apo-A1
0
100
200
300
400
500
600
700
0 200 400 600 800 1000 1200
Processors
Sp
eed
up
Recent Results on Linux ClusterSpeedup on Linux Cluster
0
10
20
30
40
50
60
70
80
0 20 40 60 80 100 120
Processors
Sp
eed
up
Recent Results on Origin 2000Performance on Origin 2000
0
10
20
30
40
50
60
70
80
90
0 20 40 60 80 100 120
Processors
Sp
ee
du
p
Multi-paradigm programming
• Long-range electrostatic interactions– Some simulations require this
– Contributions of faraway atoms can be calculated infrequently
– PVM based library, DPMTA
• developed at Duke by John Board et al
• Patch life cycle• Better expressed as a thread
Converse
• Supports multi-paradigm programming• Provides portability• Makes it easy to implement RTS for new paradigms• Several languages/libraries:
– Charm++, threaded MPI, PVM, Java, md-perl, pc++, Nexus, Path, Cid, CC++, DP, Agents,..
Namd2 with Converse
NAMD2
• In production use – Internally for about a year
– Several simulations completed/published
• Fastest MD program? We think so• Modifiable/extensible
– Steered MD
– Free energy calculations
Real Application for CS research?
• Benefits– Subtle and complex research problems uncovered only
with real application
– Satisfaction of “real” concrete contribution
– With careful planning, you can truly enrich the “middle layers”
– Bring back a rich variety of relevant CS problems
– Apply to other domains: Rockets? Casting?