scalable molecular dynamics for large biomolecular systems
DESCRIPTION
Scalable Molecular Dynamics for Large Biomolecular Systems. Robert Brunner James C Phillips Laxmikant Kale Department of Computer Science and Theoretical Biophysics Group University of Illinois at Urbana Champaign. Parallel Computing with Data-driven Objects. Laxmikant (Sanjay) Kale - PowerPoint PPT PresentationTRANSCRIPT
1
Scalable Molecular DynamicsScalable Molecular Dynamicsfor Large Biomolecular Systemsfor Large Biomolecular Systems
Robert BrunnerRobert Brunner
James C PhillipsJames C Phillips
Laxmikant KaleLaxmikant Kale
Department of Computer ScienceDepartment of Computer Science
andand
Theoretical Biophysics GroupTheoretical Biophysics Group
University of Illinois at Urbana ChampaignUniversity of Illinois at Urbana Champaign
2
Parallel Computing withParallel Computing withData-driven ObjectsData-driven Objects
Laxmikant (Sanjay) KaleLaxmikant (Sanjay) KaleParallel Programming LaboratoryParallel Programming Laboratory
Department of Computer ScienceDepartment of Computer Science
http://charm.cs.uiuc.eduhttp://charm.cs.uiuc.edu
3
OverviewOverview
• Context: approach and methodologyContext: approach and methodology• Molecular dynamics for biomoleculesMolecular dynamics for biomolecules• Our program, NAMDOur program, NAMD
– Basic parallelization strategyBasic parallelization strategy
• NAMD performance optimizationsNAMD performance optimizations– TechniquesTechniques– ResultsResults
• Conclusions: summary, lessons and future workConclusions: summary, lessons and future work
4
Parallel Programming LaboratoryParallel Programming Laboratory
• Objective:Objective: Enhance performance and productivity Enhance performance and productivity in parallel programmingin parallel programming– For complex, dynamic applicationsFor complex, dynamic applications– Scalable to thousands of processorsScalable to thousands of processors
• Theme:Theme:– Adaptive techniques for handling dynamic behaviorAdaptive techniques for handling dynamic behavior
• Strategy:Strategy: Look for optimal division of labor Look for optimal division of labor between human programmer and the “system”between human programmer and the “system”– Let the programmer specify what to do in parallel Let the programmer specify what to do in parallel – Let the system decide when and where to run them Let the system decide when and where to run them
• Data driven objects as the substrate: Data driven objects as the substrate: Charm++Charm++
5
System Mapped ObjectsSystem Mapped Objects
1
12
5
9 10
2
11
34
7
13
6
8
15810
4
1112
9 2 3
9
6 713
6
Data Driven ExecutionData Driven Execution
Scheduler Scheduler
Message Q Message Q
7
Charm++Charm++
• Parallel C++ with Parallel C++ with data driven objectsdata driven objects• Object ArraysObject Arrays and collections and collections• Asynchronous method invocationAsynchronous method invocation• Object GroupsObject Groups: :
– global object with a “representative” on each PEglobal object with a “representative” on each PE
• Prioritized schedulingPrioritized scheduling• Mature, robust, portableMature, robust, portable• http://charm.cs.uiuc.eduhttp://charm.cs.uiuc.edu
8
Multi-partition DecompositionMulti-partition Decomposition
• Writing applications with Charm++Writing applications with Charm++– Decompose the problem into a large number of chunksDecompose the problem into a large number of chunks– Implements chunks as objectsImplements chunks as objects
• Or, now, as MPI threads (AMPI on top of Charm++)Or, now, as MPI threads (AMPI on top of Charm++)
• Let Charm++ map and remap objectsLet Charm++ map and remap objects– Allow for migration of objectsAllow for migration of objects– If desired, specify potential migration pointsIf desired, specify potential migration points
9
Load Balancing MechanismsLoad Balancing Mechanisms
• Re-map and migrate objectsRe-map and migrate objects– Registration mechanisms facilitate migrationRegistration mechanisms facilitate migration– Efficient message delivery strategiesEfficient message delivery strategies– Efficient global operationsEfficient global operations
• Such as reductions and broadcastsSuch as reductions and broadcasts
• Several classes of load balancing strategies Several classes of load balancing strategies providedprovided– IncrementalIncremental– Centralized as well as distributedCentralized as well as distributed– Measurement basedMeasurement based
10
Principle of PersistencePrinciple of Persistence
• An observation about CSE applicationsAn observation about CSE applications– Extension of principle of localityExtension of principle of locality– Behavior of objects, including computational load and Behavior of objects, including computational load and
communication patterns, tend to persist over timecommunication patterns, tend to persist over time
• Application induced imbalance:Application induced imbalance:– Abrupt, but infrequent, orAbrupt, but infrequent, or– Slow, cumulativeSlow, cumulative– Rarely: frequent, large changes Rarely: frequent, large changes
• Our framework still deals with this case as wellOur framework still deals with this case as well
• Measurement based strategiesMeasurement based strategies
11
Measurement-Based Measurement-Based Load Balancing StrategiesLoad Balancing Strategies
• Collect timing data for several cyclesCollect timing data for several cycles• Run heuristic load balancerRun heuristic load balancer
– Several alternative onesSeveral alternative ones
• Robert Brunner’s recent Ph.D. thesis:Robert Brunner’s recent Ph.D. thesis:– Instrumentation frameworkInstrumentation framework– StrategiesStrategies– Performance comparisonsPerformance comparisons
12
Molecular DynamicsMolecular Dynamics
ApoA-I: 92k Atoms
13
Molecular Dynamics and NAMDMolecular Dynamics and NAMD
• MD is used to understand the structure and MD is used to understand the structure and function of biomoleculesfunction of biomolecules– Proteins, DNA, membranesProteins, DNA, membranes
• NAMD is a production-quality MD programNAMD is a production-quality MD program– Active use by biophysicists (published science)Active use by biophysicists (published science)– 50,000+ lines of C++ code50,000+ lines of C++ code– 1000+ registered users1000+ registered users– Features include:Features include:
• CHARMM and XPLOR compatibilityCHARMM and XPLOR compatibility• PME electrostatics and multiple timesteppingPME electrostatics and multiple timestepping• Steered and interactive simulation via VMDlSteered and interactive simulation via VMDl
14
NAMD ContributorsNAMD Contributors
• PI s : PI s : – Laxmikant Kale, Klaus Schulten, Robert SkeelLaxmikant Kale, Klaus Schulten, Robert Skeel
• NAMD Version 1: NAMD Version 1: – Robert Brunner, Andrew Dalke, Attila Gursoy, Bill Robert Brunner, Andrew Dalke, Attila Gursoy, Bill
Humphrey, Mark NelsonHumphrey, Mark Nelson
• NAMD2: NAMD2: – M. Bhandarkar, M. Bhandarkar, R. Brunner,R. Brunner, Justin Gullingsrud, A. Justin Gullingsrud, A.
Gursoy, N.Krawetz, Gursoy, N.Krawetz, J. PhillipsJ. Phillips,, A. Shinozaki, K. A. Shinozaki, K. Varadarajan, Gengbin Zheng, ..Varadarajan, Gengbin Zheng, ..
Theoretical Biophysics Group, supported by NIH
15
Molecular DynamicsMolecular Dynamics
• Collection of [charged] atoms, with bondsCollection of [charged] atoms, with bonds• Newtonian mechanicsNewtonian mechanics• At each time-stepAt each time-step
– Calculate forces on each atom Calculate forces on each atom • BondsBonds• Non-bonded: electrostatic and van der Waal’sNon-bonded: electrostatic and van der Waal’s
– Calculate velocities and advance positionsCalculate velocities and advance positions
• 1 femtosecond time-step, millions needed!1 femtosecond time-step, millions needed!• Thousands of atoms (1,000 - 100,000)Thousands of atoms (1,000 - 100,000)
16
Cut-off RadiusCut-off Radius
• Use of cut-off radius to reduce workUse of cut-off radius to reduce work– 8 - 14 Å 8 - 14 Å – Far away atoms ignored! (screening effects)Far away atoms ignored! (screening effects)
• 80-95 % work is non-bonded force computations80-95 % work is non-bonded force computations• Some simulations need faraway contributionsSome simulations need faraway contributions
– Particle-Mesh Ewald (PME)Particle-Mesh Ewald (PME)
• Even so, cut-off based computations are important:Even so, cut-off based computations are important:– Near-atom calculations constitute the bulk of the aboveNear-atom calculations constitute the bulk of the above– Multiple time-stepping is used: k cut-off steps, 1 PMEMultiple time-stepping is used: k cut-off steps, 1 PME
• So, (k-1) steps do just cut-off based simulationSo, (k-1) steps do just cut-off based simulation
19
Early methodsEarly methods
• Atom replication:Atom replication:– Each processor has data for all atomsEach processor has data for all atoms– Force calculations parallelizedForce calculations parallelized
• Collection of forces: O(N log p) communicationCollection of forces: O(N log p) communication– Computation: O(N/P)Computation: O(N/P)– Communication/computation Ratio: Communication/computation Ratio: O(P log P) : Not ScalableO(P log P) : Not Scalable
• Atom Decomposition Atom Decomposition – Partition the atoms array across processorsPartition the atoms array across processors
• Nearby atoms may not be on the same processorNearby atoms may not be on the same processor– Communication: Communication: O(N)O(N) per processor per processor– Ratio: Ratio: O(N) / (N / P) = O(N) / (N / P) = O(P): O(P): Not ScalableNot Scalable
20
Force DecompositionForce Decomposition
• Distribute force matrix to processorsDistribute force matrix to processors– Matrix is sparse, non uniformMatrix is sparse, non uniform– Each processor has one blockEach processor has one block– Communication:Communication:– Ratio:Ratio:
• Better scalability in practice Better scalability in practice – Can use 100+ processorsCan use 100+ processors– Plimpton: Plimpton: – Hwang, Saltz, et al: Hwang, Saltz, et al:
• 6% on 32 processors6% on 32 processors• 36% on 128 processor36% on 128 processor
– Yet not scalable in the sense defined here!Yet not scalable in the sense defined here!
P
N
P
21
Spatial DecompositionSpatial Decomposition
• Allocate close-by atoms to the same processorAllocate close-by atoms to the same processor• Three variations possible:Three variations possible:
– Partitioning into Partitioning into PP boxes, 1 per processor boxes, 1 per processor• Good scalability, but hard to implementGood scalability, but hard to implement
– Partitioning into fixed size boxes, each a little larger than Partitioning into fixed size boxes, each a little larger than the cut-off distancethe cut-off distance
– Partitioning into smaller boxesPartitioning into smaller boxes
• Communication: Communication: O(N/P)O(N/P)– Communication/Computation ratio: O(1)Communication/Computation ratio: O(1)– So, scalable in principleSo, scalable in principle
22
Ongoing workOngoing work
• Plimpton, Hendrickson:Plimpton, Hendrickson:– new spatial decompositionnew spatial decomposition
• NWChem (PNL)NWChem (PNL)• Peter Kollman, Yong Duan et al:Peter Kollman, Yong Duan et al:
– microsecond simulationmicrosecond simulation– AMBER version (SANDER)AMBER version (SANDER)
23
Spatial Decomposition in NAMDSpatial Decomposition in NAMD
But the load balancing problems are still severe
24
Hybrid DecompositionHybrid Decomposition
25
FD + SDFD + SD
• Now, we have many more objects to load balance:Now, we have many more objects to load balance:– Each diamond can be assigned to any processorEach diamond can be assigned to any processor– Number of diamonds (3D): Number of diamonds (3D):
• 14·Number of Patches14·Number of Patches
26
Bond ForcesBond Forces
• Multiple types of forces:Multiple types of forces:– Bonds(2), angles(3), dihedrals (4), ..Bonds(2), angles(3), dihedrals (4), ..– Luckily, each involves atoms in neighboring patches Luckily, each involves atoms in neighboring patches
onlyonly
• Straightforward implementation:Straightforward implementation:– Send message to all neighbors,Send message to all neighbors,– receive forces from themreceive forces from them– 26*2 messages per patch!26*2 messages per patch!
27
Bond ForcesBond Forces• Assume one patch per processor:Assume one patch per processor:
– An angle force involving atoms in patchesAn angle force involving atoms in patches (x1,y1,z1), (x2,y2,z2), (x3,y3,z3)(x1,y1,z1), (x2,y2,z2), (x3,y3,z3) is calculated in patch: is calculated in patch: (max{xi}, max{yi}, max{zi})(max{xi}, max{yi}, max{zi})
B
CA
28
NAMD ImplementationNAMD Implementation
• Multiple objects per processorMultiple objects per processor– Different types: patches, pairwise forces, bonded forcesDifferent types: patches, pairwise forces, bonded forces– Each may have its data ready at different timesEach may have its data ready at different times– Need ability to map and remap themNeed ability to map and remap them– Need prioritized scheduling Need prioritized scheduling
• Charm++ supports all of theseCharm++ supports all of these
29
Load BalancingLoad Balancing
• Is a major challenge for this applicationIs a major challenge for this application– Especially for a large number of processorsEspecially for a large number of processors
• Unpredictable workloadsUnpredictable workloads– Each diamond (force “compute” object) and patch Each diamond (force “compute” object) and patch
encapsulate variable amount of workencapsulate variable amount of work– Static estimates are inaccurateStatic estimates are inaccurate– Very slow variations across timestepsVery slow variations across timesteps
• Measurement-based load balancing frameworkMeasurement-based load balancing framework
ComputeCell (patch)
Cell (patch)
31
Load Balancing StrategyLoad Balancing StrategyGreedy variant (simplified):
Sort compute objects (diamonds)
Repeat (until all assigned)
S = set of all processors that:
-- are not overloaded
-- generate least new commun.
P = least loaded {S}
Assign heaviest compute to P
Refinement:
Repeat
- Pick a compute from
the most overloaded PE
- Assign it to a suitable
underloaded PE
Until (No movement)
Cell CellCompute
32
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
4500000
5000000
Processors
Tim
e
migratable work
non-migratable work
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
4500000
0 2 4 6 8 10 12 14
Avera
ge
Processors
Tim
e migratable work
non-migratable work
33
Speedups in 1998Speedups in 1998
0
20
40
60
80
100
120
140
160
180
200
220
240
0 20 40 60 80 100 120 140 160 180 200 220 240Processors
Sp
eed
up
Speedup
Perfect Speedup
ApoA-I: 92k atoms
34
OptimizationsOptimizations
• Series of optimizationsSeries of optimizations• Examples discussed here:Examples discussed here:
– Grainsize distributions (bimodal)Grainsize distributions (bimodal)– Integration: message sending overheadsIntegration: message sending overheads
• Several other optimizationsSeveral other optimizations– Separation of bond/angle/dihedral objectsSeparation of bond/angle/dihedral objects
• Inter-patch and intra-patchInter-patch and intra-patch– PrioritizationPrioritization– Local synchronization to avoid interference across stepsLocal synchronization to avoid interference across steps
35
Grainsize and Amdahls’s LawGrainsize and Amdahls’s Law
• A variant of Amdahl’s law, for objects, would be:A variant of Amdahl’s law, for objects, would be:– The fastest time can be no shorter than the time for the The fastest time can be no shorter than the time for the
biggest single object!biggest single object!
• How did it apply to us?How did it apply to us?– Sequential step time was 57 secondsSequential step time was 57 seconds– To run on 2k processors, no object should be more than To run on 2k processors, no object should be more than
28 msecs. 28 msecs. • Should be even shorterShould be even shorter
– Grainsize analysis via projections showed that was not Grainsize analysis via projections showed that was not so..so..
36
Grainsize AnalysisGrainsize Analysis
0
100
200
300
400
500
600
700
800
900
1000
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43
Grainsize (milliseconds)
Nu
mber
of
obje
cts
Problem
Solution: Solution:
Split compute Split compute objects that may objects that may have too much have too much work:work:
using a heuristics using a heuristics based on number based on number of interacting of interacting atomsatoms
37
Grainsize ReducedGrainsize Reduced
0
200
400
600
800
1000
1200
1400
1600
1 3 5 7 9 11 13 15 17 19 21 23 25
Grainsize (milliseconds)
Nu
mb
er
of
ob
jec
ts
38
Performance AuditPerformance Audit• Through the Through the
optimization process, optimization process, – an audit was kept to an audit was kept to
decide where to look decide where to look to improve to improve performanceperformance
Total Ideal Actual
Total 57.04 86
nonBonded 52.44 49.77
Bonds 3.16 3.9
Integration 1.44 3.05
Overhead 0 7.97
Imbalance 0 10.45
Idle 0 9.25
Receives 0 1.61
Integration time doubled
39
Integration Overhead AnalysisIntegration Overhead Analysisintegration
Problem: integration time had doubled from sequential run
40
Integration Overhead ExampleIntegration Overhead Example
• The The projectionsprojections pictures showed the overhead was pictures showed the overhead was associated with sending messages.associated with sending messages.
• Many cells were sending 30-40 messages.Many cells were sending 30-40 messages.– The overhead was still too much compared with the cost The overhead was still too much compared with the cost
of messages.of messages.– Code analysis: memory allocations!Code analysis: memory allocations!– Identical message is being sent to 30+ processors.Identical message is being sent to 30+ processors.
• Simple multicast support was added to Charm++Simple multicast support was added to Charm++– Mainly eliminates memory allocations (and some Mainly eliminates memory allocations (and some
copying)copying)
41
Integration Overhead: After Integration Overhead: After MulticastMulticast
42
ApoA-I on ASCI RedApoA-I on ASCI Red
0
200
400
600
800
1000
1200
0 500 1000 1500 2000 2500
Processors
Sp
ee
du
p
57 ms/step
43
ApoA-I on Origin 2000ApoA-I on Origin 2000
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100 120
Processors
Sp
ee
du
p
44
ApoA-I on Linux ClusterApoA-I on Linux Cluster
0
10
20
30
40
50
60
70
80
0 20 40 60 80 100 120
Processors
Sp
ee
du
p
46
ApoA-I on T3EApoA-I on T3E
0
50
100
150
200
250
0 50 100 150 200 250 300
Processors
Sp
ee
du
p
47
BC1 complex: 200k atomsBC1 complex: 200k atoms
48
BC1 on ASCI RedBC1 on ASCI Red
0
200
400
600
800
1000
1200
1400
0 500 1000 1500 2000 2500
Processors
Sp
ee
du
p
58.4 GFlops
49
Lessons LearnedLessons Learned
• Need to downsize objects!Need to downsize objects!– Choose smallest possible grainsize that amortizes Choose smallest possible grainsize that amortizes
overheadoverhead
• One of the biggest challenge One of the biggest challenge – Was getting time for performance tuning runs on parallel Was getting time for performance tuning runs on parallel
machinesmachines
50
ApoA-I with PME on T3EApoA-I with PME on T3E
0
20
40
60
80
100
120
140
160
180
0 50 100 150 200 250 300
Processors
Sp
eed
up
51
Future and Planned WorkFuture and Planned Work
• Increased speedups on 2k-10k processorsIncreased speedups on 2k-10k processors– Smaller grainsizesSmaller grainsizes– Parallelizing integration furtherParallelizing integration further– New algorithms for reducing communication impactNew algorithms for reducing communication impact– New load balancing strategiesNew load balancing strategies
• Further performance improvements for PMEFurther performance improvements for PME– With multiple timesteppingWith multiple timestepping– Needs multi-phase load balancingNeeds multi-phase load balancing
• Speedup on small molecules!Speedup on small molecules!– Interactive molecular dynamicsInteractive molecular dynamics
52
More InformationMore Information
• Charm++ and associated framework:Charm++ and associated framework:– http://charm.cs.uiuc.eduhttp://charm.cs.uiuc.edu
• NAMD and associated biophysics tools:NAMD and associated biophysics tools:– http://www.ks.uiuc.eduhttp://www.ks.uiuc.edu
• Both include downloadable softwareBoth include downloadable software
53
Parallel Programming LaboratoryParallel Programming Laboratory
• Funding:Funding:– Dept of Energy (via Rocket center)Dept of Energy (via Rocket center)– National Science FoundationNational Science Foundation– National Institute of HealthNational Institute of Health
• Group MembersGroup Members
Milind BhandarkarTerry WilmarthOrion LawlorNeelam SabooArun SinglaKarthikeyan Mahesh
Joshua UngerGengbin ZhengJay DesouzaSameer Kumar
Chee wai Lee
Affiliated (NIH/Biophysics)
Jim Phillips
Kirby Vandivoort
54
The Parallel Programming ProblemThe Parallel Programming Problem• Is there one?Is there one?
– We can all write MPI programs, right?We can all write MPI programs, right?– Several Large Machines in useSeveral Large Machines in use
• But:But:– New complex apps with dynamic and irregular structureNew complex apps with dynamic and irregular structure– Should all application scientists also be experts in Should all application scientists also be experts in
parallel computing?parallel computing?
55
What makes it difficult?What makes it difficult?
• Multiple objectivesMultiple objectives– Correctness, Sequential efficiency, speedupsCorrectness, Sequential efficiency, speedups– Nondeterminacy: affects correctnessNondeterminacy: affects correctness– Several obstacles to speedup:Several obstacles to speedup:
• communication costscommunication costs• Load imbalancesLoad imbalances• Long critical pathsLong critical paths
56
Parallel ProgrammingParallel Programming
• DecompositionDecomposition– Decide what to do in parallel Decide what to do in parallel
• Tasks (loop iterations, functions,.. ) that can be Tasks (loop iterations, functions,.. ) that can be done in paralleldone in parallel
• Mapping:Mapping:– Which processor does each taskWhich processor does each task
• Scheduling (sequencing)Scheduling (sequencing)– On each processorOn each processor
• Machine dependent expressionMachine dependent expression– Express the above decisions for the particular parallel Express the above decisions for the particular parallel
machinemachine
57
Spectrum of parallel LanguagesSpectrum of parallel Languages
Specialization
Leve
l
MPI
Parallelizing Fortran compiler
Machine dependent expression
Scheduling (sequencing)
Mapping
Decomposition
What is automated
Charm++
58
Charm++Charm++
• Data Driven ObjectsData Driven Objects• Asynchronous method invocationAsynchronous method invocation• Prioritized schedulingPrioritized scheduling• Object ArraysObject Arrays• Object Groups: Object Groups:
– global object with a “representative” on each PEglobal object with a “representative” on each PE
• Information sharing abstractionsInformation sharing abstractions– readonly datareadonly data– accumulatorsaccumulators– distributed tablesdistributed tables
59
Data Driven ExecutionData Driven Execution
Scheduler Scheduler
Message Q Message Q
Objects
60
Group Mission and ApproachGroup Mission and Approach
• To enhance To enhance PerformancePerformance and and ProductivityProductivity in programming in programming complexcomplex parallel applications parallel applications
• Approach: Approach: Application Oriented yet CS centered researchApplication Oriented yet CS centered research– Develop enabling technology, for many apps.Develop enabling technology, for many apps.– Develop, use and test it in the context of real applicationsDevelop, use and test it in the context of real applications
• ThemeTheme– Adaptive techniques for irregular and dynamic applicationsAdaptive techniques for irregular and dynamic applications
– Optimal division of labor: “system” and programmer:Optimal division of labor: “system” and programmer:• Decomposition done by programmer, everything else automatedDecomposition done by programmer, everything else automated• Develop standard library for parallel programming of reusable Develop standard library for parallel programming of reusable
componentscomponents
61
Active ProjectsActive Projects
• Charm++/ Converse parallel infrastructureCharm++/ Converse parallel infrastructure• Scientific/Engineering appsScientific/Engineering apps
– Molecular DynamicsMolecular Dynamics– Rocket SimulationRocket Simulation– Finite Element FrameworkFinite Element Framework
• Web-based interaction and monitoringWeb-based interaction and monitoring• Faucets: anonymous compute powerFaucets: anonymous compute power• Parallel Parallel
– Operations Research, discrete event simulation, Operations Research, discrete event simulation, combinatorial searchcombinatorial search
62
Charm++: Charm++: Parallel C++ With Data Driven ObjectsParallel C++ With Data Driven Objects
• Chares: dynamically balanced objectsChares: dynamically balanced objects
• Object Groups:Object Groups: – global object with a “representative” on each PEglobal object with a “representative” on each PE
• Object Arrays/ Object CollectionsObject Arrays/ Object Collections– User defined indexing (1D,2D,..,quad and oct-tree,..)User defined indexing (1D,2D,..,quad and oct-tree,..)– System supports remapping and forwardingSystem supports remapping and forwarding
• Asynchronous method invocationAsynchronous method invocation• Prioritized schedulingPrioritized scheduling• Mature, robust, portableMature, robust, portable• http://charm.cs.uiuc.eduhttp://charm.cs.uiuc.edu
Data driven Execution
63
Multi-partition DecompositionMulti-partition Decomposition
• Idea: divide the computation into a large number of Idea: divide the computation into a large number of piecespieces– Independent of number of processorsIndependent of number of processors– Typically larger than number of processorsTypically larger than number of processors– Let the system map entities to processorsLet the system map entities to processors
64
ConverseConverse• Portable parallel run-time system that allows interoperability Portable parallel run-time system that allows interoperability
among parallel languagesamong parallel languages• Rich features to allow quick and efficient implementation of Rich features to allow quick and efficient implementation of
new parallel languagesnew parallel languages• Based on message-driven execution that allows co-Based on message-driven execution that allows co-
existence of different control regimesexistence of different control regimes• Support for debugging and performance analysis of parallel Support for debugging and performance analysis of parallel
programsprograms• Support for building parallel serversSupport for building parallel servers
65
ConverseConverse
• Languages and paradigms Languages and paradigms implemented:implemented:– Charm++, a parallel object-Charm++, a parallel object-
oriented languageoriented language
– Thread-safe MPI and PVMThread-safe MPI and PVM
– Parallel Java, message-driven Parallel Java, message-driven Perl, pC++Perl, pC++
• Platforms supported:Platforms supported:– SGI Origin2000, IBM SP, ASCI SGI Origin2000, IBM SP, ASCI
Red, CRAY T3E, Convex Ex.Red, CRAY T3E, Convex Ex.
– Workstation clusters (Solaris, HP-Workstation clusters (Solaris, HP-UX, AIX, Linux etc.)UX, AIX, Linux etc.)
– Windows NT ClustersWindows NT Clusters
Paradigms Languages, Libraries,
Parallel Machines
Converse
66
Adaptive MPIAdaptive MPI• A bridge between legacy MPI codes and dynamic load A bridge between legacy MPI codes and dynamic load
balancing capabilities of Charm++balancing capabilities of Charm++• AMPI = MPI + dynamic load balancingAMPI = MPI + dynamic load balancing• Based on Charm++ object arrays and Converse’s Based on Charm++ object arrays and Converse’s
migratable threadsmigratable threads• Minimal modification needed to convert existing MPI Minimal modification needed to convert existing MPI
programs (to be automated in future)programs (to be automated in future)• Bindings for C, C++, and Fortran90Bindings for C, C++, and Fortran90• Currently supports most of the MPI 1.1 standardCurrently supports most of the MPI 1.1 standard
67
Converse Use in NAMDConverse Use in NAMD
68
Molecular DynamicsMolecular Dynamics• Collection of [charged] atoms, with bondsCollection of [charged] atoms, with bonds• Newtonian mechanicsNewtonian mechanics• At each time-stepAt each time-step
– Calculate forces on each atom Calculate forces on each atom • bonds:bonds:• non-bonded: electrostatic and van der Waal’snon-bonded: electrostatic and van der Waal’s
– Calculate velocities and Advance positionsCalculate velocities and Advance positions
• 1 femtosecond time-step, millions needed!1 femtosecond time-step, millions needed!• Thousands of atoms (1,000 - 200,000)Thousands of atoms (1,000 - 200,000)
Collaboration with Klaus Schulten, Robert Skeel
69
Spatial Decomposition in NAMDSpatial Decomposition in NAMD
• Space divided into cubesSpace divided into cubes– Forces between atoms in Forces between atoms in
neighboring cubes computed neighboring cubes computed by individual compute by individual compute objectsobjects
– Compute objects are Compute objects are remapped by load balancerremapped by load balancer
70
NAMD: a Production-quality MD ProgramNAMD: a Production-quality MD Program
• NAMD is used by biophysicists NAMD is used by biophysicists routinely, with several published routinely, with several published resultsresults
• NIH funded collaborative effort with NIH funded collaborative effort with Profs. K. Schulten and R. SkeelProfs. K. Schulten and R. Skeel
• Supports full range electrostaticsSupports full range electrostatics– Parallel Particle-Mesh Ewald for Parallel Particle-Mesh Ewald for
periodic and Fast multipole for periodic and Fast multipole for aperiodic systemsaperiodic systems
• Implemented ic C++/Charm++Implemented ic C++/Charm++• Supports visualization (via VMD), Supports visualization (via VMD),
Interactive MD, and haptic interface: Interactive MD, and haptic interface: – see http://www.ks.uiuc.edusee http://www.ks.uiuc.edu– Part of Biophysics collaboratoryPart of Biophysics collaboratory
ApoLipoprotein A1
71
NAMD Scalable PerformanceNAMD Scalable Performance
Speedup on ASCI Red: BC1 (200k atoms)
0
200
400
600
800
1000
1200
1400
0 500 1000 1500 2000 2500Processors
Spe
edup
Sequential Performance of NAMD (a C++ program) is comparable to or better than contemporary MD programs, written in Fortran.
Speedup of 1250 on 2048 processors on ASCI red, simulating BC1 with about 200k atoms. (compare with best speedups on production-quality MD by others: 170/256 processors)
Around 10,000 varying-size objects mapped by the load balancer
72
Rocket SimulationRocket Simulation• Rocket behavior (and therefore Rocket behavior (and therefore
its simulation) is irregular, its simulation) is irregular, dynamicdynamic
• We need to deal with dynamic We need to deal with dynamic variations adaptivelyvariations adaptively
• Dynamic behavior arises fromDynamic behavior arises from– Combustion: moving boundariesCombustion: moving boundaries– Crack propagationCrack propagation– Evolution of the systemEvolution of the system
73
Rocket SimulationRocket Simulation• Our Approach:Our Approach:
– Multi-partition decompositionMulti-partition decomposition– Data-driven objects (Charm++)Data-driven objects (Charm++)– Automatic load balancing Automatic load balancing
frameworkframework
• AMPI: Migration path for existing AMPI: Migration path for existing MPI+Fortran90 codesMPI+Fortran90 codes– ROCFLO, ROCSOLID, and ROCFLO, ROCSOLID, and
ROCFACEROCFACE
"Overhead" of multipartition decomposition
0
5
10
15
20
25
30
35
40
1 10 100 1000
Number of partitions
74
FEM FrameworkFEM Framework• Objective: To make it easy to parallelize existing Finite Element Method (FEM) Objective: To make it easy to parallelize existing Finite Element Method (FEM)
Applications and to quickly build new parallel FEM applications including those Applications and to quickly build new parallel FEM applications including those with irregular and dynamic behaviorwith irregular and dynamic behavior
• Hides the details of parallelism; developer provides only sequential callback Hides the details of parallelism; developer provides only sequential callback routinesroutines
• Embedded mesh partitioning algorithms split mesh into chunks that are mapped Embedded mesh partitioning algorithms split mesh into chunks that are mapped to different processors (many-to-one)to different processors (many-to-one)
• Developer’s callbacks are executed in migratable threads, monitored by the run-Developer’s callbacks are executed in migratable threads, monitored by the run-time systemtime system
• Migration of chunks to correct load imbalanceMigration of chunks to correct load imbalance• Examples:Examples:
– Pressure-driven crack propagationPressure-driven crack propagation– 3-D Dendritic Growth3-D Dendritic Growth
75
FEM Framework: ResponsibilitiesFEM Framework: Responsibilities
Charm++(Dynamic Load Balancing, Communication)
FEM Framework(Update of Nodal properties, Reductions over nodes or partitions)
FEM Application(Initialize, Registration of Nodal Attributes, Loops Over Elements, Finalize)
METIS I/O
Partitioner Combiner
76
Crack PropagationCrack Propagation• Explicit FEM codeExplicit FEM code• Zero-volume Cohesive Elements Zero-volume Cohesive Elements
inserted near the crackinserted near the crack• As the crack propagates, more As the crack propagates, more
cohesive elements added near the cohesive elements added near the crack, which leads to severe load crack, which leads to severe load imbalanceimbalance
• Framework handles Framework handles – Partitioning elements into chunksPartitioning elements into chunks– Communication between chunksCommunication between chunks– Load BalancingLoad Balancing
Decomposition into 16 chunks (left) and 128 chunks, 8 for each PE (right). The middle area contains cohesive elements. Pictures: S. Breitenfeld, and P. Geubelle
77
Dendritic GrowthDendritic Growth• Studies evolution of solidification Studies evolution of solidification
microstructures using a phase-microstructures using a phase-field model computed on an field model computed on an adaptive finite element gridadaptive finite element grid
• Adaptive refinement and Adaptive refinement and coarsening of grid involves re-coarsening of grid involves re-partitioningpartitioning
Work by Prof. Jon Dantzig, Jun-ho Jeong
78
79
Anonymous Compute PowerAnonymous Compute Power
What is needed to make this metaphor work?Timeshared parallel machines in the background
effective resource managementQuality of computational service contracts/guaranteesFront ends that will allow agents to submit jobs on user’s behalf:
Computational Faucets
80
Computational FaucetsComputational Faucets
• What does a Computational faucet do?What does a Computational faucet do?– Submit requests to “the grid”Submit requests to “the grid”– Evaluate bids and decide whom to assign workEvaluate bids and decide whom to assign work– Monitor applications (for performance and correctness)Monitor applications (for performance and correctness)– Provide interface to users: Provide interface to users:
• Interacting with jobs, and monitoring behaviorInteracting with jobs, and monitoring behavior
• What does it look like?What does it look like?
A browser!
81
Faucets QoSFaucets QoS
•User specifies desired job parameters such as: program executable name, executable platform, min PE, max PE, estimated CPU-seconds (for various PE), priority, etc.
•User does not specify machine. Faucet software contacts a central server and obtains a list of available workstation clusters, then negotiates with clusters and chooses one to submit the job.
•User can view status of clusters.
•Planned: file transfer, user authentication, merge with Appspector for job monitoring.
Central Server
Faucet Client
Web Browser
Workstation Cluster
Workstation Cluster
Workstation Cluster
82
Timeshared Parallel MachinesTimeshared Parallel Machines• Need resource managementNeed resource management
– Shrink and expand individual jobs to available sets of Shrink and expand individual jobs to available sets of processorsprocessors
– Example: Machine with 100 processorsExample: Machine with 100 processors• Job1 arrives, can use 20-150 processorsJob1 arrives, can use 20-150 processors• Assign 100 processors to itAssign 100 processors to it• Job2 arrives, can use 30-70 processors, Job2 arrives, can use 30-70 processors,
– and will pay more if we meet its deadlineand will pay more if we meet its deadline
• Make resource allocation decisionsMake resource allocation decisions
83
Time-shared Parallel MachinesTime-shared Parallel Machines
•To bid effectively (profitably) in such an environment, a parallel machine must be able to run well-paying (important) jobs, even when it is already running others.
•Allows a suitably written Charm++/Converse program running on a workstation cluster to dynamically change the number of CPU's it is running on, in response to a network (CCS) request.
•Works in coordination with a Cluster Manager to give a job as many CPU's as are available when there are no other jobs, while providing the flexibility to accept new jobs and scale down.
84
AppspectorAppspector
•Appspector provides a web interface to submitting and monitoring parallel jobs.
•Submission: user specifies machine, login, password, program name (which must already be available on the target machine).
•Jobs can be monitored from any computer with a web browser. Advanced program information can be shown on the monitoring screen using CCS.