the use of processor groups in molecular dynamics simulations to sample free-energy states

The Use of Processor Groups in Molecular DynamicsSimulations to Sample Free-Energy States

Bruce Palmer,* Shawn Kathmann, Manojkumar Krishnan, Vinod Tipparaju, andJarek Nieplocha

Pacific Northwest National Laboratory, Box 999, Richland, Washington 99352

Received August 9, 2006

Abstract: Molecular dynamics calculations composed of many independent simulations are

frequently encountered in free-energy calculations, as well as many other simulation approaches.

In principle, the availability of a large number of independent tasks should make possible the

development of highly scalable parallel code that executes these tasks concurrently. This paper

discusses the use of processor groups to write simulation codes of this type and describes

results for a code that evaluates the volume dependence of the Helmholtz free energy for clusters

of an immiscible fluid in a solvent. The results show that very high levels of scalability can be

achieved using processor groups with corresponding reductions in the time to completion. The

main limitation to scaling appears to be a load imbalance due to variations in the execution

times of the individual tasks.

IntroductionMolecular dynamics (MD) simulations are becoming anincreasingly important tool in understanding the propertiesand behavior of complex chemical and biochemical systems.Biochemists routinely use simulations to help understandproblems such as protein-ligand interactions, the bindingof drugs to active sites, and the role of conformation changesin enzymatic activity.1 Simulations are also being used tounderstand many problems in materials science and chem-istry, particularly with regard to reactions in aqueous phases.2

The higher speeds and decreasing cost of processors havealso contributed to the widespread use of molecular dynam-ics. Simulations that previously took days on a high-endsupercomputer can now be performed on a relativelyinexpensive workstation. Furthermore, the availability ofclusters and other parallel architectures has continued to pushthe bounds on the size of the simulation that can beperformed, to the point where the main limitation is, in manycases, not the size of the system that can be simulated butrather the time interval over which the simulation can beperformed. For many problems, the goal is to simulate afixed-size system for longer periods of time or to obtain betterensemble averages rather than increasing the size of thesystem.

The development of parallel molecular dynamics algo-rithms has been only partially successful in addressing theneed for simulations that extend over longer time periods.Molecular dynamics algorithms are relatively communica-tion-bound and thus scale poorly once the number of atomsper processor drops below a certain point.3,4 Applying parallelprogramming to these algorithms has been much moresuccessful in increasing the size of the system that can besimulated and relatively unsuccessful in decreasing theamount of time required to simulate a system for a giventime interval. To address these difficulties, theoreticians havedeveloped free energy techniques and other sampling pro-tocols that are designed to provide indirect estimates of thepopulation of events that are rare on the time scale of thesimulation.5-9 Instead of trying to simulate rare eventsdirectly, a collection of shorter simulations is used to estimatethe relative free energy of these rare events and thereforeestimate their relative population.

If the individual simulations are small enough to fit on asingle processor, then one method of speeding up thesecalculations using parallel architectures is to farm out eachsimulation to an individual processor. This is commonlyreferred to as an embarrassingly parallel algorithm. However,if the calculations are too large to fit on a single processor,then each simulation or task must be distributed over severalprocessors, and the only way to run the calculation is to run* Corresponding author e-mail: [email protected].

583J. Chem. Theory Comput.2007,3, 583-592

10.1021/ct600260u CCC: $37.00 © 2007 American Chemical SocietyPublished on Web 01/18/2007

many consecutive parallel tasks. This can be a majorconstraint if the individual tasks do not scale well above acertain number of processors. There is also no way to speedup the calculation if the number of processors available ismore than the number of tasks, even if the individual tasksscale well beyond a single processor per task.

This paper will discuss the use of processor groups toachieve very high levels of scalability for algorithms thatare based on multiple independent simulations. Processorgroups are designed to allow the programmer to subdividethe domain of available processors (the “world group”) intoindependent subdomains. Processor groups make it possibleto run multiple concurrent independent calculations, each ofwhich is running as a parallel, distributed task. This can beachieved with little additional effort beyond that required towrite a conventional parallel code. These extensions havebeen incorporated into the Global Arrays Toolkit,10 whichhas been used to obtain the results described below.

The remainder of this paper will give an overview ofmultiple task simulation strategies and then discuss theshared-memory programming model and the processor groupextensions to it, the implementation of a spatial decomposi-tion molecular dynamics algorithm using shared-memoryprogramming, the development of a code that uses processorgroups, and scaling results for the code.

Multiple Task SimulationsAs discussed in the Introduction, multiple task simulationscan be sped up in two ways using conventional parallelprogramming models. The first is to use an embarrassinglyparallel approach to farm out tasks to individual processors;the second is to run multiple consecutive parallel tasks. Forthe embarrassingly parallel algorithm, the program starts witha list of independent tasks and assigns each processor a subsetof tasks from the list. This is illustrated schematically inFigure 1a. A variant on this approach is the master-workeralgorithm, where one processor (the master) assigns tasksto the remaining processors (the workers). As each workerprocessor finishes its task, it is assigned the next task on thelist by the master. The embarrassingly parallel approachworks very well if the individual tasks are relatively smallcalculations that can be run on single processors, but it breaksdown for large simulations where each task must bedistributed across more than one processor in order to berun at all. It also has the problem that there is no way toreduce the simulation time further once the number ofprocessors exceeds the number of tasks.

Most general-purpose MD codes that perform free-energycalculations on parallel platforms use the approach of runningconsecutive parallel tasks. This is illustrated in Figure 1b.Each individual task is run in parallel, but only one task isrun at a time. This approach works well if the individualtasks are too large to fit on a single processor but requiresvery good scaling of the parallel code to utilize large numbersof processors. This kind of scaling is difficult to achieve,however, and many molecular dynamics codes show verypoor scaling once the number of atoms per processor dropsto a few hundred.3,4 Running consecutive parallel tasks is anatural extension of the way most parallel programs are

written and can handle arbitrarily large systems, but it doesnot provide a good mechanism for utilizing large numbersof processors.

The approach discussed in this paper is to use processorgroups to run multiple concurrent parallel tasks. Thisapproach permits the use of very large numbers of processorsto complete the calculation without taking the performancehit associated with poor scaling. The basic idea is to takethe original collection of processors and subdivide them intosmaller groups. Each group can then be used to run a paralleltask that is independent of the other tasks. This is illustratedin Figure 1c. So far as the authors are aware, this approachhas not been used in molecular dynamics, but it offers a wayto greatly speed up calculations that consist of manyindependent tasks. The main disadvantage of this approachis that it is potentially much harder to program. However,by making use of the concept of a default processor group,discussed in more detail below, it turns out that a groups-based code is not significantly more complicated than asequential parallel code.

Figure 1. Schematic illustration of programming modelsdiscussed in the text: (a) embarrassingly parallel, (b) parallelsequential tasks, (c) concurrent parallel tasks using groups.Dotted circles indicate results that are currently being gener-ated; solid circles represent completed tasks.

584 J. Chem. Theory Comput., Vol. 3, No. 2, 2007 Palmer et al.

Processor Groups and Shared MemoryThe shared-memory programming model assumes memorycan be divided into two categories, local and shared. Datastructures that are created in local memory are visible onlyto the process that created them, while data structures inshared memory are visible to any process in the system. Themessage-passing programming model, on the other hand,assumes that all memory is local. Shared memory is a featureof some computer architectures in which multiple processorscan directly access the same physical address space. Dataobjects created in this address space, such as arrays, can bewritten to and read from using a simple PUT/GET syntaxthat allows programmers to specify blocks of data in theshared array that need to be copied to or from local arrays.This style of programming tends to be much simpler thanthe distributed model used in message passing because sharedmemory preserves the global index space. Message passing,in contrast, requires the programmer to transform a data pointfrom the global index space to a local index held on aparticular processor. Communication in a shared program-ming model is accomplished by writing to and from arraysheld in shared memory, while communication in a message-passing model is managed by sending data to a specificlocation on a specific processor. Although the shared-memory programming model originally reflected the physicallayout of the memory on some parallel systems, it is alsoavailable on distributed memory platforms via the GlobalArrays Toolkit.10 This is a library of routines that allowsprogrammers to create globally accessible data structureseven if memory is completely distributed.

Most global operations in parallel programming areexecuted on the set of all available processors (the “worldgroup”). These operations consist of synchronizations, globalsums, and broadcasts. A synchronization forces all processesto wait until everyone has reached the same point in theexecution before any process can proceed further, globalsums add a value or values across all processors, andbroadcasts send a value or values from one processor to allothers. The global sum and broadcast operations implicitlysynchronize the processors as well. Global operations areusually employed to guarantee data consistency acrossprocessors. Because they force processes to stop untileveryone has reached the same point in the execution, globaloperations tend to degrade scalability in parallel codes, andtheir widespread use should be avoided. However, almostall parallel codes have at least a few global operations.

The presence of global operations is a major impedimentto subdividing the world group into smaller groups that canrun independent parallel tasks. Global operations in any taskwill force processes to wait until there is a correspondingglobal operation in other tasks. If the algorithm is such thatother tasks will have different numbers or different sequencesof global operations, the code will eventually hit a pointwhere there is an unanswered global operation, and theprogram hangs. To avoid this, the concept of processor grouphas been introduced. Global operations executed on aprocessor group are restricted to only those processes in thegroup. It is also possible to restrict the scope of queries that

return the number of processes and the local processor IDso that they act on subgroups instead of the world group.

For shared-memory programming models, processor groupsalso need to restrict the extent of shared arrays that arecreated in the context of a subgroup so that the address spacerepresented by the shared array is accessible only fromprocesses within the group. Implementing this functionalityrequires some modification of the memory allocation routinesand additional layers of index translation, but this istransparent to the user. The main modification to theprogramming model is that programmers need to be able tospecify that a given shared object is associated with aparticular processor group, instead of an implicit assumptionthat it is being created on the world group. The Global ArrayToolkit used in developing the cluster simulation applicationdescribed here has been modified to include these capabili-ties.

Finally, an important concept associated with programmingon processor groups is the default processor group. Supportfor this has been explicitly incorporated into the GlobalArrays Toolkit, but the programming concept could also beused in codes based on message-passing interfaces (MPI),provided sufficient care is used. The default processor groupis the group that all global operations and queries refer tounless another group is explicitly specified. For shared-memory programming, all shared data structures are createdon the default group, again, unless explicitly specifiedotherwise. Conventional parallel programs have the worldgroup as the default group. However, it is extremely usefulto be able to specify some other group as the default. Parallelcodes that have been written using the world group as thedefault can be repackaged as modules that run on a subgroupby wrapping them as a subroutine call. The calling programfirst creates the subgroup, sets the default group of allprocesses in the group to be the subgroup, and then callsthe subroutine. The subroutine then runs exclusively on thesubgroup. The advantage of this type of programming is thatit allows modules to be developed as standard parallel codesbefore converting them to run on groups. This capability isexplicitly supported in the Global Array Toolkit by sub-routine calls that allow the programmer to set the defaultgroup, but the concept could also be executed using MPI byreplacing explicit references to MPI_COMM_WORLD witha variable reference.

Parallel Molecular Dynamics Using SharedMemory-Style ProgrammingThe parallel molecular dynamics code used in these studiesis designed to simulate a Lennard-Jones fluid and fluidmixtures. There are no bonded interactions in this system,so there is no need for tables of bonded interactions andexcluded atoms. The code is complicated by the need toincorporate a confining sphere that encloses one componentof the Lennard-Jones mixture by forcing all atoms of thecomponent to remain within a fixed distance of the centerof mass of the cluster. This confining sphere uses a hard-sphere interaction to specularly reflect particles that crossthe sphere boundary.

Processor Groups in Molecular Dynamics Simulations J. Chem. Theory Comput., Vol. 3, No. 2, 2007585

The basic simulation kernel uses a spatial decompositionalgorithm11 to evaluate forces and distribute particle data.The original set of processors is decomposed into a three-dimensional block of processors (a set of eight processorswould be divided into a cube of 2× 2 × 2 processors). Thedimensions of the processor block grid are used to dividethe simulation volume into a set of equal-sized cells. Particlesin each cell of the simulation volume are then assigned tothe corresponding processor. Forces in the Lennard-Jonesfluid have a finite interaction length; that is, only particlesthat are within a finite cutoff distanceRc of each other haveany interactions. To evaluate the forces at each time step, itis necessary to get the coordinates of all particles onsurrounding processors that are within the cutoff distance.To reduce the number of data-transfer operations, it is usuallydesirable to pad the cutoff distance by an extra amount,δ.Thus, each processor must collect all particles that are withina distanceRc + δ of the spatial boundaries associated withthe processor. This operation is accomplished using the shiftalgorithm.11

The shift operation consists of gathering particles that liewithin a distanceRc + δ of the cell boundary along thexaxis and sending their coordinates to the adjoining processors.This is illustrated in Figure 2a. The next step is to move allparticles within a distanceRc + δ of the cell boundary alongthey axis to adjoining processors. The particle locations fromthe adjoining cells along thex axis are already on theprocessor, so these can be sent along with particles locatedlocally on the processor. This is illustrated in Figure 2b. Afterthe transfer along they axis is complete, the particles aretransferred in a similar manner along thez axis. If allprocessors execute these transfers, then each processor willhave a complete list of coordinates for all particles onneighboring processors within a distance ofRc + δ of thelocal cell boundaries. The fact that each of the updates isdone sequentially along the axes reduces the total numberof transfers to six, compared to the 26 transfers that wouldbe required in three dimensions if data were collected fromeach of the neighboring processors independently. The shift

algorithm generally needs to be performed twice during eachtime step. The first shift is before the force calculation whenall particle coordinates must be exchanged. The second shiftoccurs after the force calculation, when all of the partialforces must be sent back to their home processors. Thissecond shift can be eliminated by dropping the use ofNewton’s second law on all interactions between a locallyheld particle and a remotely held particle, but this cansignificantly increase the time spent in the force calculation.

The data transfer must be repeated at every time step. Atperiodic intervals, it is also necessary to redistribute theparticles so that particles that have drifted outside the boxassociated with a given processor are assigned to newprocessors. Because the cutoff distance is padded by theamountδ, particles that have drifted a short distance outsidethe simulation cell do not cause any problems with the forcecalculation. Eventually, however, particles will move farenough that interactions will be missed if the particles arenot reassigned to their appropriate processors. This re-assignment is performed after a fixed number of steps. Tosummarize, the major communication steps in this algorithmare (1) the distribution of particles to processors on the basisof current coordinates, (2) the gathering of particle coordi-nates before evaluating forces, and (3) the scattering of partialforces back to particles after the force calculation. The bulkof the communication is from steps 2 and 3, which must beperformed at every time step; the particle redistribution stepoccurs less frequently. Additional communication steps areassociated with initialization and closeout of the calculation,but these are typically an insignificant fraction of the totalexecution time. There are also synchronizations and globalsummations of small quantities of data.

Individual shifts can be implemented using either messagepassing or using a shared-memory programming model. Themessage-passing model assumes point-to-point communica-tion in which data are sent from a local array on oneprocessor to a local array on another processor. Typically,this requires that both the sending and receiving processespost signals that they intend to send or receive a message,

Figure 2. Schematic diagram of shift algorithm for a 2D system. (a) In the first stage, data that is within the boundary regionof a neighboring east or west processor is sent to the neighboring processor. (b) In the second stage, data that is within theboundary region of a neighboring north or south processor is sent to the neighboring processor. This includes data that weretransferred from east and west neighbors in the first stage.


and they must also specify the message length. Since thelength is generally not known beforehand, this requires theexchange of an extra integer before the actual list ofcoordinates and so forth can be sent. Instead of exchangingindividual messages, the shared-memory model creates a setof shared memory or otherwise globally accessible arraysthat are used for communication. The shared memory arraysare either one-dimensional or quasi-one-dimensional (thesecond dimension is typically small). To communicate, twoarrays are needed. The first array, referred to as the size array,is one-dimensional and only contains a number of elementsequal to the number of processors. Each processor owns onedata element that is used to store the number of particlesthat need to be moved. The second array, referred to as thecommunication array, is much larger. For concreteness,assume that the array will be used for moving coordinatesaround. A parameter is chosen that represents the maximumnumber of atoms that can be expected to reside in thecommunication array for any one processor at any point inthe communication cycle. Call this parameter MAXAT. Thesize of the communication array for moving around a vectorquantity such as the coordinates is then NPROCS× MAXAT× 3, where NPROCS is the total number of processors. Eachprocessor then owns a MAXAT× 3 sized block of the totalcommunication buffer.

A communication step takes place in two stages. In thefirst stage, all processors determine which particles need tohave their coordinates sent to a neighboring processor. Thesecoordinates are all gathered into a local buffer and thencopied into the portion of the communication array ownedby that processor, using a PUT operation. The number ofparticles that each processor puts in the communication arrayis also placed in the size array. After these operations arecomplete, a synchronization call is made to guarantee thatall data is finished being copied to the communication arrayand the size array. The second stage begins by having eachprocessor get the number of elements in the size array fromthe appropriate adjoining processor using a GET operation.Once this number has been obtained, the processor can getthe block of data corresponding to the adjoining processorfrom the communication array. The entire sequence ofoperations is illustrated schematically in Figure 3 for a 2×2 array of processors. Each processor is receiving data fromthe processor to its right. Periodic boundary conditions areassumed, so each processor at the end of a row wraps aroundand gets data from the first processor in the row.

Statistical Mechanics of Cluster NucleationThe multiple task simulation targeted in this paper is theevaluation of atomic cluster size distribution functions forclusters of an immiscible fluid in a solvent. These calcula-tions are not traditional free-energy calculations but sharethe feature of consisting of multiple independent simulations.The distributions that result from the individual simulationsare converted into free-energy profiles for further analysis.The profiles lead directly to cluster dissolution rates. Whencombined with additional simulations of the relative freeenergies of clusters with different numbers of particles, theycan also give the cluster nucleation rates. When such

information is obtained for all cluster sizes up to the criticalcluster (i.e., the cluster that defines the peak in the work offormation), a complete picture of the nucleation of phaseseparation can be developed.

A parallel single-task MD code was written that isdesigned to evaluate the cluster size probability distributionfunction

whererN is a collective coordinate representing the locationof N particles,U is the potential energy of those particles,â

Figure 3. Schematic illustration of particle exchange for afour processor calculation which is decomposed into a 2 × 2grid. (a) Each processor copies particle coordinates and soforth that need to be transferred to another processor to itslocally held portion of a communication array. It also copiesthe number of particles that are in the communication arrayto a size array. (b) After a synchronization operation, eachprocessor gets the amount of data from neighboring arraysfrom the size array. (c) Each processor gets the data for aneighboring process from the communication array.

P(V) ∝ exp(-âpext V) ∫ drN exp[-âU(rN)]∏i)1

M

θ(ri - rcnf)


) 1/kBT is the inverse temperature,V ) 4πrcnf3/3 is the

volume of the confining sphere,rcnf is the correspondingradius of the confining sphere,ri is the distance of a particleto the cluster center of mass,θ(x) is the Heaviside stepfunction, and the product runs over a subset containingMparticles. The external pressurepext is an arbitrary appliedpressure that guarantees that the probability distributiondecays to zero at large volumes. This distribution can bedirectly related to the evaporation (or dissolution) rate of acluster of sizeM.12, 13For evaporation, the system is chosenso thatM ) N, but for dissolution, the system would beconfigured so thatM < N. Once evaluated,P(V) can beconverted to a free energy,A(rcnf), using the expression12

The second expression is frequently more useful since theMonte Carlo sampling actually generates the distributionP(rcnf) ) 4πrcnf

2P(V). Note that the explicit dependence ofA(rcnf) on pext drops out in this expression. The monomerdissolution rate of the cluster is directly proportional to thederivative ofA with respect torcnf.

The problem of evaluating the distributionP(V) can bereduced to simulating the behavior of a collection of particlesinside a confining sphere of radiusrcnf. For simplicity, onlythe case in which all particles are contained in the confiningsphere is discussed; the extension to the case where a subsetof particles is restricted to the confining sphere is straight-forward. The distributionP(V) has been simulated previouslyusing Monte Carlo techniques.12,13 In this paper, the use ofmolecular dynamics simulations to evaluateP(V) is explored.The confining sphere can be considered as an infinitely hardpotential whose center corresponds to the center of mass ofthe confined particles. Whenever a particle hits the confiningsphere, it is reflected specularly away from the spheresurface. The velocity of the remaining particles is also alteredslightly because the momentum of the center of mass of theremaining particles is specularly reflected by the collisionas well. A more detailed description of how these collisionsare implemented will be provided elsewhere; for the purposesof this paper, it is important to note the following points:

(1) The presence of “hard-sphere”-type collisions meansthat time steps during which a collision occurs must bebroken up into smaller intervals. Before and after thecollision, the trajectories are smooth and can be integratedusing a conventional second-order algorithm. The velocity-Verlet algorithm, recast as a second-order Gear predictor-corrector,14 is especially convenient because it only needsthe current coordinates and velocities to execute the nextstep. After determining that a collision with the confiningsphere occurred during a time step, the time at which thecollision occurred is calculated and the original step is brokenup into pre- and postcollision intervals. The system is updatedusing a time step equal to the precollision interval, and thenthe velocities are modified using the rules for hard-spheredynamics. Once the postcollision velocities are calculated,a second step is taken using a time step equal to thepostcollision interval. If another collision is detected within

the second interval, then the process is repeated until theoriginal time step has been completed.

(2) For parallel simulations, hard-sphere dynamics associ-ated with the confining sphere add several additional globaloperations. These are needed because each processor inde-pendently determines whether one of its particles has hit theconfining sphere and, if so, when. This information mustthen be shared with all other processors. Additionally,processors also check for special cases, such as multiplecollisions within a single time step and trajectories that loopinto and out of the confining sphere within a time step, andcommunicate the results of these checks to other processors.Although these special cases occur infrequently, they causesignificant problems if not handled correctly.

(3) The confining sphere radius is chosen using aconventional Monte Carlo procedure.12 After a fixed numberof steps, a new confining sphere volume is generated. If thenew volume is smaller than the old volume, then a check isperformed to see if any of the particles in the cluster havemoved outside the confining sphere. If they have, the volumeis rejected. If no particles are outside the sphere, then thesphere is accepted or rejected using a standard Monte Carloacceptance test based on the changepext∆V. This test alsoadds an extra layer of synchronization.

Overall, the simulations proceed much like conventionalmolecular dynamics simulations. The main differences arethat the coordinate update steps are more complicated andinvolve much higher levels of synchronization.

The system investigated in this study consists of a two-component Lennard-Jones fluid characterized by the potentialfunction

where the Lennard-Jones interactionæij has the form

The constantæij0, representing the cutoff interaction energy,

is

This choice guarantees thatæij(r) goes continuously to zeroasr approachesRij

c. For these simulations, the fluid containstwo components denoted by A and B. For all interactions,σij andεij are set equal to 1. The particle masses are also setto 1. If both i and j belong to the same component, thenRij

c

is set equal to 2.5. Ifi andj belong to different components,then Rij

c is set equal to 21/6. These choices guarantee thatinteractions between particles in the same component areattractive at longer distances while interactions betweenparticles in different components are purely repulsive. Thepair interactions are illustrated in Figure 4. For thesesimulations, the component forming the cluster is B (referredto hereafter as the solute) while A is the majority component

A(rcut) ) -kBT ln[P(V)] - pextV )

-kBT ln[P(rcut)/(4πrcnf2)] - pextV

U(rN) ) ∑i<j

æij(rij)

æij(rij) ) {4εij[(σij

rij)12

- (σij

rij)6] - æij

0 rij < Rijc

0 rij gRijc

æij0 ) 4εij[(σij

Rijc)12

- (σij

Rijc)6]


and acts as the solvent. The repulsive cross interaction meansthat B will tend to form clusters within A.

The output of the simulation is the distributionP(rcnf)[which is readily converted toP(V)]. This is obtained bybinning the corresponding values of the radius of theconfining sphere that are selected at each Monte Carlo step.The program has also been modified to incorporate anequilibration protocol into each trajectory along with anadditional setup routine that creates an initial condition byspecifying the number of Lennard-Jones particles of eachtype that are used in the simulation. The initial conditionsare chosen so that the minority B component is alwayslocated in a roughly spherical blob at the center of thesimulation cell. To improve convergence, the range overwhich the confining sphere radius is sampled is restricted.The simulation is initially run for a short period during whichthe confining sphere radius is brought within the samplingrange, it is then equilibrated further. After equilibration, datais collected and the distributionP(rcnf) is evaluated. The datacollection stage is much longer than the equilibration stage,primarily because the trajectory has to be run for a very longtime to get good statistics forP(rcnf).

ResultsThe scaling properties of the groups-based simulation codewere investigated by first determining the scaling behaviorof the isolated molecular dynamics module running as astandard parallel application. After determining the single-task scaling behavior, the scaling properties of multiple tasksrunning on groups were evaluated. The single-task scalingbehavior was determined for a system consisting of 4000solvent particles and two solute particles. The simulationswere run for 50 000 steps using a time step of∆τ ) 0.01.These simulations are too short to provide valid scientificresults, but they are long enough to characterize the scalingbehavior of the code. Longer simulations that produced usefuldata are discussed below. The simulations were run usingthe isothermal/isobaric method of Nose´ and Anderson15,16

with a reduced temperature of 0.8 and a reduced pressure of0.05. This locates the simulations in the liquid part of theLennard-Jones phase diagram for the pure Lennard-Jonesfluid.17 The single-task scaling behavior of the moleculardynamics module on an HP dual Itanium2 1.5 GHz clusterwith a Quadrics switch using the Elan communication

libraries is shown in Figure 5. It can be seen that the scalingbehavior of this calculation is fairly poor and drops offrapidly as the number of processors increases. The speedupfrom one to two processors is reasonable but starts fallingoff significantly in going from two to four processors.Beyond that, the speedup is relatively flat, and at 16processors, the speedup has actually started to decrease. Thepoor scaling for this calculation is primarily due to therelatively small problem size and very poor load balancingassociated with the initial condition. Simulations run forlonger periods of time have better load balancing as thedensity equilibrates and density fluctuations present at thebeginning disappear.

To perform multiple, concurrent parallel tasks, the mo-lecular dynamics code was modified to divide the availableprocessors in the world group into subgroups. The code thenexecutes the tasks on the groups. When a task is completedon one group, the group gets the next task from a globalcounter. The global counter is implemented using the READ-INCREMENT functionality available in the Global ArraysToolkit. A shared array, consisting of a single integer iscreated at the start of the calculation and initialized to zero.The READ-INCREMENT function reads the current valueof an array variable and increments it by 1. As each processorgroup completes a task, the processor corresponding toprocess 0 in the group performs a read-increment on the arrayvariable acting as the global counter and translates the currentvalue of the variable into a set of task parameters.

Simulations ofM tasks were run with each task represent-ing a system of 4000 solvent particles and between two andM + 1 solute particles. For the timing studies describedbelow, the number of tasks was set to 512. Most of theremaining parameters are the same as for the calculationsdescribed in the single-task scaling study, except that thenumber of time steps was reduced to 10 000 in order toreduce the overall simulation time. Because the tasks withlarger numbers of solute particles were expected to takeslightly longer, the tasks were executed in reverse order toimprove load balancing. These simulations were run onbetween eight and 1024 processors on the HP clusterdescribed above. On the basis of the single-task simulationresults, the available processors were divided into groups oftwo processors. The results are shown in Figure 6.

Figure 4. Plot of Lennard-Jones interaction potential usingboth long- and short-range cutoffs.

Figure 5. Single task scaling behavior of molecular dynamicssimulation.


The results show good speedup all the way up to 1024processors. The parallel efficiency, which is the actualspeedup divided by the theoretical maximum speedup, isshown in Figure 7. The efficiency remains above 80% allthe way up to 512 processors and is close to 70% even for1024 processors. The main reason for the drop in efficiencyappears to be the granularity of the execution times of theindividual tasks. As the number of tasks per group drops,the variation in the amount of time required for each groupto finish its tasks increases.

The variation in individual task execution times can beseen in Figure 8, where the minimum and maximum

execution times for the individual tasks are plotted as afunction of the number of processors. If the individual tasksare truly running independently, then each task should takeabout the same amount of time regardless of the total numberof processors. This further implies that the minimum andmaximum execution times for the collection of tasks shouldalso be independent of the total number of processors, andthis is seen in Figure 8, although there may be a smallincrease in execution times at very large processor counts.The figure also shows that there is a substantial variation inexecution times. The maximum execution time is over 60%higher than the minimum execution time. This variation ismuch larger than the variation in the total number of particlesin the different tasks (approximately 13%) and is due to therelatively short execution times and differences in short-timebehavior caused by differences in the initial configuration.

The variance in execution time on individual groups wasalso calculated and is plotted along with the parallelefficiency in Figure 7. IfTi represents the total time spenton group i executing its tasks andTmin and Tmax are theminimum and maximum values ofTi across the differentgroups, then the variance is 100(Tmax - Tmin)/Tmax. Note thatTmax is almost identical to the overall execution time for theentire calculation. The variance is inversely correlated withparallel efficiency; as the variance increases, parallel ef-ficiency decreases. This supports the conclusion that the mainsource of the decrease in parallel efficiency is the granularityof the individual tasks. This problem would probablydecrease for longer simulations where the variation in initialconditions is not as important. For longer simulations, theexecution times should track more closely with the totalnumber of particles, and the execution times would be muchcloser together.

The short calculations described above were good onlyfor evaluating scalability of the code. To produce usefulscientific data, much longer calculations are needed. Todemonstrate that such calculations are possible, a set ofsimulations consisting of 108 steps were run on systemscontaining two to five solute particles. In order to runsimulations containing such a large number of steps in areasonable amount of time, the number of solvent particleswas reduced to 400. For the small solute clusters investigatedin this study, 400 solvent particles should be sufficient.Simulations in this regime indicate that the correlation lengthof the radial distribution function for the pure Lennard-Jonesfluid is on the order of 3-4σ, while the typical dimensionof the simulation cell is a little over 8σ. The first 2 000 000steps were used to move the confining sphere to within themaximum value of 4.0σ and to equilibrate the system. Theremainder of the simulation was used to collect data. Thesimulations were run on eight processors divided into fourgroups of two processors each. Some shorter runs ontrajectories containing 1 000 000 steps indicate that thespeedup in going from one to two processors is about 1.6.The total execution time for all four calculations was about52 h, so the use of groups amounted to a substantial decreasein the overall execution time. There is no additional benefitto using groups containing more than two processors for asystem this small since the speedup actually starts to decrease

Figure 6. Speedup curve for 512 tasks as a function of thenumber of processors. The available processors were dividedinto groups containing two processors each.

Figure 7. Parallel efficiency and maximum variance inexecution times on each of the groups as a function of thenumber of processors for 512 tasks.

Figure 8. Plot of minimum and maximum task executiontimes for 512 tasks.


at four processors. For these longer simulations, the loadimbalance issues are much smaller. The total variation inexecution times for the different tasks is about 1.6%.

The probability distributions for the confining sphereradius are shown in Figure 9. A few things are worth noting.First, the leading edge of the distributions moves to largerradius values as the number of solute particles increases. Thisis expected because more particles imply a larger cluster.Second, even with 108 steps, the distributions still appear tocontain noticeable amounts of noise. This is a commonproblem with simulations containing many degrees offreedom where the focus is to determine the behavior of asingle coordinate. It is difficult to obtain accurate averagessince each point in the distribution must be sampled manytimes but at the same time a substantial amount of work isrequired for each system update.

The actual quantity needed for evaluating dissolution ratesis the free-energy curveA(rcnf), which can be derived fromthe distributionsP(V). The curves forA(rcnf) are shown inFigure 10. Because theA(rcnf)’s are proportional to thelogarithm of the distribution functions, much of the noise inthe P(V) curves is suppressed. It is possible that, afterapplying some smoothing algorithms to these curves, reason-able values of both the derivative ofA(rcnf) and the locationof the minimum value of the derivative could be obtained

and used to evaluate the dissolution rate. A more extensiveanalysis of these results and results for larger clusters willbe presented elsewhere.

As a final comment on the performance of multitask codesusing groups, if only four processors had been available, themost efficient way to perform the simulations would be torun on groups containing only a single processor. This wouldbe equivalent to the embarrassingly parallel algorithm. Thereason for using only one processor per group is thatefficiency is highest for single-processor calculations, andsince the number of tasks exceeds the number of processors,the shortest time to solution will be obtained by running eachtask at the highest possible efficiency. This is generally truefor any calculation with a limited number of processors. Theshortest time to solution is obtained by running individualtasks on the smallest number of processors on which thecalculation will fit. It is only when a large number ofprocessors are available so thatall tasks can be runconcurrently that there is a time benefit to increasing thenumber of processors in the group.

ConclusionsThe results presented here demonstrate that processor groupscan be used to substantially reduce the amount of timerequired to perform multiple independent molecular dynam-ics simulations. The scenario of many independent calcula-tions is often encountered in evaluations of free energies,which are an important class of calculations in biochemicalsimulations and other evaluations of chemical stability andreactivity. The results of these scaling studies indicate thatprocessor groups can be used quite successfully to runmultiple concurrent instances of a parallel calculation. Theresults also illustrate the utility of the default processor groupconcept, which allows developers to easily reconfigureparallel applications to run on groups with only minoradjustments to the code.

For free-energy calculations, processor groups can be usedto either run multiple concurrent simulations when eachindividual simulation is too large to fit on a single processoror can be used to speed up simulations in which the numberof available processors exceeds the number of individualsimulations. Without processor groups, the only availablealternatives are to run individual simulations on singleprocessors following an embarrassingly parallel model or tosubmit multiple independent calculations to the computer.The first option can only be used if the individual simulationsfit on a single processor, and the second option, besides beinginconvenient, breaks down completely if more complicatedsimulation protocols are used in which initial results are usedto bias the sampling and improve statistics.

Acknowledgment. This work was supported by PacificNorthwest National Laboratory (PNNL) and the U.S. Depart-ment of Energy’s Office of Science through the William R.Wiley Environmental Molecular Sciences Laboratory (EMSL)and the Division of Chemical Sciences and was performedin part using the Molecular Science Computing Facility(MSCF) in EMSL. The MSCF and EMSL are funded bythe Office of Biological and Environmental Research in the

Figure 9. Confining sphere size distributions for simulationscontaining two to five solute particles and 400 solventparticles.

Figure 10. Cluster free energies as a function of size forsimulations containing two to five solute particles and 400solvent particles.


U.S. Department of Energy. Pacific Northwest NationalLaboratory is operated by Battelle Memorial Institute for theU.S. Department of Energy under Contract No. DE-AC06-76RLO 1830.

References

(1) Wang, W.; Donini, O.; Reyes, C. M.; Kollman, P. A.Biomolecular Simulations: Recent Developments in ForceFields, Simulations of Enzymatic Catalysis, Protein-Ligand,Protein-Protein, and Protein-Nucleic Acid NoncovalentInteractions.Ann. ReV. Biophys. Biomol. Struct.2001, 30,211.

(2) Garrett, B. C.; Schenter, G. K.; Morohita, A. MolecularSimulations of the Transport of Molecules across the Liquid/Vapor Interface of Water.Chem. ReV. 2006, 106, 1355.

(3) Phillips, J. C.; Braun, R.; Wang, W.; Gumbart, J.; Tajkhor-shid, E.; Villa, E.; Chipot, C.; Skeel, R. D.; Kale´, L.;Schulten, K. Scalable Molecular Dynamics with NAMD.J.Comput. Chem.2005, 26, 1781.

(4) Germain, R. S.; Zhestkov, Y.; Eleftheriou, M.; Rayshubskiy,A.; Suits, F.; Ward, T. J. C.; Fitch, B. G. Early PerformanceData on the Blue Matter Molecular Simulation Framework.IBM J. Res. DeV. 2005, 49, 447.

(5) Straatsma, T. P.; McCammon, J. A. MulticonfigurationThermodynamic Integration.J. Chem. Phys.1991, 95, 1175.

(6) Tobias, D. J.; Brooks, C. L;, Fleischman, S. H. Conforma-tional Flexibility in Free Energy Simulations.Chem. Phys.Lett. 1989, 156, 256.

(7) Tobias, D. J.; Brooks, C. L. Calculation of Free EnergySurfaces using the Methods of Thermodynamic PerturbationTheory.Chem. Phys. Lett.1987, 142, 472.

(8) Widom, B. Some Topics in the Theory of Fluids.J. Chem.Phys.1963, 39, 2808.

(9) Straatsma, T. P. Free Energy by Molecular Simulation.ReV.Comput. Chem.1996, 9, 81.

(10) Nieplocha, J.; Palmer, B. J.; Tipparaju, V.; Krishnan, M.;Trease, H. E.; Apra, E. Advances, Applications and Perfor-mance of the Global Arrays Shared Memory ProgrammingToolkit. Int. J. High Perf. Comput. Appl.2006, 20, 203.

(11) Plimpton, S. Fast Parallel Algorithms for Short-RangeMolecular-Dynamics.J. Comput. Phys.1995, 117, 1.

(12) Kathmann, S. M.; Schenter, G. K.; Garrett, B. C. DynamicalNucleation Theory: Calculation of Condensation RateConstants for Small Water Clusters.J. Chem. Phys.1999,111, 4688.

(13) Schenter, G. K.; Kathmann, S. M.; Garrett, B. C. VariationalTransition State Theory of Vapor Phase Nucleation.J. Chem.Phys.1999, 110, 7951.

(14) Allen, M. P.; Tildesley, D. J.Computer Simulation of Liquids;Oxford University Press: New York, 1987.

(15) Andersen, H. C. Molecular Dynamics Simulations at Con-stant Pressure and/or Temperature.J. Chem. Phys.1980, 72,2384.

(16) Nose, S. A. Unified Formulation of the Constant TemperatureMolecular Dynamics Methods.J. Chem. Phys.1984, 81, 511.

(17) Hansen, J.-P.; Verlet, L. Phase Transitions of the Lennard-Jones System.Phys. ReV. 1969, 184, 151.

CT600260U


the use of processor groups in molecular dynamics simulations to sample free-energy states

Documents