framework design, parallelization and force computation in

108
Framework Design, Parallelization and Force Computation in Molecular Dynamics Thierry Matthey Department of Informatics University of Bergen September 2002 Thesis submitted in partial fulfillment of the requirements for the degree Doctor Scientiarum

Upload: others

Post on 01-Jan-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Framework Design, Parallelization and Force Computation in

Framework Design,ParallelizationandForceComputationin MolecularDynamics

ThierryMattheyDepartmentof Informatics

Universityof Bergen

September2002

Thesissubmittedin partial fulfillmentof therequirementsfor thedegreeDoctorScientiarum

Page 2: Framework Design, Parallelization and Force Computation in
Page 3: Framework Design, Parallelization and Force Computation in

Contents

1 Intr oduction 11.1 Structureof thethesis . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Physicaland Numerical Model 62.1 Numericalintegration . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Switchingfunctions . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Boundaryconditions . . . . . . . . . . . . . . . . . . . . . . . . 12

3 ForceComputation 133.1 Fastelectrostaticforcealgorithms . . . . . . . . . . . . . . . . . 143.2 StandardEwaldsummation. . . . . . . . . . . . . . . . . . . . . 143.3 Mesh-basedEwaldmethods . . . . . . . . . . . . . . . . . . . . 183.4 Multi-grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.5 Multi-pole methods. . . . . . . . . . . . . . . . . . . . . . . . . 243.6 Otherfastelectrostaticmethods. . . . . . . . . . . . . . . . . . . 24

4 Parallelization 264.1 Loadbalancing . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2 Parallelprogrammingparadigms. . . . . . . . . . . . . . . . . . 284.3 Two parallelimplementations. . . . . . . . . . . . . . . . . . . . 29

4.3.1 Spatialdecomposition . . . . . . . . . . . . . . . . . . . 294.3.2 Incrementalparallelizationapproach. . . . . . . . . . . . 29

5 The PROTOM OL Framework 315.1 Object-orientedvs. traditionalproceduralprogramming . . . . . 325.2 A component-basedframework design . . . . . . . . . . . . . . 335.3 Integrators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.3.1 Integratordefinitionlanguage . . . . . . . . . . . . . . . 365.4 Forces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.4.1 Forcerequirements. . . . . . . . . . . . . . . . . . . . . 385.4.2 Option1: All-inclusiveinterface . . . . . . . . . . . . . . 385.4.3 Option2: Multiple inheritance. . . . . . . . . . . . . . . 395.4.4 Option3: Genericprogramming. . . . . . . . . . . . . . 395.4.5 Combinedapproach:Policy, Strategy, andTraits . . . . . 405.4.6 Forcedesignof PROTOMOL . . . . . . . . . . . . . . . . 405.4.7 Forceinterface . . . . . . . . . . . . . . . . . . . . . . . 42

I

Page 4: Framework Design, Parallelization and Force Computation in

5.4.8 Forcesubtyping. . . . . . . . . . . . . . . . . . . . . . . 425.5 Forcefactory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.5.1 Solution:AbstractFactoryandPrototypes. . . . . . . . . 435.6 Functionalitiesto compareforcealgorithms . . . . . . . . . . . . 465.7 Implementationandvalidationof fastelectrostaticforcealgorithms 47

5.7.1 StandardEwald summation . . . . . . . . . . . . . . . . 475.7.2 Particle-meshbasedEwald summation . . . . . . . . . . 485.7.3 Multi-grid . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.8 Practicalexamplesof extendibility andoptimization. . . . . . . . 505.8.1 Addinganew force . . . . . . . . . . . . . . . . . . . . . 505.8.2 Parallelizinga force . . . . . . . . . . . . . . . . . . . . 52

5.9 Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.9.1 Sequentialscalability . . . . . . . . . . . . . . . . . . . . 535.9.2 Parallelscalability . . . . . . . . . . . . . . . . . . . . . 545.9.3 Scalabilityof fastelectrostaticforcealgorithms . . . . . . 55

6 Conclusions 57

Acknowledgments 60

References 61

A Publications 73

B Design 113

C Code 120

D Gallery 122D.1 Biomolecules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122D.2 IM200 – Labcoursein ComputationalScience. . . . . . . . . . . 125D.3 Magneticholes– Ugelstadspheres. . . . . . . . . . . . . . . . . 127D.4 Coulombcrystals . . . . . . . . . . . . . . . . . . . . . . . . . . 129

II

Page 5: Framework Design, Parallelization and Force Computation in

List of Figures

1 Coulomb: ��������� . . . . . . . . . . . . . . . . . . . . . . . . . . 82 LJ: ����� ����� � �� . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Two switchingfunctionsmodifying theoriginal potential� ��� . . . 114 Themultilevel schemeof themulti-grid algorithm. . . . . . . . . 195 A plotof theoriginalkernel� ��� andtwosmoothedkernels���smooth

�������� ������with softeningdistances� ��� and � � � . . . . . . . . . . . . . . 20

6 Dataexchangebasedon one-sidedcommunication. . . . . . . . . 307 Thecomponent-basedframework PROTOMOL. . . . . . . . . . . 348 Collaborationdiagrambetweentheintegratorobjectandtheforce

objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359 Chainof integratorsimplementingmultiple timesteppingschemes.3610 Correspondencebetweentheinput definitionandtheactualforce

object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4311 Themaximumforceerrorfor agivenaccuracy parameter! . . . . . 4812 Normalizedrun-timefor agivenaccuracy parameter! . . . . . . . 4813 Theenergy for periodicboundaryconditionsandstepsize1 fs. . . 4914 Theenergy for vacuumandstepsize1 fs. . . . . . . . . . . . . . 4915 Parallelscalabilityfor periodicboundaryconditions. . . . . . . . 5416 Parallelscalabilityfor vacuum. . . . . . . . . . . . . . . . . . . . 5417 Comparisonof fastelectrostaticalgorithmsfor periodicboundary

conditions.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5618 Comparisonof fastelectrostaticalgorithmsfor vacuum.. . . . . . 5619 Overview designof forces. . . . . . . . . . . . . . . . . . . . . . 11420 Top level designof forces. . . . . . . . . . . . . . . . . . . . . . 11521 Detaileddesignof bondedsystemforces. . . . . . . . . . . . . . 11622 Detaileddesignof non-bondedsystemforces. . . . . . . . . . . . 11723 Detaileddesignof extendedforces.. . . . . . . . . . . . . . . . . 11824 Detailed designof TOneAtomPair , evaluating one interaction

pair or two simultaneously. . . . . . . . . . . . . . . . . . . . . . 11925 Bovinepancreatictrypsininhibitor (BPTI)withoutwater, 898atoms.12226 Alanin, 66atoms. . . . . . . . . . . . . . . . . . . . . . . . . . . 12327 141-moleculewatersystem. . . . . . . . . . . . . . . . . . . . . 12328 Apolipoprotein(ApoA1) withoutwater, 27,850atoms,P02647.. . 12429 15Ugelstadspheres(by courtesyof AndreasHellesøy).. . . . . . 12730 Fronthemisphereof theoutershellof a20,288Ca"#%$ system. . . . 13031 Radialdistributionof asystemwith 20,288Ca"#%$ . . . . . . . . . . 131

III

Page 6: Framework Design, Parallelization and Force Computation in

32 Fronthemisphereof theoutershell of a 10,144Ca"#%$ and10,144A "& $ system.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

33 Fronthemisphereof theoutershell (250.4-252.2' m) of a 10,144Ca"#%$ and10,144A "& $ system.. . . . . . . . . . . . . . . . . . . . 133

34 Fronthemisphereof theoutershell (242.2-250.4' m) of a 10,144Ca"#%$ and10,144A "& $ system.. . . . . . . . . . . . . . . . . . . . 134

35 Radialdistributionof asystemwith 10,144Ca"#%$ and10,144A "& $ . 13536 Radialdistributionof asystemwith 100,000Ca"#%$ . . . . . . . . . . 13637 Radialdistributionof asystemwith 50,000Ca"#%$ and50,000A "& $ . 13738 Radialdistributionof asystemwith 2,057Ca"#%$ . . . . . . . . . . . 138

List of Tables

1 Identificationstringof angleforceandaLennard-Jonesforce. . . 442 Sequentialcomparisonof PROTOMOL vs. NAMD2 (PentiumIII). 533 Sequentialcomparisonof PROTOMOL vs. NAMD2 (Origin2000). 554 Performancecomparisonof fastelectrostaticforcealgorithms. . . 565 Overview of policy choicesfor thenon-bondedforcealgorithms. . 113

List of Algorithms

1 Pseudo-codeof anMD simulation. . . . . . . . . . . . . . . . . . 92 TheVerlet-I/r-RESPA method. . . . . . . . . . . . . . . . . . . . 103 Pseudo-codeof theforceevaluation. . . . . . . . . . . . . . . . . 134 Pseudo-codeof a recursivemulti-grid schemewith V-cycle. . . . . 22

List of Programs

1 Grammarfor PROTOMOL ’s integratordefinitionlanguage. . . . . 372 Definition of Leap-Frogintegratorwith 1 fs time stepandbond,

angle,dihedralandimproperforces. . . . . . . . . . . . . . . . . 373 Three-level Verlet-I/r-RESPA MTS. . . . . . . . . . . . . . . . . 374 Dynamicinstantiationof anangularforcewith periodicboundary

conditions.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415 Declarationof a Coulombforce with cutoff and ( � -continuous

switchingfunction. . . . . . . . . . . . . . . . . . . . . . . . . . 42

IV

Page 7: Framework Design, Parallelization and Force Computation in

6 Registrationof prototypes:angularforceandLennard-Jonesforceprototypewith cutoff and ( -continuousswitchingfunction. . . . 44

7 Grammarfor PROTOMOL ’s forcedefinitionlanguage.. . . . . . . 458 Examplesof accuracy andtiming comparison.. . . . . . . . . . . 479 Implementationof a PaulTrapattraction. . . . . . . . . . . . . . 5110 Registrationof aPaul Trapattraction. . . . . . . . . . . . . . . . 5111 Parallelizationof avelocitydependentfriction. . . . . . . . . . . 5212 PROTOMOL: Integratorandforcedefinitionsusedfor theperfor-

mancecomparisontest. . . . . . . . . . . . . . . . . . . . . . . . 5313 Detailedimplementationof a PaulTrapattraction.. . . . . . . . . 12014 Detailedimplementationof a velocitydependentfriction. . . . . . 121

V

Page 8: Framework Design, Parallelization and Force Computation in
Page 9: Framework Design, Parallelization and Force Computation in

1 Intr oduction

The modelingof phenomenathat occur in natureis most frequentlyformulatedby a ”fluid-flow” approach.In generalparticlesystemswith zeronet charge forexample,theconservationof energy, massandmomentumtogetherwith a forcerelationtypically leadsto asetof fivecoupleddifferentialequations.In principle,theseequationscanbe solved accuratelyif appropriateboundaryconditionsareknown. Anotherconditionis thatthenumberof realparticlesinsideany computa-tional volumemustalwaysbelargeenoughsuchthatmacroscopiccharacteristics(e.g.,pressure,density, andtemperature)apply. At smallerlengthscales,or forsystemsdescribedby discretebuilding blocks,the particle-like natureof matterhasto be taken into account,andonefacesa numericalproblemof calculatingthebehavior of ) particles– possiblyinteractingwith fluid phasesor heatbaths.Most frequently, the actualparticlesof sucha systemare atomsor molecules.Such systemsare most accuratelydescribedby quantumtheory. Fortunately,for massiveparticlesat sufficient high temperature,quantaleffectsarenegligiblesmallandclassicalmechanicscanbeused1 [84].

Moleculardynamics(MD) [2, 75, 97] describingaclassicalparticlemolecularsystemasa function of time hasbeenusedfor several decades.MD hasbeensuccessfullyapplied to understandand explain macro phenomenafrom microstructures,sinceit is in many respectssimilar to realexperiments.For example,transportandequilibrium propertiesof the systemin questioncanbe predicted.An MD systemis definedby particles(positionandmomenta) andtheir interac-tions (potential), andthedynamicsof a systemis containedin solvingNewton’sequationof motion,i.e., theclassical) � bodyproblem.

The classical) � body problemlacksa generalanalyticalsolution, thusnu-mericalsolutionsareneeded.Solving the dynamicsnumericallyandevaluatingtheinteractionstendsto becomputationallyexpensivealreadyfor a few thousandparticles.Especially, theinteractionsaregenerallythecomputationallydominantpart.For largescaleMD simulations,wedothereforenotonly requireapowerfulmachine,but alsonew algorithmictechniquesandparallelizationschemesto solvethe problemin reasonabletime. With the introductionof novel computationalalgorithms(e.g., multiple time steppingintegration schemes,fast electrostaticforce algorithms,etc.) andlarge scaleparallelcomputers,it becamepossibletostudylarger systemsbeyond ��*,+ particles[9, 80, 99]. Thus,with suchproblem

1For molecules,the de Broglie wavelengthis alreadyvery small for very low temperatures;e.g.,for aCa-.0/ Coulombcrystalat11 K thewavelengthis of order 24365,7 m, afactor50-100smallerthanthetypical separationdistance.

1

Page 10: Framework Design, Parallelization and Force Computation in

sizescertainmacroscopicpropertiesof mattercanbestudied.Thereis a proliferationof programsfor MD [4, 14, 30, 31, 32, 53, 77, 81,

86, 91, 98, 108, 110, 119, 124, 125] andseveral of thesearerobust productioncodes;somewith scalableparallel implementations.They cover commonMDproblemsandareexcellenttools to performsimulations.However, many of thecodesarelegacy programsthatareeitherpoorlyorganizedor extremelycomplex.Oneimportantfactoris usuallythelargenumberof peoplethatcontributedto thewriting of thecodesandthelackof a strongcoordinationto enforcedesign,codeorganization,and programmingguide-lines. Most MD applicationsalso sufferfrom missingdocumentationthat is neededto understandboth the designandimplementation.

Furthermore,somecodeshave a long history and were modified multipletimesto solve differenttypesof problemsat differentpoints in time. Suchpro-gramsmay imposesomeseveredesignconstraintson futuredevelopment.Theymay alsousedifferentprogramminglanguagesandsystemdependentlibraries.Moreover, in thepast,thedesign,thereadabilityandalsotheportability hasbeensacrificedto acertainlevel,dueto aggressivemanualoptimizationtechniquesandtricks to obtainhigh performance.Nowadayscompilersareableto performsuchoptimizationsto acertainextent.

Finally, a parallelimplementationaddsanextra complexity level, sincetheseMD programsusuallyforceparallelismthroughoutthewholeapplication.As analgorithmdevelopmentplatform they areinappropriateto testnovel algorithms.Thelackof asuitabledevelopmentplatformhasnotonly anegativeeffectonstu-dentsandjunior researchers,but alsoon non-corecomputerscienceresearchers,who have to spendconsiderabletime dealingwith thesedifficulties. At theveryend, it doesalso affect the rate of disseminationof new results; for example,many new algorithmsfor MD appearin leadingjournalseachmonth,but theirimplementationsin MD codesappearonly severalyearslater, if at all (e.g.,[8]).

In the past,the main focus for MD programshasbeenon producingwell-performingand robust productioncodes. As discussedabove, codesof suchapplicationsoften have a high thresholdwith respectto testingandcomparingnew algorithmsand lack general,built-in comparisonfacilities. However, thedevelopmentfrom scratchis verytimeconsuming,andmostof thetimeis spentinimplementingsupportingcodethat is only marginally relatedto therealproblemof interest. To overcomethis dilemma,a flexible andeasyextendableplatformfor researchis needed. This will motivate researchersto reusefunctionalitiesandextendtheplatformwith new algorithmsor applications.An exampleis thesoftwarepackageBALL [12] that hasspecialemphasison molecularmodeling

2

Page 11: Framework Design, Parallelization and Force Computation in

andits representationby datastructures.Thereare different approachesthat enablereuseof code,and accordingto

Coulange[25] andMeyer[85], theobject-orientedapproachis themostfavorableone. Thus, to satisfy the requirementof flexibility and codereuse,an object-orientedcomponent-basedframework2 8 3 is proposed.Thepurposeof thecompo-nent-basedframework is to supportboth white- and black-boxdesign. White-box designis basedon inheritanceto adaptto a particularproblem. Designandimplementationhave to bewell known, but provide greatflexibility . In a black-box design,componentsarereusedby composition,which doesnot demandtoomuchknowledgeof the framework, but it comeswith lessflexibility . Run-timeefficiency, however, is a crucial issuefor MD programs. The framework mustaddressthis from the beginningof thedesignandbe awareof languagespecificpitfalls [18, 115]. Furthermore,theframework mustbedesignedto supporttrans-parentparallelization,suchthatdevelopersarenot forcedto dealwith parallelismissues,evenin a parallelenvironment.

The first part of the thesisis concernedwith the optimizationandextensionof the particular MD prototypeSPRINGS developedby Jan PetterHansen4.This work wascarriedout to get somevaluableinsight informationaboutMDsimulationsand requirementsfrom MD applicationusers. SPRINGS is alsousedto investigateparallelizationschemesfor MD applications,especiallytheusageof new featuresof theMPI-2 standard,e.g.,theone-sidedcommunicationmechanism.

The secondand main part of the thesisconsistsof the designand devel-opmentof a multi-purposeframework for MD applications,with emphasisondesignandimplementationof forcealgorithms.Therequirementsto a novel MDframework were motivatedby the knowledgeof MD obtainedin the first partandtheevaluationof advantagesanddrawbacksof existing MD programs.Themain goal was to develop a flexible, extendableMD researchplatform, neitherexcluding run-timeefficiency nor parallelism.Thedesignof the framework andits implementationPROTOMOL5 [63] wererealizedin collaborationwith JesusA.Izaguirre6 andhis students.Furthermore,somedesignaspectswere introducedfrom previouswork with theobject-andcomponent-orientedgraphicsframework

2“A set of classesthat embodiesan abstractdesign for solutions to a family of relatedproblems.” [65]

3“Good frameworkscanbeusedfor thingsthatthedesignerneverdreamedof.” [64]4Departmentof Physics,Universityof Bergen.5PROTOtypeMOL eculardynamics6Departmentof ComputerScienceandEngineering,Universityof NotreDame,USA.

3

Page 12: Framework Design, Parallelization and Force Computation in

BOOGA7 [3, 17,83,112].

1.1 Structure of the thesis

Thisdissertationdescribesthecomponent-basedframework PROTOMOL andde-tails the most essentialforce algorithmsand parallelizationaspects.Section2startswith anintroductionof thephysicalandnumericalbackgroundof MD sys-tems.In Section3, anoverview of forcecomputationis presented,with emphasison fastelectrostaticforcealgorithms.Then,in Section4, differentparallelizationapproacheswith focuson MD simulationsarediscussed;someaspectsaredis-cussedin moredetail in theappendedpublications[C] and[D]. Section5 detailsthedesignandimplementationof thecomponent-basedframework PROTOMOL.Furthermore,the sectiongivessomepracticalexamplesof the extendibility andoptimizationthat is presentin PROTOMOL. Section6 containsconclusionsandpossibledirectionsfor futurework.

Appendix A is the collection of paperspublishedduring this thesis. Thepublications[A] and [B] are MD studiesof large scale2-dimensionalphysi-cal problems.The simulationswereperformedwith theprototypeMD programSPRINGS. Paper [C] is a detailedstudy of MPI’s one-sidedcommunicationmechanism,a featureof the MPI-2 standard. Publication[D] explains an in-crementalparallelizationapproachandits implementationinto PROTOMOL.

Paper A: T. Matthey andJ.P. Hansen.Moleculardynamicssimulationsof slid-ing friction in a densegranularmaterial. Modelling and SimulationinMaterialsScienceandEngineering, 6(6):701–707,November1998.

Paper B: I. Skauvik,J. P. Hansen,andT. Matthey. Dynamicsof clusteringina binary Lennard-Jonesmaterial. Modelling and Simulationin MaterialsScienceandEngineering, 8(5):665–676,September2000.

Paper C: T. Matthey andJ.P. Hansen.Evaluationof MPI’sone-sidedcommuni-cationmechanismfor short-rangemoleculardynamicson theOrigin2000.In T. Sørevik, R. Moe, A. H. GebremedhinandF. Manne,editors,AppliedParallel Computing. New Paradigmsfor HPC in Industryand Academia,volume 1947 of In Lecture Notesin ComputerScience, pages374–365,Bergen,Norway, June2000.Springer-Verlag.

7Berne’sObject–OrientedGraphicsArchitecture

4

Page 13: Framework Design, Parallelization and Force Computation in

Paper D: T. Matthey and J. A. Izaguirre. ProtoMol: A moleculardynamicsframework with incrementalparallelization. In Proceedingsof the TenthSIAMConferenceon Parallel Processingfor ScientificComputing(PP01),Proceedingsin Applied Mathematics,Portsmouth,Virginia, March 2001.Societyfor IndustrialandAppliedMathematics(SIAM).

AppendixB containsdetaileddiagramsof thedesignandthehierarchyof theforces. AppendixC includesthe implementationcodeof an actualparallelizedforce. Appendix D.1 containssnapshotsof four different molecularsystems.AppendixD.2 givesan exampleof PROTOMOL usedin a studentcourse. Ap-pendicesD.3 andD.4 roundup the thesiswith a galleryof resultsfrom ongoingproject collaborationson applicationsof PROTOMOL to very different setsofparticlesystems.

5

Page 14: Framework Design, Parallelization and Force Computation in

2 Physicaland Numerical Model

In classicalMD simulationsthedynamicsaredescribedby Newton’s equationofmotion 9:�<; ;>= ���� � = ��� �? � � = �@� (1)

where 9:� is the massof atom8 A , ���� � = � the atomicpositionat time = and�? � � = �

the instantforceon theatom. Note that thereareotherformulationsto describethedynamicsof anMD system,e.g.,Lagrangiandefinitionof classicalmechan-ics [97, pp.42-44,Ch. 6] or Hamilton’sequation[47]. WhereNewton’sequationof motion describesnature(to a certainextent) conservingthe energy (NVE),theseadvancedformulationsallow to modify thedynamicsto achieveequilibriumstatessatisfyingcertainspecifiedrequirements,e.g.,constanttemperature(NVT),constantpressure(NVP) or rigid molecules. Theseapproachesare more forcomputationalconvenienceandbeyondthescopeof this thesis;they werepartlyimplementedinto PROTOMOL [60] by otherdevelopers.

Theforce�? � is definedby thepotentialenergy�? �B� �DC �FE �G�� � � �� �GHGHGHI� ��KJL�NM �? extended� � (2)

where E is thepotential,�? extended� anextendedforce(e.g.,velocity-basedfriction)

and ) the total numberof atomsin thesystem.Typically, thepotentialis givenby E � E bonded MOE non-bondedMOE external (3)E bonded � E bond MPE angle MPE dihedral MOE improper (4)E non-bonded � E electrostaticMOE Lennard-JonesH (5)

Thebondedforcesareasumof Q � ) � terms.Thenon-bondedforcesareasumof Q � ) @� termsdueto the pair-wise definition. E external coversadditionalinter-actions,e.g.,electromagneticfields ( Q � ) � ) or evengravitation ( Q � ) @� ). E bond,E angle, E dihedraland E improperdefinethecovalentbondinteractionsto modelflexiblemolecules.In this thesis,thebondedforcetermsarebasedonharmonicpotentialsand mimic physicaleffects basedon quantummechanics. Note that thereareotherformulationsto modeltheseterms,whicharenotharmonicnecessarily, e.g.,

8Atom is hereusedasanindivisibleentityandinterchangeablyusedwith any massiveparticle.

6

Page 15: Framework Design, Parallelization and Force Computation in

Morse-potentials. E electrostaticrepresentsthe well-known CoulombpotentialandE Lennard-JonesmodelsavanderWaalsattractionandahard-corerepulsion.Theco-efficientsfor E Lennard-Jonescanbedeterminedexperimentallyand/ortheoretically,e.g.,by ab initio quantummechanicalcalculations.

Thepotentialfor bondinteractionsis describedasa linearbondbetweentwoatomsby asimpleharmonicspringE bondR � ��<SUT �WVXV����Y� VXV �[Z R � � (6)

where SUT is a bondforceconstant,����Y��� ���� � ���� is thedistancevectorbetween

atomsA and \ , and Z R is theequilibriumbondlengthbetweentheatomsA and \ ;9 denotesthebondindex. Thepotentialfor angleinteractionsis describedasanangularbondbetweenthreeatomsby aharmonicangularspringE angleR � ��<S^] �_ �Y� � � _ R � � (7)

where S^] is anangleforceconstant,_ �Y� � thecurrentangle,and

_ R thereferenceanglefor constraint9 . Somesystemsalso include the so-calledUrey-Bradleyterm,aharmonicspringbetweenatomsA and ` ; S UB

�WVaV���� � VXV � � UB� , where S UB

and � UB are Urey-Bradley parameters.Dihedral and improper(also known astorsion)bondedforces[75] definethe interaction(torsion)betweenfour bondedatoms.They aremodeledby anangularspringbetweenplanesformedby thefirstthreeatomsandthesecondthreeatoms.Thepotentialfor a dihedralor improperforcebetweenatomsA , \ , ` , and Z is givenbyE dihedralR �IE improperR �cb Sed � �fM[g�hji �lkNm Mon,�4� if

kqp *Sed �Fm � nr� ifk � * � (8)

where S�d is a torsionforce constant,m

is the anglebetweenthe planesformedby theatomsA , \ , ` and \ , ` , Z , respectively.

kis theperiodicityand n thephase

shift of thebond.Thepotentialfunctionfor anelectrostaticpair-wiseinteractionis givenby E electrostatic�Y� � �srt ! $ou � u �VvV0����Y� VXV � (9)

where ! $ is thepermittivity of free spaceconstant,and u � and u � arethe chargesof atoms A and \ , respectively. Figure1 shows the slowly decayingelectrostatic

7

Page 16: Framework Design, Parallelization and Force Computation in

interactionwith normalizedconstants.Thepotentialfor theLennard-Jones9 (LJ)pair-wiseinteractionis E Lennard-Jones�Y� � w �Y�VvV����Y� VXV � � x �Y�VaV0����Y� VaV � (10)

where w �Y�zy * and x �Y�zy * are the LJ parametersfor atoms A and \ . Theydefinethe energy minimum and the cross-over point, wherethe LJ function iszero. Figure2 shows the attractive andrepulsive part of the LJ function for w ,x �{� .

0 10 20 30 40 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 1: Coulomb: |�}�~ ��� . 0 1 2 3 4 5

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

Figure 2: LJ: |�}�~ ����� ~ �� .2.1 Numerical integration

Newton’sequationof motionis anordinarydifferentialequationof secondorder.Its integration is often solved by the numericalLeap-Frog10 method,which issymplectic11 [127] and time reversible. Despiteits low order, it hasexcellentenergy conservation propertiesand is computationallycheap. With � = as timestep, � d and � d the coordinateand the velocity at time = , respectively, a singleintegrationstepis givenby � d "f�� � � d � �� M � =9 ? d (11)� d " � � � d M � = � d "f�� H (12)

9Also calledLondon-vanderWaalsinteraction.10TheLeap-FrogandthevelocityVerletmethodarealgebraicallyequivalent.11TheJacobianof theintegratoris symplectic.

8

Page 17: Framework Design, Parallelization and Force Computation in

Thelocal errorof coordinatesandvelocitiesis of order Q � � = # � and Q � � = � , re-spectively. Thename’Leap-Frog’hasits origin from thefactthatthecoordinatesandthevelocitiesareinterleavedin time. If necessary, � d " � canbeobtainedby� d " � � � d "f�� M � =�r9 ? d " � H (13)

By reformulatingtheLeap-Frogandaddingtheforceevaluation,wecandescribeonecompletetimestep(

� � d � � d ��� � � d " � � � d " � � ) of anMD simulationas� d " �� � � d M � =�r9 ? d (half akick) (14)� d " � �O� d M � = � d "f�� (adrift) (15)? d " � � ��C E � � d " � �NM ? d " �extended (evaluateforces) (16)� d " � � � d "f�� M � =�r9 ? d " � (half akick) H (17)

A completeMD simulationis describedby Algorithm 1. It consistsbasicallyofthe loop solving theequationof motionnumericallyandtheevaluationof forceson eachparticle,andsomeadditionalpre-andpost-processing.

MD Simulation:

1. Constructinitial configurationof ~ $ , � $ and � $ ;2. loop 1 to numberof steps

(a) Updatevelocities– Eq.(14);

(b) Evaluate forcesoneachparticle– Eq. (15); (seeAlgorithm 3)

(c) Updatepositions– Eq. (16);

(d) Updatevelocities– Eq.(17);

3. Post-processing

Algorithm 1: Pseudo-code of an MD simulation.

WhenintegratingNewton’s equationof motion numerically, the fastestmo-tions restrict the time step � = . For the Leap-Frogmethod,this is typically 1femtosecond(1 fs = ��* ���� s),assumingatypically MD systemof biomolecules12.

12Moleculesof biologicalorigin or relevance.

9

Page 18: Framework Design, Parallelization and Force Computation in

MD systemsconsistin generalof differentforcetypes,whichalsohavediffer-enttimescalesof dynamics.Velocitiesof moleculesor singleatomsaretypicallya factorof 10-50slower thantheintra-molecularmotions.Multiple timestepping(MTS) integratorsaddressthe different time scalesby splitting the forces– ifpossible– by frequenciesandintegratingthemwith appropriatetime steps.Thesplittingof forcesis definedby switchingfunctions,discussedin thenext section.MTS hasbeenusedto lengthenthe time stepfor mostof the interactionsin theequationof motion.Furthermore,onealsohopesthatthelow frequency forcepartis the computationallymostexpensive part, sinceit is evaluatedlessfrequently.MTS is thereforearemedyto achievebetterperformance,but it comeswith someadditionalcomputationalwork.

A typicalMTS integratoris theVerlet-I/r-RESPA [49, 118] method,describedin Algorithm 2. The methodconsistsof splitting the force contributions on ataggedparticle into fastandslow components.If thesecontributionsarecom-pletely decoupled,it is possibleto integratethe equationsof motion with dif-ferent time stepswhich are appropriatefor the slow and fast dynamics. Slowcomponentsaretaken into accountonly every ) time steps,wherethey actasakind of impulse,actingon theparticle. A detailedevaluationanddescriptionofintegratorsandMTS methodsarebeyond the scopeof this thesis;the readerisreferredto [7, 39,60, 57,79,129].

In particular, researchersat theUniversityof NotreDamehave implemented

� d � � d M�� d J R ? dslow (half akick)

for ` � *��GHGH�H�� ) � � do� d "���� ��� � � d " �� M � d R ? d " ��fast (half akick)� d " ��� �� ��� d " �� M � = � d "���� ��� (adrift)? d " ��� ��fast� �DC E

fast� � d " ��� �� � (evaluatefastforces)� d " ��� �� � � d " �F� ��� M�� d R ? d " �F� ��fast (half akick)

enddo? d " �slow� ��C E

slow� � d " ��� (evaluateslow forces)� d " ��� � d " ��M � d J R ? d " � slow (half akick).

Algorithm 2: The Verlet-I/r-RESPA method.

10

Page 19: Framework Design, Parallelization and Force Computation in

MTS integratoralgorithms,thatcoupledwith thefastelectrostaticsolversandanincrementalparallelizationapproach(bothimplementedfor this thesis),representasubstantialspeedup.

2.2 Switching functions

Switchingfunctions( ��� � � � ) areprimarily for computationalconvenience.Thesefunctionsmodify (or filter) an original potentialand its force contributions toachieve certainproperties(Figure3). They are typically usedfor the pair-wisenon-bondedinteractionsandaredistancebased.Themodificationof thepotentialandforcecontributionsis givenbyE���Y� � ��� �WVaV����Y� VaV ��E��Y� (18)? ��Y� � ? �Y� ��� ��VXV����Y� VXV � � E ���� C ��� ��VXV������ VaV � (19)

Themostimportantswitchingfunctionsare:� Truncation(or shift): Thepotentialis truncatedat a certaincutoff distanceandis zerobeyond.A shift removesthe ( $ -discontinuityatcutoff distance,but theforceis still discontinuous(seeShifted in Figure3).� Modification: This bringsthe potentialandforcessmoothlyto zeroat thecutoff distanceto avoid any discontinuities(seeModified in Figure3).� Range:It filters thepotentialandforcesfor agivenpairdistancerange.

0 10 20 30 40 50

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Switchon

Cutoff

r

Original potential, r−1

ModifiedShifted

Figure 3: Two switching functions modifying the original potential � ��� .11

Page 20: Framework Design, Parallelization and Force Computation in

Switchingfunctionsareoften usedin conjunctionwith MTS to split forcesintoslow andfastvaryingparts.

2.3 Boundary conditions

For a completedefinitionof theMD model,oneneedsto specifyboundarycon-ditions to definehow atomsor moleculesinteractwith their surrounding. Theselectionof boundaryconditionsdependsonthephysicalsystemandthequestionsof matter. Themostrelevantboundaryconditionsare:� Vacuum: no boundaryconditionsat all. It is generallyusedto testalgo-

rithmsor for studiesof singlemoleculesystems.� Periodicboundaryconditions:particlesleaving theunitMD cell arewrappedaround,re-enteringthe unit MD cell on the oppositeside. It is usedtosimulateaninfinite systemwith asmall-sizedsystem,e.g.,bulk liquids.� Sphericalor cylindrical boundaryconditions(or constraints):convenientfor systemswhich fit the geometricalstructureandto preserve the spatialstructure,e.g.,waterdroplets.Typically, they areimplementedby externalforces.

A full discussionof boundaryconditionscanbefoundin [2, pp.24-32].

12

Page 21: Framework Design, Parallelization and Force Computation in

3 ForceComputation

The force evaluation in Algorithm 1 is definedby the sum of interactionsinEq. (3), wherethe particlesinteractwith eachother. By expandingpoint 2b inAlgorithm 1 recursively, theforceevaluationfor a typical MD simulationcanbedescribedasin Algorithm 3. Note that for someparticularMD simulationsomeforcesmaynotapply, e.g.,SPRINGS considersonly systemswith Lennard-Jonesinteractionsandanoptionalvelocity-basedfriction (seepublications[A] and[B]).

Evaluation of forces:� Bondforces– Eq.(6);� Angle forces– Eq.(7);� Dihedralforces– Eq.(8);� Improperforces– Eq.(8);� Electrostaticforces– Eq. (9);� Lennard-Jonesforces– Eq.(10);� Optionalor otherforces;

Algorithm 3: Pseudo-code of the force evaluation.

A forcetypeactingon oneparticularparticlemayin somecasesonly dependon theparticleitself (e.g.,friction), but otherforcetypesaredefinedaspair-wiseinteractionsor by

k-tuples,so-called

k � bodyinteractions.GeneralMD applica-tionsdistinguishbetweenbondedandnon-bondedforces.Thetopologyof bondedforcesareof statictypeanddoin generalnotchangeduringthewholesimulation,sothey canbedescribedwith simplestaticgraphrepresentations.They typicallymodeltheintra-molecularforces,whichareshort-rangeinteractions.Thebondedforcesarea sumof Q � ) � terms. The non-bondedforcesdescribeinteractionsbetweenparticlesof dynamicbehavior andaregenerallydefinedby a pair-wisepotentialin Eqs.(9) and(10). Theseinteractionscannotbedeterminedstatically.Generally, they are of a long-rangenatureand dominatethe force calculationby Q � ) W� complexity. Thus, they are the mostcomputationalcritical part andoptimizationshappenhere.

13

Page 22: Framework Design, Parallelization and Force Computation in

One commonoptimization methodfor non-bondedforce evaluationsis totruncateat a given cutoff distance( �G� ) to limit the numberof pair-wise interac-tions, only consideringthe closestneighbors.This canbe achieved usinga celllist algorithm[9][97, pp. 47-52] (seealsopaper[C]), or a list of neighbors[97,pp.52-55]. However, simpletruncationleadsto discontinuitiesin theforcesandit violatesenergy conservation for symplecticintegrators. One way to handlecutoffs areswitchingfunctions,which smoothlybring theenergy andthe forcesto zeroatthecutoff distanceto avoid any discontinuities.Truncationtogetherwithswitchingfunctionsperformswith Q � ) � timecomplexity andworkswell for fastdecayingpotentials,where the long-rangepart – beyond the cutoff distance–is negligible, e.g.,van der Waalsinteractions(Figure2). Furthermore,onecanaccountfor the neglectedenergy contributions in van der Waalspotentials(orothershort-rangepotentials)by applyinga long-rangecorrection,which assumesthattheradialdistribution function   � � �¡���j¢ � p �G� (cf. [2]).

For long-rangeinteractions,wherethecontributionsdecayslowly (Figure1)andthelong-rangepartmayhaveasignificantimpactonthesimulationresults,weneedadvancedalgorithmsto avoid thedirectmethodwith Q � ) @� time complex-ity. In MD, theelectrostaticinteractionsarethemostdifficult problem,whereasvan der Waalsinteractionsarenormally handledby cutoff techniques.Consid-ering also the fact that the numericalintegration itself is of order Q � ) � , onecanconcludethat themostcritical part in anMD simulationaretheelectrostaticinteractionsin Algorithm 3.

3.1 Fastelectrostatic forcealgorithms

For systemswith partialchargesor highermultipoles,theelectrostaticinteractionsplay a dominantrole, e.g., in protein folding, ligand binding, and ion crystals.Themostrelevantalgorithmsto solveelectrostaticinteractionin lessthan Q � ) @�time complexity arediscussedin Sections3.2-3.5.Threeof thesemethodswereimplemented(seeSection5.7).

3.2 Standard Ewald summation

A generalpotentialenergy function E of a systemof ) particleswith an inter-actionpotential

m¡�G������DM �k � andperiodicboundaryconditionscanbe expressed

14

Page 23: Framework Design, Parallelization and Force Computation in

as E£� ��¥¤ ¦§¨ J¤ �ª© �J¤ ��© � m«������Y��M �k �W� (20)

where ¬ §¨ is the sumover all lattice vectors�k � �­�®Gk<® � ­°¯Ik<¯ � ­�±�kB± � , k<® 8 ¯ 8 ±³²´

and­°® 8 ¯ 8 ± are the dimensionsof the unit MD cell and

����Y�µ� ���� � ���� . The“daggered”( ¦ ) summationindicatesthe exclusion of all pairs A � \ inside theoriginal unit MD cell (

�k � �* ). Most molecularforce fields provide exclusionschemesto exclude additionalpairs, e.g., the so-called1-3 exclusionexcludesdirectly connectedpairsandpairswith a directcommonneighbor. Furthermore,someexclusion schemesmay also modify the potentialby a factor for certainpairs.

If thepotentialm

satisfies VYm«����K� V·¶ w V�� V ��¸���¹ (21)

for large enough�� and w p * and ! p * , thenthe sumin Eq. (20) is absolute

convergent13.Inequality(21) is notsatisfiedby theCoulombpotential

m«������Y�I���£º u � u � V����Y� V ��� ,suchthattheinfinite latticesumin Eq.(20) is only conditionallyconvergent14. Incaseof theLennard-Jonespotential,thesumis absolutelyconvergent.

Thewell-knownEwaldsummationmethod[34] is in generalusefulin systemswith large,spatialpotentialdifferences,wherethe summationover oneunit celldoesnot converge sufficiently, i.e., the lattice sumis not absolutelyconvergent.Thelatticesumwith theCoulombpotentialis usuallyexpressedasE electrostatic � �s,t ! $ ���¤ ¦§¨ J¤ �»© �

J¤ �4© � u � u �V����Y�°M �kLV H (22)

To overcometheconditionallyandinsufficient convergenceof Eq. (22), thesumis split into two partsby thefollowing trivial identity�� �½¼ � � �� M � � ¼ � � �� H (23)

The basicidea is to separatethe fastvariationpart for small � and the smoothpart for large � . In particular, the first part shoulddecayfastandbe negligible

13Thesumconvergesanddoesnot dependon theorderof summation.14Theconvergenceof thesumdependson theorderof summation.

15

Page 24: Framework Design, Parallelization and Force Computation in

beyond somecutoff distance,whereasthe secondpart shouldbe smoothfor all� , suchthat its Fourier transformcanberepresentedby a few terms.Ewald [34]suggestedto choosea Gaussianconvergencefactor   � � � �k �¾�À¿r�·Á4 §¨  � suchthatthe sumbecomes(see[76] for a detailedmathematicalderivation of the Ewaldsummation)

E electrostatic � �s,t ! $ �� ¤ ¦§¨ J¤ �»© �J¤ �4© � u � u � erfc

�Ã�V������°M �kfV �V����Y�°M �kfVÄ Å�Æ ÇReal-spacetermM �! $@È �� ¤ §�6É© $��` ¿ �ËÊ� �ÌÎÍ �ÐÏÑ,ÒÒÒÒÒ

J¤ �»© � u �6g�h,i � �`ÔÓ ����� ÒÒÒÒÒ M ÒÒÒÒÒ

J¤ �»© � u �6i�ÕaÖ � �`ÔÓ ����� ÒÒÒÒÒ4×Ø

Ä Å�Æ ÇReciprocal-spaceterm� �s,t ! $ ��ÚÙ¤ ¦FÛ ���© �

J�ܤ � © �JÝܤ Þ © � u � � u �Fß erf

�0ÃàV���� � � ß V �V���� � � ß VÄ Å�Æ ÇIntra-molecularself energy

(24)

� Ãs,tâá� ! $ J¤ �»© � u �Ä Å�Æ ÇPointself energy

� �ã ! $@È Ã ÒÒÒÒÒJ¤ �»© � u � ÒÒÒÒÒ

Ä Å�Æ Ç

Chargedsystemterm

H

Here,Ã

is thesplittingparameterof therealandreciprocalpart.For anoptimalÃ

theEwaldsummationscalesas Q � ) á �6� [34,36,89] in Eq.(26). ¦ Û � is the“inversedaggered”summation.Theintra-molecularself termcorrectsinteractionson thesamemolecule,whichareimplicitly includedin thereciprocal-spaceterm,but arenot requiredin theexclusionmodel.Self interactionsarecanceledout by theselfpoint term. Thechargedsystemterm is only necessaryif the total net charge ofthesystemis non-zero.Note, thatsomeMD systemshave anadditionalsurfacedipoletermto modeldipolarsystemsmoreaccurately[76, 122]. This termis notsuitedfor mobile ions,sinceit will creatediscontinuitiesin theenergy andforcecontributionswhenionscrossboundaries.

16

Page 25: Framework Design, Parallelization and Force Computation in

Theelectrostaticforceon particle A is givenby? electrostatic� � �DC §®Wä E electrostatic� u �s,t ! $ ¤ ¦§¨ J¤ ��© � u �æå erfc�Ã�V����Y��M �k�V �V�������M �k�V M � Ãç t ¿ ��è �  §®Wä Ü " §¨  �%é ����Y��M �kV����Y�°M �kLV Ä Å�Æ Ç

Real-spacetermM �! $�È ¤ §�6É© $ u ��`�` ¿ �ËÊ� �ÌÎÍ �ëê i�ÕXÖ � `ÔÓ�� �l� J¤ �4© � u �<gIhji � �`ÔÓ ����I� � g�h,i � `ÔÓ�� �ì�

J¤ ��© � u �<i4ÕXÖ � �`ÔÓ �������íÄ Å�Æ ÇReciprocal-spacetermM u �s,t ! $ ���¤ ¦�Û �� u �Ôå � Ãç t ¿ ��è �  §®Wä Ü Â � � erf

�0ÃàV0����Y� V �V������ V é ����Y�V������ V Ä Å�Æ ÇIntra-molecularterm

(25)

Furthermore,theEwaldmethodcanalsobeusedfor vanderWaalsinteractions[21,67] andseveral other potentials[43, pp. 237-256],[76]. The computationalef-ficiency and accuracy for 2-dimensionalperiodic boundaryconditionsin a 3-dimensionalsystemis discussedin [69]. The casewith 1-dimensionalperiodicboundaryconditionsis addressedin [74]. To someextent,vacuumsystemscanbetreatedby imposingperiodicboundaryconditions.Thedimensionsof theoriginalunit cell arechosenbig enough,suchthat thecontributionsbetweencell imagesarenegligible. In practice,thedimensionsof theunit cell areafactorof ��*j* largerthantheminimalboundingboxof particles.

Choiceof parameters

Theaccuracy andperformanceof theEwald summationis governedby thesplit-ting factor

Ã, thereal-andreciprocal-spacetermcutoffs �G� and `j� , andtheaccu-

racy ! . Thesplitting parameterÃ

defineshow fastthesumsconvergeanddefinesthe cutoffs �G� and `î� for a given accuracy ! . For mostapplications,�G� is smallenoughsuchthat

�kï²ñð �*�ò , which alsosimplifiesthesumfor thereal-spaceterm.Both the real- andreciprocal-spaceterm converge rapidly, suchthat only a fewtermsneedto beconsidered.For therealpartonly termssatisfying

V������«M �k�V�ó �G�are included,whereasfor the reciprocalpart only summandswith

V �` V�ó `j� areevaluated.

17

Page 26: Framework Design, Parallelization and Force Computation in

FromEq.(24) it is obviousthatfor agivenaccuracy andafixedÃ

therequiredwork scalesas Q � ) � for thereciprocaltermandas Q � ) � for the real term. Toachieveanoverallwork complexity of Q � ) á � � , Ã mustvarywith )Ã � º ç tïô�)È îõ �ö (26)

�G� � ç �ë÷ Ö !à (27)`î� � � à ç �ë÷ Ö ! H (28)

Here, È is thevolumeandtheconstantº determinestheratioof executiontimeoftherealandreciprocalterm,which mayvary from oneplatformto another. ThestandardEwald summationis unsurpassedfor very high accuracy. It is relativelyeasyto implementandthe desiredaccuracy canbe increasedandcontrolledupto machineprecisionwithout any additionalprogrammingeffort (seeFigure11).Due to theseexcellentproperties,the Ewald methodis often usedasreferencefor the evaluationof othermethodswith periodicboundaryconditions. A moredetaileddiscussionfor theoptimalchoiceof

Ãandmoreaccurateerrorestimates

canbefoundin [27, 36,71, 90].

3.3 Mesh-basedEwald methods

The mesh-basedEwald methodsapproximatethe reciprocal-spaceterm of thestandardEwald summationby a discreteconvolution on an interpolatinggrid,using the discreteFast-Fourier transforms(FFT). By choosingan appropriatesplitting parameter

Ã, the computationalcost can be reducedfrom Q � ) á � � toQ � ) ÷ h,ø ) � . The accuracy and speedare additionally governedby the mesh

sizeandtheinterpolationscheme,which makesthechoiceof optimalparametersmoredifficult. At present,thereexist severalimplementationsbasedon this idea,but they differ in detail. In [27] threeessentialmethodsarecomparedandsum-marized: particle-particle-particle-mesh[54] (P M), particle-meshEwald [26](PME)andsmoothparticle-meshEwald [33] (SPME).

Unfortunately, the mesh-basedEwald methodsare affectedby errorswhenperforminginterpolation,FFT, and differentiation[90]. However, it would bemisleadingto infer that thesemethodssacrificeaccuracy in favor of run-timeperformance.

For the fast mesh-basedEwald methods,thereexists a critical number )úùsuch that they are fasterthan the standardEwald methodfor ) p )³ù , due

18

Page 27: Framework Design, Parallelization and Force Computation in

to the fact of betterscaling. In [70], the computationalefficiency andaccuracywith 2-dimensionalperiodicboundaryconditionsfor 3-dimensionalsystemsarediscussed.

3.4 Multi-grid

Multi-grid (MG) wasintroducedin the1960sbyFedorenko [35] andBakhvalov [5]to solve partial differentialequations.It got full attentionin the1980s,but onlyrecentlyit hasbeenappliedandimplementedfor ) -bodyproblems[13,107]. MGscalesas Q � ) � .

MG imposesahierarchicalseparationof spatialscales.Thepair-wiseinterac-tions aresplit into a local anda smoothpart. The local part arethe short-rangeinteractions,which arecomputeddirectly. Thesmoothpartrepresentstheslowlyvaryingenergy contributions,approximatedwith fewer terms– coarsening. MGusesgriddedinterpolationfor boththecharges(source)andthepotentials(desti-nation)to representits smooth– coarse– part. Thesplitting andcoarseningareappliedrecursively anddefineagrid hierarchy(Figure4).

−1l

(1)

(1) (3)

(4)

(2)

(1) (3)

(4)

(4)

(3)

Potential values Point Charges

l

1

0

Figure 4: The multilevel scheme of the multi-grid algorithm. (1) Aggregate tocoarser grids; (2) Compute potential induced by the coarsest grid; (3) Interpolatepotential values from coarser grids; (4) Local corrections.

19

Page 28: Framework Design, Parallelization and Force Computation in

For the electrostaticpotential,the kernel is definedby � � � �û� � ��� and � �VXV0�� � �� VaV . � � � � is obviously not boundedfor small � . The interpolationimposessmoothnessto boundits interpolationerror. By separation,the smoothedkernel(smoothpart)for grid level ` ²�ð �j�W�·��HGHGH�� Z ò is definedas� �smooth

������ü� ����I���cb � Á � �WVXV���� � ���� VXV �þý VaV0���� � ���� VXVÝó � �� �WVXV���� � ���� VXV � ý otherwiseH (29)

Here, � � is thesofteningdistanceat level ` and � Á � � � � is thesmoothingfunctionwith � Á � � � � �¥� � � � � � . We define � � �cÿ � ��� � , typically ÿë��� , togetherwith acorrespondingconstantsequenceof levels with coarseningratios ��ýÝÿ . � Á � aredefinedby apolynomialparameterizedby � � (Figure5) with ( ¨ -continuity. Notethat � � andthecoarseningratioscanbedefinedmoregeneral,but it mayinfluenceperformanceandaccuracy, especiallyif � �����>� arenotof thesameorder[13,pp.6-31].

0 0.5 1 1.5 2 2.5 30

0.5

1

1.5

2

2.5

3Original KernelSmoothed Kernel, s=1Smoothed Kernel, s=2

Figure 5: A plot of the original kernel � ��� and two smoothed kernels���smooth ��~� � �~���� with softening distances � } � and � } � . � Á ��� ����Á � ������ � "!# Á%$&('*)� !+# , $ - ,

� , ��� � is . & -continuous.

The correctionsof the potentialsarerequiredwhenthe modified,smoothedkernelis usedinsteadof theexactone.Thecorrectionof thekernel(local part)atlevel / is givenby021

corr 3546 87 46 �%9;: < 0 3�= 9?> 0 ,�@ 3A= 9CB =EDGF 1 7 = :IH�H 46 �J> 46 KH�HL B otherwise(30)0 �

corr 3546 87 46 �%9;: < 0 �smooth3�= 9?> 0 ,NM 3�= 9OB =PDGF � 7 = :IH�H 46 �J> 46 QHRHL B otherwise S (31)

20

Page 29: Framework Design, Parallelization and Force Computation in

Note that0 �

corr 3546 N7 46 �%9T7 /VUXW , canbe pre-computedandrepresentedby a singletablewith the correspondingpre-factor Y[Z]\ � Z �_^ imposinga constantcoarseningratio W B Y , Ya`cb , YaUdW and F � : Y Z]\ � Z �_^ .

We definean interpolatione �+f � Bhg �+f �ji g � of order k with local support,where g � representsthe potential valuesat grid level / , and g 1 the particlepotentials.Theanterpolation(or adjoint interpolation)is definedby l � Bhm � im �Qf � , wherem � arethechargesatgrid level / and m 1 aretheparticlecharges. n �arethelatticeor particlepositions( n 1 :po 46 _q ). Wecanthendescribea recursiveMG schemewith r levelsby

m �+f � : l � 3 m � 9 (32)gtsu: 0 �smooth3 n � 7 n � 9?vwm 3 n � 9 (33)g � : e �Qf � 3 g �+f � 9yx 0 �corr 3 n � 7 n � 9?vwm 3 n � 9 (34)z : W{ g � vwm � (35)4| }: ~� ����5� \_��� ^�� ����Tg � 3 n �� 9 S (36)

� 3�46 A9 denotesthe lattice points where 46 has local supportfrom e � . By re-formulatingEqs.(32-36), the MG schemecanbe describedas in Algorithm 4,defininga V-cycle. TheV-cyclereflectstheorderin which thegridsareused;thealgorithmtelescopesdown to the coarsestgrid, andthenworks its way backtothefinestgrid, describingaV. Thepseudo-codefor main handlestheaggregationfrom the interpolationto the particlesandcomputesthe force contributionsandthetotal energy. multiscale recursively performstheaggregationof charges,theinterpolationof potentialvaluesandthe local correctionasdepictedin Figure4.Dueto theuniform grids,theinterpolationcoefficientscanbepre-calculatedandrepresentedby one1-dimensionalsetof coefficients.Thesameholdsfor thelocalcorrection,sincethesofteningdistanceis proportionalto its correspondingmeshsize.

Assuminguniform grids,a constantcoarseningratio of W B { , a k -orderinter-polation, � themeshsizeof thefinestgrid andanaverageparticledensity� ) :���

21

Page 30: Framework Design, Parallelization and Force Computation in

main:

1. anterpolatepositionchargesto thefinestchargegrid(1)– step(1);

2. call multiscale(maxLevel, level 1);

3. interpolatefinestpotentialgrid(1) to thepositionpotentials– step(3);

4. correctpositionpotentials– step(4), Eq.(30);

5. computeforcesandtotal energy – Eqs.(35-36);

multiscale(maxLevel, level k):

1. if maxLevel = k then

(a) computepotentialvaluesoncoarsestgrid(maxLevel) – step(2), Eq. (29);

2. otherwise

(a) anterpolatechargegrid(k) to coarserchargegrid(k+1)– step(1);

(b) call multiscale(maxLevel, k+1);

(c) interpolatecoarserpotentialgrid(k+1) to potentialgrid(k) – step(3);

(d) correctpotentialgrid(k) – step(4), Eq. (31);

Algorithm 4: Pseudo-code of a recursive multi-grid scheme with V-cycle.

boundedby someconstant,thework atparticlelevel andfor level / is givenbyY 1� : � � W{�� v]���� F ) � ) (37)Y � � : � � W{ g3 { � Z � � 9 ) v � � � 3 { � Z � F 9 ) 3 { � Z � � 9 Z ) (38)Y 1� 7 Y 1� : � � � k ) (39)Y �� 7 Y �� : � � g3 { � Z � � 9 ) k S (40)

Here, Y 1� denotesthe work for the local correctionat the particle level, Y � � isthework for local correctionsat thegrid level or thepotentialevaluationfor thecoarsestgrid. Y 1� , Y 1� , Y �� , and Y �� representtheinterpolationandaggregationwork.� � , � � and � � areconstantsto weigh thedifferentwork. For r i �

, the total

22

Page 31: Framework Design, Parallelization and Force Computation in

work isY 1� x s~ �+� � Y � � x s~ �+� 1 3 Y �� x Y �� 9;: {� � � 3 � F 9 )�� � � x�� ��  �¡ 3 �¢� 9 Z�£Q¤ (41)

x { � � � k � k & xp¥¡ W3 �¢� 9 ) k ¤ SThus,the total work is boundedby ¦ 3 � 9 . Balancingthework of Y 1� and §¨Y ��leadsto thechoice � : {��©ª � �¡ � � 7 (42)

which is independentof thesofteningdistanceF . In general,we candeduce� :¦ 3 �« 9 , whereasthework of §¨Y �� x Y �� is dominatedby thetermsY 1� and Y 1� andisof order ¦ 3 � k ) 9 , essentiallytheinterpolationbetweenparticlepositionsandthefinestgrid. Finally, thetotalwork is givenby¦ � � ��¬ F�®­ ) x k ) ¤�¤ S (43)

Theaccuracy is governedby thesizeof the interpolationerrorof thesmoothpart.Theinterpolationerrordependsonthesmoothnessof thekernel

0, themesh

size � , andthe interpolationorder k . Assuminga �°¯ -continuouskernelandatleast ± orderaccurateinterpolation,the relative forceerrorof thesmoothpart is²´³5µ·¶, ¶�¸ . Thus,thetotal relative forceerroris¦ � �º¹� & F ¹ f & ¤ S (44)

A detailederrorestimationof MG is givenin [107].In [13, pp.6-31], theperformanceis comparedwith thedirectmethodfor a2-

dimensionalsystemof chargesanddipoles.In [107], MG is comparedagainstthefastmulti-polemethodthat is implementedin theparallelprogramDPMTA [95]for watersystems.As expected,experimentsfrom [107] show that for F i �

,MG convergesto the direct methodand the error dropsto zero monotonously.Thework increasesmonotonouslywith F , until F is largeenoughto encompassallpairsandremainsconstant.Furthermore,it wasindicatedin [107] that MG hasstabledynamicsfor muchloweraccuracy thanmulti-polemethods.

23

Page 32: Framework Design, Parallelization and Force Computation in

3.5 Multi-pole methods

Themulti-polemethodcanbedividedinto two key components.First, theshort-rangeinteractionsare computeddirectly. Second,individual particlesinteractwith groupsof distantparticlesrepresentedby multi-poleexpansions.Themulti-poleexpansionexpressesthe interactionbetweena collectionof particlesandanindividual,well-separatedparticle. Well-separatedmeansthat thedistanceto thesphereenclosingthe distantparticlesis larger than the radius. The subdivisioninto groupsusesa hierarchicalspatialdecomposition(typically an oct-tree). Amulti-poleexpansionis calculatedfor eachsub-cellat thefinestlevel representingtheeffect to distantparticles.Theseexpansionsareaggregatedhierarchicallytorepresentlarger andlarger groupsof particles(upward pass).Thentheparticlesinteractwith sufficiently far apart(well-separated)sub-cells,coveringthewholedomain. Thus, ever-larger sub-cellscan interact throughindividual multi-poleexpansionsat increasingdistance.A first practicalalgorithm(Barnes-Hut)withwork complexity ¦ 3 �¼»�½�¾J� 9 wasappliedto astrophysicalsimulations[6].

GreengardandRokhlin [48] introduceda local expansionreducingtheworkcomplexity from ¦ 3 �¼»�½�¾¿� 9 to ¦ 3 � 9 . This is known as the fast multi-polemethod(FMM). The key ideaof the local expansionis the interactionbetweentwo well-separatedgroupsof particles.Thecontribution dueto anothergroupofparticlesthat is sufficiently far apartis computedby a Taylor (local) expansionutilizing themulti-poleexpansionof thedistantsub-cell(local expansion).Thenthecontributionsarepassedfrom parentcellsto thechildrensub-cells(downwardpass).

FMM hasbeenimplementedfor differentMD applications[11, 48,73] andisespeciallysuitedfor parallelization[93, 95,96, 123]. However, optimizedFMMcodestendto beratherelaborate.More importantly, they do not conserve energyduring MD simulationsunlessenforcingunusualhigh accuracy, e.g.,12th-ordermulti-poles[10, 128]. Typically, the electrostaticproblemcomeswith usuallysmall distributionsof mono-polemomentsthat causeself cancellation;positiveandnegativechargesareroughlycanceled.

3.6 Other fast electrostaticmethods

Thereexist many fastelectrostaticmethodsandimplementations.Typically, theyarerecombinationsof thesewell-known methodsor they recastpartsin yet otherformsthatareevaluatedusingother, generalnumericalmethods.

For example,in [29] the real-spacetermof thestandardEwald method(first

24

Page 33: Framework Design, Parallelization and Force Computation in

termof Eq. (24)) is evaluatedby multi-poleapproximationusinga treecodeandrecursion. The multi-pole approximationmethodbasedon Taylor expansionoforderk scalesas ¦ 3 � 9 . Theauthorsclaimarelativeaverageforceerrorin Eq.(47)of W L Z � with a Taylor expansionof order10. The trade-off is about � : WwÀ L�L�Lcomparedto thestandardEwald method.

The Fast-FourierPoisson(FFP)method[126] ( ¦ 3 �¼»R½�¾Á� 9 ) implementsthereciprocalpart by the solutionof Poisson’s equationin periodicboundarycon-ditions,with Gaussiancharge densitiesassources.Poisson’s equationis solveddirectlyusingFFT. Theadvantageis thattheenergy partcanbeevaluateddirectlyin thereciprocalspacetherebyavoiding a reverseFFT. Theforcesrequireonly alocalsummationoverthegrid sincethedensityof theGaussianscreeningfunctionis highly localized. In [100], the Poissonequationis solved by a multi-gridapproach( ¦ 3 � 9 ).

Theparticle-particleparticle-mesh/multi-poleexpansion[104] (P) M/PME,¦ 3 �¼»�½�¾Á� 9 ) is basicallyanextensionof theP) M [54] methodfor periodicbound-ary conditionsusing multi-pole expansions. The cell multi-pole method[28](CMM, ¦ 3 � 9 ) sharesthe key featuresof the multi-pole methods,but it usesCartesiancoordinatesonly, unlike the otherformulationsthat usesphericalhar-monics. The reducedcell multi-polemethod[28] (RCMM, ¦ 3 � 9 ) is theexten-sionof CMM for periodicsystems,wherethe infinite periodiclattice is handledby thestandardEwaldmethod.Themacroscopicmulti-polemethod[73] (MMM,¦ 3 � 9 ) is basedon FMM techniquesto treatperiodicsystems.A moredetailedsummaryis givenin [117].

Finally, we mentionthat thereexist several implementationsof fastelectro-static methodsthat are especiallyadaptedfor multiple time steppingintegratorschemes[7, 8, 10,68,72,129].

25

Page 34: Framework Design, Parallelization and Force Computation in

4 Parallelization

A typical MD simulationis describedby Algorithm 1. The work requiredforthe numericalintegration is of order ¦ 3 � 9 , which canbe carriedout indepen-dently for eachparticle. The calculationof forcesscalesas ¦ 3 � & 9 , dueto thedominatingnon-bondedpair-wise interactionsin Eqs. (9) and (10). The forceevaluationis thereforethemostcomputationallyexpensive part. Evenfor ¦ 3 � 9interactionstheforcestypically dominate,sincetheintegrationis relatively cheapto carryout. Nevertheless,thetermsof thesumof interactions 3546 � 7 46 & 7 S5S5S 7 46 � 9are independent.Altogether, this makesMD – to a certainextent – inherentlyparallel.Thishasbeenexploitedby severalparallelMD programs[4, 14, 22, 23,32, 53,77, 81,86,91,98,110, 124,125].

WhenparallelizingMD, therearetwo importantconsiderationsto make. First,theMD programmustperformwell for asmallnumberof particles,i.e., lessthan1000. Thereareseveral intereststo carry out simulationswith a few thousandparticlesovera long timescale,e.g.,aproteinfolding simulationwith integrationtimestepof 1 fs ( W L Z ��� s) over1 ms( W L Z ) s). Second,theMD algorithmsshouldscaleto beableto exploit parallelmachines.

To meetthecomputationaldemands,thereareseveralsequentialoptimizationapproaches– thefirst optimizationstep.Theapproachescanbeclassifiedinto ad-vancedintegrationschemes(i.e.,MTS, p. 9) andfastforceevaluationalgorithmshandlingtheexpensivenon-bondedforces(Section3). Typically, thesetechniquesarecombined[129]. For the parallelization,therearefive basicstrategies [39,109]:Ã

Cloningsimply assignseachindependentsimulationto onesingleproces-sor. This approachis very efficient (no communication)andeasyto im-plement,but its applicationareais ratherlimited (e.g.,ensembleweatherforecast).ÃA master-slave approachallocateswork amongslaves. It hasboth com-municationandloadbalancingdifficulties,especiallyfor a largenumberofprocessors.ÃAtom decompositionbasedon data replication is an easybut memory-expensive approach.It haspoor scalingpropertiesdueto global commu-nication[91]. ProgramsusingthisdecompositionincludeUHGromos[22],Amber[121],CHARMM [15], Moldy [98], andanearlyversionof EGO[31].

26

Page 35: Framework Design, Parallelization and Force Computation in

Systolicor hypersystolicloopalgorithms[78, 114] areapossibleremedytoreducethememoryusageandto improvethescaling.ÃForce decompositioninvolves either force matrix or systolic loop meth-ods. It scalesbetter than atom decompositionby reducingcommunica-tion coststhroughtheuseof a block-decomposition,asin LAMMPS [92],CHARMM [55] andPROTOMOL [D], andasdiscussedin [87].ÃSpatialdecompositionuseslinked lists andscalesvery well [80], but it ismoredifficult to implementandextend.It isusedin ddgmq[16], SIGMA [53],NAMD2 [66], PMD [125], EulerGromos[22], DMMD [4], andevaluatedin [9], SPRINGS (seepaper[C]).

4.1 Load balancing

A goodload balancingof the computationalwork in a parallelenvironmentbe-comesmore important for a large numberof processors[66, 86] (seeTable 1in paper[D]). Therearetwo main approaches,namely, staticanddynamicloadbalancing.

Staticloadbalancingis usuallyappropriatewhenthecomputationalwork loaddefinedby the sum of all interactions 3546 � 7 46 & 7 S5S5S 7 46 � 9 can be estimatedanddistributedequallyamongthe processorswithout any run-timeinformation– atleastfor the mostexpensive parts. In caseof a non-bondedforce including thewholedomainor abondedforce,wecansimplydistributetheinteractionsequallyamongthe processors,since interactionsare known for the whole simulationduration. Optionally, onemayalsoconsiderspatialinformationto reducecachetrashingor communication.For the non-bondedforceswith cutoff, onehastodistributepairsof spatialsub-domainscontaininginteractioncandidates,sincetheinteractingparticlepairsarenotstaticallydefinedandchangeduringasimulation.Thesub-domainsaredefinedby a cell list algorithm[9][97, pp. 47-52] (seealsopaper[C]). Typically, one assignsgroupsof sub-domainpairs, wherea groupcontainsall essentialsub-domainpairsto computeall interactionsfor onegivensub-domain.Theapproachassumesthateachgrouphasa work loadof thesameorder. The advantageof static load balancingis that we do not needexplicitcommunicationto (re-)distribute the work, but it generallydoesnot scalefor alargenumberof processorsor for smallsystems.

Dynamic load balancingcanbe doneeitheron a local or on a global level.In caseof local loadbalancing,neighboringprocessorstry to balancetheir work

27

Page 36: Framework Design, Parallelization and Force Computation in

amongeachother. Globalbalancingrequiresseveral iterationsto redistributethework evenly amongall processorsbut leadsin generalto a betterdistribution ofthe work. The work can be either assignedon demandor implicitly with thehelp of someestimatesfrom previous distributions. The work is split up in thesamemannerasfor staticbalancing.In caseof dynamicassignment,themasterdistributeswork to the next idle slave. The assignedwork shouldbe asbig aspossibleto reducethe communicationbetweenthe masterand the slaves, butsmall enoughto balanceout the total work amongall slaves. One remedyforglobalsynchronizationis to usethesameestimatesfor severaliterations,therebycombining the advantagesboth from dynamicand static load balancing. Theestimatesaregenerallyvalid for a longerperiod,dueto the fact that theparticledislocationis small or only local. However, for medium-sizedMD purposeanda moderatenumberof processors,we canhandleglobal load balancingby onemasterglobally, whereasfor massively parallelmachineswe needa hierarchyofseveralmasters.

4.2 Parallel programming paradigms

Todayparallelprogrammingtechniqueshave convergedto two mainparadigms.MPI15 [41] is a standardfor message-passingbasedparallelprogramsin C/C++andFortran. It supportsFlynn’s taxonomy[37] of MIMD 16 andprovidesa widevarietyof message-passingfunctionalities.MPI wasalreadywidely usedbeforeits initial standardappearedin 1995. It supportspoint-to-pointandglobal com-municationbasedon sometransportlayer(e.g.,Sockets).With MPI-2 [40], one-sidedcommunicationandprocessspawning was introduced. Oneof the majoradvantagesof MPI is therich functionality, relatively low thresholdof usageandthe compliancefor both distributed-memoryand heterogeneousenvironments.Most hardwarevendorsprovide well optimized,native implementations,but notmany of thesesupportthefull setof MPI-2 extensions.

OPENMP17 [88] is anapplicationprograminterface(API) thatsupportsmulti-platformshared-memoryparallelprogrammingin C/C++ andFortran.It supportsFlynn’s taxonomyof SIMD18. Its specificationis a set of compiler directives,run-timelibrary routinesandenvironmentvariablesthat canbe usedto specifysharedmemoryparallelism. The directivescanbe seenas“hints” to compilers

15MessagePassingInterface16Multiple Instruction,Multiple Data17Openspecificationsfor Multi Processing18Single Instruction,Multiple Data

28

Page 37: Framework Design, Parallelization and Force Computation in

to parallelizethe code. Although OPENMP is a relatively new standardanditsfirst specificationappearedin October1997,it hasgainedtremendouspopularity.Oneimportantfactoris thelow learningthresholdandeaseto parallelizeexistingsequentialcode.Theparallelizationis fine grainedandit scalesratherwell for amoderatenumberof processors.It is actuallyonly supportedin shared-memoryenvironments,dueto the fact thatdistributed-memoryenvironmentshave a highlatency to accessnon-localmemory, which is problematicfor fine grainedparal-lelism. Furthermore,thememorycannotbecontrolledexplicitly, suchthatcodeanddatamaybedistributedamongdifferentprocessors.

In thepast,severalotherparallelprogrammingparadigmswereused.Todaymostof themhave lost their importance(e.g.,PVM [46]) or never really gainedwideacceptanceamongsoftwareandhardwaredevelopers(e.g.,HPF),dueto thelack of portability, a high thresholdto use(e.g.,Pthreads)or exploring only onekey requirement.

4.3 Two parallel implementations

Two differentparallelizationstrategieswere implementedandevaluatedin thisthesiswork. Thesearedescribedbelow.

4.3.1 Spatial decomposition

SPRINGS is anMD programthat targets2-dimensionalphysicalproblems(seepapers[A] and [B]) with short-rangevan der Waals interactionsin Eq. (10).The simulationregion is split into rectangularcells anddomains(collectionofcells) are distributed amongthe processors.This is reflectedby the multi-cellalgorithm,[9],[97, pp. 47-52],organizingthespatialinformationof particles(seepaper[C]). The parallelizationis basedon MPI usingdifferentcommunicationstrategies, including the one-sidedcommunication(Figure 6) featureof MPI-2. Furthermore,SPRINGS supportsa varietyof partitioningsof thesimulationregion to test the communicationstrategies. Detailsandresultsof performancetestswith up to W L�Ä particlesand128processorsarepresentedin publication[C].

4.3.2 Incr ementalparallelization approach

PROTOMOL is a general-purposecomponent-basedMD framework basedon in-crementalparallelization[D]. The approachencapsulatesthe parallelizationde-tails andprovidesmechanismsto postponethedevelopmentof a parallel imple-

29

Page 38: Framework Design, Parallelization and Force Computation in

local memory

global memory

Inte

rcon

nect

p1get()

p0

put()

Figure 6: Data exchange based on one-sided communication. (1) put():processor p0 ’exposes’ its data to other processors; (2) get(): processor p1 ’grabs’needed data independently of p0.

mentation,evenin aparallelenvironment.With co-existingparallelandsequentialimplementations,we cantransparentlybenefitfrom alreadyparallelizedpartsinthe framework, andeasilyexperimentwith new methodsby insertingsequentialcode.

The parallelizationitself is basedon rangecomputationwith a master-slaveconcept. The slaves are assignedon demanda range(work-command)by themaster, whichrepresentsacollectionof termsof thesumof all interactions 3�46 � 746 & 7 S5S5S 7 46 � 9 . The splitting of eachparallel force is definedand implementedindividually, suchthateachforceprovidessplitting (or distribution) informationfor the masterbasedon its own characteristicandthe actualsystem.The workdefinedby therangeshouldbeasbig aspossibleto reducedthecommunication,but smallenoughto balanceout thetotal work for onetime step.Theframeworkassumesthat eachrangeof a single force representscomputationalwork of thesameorder. In somecases,a completeforce type is assignedto oneslave, i.e.,whenno parallel implementationis provided. After calculatingall interactions(completionof step2b in Algorithm 1) the force and energy contributions areglobally updatedby theframework. Theinitial implementationis basedon MPIanddatareplication. This may not scalewell for a large numberof processors.Nevertheless,it is proven to performwell for a moderatenumberof processors,i.e.,up to 32. Detailsandresultsarepresentedin publication[D].

30

Page 39: Framework Design, Parallelization and Force Computation in

5 The PROTOM OL Framework

MD programsmaysubstantiallydiffer in designor level of implementation,butthey areessentiallyall basedonAlgorithm 1. TheproposedMD component-basedframework distinguishesPROTOMOL [63] from otherwell-known MD programs,suchasGROMOS [119], Amber [121], CHARMM [15]. A programwith sim-ilar goals,which provided the initial ideabehindPROTOMOL, is NAMD2 [66].However, NAMD2’sdesigngoalis primarily highscalability.

By revisiting andlearningfrom theseMD programs,threemainrequirementswereaddressedduringthedesignof theframework:

1. Allow end-usersto composeintegratorsandforce evaluationmethodsdy-namically. Thisallowsuserstoexperimentwith differentintegrationschemes.

2. Allow developersto easilyintegrateandevaluatenovelforcealgorithmsandintegrationschemes.

3. Develop an encapsulatedparallelizationapproach,where sequentialandparallelcomponentsco-exist.

Easilycustomizedforcesandinterchangeableintegrationschemeshave emergedas important requirements. For the last decade,several new ways of solvingNewton’s equationof motion basedon multiple time stepping(MTS) have beendevelopedthat improvedperformancedramatically[7, 39, 60, 57, 79, 129]. It isusually necessaryto carefully tuneall partsof the MTS schemeto get the fullbenefitsof this technique.Thus,to experimentandto evaluatenew anddifferentintegration schemes,their dynamiccompositionand configurationis essential,ratherthanto changethesourcecodeandrecompileit. This is discussedin moredetail in Section5.3.

For the developer, it is importantto provide an experimentalplatform andadesignof looselycoupledcomponentsin which onecaneasilyaddnovel forcealgorithmsor defineand implementnew potentialsin an orthogonalway. Thismeansalsothatalgorithmscanshareoptimizationscommonto all of them.Suchaframework will hopefullyin thefuturealsoserveasaplatformfor fair comparisonof differentalgorithms.

A major problemof most parallel MD applicationsis that the developerisforced to take into accountparallelismissueswhen modifying the code. Thisis inappropriatewhenonly implementingandtestingexperimentalideas. Thus,parallelismmust be encapsulatedand we must allow that both sequentialand

31

Page 40: Framework Design, Parallelization and Force Computation in

parallel implementationscanco-exist. Therefore,a designwith an incrementalparallelizationapproachis proposed.It suggeststhat the applicationdevelopermaystartwith a sequentialimplementationof thealgorithm,while transparentlybenefitingfrom alreadyparallelizedpartsin the framework. At a laterstageonemayparallelizethewholenew algorithm.Theparallelismitself is basedonrangecomputationwith amaster-slaveconcept.Moredetailsarediscussedin paper[D].Section5.8.2givesapracticalexampleof how anexistingsequentialforcecanbeparallelized.

On theotherhand,thedesignanddevelopmentarealsouserandapplicationdriven:

1. Componentsto implementsophisticatedintegratorsandfast force evalua-tors,suchasMOLLY [44, 45, 58, 59, 60, 62] andDissipative Molly [82],andmulti-grid summation[13, 107], PME methods[26, 27, 33, 54], andstandardEwald summation[34].

2. Facilities to compareaccuracy of force methods,and to make decisionsconcerningoptimalparametersandmethodfor agivenaccuracy andsystem.

3. Componentsto timeperformanceandanalyzeparallelscalability.

4. Ability to incorporatemethodsto performgeneralsimulationbasedonNew-ton’sequationof motion,e.g.,thesolarsystem.

5.1 Object-oriented vs. traditional procedural programming

Theuseof object-orientedprogramming(OOP)comparedto traditionalprocedu-ral approacheshasa trade-off when it comesto run-timeefficiency. The OOPintroducessomelanguageconstruct-dependentoverhead,which variesstronglyon the softwaredesignandthe actualimplementationof the problem. C++ andto someextendFortran90 areexamplesof languageswith built-in supportof theOOPparadigm. Both languageshave inheritedtheir proceduralpart from theirancestorsC and Fortran 77, respectively. C++ supportsgenericprogramming(e.g.,templates)and(multiple) inheritanceenablinggreatflexibility andeaseofcodereuse.C++ is arich andpopularOOPlanguagebecausetheprogrammercancombinehigh-level OOPabstractionandoptimizeat low-level machinedetails,ifneeded.

However, run-timeperformanceof C++ implementationcomparedto Fortran77or C remainsanopenquestion.Wedid somecomparisonsonaSGIOrigin2000

32

Page 41: Framework Design, Parallelization and Force Computation in

(R10000)of a specializedproceduralFortran MD implementation[50] againstthegeneralC++ MD framework PROTOMOL. WeobservedthattheFortrancodewasapproximatelytwice asfastasPROTOMOL, bothcodeswerecompiledwithMIPSprocompilerswith thesameoptimizationoptions.Oneimportantfactorforthis may be the lack of fully featuredoptimizationcompilersfor C++, which islikely dueto the difficulty to optimizeOOPimplementations.C statementsaremappedlinearly to sequencesof assemblystatementsof more or lessconstantsize. C++ breaksthis correspondencedue to the high non-uniformtranslationof C++ statementsto assemblycode. Furthermore,compilersproducinghighlyoptimizedcodefor new hardwarewerefirst releasedfor Fortran,especiallyin thehigh-performancecomputingsegment.Somevendorsdid notconsiderC++ asanoptionfor high-performancecomputingandgave low priority to thedevelopmentof aC++ compiler.

Despitethe run-time efficiency problem,C++ is a strongcompetitorwhencorrectlydesigningand implementingcomplex software. In the long run, flex-ibility and codereusewill prevail [25, 85], ratherthan having to continuouslyre-implementor adaptthe softwareof interest. Thus, the framework C++ waspreferredoverJavaandotherobject-oriented(OO) languagesbecauseof thewideacceptanceand the abilities – asdiscussedbefore– to combinelow-level opti-mizationandhigh-level abstraction.

5.2 A component-basedframework design

Therearemany differentapproachesto designsoftwarepackages,but to satisfytherequirementsof flexibility andeaseof codereuse,we do not only needa finegrainedcollectionof classes,but alsohigh-level componentsto easilycomposeasolutionfor the problemof interest. This ideais capturedby an object-orientedcomponent-basedframeworkdesign(Figure7) andsatisfiesalsotherequirementson page31. In a component-basedframework (or componentarchitecture),anumberof modules,interfacesamongthem,andbasiccommunicationservicesare defined. Examplesof componentarchitecturesare Microsoft’s ComponentObjectModel (COM, cf. [20]) andCORBA [24]. A componentarchitectureforhigh performancesoftwareis theCommonComponentArchitecture(CCA) [19].For the designof the component-basedframework PROTOMOL, threedifferentmoduleswereidentified:

1. The front-endprovidescomponentsto composeandconfigureMD appli-cations.Thecomponentsareresponsibleto composeandcreatetheactual

33

Page 42: Framework Design, Parallelization and Force Computation in

libparallel, libforceslibbase, libtopology

libintegrators

libfrontendComponent

Framework

Class Library

Middle Layer

Front−end

Back−end

Figure 7: The component-based framework PROTOMOL.

MD simulationsetupwith its integrationschemeandparticleconfiguration,especiallyinterfacing the user. This layer is stronglydecoupledfrom therestto theextentthatthefront-endcanbereplacedby ascriptinglanguage.For moreinformationonthesetopics,seetheMD applicationprogrammersinterfaceMDAPI [51], andthe interactive andsteeredMD interfacesde-scribedin [56, 111].

2. Themiddlelayeris awhite-boxframework for numericalintegrationreflect-ing ageneralMTS design.

3. Theback-endis aclasslibrary carryingout theforcecomputationandpro-viding basicfunctionalities.It hasastrongemphasisonrun-timeefficiency.In Sections5.4.2-5.4.5,differentoptionsfor theforcedesignareevaluated.

Thedesignwasalsomotivatedby thegoalof encapsulatingoptimizationssuchasparallelismwithout destroying performance.The most importantoptimizationsconsideredwereincrementalandscalableparallelizationof forceevaluation,celllist algorithmsto retrievespatialparticleinformation,andfully customizablenon-bondedforces using arbitrary shapedcells and boundaryconditions. This isdiscussedin moredetail in Section5.4.

Thediscussionof theframework hasastrongemphasison thedesignof forcealgorithms,sinceconsiderabletimewasspentto designandimplementnew forcealgorithms(e.g.,standardEwald summation,PME, MG, etc.). The front-endismainly the pre- and post-processingin Algorithm 1 and is not detailedin thisthesis,whereasthe middle layer is shortly explainedto give an overview of thecollaborationof integratorsandtheirassociatedforces.

34

Page 43: Framework Design, Parallelization and Force Computation in

5.3 Integrators

Theintegratorsrepresentthemiddle layerof theframework andarein chargeofnumericalintegration.Thecollaborationof anintegratorobjectwith its associatedforceobjectsis basedoncompositionandencapsulation(Figure8).

doHalfKick();doDriftOrNextIntegrator();calculateForces();doHalfKick();

myForceGroup−>evaluateSystemForces(tp,r,f,e);myForceGroup−>evaluateExtendedForces(tp,r,v,f,e);

ForceGroup

evaluateSystemForces(tp,r,f,e)

evaluateExtendedForces(tp,r,v,f,e)

ExtendedForce

evaluate(tp,r,v,f,e)=0

parallelEvaluate()

STSIntegrator

doDriftOrNextIntegrator()

MTSIntegrator

doDriftOrNextIntegrator()

StandardIntegrator

run(numTimesteps)

doHalfKick()=0

doDriftOrNextIntegrator()=0

calculateForces()

Integrator

run(numTimesteps)=0

Force

SystemForce

evaluate(tp,r,f,e)=0

parallelEvaluate()

n

m1

Figure 8: Collaboration diagram between the integrator object and the forceobjects.

To meetthe requirementsof dynamicallyconfigurablemultiple steppingal-gorithmswith arbitrary levels andarbitrary associatedforcesat eachlevel, theapproachof a chainof integratorswaschosen(Figure9). A chainof arbitrarylength is definedby recursion,where a single time stepping(STS) integratorterminatestherecursion.For thedesignof integrators,inheritanceis used,whereMTS andSTSintegratorsare two differentmain branchesimplementingAlgo-rithm 1. Inheritanceis well-suitedto capturethe uniqueand sharedbehaviorsof integrators.Furthermore,it givesgreatflexibility andhasa negligible impacton run-timeefficiency in thewhole framework. Researchersat theUniversityofNotreDamehavedesignedanddevelopedtheintegratorlayerin PROTOMOL [61].

35

Page 44: Framework Design, Parallelization and Force Computation in

doHalfKick(); doDriftOrNextIntegrator();

calculateForces(); doHalfKick();

MTSIntegrator doHalfKick(); doDriftOrNextIntegrator();

calculateForces(); doHalfKick();

*nextIntegrator

STSIntegrator doHalfKick(); doDriftOrNextIntegrator(); calculateForces(); doHalfKick();

*nextIntegrator

MTSIntegrator

Figure 9: Chain of integrators implementing multiple time stepping schemes.MTS differs from STS in that it calls recursively the next inner integrator beforeevaluating its forces. The recursion is terminated by an STS integrator.

Theassociatedforcesareencapsulatedin ForceGroup to decouplethemfromthe integrator when invoked to evaluatethe forces. This makes it possibletoperformsomepre-andpost-processing,e.g.,to distribute the forcecomputationamongdifferent processors.Furthermore,in a parallel environment this classhandlesparallelismandinteractswith theparallellibrary of theframework. Theclassinvokestheactualdistribution of thework andupdatesof forceandenergycontributionsamongthenodes.However, this(default)behavior maybeoverwrit-tenandtheparallelismmovedup to theintegratorlevel, which correspondsto anoptimizedparallelupdateof positionsandvelocities.

5.3.1 Integrator definition language

At thehighestlevel, theframework definesanintegratordefinitionlanguage(Pro-gram1), in which theusercancomposehis or herown integrationscheme.Thegrammarimposesthe proposedhierarchy(or chain) of integrators,whereeachlevel is definedby anintegratorof choiceandits associatedforces.An integratorcan be either MTS or STS.The inner most integration (level 0) is definedbySTSandMTS for thesuperiorlevels. Theprogramfragment2 definesa simpleLeap-Frogintegrationschemewith a time stepof 1 fs and its associatedbond,angle,dihedraland improper forces. Program3 illustratesa three-level MTSintegration. The inner most integrationusesLeap-Frog,the superiorlevels useVerlet-I/r-RESPA. It evaluatesforcesatdifferenttimescales.Theactualfrequencyof a forceat a givenlevel is recursively definedby theproductof theactualcyclelengthandthefrequency of thenext innerintegrator.

36

Page 45: Framework Design, Parallelization and Force Computation in

Integrator Ålevel N-1 integrator name Å # MTS

cyclelengthvalueforce forcenameforceoptions, Æ forcenameforceoptionsÇÈ+È+È ÉÈ+È+È

level 0 integrator name Å # STStimestepvalueforce forcenameforceoptions, Æ forcenameforceoptionsÇÈ+È+È ÉÉProgram 1: Grammar for PROTOMOL’s integrator definition language.

Integrator {level 0 Leapfrog {

timestep 1.0 # [fs]force Bond, Angle, Dihedral, Improper}

}

Program 2: Definition of Leap-Frog integrator with 1 fs time step and bond, angle,dihedral and improper forces.

Integrator {level 2 Impulse { # Long-range electrostatics

cyclelength 4force Coulomb -algorithm Full -switchingFunction ComplementSWC1}

level 1 Impulse { # Medium varying forcescyclelength 2force Improper, Dihedralforce Coulomb -algorithm Cutoff -switchingFunction SWC1force LennardJones -algorithm Cutoff -switchingFunction SWC1}

level 0 Leapfrog { # Fast varying forcestimestep 1 # [fs]force Bond, Angle}}

Program 3: Three-level Verlet-I/r-RESPA MTS. The fast, medium and slow forcesare evaluated at 1 fs, 2 (=1 È 2) fs, and 8 (=1 È 2 È 4) fs, respectively.

37

Page 46: Framework Design, Parallelization and Force Computation in

5.4 Forces

The forcesare designedasseparatecomponentsandpart of the computationalback-end.They representoneparticularforce (a bullet in Algorithm 3) andpartof integratorsbycomposition(Figure8). FromanMD modelingpointof view andfrom performanceconsiderations,five different requirements(or customizableoptions)areproposed.Thesearediscussedbelow.

5.4.1 Forcerequirements

R1 An algorithmto selectan ± -tupleof particlesto calculatetheinteraction.

R2 Boundaryconditionsdefiningpositionsandmeasurementof distancesin thesystem.

R3 A componentto retrieve efficiently the spatialinformationof eachparticle.Thishas ¦ 3 W 9 complexity.

R4 A potentialdefiningtheforceandenergy contributionson an ± -tuple.

R5 A componentto modify the potential (force and energy contributions) toobtaincertainpropertiesof thepotential,i.e.,switchingfunctions.

In orderto addresstheserequirements,severaldesignapproachescanbechosen.In general,softwareengineeringcomeswith a rich multiplicity, whereonecansolve a problemin many different ways. However, thereare infinite nuancesbetweenwrongandright. To choosea solutionout of thecombinatorialsolutionspacedependsprimarily on theproblemandfutureneedsandextensions.Thesemaybedifficult to point out.

5.4.2 Option 1: All-inclusi ve interface

A first option is to implementeverythingunderthe sameumbrellaof a do-it-allinterface. This may result in a mammothclass,intellectualoverheadandineffi-ciency. Furthermore,trying to implementthe force objectswould result in im-plementingall combinationsof requirementsR1-R5,andanexponentialnumberof classes.Consideringonly thenon-bondedforcesin theactualimplementationof PROTOMOL with 3 algorithms(R1), 2 boundaryconditions(R2), 1 cell listalgorithm(R3),3 differentpotentials(R4)and17switchingfunctions(R5),wouldrequire306implementations;withouttakinginto accountevaluatingsimultaneous

38

Page 47: Framework Design, Parallelization and Force Computation in

two differentpotentials.Thesameproblemoccurswhendevelopinga graphicalenginefor ± different geometricalshapesand Ê different target systems. Wewould prefer to separatethe geometricalshapedefinition andthe target systemspecificpartsthanto endup with ±�Ê implementations.Moreover, theapproachis alsoextremelyrigid, sinceonly oneslightly unpredictedcustomizationwouldmake thedesignuseless.Thedesignenforcesanalysisconstraints,but we wouldlike thedevelopersto enforcetheirown constraints,ratherthanthedesignimpos-ing predefinedconstraints.

5.4.3 Option 2: Multiple inheritance

Another possibility is multiple inheritance,which could avoid the exponentialnumberof classesthrougha designbasedon a small numberof smartclassesimplementingeachdesignoption.Theproblemwith multiple inheritanceis thatitcanonly do a superpositionof theclasses.This forcesthedeveloperto carefullycoordinateinheritanceandthesmartclasses.Also, carehasto betakento resolvenamingconflictsandinheritanceof multiple implementationsof thesameoption.This can result in ambiguousdefinitions, e.g., simultaneousevaluationof theelectrostaticandthe van der Waal forces. Anotherproblemof classesbasedonmultiple inheritanceis thegenerallack of completetypeinformation,which maymake it impossiblefor theclassto carryout its work directly. Nevertheless,therearewell-definedsolutionsor work-aroundsto the relatingproblemsof multipleinheritance,cf. [85, Ch.15],[113,Ch.15],[115, pp.155-167].

5.4.4 Option 3: Genericprogramming

A third option is genericprogrammingusing templatesthat supportmetapro-gramming. This is in someway a typedextensionto C macros,cf. [103, 105,113,120]. Templatesenableparameterizationin waysnot supportedby regularclasses.Templateparameterscanbeseenasplaceholdersthattheuserwill fill outwheninstantiatinganobjector defininga type. Thecodeof templatedclassesiscreatedat compiletime dependingon user-definedtypesor values.For example,in the framework designthe boundaryconditionsare type templateparameters(periodicor vacuum);thenon-bondedforcedefinitiontakesa booleanparameterindicatingwhetheraswitchingfunctionis usedor not.

Templatesalso comewith options to specializememberfunctions for onespecificinstantiation,andto partiallyspecializetemplatedclasses.Thesefeaturesareheavily usedin theStandardTemplateLibrary (STL); e.g.,copy() maycome

39

Page 48: Framework Design, Parallelization and Force Computation in

with awell-tunedimplementationdependingon thetypeof containerto copy andthetargetplatform.Thedrawbacksof templatesarethatonecanneitherspecializeits datamembers,nor canoneaddnew datamembers.Furthermore,onecannotpartiallyspecializememberfunctions.Thismeansthatthespecializationdoesnotscalelike it doesfor inheritance.

5.4.5 Combinedapproach: Policy, Strategy, and Traits

Theevaluationof advantagesanddisadvantagesof theapproachesmentionabovea combinationof templatesandinheritanceleadingto a highly customizableandefficient componentdesign. This is discussedmoredeeplyin [1]. This idea iscapturedby thePolicy pattern19, betterknownastheStrategy pattern[42,pp.315-323]. Thispatternpromotestheideato varythebehavior of aclassindependentofits context. It is well-suitedto breakup many behaviors with multiple conditionsandit decreasesthenumberof conditionalstatements.CloselyrelatedareTraits,which intend to carry someextra information of a class,but without requiringany changesof theclassitself. PoliciesandTraitshave a lot in common,but thePoliciestendto bebehavioral, whereasTraitsallow subtyping.

5.4.6 Forcedesignof PROTOM OL

The force designdefinesonetypical type of force (e.g.,angularforcesin mole-cules)andhasaninterfacedefiningall necessaryparametersfor thecomputationof force and energy contributions. The designof forcesitself usesthe Policypattern[1], asdiscussedin Section5.4.5. The algorithmto selectthe Ë -tuples(R1) is customizedwith the restof the four requirements(R2-R5). Someforcesmay not allow any parameterizationat all sincethey arespecializedandhand-craftedto carryoutoneparticularcase,e.g.,propagatingforcecontributionsfroman external device like a haptic device. At first, it seemsmore naturalto cus-tomizethepotential(R4) with analgorithm(R1), but afteranalyzingthegeneralapplicationrequirementsand their dependencies,it makesmore sensein termsof designandefficiency to choosethefirst designapproach.This becomesmoreobvious whenwe like to evaluatesimultaneouslydifferent typesof forceswiththe samealgorithm. Furthermore,for specializedforceslike PME, it would bepracticallyimpossibleto designapropersolution,sincesomepartsdonot rely ontheselectionof Ë -tuplesbut on grid evaluationalgorithms(seeFigure19 for all

19A patterndescribesasolutionto a recurringproblemthatarisesin specificdesignsituations.

40

Page 49: Framework Design, Parallelization and Force Computation in

availableforces). In orderto illustratehow forcesarecustomizedwith differentpolicy choices,two examplesaregivenbelow.

Theevaluationof angularforcesin moleculesis a goodexampleof a special-izedforcetype. Its natureis simpleandof staticcharacter, which meansthat the3-tuplesof atomsaregivenby the initial input structure.Thepotentialenergy isdefinedby (seealsoEq.(7))Ì angleÍ ÎÐÏÑ�ÒÔÓÖÕ�×�ØÚÙKÛÝÜÞ× Í�ßáàãâ ×�ØÚÙKÛ Îåä�æ�ç%è]éÝê Õ�ëì ÙíÜîëì Ø ß?ï Õ�ëì ÙíÜîëì Û ßð�ð ëì ÙÁÜñëì Ø ðRð ï ð�ð ëì ÙÁÜñëì Û ðRðóòõô (45)

It doesnotshareany behaviorswith otherforces,besidestheboundaryconditionsdefiningthemeasurementof distancesto computetheangle ×�ØÚÙKÛ of the3 atoms.Program4 declaresthe angle force with periodic boundaryconditions. Sincethe boundarycondition is a templateand known at compile time the compilercanoptimizethecodeextensively, e.g.,for vacuumthecompileronly appliesanordinaryvectordifferenceto computethedistance.

Force* aForce = new AngleSystemForce<PBC>();

Program 4: Dynamic instantiation of an angular force with periodic boundaryconditions.

More complex force objectsstemfrom non-bondedforces. For example,todefineanelectrostaticforce(Program5), we maychoosea cutoff algorithm(R1)that considersonly the closestneighboringatoms. To find the closestatomsinconstanttime,we usea cell manager(CM, R3) basedon a cell list algorithm.PBC

definesperiodicboundaryconditions(R2). OneAtomPair definestheenergy andforce contributionsbetweentwo arbitraryatoms. Here,CoulombForce imple-mentsEq. (9) modifiedby a ö é -continuousswitchingfunction(C1SwitchingFunc tio n) usingEqs.(18) and(19). Thebooleanvaluedefinestheswitchingfunctionshouldbeappliedor not,whichis for optimizationpurpose.For example, to computeall interactionsover the whole domainno switchingfunction is requiredand the compiler can drop all coderelatedto the switch-ing function. In caseof a MOLLY integrator scheme,OneAtomPair wouldbe replacedby OneAtomPairMol li fy LJ to implementa ’mollification’. Ta-ble 5 givesa completeoverview of supportedpolicy choices.Figure24 detailsOneAtomPair , OneAtomPairMol li fy LJ andothersupportedpair-wise inter-actions,e.g., the simultaneousevaluationof two different forces. The actualobjectcreationis a little morecomplicated,duethenumerouspoliciesandusageof nestedtemplates.A genericobjectcreationis discussedin Section5.5.

41

Page 50: Framework Design, Parallelization and Force Computation in

typdefOneAtomPair<PBC,CM,C1SwitchingFunction,Cou lombFor ce,true > OneAtomPairType;

class NonbondedCutoffSystemForce<PBC,CM,OneAtomPa irType> ;

Program 5: Declaration of a Coulomb force with cutoff and ÷ é -continuousswitching function.

5.4.7 Forceinterface

Theforcesaredesignedwith acommoninterface,theso-calledevaluate(...)

andoptionalparallelEvaluat e( .. .) method,which doesthe evaluationoftheforcecontributionsbasedon its parameterizationandpolicy choices.evaluate(...) is apurevirtual functionandexpectsalwaysasequentialimple-mentation.parallelEvalua te (.. .) is anoptional(or ’override-able’)inter-facefor parallelimplementation.In caseno parallelimplementationis provided,it will call thesequentialmethodevaluate(...) . This mechanismallows thedeveloperto postponea parallelimplementation,evenin a parallelenvironment.It reflectstheideaof incrementalparallelization(seepaper[C]).

The force interfacecan be seenas a function object, but to make it a realfunctor, the Commandpattern[42, pp. 233-242]would be neededto passtheparameters.Note that this would be appropriateif we would needan ’undo’functionalityor delayed(lazy) evaluation.

5.4.8 Forcesubtyping

The force designdistinguishesbetweensystemforces(Figure21) andextendedforces(Figure23). The systemforcesaredesignedto handleordinary interac-tions of the form

Ì Õ5ëì é â ëì à â ô�ô5ô â ëì�ø ß , whereasthe extendedforcesaredesignedfor velocity dependentinteractions.Furthermore,the forcesareencapsulatedinForceGroup , which is part of the intergrator. The contributionsof the systemand extendedforcesare invoked by separatemethodsfrom an integrator (Fig-ure8), suchthatdifferentpre-andpostprocessingfor thosetwo forcetypescanbeperformed.Notethatfrom theuserpoint of view, thereis no distinctionbetweenextendedandsystemforces.

Beyondthissubtyping,theforcesaredividedbetweenbondedandnon-bondedforces.Bondedforceskeeplocal neighborinformationthat is typically staticanddoesnot rely on requirementR3,whereasnon-bondedforcesoperateondynamicneighborhoods(R3)or mayincludethewholecomputationaldomain.Thismeans

42

Page 51: Framework Design, Parallelization and Force Computation in

that non-bondedforcesmay be computedusingcutoff techniques,or in the fulldomain.AppendixB describestheforcecomputationdesignin detail. Figure19is anoverview of actuallysupportedforces.

5.5 Forcefactory

Oncetheforcesaredesignedandimplemented,we needto create(or instantiate)theactualforceobjectsneededby integrators.Oneapproachcouldbe to let theintegrator or someother function createthem. The integrator or the functionwould have theresponsibilityto parsetheinput andinstantiatethecorresponding

force LennardJones

−algorithm NonbondedCutoff

−cutoff 6.5

−switchingFunction C2

−switchon 0.1 Lennard−Jones force with

cutoff and C2 switching function

Angle forceforce Angle

Wizard

Figure 10: Correspondence between the input definition and the actual forceobject.

forceobjects(Figure10). Thiswould leadto arigid design,i.e.,hugeif-then-elseor switch-casestatementsdueto thenumerousforcetypes.Additionally, it wouldnot only be difficult to addor remove force types,but also the correspondencebetweenthe input definition and the actual force object would be hard-coded.Wewouldratherliketo makevisible forcetypesof interestonly andlet theforcesdecidethecorrespondence.Thus,weneedsomesortof wizard,thatcantransformagiveninputdefinitioninto a realforceobject.

5.5.1 Solution: Abstract Factory and Prototypes

The requirementsof object creationare satisfiedby the AbstractFactory [42,pp. 87-95] and the Prototype[42, pp. 117-126]patterns.The AbstractFactory

43

Page 52: Framework Design, Parallelization and Force Computation in

patterndelegatesthe objectcreation,and the Prototypepatternallows dynamicconfiguration.Thelogical stepsin usinganAbstractFactoryandPrototypecom-binedis asfollows:

1. Prototype registration. The force factory is dynamicallyconfiguredbyregisteringforceprototypeobjects.A forceprototypeobjectis a forceob-ject with undefinedparametervalues.Thesetof registeredforceprototypeobjectsdefinesthe force (algorithm)spaceavailableto theuseror compo-nents.This enablesus to customizetheavailableforces. For example,wecaneliminateall thosethatdonotmakesenselogically.

ForceFactory::registerExemplar(new NonbondedCutoffSystemForce<PBC,CM,

OneAtomPair<PBC,CM,C2SwitchingFunction,LennardJ onesFo rce,tru e> >());

ForceFactory::registerExemplar(new AngleSystemForce<PBC>());

Program 6: Registration of prototypes: angular force and Lennard-Jones forceprototype with cutoff and ÷ à -continuous switching function.

2. Forceprototypesymbol table. Beforeany forceobjectcanbeinstantiated,somesortof correspondence– map–betweentheinputandtheavailable(orregistered)forceprototypesmustbedefined.In C++, onecoulduseRun-Time Type Identification(RTTI), but this featureis non-portableandmayleadto ambiguouscorrespondences.In orderto definethismapgloballyandof statictype, the forcesareresponsibleto provide a uniqueidentificationstring,basedon theforceclassandtheir policies(Table1). More precisely,forceoptionsdefiningpoliciesarepartof theidentificationstring,e.g.,inter-polationscheme.For forcesusingnestedtemplates,theidentificationstringis definedasthedeep-copy principle,e.g.,the complementof a switchingfunction,which expectsaswitchingfunctionastemplateparameter.

Force IdentificationstringAngle ’Angle ’LJ ’LennardJones -algorithm NonbondedCutoff -switchingFunction C1’

Table 1: Identification string of angle force and a Lennard-Jones force.

44

Page 53: Framework Design, Parallelization and Force Computation in

3. Force definition language. The definition of a force is essentiallybasedon theuniqueidentificationstring.A forcedefinitionstartsalwayswith thekeyword force followedby akeyword(i.e.,forcename) reflectingthetypeof forceandthe restof theuniqueidentificationstring (i.e., forceoptions).forcenamemayconsistof two wordsto evaluatetwo forcessimultaneously.Theforcenameis followedby parametersof theforce(i.e., forceoptions) toparameterizetheforcewith thedesiredvalues,e.g.,cutoff distance.Someforce may have additionalpoliciesor force parameters,e.g., interpolationschemeor friction coefficient. Other forcesaredefinedonly by onekey-word, e.g.,angleforces. ’compare’and’time’ arefor comparisonpurpose(seeSection5.6).

level n integrator name ùú+ú+úforce û compareü�û timeü forcenameforceoptionsû force û compareüýû timeü forcenameforceoptionsüú+ú+úþProgram 7: Grammar for PROTOMOL’s force definition language.

4. Force parsing. The force factory takes the rule of the wizard. An ex-tendedmapbasedon the identificationstring definesthe correspondence.In general,theorderof definitiondoesnot matter, but morecomplex forcedefinitionsrequirean exact matchin order to preserve uniqueness.Oncethecorrectforceprototypeis identified,theexpectedforceparametersareretrieved from the prototype. To let the prototypedefinethe parametersmakestheparsinggeneralandtiesthenumberandkeywordsof parametersto theforceprototype.More exactly, theparametersaretied to thepolicies,suchthatthereis only onedefinitionin thewholesystemthatis in thesamefile astheimplementationof theaccordingpolicy. Furthermore,thisallowsus to provide default values,if needed.Next, the force factoryparsestherequestedparametervaluesandpassesthemwith theirvaluesto anAbstractFactorymethodmake(...) (TemplateMethod) to createthedesiredforce.The class-dependentbehavior for the AbstractFactory methodis imple-mentedin doMake(...) (HookMethod), which is calledby make(...) .This techniqueis describedby theTemplateMethodpattern[42, p. 325].

45

Page 54: Framework Design, Parallelization and Force Computation in

5. Forceobject creation. Finally, afully featuredforceobjectwith setparam-etersis createdby theprototype.In orderto makethedynamicconfigurationandtheactualobjectcreationindependent,andthe factoryglobally acces-sible, the force factoryusesthe Singletonpattern[42, pp. 127-134]. Thisavoids the creationof unnecessaryfactories,andallows to pre-customizethefactorywith behavior commonto all forces.Thelatteris usedto defineboundaryconditionsand the type of cell manager, sincethey are uniquepolicies insidethe whole MD application. Furthermore,the force factoryprovidesfeaturesfor lazy registrationof prototypesandtheirdestruction.

5.6 Functionalities to compare forcealgorithms

Sinceoneof therequirementsof theframework is theeaseto evaluateandcom-paredifferentforce algorithms,functionalitieswereaddedto the framework forthispurpose(Figure20). At present,pairsof forcescanbecomparedto determineenergy andforceerrorsof new forcemethods:

FEmax Î ÿ�� ��� Ø�� è]é � àØ ðRð ë � Ø Ü ë Ø ð�ð Ø � è]é � àØ ðRð ë Ø ð�ð (46)

FEavg Î Ø � è]é � àØ ðRð ë � Ø Ü ë Ø ð�ð Ø � è]é � àØ ð�ð ë Ø ðRð (47)

FEabs Î � ���Ø ð�ð ë � Ø Ü ë Ø ð�ð (48)

UE Î ����� �Ì Ü ÌÌ ����� ô (49)

Here, ë � Ø and �Ì arethecontributionsof theforcemethodto beevaluated. ë Ø andÌdenotethevaluesof thereferenceforcemethod(or themostaccurateone),e.g.,

obtainedby directcalculation.Thecomparisonis performedon-the-fly, suchthatthe referenceforcedoesnot affect thecurrentsimulation. For example,onecancomparea fastelectrostaticsmethodsuchasPME usingtwo grid sizes,suchthatthemoreaccurateoneservesasanaccuracy estimator(Program8).

For thebenchmarkingof forcesatimer, functionwasimplementedto measurethe total andaveragetime spentin dedicatedforce methods(Program8). Bothcomparisonsfunctionscanbe nestedto evaluateaccuracy and run-timeperfor-mancesimultaneously.

46

Page 55: Framework Design, Parallelization and Force Computation in

force compare time force Coulomb -algorithm PMEwald -real -reciprocal-correction -interpolation BSpline -cutoff 6.5 -gridsize 10 10 10

force compare time force Coulomb -algorithm PMEwald -real -reciprocal-correction -interpolation BSpline -cutoff 6.5 -gridsize 20 20 20

force time LennardJones -algorithm NonbondedCutoff -switchingFunction C1-cutoff 8.0

Program 8: Examples of accuracy and timing comparison.

Thecomparisonof forcepairsis basedontheCountProxypattern[42, pp.207-217] to link two forcestogetherandcalculatetheactualerrors.TheCountProxypatternmakescomparisontransparentfrom theintegratorpoint of view, suchthatthe contributionsof the referenceforce arenot mixedup with the currentsimu-lation. To handlebothsystemandextendedforceswith their differentinterfacespolymorphinheritanceis used.Thetimer functionis basedon thesameapproachasfor theaccuracy comparison.Dueto thetransparency, theapproachallows thenestingof theaccuracy andtiming functionalities.

5.7 Implementation and validation of fast electrostatic forcealgorithms

Duringthisthesisthreedifferentfastelectrostaticforcealgorithmsweredevelopedandincorporatedinto theframework. Moredesigndetailsaregivenin Figure 22.Thecorrectnessof implementationof thesemethodswastestedona141-moleculewater system(Figure 27) with periodic boundaryconditions25A � 25A � 25Aor vacuum. The testswereperformedon an IBM p690RegattaTurbo system.Calculationswereperformedin 64-bitarithmeticandamachineprecisionof orderÏ è]é�� . Thefollowing threesectionssummarizethefeaturesof thealgorithms.

5.7.1 Standard Ewald summation

TheEwaldmethodis describedin Section3.2.Thereal,reciprocalandcorrectionterms(the last3 termsof Eq. (24)) canbeevaluatedindependently. This is typi-cally usedin combinationwith MTS to split the force into fastandslow varyingparts. Furthermore,the real spaceterm supportsnot only simpletruncation,butalsoswitching functionsto modify the potential. The implementationsupportsparallelism(rangecomputation)for the real, reciprocalandcorrectionterms. IthandlesMD systemsfor bothvacuumandperiodicboundaryconditions.Dueto

47

Page 56: Framework Design, Parallelization and Force Computation in

the relatively expensive sineandcosinefunctions,PROTOMOL useslook-up ta-blesandtheadditiontheoremto evaluateä%æ�ç Õ ë� ï ëì Ø ß and ç���� Õ ë� ï ëì Ø ß moreefficientlyin Eqs.(24) and(25); erf Õ ì ß anderfcÕ ì ß areby default not approximated,sincethesystemplatformsprovideawell optimizedimplementation.

Figure11 shows the maximumrelative force error of the Ewald methodfordifferent accuracy parameters� comparedto the Ewald methodwith accuracyparameter� Î Ï è]é�� (seeEqs. (26-28)) under periodic boundaryconditions.Therewas no significant improvementfor smaller � , which is obvious, due tomachineprecisionof order Ï è]é�� . The maximumrelative error of the Ewaldimplementationcomparedagainstthedirectmethodin vacuumis lessthan Ï è]é��with an accuracy parameter� Î Ï è]é�� anda unit MD cell Ï � timeslarger thanthe minimal boundingbox of particles. Figure12 illustratesthe correspondingnormalizedrun-time.Figure13showsexcellentenergy conservationwith amax-imum relative force error of order Ï è]é�� . Moldy [98] wasalsousedfor furthervalidation.

10−20

10−15

10−10

10−5

100

10−15

10−10

10−5

100

Max

imum

rel

ativ

e fo

rce

erro

r

ε

Figure 11: The maximum force errorfor a given accuracy parameter � .

10−15

10−10

10−5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Tim

e

ε

Figure 12: Normalized run-time for agiven accuracy parameter � .

5.7.2 Particle-meshbasedEwald summation

PROTOMOL supportsPME (Section3.3) with a genericinterpolationschemeinterfaceof arbitraryorder. Actually, B-splinesandHermitian interpolationareimplemented.Thereal,reciprocalandcorrectiontermcanbeevaluatedseparately,as for the standardEwald implementation. The real term can be modified bya genericswitching function. The implementationsupportsparallelism(range

48

Page 57: Framework Design, Parallelization and Force Computation in

computation)for the real andcorrectionterms. Thework of the reciprocaltermcannotbedistributedin theactualimplementation.

The maximumrelative error of the PME methodcomparedagainstthe stan-dardEwald summationis lessthan

Ñ ï Ï è]é�� . Both methodsusedan accuracyparameter� Î Ï è]é�� . ThePMEwasdefinedwith meshsizeof 0.1A, cutoff in real-spacepart ��� Î Ï , andB-splinesof order12. Figure13 shows excellentenergyconservation with a maximumrelative force error of order Ï è�� . NAMD2 [66]wasusedto validatetheresults.

5.7.3 Multi-grid

Multi-grid (Section3.4)is supportedwith agenericinterpolationschemeinterfaceof arbitraryorder. B-splinesandHermitian(cubicandquintic) interpolationareimplemented. The direct and smoothpart (at particle level) can be evaluatedseparately. Thekernelandits softeningfunctionaregenericandactually öEé -, ö à -,ö! - and ö"� -continuouspolynomialsmoothingfunctionsfor Coulombinteractionsareimplemented(e.g., #!$ in Figure5). Thecoreof multi-grid handlesvacuumorperiodicity in eachdimensionindependently. Thequestionof periodicboundaryconditionsfor the coarsestgrid evaluationremainsan openresearchquestion.Theactualimplementationperformsanordinarymatrix-vectormultiplicationforperiodicboundaryconditionsandvacuum.

TheMG codein [107] wasusedfor validation. We obtainedthesameerrorsaspresentedin [107] for the6848-moleculewatersystemwith cubicandquinticHermitian interpolation. Figure 13 shows the energy conservation for a 141-

0 20 40 60 80 100

−1192

−1190

−1188

−1186

−1184

−1182

−1180

−1178

Time [ps]

Ene

rgy

[kca

l/mol

]

Direct method: FE=0.03MG: h=2.083, s=6.5, p=4, l=1, FE=0.15Cutoff: r=12.5, FE=0.03PME: N=25, r=10, p=10, ε=10−7, FE=10−5

Ewald: ε=10−10, FE=10−10

Figure 13: The energy for periodicboundary conditions and step size 1 fs.

0 200 400 600 800 1000−1192

−1191.5

−1191

−1190.5

−1190

−1189.5

−1189

−1188.5

−1188

Time [ps]

Ene

rgy

[kca

l/mol

]

MG: h=3, s=6.5, p=4, l=2, FE=0.03Direct method

Figure 14: The energy for vacuum andstep size 1 fs.

49

Page 58: Framework Design, Parallelization and Force Computation in

moleculewatersystemwith periodicboundaryconditions.MG shows excellentenergy conservation for low accuracy (maximum relative force error of order0.15). The direct methoddoesnot considerany images,only the original unitMD cell. Usingtruncationat12.5A (half of theunit MD cell dimension)togetherwith a öEé -continuousswitching function performswell andhasthe sametimecomplexity as the direct method. In caseof a small cutoff 6.5 A, the energyfluctuatesaround-1172.6kcal/mol(notshown in thisfigure).Figure14illustratesthe energy conservation for vacuum.The minor energy drift comparedwith thedirectmethodis dueto translationsof thegridsfollowing theparticles.Thereasonis thatthepotentialenergy functionis not invariantunderglobaltranslationand/orrotationof theparticlepositions.

5.8 Practical examplesof extendibility and optimization

In orderto show theextendibility andflexibility , two practicalexamplesaregiven.Thefirst oneshows how to implementa new force,whereasthesecondexampledescribestheparallelizationof agivenforce.

5.8.1 Adding a newforce

Coulombcrystalsystemsaredefinedby ionswith anadditionalmagnetictrap[52,94, 102,116]. A commonusedtrapis thePaulTrapattraction,which is givenbyÌ PaulTrap Î ø% Ø'& é ê ÏÑ � Ø�( à)+* Õ ì àØ-,/. àØ ß , ÏÑ � Ø0( à1�2 àØ ò â ë� Ø Î Õ ì Ø â . Ø â 2 Ø ß43 ô (50)

Here, ë� Ø is the coordinateof particle 5 . ( )+* and ( 1 parameterizethe Paul Trapattraction.

The interactionis implementedin theclassPaulTrapExtende dFor ce andimplementsanextendedforce(ExtendedForce ). ExtendedForce waschosenasbaseclassto be able to incorporatea velocity dependentterm in future im-plementations.TheclassPaulTrapExtende dForc eBas e is inheritedprivatelyanddefinesonly staticfunctionalities,i.e., thekeyword PaulTrap . Themethodevaluate(...) implementsEq. (50) (Program9). The first constructor(theemptyone)is usedwhencreatinga prototype,while thesecondoneis calledinthe AbstractFactorymethoddoMake(...) to instantiatea fully parameterizedobject with ( )+* and ( 1 . getParameters(.. .) definesthe parameternamesandreturnsalso the parametervalues. Program10 registersa prototypeof the

50

Page 59: Framework Design, Parallelization and Force Computation in

force and makes the force ’visible’. The completeimplementationis given inAppendixC, Program13.

template<class TBoundaryConditions>class PaulTrapExtendedForce : public ExtendedForce,

private PaulTrapExtendedForceBase {public: // Constructors

PaulTrapExtendedForce():myOmegaXY(0.0),myOm egaZ(0. 0){};PaulTrapExtendedForce(Real omeagXY, Real omegaZ):myOmegaXY(omegaXY),

myOmegaZ(omegaZ){};public: // From class SystemForce

virtual void evaluate(...){const TBoundaryConditions &boundary = ...;Real e = 0.0;for(int i=0;i<topo->atoms.size();i++){

Real c = topo->atomTypes[topo->atoms[i].type].mass*;Coordinates pos(boundary.basisPosition((*positions)[i]));Coordinates f(c*myOmegaXY*myOmegaXY*pos.x,c*myOmegaXY*myOme gaXY*po s.y,

c*myOmegaZ*myOmegaZ*pos.z);(*forces)[i] -= f;e += 0.5*c*(myOmegaXY*myOmegaXY*(pos.x*pos.x+pos.y*p os.y)+

myOmegaZ*myOmegaZ*pos.z*pos.z);}energies->otherEnergy += e;

}public: // From class Force

virtual void getParameters(vector<ParameterType>& parameters) const{parameters.push_back(ParameterType("-omeg aXY",Va rValTy pe::REA L,

VarVal(myOmegaXY)));parameters.push_back(ParameterType("-omeg aZ",Var ValTyp e::REAL ,

VarVal(myOmegaZ)));}

protected: // New methodsvirtual Force* doMake(string&, const vector<VarVal>&) const{

Real omegaXY = values[0].getReal();Real omegaZ = values[1].getReal();return new PaulTrapExtendedForce(omegaXY,omegaZ);

}...};

Program 9: Implementation of a Paul Trap attraction.

ForceFactory::registerExemplar(new PaulTrapExtendedForce<PBC>());

Program 10: Registration of a Paul Trap attraction.

51

Page 60: Framework Design, Parallelization and Force Computation in

5.8.2 Parallelizing a force

In orderto parallelizea givenforce,parallelEvalua te (. .. ) hasto beover-written. Furthermore,we have to definea splitting of the force to enablerangecomputation.For example,a velocity dependentfriction could be implementedasdescribedin Program11. In this program,n is the numberof particles,andblocks definesthe numberof blocks,typically the numberof processors.Thesamesplittingmustbeimplementedin themethodnumberOfBlocks( .. .) (seecompleteimplementationin Program14) to allow the masterprocessorto dis-tributethework. Whenever a slave entersparallelEvaluate (. .. ) , next()

will betrueif theactualblock i hasnotbeenprocessedby any otherslave.Likefor thissimpleforce,wemayratherimplementtheactualforceevaluation

in a private methoddoEvaluate(... ,f ro m,to) with rangeability to avoidcode duplication. The sequentialimplementationcalls it with the full range,whereastheparallelversioncallsit with theaccordingblockindices(Program14).

virtual evaluate(...){...} // sequential implementationvirtual parallelEvaluate(...){

...int n = positions->size();int blocks = ...; // number of blocksfor(int i = 0;i<blocks;i++){

if(topo->parallel->next()){int to = (n*(i+1))/blocks;if(to > n) to = n;int from = (n*i)/blocks;for(int j=from;j<to;j++){

(*forces)[j] += (*velocities)[j]*myF;} } } }

Program 11: Parallelization of a velocity dependent friction.

5.9 Performance

In the following threesectionsthe run-timeperformanceof PROTOMOL is dis-cussed.WecomparePROTOMOL to NAMD2 [66] for sequentialandparallelruns,andcomparethedifferentfastelectrostaticforcealgorithmsof PROTOMOL. Theperformanceevaluationsarebasedonthreesystems,a141-moleculewatersystem(Figure27),BPTI with water(Figure25),andApoA1 with water(Figure28).

52

Page 61: Framework Design, Parallelization and Force Computation in

5.9.1 Sequentialscalability

We performeda sequentialcomparisonof PROTOMOL with NAMD2, sincetheyhavesimilarbasicfunctionalitiesandarebothimplementedC++. In orderto makethe comparisonsfair, we usedthe sameintegrationschemeandconsideredthesameforces.BothconfigurationsuseLeap-Frogwith timestep1 fs (Program12).For thestructuralandinitial simulationstate,theidenticalinputwasused.Table2demonstratesthe sequentialperformanceof PROTOMOL againstNAMD2. OnecanconcludethatPROTOMOL performswell andhasaslightly better(sequential)scalingthanNAMD2, in respectof problemsize. All runswereperformedon adedicatednodeof aLinux clusterwith 1.26GHzPentiumIII processors.PROTO-MOL wasconfiguredwith a cell sizeof 5 A for thewatercaseand 6 é A for thetwo othercasesin orderto exploit thecacheoptimally.

Testcase # atoms PROTOMOL NAMD2 Ratio7[s] [ 8 s] /

7[s] [ 8 s] /

7Water vacuum 423 0.0108 25.53 0.0122 28.84 0.89

periodic 0.0156 36.88 0.0149 35.23 1.05BPTI vacuum 14,281 0.7826 54.8 0.889 62.25 0.88

periodic 1.386 97.01 1.216 85.15 1.14ApoA1 vacuum 92,224 5.622 60.96 6.537 70.88 0.86

periodic 9.259 100.04 10.056 109.04 0.92

Table 2: Sequential comparison of PROTOMOL vs. NAMD2 for one time step inaverage on a Pentium III, 1.26 GHz running Linux RedHat. All test cases useLeap-Frog integration with 1 fs time step and a non-bonded force cutoff of 10 A,and Lennard-Jones and Coulomb forces with ÷ é -continuous and shift switchingfunction, respectively.

Integrator {level 0 Leapfrog {

timestep 1.0 # [fs]force Improper, Dihedral, Bond, Angleforce LennardJones Coulomb # Combining LJ and Coulomb evaluation

-switchingFunction C2 -switchingFunction Shift-algorithm NonbondedCutoff -switchon 0.1 -cutoff 10.0 }}

Program 12: PROTOMOL: Integrator and force definitions used for the perfor-mance comparison test. Lennard-Jones and Coulomb contributions are evaluatedsimultaneously.

53

Page 62: Framework Design, Parallelization and Force Computation in

5.9.2 Parallel scalability

Figures15 and16 illustratetheparallelscalabilityof PROTOMOL andNAMD2.PROTOMOL scalessmoothlyand well up to 32 processors.For one and twoprocessorsPROTOMOL doesstaticloadbalancinganddynamicloadbalancingforthreeor moreprocessors.For dynamicload balancing,onededicatedprocessortakesthe role of masterservingtheslaves. Table3 containstheaveragetime ofa sequentialrun for onetime step. PROTOMOL andNAMD2 have comparablesequentialperformance,PROTOMOL is 9 Ü Ï ;: fasterfor vacuum,but 9 Ü Ï <:slower for periodicboundaryconditionsthanNAMD2. Thesimulationsetupforall testcaseswasthesameasfor thesequentialcomparison,exceptfor acell size

5 10 15 20 25 30

5

10

15

20

25

30

Spe

edup

# Processors

ProtoMol: ApoA1NAMD2: ApoA1ProtoMol: BPTINAMD2: BPTIProtoMol: WaterNAMD2: Water

100

101

10−1

100

101

# P

roce

ssor

s ×

Tim

e pe

r st

ep [s

]

# Processors

ProtoMol: ApoA1NAMD2: ApoA1ProtoMol: BPTINAMD2: BPTIProtoMol: WaterNAMD2: Water

Figure 15: Parallel scalability for periodic boundary conditions.

5 10 15 20 25 30

5

10

15

20

25

30

Spe

edup

# Processors

ProtoMol: ApoA1NAMD2: ApoA1ProtoMol: BPTINAMD2: BPTIProtoMol: WaterNAMD2: Water

100

101

10−1

100

101

# P

roce

ssor

s ×

Tim

e pe

r st

ep [s

]

# Processors

ProtoMol: ApoA1NAMD2: ApoA1ProtoMol: BPTINAMD2: BPTIProtoMol: WaterNAMD2: Water

Figure 16: Parallel scalability for vacuum.

54

Page 63: Framework Design, Parallelization and Force Computation in

of 5 A for all PROTOMOL runs.All runswereperformedon anOrigin2000with195MHzR10000processors.

Testcase # atoms PROTOMOL NAMD27[s] [s]

Water vacuum 423 0.0317 0.0334periodic 0.0473 0.0434

BPTI vacuum 14,281 2.209 2.408periodic 3.562 3.304

ApoA1 vacuum 92,224 16.03 17.83periodic 26.36 25.04

Table 3: Sequential comparison of PROTOMOL vs. NAMD2 for one time step inaverage on an Origin2000 with 195MHz R10000 processors. All test cases useLeap-Frog integration with 1 fs time step and a non-bonded force cutoff of 10 A,and Lennard-Jones and Coulomb forces with ÷ é -continuous and shift switchingfunction, respectively.

5.9.3 Scalability of fast electrostatic forcealgorithms

Figure 17 shows the run-time efficiency of the PME and the standardEwaldmethodwith a maximumrelative forceerrorof Ï è�� for thethreetestcaseswithperiodicboundaryconditions.TheEwaldsummationwith anaccuracy parameter� of Ï è]é�� served as reference.Figure 18 illustratesthe run-time efficiency ofMG with amaximumrelative forceerrorof Ï è à and Ï è� comparedto thedirectmethodfor vacuum. As expected,the direct methodscalesas = Õ ÿ à ß andMGscalesas = Õ ÿ ß , which is supportedby the observed speedupof 620 or morefor a Ï � Ca>�?� Coulombcrystalsimulationcomparedwith thedirectmethodandmaximumrelative force error of 9 ï Ï è� . Note that one may get even betterperformancenumbersby extensive fine-tuningof the MG parameters.Table4summarizesthe performancenumbersboth for vacuumand periodic boundaryconditions.All runeswereperformedon anIBM p690RegattaTurbosysteminproductionmode.Calculationswereperformedin 64-bitarithmeticandamachineprecisionof order Ï è]é�� .

55

Page 64: Framework Design, Parallelization and Force Computation in

103

104

10−1

100

101

102

103

Tim

e pe

r st

ep [s

]

# Atoms

EwaldEwald FE=10−5

PME FE=10−5

n3/2

n log n

2 4 6 8

x 104

0

50

100

150

200

250

300

350

400

450

Tim

e pe

r st

ep [s

]

# Atoms

EwaldEwald FE=10−5

PME FE=10−5

n3/2

n log n

Figure 17: Comparison of fast electrostatic algorithms for periodic boundaryconditions. The left figure has logarithmic axes; the right figure has linear axes.

103

104

10−1

100

101

102

Tim

e pe

r st

ep [s

]

# Atoms

DirectMG FE=10−3

MG FE=10−2

n2

n

2 4 6 8

x 104

0

5

10

15

20

25

30

35

40

45

50

Tim

e pe

r st

ep [s

]

# Atoms

DirectMG FE=10−3

MG FE=10−2

n2

n

Figure 18: Comparison of fast electrostatic algorithms for vacuum. The left figurehas logarithmic axes; the right figure has linear axes.

Testcase # atoms Direct MG @�ACBED MG @�ACB<F Ewald Ewald @�ACB<G PME @HA�BEGI[s] [s] [s] [s] [s] [s]

Water 432 0.01962 0.03498 0.03707 0.31159 0.07277 0.05365BPTI 14,281 21.584 3.1818 7.3027 144.035 21.021 3.85ApoA1 92,224 903.963 19.994 51.441 2978.52 482.847 26.728

Table 4: Performance comparison of fast electrostatic force algorithms on an IBMp690 Regatta Turbo system. The performance numbers represent the averagetime to evaluate the electrostatic contributions for one time step.

56

Page 65: Framework Design, Parallelization and Force Computation in

6 Conclusions

Theaim of this thesishasbeento studythreedifferentrelevantaspectsof molec-ular dynamics(MD):J Thestudyof parallelismthatis achievablein real-life MD applicationsand

algorithms.J The designandaspectsof a generalpurposeframework for MD applica-tions.J Thestudyof themostrelevantfastelectrostaticforcealgorithms.

In this thesis,two differentparallelizationapproachesarestudied. First, in pa-per [C] spatialdecompositionwasevaluatedfor a particularMD programfor 2-dimensionalphysicalproblems(seealsopapers[A] and[B]) with short-rangevanderWaalsinteractions.Dif ferentpartitioningandcommunicationstrategieswereappliedwith up to KML<N particlesand128 processorsto demonstratethe parallelscalability. Second,in paper[D] an incrementalparallelapproachwasdesignedand incorporatedinto the framework PROTOMOL. Although the representationrelies on replicateddata, the approachscaleswell for a moderatenumberofprocessors(up to 32). The approachwasmotivatedto provide transparentandencapsulatedparallelism,wheresequentialandparallelimplementationscanco-exist. It supportsthedevelopmentandintegrationof sequentialexperimental,andnovel algorithms,while concurrentlybenefitingfrom alreadyparallelizedpartsinthe framework. During this work, this hasemergedasan importantfeatureofthe framework, sincenumerousresearchersfocusprimarily on algorithmdevel-opmentanddonotwish to dealwith parallelizationissues.

The main part of the work is devoted to the designandthe developmentofthe fastelectrostaticforcealgorithms.In orderto do so,we useda considerableamountof timeandeffort to extendandenhancethecomponent-basedframeworkPROTOMOL. The front-endwas re-designedand the forceswere generalized(Policy pattern[42, pp. 315-323])and extendedtogetherwith a force factory(AbstractFactory[42, pp. 87-95]). PROTOMOL hasbeenparallelizedbasedonanincrementalparallelizationapproach.Thedevelopmentwasrealizedin collab-orationwith JesusA. Izaguirre20 andhisstudents.

20Departmentof ComputerScienceandEngineering,Universityof NotreDame,USA.

57

Page 66: Framework Design, Parallelization and Force Computation in

Thecomponent-basedframework approachgivestheneededflexibility to con-figure anddynamicallycomposeMD applicationstogetherwith a run-timeeffi-cientback-endfor theMD computationalintensivepart.

The PROTOMOL framework is sufficiently general. It hasbeenusedin aframework called COMPUCELL [38], which modelsmorphogenesisand otherprocessesof developmentalbiology at the cellular andorganismlevel. The ef-ficiency of theback-endis demonstratedby acomparisonof PROTOMOL againstNAMD2 (Sections5.9.1and5.9.2). The dynamiccompositionis illustratedbythe ability to define arbitrary MTS schemesbasedon an integrator definitionlanguage.The framework’s flexibility andeaseto performgeneralpurposeMDsimulationsbasedontheNewton’sequationof motionhavebeendemonstratedbyamacroscopicsimulationof Ugelstadspheresin amagneticfluid (AppendixD.3).In a laboratorycoursein computationalscienceat theDepartmentof Informaticsat theUniversityof Bergen,studentssuccessfullyansweredthequestionsof mat-terby performingsimulationswith thehelpof PROTOMOL (AppendixD.2). Theframework is also usedin an ongoingproject to study CoulombcrystalswithKCLEO ions(AppendixD.4) in collaborationwith experimentalists.Furthermore,thedevelopmentof sophisticatedintegrators,suchasMOLLY andDissipativeMollyis ongoingwork attheUniversityof NotreDame.Overall,theframework satisfiestherequirementsof anexperimentalplatformfor generalMD applications.

The framework has beenextendedwith three fast electrostaticforce algo-rithms:StandardEwaldsummation( PRQ�SUTVCW ), smoothparticle-meshbasedEwaldsummation( PXQYS[Z]\E^_S`W ) and multi-grid ( PXQYSaW ), where S is the numberofparticles/atoms.In Section5.9.3,the performancesof the differentmethodsarecompared.Their designsarebasedon thePolicy patternto provide theflexibilityto customizethem.All threemethodssupportbothvacuumandperiodicboundaryconditions.In caseof multi-grid underperiodicboundaryconditions,thecoarsestgrid is evaluatedby anordinarymatrix-vectormultiplication,whichto ourknowl-edgeis the bestapproachto conserve systemproperties.Thesefastalgorithmsalsosupportdifferentsplittingsfor MTS integratorschemesto lengthenthetimestepand achieve betterperformance. However, theseadvancedmethodshavenumerousparametersand it is not always trivial to find the right configurationto achieve optimal run-timeefficiency for a givenaccuracy. Therefore,differentinstrumentalmethodsareprovidedto comparerun-timeefficiency andaccuracy,suchthat the optimal choiceof parametersfor a given force algorithm can bedeterminedautomaticallyin thefuture.

ThesoftwarepackagePROTOMOL is generallyavailableasa public domaincode[63]. It is written in C++ andconsistsof 40,000linesof codeand180C++

58

Page 67: Framework Design, Parallelization and Force Computation in

classes.The communicationand parallelismis basedon the MessagePassingInterface(MPI). PROTOMOL runson Solaris,Linux, AIX, andIRIX systems.

Outlook

MD is anactiveresearchfield, especiallythedevelopmentof thefastelectrostaticforce algorithmscombinedwith MTS integrator schemes. A major difficultywith fast electrostaticforce algorithmsis the optimal choiceof parametersfora given accuracy and run-time efficiency. It would be desirableto determinetheseparametersautomatically. Furthermore,the choiceof the right methodisobviously important.Thereis anongoingcollaborationprojectat theUniversityof Notre Dameto provide heuristicestimatesandalgorithmsto chooseboth theright methodandoptimalparameters.

The fastelectrostaticforcealgorithmsshouldbeextendedfor genericpoten-tials, e.g., van der Waals. This will requirea careful extensionof the existingdesignwith new policies.For somemethodsthis mayhowevernotbetrivial.

During this thesis,therequirementto customizetheparallelismin moredetailwas identifiedasa possibleenhancement.This is desirableto be ableto selectamongdifferentcommunicationapproachesandstrategiesof work distribution.

Fromtheframework pointof view, therearestill severalenhancementspossi-ble. For example,theintegratordesignis missingthesupportof shadow Hamilto-nianintegrators[106]. Thesearecheapapproximationsto themodifiedHamilto-nianthatis closeto theHamiltonianof interest[101]. Furthermore,theintegratordesignshouldbe revisitedonceexperimentalintegratorshave beenprovenvalu-able.

59

Page 68: Framework Design, Parallelization and Force Computation in

Acknowledgments

I would like to thankmy supervisors,ProfessorPetterE. Bjørstadandco-advisorProfessorJanPetterHansen21, for a very challengingandinterestingproject,andfor lots of support. Many thanksgo to the ResearchCouncil of Norway forfundingtheproject. Also many thanksto assistantprofessorJesusA. Izaguirre22

andhisstudentsfor their fruitful collaborationandsupport.Furthermore,I wouldlike to thank the committee,Dr. GodehardSutmann23, ProfessorHansPetterLangtangen24 andProfessorTor Sørevik, for the time andeffort they have takento readandcommenton this manuscript.Their commentswerevery useful. I’malsogratefulto Jacko Kosterfor commentingon early draftsof this manuscript.I alsoowe thanksto theDepartmentof InformaticsandParallabfor theexcellentworking environment. I wish to thankParallabfor giving meunrestrictedaccessto their supercomputerfacilities. Last but not least, I would like to thank the“lunch-gang”andKKK.

21Departmentof Physics,Universityof Bergen.22Departmentof ComputerScienceandEngineering,Universityof NotreDame,USA.23CentralInstitutefor AppliedMathematics,ResearchCenterJulich, Julich, Germany.24SimulaResearchLaboratory, Oslo.

60

Page 69: Framework Design, Parallelization and Force Computation in

References

[1] A. Alexandrescu.ModernC++ design:genericprogramminganddesignpatternsapplied. Addison-Wesley, 2001.

[2] M. P. Allen andD. J. Tildesley. ComputerSimulationof Liquids. OxfordUniversityPress,New York, 1987.

[3] S. Amann, C. Streit, and H. Bieri. BOOGA – A component-orientedframework for computergrahics.In GraphiCon, pages193–200,Moscow,1997.

[4] N. Attig, M. Lewerenz,G. Sutmann,andR. Vogelsang.DMMD – A mod-ular moleculardynamicsprogramfor MPP-systems.5th SGI/CrayMPPWorkshop, Bologna, Italy, http://www.cine ca .i t/m pp-w or ks hop ,September1999.

[5] N. S. Bakhvalov. On theconvergenceof a relaxationmethodwith naturalconstraintson the elliptic operator. USSRComput.Math. Phys., 6(101–135),1966.

[6] J. BarnesandP. Hut. A hierarchicalbcQYSdZ�\<^_S`W force-calculationalgo-rithm. Nature, 324:446–449,1986.

[7] P. F. Batcho, D. A. Case, and T. Schlick. Optimized particle-meshEwald/multiple-timestepintegrationfor moleculardynamicssimulations.J. Chem.Phys., 115(9):4003–4018,2001.

[8] P. F. Batcho and T. Schlick. New splitting formulations for latticesummations.J. Chem.Phys., 115(18):8312–8326,November2001.

[9] D. M. Beazley andP. S. Lomdahl. Message-passingmulti-cell moleculardynamicsontheConnectionMachine5. Parallel Computing, 20:173–195,1994.

[10] T. Bishop, R. D. Skeel, and K. Schulten. Difficulties with multipletimesteppingandthe fastmultipole algorithmin moleculardynamics. J.Comp.Chem., 18(14):1785–1791,November15,1997.

[11] JohnA. Board,Jr., J. W. Causey, JamesF. Leathrum,Jr., AndreasWinde-muth, and Klaus Schulten. Acceleratedmoleculardynamicssimulation

61

Page 70: Framework Design, Parallelization and Force Computation in

with theparallelfastmultipole algorithm. Chem.Phys.Lett., 198:89–94,1992.

[12] N. P. Boghossian,O. Kohlbacher, andH.-P. Lenhof. Ball: Biochemicalalgorithmslibrary. In J. S Vitter and C. D. Zaroliagis,editors,3rd Int.WorkshoponAlgorithmEngineering(WAE-99), volume1668of In LectureNotesin ComputerScience, pages330–344.Springer-Verlag,London,July1999.

[13] A. Brandt,J. Bernholc,andK. Binder, editors. MultiscaleComputationalMethodsin ChemistryandPhysics, volume177of NATO ScienceSeries:Series III Computer and SystemsSciences. IOS Press, Amsterdam,Netherlands,January2001.

[14] B. R.Brooks,R.E.Bruccoleri,B. D. Olafson,D. J.States,S.Swaminathan,and M. Karplus. CHARMM: A program for macromolecularenergy,minimization, and dynamicscalculations. J. Comp.Chem., 4:187–217,1983.

[15] B. R. BrooksandM. Hodoscek. Parallelizationof CHARMM for MIMDmachines.ChemicalDesignAutomationNews, 7:16–22,December1992.

[16] D. Brown, H. Minoux, andB. Maigret. A domaindecompositionparallelprocessingalgorithm for moleculardynamicssimulationsof systemsofarbitraryconnectivity. ComputerPhysicsCommunications, 103:170–186,1997.

[17] B. Buhlmann. Extensionsand Applicationsof the GraphicsFrameworkBOOGA. PhD thesis,Instituteof ComputerScienceandApplied Mathe-matics,Universityof Berne,1998.

[18] D. Bulka and D. Mayhew. Efficient C++, PerformanceProgrammingTechniques. Addison-Wesley, 1999.

[19] Common Component Architecture Forum. Seehttp://www.acl. lan l. gov/ cc a-f or um.

[20] D. Chappell.UnderstandingActiveXandOLE. Microsoft Press,1997.

[21] Z. M. Chen,T. Cagin,andW. A. Goddard. FastEwald sumsfor generalvanderWaalspotentials.J. Comp.Chem., 18:1365–1370,1997.

62

Page 71: Framework Design, Parallelization and Force Computation in

[22] T. W. Clark, R. van Hanxleden,J. A. McCammon,and L. Ridgway.Parallelization strategies for a molecular dynamicsprogram. In IntelTechnologyFocysConferenceOroceedings, April 1992.

[23] T. W. Clark, R. vanHanxleden,J.A. McCammon,andL. R. Scott. Paral-lelizing moleculardynamicsusingspatialdecomposition.In Proceedingsof theScalableHigh-PerformanceComputingConference, pages95–102,LosAlamitos,California,1994.IEEE ComputerSocietyPress.

[24] The CommonObject RequestBroker: Architecture and Specification,December2001.Revision 2.6.

[25] B. Coulange.SoftwareReuse. SpringerVerlag,London,1998.

[26] T. A. Darden,D. M. York, andL. G. Pedersen.Particle meshEwald. AnS efZ�\<^fQYSaW methodfor Ewald sumsin large systems. J. Chem.Phys.,98:10089–10092,1993.

[27] M. DesernoandC.Holm. How to meshupEwaldsums.I. A theoreticalandnumericalcomparisonof variousparticlemeshroutines. J. Chem.Phys.,109(18):7678–7693,1998.

[28] H.-Q. Ding, N. Karasawa, andW. A. GoddardIII. The reducedcell mul-tipole methodfor coulombinteractionsin periodicsystemswith million-atomunit cell. Chem.Phys.Lett., 196(6):6–10,August1992.

[29] Z. H. DuanandR. Krasny. An Ewaldsummationbasedmultipolemethod.J. Chem.Phys., 113(9):3492–3495,2000.

[30] M. J. DudekandJ. W. Ponder. Accuratemodelingof the intramolecularelectrostaticenergy of proteins.J. Comp.Chem., 16:791–816,1995.

[31] M. Eichinger, H. Grubmuller, H. Heller, and P. Tavan. FAMUSAMM:An algorithmfor rapidevaluationof electrostaticinteractionsin moleculardynamicssimulations.J. Comp.Chem., 18:1729–1749,1997.

[32] R. Esser, P. Grassberger, J. Grotendorst,and M. Lewerenz. EGO –an efficient moleculardynamicsprogramand its applicationto proteindynamicssimulations. In Workshopon Molecular Dynamicson ParallelComputers, World Scientific,pages154–174,Singapore912805,2000.

63

Page 72: Framework Design, Parallelization and Force Computation in

[33] U. Essmann,L. Perera,and M. L. Berkowitz. A smoothparticle meshEwald method.J. Chem.Phys., 103(19):8577–8593,1995.

[34] P. Ewald. Die BerechnungoptischerundelektrostatischerGitterpotentiale.Ann.Phys., 64:253–287,1921.

[35] R. P. Fedorenko. A relaxationmethodfor solving elliptic differentialequations.USSRComput.Math.Phys., 1:1092,1961.

[36] D. Fincham.Optimisationof theEwald sumfor largesystems.Mol. Sim.,13:1–9,1994.

[37] M. Flynn. Somecomputerorganizationsand their effectiveness. IEEETrans.Comput., C(21):94,1972.

[38] Laboratory for Computational Life Sciences.http://www.nd.e du/ ˜l cl s/ co mpuce ll , 2002. CompuCell homepage.

[39] T. ForesterandW. Smith. Onmultiple time-stepalgorithmsandtheEwaldsum.Mol. Sim., 13(3):195–204,1994.

[40] MPI Forum. MPI-2: Extensionsto the Message-PassingInterface.http://www.mpi- for um.o rg /d ocs /d oc s. ht ml .

[41] MPI Forum. MPI: Message-PassingInterface. http://www.mpi-

forum.org/docs/ doc s. ht ml .

[42] E. Gamma,R. Helm, R. Johnson,andJ. Vlissides. DesignPatterns.El-ementsof ReusableObject-OrientedSoftware. Addison-Wesley, Reading,Massachusetts,1995.

[43] G. Gao. Large ScaleMolecularSimulationswith Applicationto PolymersandNano-scaleMaterials. PhDthesis,Caltech,1998.

[44] B. Garcıa-Archilla, J. M. Sanz-Serna,and R. D. Skeel. Long-time-stepmethodsfor oscillatory differential equations. SIAM J. Sci. Comput.,20(3):930–963,October20,1998.

[45] B. Garcıa-Archilla, J. M. Sanz-Serna,and R. D. Skeel. The mollifiedimpulsemethodfor oscillatorydifferentialequations.In D. F. GriffithsandG. A. Watson,editors,NumericalAnalysis1997, pages111–123,London,1998.Pitman.

64

Page 73: Framework Design, Parallelization and Force Computation in

[46] A. Geist,A. Beguelin, J. Dongarra,W. Jiang,R. Manchek,and V. Sun-deram. PVM 3 usersguide and referencemanual. TechnicalManualORNL/TM-12187,OakRidgeNationalLaboratory, May 1994.

[47] H. Goldstein. Classical mechanics. Addison-Wesley, Reading,Mas-sachusetts,1980.

[48] L. GreengardandV. Rokhlin. A fastalgorithmfor particlesimulations.J.Comput.Phys., 73:325–348,1987.

[49] H. Grubmuller, H. Heller, A. Windemuth,andK. Schulten. GeneralizedVerlet algorithmfor efficient moleculardynamicssimulationswith long-rangeinteractions.MolecularSimulation, 6:121–142,1991.

[50] B. HafskjoldandL. Rekvig.ComparisonbetweenPROTOMOL andNEMDfor a simple Lennard-Jonessystem. Technical report, DepartmentofChemistry, Trondheim,NTNU, March2002.

[51] D. Hardy. Molecular DynamicsAPI. TheoreticalBiophysicsGroup,University of Illinois at Urbana-Champaign,405 North Matthews Av-enue, Urbana, IL 61801, 0.96 edition, October 1999. Download athttp://www.ks. ui uc. edu/ ˜d hardy /mdapi .

[52] R.H. HasseandV. V. Avilov. StructureandMandelungenergy of sphericalCoulombcrystals.Phys.Rev. A, 44(7):4506–4515,1991.

[53] J.Hermans,R.H. Yun,J.Leech,andD. Cavanaugh.SIGMA: SI-mulationsof MA-cromolecule.http://femto.med .u nc .e du/ SI GMA.

[54] R. W. Hockney andJ.W. Eastwood.ComputerSimulationUsingParticles.McGraw-Hill, New York, 1981.

[55] Y.-S. Hwang,R. Das,J. H. Saltz,M. Hodoscek, andB. R. Brooks. Par-allelizingmoleculardynamicsprogramsfor distributed-memorymachines.IEEEComputationalScience& Engineering, 2(2):18–29,Summer1995.

[56] B. Isralewitz, M. Gao,andK. Schulten.Steeredmoleculardynamicsandmechanicalfunctionsof proteins.Curr. OpinionStruct.Biol., 2001.InvitedArticle.

65

Page 74: Framework Design, Parallelization and Force Computation in

[57] J. A. Izaguirre. Longer Time Stepsfor MolecularDynamics. PhD thesis,Universityof Illinois at Urbana-Champaign,1999. Also UIUC TechnicalReportUIUCDCS-R-99-2107.

[58] J. A. Izaguirre. Generalizedmollified multiple time steppingmethodsformoleculardynamics. In A. Brandt, J. Bernholc,and K. Binder, editors,MultiscaleComputationalMethodsin ChemistryandPhysics, volume177of NATO ScienceSeries:SeriesIII ComputerandSystemsSciences, pages34–47.IOS Press,Amsterdam,Netherlands,Jan2001.

[59] J.A. Izaguirre,D. P. Catarello,J.M. Wozniak,andR. D. Skeel. Langevinstabilizationof moleculardynamics. J. Chem.Phys., 114(5):2090–2098,February1, 2001.

[60] J.A. Izaguirre,Q. Ma, T. Matthey, J.Willcock, T. Slabach,B. Moore,andG. Viamontes.Overcominginstabilitiesin Verlet-I/r-RESPA with themol-lified impulsemethod. In T. SchlickandH. H. Gan,editors,Proceedingsof 3rd InternationalWorkshopon Methodsfor MacromolecularModeling,volume24 of Lecture Notesin ComputationalScienceand Engineering,pages146–174.Springer-Verlag,Berlin, New York, 2002.

[61] J. A. Izaguirre,T. Matthey, J. Willcock, Q. Ma, B. Moore, T. Slabach,and G. Viamontes. A tutorial on the prototyping of multipletime stepping integrators for molecular dynamics. Available fromhttp://www.cse. nd. edu/ ˜l cl s/P ro to mol. htm l , 2001.

[62] J.A. Izaguirre,S.Reich,andR. D. Skeel. Longertimestepsfor moleculardynamics.J. Chem.Phys., 110(19):9853–9864,May 15,1999.

[63] J. A. Izaguirre,J. Willcock, T. Matthey, Q. Ma, T. B. Slabach,T. Stein-bach, S. Stender, G. F. Viamontes,and J. Mohnke. PROTOMOL: Anobjectorientedframework for moleculardynamics. Availableonline viahttp://www.cse. nd. edu/ ˜l cl s/P ro to mol. htm l , 2000–2002.

[64] R. E. Johnson. Frameworks. Seminar, Zuhlke Informatik, Zurich, May1996.

[65] R. E. Johnsonand B. Foote. Designingreusableclasses. J. of Object-OrientedProgramming, June/July1988.

66

Page 75: Framework Design, Parallelization and Force Computation in

[66] L. Kale, R. Skeel, R. Brunner, M. Bhandarkar, A. Gursoy, N. Krawetz,J. Phillips, A. Shinozaki,K. Varadarajan,and K. Schulten. NAMD2:Greaterscalability for parallel moleculardynamics. J. Comput.Phys,151(1):283–312,May 1, 1999.

[67] N. KarawasaandW. A. Goddard.Accelerationof convergencefor latticesums.J. Phys.Chem., 93:7320,1989.

[68] M. Kawata and M. Mikami. Accelerationof the canonicalmoleculardynamicssimulationby theparticlemeshEwaldmethodcombinedwith themultiple time-stepintegratoralgorithm. ChemicalPhysicsLetters, 313(1-2):261–266,1999.

[69] M. KawataandM. Mikami. Rapidcalculationof two-dimensionalEwaldsummation.ChemicalPhysicsLetters, 340(1-2):157–164,2001.

[70] M. Kawata and U. Nagashima. Particle meshEwald methodfor three-dimensionalsystemswith two-dimensionalperiodicity. ChemicalPhysicsLetters, 340(1-2):165–172,2001.

[71] J.KolafaandJ.W. Perram.Cutoff errorsin theEwaldsummationformulaefor point chargesystems.MolecularSimulation, 9:351–368,1992.

[72] Y. Komeiji. Ewald summationandmultiple time stepmethodsfor molec-ular dynamicssimulationof biological molecules. Journal of MolecularStructure (THEOCHEM), 530(3):237–243,2000.

[73] C.G.Lambert,T. A. Darden,andJ.A. Board.A multipole-basedalgorithmfor efficient calculationof forcesand potentialsin macroscopicperiodicassembliesof particles.J. Comp.Phys., 126(2):274–287,July1996.

[74] D. J.Langridge,J.F. Hart,andS.Crampin.Ewaldsummationtechniqueforone-dimensionalchargedistributions.ComputerPhysicsCommunications,134(1):78–85,2001.

[75] Andrew R. Leach. Molecular Modelling: Principles and Applications.Addison-Wesley Longman,Reading,Massachusetts,July1996.

[76] S.W. DeLeeuw, J.W. Perram,andE. R. Smith.Simulationof electrostaticsystemsin periodic boundaryconditions.I. lattice sumsand dielectricconstants.Proc.R.Soc.Lond., A 373:27–56,1980.

67

Page 76: Framework Design, Parallelization and Force Computation in

[77] E. Lindahl,B. Hess,andD. vanderSpoel.GROMACS3.0: A packageformolecularsimulationandtrajectory. JMM, 7(8):306–317,2001.

[78] T. Lippert,A. Seyfried, A. Bode,andK. Schilling. Hyper-systolicparallelcomputing.IEEE Trans.Paral. Distr. Syst., 9:97–108,1998.

[79] T. R. Littell, R. D. Skeel, and M. Zhang. Error analysisof symplecticmultiple time stepping.SIAMJ. Numer. Anal., 34(5):1792–1807,October1997.

[80] P. S. Lomdahl,P. Tamayo,N. Grønbech-Jensen,andD. M. Beazley. 50G-Flopsmoleculardynamicson theConnectionMachine. In Proceedingsof Supercomputing’93, pages520–527.IEEE ComputerSociety, 1993.

[81] A. P. Lyubartsev and A. Laaksonen. MDynaMix – a scalableportableparallelMD simulationpackagefor arbitrarymolecularmixtures.Comput.Phys.Commun., 128:565–589,2000.

[82] Q.MaandJ.A. Izaguirre.Targetedmollified impulsemethodfor moleculardynamics.Submittedto J. Chem.Phys., 2002.

[83] T. Matthey. ObjektorientierteGebaudemodellierung. Master’s thesis,Institute of ComputerScienceand Applied Mathematics,University ofBerne,1997.

[84] E. Merzbacher. QuantumMechanics. JohnWiley & Sons,3rd editionedition,1998.

[85] B. Meyer. Object-OrientedSoftware Construction. Prentice-Hall,2ndedition,1997.

[86] M. Nelson,W. Humphrey, A. Gursoy, A. Dalke,L. Kale, R. D. Skeel,andK. Schulten. NAMD – A parallel, object-orientedmoleculardynamicsprogram. Int. J. Supercomput.Applics. High PerformanceComputing,10(4):251–268,Winter 1996.

[87] D. Okunbor and R. Murty. Parallel molecular dynamicsusing forcedecomposition. In P. Deuflhard,J. Hermans,B. Leimkuhler, A. Mark,S. Reich,andR. D. Skeel, editors,ComputationalMolecular Dynamics:Challenges,Methods,Ideas, volume4 of Lecture Notesin ComputationalScienceand Engineering, pages483–494.Springer-Verlag, November1998.

68

Page 77: Framework Design, Parallelization and Force Computation in

[88] OpenMP-forum.OpenMP:A proposedindustrystandardAPI for sharedmemory programming.technical report. http://www.ope nmp. org ,October1997.

[89] J. W. Perram,H. G. Petersen,andS. W. de Leeuw. An algorithmfor thesimulationof condensedmatterwhichgrowsasthe gh powerof thenumberof particles.Mol. Phys., 65:875–893,1988.

[90] H. G.Petersen.Accuracy andefficiency of theparticlemeshEwaldmethod.J. Chem.Phys., 103(3668-3679),1995.

[91] S. Plimpton. Fastparallelalgorithmsfor short-rangemoleculardynamics.J. Chem.Phys., 117:1–19,1995.

[92] S.PlimptonandB. Hendrickson.A new parallelmethodfor moleculardy-namicssimulationof macromolecularsystems.J. Comp.Chem., 17(3):326,1996.

[93] E. L. Pollock and J. Glosli. Commentson PPPM, FMM, and theEwald methodfor large periodicCoulombicsystems.ComputerPhysicsCommunications, 95:93–110,1996.

[94] L. Hornekær. Single-andMulti-SpeciesCoulombIon Crystals:Structures,Dynamicsand SympatheticCooling. PhD thesis,University of Aarhus,Denmark,November2000.

[95] W. Rankin and J. Board. A portabledistributed implementationof theparallelmultipole treealgorithm. IEEE Symposiumon High PerformanceDistributedComputing, 1995.[DukeUniversityTechnicalReport95-002].

[96] W. T. Rankin. Efficient Parallel Implementationsof Multipole BasedN-BodyAlgorithms. PhDthesis,DukeUniversity, 1997.

[97] D. C. Rapaport.TheArt of Molecular DynamicsSimulation. CambridgeUniversityPress,1995.

[98] K. Refson.Moldy: A portablemoleculardynamicssimulationprogramforserialandparallelcomputers.Comput.Phys.Commun., 126(3):309–328,2000.

[99] J. Roth, F. Gahler, and H.-R. Trebin. A moleculardynamicsrun with5.180.116.000particles.Int. J. of ModernPhys.C, 11(2):317–322,2000.

69

Page 78: Framework Design, Parallelization and Force Computation in

[100] C. Sagui and T. Darden. Multigrid methodsfor classicalmoleculardynamicssimulationsof biomolecules. J. Chem.Phys., 114(15):6578–6591,2001.

[101] J. M. Sanz-Sernaand M. P. Calvo. NumericalHamiltonian Problems.ChapmanandHall, London,1994.

[102] J. P. Schiffer. Phasetransitionsin anisotropicallyconfinedionic crystals.PhysicalReview Letters, 70(6):818–821,1993.

[103] SGI. The Standard Template Library: Introduction.http://www.sgi. com/t ec h/ st l .

[104] J. Shimada,H. Kaneko, and T. Takada. Performanceof fast multipolemethodsfor calculatingelectrostaticinteractionsin biomacromolecularsimulations.J. Comp.Chem., 15(1):28,1994.

[105] J. G. Siek andA. Lumsdaine.The Matrix TemplateLibrary: A unifyingframework for numericallinear algebra. In InternationalSymposiumonComputingin Object-OrientedParallel, ParallelObjectOrientedScientificComputing.ECOOP, 1998.

[106] R. D. SkeelandD. J.Hardy. Practicalconstructionof modifiedHamiltoni-ans.SIAMJ. on Sci.Comput., 23(4):1172–1188,November2001.

[107] R. D. Skeel,I. Tezcan,andD. J.Hardy. Multiple grid methodsfor classicalmoleculardynamics.J. Comp.Chem., 23(6):673–684,March2002.

[108] G. D. Smith, C. Ayyagari, and D. Bedrov. Lu-cretius: A molecular dynamics simulation program, 1999.http://www.che. uta h. edu/ ˜g dsmit h/ mdco de/ main .h tml .

[109] L. Smith. Molecular sciencemodelling. Technical report, EdinburghParallelComputingCenter, TheUniversityof Edinburgh,UK, 1997.

[110] W. Smith andT. R. Forester. DL POLY 2.0: A general-purposeparallelmoleculardynamicssimulationpackage. J. Mol. Graphics, 14(3), June1996.

[111] J. Stone, J. Gullingsrud, and K. Schulten. A systemfor interactivemoleculardynamics. In 2001 Symposiumon Interactive 3D Graphics.ACM, 2001. In Press.

70

Page 79: Framework Design, Parallelization and Force Computation in

[112] C. Streit. BOOGA,ein Komponentenframework fur Grafikanwendungen.PhD thesis, Institute of ComputerScienceand Applied Mathematics,Universityof Berne,1997.

[113] B. Stroustrup.TheC++ ProgrammingLanguage. Addison-Wesley, thirdedition,1997.

[114] G. Sutmann. Classicaldynamics. In J. Grotendorst,D. Marx, andA. Muramatsu,editors, QuantumSimulationsof Complex Many-BodySystems:FromTheoryto Algorithms, volume10 of NIC, pages211–254,ForschungszentrumJulich, Germany, 2000.Johnvon NeumannInstitutefor Computing,NIC-Directors.

[115] H. Sutter. ExceptionalC++: 47 EngineeringPuzzles,ProgrammingProblems,andSolutions. Addison-Wesley, 2000.

[116] H. Tosuji,T. Kishimoto,C. Totsuji, andK. Tsuruta.Competitionbetweentwo formsof orderingin finite Coulombclusters.PhysicalReview Letters,88(12),March2002.

[117] A. Y. Toukmaji and J. A. Board Jr. Ewald summationtechniquesinperspective: A survey. ComputerPhysicsCommunications, 95:73–92,1996.

[118] M. Tuckerman,B. J. Berne,andG. J. Martyna. Reversiblemultiple timescalemoleculardynamics.J. Chem.Phys., 97(3):1990–2001,1992.

[119] W. F. van Gunsterenand H. J. C. Berendsen. Computersimulationof moleculardynamics: Methodology, applicationsand perspectives inchemistry. ChemieInt. Ed.Engl., 29:922–1023,1990.

[120] T. Veldhuizen.Blitz++: Thelibrary thatthinksit is acompiler. Conferencepresentation,Extreme! ComputingLaboratory, IndianaUniversity Com-puterScienceDepartment,Sep.1998.

[121] J. VincentandK. M. Merz. A highly portableparallelimplementationofAMBER usingtheMessageMassingInterfacestandard.J. Comp.Chem.,11:1420–1427,1995.

[122] Z. WangandC.Holm. Estimateof thecutoff errorsin theEwaldsummationfor dipolarsystems.J. Chem.Phys., 115(14):6351–6359,2001.

71

Page 80: Framework Design, Parallelization and Force Computation in

[123] Z. Wang,J. Lupo, A. McKenney, andR. Pachter. Large scalemoleculardynamicssimulationswith fastmultipoleimplementations.In Proceedingsof Supercomputing’99, Portland,Oregon,November1999.

[124] P. K. WeinerandP. A. Kollman. AMBER: Assistedmodelbuilding withenergy refinement.a generalprogramfor modelingmoleculesand theirinteractions.J. Comp.Chem., 2:287,1981.

[125] A. Windemuth.Advancedalgorithmsfor moleculardynamicssimulation:TheprogramPMD. In Timothy G. Mattson,editor, Parallel ComputinginComputationalChemistry, Washington,D.C.,1995.ACSBooks. In press.

[126] D. York andW. T. Yang. ThefastFourier–Poissonmethodfor calculatingEwald sums.J. Chem.Phys., 101:3298–3300,1994.

[127] M. Q. ZhangandR. D. Skeel. Symplecticintegratorsandtheconservationof angularmomentum.J. Comput.Chem., 16:365–369,March1995.

[128] R. Zhou andB. J. Berne. A new moleculardynamicsmethodcombiningthe referencesystempropagatoralgorithm with a fast multipole methodfor simulating proteins and other complex systems. J. Chem.Phys.,103(21):9444–9459,1995.

[129] R. Zhou, E. Harder, H. Xu, andB. J. Berne. Efficient multiple time stepmethodfor usewith EwaldandparticlemeshEwaldfor largebiomolecularsystems.J. Chem.Phys., 115(5):2348–2358,August1 2001.

72

Page 81: Framework Design, Parallelization and Force Computation in

A Publications

73

Page 82: Framework Design, Parallelization and Force Computation in
Page 83: Framework Design, Parallelization and Force Computation in

B Design

R2: Boundaryconditions VacuumBoundaryConditions(TBC) PeriodicBoundaryConditionsR3: Cell manager(CM) CubicCellManagerOn atompair OneAtomPair<TBC, TCM, TSF, TNF, useSF>(TOneAtomPair) OneAtomPairTwo<TBC, TCM, TSF1, TNF1,

TSF2, TNF2, useSF>OneAtomPairMollifyLJ <TBC, TCM, TSF, TNF,useSF>OneAtomPairMollifyT wo<TBC, TCM, TSF1,TNF1, TSF2, TNF2, useSF>

R4: Non-bondedforce CoulombForceGravitationForceLennardJonesForceMagneticDipoleForce

R5: Switchingfunction C1SwitchingFunction(TSF) C2SwitchingFunction

ComplementSwitchingFunction<TSF>CutoffSwitchingFunctionRangeSwitchingFunction<TSF>ShiftSwitchingFunctionUniversalSwitchingFunction

Table 5: Overview of policy choices for the non-bonded force algorithms (R1):cutoff (or truncation), simple-full (or direct) and full (encounters also pairs overdifferent unit MD cells). OneAtomPair computes one interaction pair, whereOneAtomPairTwo computes two potentials simultaneously. OneAtomPairMolli-fyLJ and OneAtomPairMollifyT wo are the corresponding interactions for MOLLYintegrators.

113

Page 84: Framework Design, Parallelization and Force Computation in

BondSystemForceBase

BondSystemForce<TBoundaryConditions>

AngleSystemForceBase

AngleSystemForce<TBoundaryConditions>

HapticSystemForceBase

HapticSystemForce

NonbondedFullSystemForceBase

NonbondedSimpleFullSystemForceBase

NonbondedSimpleFullSystemForce<TBoundaryConditions,

TCellManager,TOneAtomPair>

TCellManager,TOneAtomPair>

NonbondedCutoffSystemForce<TBoundaryConditions,

MagneticDipoleMirrorSystemForceBase

MagneticDipoleMirrorSystemForce<TBoundaryConditions>

NonbondedCutoffSystemForceBase

NonbondedFullEwaldSystemForceBase

Force

FrictionExtendedForce<TBoundaryConditions>

PaulTrapExtendedForceBase

PaulTrapExtendedForce<TBoundaryConditions>

MTorsionSystemForce<TBoundaryConditions>

DihedralSystemForceBase

DihedralSystemForce<TBoundaryConditions>

ImproperSystemForceBase

ImproperSystemForce<TBoundaryConditions>

NonbondedMultiGridSystemForceBase

NonbondedPMEwaldSystemForceBase

NonbondedMultiGridSystemForce<TBoundaryConditions,

NonbondedFullEwaldSystemForce<TBoundaryConditions,

TCellManager,real,reciprocal,correction,TSwitchingFunction>

TCellManager,TInterpolation,TKernel,pbc,direct,smooth>

NonbondedPMEwaldSystemForce<TBoundaryConditions,TCellManager,real,reciprocal,correction,TInterpolation,TSwitchingFunction>

FrictionExtendedForceBase

CompareForce SystemForce CompareTimeExtendedForce

SystemCompareForce ExtendedCompareForce SystemTimeForce ExtendedTimeForce

NonbondedFullSystemForce<TBoundaryConditions,

TCellManager,TOneAtomPair>

Figure 19: Overview design of forces.

114

Page 85: Framework Design, Parallelization and Force Computation in

preprocess();myActualForce−>parallelEvaluate(tp,r,f,e);

preprocess();myActualForce−>evaluate(tp,r,f,e);myActualForce−>mollify(tp,r,fout,e,fin);

preprocess();

postprocess(tp,fout,e);

preprocess();

postprocess(tp,fout,e); fout,e,fin);

new SystemCompareForce(actualForce,compareForce);

postprocess(tp,f,e);

if(tp−>parallel−next())evaluate(tp,r,v,f,e);

postprocess(tp,f,e);

mollify(tp,r,fout,e,fin);

"No mollify implementation"

myActualForce−>evaluate(tp,r,v,f,e);preprocess();

postprocess(tp,v,f,e);

myActualForce−>parallelMollify(tp,r,

ExtendedForce

makeCompareForce()evaluate()=0parallelEvaluate()

CompareForce

getForceObject()getForces()

preprocess()postprocess()

getEnergies()doMake()getKeyword()getIdString()getParameters()numberOfBlocks()

Force* myActualForceCompareForce* myCompareForceCoordinateBlock* myForcesEnergyStructure* myEnergies

Compute relative errors

myActualForce−>make(errMsg,values);

myActualForce−>numberOfBlocks();

Force

getKeyword()=0makeCompareForce()=0getParameters()=0getIdString()=0make()numberOfBlocks()doMake()=0

doMake(errMsg,values);

SystemForce

parallelMollify()mollify()

parallelEvaluate()evaluate()=0makeCompareForce()

if(tp−>parallel−next())evaluate(tp,r,v,f,e);

new ExtendedCompareForce(actualForce,compareForce);

SystemCompareForce

parallelEvaluate()mollify()parallelMollify()

evaluate()

ExtendedCompareForce

evaluate()parallelEvaluate()

preprocess();myActualForce−>parallelEvaluate(tp,r,v,f,e);postprocess(tp,v,f,e);

CompareForce(Force*, CompareForce*)

Figure 20: Top level design of forces.

115

Page 86: Framework Design, Parallelization and Force Computation in

if(tp−>parallel−>next())evaluate(tp,r,f,e);

parallelMollify()mollify()

SystemForce

evaluate(tp,r,f,e)=0parallelEvaluate(tp,r,f,e) "No mollify implementation"

HapticSystemForce

evaluate(tp,r,f,e)

AngleSystemForceBase

BondSystemForce<TBoundaryConditions>

evaluate(tp,r,f,e)calcBond(boundary,bond,r,f,e)calcBondEnergy(boundary,bond,r)

BondSystemForceBase

IMDElf myIMDElfCoordinateBlock* myHapticForces

MagneticDipoleMirrorSystemForce<TBoundaryConditions>

MagneticDipoleMirrorSystemForceBase

evaluate(tp,r,f,e)

Real myXsi, myR, myD, myHx, myHy, myHz, myTau

DihedralSystemForce<TBoundaryConditions>

evaluate(tp,r,f,e)

AngleSystemForce<TBoundaryConditions>

evaluate(tp,r,f,e)calcAngle(boundary,angle,r,f,e)calcAngleEnergy(boundary,angle,r)

HapticSystemForceBase

ImproperSystemForceBase

ImproperSystemForce<TBoundaryConditions>

evaluate(tp,r,f,e)

MTorsionSystemForce<TBoundaryConditions>

calcTorsion(boundary,torsion,r,f,e)calcTorsionEnergy(boundary,torsion,r)

DihedralSystemForceBase

Figure 21: Detailed design of bonded system forces.

116

Page 87: Framework Design, Parallelization and Force Computation in

if(tp−>parallel−>next())evaluate(tp,r,f,e);

"No mollify implementation"

for(i=0;i<r−size();i++)for(j=i+1;j<r−>size();j++)

myOneAtomPair.doOneAtomPair(i,j)

parallelMollify()mollify()

SystemForce

evaluate(tp,r,f,e)=0parallelEvaluate(tp,r,f,e)

NonbondedCutoffSystemForceBase

NonbondedCutoffSystemForce<TBoundaryConditions,

TCellManager,TOneAtomPair>

evaluate()parallelEvaluate()

Real myCutoffTOneAtomPair myOneAtomPair

mollify()doEvaluate()

NonbondedPMEwaldSystemForceBase

NonbondedFullEwaldSystemForce<TBoundaryConditions,

evaluate()parallelEvaluate()initialize()realTerm()reciprocalTerm()correctionTerm()

Real myAlpha

NonbondedMultiGridSystemForceBase

NonbondedMultiGridSystemForce<TBoundaryConditions,

evaluate()initialize()shortRangeTerm()longRangeTerm()correctionTerm()

MultiGrid<TInterpolation,TKernel,pbc,pbc,pbc> myGrid

NonbondedFullEwaldSystemForceBase

NonbondedSimpleFullSystemForce<TBoundaryConditions,

evaluate()parallelEvaluate()doEvaluate()

TOneAtomPair myOneAtomPairReal myCutoff

NonbondedSimpleFullSystemForceBase

NonbondedFullSystemForce<TBoundaryConditions,

NonbondedFullSystemForceBase

evaluate()parallelEvaluate()doEvaluate()

TOneAtomPair myOneAtomPair

TCellManager,TOneAtomPair>

myOneAtomPair.doOneAtomPair(i,j,l)

for(j=i+1;j<r−>size();j++)for all lattice vectors

for(i=0;i<r−size();i++)

for (; !enumerator.done(); enumerator.next())for(i=pair.first; i!=−1; i=tp−>atoms[i].cellListNext)

for(j=pair.second; j!=−1; j=tp−>atoms[j].cellListNext)myOneAtomPair.doOneAtomPair(i,j);

NonbondedPMEwaldSystemForce<TBoundaryConditions,

evaluate()parallelEvaluate()initialize()realTerm()reciprocalTerm()correctionTerm()

Grid<TInterpolation> myGridReal myAlpha

TCellManager,TInterpolation,TKernel,pbc,direct,smooth>TCellManager,real,reciprocal,correction,TInterpolation,TSwitchingFunction>

TCellManager,real,reciprocal,correction,TSwitchingFunction>

TCellManager,TOneAtomPair>

Figure 22: Detailed design of non-bonded system forces.

117

Page 88: Framework Design, Parallelization and Force Computation in

return new FrictionExtendedForce(f);Real f = values[0].getReal();

doEvaluate(tp,r,v,f,e,0,r−>size());

parameters.push_back(ParameterType(

"−k",VarValType::REAL,VarVal(myF)));

if(tp−>parallel−>next())evaluate(tp,r,v,f,e);

Real omegaR = values[0].getReal();Real omegaZ = values[1].getReal();return new PaulTrapExtendedForce(omegaR,omegaZ);

parameters.push_back(ParameterType("−omegaR",VarValType::REAL,VarVal(myOmegaR)));

parameters.push_back(ParameterType("−omegaZ",VarValType::REAL,VarVal(myOmegaZ)));

ExtendedForce

evaluate(tp,r,v,f,e)=0parallelEvaluate(tp,r,v,f,e)

for(int i=from;i<to;i++)(*f)[i] += (*v)[i]*myF;

return keyword;

for(unsigned int i = 0;i<count;i++){if(topo−>parallel−>next()){

int to = ...;int from = ...;doEvaluate(tp,r,v,f,e,from,to);

}

}

PaulTrapExtendedForceBase

static const string keyword

FrictionExtendedForce<TBoundaryConditions>

FrictionExtendedForce()FrictionExtendedForce(Real f)evaluate()parallelEvaluate()getParameters()doMake()numberOfBlocks()

Real myF

doEvaluate()getKeyword()

FrictionExtendedForceBase

static const string keyword

PaulTrapExtendedForce<TBoundaryConditions>

PaulTrapExtendedForce()PaulTrapExtendedForce(Real omegaR, Real omegaZ)evaluate()getParameters()

Real myOmegaZReal myOmegaR

doMake()getKeyword()

for(int i=0;i<r−>size();i++)...

Figure 23: Detailed design of extended forces.

118

Page 89: Framework Design, Parallelization and Force Computation in

Accumulate force and energy contributions

Check for exclusions

Compute force and energy contributions

Apply switching function, if necessary

Rough cutoff test, if necessary

Compute distance between atom pair considering boundary conditions

TSwitchingFunction switchingFunctionTNonbondedForce nonbondedForceFunctionTSwitchingFunction switchingFunctionTNonbondedForce nonbondedForceFunction

initialize()

getIdString()

TSwitchingFunction switchingFunction

Real mySwitchOn, myCutoffTNonbondedForce nonbondedForceFunctionTSwitchingFunction switchingFunction

calcLJHessianAndUpdateZi()calcLJHessianAndUpdateZi()doMake()getIdString()

doOneAtomPair(i,j)

OneAtomPairMollifyLJ

getParameters()

initialize()

OneAtomPairMollifyTwo

TNonbondedForce nonbondedForceFunctionTSwitchingFunction switchingFunctionTNonbondedForce nonbondedForceFunction

Real mySwitchOn, myCutoff

calcLJHessianAndUpdateZi()calcLJHessianAndUpdateZi()doMake()getIdString()

doOneAtomPair(i,j)getParameters()

initialize()

doOneAtomPair(i,j)

OneAtomPairTwo

Accumulate forces and energy contributions

TNonbondedForce nonbondedForceFunction

TNonbondedForce::doMake());

doOneAtomPair(i,j)getParameters()

Apply 2nd switching function on 2nd contribution, if necessary

doMake()

OneAtomPair

initialize()

return OneAtomPair(TSwitchingFunctionFirst::doMake(),

getParameters()getIdString()doMake()

TSwitchingFunction switchingFunction

TNonbondedForceSecond::doMake());

return OneAtomPair(TSwitchingFunction::doMake(),

nonbondedForceFunctionSecond.getParameters(parameters);

nonbondedForceFunction.getParameters(parameters);

switchingFunctionFirst.getParameters(parameters);

TNonbondedForceFirst::doMake(),TSwitchingFunctionSecond::doMake()

switchingFunction.getParameters(parameters);

nonbondedForceFunctionFirst.getParameters(parameters);

switchingFunctionSecond.getParameters(parameters);

Compute distance between atom pair considering boundary conditions

Rough cutoff test, if necessary

Compute 1st force and 1st energy contributions

Check for exclusions

Compute 2nd force and 2nd energy contributions

Apply 1st switching function on 1st contribution, if necessary

TBoundaryConditions

TSwitchingFunctionTCellManager

TNonbondedForceuseSwitchingFunction

useSwitchingFunctionFirst

TNonbondedForceSeconduseSwitchingFunctionSecond

TSwitchingFunctionSecond

TNonbondedForceFirstTSwitchingFunctionFirst

TBoundaryConditionsTCellManager

useSwitchingFunctionFirst

TNonbondedForceSeconduseSwitchingFunctionSecond

TSwitchingFunctionSecond

TNonbondedForceFirstTSwitchingFunctionFirst

TBoundaryConditionsTCellManager

TBoundaryConditions

useSwitchingFunctionTNonbondedForce

TCellManagerTSwitchingFunction

Figure 24: Detailed design of TOneAtomPair , evaluating one interaction pair ortwo simultaneously.

119

Page 90: Framework Design, Parallelization and Force Computation in

C Code

#include "ExtendedForce.h"#include "PaulTrapExtendedForceBase.h"

template<class TBoundaryConditions>class PaulTrapExtendedForce : public ExtendedForce,

private PaulTrapExtendedForceBase {public:

PaulTrapExtendedForce():myOmegaXY(0.0),myOmegaZ(0.0){} ;PaulTrapExtendedForce(Real omeagXY, Real omegaZ):myOmegaXY(omegaXY),myOmegaZ(omegaZ){};

public: // From class SystemForcevirtual void evaluate(const GenericTopology* topo,

const CoordinateBlock* positions,const CoordinateBlock* velocities,CoordinateBlock* forces,EnergyStructure* energies){

const TBoundaryConditions &boundary =(dynamic_cast<const SemiGenericTopology<TBoundaryConditions>& >(*topo)).boundaryConditions;

Real e = 0.0;for(int i=0;i<topo->atoms.size();i++){

Real c = topo->atomTypes[topo->atoms[i].type].mass;Coordinates pos(boundary.basisPosition((*positions)[i]));Coordinates f(c*myOmegaXY*myOmegaXY*pos.x,

c*myOmegaXY*myOmegaXY*pos.y,c*myOmegaZ*myOmegaZ*pos.z);

(*forces)[i] -= f;e += 0.5*c*(myOmegaXY*myOmegaXY*(pos.x*pos.x+pos.y*pos.y)+

myOmegaZ*myOmegaZ*pos.z*pos.z);}energies->otherEnergy += e;

}

public: // From class Forcevirtual string getKeyword() const{return keyword;}virtual string getIdString() const{return keyword;}virtual void getParameters(vector<ParameterType>& parameters) const{

parameters.push_back(ParameterType("-omegaXY",VarValType::REA L,VarVal(myOmegaXY)));

parameters.push_back(ParameterType("-omegaZ",VarValType::REAL ,VarVal(myOmegaZ)));

}virtual unsigned int getParametersCount() const{return 2;}

protected:virtual Force* doMake(string&, const vector<VarVal>&) const{

Real omegaXY = values[0].getReal();Real omegaZ = values[1].getReal();return new PaulTrapExtendedForce(omegaXY,omegaZ);

}

private: // My data membersReal myOmegaXY;Real myOmegaZ;

};

Program 13: Detailed implementation of a Paul Trap attraction.

120

Page 91: Framework Design, Parallelization and Force Computation in

#include "ExtendedForce.h"#include "FrictionExtendedForceBase.h"

template<class TBoundaryConditions>class FrictionExtendedForce : public ExtendedForce, private FrictionExtendedForceBase {

public:FrictionExtendedForce():myF(0.0){};FrictionExtendedForce(Real f):myF(f){};

private:void doEvaluate(const GenericTopology* topo, const CoordinateBlock* positions,

const CoordinateBlock* velocities, CoordinateBlock* forces,EnergyStructure* energies, int from, int to){

for(int i=from;i<to;i++)(*forces)[i] += (*velocities)[i]*myF;

}

public: // From class SystemForcevirtual void evaluate(const GenericTopology* topo, const CoordinateBlock* positions,

const CoordinateBlock* velocities, CoordinateBlock* forces,EnergyStructure* energies){

doEvaluate(topo,positions,velocities,forces,energies,0,p ositions- >size());}

virtual void parallelEvaluate(const GenericTopology* topo, const CoordinateBlock* positions,const CoordinateBlock* velocities, CoordinateBlock* forces,EnergyStructure* energies){

int n = positions->size();int num = topo->parallel->getAvailableNum();int count = 0;if(num >= n) count = n;else if(num*2 <= n) count = 2*num;else count = num;

for(int i = 0;i<count;i++){if(topo->parallel->next()){int to = (n*(i+1))/count;

if(to > (int)n) to = n;int from = (n*i)/count;doEvaluate(topo,positions,velocities,forces,energies,from,to) ;

} } }

public: // From class Forcevirtual string getKeyword() const{return keyword;}virtual string getIdString() const{return keyword;}virtual void getParameters(vector<ParameterType>& parameters) const{

parameters.push_back(ParameterType("-k",VarValType::REAL ,VarVal(m yF)));}virtual unsigned int getParametersCount() const{return 1;}virtual void numberOfBlocks(const GenericTopology* topo,

const CoordinateBlock* pos,unsigned int& n,unsigned int& step){

int count = pos->size();int num = (unsigned int)topo->parallel->getAvailableNum();if(num >= count)

n = count;else if(num*2 <= count)

n = 2*num;else

n = num;step = 1;

}protected:

virtual Force* doMake(string&, const vector<VarVal>&) const{Real f = values[0].getReal();return new FrictionExtendedForce(f);

}

private: // My data membersReal myF;

};

Program 14: Detailed implementation of a velocity dependent friction.

121

Page 92: Framework Design, Parallelization and Force Computation in

D Gallery

D.1 Biomolecules

Thefollowing picturesweregeneratedwith PROTOMOL andraytraced25 with thegraphicsframework BOOGA[3, 17, 112].

Figure 25: Bovine pancreatic trypsin inhibitor (BPTI) without water, 898 atoms;a 58 amino acid polypeptide that adopts a tertiary fold comprising two strands ofantiparallel beta sheet and two short segments of alpha helix.

25The raytracemethodis a well-known methodin computergraphicsto generaterealisticpictureswith shadow, reflectionandtextures.

122

Page 93: Framework Design, Parallelization and Force Computation in

Figure 26: Alanin, 66 atoms. A nonessential amino acid, alpha-aminopropanoicacid, occurring in proteins.

Figure 27: 141-molecule water system.

123

Page 94: Framework Design, Parallelization and Force Computation in

Figure 28: Apolipoprotein (ApoA1) without water, 27,850 atoms, P02647.

124

Page 95: Framework Design, Parallelization and Force Computation in

D.2 IM200 – Labcoursein Computational Science

During spring2002, IM200 (a laboratorycoursein computationalscience)wasgiven at the Departmentof Informaticsat the University of Bergen. The mainpurposeof this coursewas to simulateand study scientific problemswith thehelp of software packages.One of the exerciseswas basedon the frameworkPROTOMOL.

Thepurposeof theexercisewasto investigatethestructureanddynamicsofso-calledions trappedin anelectricandmagneticfield (Paul Trap). Ions in freespacerepeleachotherdueto Coulombforcerepulsionandexpandto infinity. Ina trap, an additionalattractive potentialkeepsthe systembounded.For certainparametersof thepotential,themostfavorableenergetic configuration(thestatewith lowestenergy) of theionsorganizethemselvesin alinearstring.Suchastringis at presentoneof the mostpromisingcandidatesfor implementinga quantumprocessor, which have thepotentialof solvingcertainclassicalexponentialprob-lemsin lineartime.

During thisexercise,thestudentsperformedseveralsimulationswith PROTO-MOL andwereaskedto solve thefollowing problems:i Problem1

– For atwo particles,calculatetheequilibriumdistancebetweenthetwoCajk?l ions. monqp<rtsurEv , w!nyxEsur .

– Verify your calculationby runningPROTOMOL with the two-ion ex-ample.DownloadVMD to visualizethebehavior.

– Setup a seriesof simulationswith increasingnumberof ionsandrunPROTOMOL until theinitial stringformsabulk.i Problem2

– Find the right Paul Trap parameterssuchthat up to 41 Cajk?l ions aconvergenceto a string is seen,where42 or more ions form a zig-zagline or a bulk. Down load andusethe systemfiles (PSF, XYZ-positionsand PAR) for 41 and 42 ions. For the configurationfilesset z|{�n xCr�}�~ l�� fs }�~�� , temperatureof xCr�}�� K and usethe Leapfrogintegratoranda friction force.i Problem3

125

Page 96: Framework Design, Parallelization and Force Computation in

– Therealscientificinterestis in largenumbersof particles.Runsometestfor differentvaluesof � anddevelopa modelfor the time com-plexity for thecode.Assumeyou got thenew IBM regattasystematyour disposalfor aweek,how largesystemcanyousimulate?

– If we increasethenumberof ionsbeyond thebreak-over from stringto bulk otherformswill appear. Setthenumberof ionsto 10000andcomputeuntil astableform is achieved.How doesthisform look like?

– Now the problemhasbecomecomputationallyexpensive. Thusweneedfast force calculationandefficient time integration. Play withdifferentforcecalculationandtime integratorsandfind onethatbal-anceefficiency with accuracy well.

A detaileddescriptioncanbefoundat

http://www.ii. ui b.n o/ ˜mat th ey/ im 200/ 02/

126

Page 97: Framework Design, Parallelization and Force Computation in

D.3 Magnetic holes– Ugelstadspheres

Ugelstadspheresin a magneticfluid and an appliedmagneticfield createananalogyto Archimedeslaw. The spheresbecomemagneticholes and act asmagneticdipoleswith adipolemomentin theoppositewayof themagneticfield.

Figure 29: 15 Ugelstad spheres (by courtesy of Andreas Hellesøy).

The strengthof the dipole �� is governedby the volumedisplacementof theferrofluid, thestrengthof theexternalmagneticfield andthesusceptibilityof theferrofluid:

�� ny����������� �� s (51)

������� is theeffectivesusceptibilityasaresultof thegeometryof theUgelstadbead,a sphere.Theforceon a dipole � interactingwith adipole � of equalstrengthanddirectionis givenby

��� nd� �]��� � ��� ���� ¢¡H£¥¤ �¦ �§�¦�¨�§� © x�ª   ��_ �¡H£«¤ �¦ �¬� £?�­¤ �¦ �§�¦E®�§� �/¯ ��¥ �¡H£¥°±  ��_ �¡H£¥° �¦ �¬� £¦�¨�§� ² (52)

127

Page 98: Framework Design, Parallelization and Force Computation in

In experimentscomparingwith the theoreticalresults,the spheresandthe mag-netic fluid areconfinedbetweentwo glass-plates.The boundaryconditionsbe-tweentheferrofluid andtheglass-platescanbesatisfiedby introducingso-calledmirror-dipoles.An infinite seriesof imaginarydipolesof Ugelstadspheresmirror-imageswould appear, if the glass-plateswere mirrors satisfying the boundaryconditions.

Thestrengthof themirror-imagesdecreasesas ³ � , where� is theimagenumberand ³cn �|���� ©U´|µ (53)

and ��� is thesusceptibilityof theferrofluid.This work wascarriedout by AndreasHellesøy26 for his Master’s degreeand

aspartof theNOTUR TechnologyTransferProjectNo. 5, fundedin partby theNorwegianResearchCouncil.More detailscanbefoundat

http://www.fi. ui b.n o/ ˜h el le soy /T TP5. ht ml

26Departmentof Physics,Universityof Bergen.

128

Page 99: Framework Design, Parallelization and Force Computation in

D.4 Coulomb crystals

Ion CoulombCrystalsarethesolidstateof plasmacontainingonly particlesof thesamesignof charge. One-componentplasma(OCP’s) consistonly of onesingleion species(Figure30). OCP’s at low temperatureshavebeenstudiedintensivelytheoreticallyandduring the last 10 yearsalsoexperimentallywith lasercooledionsin traps[52, 94, 102,116].

Thepotentialis asumof electrostaticrepulsionsandtrapattractions

¶ n ·��'¸ ~�¹ ��º¸t� ¶ electrostatic�§� © ·� �'¸ ~� x´ m � z �»+¼  �½ �� ©�¾ �� £ © x´ m � z �{M¿ �� ² µ �¦ � n  �½ � µ ¾ � µ ¿ � £HÀ s

(54)

Here, �¦ � is thecoordinateof particle � . Thisongoingprojectis acollaborationwithJanPetterHansen27 andMichaelDrewsen28 who providesexperimentaldata.Weinvestigatefor increasingnumbersof ionswhetherthey convergeto shellstructureor not (Figures31,35-38),especiallyfor bi-crystals(Figure 32).

¶ electrostatic isevaluatedwith the multi-grid method. For the larger simulationsthe multi-gridmethodis about100-300times fasterwith a maximumrelative force error oforder xCr }�Á comparedto thedirectmethod.For theconvergenceto shellor latticestructurethis accuracy is appropriated.If needed,onemay run somefew finalstepswith amoreaccuratemethod.It is partof theNOTUR TechnologyTransferProjectNo. 5, fundedin partby theNorwegianResearchCouncil.

27Departmentof Physics,Universityof Bergen.28The ion Trap group, Institute of Physics and Astronomy, University of Aarhus.

http://www.ifa.au.dk/iontrapgroup/

129

Page 100: Framework Design, Parallelization and Force Computation in

Figure 30: Front hemisphere of the outer shell of a 20,288 Ca jk?l system.

130

Page 101: Framework Design, Parallelization and Force Computation in

0 50 100 150 2000

1

2

3

4

5

6

7Radial distribution

µm

a

Figure 31: Radial distribution of a system with 20,288 Ca jk?l .

131

Page 102: Framework Design, Parallelization and Force Computation in

Figure 32: Front hemisphere of the outer shell of a 10,144 Ca jk?lCÂ�ÃCÄ and 10,144A � jÅ l-Â�ÆCÄ system.

132

Page 103: Framework Design, Parallelization and Force Computation in

Figure 33: Front hemisphere of the outer shell (250.4-252.2 Ç m) of a 10,144Ca jk?lMÂ�ÃCÄ and 10,144 A � jÅ l-Â�ÆCÄ system.

133

Page 104: Framework Design, Parallelization and Force Computation in

Figure 34: Front hemisphere of the outer shell (242.2-250.4 Ç m) of a 10,144Ca jk?lCÂ�ÃCÄ and 10,144 A � jÅ l¥Â�ÆCÄ system.

134

Page 105: Framework Design, Parallelization and Force Computation in

0 50 100 150 200 2500

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5Radial distribution

µm

a

10144 Ca+

10144 A2+

10144 Ca+ + 10144 A2+

Figure 35: Radial distribution of a system with 10,144 Ca jk?l and 10,144 A � jÅ l .

135

Page 106: Framework Design, Parallelization and Force Computation in

0 50 100 150 200 250 300 3500

1

2

3

4

5

6Radial distribution

µm

a

Figure 36: Radial distribution of a system with 100,000 Ca jk?l .

136

Page 107: Framework Design, Parallelization and Force Computation in

0 50 100 150 200 250 300 350 4000

0.5

1

1.5

2

2.5

3Radial distribution

µm

a

50000 Ca+

50000 A2+

50000 Ca+ + 50000 A2+

Figure 37: Radial distribution of a system with 50,000 Ca jk?l and 50,000 A � jÅ l .

137

Page 108: Framework Design, Parallelization and Force Computation in

0 10 20 30 40 50 60 70 80 90 1000

0.5

1

1.5

2

2.5

3

3.5

4Radial distribution

µm

a

Figure 38: Radial distribution of a system with 2,057 Ca jk?l .

138