a high-performance parallel-generalized born implementation enabled by tabulated interaction...

8
A High-Performance Parallel-Generalized Born Implementation Enabled by Tabulated Interaction Rescaling PER LARSSON, ERIK LINDAHL Center for Biomembrane Research, Department of Biochemistry & Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden Received 19 October 2009; Revised 22 January 2010; Accepted 27 February 2010 DOI 10.1002/jcc.21552 Published online 6 May 2010 in Wiley Online Library (wileyonlinelibrary.com). Abstract: Implicit solvent representations, in general, and generalized Born models, in particular, provide an attractive way to reduce the number of interactions and degrees of freedom in a system. The instantaneous relaxation of the dielectric shielding provided by an implicit solvent model can be extremely efficient for high-throughput and Monte Carlo studies, and a reduced system size can also remove a lot of statistical noise. Despite these advantages, it has been difficult for generalized Born implementations to significantly outperform optimized explicit-water simulations due to more complex functional forms and the two extra interaction stages necessary to calculate Born radii and the derivative chain rule terms contributing to the force. Here, we present a method that uses a rescaling transformation to make the standard generalized Born expression a function of a single variable, which enables an efficient tabulated implementation on any modern CPU hardware. The total performance is within a factor 2 of simulations in vacuo. The algorithm has been implemented in Gromacs, including single-instruction multiple-data acceleration, for three different Born radius models and corresponding chain rule terms. We have also adapted the model to work with the virtual interaction sites commonly used for hydrogens to enable long-time steps, which makes it possible to achieve a simulation performance of 0.86 µs/day for BBA5 with 1-nm cutoff on a single quad-core desktop processor. Finally, we have also implemented a set of streaming kernels without neighborlists to accelerate the non-cutoff setup occasionally used for implicit solvent simulations of small systems. © 2010 Wiley Periodicals, Inc. J Comput Chem 31: 2593–2600, 2010 Key words: generalized born; tabulation; molecular dynamics Introduction Most molecular dynamics simulations today are performed with atomic detail for the solvent water, but implicit solvent models have long been available as an alternative. While some details might require atomic-level descriptions, the general electrostatic environment of an ionic solvent can be modeled quite accurately with the continuum Poisson-Boltzmann equation (see e.g., reviews by Koehl 1 or Roux and Simonson 2 ). Because of the high compu- tational cost of Poisson-Boltzmann electrostatics, there has been significant work on simpler approximations, in particular, general- ized Born models 3 that estimate screening from the degree of burial of atoms. These models reduce the number of particles significantly when compared with explicit-solvent simulations, but still provide reasonably efficient and accurate representions of the electrostatic effects from the solvent environment. In particular, modern gen- eralized Born models have been quite successful in reproducing macromolecular properties. 4–6 While there are certain limitations to implicit models—few if any continuum descriptions will for instance be able to model salt bridges or hydrogen bond network details—there are also many interesting advantages. For instance, the electrostatic shielding is instantaneous rather than requiring thousands of simulations steps for water molecules to relax, which is highly useful for high- throughput screening or Monte Carlo methods. In addition, absence of viscosity can allow the solute to quickly explore conformational space (at the cost of unphysical dynamics), and potential artifacts from periodic boundary conditions are removed. Finally, since water can account for >90% of the atoms in explicit-solvent simulations the implicit solvent alternative will have much fewer degrees of freedom which significantly reduces the amount of thermal fluctu- ations in energy terms. Paradoxically, this means the less detailed model can provide better energy averages, which is critical e.g., in replica-exchange simulations. Correspondence to: E. Lindahl; e-mail: [email protected] Contract/grant sponsor: European Research Council; contract/grant number: 209825 Contract/grant sponsors: Swedish Foundation for Strategic Research, and the Swedish Research Council Contract/grant sponsor: NSF; contract/grant number: CNS-0619926 © 2010 Wiley Periodicals, Inc.

Upload: per-larsson

Post on 11-Jun-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A high-performance parallel-generalized born implementation enabled by tabulated interaction rescaling

A High-Performance Parallel-Generalized BornImplementation Enabled by Tabulated

Interaction Rescaling

PER LARSSON, ERIK LINDAHLCenter for Biomembrane Research, Department of Biochemistry & Biophysics,

Stockholm University, SE-106 91 Stockholm, Sweden

Received 19 October 2009; Revised 22 January 2010; Accepted 27 February 2010DOI 10.1002/jcc.21552

Published online 6 May 2010 in Wiley Online Library (wileyonlinelibrary.com).

Abstract: Implicit solvent representations, in general, and generalized Born models, in particular, provide an attractiveway to reduce the number of interactions and degrees of freedom in a system. The instantaneous relaxation of the dielectricshielding provided by an implicit solvent model can be extremely efficient for high-throughput and Monte Carlo studies,and a reduced system size can also remove a lot of statistical noise. Despite these advantages, it has been difficult forgeneralized Born implementations to significantly outperform optimized explicit-water simulations due to more complexfunctional forms and the two extra interaction stages necessary to calculate Born radii and the derivative chain rule termscontributing to the force. Here, we present a method that uses a rescaling transformation to make the standard generalizedBorn expression a function of a single variable, which enables an efficient tabulated implementation on any modern CPUhardware. The total performance is within a factor 2 of simulations in vacuo. The algorithm has been implemented inGromacs, including single-instruction multiple-data acceleration, for three different Born radius models and correspondingchain rule terms. We have also adapted the model to work with the virtual interaction sites commonly used for hydrogensto enable long-time steps, which makes it possible to achieve a simulation performance of 0.86 µs/day for BBA5 with1-nm cutoff on a single quad-core desktop processor. Finally, we have also implemented a set of streaming kernels withoutneighborlists to accelerate the non-cutoff setup occasionally used for implicit solvent simulations of small systems.

© 2010 Wiley Periodicals, Inc. J Comput Chem 31: 2593–2600, 2010

Key words: generalized born; tabulation; molecular dynamics

Introduction

Most molecular dynamics simulations today are performed withatomic detail for the solvent water, but implicit solvent modelshave long been available as an alternative. While some detailsmight require atomic-level descriptions, the general electrostaticenvironment of an ionic solvent can be modeled quite accuratelywith the continuum Poisson-Boltzmann equation (see e.g., reviewsby Koehl1 or Roux and Simonson2). Because of the high compu-tational cost of Poisson-Boltzmann electrostatics, there has beensignificant work on simpler approximations, in particular, general-ized Born models3 that estimate screening from the degree of burialof atoms. These models reduce the number of particles significantlywhen compared with explicit-solvent simulations, but still providereasonably efficient and accurate representions of the electrostaticeffects from the solvent environment. In particular, modern gen-eralized Born models have been quite successful in reproducingmacromolecular properties.4–6

While there are certain limitations to implicit models—few ifany continuum descriptions will for instance be able to model saltbridges or hydrogen bond network details—there are also many

interesting advantages. For instance, the electrostatic shielding isinstantaneous rather than requiring thousands of simulations stepsfor water molecules to relax, which is highly useful for high-throughput screening or Monte Carlo methods. In addition, absenceof viscosity can allow the solute to quickly explore conformationalspace (at the cost of unphysical dynamics), and potential artifactsfrom periodic boundary conditions are removed. Finally, since watercan account for >90% of the atoms in explicit-solvent simulationsthe implicit solvent alternative will have much fewer degrees offreedom which significantly reduces the amount of thermal fluctu-ations in energy terms. Paradoxically, this means the less detailedmodel can provide better energy averages, which is critical e.g., inreplica-exchange simulations.

Correspondence to: E. Lindahl; e-mail: [email protected]

Contract/grant sponsor: European Research Council; contract/grantnumber: 209825

Contract/grant sponsors: Swedish Foundation for Strategic Research, andthe Swedish Research CouncilContract/grant sponsor: NSF; contract/grant number: CNS-0619926

© 2010 Wiley Periodicals, Inc.

Page 2: A high-performance parallel-generalized born implementation enabled by tabulated interaction rescaling

2594 Larsson and Lindahl • Vol. 31, No. 14 • Journal of Computational Chemistry

Another motivation for using implicit solvent has been to extendsimulation time scales beyond what is possible with explicit sol-vent. In this case, it is typically used with stochastic dynamics tomimic solvent viscosity, but it also has been recently combined withMonte Carlo sampling.7, 8 This type of generalized Born simula-tions have been used quite successfully to study folding of multiplesmall proteins and peptides.9–13 There are several other applica-tions in the literature, including the possibility to study proteinaggregation, for example, using oligopeptides,14 calculation of var-ious aspects of free energies such as hydration and solvation,15–17

simulations of biomolecules at constant pH or performing pKacalculations18, 19 (sometimes even in conjunction with protein aggre-gation20), as well as continuing studies of the folding properties andenergetic landscapes of small proteins.21 Studies of the latter kindinclude dynamics of the avian-flu influenza viruses,22 structure pre-diction of peptides,23 and loop conformations,24 just to mentiona few.

However, despite the theoretical lower number of interactions,it has frequently been difficult for generalized Born models to keepup with the absolute performance of explicit solvent simulations.The functional form is more complex with an exponential term,the Born radii have to be calculated every step, and there is anextra chain rule term in the force. Altogether, this requires three fullneighborlist traversals for each force evaluation step when comparedwith a single for classical interactions.

Generalized Born simulations are also harder to parallelize effi-ciently due to the extra calculation stages, highly inhomogeneousparticle and interaction densities, and because the interactions sim-ply account for a smaller fraction of the runtime. As an example, wecan use the 24-residue BBA5 protein and our codebase Gromacs:With 3000 explicit solvent waters and reaction-field electrostatics,we reach 143 ns/day on a single desktop (Nehalem 2.8 GHz quad-core) using MPI, and as long as our implicit solvent performancecould not surpass this it was of somewhat limited value for simula-tions where we wanted an accurate representation of the dynamics.Below, we present a rescaling approach to speed up generalizedBorn electrostatics, and a highly tuned implementation that bothbreaks the previous impasse to achieve performance of 0.86 µs/dayfor BBA5, and also enables large-scale parallel implicit solventsimulation in Gromacs.

Theory

Using an explicit representation of the solvent molecules, the sol-vation free energy Gsolv is described by the force-field–mediatedinteractions between solvent and solute. Alternatively, we can optfor an implicit description of the solvent environment by using thegeneralized Born approximation to continuum electrostatics. In thiscase, following Still,3 the free energy Gsolv is the sum of three terms,a solvent–solvent cavity term (Gcav), a solute–solvent van der Waalsterm (Gvdw), and, finally, a solvent–solute electrostatics polarizationterm (Gpol). As the sum of Gcav and Gvdw corresponds to the (nonpo-lar) free energy of solvation for a molecule from which all chargeshave been removed, it is commonly called Gnp, and it is calculatedfrom the total solvent accessible surface area multiplied with a sur-face tension. The total expression for the solvation free energy thenbecomes:

Table 1. Computational Cost (Cycles).

Function gcc icc

exp 100 61log 141 68sqrt 62 39full GB 186 125

Number of clock cycles required for single calls to standard c-language mathroutines for both gcc (version 4.3.3) and icc (version 11.0). The reportednumbers are averages from 109 evaluations. In addition, the number of cyclesrequired to calculate the full-generalized Born equation is included. Forcomparison, the table lookup interaction uses about 40 cycles.

Gsolv = Gnp + Gpol (1)

Under the generalized Born formalism, Gpol is calculated fromthe generalized Born equation3

Gpol =n∑

i=1

n∑j>i

qiqj√r2

ij + bibj exp( −r2

ij

4bibj

) (2)

While this expression offers several advantages as indicated inthe introduction, it is not uncomplicated from a computational pointof view. Each interaction requires two divisions, one square rootand, in particular, one exponential; all operations that can be fairlyexpensive to calculate on modern CPUs (Table 1). There are alreadyseveral good implementations of eq. (2) available,25–27 but in thecurrent form, it would only be marginally faster than the optimizedwater–water interaction calculations already used in Gromacs.28, 29

As explicit solvent arguably represents a more accurate (or at leastmore detailed) picture, it would likely be natural to opt for theexplicit solvent if the time scales reachable are roughly the same.The aim of this work has been to investigate ways to break the com-putational complexity of eq. (2) and, as a consequence, speed up theevaluation of eq.(1) to allow accurate simulations of different pro-teins on the multi-µs timescale. In principle, there are many waysin which this can be achieved. As the most expensive part of eq. (2)is the exponent (Table 1), one can try and simplify that expressionalone. One approach would be to create a tabulated exponentialfunction that is used in a first step, possibly with lower accuracyto save cycles. It is not possible to tabulate the entire expression,as it is a function of three variables (rij , bi, bj) and would requirea prohibitively large table data structure. In addition, care must betaken when introducing approximations, tables in particular, sincethey may lead to loss of accuracy in the calculations, and the extramemory operations can reduce performance. Nevertheless, while atabulated Coulomb potential is slower than the analytical form ofa simple cutoff, it is typically faster for complex interaction formssuch as Ewald. For this reason, we chose to investigate (a) whether itis possible to tabulate the entire generalized Born-expression with-out significant loss of accuracy and (b) if the performance of suchan implementation could be pushed to significantly surpass cur-rent explicit-solvent simulations. Our new formulation is based on a

Journal of Computational Chemistry DOI 10.1002/jcc

Page 3: A high-performance parallel-generalized born implementation enabled by tabulated interaction rescaling

Tabulated Generalized Born 2595

transformation of variables described below. The resulting algorithmhas implemented into the molecular simulation package Gromacs,together with highly optimized routines to calculate Born radii forthe common Still,3 HCT,30 and OBC31 models.

Methods

Table Implementation

To be able to tabulate the generalized Born equation, we start againwith eq. (2). This expression is a function of three variables, namelyrij , the distance between atoms i and j, and their effective Born radiibi and bj , respectively. Such a function is unsuitable to tabulate, bothbecause of the size of three-dimensional table data and the look-uptime (cache misses) for three different dimensions (bi, bj , and rij).To circumvent this, we first introduce a pre-calculation for all Bornradii to store an array

ci = 1√bi

(3)

This makes it possible to introduce a cheap transformation to anew variable x when evaluating each interaction, such that

x = rij√bibj

= rijcicj (4)

which already transforms away some of the computationally expen-sive parts of the generalized Born equation [eq. (2)] into a functionof one variable, ξ(x), where

ξ(x) = 1√x2 + exp(−x2

4 )

(5)

To stress the functional form, we have made this a dimensionlessexpression; the actual polarization energy will also contain a simplefactor with the charge and Born radii as well as an electrostatic factor.This function is theoretically possible to tabulate; the dependenceon the single variable x is necessary for efficient tabulation but notsufficient; if the function is not smooth enough, the tabulation wouldintroduce too large errors. In this case, however, no such problemsexist. Plotting ξ(x) along with its derivative (Fig. 1) shows that thefunction is indeed smooth enough to be suitable for tabulation—infact it is much smoother than standard electrostatics. In the end, thefull expression including the extra factors becomes:

Gpol =(

1 − 1

ε

) n∑i=1

n∑j>i

qiqj√bibj

ξ(x)

=(

1 − 1

ε

) n∑i=1

(qici)

n∑j>i

(qjcj) ξ(x) (6)

When calculating the Born radii, we simultaneously compute theinverse square root of the radii (once per atom). This avoids calculat-ing the ci factors again in the force calculation, which saves a bit of

Figure 1. The rescaled (dimensionless) interaction function ξ(x) and itsnegative derivative corresponding to force. Only the range up to x = 10is shown, but the table data is typically calculated at least up to x = 50.Note the relative smoothness of both the function and derivative, whichmakes them easy to tabulate.

time in the kernels (sqrt() in Table 1). Calculating the table look-upvariable then only takes two extra multiplications, xij = rijcicj . Theexpression can easily be reformulated to tabulate against the squaredistance r2

ij . This is used in some tabulated codes as it avoids anothersquare root operation, but as it leads to slightly reduced accuracywhen compared with analytical forms at short distances, we havechosen not to use it here.

The density required for the table function depends on therequested accuracy. In Gromacs, tabulated potentials have long beenavailable for ordinary Coulomb interactions, both in single and dou-ble precision. Those tables typically have a resolution of 500 pointsper nm in single precision and 2000 in double precision. This givesan maximum relative error in the potential of 10−6 in single pre-cision when compared with the analytical formula for Coulombinteractions. For the generalized Born table, the aim was to achieveerrors of the same magnitude compared to the analytical generalizedBorn formulation [eq. (2)]. As the new table variable is the dimen-sionless rij/

√bibj , we cannot automatically use the same number

of points. Fortunately, due to the smoothness of the function ξ(x),it is sufficient to use as little as 50 points per unit length in singleprecision and 200 in double. From Figures 2 and 3, it is evidentthat both the relative and absolute single precision errors are small,in the order of 10−8 for the potential and 10−7 for the forces. Ifanything, one could consider reducing the table density slightly toimprove performance even further. The larger error for very smallvalues of x (<0.002) is not a problem as the table will never beaccessed in this regime. To be concrete, this would only happen iftwo atoms separated by 0.1 nm had Born radii of 50 nm each. In com-mon force-fields, this is an unphysical situation that does not occur(the Lennard-Jones potential would become repulsive much before0.1 nm).

It is necessary to ensure that the table extends far enough soall possible values of the variable (all interactions) are accountedfor. This is relatively straightforward for a Coulomb table, as thetabulated variable is just the distance between pairs of atoms, whichcaps the table variable at a low value. It is not obvious what the

Journal of Computational Chemistry DOI 10.1002/jcc

Page 4: A high-performance parallel-generalized born implementation enabled by tabulated interaction rescaling

2596 Larsson and Lindahl • Vol. 31, No. 14 • Journal of Computational Chemistry

Figure 2. Relative and absolute error from a cubic spline interpolationtable of eq. (5) using single precision with 50 points per unit length.

maximum range of a generalized Born table should be, as we aretabulating a more complex and dimensionless variable. The extremecase occurs when two atoms that have very small Born radii at thesame time are very far apart. In some simulations with implicitsolvent, cutoffs are occasionally longer than with explicit solvent,perhaps up to 2 nm. If two hydrogen atoms are separated by thisdistance, and they have the smallest possible Born radii (around0.115 nm for Still/Hawkins-Cramer-Truhlar (HCT)/OBC models),the table would have to extend up to x ≈ 300. For practical purposes,the value of x rarely exceeds 10, so when setting up the table, a rangeof 50, five times as long, was chosen—in practical simulations wehave never seen this violated (but it could be system dependent).However, the range of the table is not a limiting factor; setting therange to 1000 only consumes slightly more memory. As long as thatpart of the table is not accessed frequently for very large values ofx, no performance penalty due to cache trashing will be incurred.

Calculation of Born Radii

The Born radius of each atom can superficially be viewed as thedistance between a charge in the protein and the boundary withthe solvent. Mathematically, the Born radius bi of an atom can bedefined as an integral over the solute interior

b−1i = a−1

i − 1

∫in,r>ai

1

r4dV (7)

excluding a region ai around the origin.5 A number of differentalternative Born radii algorithms exist, differing mainly in the waythey calculate the integral in eq. (7), and earlier studies have shownthem to have slightly different properties. Long term, it might beworthwhile to look into Born radius models with nice computationalproperties, but for the present work, we wanted to focus on acceler-ating models that have already proven successful in applications. Asshown by Bashford et al. using perfect radii, accurate calculations ofthe Born radii is key to the accuracy of implicit solvation models.32

One of the earliest algorithms for Born radii calculations wasproposed by Still and co-workers in 1997.3 In this overlapping

spheres-model, the Born radii are calculated from the van der Waalsradii for the atoms in the system. There are additional contribu-tions from neighboring atoms, with a scaling factor dependent onthe connectivity (1,2,3 or more bonds away). Since then, many moreschemes have been presented. In addition to the original Still model,we have also implemented the HCT method,30 as well as a popularvariation of this model developed by Onufriev et al.31 (called OBC).Both the HCT and OBC models use a sum of integrals over sphericalatomic volumes, which can be calculated analytically, to computethe integral in eq. (7). Then, a set of empirically determined scal-ing factors are used to reduce the overcounting when volumes fromoverlapping spheres are summed. The OBC model was developed tocorrect a few shortcomings in the HCT model, such as the tendencyto underestimate the Born radii of certain atoms (particularly buriedones).

The nonpolar part (Gnp) of eq. (1) was calculated directly fromthe Born radius of each atom using a simple ACE type approx-imation by Schaefer et al.,33 including a simple loop over allatoms. This requires only one extra solvation parameter, indepen-dent of atom type, but differing slightly between the three Born radiimodels.

System Setup

To test the performance of the tabulated algorithm, we selected theVillin headpiece (pdb code 1VII, 596 atoms), the smaller BBA5(1T8J, 377 atoms), protein L (2PTL, 923 atoms), Lysozyme (194L,1008 atoms), and chains I, K, and M from the influenza fusion pep-tide Hemagglutinin (1RUZ, 7458 atoms). We used the OPLS-AAforce field by Jorgensen et al.,34 and performed simulations usingboth 1 nm, 2 nm, and infinite (all-vs-all) cutoffs for both Coulomband Lennard-Jones interactions as well as for the calculation ofthe Born radii themselves. As generalized Born runs are occasion-ally performed without cutoffs for very small molecules, we alsoimplemented a separate set of streaming accelerated kernels tunedspecifically for the all-vs-all interaction case (for obvious reasons

Figure 3. Relative and absolute error in the derivative when using cubicspline interpolation for eq. (5). Settings are identical to Fig. 2, i.e.,cubic spline table, single precision, and 50 points per unit length. Asdescribed in the text, small values of x do not occur in practice, as itwould correspond to extremely short separation with large Born radii.

Journal of Computational Chemistry DOI 10.1002/jcc

Page 5: A high-performance parallel-generalized born implementation enabled by tabulated interaction rescaling

Tabulated Generalized Born 2597

Figure 4. Total energy for a 100-ns generalized Born simulation usinga leap-frog integrator. In contrast to stochastic dynamics, this integra-tor will conserve energy and, therefore, expose numerical errors in thetabulated forces or integration. The drift in total energy is essentiallynegligible. The inset shows a magnified version and also includes thegeneralized Born polarization separately. While the latter is not strictlya conserved quantity, it is included to indicate that the small remainingdrift is not likely to come from the tabulated Born interactions.

this cannot be expected to scale with system size). A stochasticdynamics integrator with a friction coefficient of 91 ps−1 was usedfor integrating the equations of motion. As previously describedby Zagrovic,9 this accounts for the lack of friction from the absentsolvent molecules. The temperature was kept at 300 K with thestochastic dynamics integrator, and all covalent bonds constrainedusing the P-LINCS35 algorithm. Explicit water simulations used1.0 nm cutoffs, and either the PME36 or Reaction-Field method tocalculate electrostatics. For BBA5, the Villin headpiece and pro-tein L, the solvation box was defined to accommodate 3000 watermolecules. For Lysozyme and Hemagglutinin, a minimum distanceof 1.0 nm between the protein and the box edge was used, result-ing in systems with 9251 and 47161 water molecules, respectively.All simulations above were performed using single precision arith-metics. To assess the energy conservation of the table approachitself, Gromacs was also compiled in double precision for a 100 nsreference simulation using the Villin headpiece and the Still model.This single test system was run without cutoffs using a 1 fs timestep and leap-frog integrator with NVE ensemble as a stochasticdynamics integrator would not conserve energy.

Results

Stability and Energy Conservation

Production-level implicit solvent simulations should typically notbe run without a stochastic dynamics integrator to account for vis-cosity, but it is nevertheless very useful to assess the accuracy ofthe table approximation and implementation by testing the energyconservation for long NVE ensemble simulations. To this end, weperformed a 100-ns simulation of the Villin headpiece using a leap-frog integrator, which clearly shows that energy is well conserved

with our tabulated generalized Born implementation (Figure 4).While energy conservation in theory is a binary property (it iseither conserved or not), in practice this is rarely the case for verylong time-scales in any code due to rounding errors and numer-ical precision. The values reported here show a relative drift intotal energy of only ∼10−7 per ns. While not exactly zero, thisshould be negligible for all practical purposes. In particular, it canbe seen that the part of the potential energy related to generalizedBorn has a drift close to zero over the entire 100 ns of simula-tion time (Fig. 4), which probably means the small remaining drifthas other sources. The energy conservation primarily tests self-consistency between energy and force, but note that the absoluteerrors in the force and potential were assessed already in Figures 2and 3.

The general structural stability of proteins in implicit simula-tions is a more difficult question, in particular as it is very muchdependent on the model for the Born radii. Arguably, the lack ofexplicit solvent molecules around proteins can also make thema bit more flexible. As seen from Figure 5, the Villin structureinitially shows fluctuations in the first part of a 1µs simula-tion even when simulated with the OBC model. This could bedue to an imperfect initial structure or force field differences,but interestingly it then relaxes to a backbone RMSD (root-mean-square displacement) around 0.17 nm. This is more thanacceptable for an implicit solvent simulation, which indicates gen-eralized Born models indeed keep proteins stable on microsecondscales.

Performance

An efficient way to increase performance in any molecular dynamicssimulation is to use a longer time step. In Gromacs, this can be doneusing virtual interaction sites for hydrogen atoms. Effectively, thisremoves the integration degrees of freedom for the hydrogens butretains them as force interaction sites.37 Using virtual sites, it ispossible to use a time step of at least 4 fs with explicit water. It

Figure 5. Backbone RMSD over a 1 µs simulation of the Villin head-piece using the OBC model. There are significant initial fluctuations,with RMSD values as high as 0.3 nm. However, after roughly 500 ns,the structure stabilizes at an RMSD of 0.17 nm, and the average for theentire simulation is 0.21 nm.

Journal of Computational Chemistry DOI 10.1002/jcc

Page 6: A high-performance parallel-generalized born implementation enabled by tabulated interaction rescaling

2598 Larsson and Lindahl • Vol. 31, No. 14 • Journal of Computational Chemistry

Table 2. Simulation Performance (ns/day).

Cut-off Hemag-Method (nm) BBA5 Villin Protein L Lysozyme glutinin

PME (4 fs) 1.0 53 50 45 17 3.6PME 1.0 66 62 56 21 4.5Reaction-field (4 fs) 1.0 115 110 95 39 8Reaction-field 1.0 143 138 120 49 10Still 1.0 860 580 330 110 33

2.0 590 305 150 33 7.5∞ 845 440 205 48 3.3

HCT 1.0 710 460 255 85 262.0 460 230 110 25 5.7∞ 640 315 150 29 2.3

OBC 1.0 700 450 250 82 25.52.0 450 220 105 25 5.7∞ 630 310 140 28 2.13

In vacuo ∞ 1580 930 450 112 8.9

Simulation performance for different systems and solvent models on a 2.8GHz quad-core. All simulations use virtual interaction sites to enable longtime steps. PME and reaction-field uses explicit solvent, and the performanceis reported both for standard 4 fs time steps as well as 5 fs (borderline forexplicit water). While the preferred choice for many simulations, PME is alsoclearly slowest. The Still model is the fastest implicit solvent alternative, butit can lead to stability problems for small proteins such as BBA5 on long timescales. The HCT/OBC models correct this, but are also somewhat slower dueto more complex Born radius calculations. Note how the special all-vs-allkernels (infinite cutoff) provide very high performance for small systems andthat the Born models significantly outperform the explicit water simulations.

can often be pushed to 5 fs;37 although at that stage, there are smallsigns of integration errors, that lead to solvent heating. With implicitsolvent the water molecules no longer pose a problem, and 5 fs timesteps will conserve energy with virtual hydrogen sites. In Table 2,we compare the performance for different proteins and setups usingboth a 4 fs time step (explicit solvent only) and a 5 fs time step (allsolvent models).

Using 1.0 nm cutoffs and a 5 fs time step for simulations, weachieve a Still model performance of 0.58 µs/day for Villin and0.86 µs/day for BBA5 on the previously described desktop (seeTable 2). A dual quad-core desktop will likely break the microsec-ond barrier. The HCT/OBC models are significantly more costlysince the polarization effects from atom i to j and j to i have to beevaluated separately, but even in this case we reach 0.45 and 0.7µs/day for Villin and BBA5, respectively. In comparison, explicit-solvent PME calculations produce around 50 ns/day on the samesystems. It is also worth noting that the implicit solvent performanceis now within a factor two of the vacuum value (despite three iter-ations over the neighborlist each step), and that the new streamingall-vs-all kernels can be remarkably useful for small proteins—forBBA5 it is for instance more efficient to run without cutoff than witha 2 nm one. Some previous studies, such as those performed withFolding@Home on BBA5, reached aggregated simulation times ofup to 10 µs using large-scale distributed computing. With the presentimplementation, this could be repeated in 2 weeks on a desktop, andit will obviously improve distributed computing performance by thesame amount.

While the implementation is parallel, there is still some roomfor further improvement in scaling; the performance when usingjust a single core for e.g BBA5 with the all-vs-all kernel and Stillmodel is as high as 230 ns/day. This is primarily due to the extracommunication steps required for implicit solvent combined witha very small system—with BBA5 there is less than 100 atoms percore. For larger systems, the bottleneck is rather load balancing,since we currently cannot balance the load separately for Born radiicalculation, interaction, and the chain rule. Further improving thescaling of these inhomogenous systems is thus an important futurechallenge.

Discussion

Notwithstanding several other advantages, raw simulation per-formance has not been the strongest point of implicit solventsimulations the last few years. In particular, as a lot of effortshave gone into optimization of water interactions, lattice summa-tion methods, and parallelization, explicit-solvent simulation hasoccasionally even been faster. While this has made microsecond-scale simulations possible even for large systems,38 it typicallyrequires a large number of nodes which makes it difficult to usefor high-throughput studies. The main aim of this work has beento change this by pushing implicit solvent performance to enablemulti-microsecond simulations on relatively modest hardware.

The usefulness of the fast table formulation of generalized Bornis, however, not limited to long simulations. The method shouldprovide similar speedup when implicit solvent is used to screen largenumbers of structures, for instance single-point potential energycalculations of protein structures to assess stability. For example, inhomology modeling it is common to build a whole set of structuresfrom the initial alignment, and faster implicit solvent methods willhelp in finding good low-energy models (assuming these are morenative-like) by single-point calculations or short minimizations. Inaddition, implicit solvent models have recently been shown to besurprisingly useful for refinement of structure ensembles by usingsimple energy minimization techniques.39

It is interesting to compare the performance of this new CPUimplementation with other recent work using graphics cards (GPU)as it shows advantages of both architectures. By using the OBCmodel for Villin without cutoffs (all-vs-all kernels), Friedrichset al.40 were able to reach 528.5 and 260.8 ns/day for Nvidia andATI cards, respectively. Those authors also compared it to a pre-vious CPU generalized Born implementation with performance of3.9 ns/day for the same system. The non-cutoff kernels are partic-ularly well suited for GPUs as it avoids the neighborlist and can“streamed” through the processor, which was one of the reasonswe decided to implement similar all-vs-all interactions optimizedfor CPUs. Even for the special non-cutoff case, our present CPUimplementation reaches 310 ns/day (Table 2), indicating that anefficient implementation can get surprisingly close to the GPU per-formance. In addition, it does so with less than 200 W of powerand very low cost for the entire system, which might even make forbetter performance to price and power ratios for CPUs than currentstream processor alternatives.

However, the main point of an efficient CPU implementation isto enable efficient neighborlists, parallelization, and other tasks that

Journal of Computational Chemistry DOI 10.1002/jcc

Page 7: A high-performance parallel-generalized born implementation enabled by tabulated interaction rescaling

Tabulated Generalized Born 2599

are currently difficult on other architectures. For our large Hemag-glutinin system, we reach 25.5 ns/day using cutoffs on a singlemachine. On a 4-socket AMD Istanbul 2.1-GHz machine, the per-formance is 35 ns/day using 12 cores, with 16 cores it is 43 ns/day,and with 24 cores it is 57 ns/day for hemagglutinin. There are stillsome bottlenecks remaining for really efficient scaling, in partic-ular that it is hard to load-balance separately for the three stepsof radius, interactions, and chain-rule calculations. Comparing thenumbers in Table 2 highlights the efficiency of the efficient all-vs-allcalculations, but the performance and scaling with cut-offs is morethan decent. For large systems, we believe the implementation pre-sented here is the fastest generalized Born implementation availableregardless of architecture, in particular as it also scales over multipleprocessors.

There are a few additional numerical approximations that couldimprove the performance further, e.g., linear tables, tabulate againstr2, and only calculate the force (not potential) every step. Anotherpotential speed-up is to only calculate the Born radii every n steps,with n in the range 2–3. This will violate the strict ensemble, andmore seriously, we have unfortunately found it to destabilize pro-teins, particularly when combined with long time-steps and theHCT/OBC models. Nevertheless, in some cases it might be possi-ble to use with the Still model. Finally, getting this implementationto scale better over very large numbers of processors with domaindecomposition is an important future challenge.

Conclusions

Implicit solvation is by no means the only way of reducing thecomputational complexity in molecular dynamics simulations, butas discussed above, it does offer some key advantages over withexplicit solvent. This becomes particularly important if the samemodels also enable us to increase simulation performance signifi-cantly, and we believe the implementation presented here achievesthat goal. The (surprisingly simple) rescaling method itself is noapproximation to the generalized Born equation, and as we haveshown that the tabulated function is smooth enough in the regionused, the errors even approach numerical precision. Altogether, thisenables an efficient and portable CPU implementation of general-ized Born that has also been fully implemented in publicly availablecode—we believe it provides generalized Born simulations thatare significantly faster than best explicit-solvent implementationsavailable on general hardware. This will enable more studies ofphenomena that are currently at the limit of what is possible withexplicit-water simulations, for instance, folding of larger proteinsand protein aggregation. In addition, as this performance is reachedwith only one or a few nodes, it also opens the possibility to ensemblesimulations of thousands of structures on microsecond scale on nor-mal clusters, which is important for refinement and high-throughputwork. While the scaling is not yet perfect on large numbers of nodesdue to different load balancing in the multiple stages involved, wehope it will provide a useful addition for the field of implicit solventsimulations.

Acknowledgment

The authors would like to acknowledge Pär Bjelkmar and PeterKasson for helpful discussions, and Berk Hess for help with someof the programming.

References

1. Koehl, P. Curr Opin Struct Biol 2006, 16, 142.2. Roux, B.; Simonson, T. Biophys Chem 1999, 78, 1.3. Qui, D.; Shenkin, P. S.; Hollinger, F. P.; Still, W. C. J Phys Chem A

1997, 101, 3005.4. Cramer, C. J.; Truhlar, D. G. Chem Rev 1999, 99, 2161.5. Bashford, D.; Case, D. A. Annu Rev Phys Chem 2000, 51, 129.6. Lopes, A.; Alexandrov, A.; Bathelt, C.; Archontis, G.; Simonson, T.

Proteins 2007, 67, 853.7. Nilmeier, J.; Jacobson, M. J Chem Theory Comput 2008, 4, 835.8. Michel, J.; Taylor, R. D.; Essex, J. W. J Chem Theory Comput 2006, 2,

732.9. Zagrovic, B.; Snow, C. D.; Khaliq, S.; Shirts, M. R.; Pande, V. S. J Mol

Biol 2002, 323, 153.10. Zagrovic, B.; Snow, C. D.; Shirts, M. R.; Pande, V. S. J Mol Biol 2002,

323, 927.11. Snow, C. D.; Zagrovic, B.; Pande, V. S. J Am Chem Soc 2002, 124,

14548.12. Ho, B. K.; Dill, K. A. PLoS Comput Biol 2006, 2, e27.13. Lei, H.; Wang, Z. X.; Wu, C.; Duan, Y. J Chem Phys 2009, 131,

16510.14. Tamamis, P.; Adler-Abramovich, L.; Reches, M.; Marshall, K.; Siko-

rski, P.; Serpell, L.; Gazit, E.; Archontis, G. Biophys J 2009, 96,5020.

15. Konig, G.; Boresch, S. J Phys Chem B 2009, 113, 8967.16. Purisima, E. O.; Sulea, T. J Phys Chem B 2009, 113, 8206.17. Jiao, D.; Zhang, J.; Duke, R. E.; Li, G.; Schnieders, M. J.; Ren, P.

J Comput Chem 2009, 30, 1701.18. Mongan, J.; Case, D. A. Curr Opin Struct Biol 2005, 15, 157.19. Mongan, J.; Case, D. A.; McCammon, J. A. J Comput Chem 2004, 25,

2038.20. Khandogin, J.; Brooks, C. L, III. Proc Natl Acad Sci USA 2007, 104,

16880.21. Fan, H.; Mark, A. E.; Zhu, J.; Honig, B. Proc Natl Acad Sci USA 2005,

102, 6760.22. Amaro, R. E.; Cheng, X.; Ivanov, I.; Xu, D.; McCammon, J. A. J Am

Chem Soc 2009, 131, 4702.23. Voelz, V. A.; Shell, M. S.; Dill, K. A. PLoS Comput Biol 2009, 5,

e1000281.24. Felts, A. K.; Gallichio, E.; Chekmarev, D.; Paris, K. A.; Friesner, R. A.;

Levy, R. M. J Chem Theory Comput 2008, 4, 855.25. J. W. Ponder. Tinker: Software Tools for Molecular Design. Version 3.9;

Washington University: St. Louis, 2001.26. Pearlman, D. A.; Case, D. A.; Caldwell, J. W.; Ross, W. S.; Cheatham,

T. E.; Debolt, S.; Ferguson, D.; Siebel, G.; Kollman, P. Comput PhysCommun 1995, 91, 1.

27. MacKerell, A. D.; Bashford, D.; Bellott, M.; Dunbrack, R. L.; Evanseck,J. D.; Field, M. J.; Fischer, S.; Gao, J.; Guo, H.; Ha, S.; Joseph-McCarthy,D.; Kuchnir, L.; Kuczera, K.; Lau, F. T. K.; Mattos, C.; Michnick,S.; Ngo, T.; Nguyen, D. T.; Prodhom, B.; Reiher, W. E.; Roux, B.;Schlenkrich, M.; Smith, J. C.; Stote, R.; Straub, J.; Watanabe, M.;Wiorkiewicz-Kuczera, J.; Yin, D.; Karplus, M. J Phys Chem B 1998,102, 3586.

28. Van Der Spoel, D.; Lindahl, E.; Hess, B.; Groenhof, G.; Mark, A. E.;Berensen, H. J. J Comput Chem 2005, 26, 1701.

29. Hess, B.; Kutzner, C.; van der Spoel, D.; Lindahl, E. J Chem TheoryComput 2008, 4, 435.

30. Hawkins, D. G.; Cramer, C. J.; Truhlar, D. G. J Phys Chem A 1996,100, 19824.

31. Onufriev, A.; Bashford, D.; Case, D. A. Proteins 2004, 55, 383.32. Onufriev, A.; Case, D. A.; Bashford, D. J Comput Chem 2002, 23,

1297.33. Schaefer, M.; Bartels, C.; Karplus, M. J Mol Biol 1998, 284, 835.

Journal of Computational Chemistry DOI 10.1002/jcc

Page 8: A high-performance parallel-generalized born implementation enabled by tabulated interaction rescaling

2600 Larsson and Lindahl • Vol. 31, No. 14 • Journal of Computational Chemistry

34. Jorgensen, W. L.; Maxwell, D. S.; Tirado-Rives, J. J Am Chem Soc1996, 118, 11225.

35. Hess, B.; Bekker, H.; Berensen, H. J. C.; Fraaije, J. G. E. M. J ComputChem 1997, 18, 1463.

36. Darden, T.; York, D.; Pedersen, L. J Chem Phys 1993, 98,10089.

37. Feenstra, A.; Hess, B.; Berendsen, H. J Comput Chem 1999, 20, 786.

38. Bjelkmar, P.; Niemela, P. S.; Vattulainen, I.; Lindahl. E. PLoS ComputBiol 2009, 5, 1000289.

39. Chopra, G.; Summa, C. M.; Levitt, M. Proc Natl Acad Sci USA 2008,105, 20239.

40. Friedrichs, M. S.; Eastman, P.; Vaidyanathan, V.; Houston, M.; Legrand,S.; Beberg, A. L.; Ensign, D. L.; Bruns, C. M.; Pande, V. S. J ComputChem 2009, 30, 864.

Journal of Computational Chemistry DOI 10.1002/jcc