ms thesis on benchmark test set for free energy calculations using molecular simulations

Using a molecular benchmark set to compare free energy estimators

and explore electrostatic potential parameter space

Himanshu Paliwal

B. Tech. Chemical Engineering, Institute of Technology Banaras Hindu University, 2002

A Thesis presented to the Graduate Faculty

of the University of Virginia in Candidacy for the Degree of

Master of Science

Department of Chemical Engineering

University of Virginia

July, 2011

APPROVAL SHEET

This thesis is submitted in partial fulfillment of the

requirements for the degree of

Master of Science (Chemical Engineering)

______________________________________________________

Author

This thesis has been read and approved by the Examining Committee:

______________________________________________________

Michael R. Shirts, Thesis Advisor

______________________________________________________

John OConnell, Committee Chairperson

______________________________________________________

Erik J. Fernandez, Committee Member

Accepted for the School of Engineering and Applied Science:

______________________________________________________

Kathryn Thornton, Dean, School of Engineering and Applied Science

July, 2011

ABSTRACT

There is a significant need for improved tools to validate thermo-physical quantities

computed via molecular simulation. In this paper we present the initial version of a

benchmark set for testing methods of calculating free energies of molecular

transformation in solution. This set is based on molecular changes common to many

molecular design problems, such as insertion and deletion of atomic sites and changing

atomic partial charges. We use this benchmark set to compare the statistical efficiency,

reliability and quality of uncertainty estimates for a number of published free energy

methods, including thermodynamic integration, free energy perturbation, the Bennett

acceptance ratio (BAR), and its multistate equivalent MBAR. We identify MBAR as the

consistently best performing method, though other methods are comparable in reliability

and accuracy in many cases. We demonstrate that assumptions of Gaussian distributed

errors in free energies are usually valid for most methods studied. We demonstrate that

bootstrap error estimation is a robust and useful technique for estimating statistical

variance for all free energy methods studied. We also use MBAR to explore electrostatic

potential parameters space in order to find the set of parameters resulting in shortest

simulation time for a given accuracy in the free energy estimate of the transformations in

benchmark set and enthalpy of vaporization of water. This benchmark set is provided in a

number of different file formats with the hope of becoming a useful and general tool for

method comparisons.

CONTENTS

ACKNOWLEDGEMENTS .............................................................................................. I

LIST OF FIGURES ......................................................................................................... II

LIST OF TABLES ........................................................................................................ VII

1. Introduction ................................................................................................................... 1

2. Molecular systems of benchmark set .......................................................................... 7

2.1 Minimal test system .................................................................................................. 7

2.2 Charge mutation system ............................................................................................ 7

2.3 Large molecule mutation system .............................................................................. 8

3. Comparing free energy methods based on statistical tests, using benchmark set 10

3.1 Free energy methods and error propagation ........................................................... 10

3.1.1 Thermodynamic integration ............................................................................. 10

3.1.2 Exponential averaging ..................................................................................... 12

3.1.3 Bennett acceptance ratio .................................................................................. 15

3.1.4 Unoptimized Bennett acceptance ratio ............................................................ 16

3.1.5 Range-based Bennett acceptance ratio ............................................................. 16

3.1.6 Multistate Bennett acceptance ratio ................................................................. 17

3.1.7 Gaussian estimate of exponential averaging .................................................... 17

3.2 Building molecular systems, setting up simulation framework and designing

statistical experiments. .................................................................................................. 18

(A) System preparation and simulation parameters. ..................................................... 19

3.2.1 values and spacing between intermediate states for free energy calculations

................................................................................................................................... 19

3.2.1.1 Full set ................................................................................................... 20

3.2.1.2 Sparse set ............................................................................................... 21

3.2.2 Generating an ensemble of uncorrelated configurations ................................. 22

(B) Statistical tests. ....................................................................................................... 23

3.2.3 Quantifying accuracy and precision in uncertainty estimate of an estimator .. 23

3.2.3.1 Sample standard deviation ........................................................................ 23

3.2.3.2 Analytical estimate.................................................................................... 24

3.2.3.3 Bootstrap estimate ..................................................................................... 25

3.2.4 Quantifying bias of free energy estimates ....................................................... 26

3.2.4.1 Bias due to number of samples ................................................................. 27

3.2.4.2 Bias due to number of intermediate states ................................................ 28

3.2.5 Quantifying reliability of a free energy estimator ............................................ 28

3.2.6 Validating the Gaussian distributions of the free energy differences .............. 31

3.3 Results and discussion ............................................................................................ 31

3.3.1 Validation of uncertainty estimates ................................................................. 32

3.3.2 Analysis of bias in free energy estimates ......................................................... 40

3.3.3 Overall reliability of a free energy estimator ................................................... 47

3.3.4 Testing the Gaussian distribution of free energies ........................................... 52

3.3.5 Convergence properties of free energy estimates ............................................ 55

3.4. Conclusions ............................................................................................................ 58

4. Exploring computational efficiency of multiple electrostatic potential parameter

sets using data sampled at a single set. .......................................................................... 61

4.1 Methods................................................................................................................... 64

4.1.1 Difference between free energy estimates calculated with two different PME

parameter sets............................................................................................................ 64

4.1.2 Difference in enthalpy of transformation between two PME parameter sets .. 68

4.1.3 Enthalpy of vaporization of water calculated at new set of PME parameters. 70

4.2 Results and Discussion ........................................................................................... 75

4.3 Conclusions ............................................................................................................. 88

References: ...................................................................................................................... 90

Appendix: ........................................................................................................................ 94

A1 Thermodynamic integration using cubic splines .................................................... 94

A2 Supplementary information..................................................................................... 97

A3 Availability of datasets for other simulation packages ......................................... 120

Readme for free energy benchmark V1.0 ............................................................... 121

A3.1. Benchmark Test Set Energies ................................................................... 121

A3.2. Difficulties in Exact Energy Matching ..................................................... 121

A3.2.1. Agreement in combination rule.......................................................... 121

A3.2.2. Agreement in Cutoff .......................................................................... 122

A3.2.3. Agreement in PME Parameters .......................................................... 123

A3.3. Best Matches to Parameters used in the Benchmark Set Tests ................. 123

A3.4. Long Cutoff Parameters ............................................................................ 124

A3.5. Format of output........................................................................................ 125

A3.6. File Organization....................................................................................... 126

I

ACKNOWLEDGEMENTS

I would like to express my heartfelt gratitude to my advisor Prof. Michael R. Shirts.

Without his help, support and vision, this work would not have been possible. I would

like to thank my lab members, Kai, Christoph, Arjan, Joe, Tri and Jon. Their comments

and discussions have contributed in bringing this effort to its final form.

I thank UVA ITC and UVA XCG for their computing resources support. I thank Karolina

from the UVA Computer Science department. Her help during the beta test of cross

campus grid gave me ideas about setting up the framework for running high through-put

simulations. I thank James Watney at D. E. Shaw Research for significant help in running

Desmond.

Last but not the least I thank my parents, my family and my wife Priya for all the moral

support they have provided me throughout the work.

II

LIST OF FIGURES

Figure 1. (a) In the coupled state or solvated state both intermolecular and intramolecular

interactions for anthracene are turned on. (b) In the decoupled state or vacuum state the

intermolecular interactions with water molecules are turned off. 9

Figure 2. Free energy differences of transitions in the direction of increasing and

decreasing entropy should be added separately to get the overall free energy for a dipole

inversion. 14

Figure 3. (a) Gaussians are plotted on mutually perpendicular axes. Both have the G

calculated using a method (here TI for methane solvation) as mean and RMSE1 and

RMSE2 as their standard deviations. (b) These are fused to generate a bivariate Gaussian

plot. (c) Top view of the bivariate Gaussian plot. 30

Figure 4. Uncertainty estimates (sample standard deviation, analytical, bootstrap) are

plotted for all the methods in three different test cases for full set. Consistent free

energy methods have bars equal in height, and the most accurate methods have the

shortest bars. 36


plotted for six methods in three different test cases for sparse set. Four methods

(DEXP, IEXP, GINS and GDEL) are not graphed because they did not converge properly

for any of the three uncertainty estimates. 38

Figure 6. Bias plots for different test cases for the full set. DEXP, TI and TI3 show

numerically significant bias for methane, while all methods show moderate bias with

respect to number of states for anthracene. 45

III

Figure 7. Bias plots for different test cases or sparse set. DEXP and IEXP show large

biases both due to number of samples and due to number of intermediate states. 46

Figure 8. Bivariate Gaussian plots for UA methane solvation. Note how TI and TI3 fail

in reliability test for sparse set. GDEL and GINS are not at all reliable. MBAR and

BAR are accurate but the estimate of the precision in BAR is misleading as it

underestimates uncertainty, as shown in Figures 4 and 5. SP after the method name

indicates the results of the sparse set. 49

Figure 9. Bivariate Gaussian plots for dipole inversion. TI is reliable for this molecule,

but DEXP, IEXP, GDEL and GINS again are the least accurate and precise of all the

methods studied. 50

Figure 10. Bivariate Gaussian plots for Anthracene hydration free energy. The effect of

bias due to intermediate states is evident in almost all methods as all spreads are elliptical

in vertical direction. TI3 and MBAR both appear reliable, especially for the full set. 51

Figure 11. Each subplot has a comparison between the distribution of estimated free

energies from 100 repetitions (in blue), Gaussian with the mean G and standard

deviation (G) from 100 repetitions (in black), Gaussian with the mean (G)450ns and

standard deviation ((G)450ns) (in magenta), Gaussian with the mean (G)51states and

standard deviation ((G)51states ) (in cyan) for the full set for methane solvation. 53

Figure 12. Each subplot has a comparison between the distribution of estimated free

energies from 100 repetitions (in blue), Gaussian with the mean G and standard

deviation (G) from 100 repetitions (in black), Gaussian with the mean (G)450ns and

IV

standard deviation ((G)450ns) (in magenta), Gaussian with the mean (G)51states and

standard deviation ((G)51states ) (in cyan) for sparse set for methane solvation. 54

Figure 13. Free energies estimated using large number of intermediate states and a large

number of samples converge to a single value for all free energy estimators. 56

Figure 14. Accuracy in methane solvation free energy estimates as a function of different

PME parameters. 76

Figure 15. Inaccuracy in free energy estimates increases with increasing phase space

overlap. 77

Figure 16. Molar enthalpy of vaporization of water as a function of different PME

parameters. 80

Figure 17. Accuracy in methane solvation enthalpy estimates as a function of different

PME parameters. 81

Figure 18. Simulation time for methane solvation as a function of different PME

parameters. 82

Figure 19. Subplot at row 2 column 2 in Figure 14 is expanded to explore exactly how

G estimates of methane solvation vary for different Fourier spacings and cutoffs

converge. 83


Hvap estimates vary for different Fourier spacings and cutoffs converge. 83


simulation time for methane solvation varies for different Fourier spacings and cutoffs. 84

V

Figure S 1. Plots comparing the distribution of dipole inversion free energies with the

Gaussians for full set for dipole inversion. 108

Figure S 2. Plots comparing the distribution of dipole inversion free energies with the

Gaussians for sparse set for dipole inversion. 109

Figure S 3. Plots comparing the distribution of estimated anthracene solvation free

energies with the Gaussians for full set. 110

Figure S 4. Plots comparing the distribution of estimated anthracene solvation free

energies with the Gaussians for sparse set. 111

Figure S 5. Accuracy in free energy estimates of dipole inversion as a function of

different PME parameters. 112

Figure S 6. Accuracy in the estimate of enthalpy for dipole inversion process as a

function of different PME parameters. 113

Figure S 7. Simulation time for dipole inversion as a function of different PME

parameters. 114

Figure S 8. Subplot row 2, column 2 in Figure S5 is expanded to explore exactly how

G estimate of dipole inversion vary for different Fourier spacings and cutoffs. 115


simulation time for dipole inversion varies for different Fourier spacings and cutoffs. 115

Figure S 10. Accuracy in anthracene solvation free energy estimate as a function of


Figure S 11. Accuracy in anthracene solvation enthalpy estimates as a function of


VI

Figure S 12. Simulation time for anthracene solvation as a function of different PME

parameters. 118


G estimate of anthracene solvation vary for different Fourier spacings and cutoffs. 119

Figure S 14. Subplot at row 2, column 2 in Figure S12 is expanded to explore exactly

how simulation time for anthracene solvation varies for different Fourier spacings and

cutoffs. 119

VII

LIST OF TABLES

Table 1. Statistical uncertainty calculated using three different approaches (analytical

(G), sample standard deviation (G), bootstrap (G)bs) for UA methane

solvation using the full set. All quantities are in kJ/mol. G is not the ensemble

average but the average over 100 repetitions. 35



solvation using sparse set. All quantities are in kJ/mol. 37

Table 3. Free energy estimates and corresponding uncertainty estimates in the large

number of samples (450 ns) and large number of intermediate states (51 states) for UA

methane solvation. Bootstrap estimates are reported as they are better than analytical

estimates. All quantities are in kJ/mol. 41

Table 4. Bias estimates due to number of samples and number of lambda states for full

and sparse sets for UA methane solvation. All quantities are in kJ/mol. 42

Table 5. RMSEs and statistical uncertainties in RMSEs in UA methane solvation free

energy estimate are largest for GINS and GDEL in sparse set while the acceptance ratio

methods consistently show low RMSEs and the corresponding uncertainty in the RMSEs.

All quantities are in kJ/mol. 43

Table 6. Time consumed by different methods to calculate the free energy 201 times of

anthracene solvation. 57

Table 7. Summary of all statistical tests are presented. 60

Table 8. PME parameters used to generate input sets. 74

VIII

Table 9. Lead parameter set for methane solvation. 85

Table 10. Lead parameter sets for dipole inversion. 85

Table 11. Lead parameter sets for anthracene solvation. 86

Table 12. Comparison of lead parameter sets with the parameter set used in simulation

done to test free energy estimators for methane solvation. 87

Table S 1. Statistical uncertainty calculated using three different approaches (analytical

(G), sample standard deviation (G), and bootstrap (G)bs) for dipole inversion

using the full set. All quantities are in kJ/mol. 97


(G), sample standard deviation (G), bootstrap (G)bs) for dipole inversion

using the sparse set. All quantities are in kJ/mol. 98


(G), sample standard deviation (G), bootstrap (G)bs) for anthracene solvation

using the full set. All quantities are in kJ/mol. 99


(G), sample standard deviation (G), bootstrap (G)bs) for anthracene solvation

using sparse set. All quantities are in kJ/mol. 100

Table S 5. Free energy estimated using large number of samples and large number of

intermediate states for dipole inversion. 101

Table S 6. Bias in free energy estimates due number of samples and number of

intermediate states for dipole inversion are presented here. None of the methods show

significantly large bias for this molecular test set. 102

IX

Table S 7. RMSEs and statistical uncertainties in RMSEs in free energy change of dipole

inversion of IEXP, DEXP, UBAR GDEL, and GINS are significantly higher (indicating

lower reliability) compared to other methods specifically in the sparse set. 103

Table S 8. Free energies estimated u large number of samples and large number of

intermediates states for anthracene hydration free energy. 104

Table S 9. GDEL and GINS have largest bias in free energy estimates for anthracene

solvation due to number of samples and number of intermediate states even for full set.

For sparse set GDEL and GINS have even larger bias compared to full set. DEXP and

IEXP also show significantly large biases in sparse set. 105

Table S 10. High RMSEs and statistical uncertainties in RMSEs for GDEL and GINS in

both the full and the sparse set indicate low reliability. DEXP, IEXP, UBAR also

become unreliable in the sparse set. 106

Table A 1. Single point simulation parameters for MD_sim_parm energies for methane

solvation, dipole inversion, anthracene solvation. The second set of parameters for

number of grid points is for dipole inversion, which had a larger box. 124

Table A 2. Single point simulation parameters for high cutoff energies for methane

solvation and anthracene solvation. 125

Table A 3. Single point simulation parameters for high cutoff energies for dipole

inversion. 125

1

1. Introduction

Simulation and theory communities have developed substantial interest in using free

energy calculations for molecular design problems. Specifically, free energy calculations

can guide experimental screening techniques for measuring biological interaction

energies and offer the potential of a faster and cheaper way to get thermodynamic

information over large chemical spaces in a variety of molecular contexts.1 For example,

drug design2 requires prediction of binding affinities, tautomers, protonation states,

membrane permeabilities, and solubilities, all of which involve free energy calculations.3-

6 Similarly, free energy calculations could become useful tools in material design

problems ranging from improved protein selectivity and stability on chromatographic

surfaces7 to tailoring metal organic frameworks8 for applications such as gas storage and

separation. Such potential extends to the design of new nanomaterials such as therapeutic

dendrimers, hetero-polymers and hyperbranched polymers for molecular recognition,

imaging, sensing/signaling and controlled payload delivery.9-13 However, substantial

roadblocks to routine use of molecular simulations as a complement to experiment are

confusion over suitability of methods for different molecular problems, as well as lack of

rigorous, validated understanding about the reliability of free energy calculations and

other observables estimated using statistical methods.

Other computational fields have successfully benchmarked and tested computational

methods to improve the reliability and thus the utility of simulations. For example the

field of computational fluid dynamics has also grappled with issues of reliability and

2

standardization of simulations. During the late 90s, substantial research efforts in the

field of computational fluid dynamics (CFD) were focused on establishing validation

benchmarks to improve the reliability of CFD simulations in various design

applications.14-16 This research helped to bring down costs, increase data fidelity, and

reduce design cycle time in the early development phases of new airplanes,17 Formula 1

cars,18 treatment and diagnosis of cardiovascular diseases19,20 and off-shore oil rigs.21

Similar validation benchmarks were developed for simulations in the nuclear industry, to

improve the reliability in nuclear reactor safety, underground nuclear waste storage, and

nuclear weapon safety.22 For molecular simulation to play a similarly useful role in

molecular engineering design,23 benchmarks and validation sets must be established. In

this paper we provide tools for improved standardization of molecular simulations

through the first version of a benchmark set for free energy calculations.

There are a large number of free energy methods available,23-30 which by early 2011 have

been cited collectively over 4600 times. As of the time of writing, 20% of those citations

occurred within the last 18 months.1 However, the simulation field lacks consensus in

choosing a method most appropriate for a given molecular design situation. At least

three fundamental issues contribute to this confusion.

1 These numbers were generated by adding the primary citations for each of the listed

methods (23-30) and searches for Thermodynamic Integration and Free Energy

Perturbation in ISI Web of Science, as the original papers for these methods are older

than the ISI database.

3

First, there is a lack of standard test cases for rigorous comparisons between different free

energy methods. Studies of new methods frequently use relatively trivial model systems,

such as a one- or two-dimensional analytically solvable potential energy function, or a

small Lenard-Jones sphere in water. Such test cases may not be representative of the

problems encountered in actual molecular changes. Alternatively, papers describing new

methods may use extremely complicated systems, such as protein-ligand binding systems

that are hard to converge and therefore make it difficult to accurately gauge true gains in

efficiency. Both of these scenarios yield little knowledge about whether a given method

will be useful in answering actual molecular design questions. Second, computing

thermo-physical properties by molecular simulation involves stochastic sampling of

molecular configurations, and all comparisons must deal with the fact that repeated

independent measurements will have associated statistical error; unlike in quantum

mechanics, it will never be possible to match calculation to fourteen decimal places.24

Finally, direct comparisons between methods can be difficult because of the differences

between simulation code bases. Free energy calculation capabilities are recent additions

to most large-scale molecular simulation codes, and therefore most codes support usually

only a small subset of available free energy methods.

As a first step towards helping solve these problems, we propose the first version of a

molecular test set comprising realistic systems undergoing challenging molecular

transformations. We then use this test set to test the reliability of different methods for

estimating free energy differences. Although the molecular design applications listed in

the introduction seem very different, there are features common to all free energy

4

calculations performed for these applications. All involve determining the preference of

a molecule to partition between two environments, which can be calculated by way of a

difference in the free energies of molecular transformation between these two

environments. For example, we might wish to design a solute preferentially solvated by

a protein when compared to solvent (pure water) or a different complex medium (another

protein), as in the case of drug design. Alternately we might design a solvent which

preferentially solvates a given solute in a mixture; such as, designing ionic liquids25 for

sequestering CO2. These molecular transformations primarily involve either growing or

deleting atoms, changing the size or dispersion interaction between atoms, or altering the

partial charge on mutation sites. Any benchmark test set must include examples of these

transformations which are simultaneously challenging enough to push new methods and

yet possible to evaluate with sufficiently high precision to give meaningful comparisons

between methods in a reasonable amount computer time.

The most important features of a property estimation method to understand are the

statistical errors inherent in the method, both bias and statistical uncertainty, and the

reliability of the methods estimate of the property. Without such data, we cannot trust

our calculations or compare two different calculations for validation purposes. Studies

commissioned by U.S. science funding agencies on future directions for simulation based

engineering and science have emphasized the fundamental need for improved uncertainty

verification and validation.26,27 Almost all estimators of statistical quantities, like free

energy and ensemble average calculation methods, have some bias, a systematic

deviation from the true answer that would be obtained with perfect sampling.

5

Additionally, computing a given observable from independently selected samples gives

different estimates of the observable; this variation is the statistical uncertainty of the

estimate. Most free energy methods include estimates of this statistical uncertainty; but of

course, these uncertainty estimates are themselves statistical quantities and have variation

from sample to sample, and must be validated.

A few valuable studies have compared28,29 free energy methods, but not necessarily in a

systematic way. We use our proposed benchmark set to directly compare the estimated

uncertainties with the sample uncertainty. We compute the change in the mean square

error as a function of the number of intermediate states and number of samples to capture

both bias and uncertainty. We test whether the distribution of free energy estimators is

indeed Gaussian, a condition usually assumed when using statistical uncertainty estimates

to calculate error. Finally, we evaluate the bootstrap method30 as a method for computing

statistical uncertainty, as this method can be easily implemented for all of free energy

algorithms described in this thesis, and indeed generally for most statistical estimates of

observables.

Bias and uncertainty associated with the free energy algorithm are not the only sources of

errors in free energy estimates. Simulation parameters associated with the calculations of

non-bonded interactions, like electrostatic potential, also contribute to the errors in free

energy estimates. These parameters specify the extent of approximations allowed during

force and potential calculations. Molecular dynamics simulations require accurate force

calculations at each step. Similarly, Monte Carlo simulations require accurate potentials

6

at each step for generating samples. The selection criterions for these simulation

parameters are not well defined and there still exists a challenge as to how these can be

tuned efficiently for an entirely new system. The most common technique is to tune the

parameters such that for a given accuracy in force calculations the selection results in

minimum computational cost31-33. We have introduced a new way of looking into this

problem. We explore the electrostatic potential parameter space using our benchmark test

set. We search for the set of parameters which results in the least computational cost for a

given accuracy in free energy estimates of systems in benchmark set, estimates of molar

enthalpy of vaporization of water and estimates of enthalpy of transformation of each

system in benchmark set.

In this paper, we first explain the molecular test set and the rationale behind the

molecular choices. Next, we use this set to test and compare the accuracy, precision and

reliability of ten free energy methods. We then present a summary of the method

comparison, with much of the data presented as supplementary material because of its

length, and present our recommendations for methods for performing free energy

calculations. We then get into a discussion about the parameter space search and explain

our methodology and protocols to explore it and finally present our recommendations for

electrostatic potential parameters for performing molecular simulations.

7

2. Molecular systems of benchmark set

The systems in this benchmark set are designed to represent alchemical changes, or

changes of molecular identity, common to most molecular design applications.

Alchemical transformations frequently require the deletion or introduction of atoms and

large changes in the partial charges. Changes in torsional, angle, or dihedral parameters

usually result in smaller changes in phase space, as do small changes in dispersion

strength atomic radius, or charge. We therefore focus on atomic introduction/deletion and

large changes in partial charge.

2.1 Minimal test system: OPLS UA methane in TIP3P water (MS): The solvation of a

Lennard-Jones (LJ) sphere, representing methane, is perhaps the simplest molecular free

energy system that can be truly defined as molecularly realistic. There are no bonds,

angles, or torsions terms, nor are there solute/solvent charge terms. This system

represents a minimal test of whether the free energy method is at all valid or applicable

for molecular systems. We examine the transformation of coupling the sphere into water,

which corresponds to the solvation free energy of this molecule Figure 1 (a).

2.2 Charge mutation system: dipole inversion with OPLS UA ethane molecule in

TIP3P water (DI): We chose two LJ methyl spheres, tethered together, with +1/-1

charges and the bond length of a C-C bond as illustrated in Figure 1(b). This setup avoids

computing free energies of ions directly, as changing the total charge of a system with

periodic boundary conditions is not always handled completely correctly in many codes.35

8

This test measures whether the method can handle large water rearrangements around

charges. The system is a null transform; the free energy change is zero as the final state is

identical to the initial state by symmetry.

2.3 Large molecule mutation system: absolute hydration free energy of UA

anthracene in TIP3P water: In our third test, we compute the solvation of anthracene

via the decoupling of the intermolecular interactions from water. This system tests

whether the method can handle introduction or deletion of multiple atomic sites

efficiently. Importantly, there are no internal ligand degrees of freedom to complicate the

analysis. Force field parameters are taken from Pitera & Van Gunsteren.36 Originally, we

chose a null transformation of anthracene to anthracene via a benzene intermediate, but

the simpler solvation problem was eventually chosen because of the difficulty of

supporting such transformations in other codes and the difficulty of interpreting the

statistics of multiple transformations; a key requirement of the benchmark set is to be

simple to use. For this test set, we have used a single topology scheme, slowly turning

off the interactions between the solute and the solvent but keeping the intramolecular

interactions in the solute turned on as shown in Figure 1 (c) and (d) . The free energy

change of this transformation corresponds to the desolvation free energy of anthracene,

but we report the solvation free energy for ease of interpretation.

9

Figure 1. (a) Methane solvation (b) Dipole inversion (c) In the coupled state or solvated

state both intermolecular and intramolecular interactions for anthracene are turned on. (d)

In the decoupled state or vacuum state the intermolecular interactions with water

molecules are turned off.

10

3. Comparing free energy methods based on statistical tests, using benchmark set.

We use the benchmark set to test a total of ten free energy methods. In the following

explanations, we assume that the simulations are performed in the isothermal-isobaric

ensemble, and thus the Gibbs free energy G is the quantity of interest; the Helmholtz

free energy can also be computed if the simulations are performed in the canonical

ensemble. U indicates the generalized potential energy, which in the case of the isobaric-

isothermal ensemble is actually U+PV, is 1/kBT, and is a coupling parameter

connecting the initial and final states in an arbitrary manner. Brackets indicate an

ensemble average over the appropriate ensemble.

3.1 Free energy methods and error propagation

3.1.1 Thermodynamic integration24 using a trapezoid rule (TI) and a cubic spline37

(TI3):

For TI, we compute the ensemble average of the derivative of potential energy function

with respect to a coupling parameter for a system i.e. ( U()/ )i at all values and

the corresponding variances 2i of the ( U()/ )i distributions.

222 xxi , where x =

iU )/)(( (1)

The ( U()/ )i values at different intermediates are interpolated and then integrated

to get an overall free energy change.

1

0

10 )/)(()0()1(

UdGGG (2)

11

For TI we have used a linear interpolation, which leads to the standard trapezoid rule to

integrate the total free energy. For TI3, the ( U()/ )i vs i curve is fit piecewise to a

natural cubic spline and then integrated analytically using the coefficients of the cubic

equation (see Appendix 1 for the derivation). Both the trapezoidal and the cubic spline

integration can be expressed in the form of weighted sum of individual ( U()/ )i

K

i

i iUWG

1

10 )/)(( (3)

Here the Wis are the respective weights corresponding to each state and K is the total

number of intermediate states. The variance of this estimate of free energy can be

calculated by the following variance propagation formula:

K

i

iiW1

222

10 (4)

Occasionally in the literature, the variance of the free energy over each interval i to i+1 is

computed individually, and then propagated into the total variance by the sum of squares.

This is incorrect, since the variance of each interval is then correlated to the variance of

its neighbors. For example, the free energy difference between state 1 and state 2 and

between state 2 and state 3 both contain statistical information from state 2. A number

of alternative thermodynamic integration schemes have been proposed.37,38 However,

such schemes require some knowledge of the magnitude of the statistical uncertainty for

optimality. Other schemes use nonlinear fits to two different functional forms separately

describing Lennard-Jones and Coulomb contributions to the free energy.39 Such schemes

are not particularly flexible and introduce integration bias that is difficult to quantify. By

using cubic splines, we can obtain a higher order formula independent of functional form

12

of dU/d, while propagating error using the same formalism as is used in standard TI

(Eq. 4).

3.1.2 Exponential averaging (EXP) in two forms: deletion (DEXP) and insertion

(IEXP)28,40:

In exponential averaging schemes, the free energy change Gij is calculated using the

exponential average of the difference of the potential energies Uij between two states i

and j over one of the ensembles. The free energy difference as a function of potential

energy difference Uij and N samples is then

N

n

ijnij xUN

G1

])(exp[1

ln1

(5)

This averaging is performed using samples from state i to compute potential energy

differences Uij from state i to state j. The free energy of the reverse process can be

computed using samples from state j and computing potential energy differences to state

i. Since the labels themselves are arbitrary, to remove ambiguity in the direction we will

describe such computations as being either deletion or insertion. We will call Uij

taken in the direction of decreasing entropy as an insertion step, and Uij taken in the

direction of increasing entropy as a deletion step, as inspired by Wu et al.41 Hence the

free energy method using a Uij which steps in the direction of increasing entropy in Eq.

5 is labeled as deletion exponential averaging (DEXP), and the free energy method which

uses Uij which steps in the direction of decreasing entropy in Eq. 5 is labeled as

insertion exponential averaging (IEXP). In both cases, the variance 2ij between two

adjacent intermediate states can be estimated using standard point estimation theory as:

13

])(exp[,1

2

2

ijnx

ij xUxxN

(6)

In both the exponential averaging methods the overall free energy change G10 is

the sum of intermediate free energy changes Gij, and so the variance 210 is simply the

sum of the associated variances 2ij. In some cases the changes from the i state to the i-1

state and i+1 states might both be deletion or insertion cases; in this case, all the sampling

performed at i, and the two estimates of the free energy difference will not be statistically

independent. Complicated molecular changes will frequently involve both addition and

subtractions of phase space, and thus will fall somewhere in between these two general

schemes.

When we calculate the overall free energy change using a method which has inherent

directionality, like exponential averaging, then given our definitions, we need to make it

sure that the direction of entropy change remains intact throughout the process in order to

interpret the whole transformation as a deletion or an insertion process. Methane

solvation and anthracene solvation involve moving molecules from vapor to liquid phase,

resulting in a decrease in total entropy. However, in the dipole inversion case, we have a

symmetric transformation, and thus deletion and insertion happen within a single process.

In dipole inversion, going from a very large magnitude dipole to a small uncharged

intermediate, we have an increase of entropy as the water around the particle becomes

less structured, and thus we use the term deletion. From the intermediate uncharged

intermediate state to the reversed -/+ dipolar state (during the second half of the

inversion), we have the reverse process, and we use the term insertion consistent with the

14

entropy direction definition. To use the terminology IEXP, GINS, DEXP, and GDEL

pathways, we therefore need to combine mixed halves of what would typically be called

the forward and reverse pathways as illustrated in Figure 2.

Figure 2. Free energy differences of transitions in the direction of increasing and

decreasing entropy should be added separately to get the overall free energy for a dipole

inversion.

Although these particular sums are nonzero, they provide a consistent definition of the

statistical variance of insertion and deletion. The statistical variance for symmetric

transformations will simply be the average of the variance for the deletion and insertion

processes.

15

3.1.3 Bennett acceptance ratio (BAR)42:

The Bennett acceptance ratio uses samples of the potential energy both from i to j and j to

i to obtain a provably minimum variance estimate of the free energy difference.

Calculation of free energy change between any two intermediate states through BAR

requires self-consistent solution of the two equations:

i

j

N

li

l

N

kj

kij

N

NC

CU

CUG

i

j

ln1

)](exp[1

1

)](exp[1

1

ln1

1

1

(7)

i

j

ijN

NGC ln

1

(8)

The first equation is true for any constant C, but when Eqs. (7) and (8) are solved self-

consistently, then Gij will have minimized variance. There exist a large number of ways

to solve the equations self-consistently which is beyond the scope of this paper. The

variance 2ij in free energy change can be estimated from:

1

)(

)(11

)(

)(12

2

22

2

2

2

j

j

ji

i

i

ijxf

xf

Nxf

xf

N (9)

Where f(x) is the Fermi function 1/(1+x) and x = (U-C). The total free energy change

is the sum over changes between consecutive intermediate states. Typically, the variance

in the full free energy is computed by assuming independent error and summing the

variance for consecutive intermediate states. However, the assumption that the errors add

independently is not correct, since the free energy difference from i-1 to i and from i to

i+1 states both depend on the potential energy at i, so their variances are not independent.

16

3.1.4 Unoptimized Bennett acceptance ratio (UBAR):

Eq. 9 in Section 3.4 is valid for any initial estimate of the free energy, though choices of

C not given by the implicit equation (Eq. 9) will not have minimum variance. If we make

the choice of C=-1ln(Nj/Ni), we no longer need to self consistently solve equations,

which can avoid saving, reading, and reprocessing all of the data, potentially saving

significant resources, at a cost of statistical efficiency. We instead accumulate the

averages in Eqs. 7 through 9 as the simulation progresses. If each intermediate free

energy is relatively near zero, the free energy estimate will be close to optimal. Such an

estimator is directly equivalent to the minimum variance version of transition state Monte

Carlo, where Barker acceptance probability43 is used.44

3.1.5 Range-based Bennett acceptance ratio (RBAR):

If we keep track of the ensemble averages in Eqs. 11 and 12 for a range of trial values of

C, we will obtain a number of estimates of the best estimate free energy.44 Of these

estimates, the one that is closest to the input C will be the least biased and will have

minimum variance. By choosing this particular estimate, we are essentially pre-

calculating the self-consistent solution. To apply this method, a range of starting values

of C is chosen. This trial C is fed as an initial guess and C is calculated using a single

iteration of Eq. 11, with corresponding G and then calculated. Accumulated averages

are maintained for each choice of G. A decent estimate of the range of C and G is

therefore a requirement for using this method. In some cases, it may end up being more

costly than BAR, as accumulated averages must be maintained for a certain number of

17

trial free energy values, instead of simply performing 5-10 self-consistent iterations.

However, the advantage of what we will call in this paper RBAR (range-based Bennett

acceptance ratio) is that data from each simulation step does not need to be retained for

post-processing as is required with BAR, and only the accumulated averages need to be

maintained.

3.1.6 Multistate Bennett acceptance ratio (MBAR)45:

MBAR is a method to find the free energies of K states simultaneously by minimizing the

KK matrix of variances of the free energy differences of these K states simultaneously.

The derivation of MBAR is a straightforward if mathematically difficult extension of the

derivation BAR to more than two states considered simultaneously. This can be a

significant improvement over BAR, which minimize variances of the free energy

differences for two states at a time. For MBAR, the equation:

K

k

N

nknkk

K

kk

knii

k

xUGN

xUG

1 1

1

)](exp[

)](exp[ln

1

''

'

'

(11)

is solved self-consistently for each Gi. Gij = G(j) - G(i) gives the free energy change

between two states i and j. The statistical variance of Gij, ij2 , is calculated using Eqs. 8

and 12 in the paper by Shirts and Chodera.45

3.1.7 Gaussian estimate of exponential averaging in two forms: deletion (GDEL) and

insertion (GINS)29:

18

If 2U = U2-U2 is finite, and we approximate the Uij distribution as a Gaussian,

the free energy can be expressed as a sum of moments of the energy differences46 by:

2

2 ijUijij

UG

(12)

The variance over N samples of this free energy difference is given by:

)1(2

422

2

NN

ijij UU

ij

(13)

If the distribution Uij is close to Gaussian, then this estimation method can minimize the

statistical effect of rare events, resulting in a more efficient and substantially simpler

estimate method. To remove ambiguities with respect to direction of the process, we use

the same convention of deletion and insertion as described for exponential averaging. In

Eq. 12, when Uij is in the direction of increasing entropy, we refer to this as the

Gaussian estimate with deletion or GDEL, and if we use Uij in the direction decreasing

entropy, we refer to this estimate as the Gaussian estimate with insertion or GINS.

Summing the free energy changes between intermediates again gives the total free energy

changes. Total variance is calculated assuming independent sampling at each state, which

is not an approximation here, as each calculation depends on samples from only one state.

The total free energy is calculated by summing over the free energy changes between

neighboring states.

3.2 Building molecular systems, setting up simulation framework and designing

statistical experiments.

19

(A) System preparation and simulation parameters.

Topologies for UA methane, dipole inversion, anthracene solvation test systems were

created by a combination of automated tools (Dundee PRODRG47, OpenEye libraries,48

ACPYPE49) and manual editing, and are available on the Alchemistry.org website,

http://www.alchemistry.org. Starting configurations were generated using GROMACS

4.0.4. The automated topologies were solvated using GROMACS genbox, and these

solvated systems were minimized with the L-BFGS50 (Low memory-BroydenFletcher

GoldfarbShanno) minimization method, followed by steepest descent minimization. All

systems were then equilibrated at constant volume at 300K for 100 ps, using Langevin

dynamics with a time step of 0.002 fs.51, 52 All hydrogen-containing bonds were

constrained using the SHAKE algorithm to a relative tolerance of 10-12. The systems were

then equilibrated at constant pressure at 1 atm using a Parrinello-Rahman barostat43 and a

Nose-Hoover thermostat44 for 100 ps. A coupling time constant of 5 ps was used for both

thermostat and barostat. A switching function was used for both PME and van der Waals

potentials. The PME switch started at 0.88 nm with a coulomb cutoff distance of 0.9 nm

for electrostatics. Other PME parameters were: Fourier spacing of 0.12 nm, 4th order B-

spline interpolation and a Ewald tolerance of 10-8. A van der Waals switch at 0.8 nm and

cutoff distance of 0.9 nm was used. A long range van der Waals dispersion correction

was used for both energy and pressure.

3.2.1 values and spacing between intermediate states for free energy calculations:

Initial simulations (5 ns long, including 0.5 ns equilibration) were performed with 21

equally spaced values to guide the selection of the values for the main study.

20

Intermediate states were chosen so that each window contributed almost equally to the

total error, specifically insuring that the maximum variance among all windows was no

larger than the twice of the variance among all windows. In order to examine the

change in bias, statistical error, and mean square error as a function of the spacing

between coupling parameter values, we choose two sets of states for each model: a

full set, and a sparse set.

3.2.1.1 Full set:

For the full intermediate set, states were chosen such that the total uncertainty in MBAR

was no greater than 0.1kJ/mol for methane solvation and no greater than 0.22 kJ/mol for

dipole inversion and anthracene solvation over the 5 ns initial simulation. For UA

methane solvation (MS), 8 intermediates were selected: = [0.0, 0.2, 0.4, 0.5, 0.6, 0.7,

0.8, 1.0] where = 0 denotes fully interacting UA methane in water and = 1 denotes

ghost UA methane, where there are no interactions with the solvent. For consistency of

sign between test cases, the reported results are in terms of the reverse process, the

solvation of methane. For dipole inversion (DI), we include 11 intermediate states: =

[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]. Here = 0 denotes a starting +/-

configuration of the dipole; = 1 denotes the reversed configuration of dipole i.e. -/+,

with = 0.5 a state with zero partial charges. For anthracene solvation (AS), the full set

has 15 total states, with = [0.0, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85,

0.9, 1.0]. = 0 denotes fully interacting united atom anthracene in water; = 1 denotes

the vacuum state anthracene with no interactions with water. Again this corresponds to a

desolvation process, and in tables and charts, we report the free energy of hydration,

21

which is simply the reverse process and so includes a sign reversal. For RBAR, this

spacing means that the largest free energy between intervals is approximately 35 kJ/mol,

meaning we must use range of -40 to 40 kJ/mol, with increments of 1 kJ/mol.

3.2.1.2 Sparse set:

For the sparse sets, states were chosen such that the total uncertainty in MBAR was no

greater than 0.25 kJ/mol for MS and 0.4kJ/mol for DI and AS over the 5 ns initial

simulation with no fewer than 3 total states. For methane solvation, there are only three

states [0, 0.5, 1]. For dipole inversion, the sparse set was generated by picking every

alternate along with = 0.5 (which represents zero net charge) [0, 0.2, 0.4, 0.5, 0.6, 0.8,

1.0], reducing the number of states from 11 to 7. For the AS test set every third was

chosen to create the sparse set [0.0, 0.2, 0.5, 0.7, 0.85, 1.0], reducing the number of

states from 15 to 6. Note that we did not need to run the separate simulations for the

sparse states; we merely select for analysis only a subset of the states from the full set.

For RBAR, this spacing means that the largest free energy between intervals is

approximately 66 kJ/mol, meaning we must use a range of -70 to 70 kJ/mol, with

increments of 1 kJ/mol. In order to also test the bias with respect to number of states, for

each model we conduct a set of 5 ns simulations at 51 equally spaced states, with a

spacing 0.02 between two neighboring states.

22

3.2.2 Generating an ensemble of uncorrelated configurations

The most important use of the benchmark test set in this paper is to compare estimates of

the statistical error of different estimators of the free energy with direct sample error

obtained by repeating the experiment N times. For this purpose, we started by generating

100 uncorrelated starting configurations. Using the =0 state from the 5 ns test runs, we

used the GROMACS program g_analyze to compute the autocorrelation time of the

potential energy, kinetic energy, total energy, coulomb interactions, and dUpot/d for all

three systems. We used block averaging53 using the GROMACS g_analyze program to

compute the autocorrelation times. The autocorrelation time of potential energy was

chosen since it was the longest correlation time of all the observables examined. The

autocorrelation times of potential energy for UA methane solvation, dipole inversion,

anthracene solvation were 25 ps, 30 ps, and 25 ps respectively.

We selected configurations separated by 2 ns as our uncorrelated starting points. 2 ns is

more than 50 times longer than the 30 ps autocorrelation time of the potential energy.

Each of the three test systems were run for 20 ns and configurations were printed out

after every 2 ns. From each of these 10 parent configurations, 20 ns simulation were run

with different random seeds to generate a Maxwell-Boltzmann velocity distribution, and

configurations were printed out every 2 ns, giving a total of 100 uncorrelated starting

configurations for each system. These configurations were generated using a beta version

of GROMACS 4.5 also used for subsequent free energy configurations. Velocity Verlet

(option md-vv) was used as the integrator algorithm. A Nose-Hoover thermostat and the

MTTK54 barostat was used to control temperature and pressure, respectively.

23

The starting co-ordinate files for each intermediate state simulation, other than the first

state, were generated by running short 10 ps equilibration runs in series. The first state at

=0 used one of the 100 uncorrelated starting configurations as the starting co-ordinate

file. The starting configuration for any other state used the final configuration of the

previous equilibrated state. After this initial equilibration round, 5 ns of NPT simulation

were then performed for each initial configuration and each separate state. Data

corresponding to first 500 ps was discarded as equilibration. The remaining 4.5 ns of

equilibrium data for each model at each intermediate state were used for all subsequent

calculations.

(B) Statistical tests.

3.2.3 Quantifying accuracy and precision in uncertainty estimate of an estimator.

We estimate uncertainty for each free energy estimator in three ways. First, we compute

the sample standard deviation G2 - G2 from Gs computed from the series of

100 uncorrelated simulation runs described above. Second, we compute the analytical

estimates of error corresponding to each of the methods. Finally, we use the bootstrap

estimator for the standard deviation30.

3.2.3.1 Sample standard deviation: To compute the sample standard deviation, we take

the simulations started from 100 initial configurations and compute free energy

differences from each simulation to obtain a distribution of free energy differences. We

24

then directly compute the sample standard deviation corresponding to each individual

estimator from:

1

)(

)( 1

2

N

GG

G

N

i

i

(14)

Where N = 100, and where G is the mean over the 100 values of Gis. Crucially, the

standard deviation computed from a finite sized sample is itself a statistical quantity, and

must therefore have an associated uncertainty. Rigorously, in order to compute the

sample standard deviation of the uncertainty ((G)), we would need to repeat our 100

simulation experiment 100 times. Instead, we have used the bootstrap method (described

in the bootstrap estimate section) to estimate ((G)). From this exercise we finally get

G and (G) ((G)). G here is not an ensemble average but the average

over 100 repetitions.

3.2.3.2 Analytical estimate: Each free energy estimator has an associated uncertainty

estimator as discussed in previous sections, namely the square root of the estimated

variance of the total free energy. From 100 uncorrelated starting configurations, we will

obtain not only 100 Gs, but 100 error estimates from the methods analytical

uncertainty estimate. We denote the average and standard deviation of these estimated

uncertainties over all 100 independent runs as (G) ((G)), and call these the

analytical uncertainty estimate and the standard deviation of the analytical estimate

distribution.

25

3.2.3.3 Bootstrap estimate: For each of the 100 independent free energy calculations, we

also calculate a bootstrap error estimate, which is a rigorous and robust statistical

estimation technique.30 The bootstrap error is constructed as follows: From each set of

potential energy differences or dU/d values, we generate N bootstrap sets from the

original sample of molecular simulation data. To generate a bootstrap set, we first

subsample the data using an estimate of the autocorrelation time to obtain N statistically

uncorrelated values. For each bootstrap set, we draw N samples with replacement from a

set of uncorrelated measurements. For example, if our set was the integers {3,6,8,9},

then a bootstrap set would consist in randomly selecting each of the four numbers for

times; {3,3,8,9}, {9, 6, 6, 3}, and {8, 8, 8, 8} would all be valid sets, though clearly the

last one would be the rarest. We repeat this sub-sampling process many times; in this

case, we draw 200 bootstrap sets. For each of these 200 bootstrap sets, we compute the

free energy and uncertainties using the estimators as if they were the original data set.

This gives us 200 Gs, one for each of the 200 bootstrapped sets; standard rules of thumb

suggest using 50-200 bootstrapped sets to get robust estimates of uncertainty.30 The

average of these 200 bootstrapped Gbss gives Gbs. The bootstrap estimate of the

error, (G)bs is the sample standard deviation of the 200 Gbs values for each of the

initial configurations. The average (G)bs over all 100 initial configurations is the

bootstrap error estimate. The statistical uncertainty of this bootstrap uncertainty estimate

is estimated by computing the sample standard deviation over the 100 (G)bs, and is

denoted by ((G)bs).

26

A good free energy estimator should have an analytical uncertainty estimate that is

consistent with the sample standard deviation. If the analytical uncertainty estimate is not

accurate, then we have no way of knowing how accurate our free energy estimate is when

we run only a single calculation. This can be very problematic; for example, the analytic

estimate of the uncertainty of EXP diverges from the true estimate well before the error

in EXP itself.28 The smaller the difference between the direct sample standard deviation

error estimate (G) and the predicted or analytical error estimate (G), the better

we know the methods ability to predict error without having to run multiple trials. If the

statistics are well-behaved, the bootstrap estimate of the statistical uncertainty should also

agree with the sample estimate of the uncertainty. If we can show that the bootstrap

estimate agrees with the sample estimate of the uncertainty, then bootstrap error can

substitute for sample uncertainty estimates even when the analytical estimate fails.

3.2.4 Quantifying bias of free energy estimates

The average estimate from a statistically biased estimator, even if repeated an infinite

number of times, will still deviate from the true estimate by the amount of the bias.

There are typically two types of bias in the free energy estimators considered here.

Asymptotically unbiased estimators have bias with finite number of samples but

converge to a true answer in the limit of large numbers of samples. An example is the

nave estimator of the variance in the average of a set of numbers, which is var(a) = 1/N

[(A Ai)2]. This estimate can be shown to always be slightly too large. If N is

replaced by N-1, however, the estimator becomes unbiased. In the limit of very large N,

27

the difference is meaningless. Thus the nave estimator of the variance is not unbiased,

but it is asymptotically unbiased. Thermodynamic integration does not have asymptotic

bias, because at each intermediate state, the simple average of dU/d is unbiased for any

numbers of samples. Exponential averaging, BAR, and MBAR are only asymptotically

unbiased, though the bias of exponential averaging is usually significantly higher than

that of BAR and MBAR;28 careful design of the pathway can minimize this large bias to

some extent.55 However, unlike the simple case of the estimator for the variance of an

average, there exist no known unbiased versions of these estimators.

Bias also occurs due to using a limited number of intermediate states, because of lack of

either phase space overlap between intermediates in exponential averaging, as occurs in

acceptance ratio methods and exponential averaging, or because of numerical integration

error occurring in thermodynamic integration methods. Even for a large benchmark set

like the present study, computational expense and storage limits make it very difficult to

approach the number of sample limit and the number of intermediate states limit

simultaneously. Therefore, we estimate the two contributions to bias independently. For

asymptotic bias, we compare the results from combining a fixed amount of data in either

one large data set, or a series of shorter data sets. For bias as a function of number of

intermediate states, we vary the number of intermediates given fixed length simulations

to investigate bias as a function of number of intermediate states.

3.2.4.1 Bias due to number of samples: Data from all 100 5ns runs is stitched into a

single trajectory to mimic a single long trajectory. The data corresponding to

28

equilibrations was not included while estimating free energies, and since each simulation

contained 4.5 ns, stitching results in 450 ns total simulation data. The difference between

free energy estimates computed with 450 ns of data (G)450 and the same data used as an

average of 100 4.5 ns trajectories G is the bias due to number of samples.

3.2.4.2 Bias due to number of intermediate states: We also ran a set of simulations

with 51 states for each model as discussed above. We can estimate the bias due to the

number of intermediate states as the difference between free energy estimates computed

using 51 states (G)51, G estimated using the full states, and G estimated using the

sparse states. Besides having low and consistent error estimates, ideal free energy

estimation methods would show little or no bias in these tests. In many cases, we are

limited by the statistical uncertainty in determining the bias with high accuracy, since it is

computationally too demanding to generate 500 ns of simulation for all 51 states for all

test sets. In these cases, we can only determine if the bias is statistically insignificant

with respect to the statistical uncertainty. In the case of bias with respect to number of

samples, in most cases, asymptotic bias scales with 1/N while statistical uncertainty

scales with 1/N1/2, so statistical variance is usually dominant.

3.2.5 Quantifying reliability of a free energy estimator

Root mean square error (RMSE) is the square root of the sum of squared differences

between the sample and the true answer. Alternatively, we can write it as the sum of the

variance estimate (2) and the square of the bias. Mean square error is therefore the most

29

useful overall measure of the reliability of any statistically estimated observable.

Calculating a true RMSE requires collecting an infinite number of samples at an infinite

number of intermediate states to obtain the true answer. Since it is computationally

impossible to reach this limit, and computationally expensive to approach it, we use the

free energy estimate G450 from the 450 ns run to approximate the unbiased limit of free

energies given a fixed set of intermediates, and the free energy estimate G51 from 51

state run as the unbiased limit of the free energy estimate with respect to large numbers of

intermediate states, given a fixed amount of sampling. From the two different biases

generated from these two reference states, we obtain two different estimates of mean

square error. Neither of these are true RMSEs because we lack the true reference answer.

However, these estimates of the RMSEs capture the combined effect of the statistical

error and the two different sources of bias. We define the two estimates of mean square

errors as MSE1 and MSE2:

MSE1i = i2 + Bias1i2, where Bias1i = (G)450 (Gest)i and 1 i 100 (15)

MSE2i = i2 + Bias2i2, where Bias2i = (G)51 (Gest)i and 1 i 100 (16)

RMSE1 and RMSE2 will be square roots of MSE1 and MSE2 respectively. The errors in

RMSE1 and RMSE2, (RMSE1) and (RMSE2) respectively, are the standard deviations

over the 100 RMSE1i and 100 RMSE2i. With the two quantities RMSE1 and RMSE2,

we can examine qualitative information about the reliability of the methods using all the

information from our experiments. We plot a bivariate Gaussian with variances equal to

RMSE1 and RMSE2 on mutually perpendicular axes, as shown in Figure 3(a), with the

analytical average of the free energy estimate G estimated by the method as the mean.

An example bivariate Gaussian RMSE plots is shown in Figure 3(b). Figure 3(c) shows

30

the top view of this Gaussian. The overall spread of the rings in Figure 3(c) is a measure

of overall reliability.

Figure 3. (a) Gaussians are plotted on mutually perpendicular axes. Both have the G

calculated using a method (here TI for methane solvation) as mean and RMSE1 and

RMSE2 as their standard deviations. (b) These are fused to generate a bivariate Gaussian

plot. (c) Top view of the bivariate Gaussian plot.

A poor estimator has either large and or unequal spreads in horizontal and vertical

directions, yielding a large circle or an ellipse. A good estimator has small and equal

spread in both the horizontal and vertical direction. Larger spread in one direction

indicates dominance of a single type of bias. In the above plot, the vertical axis

corresponds to the Gaussian with variance equal to RMSE2. Vertical spread indicates that

bias due to number of intermediate states dominates the uncertainty estimate, while

31

horizontal spread indicates bias due to number of samples dominates the uncertainty

estimate.

3.2.6 Validating the Gaussian distributions of the free energy differences.

When we express uncertainty in the free energy estimate in terms of a single number, the

variance or the statistical uncertainty, rather than as a distribution, we are implicitly

assuming that the errors are well described by a normal distribution, whose spread is

equal to the statistical variance. Most analytical variance methods use propagation

estimates that are only rigorously true in the Gaussian limit. As long as the variances are

bounded (not infinite), the central limit theorem ensures that with enough samples,

variances will indeed converge to the Gaussian limit. However, for finite number of

samples, this assumption must be tested, not simply assumed, or else we run the risk of

underestimating the chance of large deviations (black swans) from the average value.

To test whether the shape of the free energy distribution is Gaussian or not, histograms of

free energy distributions are plotted against Gaussians using the free energy estimate as

mean and analytical uncertainty estimate as standard deviation for each method.

3.3 Results and discussion:

Results are presented in Tables 1-5 as well as in Figures 4-13. Tables 1 and 2 contain

results of the uncertainty analysis for full and sparse sets only for methane solvation.

Tables containing full results for dipole inversion and anthracene solvation can be found

in the supplementary material. Tables 3, 4 and 5 contain the free energy estimates

32

corresponding to 450 ns and 51 states runs, bias analysis, and reliability estimates of

free energy and uncertainty predictions for UA methane solvation, for full and sparse

sets. Figures 4-7 provide comparison of the accuracy and precision in free energy and

uncertainty predictions for all ten methods for the three test cases. Tables of the

uncertainty analysis for the dipole inversion and anthracene hydration free energy test

cases are presented in the supplementary material as tables S1-S10. Bivariate Gaussian

plots presenting the analysis of reliabilities of free energy methods are presented in

Figures 8-10. Figures 11 and 12 compare the actual distribution of the free energies

(computed using 100 5ns simulations with full and sparse sets) with Gaussians using

the corresponding variance and mean estimates of the free energy methods, along with

the Gaussians of 450 ns trajectory solution and the 51 state simulation set for

comparison. In some plots, we omit GDEL and GINS, as their errors are significantly

larger than the scale of errors of the other methods in the plots.

3.3.1 Validation of uncertainty estimates. The first column in Table 1 is the free energy

change calculated as an average over the 100 uncorrelated repetitions. The next three

columns are the average of the analytical estimate of uncertainty over all repetitions

(G), the sample standard deviation (G) and the average bootstrap estimate of the

uncertainty over all repetitions (G)bs. The last column gives the percent deviation of

the analytical estimate of uncertainty from the sample standard deviation along with the

propagated error. Standard deviations {((G)),((G)), ((G)bs} for the error

estimate distributions are also reported.

33

We find that the analytic error estimate of most methods underestimates the true sample

standard deviation. In some of the methods this deviation is within either one or two

standard deviations, and thus likely within the statistical noise. Examining the data in

Tables 1, S2, and S3 for TI and TI3, analytical and sample estimates of uncertainty differ

by one or less standard deviation for all but the anthracene solvation with the full set,

where it differs by approximately two standard deviations, and is likely therefore noise.

For IEXP and DEXP, analytical and sample estimates also differ by one or in some cases

two standard deviations. However, analytical and sample uncertainty estimates for BAR,

RBAR and UBAR have differences of up to five standard deviations which indicates that

the BAR uncertainty estimates can be significantly inaccurate for multiple intermediate

states. In contrast, analytical uncertainty estimates with MBAR have less than one

standard deviation from sample uncertainty estimates. Over all methods, MBAR

analytical error estimates deviate least from the sample estimate.

The percentage deviation of analytical estimate from the sample uncertainty estimate is

listed explicitly in column six of Table 1. A negative percent deviation indicates

underestimation of uncertainty and a positive percent deviation indicates overestimation

of uncertainty by the analytical uncertainty estimate. The percentage deviation of the

analytical uncertainty estimate from sample uncertainty for TI is -78% and for MBAR

is -108%, in both cases, within the noise, while for BAR, the deviation from the true

uncertainty is -315%, which is clearly statistically significant. Large deviation of the

analytical estimate from sample standard deviation indicates poor accuracy of estimator

in estimating uncertainty. Similar deviations in analytical and sample estimates are seen

34

for UBAR and RBAR. As with BAR and its variants, analytical and sample estimates

GDEL and GINS are off by 305%.

Bootstrap uncertainty is clearly a robust alternative to the sample standard deviation for

all methods. Bootstrap estimates of the uncertainty (in column five of Table 1) are very

close to sample uncertainty (in column four of Table 1). Specifically in the case of BAR,

RBAR, UBAR, GDEL and GINS bootstrap estimates of uncertainty are better than the

analytical counterparts. In cases where the analytical estimate does not accurately predict

the sample standard deviation, such as with BAR, RBAR, and UBAR, the bootstrap

method provides a useful estimate of the error in the free energy without a need for

performing repeated sampling.

Figure 4 visually demonstrates the efficiency of free energy methods and the consistency

of uncertainty estimators. Short bars indicate precise free energy estimates. Equal height

bars indicate that the analytical and bootstrap uncertainties are consistent with the sample

standard deviation. For the full set MBAR, TI and TI3 predict free energies with the

highest precision and with the most reliable error estimate. BAR, RBAR and UBAR

analytical estimates have large deviations from the standard deviation estimate but the

bootstrap estimate closely matches sample standard uncertainty estimate. IEXP, DEXP

have the largest deviations particularly for large transformations like dipole inversion and

anthracene solvation.

35



solvation using the full set. All quantities are in kJ/mol. G is not the ensemble

average but the average over 100 repetitions.

Method G

(G)

((G))

(G)

((G))

(G)bs

((G)bs)

% dev

(% dev)

TI 9.081 0.1070.002 0.1150.009 0.1060.006 -7.17.9

TI3 9.008 0.1110.003 0.1190.012 0.1100.007 -6.99.5

DEXP 8.984 0.3200.255 0.5560.122 0.3190.267 -42.547.5

IEXP 8.928 0.1040.002 0.1100.011 0.1030.006 -5.89.6

UBAR 8.936 0.0790.002 0.1060.010 0.0980.005 -25.37.5

BAR 8.933 0.0750.001 0.1090.010 0.0990.006 -31.36.3

RBAR 8.937 0.0750.001 0.1090.010 0.0990.006 -31.06.7

MBAR 8.929 0.0950.002 0.1060.010 0.0940.005 -9.98.6

GDEL 7.042 0.0930.002 0.1360.011 0.1210.008 -31.75.8

GINS 1.097 0.2530.008 0.3990.032 0.4000.169 -36.55.5

36


plotted for all the methods in three different test cases for full set. Consistent free

energy methods have bars equal in height, and the most accurate methods have the

shortest bars.

37



solvation using sparse set. All quantities are in kJ/mol.

Method G

(G)

((G))

(G)

((G))

(G)bs

((G)bs)

% dev

(% dev)

TI 2.545 0.175 0.007 0.177 0.014 0.175 0.011 -1.5 8.7

TI3 3.792 0.214 0.009 0.217 0.017 0.214 0.014 -1.5 8.7

DEXP 5.631 1.343 0.524 3.179 0.679 1.628 1.186 -57.8 18.8

IEXP 9.091 0.666 0.101 0.638 0.042 0.683 0.108 4.4 17.3

UBAR 8.954 0.344 0.018 0.358 0.024 0.351 0.024 -3.9 8.2

BAR 8.926 0.225 0.006 0.263 0.014 0.232 0.014 -14.4 4.9

RBAR 8.927 0.226 0.005 0.260 0.016 0.233 0.014 -13.2 5.6

MBAR 8.928 0.232 0.006 0.262 0.015 0.232 0.014 -11.4 5.4

GDEL -3.833 0.112 0.004 0.189 0.014 0.200 0.016 -40.6 4.9

GINS -1.68E+32 3E+30 34E+30 13E+32 10E+32 1E+32 15E+32 -99.7 2.7

38


plotted for six methods in three different test cases for sparse set. Four methods

(DEXP, IEXP, GINS and GDEL) are not graphed because they did not converge properly

for any of the three uncertainty estimates.

39

For the sparse set, as shown in Tables 2, S2 and S4 and Figure 5, TI and TI3 still show

the lowest percent deviation from sample standard deviation. However, the free energy

estimate of methane solvation is off by 6.5 kJ/mol for TI and 5 kJ/mol for TI3. GDEL

and GINS have clearly un-converged free energy and uncertainty estimates; the free

energy estimate of methane solvation for GDEL is off by 12 kJ/mol and for GINS is off

by ~1032 kJ/mol. This clearly indicates the failure of the Gaussian approximation of U.

The free energy estimate of methane solvation for DEXP differs from the converged

answer by 3 kJ/mol and its uncertainty estimate is 5 times larger than the largest

estimated uncertainty shown in Figure 5. IEXP differs from the converged answer by

only 0.2 kJ/mol but its uncertainty estimate is twice the largest plotted uncertainty.

MBAR again has the lowest and most consistent uncertainty estimates. However, unlike

with the full set, BAR and RBAR have uncertainty estimates which are as accurate as

MBAR to within statistical noise.

Bootstrap and analytical estimates of the error in the sparse set are slightly lower than

the sample standard deviation for acceptance ratio methods for methane solvation, though

not for the other two molecules. For this set of states, MBAR does not provide the

same clear advantage over BAR in estimating the uncertainty as with the full set. We

hypothesize that this advantage may only exist when the overlap between states is

somewhat high. However, even our full set uses relatively aggressive spacing for

typical free energy calculations. MBAR analytical estimates will generally have the

advantage over BAR analytical estimates as with MBAR it is not necessary to attempt to

determine whether the calculation is in the low overlap or high overlap regime.

40

3.3.2 Analysis of bias in free energy estimates: Statistical uncertainty is the most

important measure to quantify to understand the reliability of the free energy estimate but

understanding systematic bias is also important. By comparisons with free energy

estimate using large number of samples we can test whether or not the method is

asymptotically biased. By comparisons with the free energy estimate using large number

of intermediate states we can find how sensitive a methods accuracy is to the overlap

between intermediate states. Table 3 shows the free energy and uncertainty estimates

predicted by different methods for UA methane solvation for large number of samples

(450 ns trajectory) and large number of intermediate states (51 states). Tables 4 and 5

include estimates of both types of bias in free energy estimates, with respect to number of

samples and with respect to number of intermediate states, and the corresponding root

mean square error estimates for UA methane solvation. For UA methane solvation,

MBAR, BAR, RBAR, UBAR, TI, and TI3 have very low bias in free energy estimates

with respect to number of samples. The acceptance ratio methods, MBAR, BAR, RBAR,

and UBAR have biases with respect to number of states within the statistical noise. TI

and TI3 show larger bias in free energy estimates with respect to number of states than

the other methods. DEXP and IEXP show large biases in free energy estimates both with

respect to number

ms thesis on benchmark test set for free energy calculations using molecular simulations

Documents