ms thesis on benchmark test set for free energy calculations using molecular simulations

Upload: hkmydreams

Post on 08-Mar-2016

15 views

Category:

Documents


1 download

DESCRIPTION

Benchmark test set for free energy calculations using molecular simulations

TRANSCRIPT

  • Using a molecular benchmark set to compare free energy estimators

    and explore electrostatic potential parameter space

    Himanshu Paliwal

    B. Tech. Chemical Engineering, Institute of Technology Banaras Hindu University, 2002

    A Thesis presented to the Graduate Faculty

    of the University of Virginia in Candidacy for the Degree of

    Master of Science

    Department of Chemical Engineering

    University of Virginia

    July, 2011

  • APPROVAL SHEET

    This thesis is submitted in partial fulfillment of the

    requirements for the degree of

    Master of Science (Chemical Engineering)

    ______________________________________________________

    Author

    This thesis has been read and approved by the Examining Committee:

    ______________________________________________________

    Michael R. Shirts, Thesis Advisor

    ______________________________________________________

    John OConnell, Committee Chairperson

    ______________________________________________________

    Erik J. Fernandez, Committee Member

    Accepted for the School of Engineering and Applied Science:

    ______________________________________________________

    Kathryn Thornton, Dean, School of Engineering and Applied Science

    July, 2011

  • ABSTRACT

    There is a significant need for improved tools to validate thermo-physical quantities

    computed via molecular simulation. In this paper we present the initial version of a

    benchmark set for testing methods of calculating free energies of molecular

    transformation in solution. This set is based on molecular changes common to many

    molecular design problems, such as insertion and deletion of atomic sites and changing

    atomic partial charges. We use this benchmark set to compare the statistical efficiency,

    reliability and quality of uncertainty estimates for a number of published free energy

    methods, including thermodynamic integration, free energy perturbation, the Bennett

    acceptance ratio (BAR), and its multistate equivalent MBAR. We identify MBAR as the

    consistently best performing method, though other methods are comparable in reliability

    and accuracy in many cases. We demonstrate that assumptions of Gaussian distributed

    errors in free energies are usually valid for most methods studied. We demonstrate that

    bootstrap error estimation is a robust and useful technique for estimating statistical

    variance for all free energy methods studied. We also use MBAR to explore electrostatic

    potential parameters space in order to find the set of parameters resulting in shortest

    simulation time for a given accuracy in the free energy estimate of the transformations in

    benchmark set and enthalpy of vaporization of water. This benchmark set is provided in a

    number of different file formats with the hope of becoming a useful and general tool for

    method comparisons.

  • CONTENTS

    ACKNOWLEDGEMENTS .............................................................................................. I

    LIST OF FIGURES ......................................................................................................... II

    LIST OF TABLES ........................................................................................................ VII

    1. Introduction ................................................................................................................... 1

    2. Molecular systems of benchmark set .......................................................................... 7

    2.1 Minimal test system .................................................................................................. 7

    2.2 Charge mutation system ............................................................................................ 7

    2.3 Large molecule mutation system .............................................................................. 8

    3. Comparing free energy methods based on statistical tests, using benchmark set 10

    3.1 Free energy methods and error propagation ........................................................... 10

    3.1.1 Thermodynamic integration ............................................................................. 10

    3.1.2 Exponential averaging ..................................................................................... 12

    3.1.3 Bennett acceptance ratio .................................................................................. 15

    3.1.4 Unoptimized Bennett acceptance ratio ............................................................ 16

    3.1.5 Range-based Bennett acceptance ratio ............................................................. 16

    3.1.6 Multistate Bennett acceptance ratio ................................................................. 17

    3.1.7 Gaussian estimate of exponential averaging .................................................... 17

    3.2 Building molecular systems, setting up simulation framework and designing

    statistical experiments. .................................................................................................. 18

    (A) System preparation and simulation parameters. ..................................................... 19

  • 3.2.1 values and spacing between intermediate states for free energy calculations

    ................................................................................................................................... 19

    3.2.1.1 Full set ................................................................................................... 20

    3.2.1.2 Sparse set ............................................................................................... 21

    3.2.2 Generating an ensemble of uncorrelated configurations ................................. 22

    (B) Statistical tests. ....................................................................................................... 23

    3.2.3 Quantifying accuracy and precision in uncertainty estimate of an estimator .. 23

    3.2.3.1 Sample standard deviation ........................................................................ 23

    3.2.3.2 Analytical estimate.................................................................................... 24

    3.2.3.3 Bootstrap estimate ..................................................................................... 25

    3.2.4 Quantifying bias of free energy estimates ....................................................... 26

    3.2.4.1 Bias due to number of samples ................................................................. 27

    3.2.4.2 Bias due to number of intermediate states ................................................ 28

    3.2.5 Quantifying reliability of a free energy estimator ............................................ 28

    3.2.6 Validating the Gaussian distributions of the free energy differences .............. 31

    3.3 Results and discussion ............................................................................................ 31

    3.3.1 Validation of uncertainty estimates ................................................................. 32

    3.3.2 Analysis of bias in free energy estimates ......................................................... 40

    3.3.3 Overall reliability of a free energy estimator ................................................... 47

    3.3.4 Testing the Gaussian distribution of free energies ........................................... 52

    3.3.5 Convergence properties of free energy estimates ............................................ 55

    3.4. Conclusions ............................................................................................................ 58

  • 4. Exploring computational efficiency of multiple electrostatic potential parameter

    sets using data sampled at a single set. .......................................................................... 61

    4.1 Methods................................................................................................................... 64

    4.1.1 Difference between free energy estimates calculated with two different PME

    parameter sets............................................................................................................ 64

    4.1.2 Difference in enthalpy of transformation between two PME parameter sets .. 68

    4.1.3 Enthalpy of vaporization of water calculated at new set of PME parameters. 70

    4.2 Results and Discussion ........................................................................................... 75

    4.3 Conclusions ............................................................................................................. 88

    References: ...................................................................................................................... 90

    Appendix: ........................................................................................................................ 94

    A1 Thermodynamic integration using cubic splines .................................................... 94

    A2 Supplementary information..................................................................................... 97

    A3 Availability of datasets for other simulation packages ......................................... 120

    Readme for free energy benchmark V1.0 ............................................................... 121

    A3.1. Benchmark Test Set Energies ................................................................... 121

    A3.2. Difficulties in Exact Energy Matching ..................................................... 121

    A3.2.1. Agreement in combination rule.......................................................... 121

    A3.2.2. Agreement in Cutoff .......................................................................... 122

    A3.2.3. Agreement in PME Parameters .......................................................... 123

    A3.3. Best Matches to Parameters used in the Benchmark Set Tests ................. 123

    A3.4. Long Cutoff Parameters ............................................................................ 124

    A3.5. Format of output........................................................................................ 125

  • A3.6. File Organization....................................................................................... 126

  • I

    ACKNOWLEDGEMENTS

    I would like to express my heartfelt gratitude to my advisor Prof. Michael R. Shirts.

    Without his help, support and vision, this work would not have been possible. I would

    like to thank my lab members, Kai, Christoph, Arjan, Joe, Tri and Jon. Their comments

    and discussions have contributed in bringing this effort to its final form.

    I thank UVA ITC and UVA XCG for their computing resources support. I thank Karolina

    from the UVA Computer Science department. Her help during the beta test of cross

    campus grid gave me ideas about setting up the framework for running high through-put

    simulations. I thank James Watney at D. E. Shaw Research for significant help in running

    Desmond.

    Last but not the least I thank my parents, my family and my wife Priya for all the moral

    support they have provided me throughout the work.

  • II

    LIST OF FIGURES

    Figure 1. (a) In the coupled state or solvated state both intermolecular and intramolecular

    interactions for anthracene are turned on. (b) In the decoupled state or vacuum state the

    intermolecular interactions with water molecules are turned off. 9

    Figure 2. Free energy differences of transitions in the direction of increasing and

    decreasing entropy should be added separately to get the overall free energy for a dipole

    inversion. 14

    Figure 3. (a) Gaussians are plotted on mutually perpendicular axes. Both have the G

    calculated using a method (here TI for methane solvation) as mean and RMSE1 and

    RMSE2 as their standard deviations. (b) These are fused to generate a bivariate Gaussian

    plot. (c) Top view of the bivariate Gaussian plot. 30

    Figure 4. Uncertainty estimates (sample standard deviation, analytical, bootstrap) are

    plotted for all the methods in three different test cases for full set. Consistent free

    energy methods have bars equal in height, and the most accurate methods have the

    shortest bars. 36

    Figure 5. Uncertainty estimates (sample standard deviation, analytical, bootstrap) are

    plotted for six methods in three different test cases for sparse set. Four methods

    (DEXP, IEXP, GINS and GDEL) are not graphed because they did not converge properly

    for any of the three uncertainty estimates. 38

    Figure 6. Bias plots for different test cases for the full set. DEXP, TI and TI3 show

    numerically significant bias for methane, while all methods show moderate bias with

    respect to number of states for anthracene. 45

  • III

    Figure 7. Bias plots for different test cases or sparse set. DEXP and IEXP show large

    biases both due to number of samples and due to number of intermediate states. 46

    Figure 8. Bivariate Gaussian plots for UA methane solvation. Note how TI and TI3 fail

    in reliability test for sparse set. GDEL and GINS are not at all reliable. MBAR and

    BAR are accurate but the estimate of the precision in BAR is misleading as it

    underestimates uncertainty, as shown in Figures 4 and 5. SP after the method name

    indicates the results of the sparse set. 49

    Figure 9. Bivariate Gaussian plots for dipole inversion. TI is reliable for this molecule,

    but DEXP, IEXP, GDEL and GINS again are the least accurate and precise of all the

    methods studied. 50

    Figure 10. Bivariate Gaussian plots for Anthracene hydration free energy. The effect of

    bias due to intermediate states is evident in almost all methods as all spreads are elliptical

    in vertical direction. TI3 and MBAR both appear reliable, especially for the full set. 51

    Figure 11. Each subplot has a comparison between the distribution of estimated free

    energies from 100 repetitions (in blue), Gaussian with the mean G and standard

    deviation (G) from 100 repetitions (in black), Gaussian with the mean (G)450ns and

    standard deviation ((G)450ns) (in magenta), Gaussian with the mean (G)51states and

    standard deviation ((G)51states ) (in cyan) for the full set for methane solvation. 53

    Figure 12. Each subplot has a comparison between the distribution of estimated free

    energies from 100 repetitions (in blue), Gaussian with the mean G and standard

    deviation (G) from 100 repetitions (in black), Gaussian with the mean (G)450ns and

  • IV

    standard deviation ((G)450ns) (in magenta), Gaussian with the mean (G)51states and

    standard deviation ((G)51states ) (in cyan) for sparse set for methane solvation. 54

    Figure 13. Free energies estimated using large number of intermediate states and a large

    number of samples converge to a single value for all free energy estimators. 56

    Figure 14. Accuracy in methane solvation free energy estimates as a function of different

    PME parameters. 76

    Figure 15. Inaccuracy in free energy estimates increases with increasing phase space

    overlap. 77

    Figure 16. Molar enthalpy of vaporization of water as a function of different PME

    parameters. 80

    Figure 17. Accuracy in methane solvation enthalpy estimates as a function of different

    PME parameters. 81

    Figure 18. Simulation time for methane solvation as a function of different PME

    parameters. 82

    Figure 19. Subplot at row 2 column 2 in Figure 14 is expanded to explore exactly how

    G estimates of methane solvation vary for different Fourier spacings and cutoffs

    converge. 83

    Figure 20. Subplot at row 2 column 2 in Figure 17 is expanded to explore exactly how

    Hvap estimates vary for different Fourier spacings and cutoffs converge. 83

    Figure 21. Subplot at row 2 column 2 in Figure 19 is expanded to explore exactly how

    simulation time for methane solvation varies for different Fourier spacings and cutoffs. 84

  • V

    Figure S 1. Plots comparing the distribution of dipole inversion free energies with the

    Gaussians for full set for dipole inversion. 108

    Figure S 2. Plots comparing the distribution of dipole inversion free energies with the

    Gaussians for sparse set for dipole inversion. 109

    Figure S 3. Plots comparing the distribution of estimated anthracene solvation free

    energies with the Gaussians for full set. 110

    Figure S 4. Plots comparing the distribution of estimated anthracene solvation free

    energies with the Gaussians for sparse set. 111

    Figure S 5. Accuracy in free energy estimates of dipole inversion as a function of

    different PME parameters. 112

    Figure S 6. Accuracy in the estimate of enthalpy for dipole inversion process as a

    function of different PME parameters. 113

    Figure S 7. Simulation time for dipole inversion as a function of different PME

    parameters. 114

    Figure S 8. Subplot row 2, column 2 in Figure S5 is expanded to explore exactly how

    G estimate of dipole inversion vary for different Fourier spacings and cutoffs. 115

    Figure S 9. Subplot row 2, column 2 in Figure S7 is expanded to explore exactly how

    simulation time for dipole inversion varies for different Fourier spacings and cutoffs. 115

    Figure S 10. Accuracy in anthracene solvation free energy estimate as a function of

    different PME parameters. 116

    Figure S 11. Accuracy in anthracene solvation enthalpy estimates as a function of

    different PME parameters. 117

  • VI

    Figure S 12. Simulation time for anthracene solvation as a function of different PME

    parameters. 118

    Figure S 13. Subplot row 2, column 2 in Figure S10 is expanded to explore exactly how

    G estimate of anthracene solvation vary for different Fourier spacings and cutoffs. 119

    Figure S 14. Subplot at row 2, column 2 in Figure S12 is expanded to explore exactly

    how simulation time for anthracene solvation varies for different Fourier spacings and

    cutoffs. 119

  • VII

    LIST OF TABLES

    Table 1. Statistical uncertainty calculated using three different approaches (analytical

    (G), sample standard deviation (G), bootstrap (G)bs) for UA methane

    solvation using the full set. All quantities are in kJ/mol. G is not the ensemble

    average but the average over 100 repetitions. 35

    Table 2. Statistical uncertainty calculated using three different approaches (analytical

    (G), sample standard deviation (G), bootstrap (G)bs) for UA methane

    solvation using sparse set. All quantities are in kJ/mol. 37

    Table 3. Free energy estimates and corresponding uncertainty estimates in the large

    number of samples (450 ns) and large number of intermediate states (51 states) for UA

    methane solvation. Bootstrap estimates are reported as they are better than analytical

    estimates. All quantities are in kJ/mol. 41

    Table 4. Bias estimates due to number of samples and number of lambda states for full

    and sparse sets for UA methane solvation. All quantities are in kJ/mol. 42

    Table 5. RMSEs and statistical uncertainties in RMSEs in UA methane solvation free

    energy estimate are largest for GINS and GDEL in sparse set while the acceptance ratio

    methods consistently show low RMSEs and the corresponding uncertainty in the RMSEs.

    All quantities are in kJ/mol. 43

    Table 6. Time consumed by different methods to calculate the free energy 201 times of

    anthracene solvation. 57

    Table 7. Summary of all statistical tests are presented. 60

    Table 8. PME parameters used to generate input sets. 74

  • VIII

    Table 9. Lead parameter set for methane solvation. 85

    Table 10. Lead parameter sets for dipole inversion. 85

    Table 11. Lead parameter sets for anthracene solvation. 86

    Table 12. Comparison of lead parameter sets with the parameter set used in simulation

    done to test free energy estimators for methane solvation. 87

    Table S 1. Statistical uncertainty calculated using three different approaches (analytical

    (G), sample standard deviation (G), and bootstrap (G)bs) for dipole inversion

    using the full set. All quantities are in kJ/mol. 97

    Table S 2. Statistical uncertainty calculated using three different approaches (analytical

    (G), sample standard deviation (G), bootstrap (G)bs) for dipole inversion

    using the sparse set. All quantities are in kJ/mol. 98

    Table S 3. Statistical uncertainty calculated using three different approaches (analytical

    (G), sample standard deviation (G), bootstrap (G)bs) for anthracene solvation

    using the full set. All quantities are in kJ/mol. 99

    Table S 4. Statistical uncertainty calculated using three different approaches (analytical

    (G), sample standard deviation (G), bootstrap (G)bs) for anthracene solvation

    using sparse set. All quantities are in kJ/mol. 100

    Table S 5. Free energy estimated using large number of samples and large number of

    intermediate states for dipole inversion. 101

    Table S 6. Bias in free energy estimates due number of samples and number of

    intermediate states for dipole inversion are presented here. None of the methods show

    significantly large bias for this molecular test set. 102

  • IX

    Table S 7. RMSEs and statistical uncertainties in RMSEs in free energy change of dipole

    inversion of IEXP, DEXP, UBAR GDEL, and GINS are significantly higher (indicating

    lower reliability) compared to other methods specifically in the sparse set. 103

    Table S 8. Free energies estimated u large number of samples and large number of

    intermediates states for anthracene hydration free energy. 104

    Table S 9. GDEL and GINS have largest bias in free energy estimates for anthracene

    solvation due to number of samples and number of intermediate states even for full set.

    For sparse set GDEL and GINS have even larger bias compared to full set. DEXP and

    IEXP also show significantly large biases in sparse set. 105

    Table S 10. High RMSEs and statistical uncertainties in RMSEs for GDEL and GINS in

    both the full and the sparse set indicate low reliability. DEXP, IEXP, UBAR also

    become unreliable in the sparse set. 106

    Table A 1. Single point simulation parameters for MD_sim_parm energies for methane

    solvation, dipole inversion, anthracene solvation. The second set of parameters for

    number of grid points is for dipole inversion, which had a larger box. 124

    Table A 2. Single point simulation parameters for high cutoff energies for methane

    solvation and anthracene solvation. 125

    Table A 3. Single point simulation parameters for high cutoff energies for dipole

    inversion. 125

  • 1

    1. Introduction

    Simulation and theory communities have developed substantial interest in using free

    energy calculations for molecular design problems. Specifically, free energy calculations

    can guide experimental screening techniques for measuring biological interaction

    energies and offer the potential of a faster and cheaper way to get thermodynamic

    information over large chemical spaces in a variety of molecular contexts.1 For example,

    drug design2 requires prediction of binding affinities, tautomers, protonation states,

    membrane permeabilities, and solubilities, all of which involve free energy calculations.3-

    6 Similarly, free energy calculations could become useful tools in material design

    problems ranging from improved protein selectivity and stability on chromatographic

    surfaces7 to tailoring metal organic frameworks8 for applications such as gas storage and

    separation. Such potential extends to the design of new nanomaterials such as therapeutic

    dendrimers, hetero-polymers and hyperbranched polymers for molecular recognition,

    imaging, sensing/signaling and controlled payload delivery.9-13 However, substantial

    roadblocks to routine use of molecular simulations as a complement to experiment are

    confusion over suitability of methods for different molecular problems, as well as lack of

    rigorous, validated understanding about the reliability of free energy calculations and

    other observables estimated using statistical methods.

    Other computational fields have successfully benchmarked and tested computational

    methods to improve the reliability and thus the utility of simulations. For example the

    field of computational fluid dynamics has also grappled with issues of reliability and

  • 2

    standardization of simulations. During the late 90s, substantial research efforts in the

    field of computational fluid dynamics (CFD) were focused on establishing validation

    benchmarks to improve the reliability of CFD simulations in various design

    applications.14-16 This research helped to bring down costs, increase data fidelity, and

    reduce design cycle time in the early development phases of new airplanes,17 Formula 1

    cars,18 treatment and diagnosis of cardiovascular diseases19,20 and off-shore oil rigs.21

    Similar validation benchmarks were developed for simulations in the nuclear industry, to

    improve the reliability in nuclear reactor safety, underground nuclear waste storage, and

    nuclear weapon safety.22 For molecular simulation to play a similarly useful role in

    molecular engineering design,23 benchmarks and validation sets must be established. In

    this paper we provide tools for improved standardization of molecular simulations

    through the first version of a benchmark set for free energy calculations.

    There are a large number of free energy methods available,23-30 which by early 2011 have

    been cited collectively over 4600 times. As of the time of writing, 20% of those citations

    occurred within the last 18 months.1 However, the simulation field lacks consensus in

    choosing a method most appropriate for a given molecular design situation. At least

    three fundamental issues contribute to this confusion.

    1 These numbers were generated by adding the primary citations for each of the listed

    methods (23-30) and searches for Thermodynamic Integration and Free Energy

    Perturbation in ISI Web of Science, as the original papers for these methods are older

    than the ISI database.

  • 3

    First, there is a lack of standard test cases for rigorous comparisons between different free

    energy methods. Studies of new methods frequently use relatively trivial model systems,

    such as a one- or two-dimensional analytically solvable potential energy function, or a

    small Lenard-Jones sphere in water. Such test cases may not be representative of the

    problems encountered in actual molecular changes. Alternatively, papers describing new

    methods may use extremely complicated systems, such as protein-ligand binding systems

    that are hard to converge and therefore make it difficult to accurately gauge true gains in

    efficiency. Both of these scenarios yield little knowledge about whether a given method

    will be useful in answering actual molecular design questions. Second, computing

    thermo-physical properties by molecular simulation involves stochastic sampling of

    molecular configurations, and all comparisons must deal with the fact that repeated

    independent measurements will have associated statistical error; unlike in quantum

    mechanics, it will never be possible to match calculation to fourteen decimal places.24

    Finally, direct comparisons between methods can be difficult because of the differences

    between simulation code bases. Free energy calculation capabilities are recent additions

    to most large-scale molecular simulation codes, and therefore most codes support usually

    only a small subset of available free energy methods.

    As a first step towards helping solve these problems, we propose the first version of a

    molecular test set comprising realistic systems undergoing challenging molecular

    transformations. We then use this test set to test the reliability of different methods for

    estimating free energy differences. Although the molecular design applications listed in

    the introduction seem very different, there are features common to all free energy

  • 4

    calculations performed for these applications. All involve determining the preference of

    a molecule to partition between two environments, which can be calculated by way of a

    difference in the free energies of molecular transformation between these two

    environments. For example, we might wish to design a solute preferentially solvated by

    a protein when compared to solvent (pure water) or a different complex medium (another

    protein), as in the case of drug design. Alternately we might design a solvent which

    preferentially solvates a given solute in a mixture; such as, designing ionic liquids25 for

    sequestering CO2. These molecular transformations primarily involve either growing or

    deleting atoms, changing the size or dispersion interaction between atoms, or altering the

    partial charge on mutation sites. Any benchmark test set must include examples of these

    transformations which are simultaneously challenging enough to push new methods and

    yet possible to evaluate with sufficiently high precision to give meaningful comparisons

    between methods in a reasonable amount computer time.

    The most important features of a property estimation method to understand are the

    statistical errors inherent in the method, both bias and statistical uncertainty, and the

    reliability of the methods estimate of the property. Without such data, we cannot trust

    our calculations or compare two different calculations for validation purposes. Studies

    commissioned by U.S. science funding agencies on future directions for simulation based

    engineering and science have emphasized the fundamental need for improved uncertainty

    verification and validation.26,27 Almost all estimators of statistical quantities, like free

    energy and ensemble average calculation methods, have some bias, a systematic

    deviation from the true answer that would be obtained with perfect sampling.

  • 5

    Additionally, computing a given observable from independently selected samples gives

    different estimates of the observable; this variation is the statistical uncertainty of the

    estimate. Most free energy methods include estimates of this statistical uncertainty; but of

    course, these uncertainty estimates are themselves statistical quantities and have variation

    from sample to sample, and must be validated.

    A few valuable studies have compared28,29 free energy methods, but not necessarily in a

    systematic way. We use our proposed benchmark set to directly compare the estimated

    uncertainties with the sample uncertainty. We compute the change in the mean square

    error as a function of the number of intermediate states and number of samples to capture

    both bias and uncertainty. We test whether the distribution of free energy estimators is

    indeed Gaussian, a condition usually assumed when using statistical uncertainty estimates

    to calculate error. Finally, we evaluate the bootstrap method30 as a method for computing

    statistical uncertainty, as this method can be easily implemented for all of free energy

    algorithms described in this thesis, and indeed generally for most statistical estimates of

    observables.

    Bias and uncertainty associated with the free energy algorithm are not the only sources of

    errors in free energy estimates. Simulation parameters associated with the calculations of

    non-bonded interactions, like electrostatic potential, also contribute to the errors in free

    energy estimates. These parameters specify the extent of approximations allowed during

    force and potential calculations. Molecular dynamics simulations require accurate force

    calculations at each step. Similarly, Monte Carlo simulations require accurate potentials

  • 6

    at each step for generating samples. The selection criterions for these simulation

    parameters are not well defined and there still exists a challenge as to how these can be

    tuned efficiently for an entirely new system. The most common technique is to tune the

    parameters such that for a given accuracy in force calculations the selection results in

    minimum computational cost31-33. We have introduced a new way of looking into this

    problem. We explore the electrostatic potential parameter space using our benchmark test

    set. We search for the set of parameters which results in the least computational cost for a

    given accuracy in free energy estimates of systems in benchmark set, estimates of molar

    enthalpy of vaporization of water and estimates of enthalpy of transformation of each

    system in benchmark set.

    In this paper, we first explain the molecular test set and the rationale behind the

    molecular choices. Next, we use this set to test and compare the accuracy, precision and

    reliability of ten free energy methods. We then present a summary of the method

    comparison, with much of the data presented as supplementary material because of its

    length, and present our recommendations for methods for performing free energy

    calculations. We then get into a discussion about the parameter space search and explain

    our methodology and protocols to explore it and finally present our recommendations for

    electrostatic potential parameters for performing molecular simulations.

  • 7

    2. Molecular systems of benchmark set

    The systems in this benchmark set are designed to represent alchemical changes, or

    changes of molecular identity, common to most molecular design applications.

    Alchemical transformations frequently require the deletion or introduction of atoms and

    large changes in the partial charges. Changes in torsional, angle, or dihedral parameters

    usually result in smaller changes in phase space, as do small changes in dispersion

    strength atomic radius, or charge. We therefore focus on atomic introduction/deletion and

    large changes in partial charge.

    2.1 Minimal test system: OPLS UA methane in TIP3P water (MS): The solvation of a

    Lennard-Jones (LJ) sphere, representing methane, is perhaps the simplest molecular free

    energy system that can be truly defined as molecularly realistic. There are no bonds,

    angles, or torsions terms, nor are there solute/solvent charge terms. This system

    represents a minimal test of whether the free energy method is at all valid or applicable

    for molecular systems. We examine the transformation of coupling the sphere into water,

    which corresponds to the solvation free energy of this molecule Figure 1 (a).

    2.2 Charge mutation system: dipole inversion with OPLS UA ethane molecule in

    TIP3P water (DI): We chose two LJ methyl spheres, tethered together, with +1/-1

    charges and the bond length of a C-C bond as illustrated in Figure 1(b). This setup avoids

    computing free energies of ions directly, as changing the total charge of a system with

    periodic boundary conditions is not always handled completely correctly in many codes.35

  • 8

    This test measures whether the method can handle large water rearrangements around

    charges. The system is a null transform; the free energy change is zero as the final state is

    identical to the initial state by symmetry.

    2.3 Large molecule mutation system: absolute hydration free energy of UA

    anthracene in TIP3P water: In our third test, we compute the solvation of anthracene

    via the decoupling of the intermolecular interactions from water. This system tests

    whether the method can handle introduction or deletion of multiple atomic sites

    efficiently. Importantly, there are no internal ligand degrees of freedom to complicate the

    analysis. Force field parameters are taken from Pitera & Van Gunsteren.36 Originally, we

    chose a null transformation of anthracene to anthracene via a benzene intermediate, but

    the simpler solvation problem was eventually chosen because of the difficulty of

    supporting such transformations in other codes and the difficulty of interpreting the

    statistics of multiple transformations; a key requirement of the benchmark set is to be

    simple to use. For this test set, we have used a single topology scheme, slowly turning

    off the interactions between the solute and the solvent but keeping the intramolecular

    interactions in the solute turned on as shown in Figure 1 (c) and (d) . The free energy

    change of this transformation corresponds to the desolvation free energy of anthracene,

    but we report the solvation free energy for ease of interpretation.

  • 9

    Figure 1. (a) Methane solvation (b) Dipole inversion (c) In the coupled state or solvated

    state both intermolecular and intramolecular interactions for anthracene are turned on. (d)

    In the decoupled state or vacuum state the intermolecular interactions with water

    molecules are turned off.

  • 10

    3. Comparing free energy methods based on statistical tests, using benchmark set.

    We use the benchmark set to test a total of ten free energy methods. In the following

    explanations, we assume that the simulations are performed in the isothermal-isobaric

    ensemble, and thus the Gibbs free energy G is the quantity of interest; the Helmholtz

    free energy can also be computed if the simulations are performed in the canonical

    ensemble. U indicates the generalized potential energy, which in the case of the isobaric-

    isothermal ensemble is actually U+PV, is 1/kBT, and is a coupling parameter

    connecting the initial and final states in an arbitrary manner. Brackets indicate an

    ensemble average over the appropriate ensemble.

    3.1 Free energy methods and error propagation

    3.1.1 Thermodynamic integration24 using a trapezoid rule (TI) and a cubic spline37

    (TI3):

    For TI, we compute the ensemble average of the derivative of potential energy function

    with respect to a coupling parameter for a system i.e. ( U()/ )i at all values and

    the corresponding variances 2i of the ( U()/ )i distributions.

    222 xxi , where x =

    iU )/)(( (1)

    The ( U()/ )i values at different intermediates are interpolated and then integrated

    to get an overall free energy change.

    1

    0

    10 )/)(()0()1(

    UdGGG (2)

  • 11

    For TI we have used a linear interpolation, which leads to the standard trapezoid rule to

    integrate the total free energy. For TI3, the ( U()/ )i vs i curve is fit piecewise to a

    natural cubic spline and then integrated analytically using the coefficients of the cubic

    equation (see Appendix 1 for the derivation). Both the trapezoidal and the cubic spline

    integration can be expressed in the form of weighted sum of individual ( U()/ )i

    K

    i

    i iUWG

    1

    10 )/)(( (3)

    Here the Wis are the respective weights corresponding to each state and K is the total

    number of intermediate states. The variance of this estimate of free energy can be

    calculated by the following variance propagation formula:

    K

    i

    iiW1

    222

    10 (4)

    Occasionally in the literature, the variance of the free energy over each interval i to i+1 is

    computed individually, and then propagated into the total variance by the sum of squares.

    This is incorrect, since the variance of each interval is then correlated to the variance of

    its neighbors. For example, the free energy difference between state 1 and state 2 and

    between state 2 and state 3 both contain statistical information from state 2. A number

    of alternative thermodynamic integration schemes have been proposed.37,38 However,

    such schemes require some knowledge of the magnitude of the statistical uncertainty for

    optimality. Other schemes use nonlinear fits to two different functional forms separately

    describing Lennard-Jones and Coulomb contributions to the free energy.39 Such schemes

    are not particularly flexible and introduce integration bias that is difficult to quantify. By

    using cubic splines, we can obtain a higher order formula independent of functional form

  • 12

    of dU/d, while propagating error using the same formalism as is used in standard TI

    (Eq. 4).

    3.1.2 Exponential averaging (EXP) in two forms: deletion (DEXP) and insertion

    (IEXP)28,40:

    In exponential averaging schemes, the free energy change Gij is calculated using the

    exponential average of the difference of the potential energies Uij between two states i

    and j over one of the ensembles. The free energy difference as a function of potential

    energy difference Uij and N samples is then

    N

    n

    ijnij xUN

    G1

    ])(exp[1

    ln1

    (5)

    This averaging is performed using samples from state i to compute potential energy

    differences Uij from state i to state j. The free energy of the reverse process can be

    computed using samples from state j and computing potential energy differences to state

    i. Since the labels themselves are arbitrary, to remove ambiguity in the direction we will

    describe such computations as being either deletion or insertion. We will call Uij

    taken in the direction of decreasing entropy as an insertion step, and Uij taken in the

    direction of increasing entropy as a deletion step, as inspired by Wu et al.41 Hence the

    free energy method using a Uij which steps in the direction of increasing entropy in Eq.

    5 is labeled as deletion exponential averaging (DEXP), and the free energy method which

    uses Uij which steps in the direction of decreasing entropy in Eq. 5 is labeled as

    insertion exponential averaging (IEXP). In both cases, the variance 2ij between two

    adjacent intermediate states can be estimated using standard point estimation theory as:

  • 13

    ])(exp[,1

    2

    2

    ijnx

    ij xUxxN

    (6)

    In both the exponential averaging methods the overall free energy change G10 is

    the sum of intermediate free energy changes Gij, and so the variance 210 is simply the

    sum of the associated variances 2ij. In some cases the changes from the i state to the i-1

    state and i+1 states might both be deletion or insertion cases; in this case, all the sampling

    performed at i, and the two estimates of the free energy difference will not be statistically

    independent. Complicated molecular changes will frequently involve both addition and

    subtractions of phase space, and thus will fall somewhere in between these two general

    schemes.

    When we calculate the overall free energy change using a method which has inherent

    directionality, like exponential averaging, then given our definitions, we need to make it

    sure that the direction of entropy change remains intact throughout the process in order to

    interpret the whole transformation as a deletion or an insertion process. Methane

    solvation and anthracene solvation involve moving molecules from vapor to liquid phase,

    resulting in a decrease in total entropy. However, in the dipole inversion case, we have a

    symmetric transformation, and thus deletion and insertion happen within a single process.

    In dipole inversion, going from a very large magnitude dipole to a small uncharged

    intermediate, we have an increase of entropy as the water around the particle becomes

    less structured, and thus we use the term deletion. From the intermediate uncharged

    intermediate state to the reversed -/+ dipolar state (during the second half of the

    inversion), we have the reverse process, and we use the term insertion consistent with the

  • 14

    entropy direction definition. To use the terminology IEXP, GINS, DEXP, and GDEL

    pathways, we therefore need to combine mixed halves of what would typically be called

    the forward and reverse pathways as illustrated in Figure 2.

    Figure 2. Free energy differences of transitions in the direction of increasing and

    decreasing entropy should be added separately to get the overall free energy for a dipole

    inversion.

    Although these particular sums are nonzero, they provide a consistent definition of the

    statistical variance of insertion and deletion. The statistical variance for symmetric

    transformations will simply be the average of the variance for the deletion and insertion

    processes.

  • 15

    3.1.3 Bennett acceptance ratio (BAR)42:

    The Bennett acceptance ratio uses samples of the potential energy both from i to j and j to

    i to obtain a provably minimum variance estimate of the free energy difference.

    Calculation of free energy change between any two intermediate states through BAR

    requires self-consistent solution of the two equations:

    i

    j

    N

    li

    l

    N

    kj

    kij

    N

    NC

    CU

    CUG

    i

    j

    ln1

    )](exp[1

    1

    )](exp[1

    1

    ln1

    1

    1

    (7)

    i

    j

    ijN

    NGC ln

    1

    (8)

    The first equation is true for any constant C, but when Eqs. (7) and (8) are solved self-

    consistently, then Gij will have minimized variance. There exist a large number of ways

    to solve the equations self-consistently which is beyond the scope of this paper. The

    variance 2ij in free energy change can be estimated from:

    1

    )(

    )(11

    )(

    )(12

    2

    22

    2

    2

    2

    j

    j

    ji

    i

    i

    ijxf

    xf

    Nxf

    xf

    N (9)

    Where f(x) is the Fermi function 1/(1+x) and x = (U-C). The total free energy change

    is the sum over changes between consecutive intermediate states. Typically, the variance

    in the full free energy is computed by assuming independent error and summing the

    variance for consecutive intermediate states. However, the assumption that the errors add

    independently is not correct, since the free energy difference from i-1 to i and from i to

    i+1 states both depend on the potential energy at i, so their variances are not independent.

  • 16

    3.1.4 Unoptimized Bennett acceptance ratio (UBAR):

    Eq. 9 in Section 3.4 is valid for any initial estimate of the free energy, though choices of

    C not given by the implicit equation (Eq. 9) will not have minimum variance. If we make

    the choice of C=-1ln(Nj/Ni), we no longer need to self consistently solve equations,

    which can avoid saving, reading, and reprocessing all of the data, potentially saving

    significant resources, at a cost of statistical efficiency. We instead accumulate the

    averages in Eqs. 7 through 9 as the simulation progresses. If each intermediate free

    energy is relatively near zero, the free energy estimate will be close to optimal. Such an

    estimator is directly equivalent to the minimum variance version of transition state Monte

    Carlo, where Barker acceptance probability43 is used.44

    3.1.5 Range-based Bennett acceptance ratio (RBAR):

    If we keep track of the ensemble averages in Eqs. 11 and 12 for a range of trial values of

    C, we will obtain a number of estimates of the best estimate free energy.44 Of these

    estimates, the one that is closest to the input C will be the least biased and will have

    minimum variance. By choosing this particular estimate, we are essentially pre-

    calculating the self-consistent solution. To apply this method, a range of starting values

    of C is chosen. This trial C is fed as an initial guess and C is calculated using a single

    iteration of Eq. 11, with corresponding G and then calculated. Accumulated averages

    are maintained for each choice of G. A decent estimate of the range of C and G is

    therefore a requirement for using this method. In some cases, it may end up being more

    costly than BAR, as accumulated averages must be maintained for a certain number of

  • 17

    trial free energy values, instead of simply performing 5-10 self-consistent iterations.

    However, the advantage of what we will call in this paper RBAR (range-based Bennett

    acceptance ratio) is that data from each simulation step does not need to be retained for

    post-processing as is required with BAR, and only the accumulated averages need to be

    maintained.

    3.1.6 Multistate Bennett acceptance ratio (MBAR)45:

    MBAR is a method to find the free energies of K states simultaneously by minimizing the

    KK matrix of variances of the free energy differences of these K states simultaneously.

    The derivation of MBAR is a straightforward if mathematically difficult extension of the

    derivation BAR to more than two states considered simultaneously. This can be a

    significant improvement over BAR, which minimize variances of the free energy

    differences for two states at a time. For MBAR, the equation:

    K

    k

    N

    nknkk

    K

    kk

    knii

    k

    xUGN

    xUG

    1 1

    1

    )](exp[

    )](exp[ln

    1

    ''

    '

    '

    (11)

    is solved self-consistently for each Gi. Gij = G(j) - G(i) gives the free energy change

    between two states i and j. The statistical variance of Gij, ij2 , is calculated using Eqs. 8

    and 12 in the paper by Shirts and Chodera.45

    3.1.7 Gaussian estimate of exponential averaging in two forms: deletion (GDEL) and

    insertion (GINS)29:

  • 18

    If 2U = U2-U2 is finite, and we approximate the Uij distribution as a Gaussian,

    the free energy can be expressed as a sum of moments of the energy differences46 by:

    2

    2 ijUijij

    UG

    (12)

    The variance over N samples of this free energy difference is given by:

    )1(2

    422

    2

    NN

    ijij UU

    ij

    (13)

    If the distribution Uij is close to Gaussian, then this estimation method can minimize the

    statistical effect of rare events, resulting in a more efficient and substantially simpler

    estimate method. To remove ambiguities with respect to direction of the process, we use

    the same convention of deletion and insertion as described for exponential averaging. In

    Eq. 12, when Uij is in the direction of increasing entropy, we refer to this as the

    Gaussian estimate with deletion or GDEL, and if we use Uij in the direction decreasing

    entropy, we refer to this estimate as the Gaussian estimate with insertion or GINS.

    Summing the free energy changes between intermediates again gives the total free energy

    changes. Total variance is calculated assuming independent sampling at each state, which

    is not an approximation here, as each calculation depends on samples from only one state.

    The total free energy is calculated by summing over the free energy changes between

    neighboring states.

    3.2 Building molecular systems, setting up simulation framework and designing

    statistical experiments.

  • 19

    (A) System preparation and simulation parameters.

    Topologies for UA methane, dipole inversion, anthracene solvation test systems were

    created by a combination of automated tools (Dundee PRODRG47, OpenEye libraries,48

    ACPYPE49) and manual editing, and are available on the Alchemistry.org website,

    http://www.alchemistry.org. Starting configurations were generated using GROMACS

    4.0.4. The automated topologies were solvated using GROMACS genbox, and these

    solvated systems were minimized with the L-BFGS50 (Low memory-BroydenFletcher

    GoldfarbShanno) minimization method, followed by steepest descent minimization. All

    systems were then equilibrated at constant volume at 300K for 100 ps, using Langevin

    dynamics with a time step of 0.002 fs.51, 52 All hydrogen-containing bonds were

    constrained using the SHAKE algorithm to a relative tolerance of 10-12. The systems were

    then equilibrated at constant pressure at 1 atm using a Parrinello-Rahman barostat43 and a

    Nose-Hoover thermostat44 for 100 ps. A coupling time constant of 5 ps was used for both

    thermostat and barostat. A switching function was used for both PME and van der Waals

    potentials. The PME switch started at 0.88 nm with a coulomb cutoff distance of 0.9 nm

    for electrostatics. Other PME parameters were: Fourier spacing of 0.12 nm, 4th order B-

    spline interpolation and a Ewald tolerance of 10-8. A van der Waals switch at 0.8 nm and

    cutoff distance of 0.9 nm was used. A long range van der Waals dispersion correction

    was used for both energy and pressure.

    3.2.1 values and spacing between intermediate states for free energy calculations:

    Initial simulations (5 ns long, including 0.5 ns equilibration) were performed with 21

    equally spaced values to guide the selection of the values for the main study.

  • 20

    Intermediate states were chosen so that each window contributed almost equally to the

    total error, specifically insuring that the maximum variance among all windows was no

    larger than the twice of the variance among all windows. In order to examine the

    change in bias, statistical error, and mean square error as a function of the spacing

    between coupling parameter values, we choose two sets of states for each model: a

    full set, and a sparse set.

    3.2.1.1 Full set:

    For the full intermediate set, states were chosen such that the total uncertainty in MBAR

    was no greater than 0.1kJ/mol for methane solvation and no greater than 0.22 kJ/mol for

    dipole inversion and anthracene solvation over the 5 ns initial simulation. For UA

    methane solvation (MS), 8 intermediates were selected: = [0.0, 0.2, 0.4, 0.5, 0.6, 0.7,

    0.8, 1.0] where = 0 denotes fully interacting UA methane in water and = 1 denotes

    ghost UA methane, where there are no interactions with the solvent. For consistency of

    sign between test cases, the reported results are in terms of the reverse process, the

    solvation of methane. For dipole inversion (DI), we include 11 intermediate states: =

    [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]. Here = 0 denotes a starting +/-

    configuration of the dipole; = 1 denotes the reversed configuration of dipole i.e. -/+,

    with = 0.5 a state with zero partial charges. For anthracene solvation (AS), the full set

    has 15 total states, with = [0.0, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85,

    0.9, 1.0]. = 0 denotes fully interacting united atom anthracene in water; = 1 denotes

    the vacuum state anthracene with no interactions with water. Again this corresponds to a

    desolvation process, and in tables and charts, we report the free energy of hydration,

  • 21

    which is simply the reverse process and so includes a sign reversal. For RBAR, this

    spacing means that the largest free energy between intervals is approximately 35 kJ/mol,

    meaning we must use range of -40 to 40 kJ/mol, with increments of 1 kJ/mol.

    3.2.1.2 Sparse set:

    For the sparse sets, states were chosen such that the total uncertainty in MBAR was no

    greater than 0.25 kJ/mol for MS and 0.4kJ/mol for DI and AS over the 5 ns initial

    simulation with no fewer than 3 total states. For methane solvation, there are only three

    states [0, 0.5, 1]. For dipole inversion, the sparse set was generated by picking every

    alternate along with = 0.5 (which represents zero net charge) [0, 0.2, 0.4, 0.5, 0.6, 0.8,

    1.0], reducing the number of states from 11 to 7. For the AS test set every third was

    chosen to create the sparse set [0.0, 0.2, 0.5, 0.7, 0.85, 1.0], reducing the number of

    states from 15 to 6. Note that we did not need to run the separate simulations for the

    sparse states; we merely select for analysis only a subset of the states from the full set.

    For RBAR, this spacing means that the largest free energy between intervals is

    approximately 66 kJ/mol, meaning we must use a range of -70 to 70 kJ/mol, with

    increments of 1 kJ/mol. In order to also test the bias with respect to number of states, for

    each model we conduct a set of 5 ns simulations at 51 equally spaced states, with a

    spacing 0.02 between two neighboring states.

  • 22

    3.2.2 Generating an ensemble of uncorrelated configurations

    The most important use of the benchmark test set in this paper is to compare estimates of

    the statistical error of different estimators of the free energy with direct sample error

    obtained by repeating the experiment N times. For this purpose, we started by generating

    100 uncorrelated starting configurations. Using the =0 state from the 5 ns test runs, we

    used the GROMACS program g_analyze to compute the autocorrelation time of the

    potential energy, kinetic energy, total energy, coulomb interactions, and dUpot/d for all

    three systems. We used block averaging53 using the GROMACS g_analyze program to

    compute the autocorrelation times. The autocorrelation time of potential energy was

    chosen since it was the longest correlation time of all the observables examined. The

    autocorrelation times of potential energy for UA methane solvation, dipole inversion,

    anthracene solvation were 25 ps, 30 ps, and 25 ps respectively.

    We selected configurations separated by 2 ns as our uncorrelated starting points. 2 ns is

    more than 50 times longer than the 30 ps autocorrelation time of the potential energy.

    Each of the three test systems were run for 20 ns and configurations were printed out

    after every 2 ns. From each of these 10 parent configurations, 20 ns simulation were run

    with different random seeds to generate a Maxwell-Boltzmann velocity distribution, and

    configurations were printed out every 2 ns, giving a total of 100 uncorrelated starting

    configurations for each system. These configurations were generated using a beta version

    of GROMACS 4.5 also used for subsequent free energy configurations. Velocity Verlet

    (option md-vv) was used as the integrator algorithm. A Nose-Hoover thermostat and the

    MTTK54 barostat was used to control temperature and pressure, respectively.

  • 23

    The starting co-ordinate files for each intermediate state simulation, other than the first

    state, were generated by running short 10 ps equilibration runs in series. The first state at

    =0 used one of the 100 uncorrelated starting configurations as the starting co-ordinate

    file. The starting configuration for any other state used the final configuration of the

    previous equilibrated state. After this initial equilibration round, 5 ns of NPT simulation

    were then performed for each initial configuration and each separate state. Data

    corresponding to first 500 ps was discarded as equilibration. The remaining 4.5 ns of

    equilibrium data for each model at each intermediate state were used for all subsequent

    calculations.

    (B) Statistical tests.

    3.2.3 Quantifying accuracy and precision in uncertainty estimate of an estimator.

    We estimate uncertainty for each free energy estimator in three ways. First, we compute

    the sample standard deviation G2 - G2 from Gs computed from the series of

    100 uncorrelated simulation runs described above. Second, we compute the analytical

    estimates of error corresponding to each of the methods. Finally, we use the bootstrap

    estimator for the standard deviation30.

    3.2.3.1 Sample standard deviation: To compute the sample standard deviation, we take

    the simulations started from 100 initial configurations and compute free energy

    differences from each simulation to obtain a distribution of free energy differences. We

  • 24

    then directly compute the sample standard deviation corresponding to each individual

    estimator from:

    1

    )(

    )( 1

    2

    N

    GG

    G

    N

    i

    i

    (14)

    Where N = 100, and where G is the mean over the 100 values of Gis. Crucially, the

    standard deviation computed from a finite sized sample is itself a statistical quantity, and

    must therefore have an associated uncertainty. Rigorously, in order to compute the

    sample standard deviation of the uncertainty ((G)), we would need to repeat our 100

    simulation experiment 100 times. Instead, we have used the bootstrap method (described

    in the bootstrap estimate section) to estimate ((G)). From this exercise we finally get

    G and (G) ((G)). G here is not an ensemble average but the average

    over 100 repetitions.

    3.2.3.2 Analytical estimate: Each free energy estimator has an associated uncertainty

    estimator as discussed in previous sections, namely the square root of the estimated

    variance of the total free energy. From 100 uncorrelated starting configurations, we will

    obtain not only 100 Gs, but 100 error estimates from the methods analytical

    uncertainty estimate. We denote the average and standard deviation of these estimated

    uncertainties over all 100 independent runs as (G) ((G)), and call these the

    analytical uncertainty estimate and the standard deviation of the analytical estimate

    distribution.

  • 25

    3.2.3.3 Bootstrap estimate: For each of the 100 independent free energy calculations, we

    also calculate a bootstrap error estimate, which is a rigorous and robust statistical

    estimation technique.30 The bootstrap error is constructed as follows: From each set of

    potential energy differences or dU/d values, we generate N bootstrap sets from the

    original sample of molecular simulation data. To generate a bootstrap set, we first

    subsample the data using an estimate of the autocorrelation time to obtain N statistically

    uncorrelated values. For each bootstrap set, we draw N samples with replacement from a

    set of uncorrelated measurements. For example, if our set was the integers {3,6,8,9},

    then a bootstrap set would consist in randomly selecting each of the four numbers for

    times; {3,3,8,9}, {9, 6, 6, 3}, and {8, 8, 8, 8} would all be valid sets, though clearly the

    last one would be the rarest. We repeat this sub-sampling process many times; in this

    case, we draw 200 bootstrap sets. For each of these 200 bootstrap sets, we compute the

    free energy and uncertainties using the estimators as if they were the original data set.

    This gives us 200 Gs, one for each of the 200 bootstrapped sets; standard rules of thumb

    suggest using 50-200 bootstrapped sets to get robust estimates of uncertainty.30 The

    average of these 200 bootstrapped Gbss gives Gbs. The bootstrap estimate of the

    error, (G)bs is the sample standard deviation of the 200 Gbs values for each of the

    initial configurations. The average (G)bs over all 100 initial configurations is the

    bootstrap error estimate. The statistical uncertainty of this bootstrap uncertainty estimate

    is estimated by computing the sample standard deviation over the 100 (G)bs, and is

    denoted by ((G)bs).

  • 26

    A good free energy estimator should have an analytical uncertainty estimate that is

    consistent with the sample standard deviation. If the analytical uncertainty estimate is not

    accurate, then we have no way of knowing how accurate our free energy estimate is when

    we run only a single calculation. This can be very problematic; for example, the analytic

    estimate of the uncertainty of EXP diverges from the true estimate well before the error

    in EXP itself.28 The smaller the difference between the direct sample standard deviation

    error estimate (G) and the predicted or analytical error estimate (G), the better

    we know the methods ability to predict error without having to run multiple trials. If the

    statistics are well-behaved, the bootstrap estimate of the statistical uncertainty should also

    agree with the sample estimate of the uncertainty. If we can show that the bootstrap

    estimate agrees with the sample estimate of the uncertainty, then bootstrap error can

    substitute for sample uncertainty estimates even when the analytical estimate fails.

    3.2.4 Quantifying bias of free energy estimates

    The average estimate from a statistically biased estimator, even if repeated an infinite

    number of times, will still deviate from the true estimate by the amount of the bias.

    There are typically two types of bias in the free energy estimators considered here.

    Asymptotically unbiased estimators have bias with finite number of samples but

    converge to a true answer in the limit of large numbers of samples. An example is the

    nave estimator of the variance in the average of a set of numbers, which is var(a) = 1/N

    [(A Ai)2]. This estimate can be shown to always be slightly too large. If N is

    replaced by N-1, however, the estimator becomes unbiased. In the limit of very large N,

  • 27

    the difference is meaningless. Thus the nave estimator of the variance is not unbiased,

    but it is asymptotically unbiased. Thermodynamic integration does not have asymptotic

    bias, because at each intermediate state, the simple average of dU/d is unbiased for any

    numbers of samples. Exponential averaging, BAR, and MBAR are only asymptotically

    unbiased, though the bias of exponential averaging is usually significantly higher than

    that of BAR and MBAR;28 careful design of the pathway can minimize this large bias to

    some extent.55 However, unlike the simple case of the estimator for the variance of an

    average, there exist no known unbiased versions of these estimators.

    Bias also occurs due to using a limited number of intermediate states, because of lack of

    either phase space overlap between intermediates in exponential averaging, as occurs in

    acceptance ratio methods and exponential averaging, or because of numerical integration

    error occurring in thermodynamic integration methods. Even for a large benchmark set

    like the present study, computational expense and storage limits make it very difficult to

    approach the number of sample limit and the number of intermediate states limit

    simultaneously. Therefore, we estimate the two contributions to bias independently. For

    asymptotic bias, we compare the results from combining a fixed amount of data in either

    one large data set, or a series of shorter data sets. For bias as a function of number of

    intermediate states, we vary the number of intermediates given fixed length simulations

    to investigate bias as a function of number of intermediate states.

    3.2.4.1 Bias due to number of samples: Data from all 100 5ns runs is stitched into a

    single trajectory to mimic a single long trajectory. The data corresponding to

  • 28

    equilibrations was not included while estimating free energies, and since each simulation

    contained 4.5 ns, stitching results in 450 ns total simulation data. The difference between

    free energy estimates computed with 450 ns of data (G)450 and the same data used as an

    average of 100 4.5 ns trajectories G is the bias due to number of samples.

    3.2.4.2 Bias due to number of intermediate states: We also ran a set of simulations

    with 51 states for each model as discussed above. We can estimate the bias due to the

    number of intermediate states as the difference between free energy estimates computed

    using 51 states (G)51, G estimated using the full states, and G estimated using the

    sparse states. Besides having low and consistent error estimates, ideal free energy

    estimation methods would show little or no bias in these tests. In many cases, we are

    limited by the statistical uncertainty in determining the bias with high accuracy, since it is

    computationally too demanding to generate 500 ns of simulation for all 51 states for all

    test sets. In these cases, we can only determine if the bias is statistically insignificant

    with respect to the statistical uncertainty. In the case of bias with respect to number of

    samples, in most cases, asymptotic bias scales with 1/N while statistical uncertainty

    scales with 1/N1/2, so statistical variance is usually dominant.

    3.2.5 Quantifying reliability of a free energy estimator

    Root mean square error (RMSE) is the square root of the sum of squared differences

    between the sample and the true answer. Alternatively, we can write it as the sum of the

    variance estimate (2) and the square of the bias. Mean square error is therefore the most

  • 29

    useful overall measure of the reliability of any statistically estimated observable.

    Calculating a true RMSE requires collecting an infinite number of samples at an infinite

    number of intermediate states to obtain the true answer. Since it is computationally

    impossible to reach this limit, and computationally expensive to approach it, we use the

    free energy estimate G450 from the 450 ns run to approximate the unbiased limit of free

    energies given a fixed set of intermediates, and the free energy estimate G51 from 51

    state run as the unbiased limit of the free energy estimate with respect to large numbers of

    intermediate states, given a fixed amount of sampling. From the two different biases

    generated from these two reference states, we obtain two different estimates of mean

    square error. Neither of these are true RMSEs because we lack the true reference answer.

    However, these estimates of the RMSEs capture the combined effect of the statistical

    error and the two different sources of bias. We define the two estimates of mean square

    errors as MSE1 and MSE2:

    MSE1i = i2 + Bias1i2, where Bias1i = (G)450 (Gest)i and 1 i 100 (15)

    MSE2i = i2 + Bias2i2, where Bias2i = (G)51 (Gest)i and 1 i 100 (16)

    RMSE1 and RMSE2 will be square roots of MSE1 and MSE2 respectively. The errors in

    RMSE1 and RMSE2, (RMSE1) and (RMSE2) respectively, are the standard deviations

    over the 100 RMSE1i and 100 RMSE2i. With the two quantities RMSE1 and RMSE2,

    we can examine qualitative information about the reliability of the methods using all the

    information from our experiments. We plot a bivariate Gaussian with variances equal to

    RMSE1 and RMSE2 on mutually perpendicular axes, as shown in Figure 3(a), with the

    analytical average of the free energy estimate G estimated by the method as the mean.

    An example bivariate Gaussian RMSE plots is shown in Figure 3(b). Figure 3(c) shows

  • 30

    the top view of this Gaussian. The overall spread of the rings in Figure 3(c) is a measure

    of overall reliability.

    Figure 3. (a) Gaussians are plotted on mutually perpendicular axes. Both have the G

    calculated using a method (here TI for methane solvation) as mean and RMSE1 and

    RMSE2 as their standard deviations. (b) These are fused to generate a bivariate Gaussian

    plot. (c) Top view of the bivariate Gaussian plot.

    A poor estimator has either large and or unequal spreads in horizontal and vertical

    directions, yielding a large circle or an ellipse. A good estimator has small and equal

    spread in both the horizontal and vertical direction. Larger spread in one direction

    indicates dominance of a single type of bias. In the above plot, the vertical axis

    corresponds to the Gaussian with variance equal to RMSE2. Vertical spread indicates that

    bias due to number of intermediate states dominates the uncertainty estimate, while

  • 31

    horizontal spread indicates bias due to number of samples dominates the uncertainty

    estimate.

    3.2.6 Validating the Gaussian distributions of the free energy differences.

    When we express uncertainty in the free energy estimate in terms of a single number, the

    variance or the statistical uncertainty, rather than as a distribution, we are implicitly

    assuming that the errors are well described by a normal distribution, whose spread is

    equal to the statistical variance. Most analytical variance methods use propagation

    estimates that are only rigorously true in the Gaussian limit. As long as the variances are

    bounded (not infinite), the central limit theorem ensures that with enough samples,

    variances will indeed converge to the Gaussian limit. However, for finite number of

    samples, this assumption must be tested, not simply assumed, or else we run the risk of

    underestimating the chance of large deviations (black swans) from the average value.

    To test whether the shape of the free energy distribution is Gaussian or not, histograms of

    free energy distributions are plotted against Gaussians using the free energy estimate as

    mean and analytical uncertainty estimate as standard deviation for each method.

    3.3 Results and discussion:

    Results are presented in Tables 1-5 as well as in Figures 4-13. Tables 1 and 2 contain

    results of the uncertainty analysis for full and sparse sets only for methane solvation.

    Tables containing full results for dipole inversion and anthracene solvation can be found

    in the supplementary material. Tables 3, 4 and 5 contain the free energy estimates

  • 32

    corresponding to 450 ns and 51 states runs, bias analysis, and reliability estimates of

    free energy and uncertainty predictions for UA methane solvation, for full and sparse

    sets. Figures 4-7 provide comparison of the accuracy and precision in free energy and

    uncertainty predictions for all ten methods for the three test cases. Tables of the

    uncertainty analysis for the dipole inversion and anthracene hydration free energy test

    cases are presented in the supplementary material as tables S1-S10. Bivariate Gaussian

    plots presenting the analysis of reliabilities of free energy methods are presented in

    Figures 8-10. Figures 11 and 12 compare the actual distribution of the free energies

    (computed using 100 5ns simulations with full and sparse sets) with Gaussians using

    the corresponding variance and mean estimates of the free energy methods, along with

    the Gaussians of 450 ns trajectory solution and the 51 state simulation set for

    comparison. In some plots, we omit GDEL and GINS, as their errors are significantly

    larger than the scale of errors of the other methods in the plots.

    3.3.1 Validation of uncertainty estimates. The first column in Table 1 is the free energy

    change calculated as an average over the 100 uncorrelated repetitions. The next three

    columns are the average of the analytical estimate of uncertainty over all repetitions

    (G), the sample standard deviation (G) and the average bootstrap estimate of the

    uncertainty over all repetitions (G)bs. The last column gives the percent deviation of

    the analytical estimate of uncertainty from the sample standard deviation along with the

    propagated error. Standard deviations {((G)),((G)), ((G)bs} for the error

    estimate distributions are also reported.

  • 33

    We find that the analytic error estimate of most methods underestimates the true sample

    standard deviation. In some of the methods this deviation is within either one or two

    standard deviations, and thus likely within the statistical noise. Examining the data in

    Tables 1, S2, and S3 for TI and TI3, analytical and sample estimates of uncertainty differ

    by one or less standard deviation for all but the anthracene solvation with the full set,

    where it differs by approximately two standard deviations, and is likely therefore noise.

    For IEXP and DEXP, analytical and sample estimates also differ by one or in some cases

    two standard deviations. However, analytical and sample uncertainty estimates for BAR,

    RBAR and UBAR have differences of up to five standard deviations which indicates that

    the BAR uncertainty estimates can be significantly inaccurate for multiple intermediate

    states. In contrast, analytical uncertainty estimates with MBAR have less than one

    standard deviation from sample uncertainty estimates. Over all methods, MBAR

    analytical error estimates deviate least from the sample estimate.

    The percentage deviation of analytical estimate from the sample uncertainty estimate is

    listed explicitly in column six of Table 1. A negative percent deviation indicates

    underestimation of uncertainty and a positive percent deviation indicates overestimation

    of uncertainty by the analytical uncertainty estimate. The percentage deviation of the

    analytical uncertainty estimate from sample uncertainty for TI is -78% and for MBAR

    is -108%, in both cases, within the noise, while for BAR, the deviation from the true

    uncertainty is -315%, which is clearly statistically significant. Large deviation of the

    analytical estimate from sample standard deviation indicates poor accuracy of estimator

    in estimating uncertainty. Similar deviations in analytical and sample estimates are seen

  • 34

    for UBAR and RBAR. As with BAR and its variants, analytical and sample estimates

    GDEL and GINS are off by 305%.

    Bootstrap uncertainty is clearly a robust alternative to the sample standard deviation for

    all methods. Bootstrap estimates of the uncertainty (in column five of Table 1) are very

    close to sample uncertainty (in column four of Table 1). Specifically in the case of BAR,

    RBAR, UBAR, GDEL and GINS bootstrap estimates of uncertainty are better than the

    analytical counterparts. In cases where the analytical estimate does not accurately predict

    the sample standard deviation, such as with BAR, RBAR, and UBAR, the bootstrap

    method provides a useful estimate of the error in the free energy without a need for

    performing repeated sampling.

    Figure 4 visually demonstrates the efficiency of free energy methods and the consistency

    of uncertainty estimators. Short bars indicate precise free energy estimates. Equal height

    bars indicate that the analytical and bootstrap uncertainties are consistent with the sample

    standard deviation. For the full set MBAR, TI and TI3 predict free energies with the

    highest precision and with the most reliable error estimate. BAR, RBAR and UBAR

    analytical estimates have large deviations from the standard deviation estimate but the

    bootstrap estimate closely matches sample standard uncertainty estimate. IEXP, DEXP

    have the largest deviations particularly for large transformations like dipole inversion and

    anthracene solvation.

  • 35

    Table 1. Statistical uncertainty calculated using three different approaches (analytical

    (G), sample standard deviation (G), bootstrap (G)bs) for UA methane

    solvation using the full set. All quantities are in kJ/mol. G is not the ensemble

    average but the average over 100 repetitions.

    Method G

    (G)

    ((G))

    (G)

    ((G))

    (G)bs

    ((G)bs)

    % dev

    (% dev)

    TI 9.081 0.1070.002 0.1150.009 0.1060.006 -7.17.9

    TI3 9.008 0.1110.003 0.1190.012 0.1100.007 -6.99.5

    DEXP 8.984 0.3200.255 0.5560.122 0.3190.267 -42.547.5

    IEXP 8.928 0.1040.002 0.1100.011 0.1030.006 -5.89.6

    UBAR 8.936 0.0790.002 0.1060.010 0.0980.005 -25.37.5

    BAR 8.933 0.0750.001 0.1090.010 0.0990.006 -31.36.3

    RBAR 8.937 0.0750.001 0.1090.010 0.0990.006 -31.06.7

    MBAR 8.929 0.0950.002 0.1060.010 0.0940.005 -9.98.6

    GDEL 7.042 0.0930.002 0.1360.011 0.1210.008 -31.75.8

    GINS 1.097 0.2530.008 0.3990.032 0.4000.169 -36.55.5

  • 36

    Figure 4. Uncertainty estimates (sample standard deviation, analytical, bootstrap) are

    plotted for all the methods in three different test cases for full set. Consistent free

    energy methods have bars equal in height, and the most accurate methods have the

    shortest bars.

  • 37

    Table 2. Statistical uncertainty calculated using three different approaches (analytical

    (G), sample standard deviation (G), bootstrap (G)bs) for UA methane

    solvation using sparse set. All quantities are in kJ/mol.

    Method G

    (G)

    ((G))

    (G)

    ((G))

    (G)bs

    ((G)bs)

    % dev

    (% dev)

    TI 2.545 0.175 0.007 0.177 0.014 0.175 0.011 -1.5 8.7

    TI3 3.792 0.214 0.009 0.217 0.017 0.214 0.014 -1.5 8.7

    DEXP 5.631 1.343 0.524 3.179 0.679 1.628 1.186 -57.8 18.8

    IEXP 9.091 0.666 0.101 0.638 0.042 0.683 0.108 4.4 17.3

    UBAR 8.954 0.344 0.018 0.358 0.024 0.351 0.024 -3.9 8.2

    BAR 8.926 0.225 0.006 0.263 0.014 0.232 0.014 -14.4 4.9

    RBAR 8.927 0.226 0.005 0.260 0.016 0.233 0.014 -13.2 5.6

    MBAR 8.928 0.232 0.006 0.262 0.015 0.232 0.014 -11.4 5.4

    GDEL -3.833 0.112 0.004 0.189 0.014 0.200 0.016 -40.6 4.9

    GINS -1.68E+32 3E+30 34E+30 13E+32 10E+32 1E+32 15E+32 -99.7 2.7

  • 38

    Figure 5. Uncertainty estimates (sample standard deviation, analytical, bootstrap) are

    plotted for six methods in three different test cases for sparse set. Four methods

    (DEXP, IEXP, GINS and GDEL) are not graphed because they did not converge properly

    for any of the three uncertainty estimates.

  • 39

    For the sparse set, as shown in Tables 2, S2 and S4 and Figure 5, TI and TI3 still show

    the lowest percent deviation from sample standard deviation. However, the free energy

    estimate of methane solvation is off by 6.5 kJ/mol for TI and 5 kJ/mol for TI3. GDEL

    and GINS have clearly un-converged free energy and uncertainty estimates; the free

    energy estimate of methane solvation for GDEL is off by 12 kJ/mol and for GINS is off

    by ~1032 kJ/mol. This clearly indicates the failure of the Gaussian approximation of U.

    The free energy estimate of methane solvation for DEXP differs from the converged

    answer by 3 kJ/mol and its uncertainty estimate is 5 times larger than the largest

    estimated uncertainty shown in Figure 5. IEXP differs from the converged answer by

    only 0.2 kJ/mol but its uncertainty estimate is twice the largest plotted uncertainty.

    MBAR again has the lowest and most consistent uncertainty estimates. However, unlike

    with the full set, BAR and RBAR have uncertainty estimates which are as accurate as

    MBAR to within statistical noise.

    Bootstrap and analytical estimates of the error in the sparse set are slightly lower than

    the sample standard deviation for acceptance ratio methods for methane solvation, though

    not for the other two molecules. For this set of states, MBAR does not provide the

    same clear advantage over BAR in estimating the uncertainty as with the full set. We

    hypothesize that this advantage may only exist when the overlap between states is

    somewhat high. However, even our full set uses relatively aggressive spacing for

    typical free energy calculations. MBAR analytical estimates will generally have the

    advantage over BAR analytical estimates as with MBAR it is not necessary to attempt to

    determine whether the calculation is in the low overlap or high overlap regime.

  • 40

    3.3.2 Analysis of bias in free energy estimates: Statistical uncertainty is the most

    important measure to quantify to understand the reliability of the free energy estimate but

    understanding systematic bias is also important. By comparisons with free energy

    estimate using large number of samples we can test whether or not the method is

    asymptotically biased. By comparisons with the free energy estimate using large number

    of intermediate states we can find how sensitive a methods accuracy is to the overlap

    between intermediate states. Table 3 shows the free energy and uncertainty estimates

    predicted by different methods for UA methane solvation for large number of samples

    (450 ns trajectory) and large number of intermediate states (51 states). Tables 4 and 5

    include estimates of both types of bias in free energy estimates, with respect to number of

    samples and with respect to number of intermediate states, and the corresponding root

    mean square error estimates for UA methane solvation. For UA methane solvation,

    MBAR, BAR, RBAR, UBAR, TI, and TI3 have very low bias in free energy estimates

    with respect to number of samples. The acceptance ratio methods, MBAR, BAR, RBAR,

    and UBAR have biases with respect to number of states within the statistical noise. TI

    and TI3 show larger bias in free energy estimates with respect to number of states than

    the other methods. DEXP and IEXP show large biases in free energy estimates both with

    respect to number