reproducible research in molecular biophysics and structural biology

Reproducible research in molecular biophysics and

structural bioinformatics

Konrad HINSEN

Centre de Biophysique Moléculaire, Orléans, Franceand

Synchrotron SOLEIL, Saint Aubin, France

13 December 2012

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 1 / 36

Reproducibility

One of the ideals of science

Scientific results should be verifiable.

Verification requires reproduction by other scientists.

Few results actually are reproduced, but it’s still important tomake this possible:

important for the credibility of science in society(remember “Climategate”, cold fusion, ...)important for the credibility of a specific studyThe more detail you provide about what you did, the more yourpeers are willing to believe that you did what you claim to havedone.

Reproducibility also matters for efficient collaboration inside ateam.


Reproducibility in practice

Near-perfect in mathematics

No journal publishes a theorem unless the author provides a proof.

Often best-effort in experimental science

Lab notebooks record all the details for in-house replication.

Published protocols are less detailed, but often clear enough foran expert in the field.

The main limitation is technical: lab equipment and concretesamples cannot be reproduced identically.

Lousy in computational science

Papers give short method summaries and concentrate on results.

Few scientists can reproduce their own results after a fewmonths.


Reproducibility in computational science

We could do better than experimentalists

Results are deterministic, fully determined by input data andalgorithms.

If we published all input data and programs, anyone couldreproduce the results exactly.

But we don’tLots of technical difficulties.

Important additional effort.

Few incentives.

Goals of the Reproducible Research movement

Create more awareness of the problem.

Provide better tools.


Reproducible (computational) Research matters

“Climategate”

In 2009 a server at the Climatic Research Unit at the University ofEast Anglia was hacked and many of its files became public. One ofthem describes a scientist’s difficulties to reproduce his colleagues’results. This has been used by climate change skeptics to discreditclimate research.

Protein structure retractionsIn 2006, three important protein structures published in Sciencewere retracted following the discovery of a bug in the software usedfor data processing.

For more examples and details:

Z. Merali, “...Error ... why scientific programming does notcompute”, Nature 467, 775 (2010)Science Special Issue on Computational Biology, 13 April 2012


Today’s tools for Reproducible Research

Version control for source code and dataMercurial, Git, Subversion, ...

Great for source code, usable for small data sets

Workflow management systemsVisTrails, Taverna, Kepler, ...

Preservation of computational procedures and provenancetracking, great for scientists who like graphical workflows

Electronic lab notebooksMathematica, IPython notebook, ...

Preservation of computational procedures and provenancetracking, great for scientists who like scripting

Publication tools for computationsCollage, IPOL, myExperiment, PyPedia, RunMyCode, SHARE, ...

Experimental and/or domain-specific


Missing pieces (1/2)

Integration of version control and provenance tracking

Record for reproduction that dataset X was obtained from version 5of dataset Y using version 2.2 of program Z compiled with gccversion 4.1.

Version control for big datasets

Version control tools are made for text-based formats.

Provenance tracking and workflows across machines

If you do part of your computations elsewhere (supercomputer, lab’scluster, ...), existing tools won’t work for you.

Tool-independent standard file formats

If Alice wants to reproduce Bob’s results, she needs to use Bob’stools for version control and workflows/notebooks.


Missing pieces (2/2)

Reproducible floating-point computations

Conflict between reproducibility and performance:

Reproducibility for IEEE float arithmetic requires an exactsequence of load, store, and arithmetic operations.

Performance optimization requires shuffling around operationsdepending on processor type, cache size, memory access speed,number of processors, etc.


Reproducibility in biomolecular simulations (1/3)

Heavy resource requirements

Most important technique: Molecular Dynamics

Long runs (days to weeks) on big parallel machines

Produces large datasets (a few GB per trajectory)

Trajectory analysis can be as expensive as the simulation itself

Datasets are too big for version control.

Today’s provenance trackers can’t keep track of computations acrossmachines.



Complexity

Computational models contain thousands of parameters

No complete formal description of the models(except for program source code)

Simulation algorithms contain many approximations made forperformance reasons

These approximations are usually undocumented

Publishing a complete description of a molecular simulation is sodifficult that very few scientists could do it and equally few couldunderstand the resulting description.

There is no file format for storing a complete description of amolecular simulation in machine-readable form.



A split community

A small number of “developers” with backgrounds in physics andchemistry

A large number of “users” with backgrounds in biology andbiochemistry

A negligibly small number of computing specialists

Users don’t care about the technical details of the simulations.

Users (have to) trust the developers blindly.

Developers understand the models but have no training insoftware development or data management.

The field is dominated by a few large monolithic simulation packageswhose internal workings are complex and mostly undocumented.

Intermediate stages of processing in a simulation workflow areimpossible to access or available only in program-specific formats.


Complexity: the Amber force field (1/2)

The Amber force field (http://ambermd.org/#ff) is one of the mostpopular computational models for biomolecules. In a research paper,it is typically described by its formula for the potential energy:

U =∑

bonds ij

kij(rij − r

(0)ij

)2

+∑

angles ijk

kijk(φijk − φ

(0)ijk

)2

+∑

dihedrals ijkl

kijkl cos (nijklθijkl − δijkl)

+∑

all pairs ij 4εij

(σ12ij

r12 −σ6ij

r6

)+

∑all pairs ij

qiqj4πε0rij


http://ambermd.org/#ff

Complexity: the Amber force field (2/2)

In addition to that formula, we need

rules for summing over bonds, angles, etc.

rules for finding the values of all the parameters for a givenmolecule

Most of these rules can be found on the Amber Web site (with someeffort), but not in any research paper.

One rule is, to the best of my knowledge, not written down anywhere.It concerns the attribution of “improper dihedral” terms. I obtained itfrom the Amber developers by asking a precise question by e-mail.

That rule refers to the indices of the atoms in the input file definingthe molecular structure. These indices are chosen arbitrarily. Makingan energy dependent on them is profoundly unphysical.

How many users of the Amber force field know about this unphysicaldependence on atom indices? The effect is very small.


A real-life case of (non-)reproducibility (1/5)

Goal: find the gas-phase structure of short peptide sequences bymolecular simulation

or

Earlier work on this topic1 finds that AcA15 K + H+ forms a helix, butAcK A15 + H+ forms a globule.

My simulations predict that both sequences form globules.

What did I do differently?

1M.F. Jarrold, Phys. Chem. Chem. Phys. 9, 1659 (2007)Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 14 / 36


From the paper:

Molecular Dynamics (MD) simulations were performed tohelp interpret the experimental results. The simulationswere done with the MACSIMUS suite of programs [31] usingthe CHARMM21.3 parameter set. A dielectric constant of 1.0was employed.

The URL in ref. 31 is broken, but Google helps me find MACSIMUSnevertheless. It’s free and comes with a manual! ¨̂

I download the “latest release” dated 2012-11-09.But which one was used for that paper in 2007?



MACSIMUS comes with a file charmm21.par that starts with

! Parameter File for CHARMM version 21.3 [June 24, 1991]! Includes parameters for both polar and all hydrogen topology files! Based on QUANTA Parameter Handbook [Polygen Corporation, 1990]! Modified by JK using various sources

Did that paper use the “polar” or the “all hydrogen” topology files?

I can find only one set of topology files named charmm21, and that’swith polar hydrogens only, so I guess “polar”.

Now I have the parameters, but I don’t know the rules of theCHARMM force field, nor can I be sure that MACSIMUS uses the samerules as the CHARMM software.



From the paper:

A variety of starting structures were employed (such ashelix, sheet, and extended linear chain) and a number ofsimulated annealing schedules were used in an effort toescape high energy local minima. Often, hundreds ofsimulations were performed to explore the energylandscape of a particular peptide. In some cases, MD withsimulated annealing was unable to locate the lowest energyconformation and more sophisticated methods were used(see description of evolutionary based methods below).

I might as well give up here...

My point is not to criticize this particular paper.The level of description is typical for the field.



What I would have liked to get:

a machine-readable file containing a full specification of thesimulated system:

chemical structureall force field terms with their parametersthe initial atom positions

a script implementing the annealing protocol

links to all the software used in the simulation, with versionnumbers

All that with persistent references (DOIs).


Enough whining...

Let’s do somethingto improve the situ-tion!


1 A data model for molecularsimulations

2 A data model for executablepapers

3 Teaching best practices tocomputational scientists


Data handling in molecular simulation

There is no standard file format for molecular simulations.

Best approximation: the PDB format

made for a different purpose (crystallography)

lacks precision for coordinates

lacks the possibility to store anything else but coordinates

limited to 10.000 atoms

few if any programs respect it completely

OK... so let’s define something better...... which is a far from trivial task.


What do we want to archive and exchange?

the system we are simulating

the chemical structure of the moleculesthe number of each kind of moleculethe topology of the system (infinite, 3D periodic, 2D periodic, ...)symmetries (crystals, ...)

configurations (positions of all atoms plus PBC parameters)

velocities, masses, charges, and other per-atom properties

force field parameters

force fields (the rules)

trajectories

experimental data used in simulations

... lots of other stuff ...

Don’t try to do everything at once: we want a modular set ofconventions.


Which systems, which simulation techniques?

anything at the atomic and molecular scale: gases, liquids, solids

from atoms via small molecules to macromolecules

all-atom and coarse-grained models

from quantum models via classical mechanics to stochasticdynamics

Application domains:

Chemical/statistical physics

Solid-state physics

Chemistry

Molecular/structural biology


Mosaic: a data model for molecular simulation

MOlecular SimulAtion Interchange Conventions

http://bitbucket.org/molsim/mosaic/

Mosaic is a data model with (currently) two representations.

Data modelexpresses molecular simulation data in terms of basicdata items and structures: numbers, strings, lists, trees,. . .

Representationdefines how a data model is stored as a sequence ofbytes in a file. Conversion between representations isexact.

This is work in progress.In order to succeed, it must become a community project.


http://bitbucket.org/molsim/mosaic/

Mosaic: current state

Data items

System topology, chemical structureConfigurationsPer-atom/site numerical properties (velocities, charges, masses,. . .)Per-atom/site labels (atom names, atom types, . . .)

Representations

HDF5 for big data sets, fast access, and easy interfacing toC/C++/Fortran codeXML for processing with standard XML toolsPlanned: JSON for interfacing with Web-based toolsThree in-memory representations for Python

Tools

Conversion between representationsImport from PDB (mmCIF format)


HDF5 (Hierarchical Data Format, version 5)

Platform-independent binarystorage

Efficient library for data handling

Metadata (provenance,dependencies, . . . )

Well-established: used by bigorganizations (NASA, . . .) for verylarge data sets (meteorology . . .)

Ideal for exchanging andarchiving data

Suitable for very large data sets

groups

datasets(≈ arrays)

root group


So here’s my model for the gas phase peptide...

1 <?xml version="1.0" encoding="utf-8"?>2 <mosaic version="0.1">3 <universe cell_shape="infinite" convention="MMTK" id="universe">4 <molecules>5 <molecule count="1">6 <fragment label="" species="protein">7 <fragments>8 <fragment label="A" polymer_type="polypeptide" species="ACE,Ala,Ala,Ala,Ala,Ala,Ala,Ala,Ala,Ala,Ala,Ala,Ala,Ala,Ala,Ala,Lys">9 <fragments>

10 <fragment label="ACE0" species="ace_beginning">11 <atoms>12 <atom label="CBond" name="C" type="element" />13 <atom label="CH3" name="C" type="element" />14 <atom label="HH31" name="H" type="element" />15 <atom label="HH32" name="H" type="element" />16 <atom label="HH33" name="H" type="element" />17 <atom label="O" name="O" type="element" />18 </atoms>19 <bonds>20 <bond atoms="CH3 HH31" order="" />21 <bond atoms="CH3 HH32" order="" />22 <bond atoms="CH3 HH33" order="" />23 <bond atoms="CBond CH3" order="" />24 <bond atoms="CBond O" order="" />25 </bonds>26 </ fragment>

. . . (686 lines in total)


Biomolecular simulation . . . a few years from now

Mosaic HDF5 filePDB crystal universe + configuration

Protein in vacuum universe + configuration

Force field for protein in vacuum

Solvated protein universe + configuration

Force field for solvated protein

Trajectory for solvated protein

PDB import

Chain extraction

Force field

Solvatation

MD simulation

Small independentprograms

All the data for the simulation study in one place


Managing, publishing, and archiving reproducible

research

Reproducible research results in large sets of interrelated software,data sets, and documentation.

Managing files, dependencies, and metadata (provenance, . . .)becomes a nightmare when moving between multiplecomputers.Publishing everything in an easily usable form is difficult due to alack of standards.Archiving is almost pointless: which software will still work 50years from now?

ActivePapers(http://dirac.cnrs-orleans.fr/plone/software/activepapers)

provides a solution to these problemsis a ready-to-run prototype (needs a Java installation)is rather useless in practice for now

K. Hinsen, Procedia Computer Science 4 (2011): 579-588Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 29 / 36

http://dirac.cnrs-orleans.fr/plone/software/activepapers

Preserving data for a long time

Formalized descriptions

data models

ontologies

well-defined file formats

Use standards, even bad onesStandards ensure the critical mass that will motivate future scientiststo maintain the knowledge and the tools required to interpret yourdata.

Limited dependencies

software: operating systems, compilers, libraries, . . .

infrastructure: Internet, communication protocols, . . .

organizations: research labs, publishers, . . .


Code is data

There is nothing special about software, it’s just a kind ofdata!

Everything is data:

Scientific data is data.

Code is data.

Input parameters are data.

Documentation is data.

All we need is good data models for everything, plus a way todescribe dependencies.

The tricky part is finding a good data model for code.


Representing code as data

“Data types” for code:

source code: defined by a language specificationthere are lots of languagesprogrammers are attached to them

binary/executable: defined by a processor/hardwarespecification

very complex specificationevolves rapidly with technological progress

bytecode: defined by a virtual machine specification (JVM, CLI,. . .)

clear and simple specificationnot directly visible to the programmertoday’s virtual machines were not designed for scientificcomputing→ sub-optimal performance

JVM bytecodes provide the best existing infrastructure.

There is not much scientific software for the JVM platform, which iswhy ActivePapers is rather useless today.


Dependency handling

dataset 1

dataset 2

dataset 3

program

source code

compiler

DirectedAcyclicGraphof dependencies

program

dataset 1 dataset 2

dataset 3

input

output

Dataflow graphof program execution


ActivePapers overview (1/2)

An ActivePaper . . .

is an HDF5 file containing any number of data sets

typically contains scientific data, code, and documentation

is identified by a DOI (Digital Object Identifier)

can refer to other ActivePapers through their DOIs

stores the dependency graph between data sets in HDF5metadata

Executable code in an ActivePaper . . .

cannot access files outside of ActivePaper files

cannot modify data except inside its own ActivePaper

comes in three varieties: calclet (reads and writes data), viewlet(reads data for interactive visualization), and library (called byother code)


ActivePapers overview (2/2)

Special cases

a raw dataset (usually with some documentation)

a software library (usually with some documentation)

External dependencies (must be maintained for 50 years . . .)

a Java Virtual Machine with the Java standard library

the HDF5 library (could be rewritten as a JVM library)

JHDF5, a Java interface to HDF5 (could be rewritten as a JVM library)

Problems solvedfuture-proof publishing and archiving of computational research

secure repeatability of computational studies

reusability of scientific data and software

integration of scientific data and software into bibliometry


Teaching best practices to computational scientists

Software Carpentryis an organization dedicated to teaching computing skills toscientists, with support from the Alfred P. Sloan Foundation and theMozilla Foundation.

Activities:

short intensive workshops (“boot camps”)

online courses

You can help by

hosting a boot camp

being an instructor at boot-camps

working on teaching material

http://software-carpentry.org/Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 36 / 36

http://software-carpentry.org/

reproducible research in molecular biophysics and structural biology

Technology