reproducible research in molecular biophysics and structural biology
DESCRIPTION
Reproducible science in the context of molecular biophysics and structural biology: state of the art and outlook.TRANSCRIPT
Reproducible research in molecular biophysics and
structural bioinformatics
Konrad HINSEN
Centre de Biophysique Moléculaire, Orléans, Franceand
Synchrotron SOLEIL, Saint Aubin, France
13 December 2012
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 1 / 36
Reproducibility
One of the ideals of science
Scientific results should be verifiable.
Verification requires reproduction by other scientists.
Few results actually are reproduced, but it’s still important tomake this possible:
important for the credibility of science in society(remember “Climategate”, cold fusion, ...)important for the credibility of a specific studyThe more detail you provide about what you did, the more yourpeers are willing to believe that you did what you claim to havedone.
Reproducibility also matters for efficient collaboration inside ateam.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 2 / 36
Reproducibility in practice
Near-perfect in mathematics
No journal publishes a theorem unless the author provides a proof.
Often best-effort in experimental science
Lab notebooks record all the details for in-house replication.
Published protocols are less detailed, but often clear enough foran expert in the field.
The main limitation is technical: lab equipment and concretesamples cannot be reproduced identically.
Lousy in computational science
Papers give short method summaries and concentrate on results.
Few scientists can reproduce their own results after a fewmonths.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 3 / 36
Reproducibility in computational science
We could do better than experimentalists
Results are deterministic, fully determined by input data andalgorithms.
If we published all input data and programs, anyone couldreproduce the results exactly.
But we don’tLots of technical difficulties.
Important additional effort.
Few incentives.
Goals of the Reproducible Research movement
Create more awareness of the problem.
Provide better tools.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 4 / 36
Reproducible (computational) Research matters
“Climategate”
In 2009 a server at the Climatic Research Unit at the University ofEast Anglia was hacked and many of its files became public. One ofthem describes a scientist’s difficulties to reproduce his colleagues’results. This has been used by climate change skeptics to discreditclimate research.
Protein structure retractionsIn 2006, three important protein structures published in Sciencewere retracted following the discovery of a bug in the software usedfor data processing.
For more examples and details:
Z. Merali, “...Error ... why scientific programming does notcompute”, Nature 467, 775 (2010)Science Special Issue on Computational Biology, 13 April 2012
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 5 / 36
Today’s tools for Reproducible Research
Version control for source code and dataMercurial, Git, Subversion, ...
Great for source code, usable for small data sets
Workflow management systemsVisTrails, Taverna, Kepler, ...
Preservation of computational procedures and provenancetracking, great for scientists who like graphical workflows
Electronic lab notebooksMathematica, IPython notebook, ...
Preservation of computational procedures and provenancetracking, great for scientists who like scripting
Publication tools for computationsCollage, IPOL, myExperiment, PyPedia, RunMyCode, SHARE, ...
Experimental and/or domain-specific
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 6 / 36
Missing pieces (1/2)
Integration of version control and provenance tracking
Record for reproduction that dataset X was obtained from version 5of dataset Y using version 2.2 of program Z compiled with gccversion 4.1.
Version control for big datasets
Version control tools are made for text-based formats.
Provenance tracking and workflows across machines
If you do part of your computations elsewhere (supercomputer, lab’scluster, ...), existing tools won’t work for you.
Tool-independent standard file formats
If Alice wants to reproduce Bob’s results, she needs to use Bob’stools for version control and workflows/notebooks.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 7 / 36
Missing pieces (2/2)
Reproducible floating-point computations
Conflict between reproducibility and performance:
Reproducibility for IEEE float arithmetic requires an exactsequence of load, store, and arithmetic operations.
Performance optimization requires shuffling around operationsdepending on processor type, cache size, memory access speed,number of processors, etc.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 8 / 36
Reproducibility in biomolecular simulations (1/3)
Heavy resource requirements
Most important technique: Molecular Dynamics
Long runs (days to weeks) on big parallel machines
Produces large datasets (a few GB per trajectory)
Trajectory analysis can be as expensive as the simulation itself
Datasets are too big for version control.
Today’s provenance trackers can’t keep track of computations acrossmachines.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 9 / 36
Reproducibility in biomolecular simulations (2/3)
Complexity
Computational models contain thousands of parameters
No complete formal description of the models(except for program source code)
Simulation algorithms contain many approximations made forperformance reasons
These approximations are usually undocumented
Publishing a complete description of a molecular simulation is sodifficult that very few scientists could do it and equally few couldunderstand the resulting description.
There is no file format for storing a complete description of amolecular simulation in machine-readable form.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 10 / 36
Reproducibility in biomolecular simulations (3/3)
A split community
A small number of “developers” with backgrounds in physics andchemistry
A large number of “users” with backgrounds in biology andbiochemistry
A negligibly small number of computing specialists
Users don’t care about the technical details of the simulations.
Users (have to) trust the developers blindly.
Developers understand the models but have no training insoftware development or data management.
The field is dominated by a few large monolithic simulation packageswhose internal workings are complex and mostly undocumented.
Intermediate stages of processing in a simulation workflow areimpossible to access or available only in program-specific formats.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 11 / 36
Complexity: the Amber force field (1/2)
The Amber force field (http://ambermd.org/#ff) is one of the mostpopular computational models for biomolecules. In a research paper,it is typically described by its formula for the potential energy:
U =∑
bonds ij
kij(rij − r
(0)ij
)2
+∑
angles ijk
kijk(φijk − φ
(0)ijk
)2
+∑
dihedrals ijkl
kijkl cos (nijklθijkl − δijkl)
+∑
all pairs ij 4εij
(σ12ij
r12 −σ6ij
r6
)+
∑all pairs ij
qiqj4πε0rij
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 12 / 36
Complexity: the Amber force field (2/2)
In addition to that formula, we need
rules for summing over bonds, angles, etc.
rules for finding the values of all the parameters for a givenmolecule
Most of these rules can be found on the Amber Web site (with someeffort), but not in any research paper.
One rule is, to the best of my knowledge, not written down anywhere.It concerns the attribution of “improper dihedral” terms. I obtained itfrom the Amber developers by asking a precise question by e-mail.
That rule refers to the indices of the atoms in the input file definingthe molecular structure. These indices are chosen arbitrarily. Makingan energy dependent on them is profoundly unphysical.
How many users of the Amber force field know about this unphysicaldependence on atom indices? The effect is very small.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 13 / 36
A real-life case of (non-)reproducibility (1/5)
Goal: find the gas-phase structure of short peptide sequences bymolecular simulation
or
Earlier work on this topic1 finds that AcA15 K + H+ forms a helix, butAcK A15 + H+ forms a globule.
My simulations predict that both sequences form globules.
What did I do differently?
1M.F. Jarrold, Phys. Chem. Chem. Phys. 9, 1659 (2007)Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 14 / 36
A real-life case of (non-)reproducibility (2/5)
From the paper:
Molecular Dynamics (MD) simulations were performed tohelp interpret the experimental results. The simulationswere done with the MACSIMUS suite of programs [31] usingthe CHARMM21.3 parameter set. A dielectric constant of 1.0was employed.
The URL in ref. 31 is broken, but Google helps me find MACSIMUSnevertheless. It’s free and comes with a manual! ¨̂
I download the “latest release” dated 2012-11-09.But which one was used for that paper in 2007?
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 15 / 36
A real-life case of (non-)reproducibility (3/5)
MACSIMUS comes with a file charmm21.par that starts with
! Parameter File for CHARMM version 21.3 [June 24, 1991]! Includes parameters for both polar and all hydrogen topology files! Based on QUANTA Parameter Handbook [Polygen Corporation, 1990]! Modified by JK using various sources
Did that paper use the “polar” or the “all hydrogen” topology files?
I can find only one set of topology files named charmm21, and that’swith polar hydrogens only, so I guess “polar”.
Now I have the parameters, but I don’t know the rules of theCHARMM force field, nor can I be sure that MACSIMUS uses the samerules as the CHARMM software.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 16 / 36
A real-life case of (non-)reproducibility (4/5)
From the paper:
A variety of starting structures were employed (such ashelix, sheet, and extended linear chain) and a number ofsimulated annealing schedules were used in an effort toescape high energy local minima. Often, hundreds ofsimulations were performed to explore the energylandscape of a particular peptide. In some cases, MD withsimulated annealing was unable to locate the lowest energyconformation and more sophisticated methods were used(see description of evolutionary based methods below).
I might as well give up here...
My point is not to criticize this particular paper.The level of description is typical for the field.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 17 / 36
A real-life case of (non-)reproducibility (5/5)
What I would have liked to get:
a machine-readable file containing a full specification of thesimulated system:
chemical structureall force field terms with their parametersthe initial atom positions
a script implementing the annealing protocol
links to all the software used in the simulation, with versionnumbers
All that with persistent references (DOIs).
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 18 / 36
Enough whining...
Let’s do somethingto improve the situ-tion!
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 19 / 36
1 A data model for molecularsimulations
2 A data model for executablepapers
3 Teaching best practices tocomputational scientists
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 20 / 36
Data handling in molecular simulation
There is no standard file format for molecular simulations.
Best approximation: the PDB format
made for a different purpose (crystallography)
lacks precision for coordinates
lacks the possibility to store anything else but coordinates
limited to 10.000 atoms
few if any programs respect it completely
OK... so let’s define something better...... which is a far from trivial task.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 21 / 36
What do we want to archive and exchange?
the system we are simulating
the chemical structure of the moleculesthe number of each kind of moleculethe topology of the system (infinite, 3D periodic, 2D periodic, ...)symmetries (crystals, ...)
configurations (positions of all atoms plus PBC parameters)
velocities, masses, charges, and other per-atom properties
force field parameters
force fields (the rules)
trajectories
experimental data used in simulations
... lots of other stuff ...
Don’t try to do everything at once: we want a modular set ofconventions.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 22 / 36
Which systems, which simulation techniques?
anything at the atomic and molecular scale: gases, liquids, solids
from atoms via small molecules to macromolecules
all-atom and coarse-grained models
from quantum models via classical mechanics to stochasticdynamics
Application domains:
Chemical/statistical physics
Solid-state physics
Chemistry
Molecular/structural biology
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 23 / 36
Mosaic: a data model for molecular simulation
MOlecular SimulAtion Interchange Conventions
http://bitbucket.org/molsim/mosaic/
Mosaic is a data model with (currently) two representations.
Data modelexpresses molecular simulation data in terms of basicdata items and structures: numbers, strings, lists, trees,. . .
Representationdefines how a data model is stored as a sequence ofbytes in a file. Conversion between representations isexact.
This is work in progress.In order to succeed, it must become a community project.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 24 / 36
Mosaic: current state
Data items
System topology, chemical structureConfigurationsPer-atom/site numerical properties (velocities, charges, masses,. . .)Per-atom/site labels (atom names, atom types, . . .)
Representations
HDF5 for big data sets, fast access, and easy interfacing toC/C++/Fortran codeXML for processing with standard XML toolsPlanned: JSON for interfacing with Web-based toolsThree in-memory representations for Python
Tools
Conversion between representationsImport from PDB (mmCIF format)
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 25 / 36
HDF5 (Hierarchical Data Format, version 5)
Platform-independent binarystorage
Efficient library for data handling
Metadata (provenance,dependencies, . . . )
Well-established: used by bigorganizations (NASA, . . .) for verylarge data sets (meteorology . . .)
Ideal for exchanging andarchiving data
Suitable for very large data sets
groups
datasets(≈ arrays)
root group
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 26 / 36
So here’s my model for the gas phase peptide...
1 <?xml version="1.0" encoding="utf-8"?>2 <mosaic version="0.1">3 <universe cell_shape="infinite" convention="MMTK" id="universe">4 <molecules>5 <molecule count="1">6 <fragment label="" species="protein">7 <fragments>8 <fragment label="A" polymer_type="polypeptide" species="ACE,Ala,Ala,Ala,Ala,Ala,Ala,Ala,Ala,Ala,Ala,Ala,Ala,Ala,Ala,Ala,Lys">9 <fragments>
10 <fragment label="ACE0" species="ace_beginning">11 <atoms>12 <atom label="CBond" name="C" type="element" />13 <atom label="CH3" name="C" type="element" />14 <atom label="HH31" name="H" type="element" />15 <atom label="HH32" name="H" type="element" />16 <atom label="HH33" name="H" type="element" />17 <atom label="O" name="O" type="element" />18 </atoms>19 <bonds>20 <bond atoms="CH3 HH31" order="" />21 <bond atoms="CH3 HH32" order="" />22 <bond atoms="CH3 HH33" order="" />23 <bond atoms="CBond CH3" order="" />24 <bond atoms="CBond O" order="" />25 </bonds>26 </ fragment>
. . . (686 lines in total)
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 27 / 36
Biomolecular simulation . . . a few years from now
Mosaic HDF5 filePDB crystal universe + configuration
Protein in vacuum universe + configuration
Force field for protein in vacuum
Solvated protein universe + configuration
Force field for solvated protein
Trajectory for solvated protein
PDB import
Chain extraction
Force field
Solvatation
MD simulation
Small independentprograms
All the data for the simulation study in one place
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 28 / 36
Managing, publishing, and archiving reproducible
research
Reproducible research results in large sets of interrelated software,data sets, and documentation.
Managing files, dependencies, and metadata (provenance, . . .)becomes a nightmare when moving between multiplecomputers.Publishing everything in an easily usable form is difficult due to alack of standards.Archiving is almost pointless: which software will still work 50years from now?
ActivePapers(http://dirac.cnrs-orleans.fr/plone/software/activepapers)
provides a solution to these problemsis a ready-to-run prototype (needs a Java installation)is rather useless in practice for now
K. Hinsen, Procedia Computer Science 4 (2011): 579-588Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 29 / 36
Preserving data for a long time
Formalized descriptions
data models
ontologies
well-defined file formats
Use standards, even bad onesStandards ensure the critical mass that will motivate future scientiststo maintain the knowledge and the tools required to interpret yourdata.
Limited dependencies
software: operating systems, compilers, libraries, . . .
infrastructure: Internet, communication protocols, . . .
organizations: research labs, publishers, . . .
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 30 / 36
Code is data
There is nothing special about software, it’s just a kind ofdata!
Everything is data:
Scientific data is data.
Code is data.
Input parameters are data.
Documentation is data.
All we need is good data models for everything, plus a way todescribe dependencies.
The tricky part is finding a good data model for code.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 31 / 36
Representing code as data
“Data types” for code:
source code: defined by a language specificationthere are lots of languagesprogrammers are attached to them
binary/executable: defined by a processor/hardwarespecification
very complex specificationevolves rapidly with technological progress
bytecode: defined by a virtual machine specification (JVM, CLI,. . .)
clear and simple specificationnot directly visible to the programmertoday’s virtual machines were not designed for scientificcomputing→ sub-optimal performance
JVM bytecodes provide the best existing infrastructure.
There is not much scientific software for the JVM platform, which iswhy ActivePapers is rather useless today.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 32 / 36
Dependency handling
dataset 1
dataset 2
dataset 3
program
source code
compiler
DirectedAcyclicGraphof dependencies
program
dataset 1 dataset 2
dataset 3
input
output
Dataflow graphof program execution
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 33 / 36
ActivePapers overview (1/2)
An ActivePaper . . .
is an HDF5 file containing any number of data sets
typically contains scientific data, code, and documentation
is identified by a DOI (Digital Object Identifier)
can refer to other ActivePapers through their DOIs
stores the dependency graph between data sets in HDF5metadata
Executable code in an ActivePaper . . .
cannot access files outside of ActivePaper files
cannot modify data except inside its own ActivePaper
comes in three varieties: calclet (reads and writes data), viewlet(reads data for interactive visualization), and library (called byother code)
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 34 / 36
ActivePapers overview (2/2)
Special cases
a raw dataset (usually with some documentation)
a software library (usually with some documentation)
External dependencies (must be maintained for 50 years . . .)
a Java Virtual Machine with the Java standard library
the HDF5 library (could be rewritten as a JVM library)
JHDF5, a Java interface to HDF5 (could be rewritten as a JVM library)
Problems solvedfuture-proof publishing and archiving of computational research
secure repeatability of computational studies
reusability of scientific data and software
integration of scientific data and software into bibliometry
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 35 / 36
Teaching best practices to computational scientists
Software Carpentryis an organization dedicated to teaching computing skills toscientists, with support from the Alfred P. Sloan Foundation and theMozilla Foundation.
Activities:
short intensive workshops (“boot camps”)
online courses
You can help by
hosting a boot camp
being an instructor at boot-camps
working on teaching material
http://software-carpentry.org/Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 36 / 36