reproducible research in molecular biophysics and structural biology

36
Reproducible research in molecular biophysics and structural bioinformatics Konrad HINSEN Centre de Biophysique Moléculaire, Orléans, France and Synchrotron SOLEIL, Saint Aubin, France 13 December 2012 Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 1 / 36

Upload: khinsen

Post on 10-May-2015

648 views

Category:

Technology


0 download

DESCRIPTION

Reproducible science in the context of molecular biophysics and structural biology: state of the art and outlook.

TRANSCRIPT

Page 1: Reproducible research in molecular biophysics and structural biology

Reproducible research in molecular biophysics and

structural bioinformatics

Konrad HINSEN

Centre de Biophysique Moléculaire, Orléans, Franceand

Synchrotron SOLEIL, Saint Aubin, France

13 December 2012

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 1 / 36

Page 2: Reproducible research in molecular biophysics and structural biology

Reproducibility

One of the ideals of science

Scientific results should be verifiable.

Verification requires reproduction by other scientists.

Few results actually are reproduced, but it’s still important tomake this possible:

important for the credibility of science in society(remember “Climategate”, cold fusion, ...)important for the credibility of a specific studyThe more detail you provide about what you did, the more yourpeers are willing to believe that you did what you claim to havedone.

Reproducibility also matters for efficient collaboration inside ateam.

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 2 / 36

Page 3: Reproducible research in molecular biophysics and structural biology

Reproducibility in practice

Near-perfect in mathematics

No journal publishes a theorem unless the author provides a proof.

Often best-effort in experimental science

Lab notebooks record all the details for in-house replication.

Published protocols are less detailed, but often clear enough foran expert in the field.

The main limitation is technical: lab equipment and concretesamples cannot be reproduced identically.

Lousy in computational science

Papers give short method summaries and concentrate on results.

Few scientists can reproduce their own results after a fewmonths.

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 3 / 36

Page 4: Reproducible research in molecular biophysics and structural biology

Reproducibility in computational science

We could do better than experimentalists

Results are deterministic, fully determined by input data andalgorithms.

If we published all input data and programs, anyone couldreproduce the results exactly.

But we don’tLots of technical difficulties.

Important additional effort.

Few incentives.

Goals of the Reproducible Research movement

Create more awareness of the problem.

Provide better tools.

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 4 / 36

Page 5: Reproducible research in molecular biophysics and structural biology

Reproducible (computational) Research matters

“Climategate”

In 2009 a server at the Climatic Research Unit at the University ofEast Anglia was hacked and many of its files became public. One ofthem describes a scientist’s difficulties to reproduce his colleagues’results. This has been used by climate change skeptics to discreditclimate research.

Protein structure retractionsIn 2006, three important protein structures published in Sciencewere retracted following the discovery of a bug in the software usedfor data processing.

For more examples and details:

Z. Merali, “...Error ... why scientific programming does notcompute”, Nature 467, 775 (2010)Science Special Issue on Computational Biology, 13 April 2012

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 5 / 36

Page 6: Reproducible research in molecular biophysics and structural biology

Today’s tools for Reproducible Research

Version control for source code and dataMercurial, Git, Subversion, ...

Great for source code, usable for small data sets

Workflow management systemsVisTrails, Taverna, Kepler, ...

Preservation of computational procedures and provenancetracking, great for scientists who like graphical workflows

Electronic lab notebooksMathematica, IPython notebook, ...

Preservation of computational procedures and provenancetracking, great for scientists who like scripting

Publication tools for computationsCollage, IPOL, myExperiment, PyPedia, RunMyCode, SHARE, ...

Experimental and/or domain-specific

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 6 / 36

Page 7: Reproducible research in molecular biophysics and structural biology

Missing pieces (1/2)

Integration of version control and provenance tracking

Record for reproduction that dataset X was obtained from version 5of dataset Y using version 2.2 of program Z compiled with gccversion 4.1.

Version control for big datasets

Version control tools are made for text-based formats.

Provenance tracking and workflows across machines

If you do part of your computations elsewhere (supercomputer, lab’scluster, ...), existing tools won’t work for you.

Tool-independent standard file formats

If Alice wants to reproduce Bob’s results, she needs to use Bob’stools for version control and workflows/notebooks.

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 7 / 36

Page 8: Reproducible research in molecular biophysics and structural biology

Missing pieces (2/2)

Reproducible floating-point computations

Conflict between reproducibility and performance:

Reproducibility for IEEE float arithmetic requires an exactsequence of load, store, and arithmetic operations.

Performance optimization requires shuffling around operationsdepending on processor type, cache size, memory access speed,number of processors, etc.

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 8 / 36

Page 9: Reproducible research in molecular biophysics and structural biology

Reproducibility in biomolecular simulations (1/3)

Heavy resource requirements

Most important technique: Molecular Dynamics

Long runs (days to weeks) on big parallel machines

Produces large datasets (a few GB per trajectory)

Trajectory analysis can be as expensive as the simulation itself

Datasets are too big for version control.

Today’s provenance trackers can’t keep track of computations acrossmachines.

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 9 / 36

Page 10: Reproducible research in molecular biophysics and structural biology

Reproducibility in biomolecular simulations (2/3)

Complexity

Computational models contain thousands of parameters

No complete formal description of the models(except for program source code)

Simulation algorithms contain many approximations made forperformance reasons

These approximations are usually undocumented

Publishing a complete description of a molecular simulation is sodifficult that very few scientists could do it and equally few couldunderstand the resulting description.

There is no file format for storing a complete description of amolecular simulation in machine-readable form.

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 10 / 36

Page 11: Reproducible research in molecular biophysics and structural biology

Reproducibility in biomolecular simulations (3/3)

A split community

A small number of “developers” with backgrounds in physics andchemistry

A large number of “users” with backgrounds in biology andbiochemistry

A negligibly small number of computing specialists

Users don’t care about the technical details of the simulations.

Users (have to) trust the developers blindly.

Developers understand the models but have no training insoftware development or data management.

The field is dominated by a few large monolithic simulation packageswhose internal workings are complex and mostly undocumented.

Intermediate stages of processing in a simulation workflow areimpossible to access or available only in program-specific formats.

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 11 / 36

Page 12: Reproducible research in molecular biophysics and structural biology

Complexity: the Amber force field (1/2)

The Amber force field (http://ambermd.org/#ff) is one of the mostpopular computational models for biomolecules. In a research paper,it is typically described by its formula for the potential energy:

U =∑

bonds ij

kij(rij − r

(0)ij

)2

+∑

angles ijk

kijk(φijk − φ

(0)ijk

)2

+∑

dihedrals ijkl

kijkl cos (nijklθijkl − δijkl)

+∑

all pairs ij 4εij

(σ12ij

r12 −σ6ij

r6

)+

∑all pairs ij

qiqj4πε0rij

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 12 / 36

Page 13: Reproducible research in molecular biophysics and structural biology

Complexity: the Amber force field (2/2)

In addition to that formula, we need

rules for summing over bonds, angles, etc.

rules for finding the values of all the parameters for a givenmolecule

Most of these rules can be found on the Amber Web site (with someeffort), but not in any research paper.

One rule is, to the best of my knowledge, not written down anywhere.It concerns the attribution of “improper dihedral” terms. I obtained itfrom the Amber developers by asking a precise question by e-mail.

That rule refers to the indices of the atoms in the input file definingthe molecular structure. These indices are chosen arbitrarily. Makingan energy dependent on them is profoundly unphysical.

How many users of the Amber force field know about this unphysicaldependence on atom indices? The effect is very small.

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 13 / 36

Page 14: Reproducible research in molecular biophysics and structural biology

A real-life case of (non-)reproducibility (1/5)

Goal: find the gas-phase structure of short peptide sequences bymolecular simulation

or

Earlier work on this topic1 finds that AcA15 K + H+ forms a helix, butAcK A15 + H+ forms a globule.

My simulations predict that both sequences form globules.

What did I do differently?

1M.F. Jarrold, Phys. Chem. Chem. Phys. 9, 1659 (2007)Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 14 / 36

Page 15: Reproducible research in molecular biophysics and structural biology

A real-life case of (non-)reproducibility (2/5)

From the paper:

Molecular Dynamics (MD) simulations were performed tohelp interpret the experimental results. The simulationswere done with the MACSIMUS suite of programs [31] usingthe CHARMM21.3 parameter set. A dielectric constant of 1.0was employed.

The URL in ref. 31 is broken, but Google helps me find MACSIMUSnevertheless. It’s free and comes with a manual! ¨̂

I download the “latest release” dated 2012-11-09.But which one was used for that paper in 2007?

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 15 / 36

Page 16: Reproducible research in molecular biophysics and structural biology

A real-life case of (non-)reproducibility (3/5)

MACSIMUS comes with a file charmm21.par that starts with

! Parameter File for CHARMM version 21.3 [June 24, 1991]! Includes parameters for both polar and all hydrogen topology files! Based on QUANTA Parameter Handbook [Polygen Corporation, 1990]! Modified by JK using various sources

Did that paper use the “polar” or the “all hydrogen” topology files?

I can find only one set of topology files named charmm21, and that’swith polar hydrogens only, so I guess “polar”.

Now I have the parameters, but I don’t know the rules of theCHARMM force field, nor can I be sure that MACSIMUS uses the samerules as the CHARMM software.

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 16 / 36

Page 17: Reproducible research in molecular biophysics and structural biology

A real-life case of (non-)reproducibility (4/5)

From the paper:

A variety of starting structures were employed (such ashelix, sheet, and extended linear chain) and a number ofsimulated annealing schedules were used in an effort toescape high energy local minima. Often, hundreds ofsimulations were performed to explore the energylandscape of a particular peptide. In some cases, MD withsimulated annealing was unable to locate the lowest energyconformation and more sophisticated methods were used(see description of evolutionary based methods below).

I might as well give up here...

My point is not to criticize this particular paper.The level of description is typical for the field.

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 17 / 36

Page 18: Reproducible research in molecular biophysics and structural biology

A real-life case of (non-)reproducibility (5/5)

What I would have liked to get:

a machine-readable file containing a full specification of thesimulated system:

chemical structureall force field terms with their parametersthe initial atom positions

a script implementing the annealing protocol

links to all the software used in the simulation, with versionnumbers

All that with persistent references (DOIs).

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 18 / 36

Page 19: Reproducible research in molecular biophysics and structural biology

Enough whining...

Let’s do somethingto improve the situ-tion!

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 19 / 36

Page 20: Reproducible research in molecular biophysics and structural biology

1 A data model for molecularsimulations

2 A data model for executablepapers

3 Teaching best practices tocomputational scientists

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 20 / 36

Page 21: Reproducible research in molecular biophysics and structural biology

Data handling in molecular simulation

There is no standard file format for molecular simulations.

Best approximation: the PDB format

made for a different purpose (crystallography)

lacks precision for coordinates

lacks the possibility to store anything else but coordinates

limited to 10.000 atoms

few if any programs respect it completely

OK... so let’s define something better...... which is a far from trivial task.

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 21 / 36

Page 22: Reproducible research in molecular biophysics and structural biology

What do we want to archive and exchange?

the system we are simulating

the chemical structure of the moleculesthe number of each kind of moleculethe topology of the system (infinite, 3D periodic, 2D periodic, ...)symmetries (crystals, ...)

configurations (positions of all atoms plus PBC parameters)

velocities, masses, charges, and other per-atom properties

force field parameters

force fields (the rules)

trajectories

experimental data used in simulations

... lots of other stuff ...

Don’t try to do everything at once: we want a modular set ofconventions.

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 22 / 36

Page 23: Reproducible research in molecular biophysics and structural biology

Which systems, which simulation techniques?

anything at the atomic and molecular scale: gases, liquids, solids

from atoms via small molecules to macromolecules

all-atom and coarse-grained models

from quantum models via classical mechanics to stochasticdynamics

Application domains:

Chemical/statistical physics

Solid-state physics

Chemistry

Molecular/structural biology

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 23 / 36

Page 24: Reproducible research in molecular biophysics and structural biology

Mosaic: a data model for molecular simulation

MOlecular SimulAtion Interchange Conventions

http://bitbucket.org/molsim/mosaic/

Mosaic is a data model with (currently) two representations.

Data modelexpresses molecular simulation data in terms of basicdata items and structures: numbers, strings, lists, trees,. . .

Representationdefines how a data model is stored as a sequence ofbytes in a file. Conversion between representations isexact.

This is work in progress.In order to succeed, it must become a community project.

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 24 / 36

Page 25: Reproducible research in molecular biophysics and structural biology

Mosaic: current state

Data items

System topology, chemical structureConfigurationsPer-atom/site numerical properties (velocities, charges, masses,. . .)Per-atom/site labels (atom names, atom types, . . .)

Representations

HDF5 for big data sets, fast access, and easy interfacing toC/C++/Fortran codeXML for processing with standard XML toolsPlanned: JSON for interfacing with Web-based toolsThree in-memory representations for Python

Tools

Conversion between representationsImport from PDB (mmCIF format)

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 25 / 36

Page 26: Reproducible research in molecular biophysics and structural biology

HDF5 (Hierarchical Data Format, version 5)

Platform-independent binarystorage

Efficient library for data handling

Metadata (provenance,dependencies, . . . )

Well-established: used by bigorganizations (NASA, . . .) for verylarge data sets (meteorology . . .)

Ideal for exchanging andarchiving data

Suitable for very large data sets

groups

datasets(≈ arrays)

root group

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 26 / 36

Page 27: Reproducible research in molecular biophysics and structural biology

So here’s my model for the gas phase peptide...

1 <?xml version="1.0" encoding="utf-8"?>2 <mosaic version="0.1">3 <universe cell_shape="infinite" convention="MMTK" id="universe">4 <molecules>5 <molecule count="1">6 <fragment label="" species="protein">7 <fragments>8 <fragment label="A" polymer_type="polypeptide" species="ACE,Ala,Ala,Ala,Ala,Ala,Ala,Ala,Ala,Ala,Ala,Ala,Ala,Ala,Ala,Ala,Lys">9 <fragments>

10 <fragment label="ACE0" species="ace_beginning">11 <atoms>12 <atom label="CBond" name="C" type="element" />13 <atom label="CH3" name="C" type="element" />14 <atom label="HH31" name="H" type="element" />15 <atom label="HH32" name="H" type="element" />16 <atom label="HH33" name="H" type="element" />17 <atom label="O" name="O" type="element" />18 </atoms>19 <bonds>20 <bond atoms="CH3 HH31" order="" />21 <bond atoms="CH3 HH32" order="" />22 <bond atoms="CH3 HH33" order="" />23 <bond atoms="CBond CH3" order="" />24 <bond atoms="CBond O" order="" />25 </bonds>26 </ fragment>

. . . (686 lines in total)

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 27 / 36

Page 28: Reproducible research in molecular biophysics and structural biology

Biomolecular simulation . . . a few years from now

Mosaic HDF5 filePDB crystal universe + configuration

Protein in vacuum universe + configuration

Force field for protein in vacuum

Solvated protein universe + configuration

Force field for solvated protein

Trajectory for solvated protein

PDB import

Chain extraction

Force field

Solvatation

MD simulation

Small independentprograms

All the data for the simulation study in one place

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 28 / 36

Page 29: Reproducible research in molecular biophysics and structural biology

Managing, publishing, and archiving reproducible

research

Reproducible research results in large sets of interrelated software,data sets, and documentation.

Managing files, dependencies, and metadata (provenance, . . .)becomes a nightmare when moving between multiplecomputers.Publishing everything in an easily usable form is difficult due to alack of standards.Archiving is almost pointless: which software will still work 50years from now?

ActivePapers(http://dirac.cnrs-orleans.fr/plone/software/activepapers)

provides a solution to these problemsis a ready-to-run prototype (needs a Java installation)is rather useless in practice for now

K. Hinsen, Procedia Computer Science 4 (2011): 579-588Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 29 / 36

Page 30: Reproducible research in molecular biophysics and structural biology

Preserving data for a long time

Formalized descriptions

data models

ontologies

well-defined file formats

Use standards, even bad onesStandards ensure the critical mass that will motivate future scientiststo maintain the knowledge and the tools required to interpret yourdata.

Limited dependencies

software: operating systems, compilers, libraries, . . .

infrastructure: Internet, communication protocols, . . .

organizations: research labs, publishers, . . .

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 30 / 36

Page 31: Reproducible research in molecular biophysics and structural biology

Code is data

There is nothing special about software, it’s just a kind ofdata!

Everything is data:

Scientific data is data.

Code is data.

Input parameters are data.

Documentation is data.

All we need is good data models for everything, plus a way todescribe dependencies.

The tricky part is finding a good data model for code.

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 31 / 36

Page 32: Reproducible research in molecular biophysics and structural biology

Representing code as data

“Data types” for code:

source code: defined by a language specificationthere are lots of languagesprogrammers are attached to them

binary/executable: defined by a processor/hardwarespecification

very complex specificationevolves rapidly with technological progress

bytecode: defined by a virtual machine specification (JVM, CLI,. . .)

clear and simple specificationnot directly visible to the programmertoday’s virtual machines were not designed for scientificcomputing→ sub-optimal performance

JVM bytecodes provide the best existing infrastructure.

There is not much scientific software for the JVM platform, which iswhy ActivePapers is rather useless today.

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 32 / 36

Page 33: Reproducible research in molecular biophysics and structural biology

Dependency handling

dataset 1

dataset 2

dataset 3

program

source code

compiler

DirectedAcyclicGraphof dependencies

program

dataset 1 dataset 2

dataset 3

input

output

Dataflow graphof program execution

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 33 / 36

Page 34: Reproducible research in molecular biophysics and structural biology

ActivePapers overview (1/2)

An ActivePaper . . .

is an HDF5 file containing any number of data sets

typically contains scientific data, code, and documentation

is identified by a DOI (Digital Object Identifier)

can refer to other ActivePapers through their DOIs

stores the dependency graph between data sets in HDF5metadata

Executable code in an ActivePaper . . .

cannot access files outside of ActivePaper files

cannot modify data except inside its own ActivePaper

comes in three varieties: calclet (reads and writes data), viewlet(reads data for interactive visualization), and library (called byother code)

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 34 / 36

Page 35: Reproducible research in molecular biophysics and structural biology

ActivePapers overview (2/2)

Special cases

a raw dataset (usually with some documentation)

a software library (usually with some documentation)

External dependencies (must be maintained for 50 years . . .)

a Java Virtual Machine with the Java standard library

the HDF5 library (could be rewritten as a JVM library)

JHDF5, a Java interface to HDF5 (could be rewritten as a JVM library)

Problems solvedfuture-proof publishing and archiving of computational research

secure repeatability of computational studies

reusability of scientific data and software

integration of scientific data and software into bibliometry

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 35 / 36

Page 36: Reproducible research in molecular biophysics and structural biology

Teaching best practices to computational scientists

Software Carpentryis an organization dedicated to teaching computing skills toscientists, with support from the Alfred P. Sloan Foundation and theMozilla Foundation.

Activities:

short intensive workshops (“boot camps”)

online courses

You can help by

hosting a boot camp

being an instructor at boot-camps

working on teaching material

http://software-carpentry.org/Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics13 December 2012 36 / 36