reproducible science with python

29
Reproducible Science with Python Andreas Schreiber Department for Intelligent and Distributed Systems German Aerospace Center (DLR), Cologne/Berlin > PyCon DE 2016 > Andreas Schreiber Reproducible Science with Python > 29.10.2016 DLR.de Chart 1

Upload: andreas-schreiber

Post on 22-Jan-2017

152 views

Category:

Science


2 download

TRANSCRIPT

Page 1: Reproducible Science with Python

Reproducible Science with Python

Andreas SchreiberDepartment for Intelligent and Distributed SystemsGerman Aerospace Center (DLR), Cologne/Berlin

> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 1

Page 2: Reproducible Science with Python

Simulations, experiments, data analytics, …

Science

> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 2

SIMULATION FAILED

Page 3: Reproducible Science with Python

Reproducing results relies on• Open Source codes• Code reviews• Code repositories• Publications with code• Computational environment

captured (Docker etc.)• Workflows• Open Data formats• Data management • (Electronics) laboratory notebooks• Provenance

Reproducible Science

> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 3

Page 4: Reproducible Science with Python

Provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness.

PROV W3C Working Group https://www.w3.org/TR/prov-overview

Provenance

> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 4

Page 5: Reproducible Science with Python

W3C Provenance Working Group: https://www.w3.org/2011/prov

PROV• The goal of PROV is to enable the wide publication and

interchange of provenance on the Web and other information systems

• PROV enables one to represent and interchange provenance information using widely available formats such as RDF and XML

PROV

> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 5

Page 6: Reproducible Science with Python

Entities• Physical, digital, conceptual, or other kinds of things• For example, documents, web sites, graphics, or data sets

Activities• Activities generate new entities or

make use of existing entities• Activities could be actions or processes

Agents• Agents takes a role in an activity and

have the responsibility for the activity• For example, persons, pieces of software, or organizations

Overview of PROVKey Concepts

> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 6

Page 7: Reproducible Science with Python

PROV Data Model

> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 7

Page 8: Reproducible Science with Python

Example Provenance

> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 8

https://provenance.ecs.soton.ac.uk/store/documents/113794/

Page 9: Reproducible Science with Python

Storing Provenance

> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 9

ProvenanceStore

Provenance recordingduring runtime

Applications / Workflow

Data (results)

Page 10: Reproducible Science with Python

Some Storage Technologies

• Relational databases and SQL

• XML and Xpath

• RDF and SPARQL

• Graph databases and Gremlin/Cypher

Services

• REST APIs

• ProvStore (University of Southampton)

Storing and Retrieving Provenance

> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 10

Page 11: Reproducible Science with Python

ProvStore

> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 11

https://provenance.ecs.soton.ac.uk/store/

Page 12: Reproducible Science with Python

Provenance is a directed acyclic graph (DAG)

Graphs

> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 12

A

B

E

F

GD

C

Page 13: Reproducible Science with Python

Naturally, graph databases are a good technology for storing (Provenance) graphs

Many graph databases are available• Neo4J• Titan• ArangoDB• ...

Query languages• Cypher• Gremlin (TinkerPop)• GraphQL

Graph Databases

> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 13

Page 14: Reproducible Science with Python

• Open-Source• Implemented in Java• Stores property graphs

(key-value-based, directed)

http://neo4j.com

Neo4j

> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 14

Page 15: Reproducible Science with Python

Depends on your application (tools, languages, etc.)

Libraries for Python• prov• provneo4j• NoWorkflow• …

Tools• Git2PROV• Prov-sty for LaTeX• …

Gathering Provenance

> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 15

Page 16: Reproducible Science with Python

Python Library provhttps://github.com/trungdong/prov

> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 16

from prov.model import ProvDocument# Create a new provenance documentd1 = ProvDocument()# Entity: now:employment-article-v1.htmle1 = d1.entity('now:employment-article-v1.html')# Agent: nowpeople:Bobd1.agent('nowpeople:Bob')# Attributing the article to the agentd1.wasAttributedTo(e1, 'nowpeople:Bob')d1.entity('govftp:oesm11st.zip', {'prov:label': 'employment-stats-2011', 'prov:type': 'void:Dataset'})d1.wasDerivedFrom('now:employment-article-v1.html', 'govftp:oesm11st.zip')# Adding an activityd1.activity('is:writeArticle')d1.used('is:writeArticle', 'govftp:oesm11st.zip')d1.wasGeneratedBy('now:employment-article-v1.html', 'is:writeArticle')

Page 17: Reproducible Science with Python

Python Library provhttps://github.com/trungdong/prov

> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 17

Page 18: Reproducible Science with Python

Example

> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 18

http://localhost:8888/notebooks/WeightCompanion PROV.ipynb

Page 19: Reproducible Science with Python

provneo4j – Storing PROV Documents in Neo4jhttps://github.com/DLR-SC/provneo4j

> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 19

import provneo4j.api

provneo4j_api = provneo4j.api.Api( base_url="http://localhost:7474/db/data", username="neo4j", password="python")

provneo4j_api.document.create(prov_doc, name=”MyProv”)

Page 20: Reproducible Science with Python

provneo4j – Storing PROV Documents in Neo4jhttps://github.com/DLR-SC/provneo4j

> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 20

Page 21: Reproducible Science with Python

NoWorkflow – Provenance of Scriptshttps://github.com/gems-uff/noworkflow

> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 21

Project

experiment.py

p12.dat

p13.dat

precipitation.py

p14.dat

out.png

$ now run -e Tracker experiment.py

Page 22: Reproducible Science with Python

Git2PROVhttp://git2prov.org

> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 22

Page 23: Reproducible Science with Python

> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 23

Page 24: Reproducible Science with Python

LaTeX Annotations

prov-sty for LaTeXhttps://github.com/prov-suite/prov-sty

> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 24

\begin{document}\provAuthor{Andreas Schreiber}{http://orcid.org/0000-0001-5750-5649}\provOrganization{German Aerospace Center (DLR)}{http://www.dlr.de}\provTitle{A Provenance Model for Quantified Self Data}\provProject {PROV-SPEC (FS12016)} {http://www.dlr.de/sc/desktopdefault.aspx/tabid-8073/} {http://www.bmwi.de/}

Page 25: Reproducible Science with Python

Electronic Laboratory Notebook

> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 25

Page 26: Reproducible Science with Python

Query „Who worked on experiment X?“

> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 26

$experiment = g.key($_g, 'identifier', X)$user = $experiment/inE/inV/outE[@label = controlled_by]

Page 27: Reproducible Science with Python

Practical Use Case – Archive all Data of a Paper

> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 27

https://github.com/DLR-SC/DataFinder

Page 28: Reproducible Science with Python

New PROV Library for Python in development• https://github.com/DLR-SC/prov-db-connector• Connectors for Neo4j implemented, ArangoDB planned• APIs in REST, ZeroMQ, MQTT

Trusted Provenance• Storing Provenance in Blockchains

Provenance for people• New approaches for visualization• For example, PROV Comics

Current Research and Development

> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 28

Page 29: Reproducible Science with Python

> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 29

Thank You!

Questions?

[email protected]/sc | @onyame