keepit course 3: provenance (and opm), based on slides by luc moreau
DESCRIPTION
This presentation offers a brief introduction to provenance, a record of the process that led to the current state of an object, based on a new descriptive model designed to allow provenance information to be exchanged between systems, the Open Provenance Model (OPM). It was given as part of module 3 of a 5-module course on digital preservation tools for repository managers, presented by the JISC KeepIt project. For more on this and other presentations in this course look for the tag 'KeepIt course' in the project blog http://blogs.ecs.soton.ac.uk/keepit/TRANSCRIPT
Brief Introduction to Provenance
"As data becomes plentiful, verifiable truth becomes scarce”
http://go-to-hellman.blogspot.com/2010/02/named-graphs-argleton-and-truth-economy.html
For JISC KeepIt course on Digital Preservation Tools for Repository ManagersModule 3, Primer on preservation workflow, formats and characterisation
Westminster-Kingsway College, London, 2 March 2010
Provenance: exampleThe following excerpt and slides are taken with permission from Moreau, L.The Open Provenance Model: Towards inter-operability of Provenance Systems http://users.ecs.soton.ac.uk/lavm/talks/iam09.pdf
Example The provenance of a bottle of wine includes:• Grapes from which it is made• Where those grapes grew• Process in the wine’s preparation• How the wine was stored• Between which parties the wine was transported, e.g. producer to distributer to retailer• Where it was auctioned
Provenance Definition
• Oxford English Dictionary: – the fact of coming from some particular source or quarter;
origin, derivation– the history or pedigree of a work of art, manuscript, rare
book, etc.; – concretely, a record of the passage of an item through its various owners.
• The provenance of a piece of data is the process that led to that piece of data
The Science Lifecycle
scientists
LocalWebRepositories
Graduate Students
Undergraduate Students
Virtual Learning Environment
Technical Reports
Reprints
Peer-Reviewed Journal &
Conference Papers
Preprints &
Metadata
Certified Experimental Results
& Analyses
experimentation
Data, Metadata, Provenance, Scripts, Workflows, Services,Ontologies, Blogs, ...
Digital Libraries
Next Generation Researchers
Adapted from David De Roure’s slides
scientists
LocalWebRepositories
Graduate Students
Undergraduate Students
Virtual Learning Environment
Technical Reports
Reprints
Peer-Reviewed Journal &
Conference Papers
Preprints &
Metadata
Certified Experimental Results
& Analyses
experimentation
Data, Metadata, Provenance, Scripts, Workflows, Services,Ontologies, Blogs, ...
Digital Libraries
Next Generation Researchers
Finding the Provenance of research outputs
across all the systemsdata transited through
Open Provenance Model (OPM)
• Allows us to express all the causes of an item• Allow for process-oriented and dataflow
oriented views• Based on a notion of annotated causality
graphMoreau, L., et al. v1.00 (Dec 2007), OPM v1.01
(Jul 2008), OPM v1.1 (Dec 2009)
OPM Requirements• To allow provenance information to be
exchanged between systems, by means of a compatibility layer based on a shared provenance model.
• To allow developers to build and share tools that operate on such provenance model.
• To define the model in a precise, technology-agnostic manner.
• To define bindings to XML/RDF separately• To support a digital representation of provenance
for any “thing”, whether produced by computer systems or not
OPM Serialisation
• OPM is an abstract data model to represent past execution and what causes data and processes to occur
• OPM can be serialised in different formats, referred to as “technology bindings” or serializations
• OPM XML schema (http://openprovenance.org/model/v1.01.a)
• OPM RDF schema• OPM OWL ontology• Effort underway to ensure full equivalence of
representations
Nodes• Artifact: Immutable piece of state, which
may have a physical embodiment in a physical object, or a digital representation in a computer system.
• Process: Action or series of actions performed on or caused by artifacts, and resulting in new artifacts.
• Agent: Contextual entity acting as a catalyst of a process, enabling, facilitating, controlling, affecting its execution.
A
P
Ag
Edges
A1 A2
P1 P2wasTriggeredBy
wasDerivedFrom
A Pused(R)
APwasGeneratedBy(R)
Ag PwasControlledBy(R)
Edge labels are in the past to express that these are used to describe past executions
Illustration• Process “used” artifacts and
“generated” artifact• Edge “roles” indicate the
function of the artifact with respect to the process (akin to function parameters)
• Edges and nodes can be typed
Causation chain:• P was caused by A1 and A2• A3 and A4 were caused by P• Does it mean that A3 and A4
were caused by A1 and A2?
P
A1 A2
A3 A4
used(divisor)used(dividend)
wasGeneratedBy(rest)wasGeneratedBy(quotient)
type=division
Time Constraints
A Pused(R) AwasGeneratedBy(R)
Ag
wasControlledBy(R)start: T2end: T5
T4T3
T1<T3 (artifact must exist before being used)T2<T3 (process must have started before using artifacts)T3<T5 (process uses artifacts before it ends)T2<T4 (process must have started before generating artifacts)T4<T5 (process generates artifacts before it ends)T4<T6 (artifact must exist before being used)T2<T5 (process must have started before ending)no constraint between t3 and t4
wasGeneratedBy(R)
T1
used(R)
T6
Dublin Core Profile (draft)
• To many people, provenance is primarily about attribution, citation, bibliographic information
• DC provides terms to relate resources to such information
• DC profile aims to use of Dublin Core terms to OPM concepts and graph patterns
with Simon Miles and Joe Futrelle
DC to OPM example: dc:publisher
A2
A1
P
publish
wasSameResourceAs
state=published
AgwasActionOf
state=unpublished
personname=Luc
used
wasGeneratedBy
What have we learned about provenance?
• Provenance: describes and records the results of processes on objects over time• OPM represents provenance as XML• OPM can be serialised in different formats
• RDF, Semantic Web
• OPM is a work in progress
By working with an open standard model, that can pass information as XML and in standard serialisation formats (e.g. RDF), it should be possible to build provenance services into repository environments