keepit course 3: provenance (and opm), based on slides by luc moreau

15
Brief Introduction to Provenance "As data becomes plentiful, verifiable truth becomes scarce” http://go-to-hellman.blogspot.com/2010/02/named-graphs-argleton-and- truth-economy.html For JISC KeepIt course on Digital Preservation Tools for Repository Managers Module 3, Primer on preservation workflow, formats and characterisation Westminster-Kingsway College, London, 2 March 2010

Upload: jisc-keepit-project

Post on 28-Nov-2014

1.644 views

Category:

Technology


1 download

DESCRIPTION

This presentation offers a brief introduction to provenance, a record of the process that led to the current state of an object, based on a new descriptive model designed to allow provenance information to be exchanged between systems, the Open Provenance Model (OPM). It was given as part of module 3 of a 5-module course on digital preservation tools for repository managers, presented by the JISC KeepIt project. For more on this and other presentations in this course look for the tag 'KeepIt course' in the project blog http://blogs.ecs.soton.ac.uk/keepit/

TRANSCRIPT

Page 1: Keepit Course 3: Provenance (and OPM), based on slides by Luc Moreau

Brief Introduction to Provenance

"As data becomes plentiful, verifiable truth becomes scarce”

http://go-to-hellman.blogspot.com/2010/02/named-graphs-argleton-and-truth-economy.html

For JISC KeepIt course on Digital Preservation Tools for Repository ManagersModule 3, Primer on preservation workflow, formats and characterisation

Westminster-Kingsway College, London, 2 March 2010

Page 2: Keepit Course 3: Provenance (and OPM), based on slides by Luc Moreau

Provenance: exampleThe following excerpt and slides are taken with permission from Moreau, L.The Open Provenance Model: Towards inter-operability of Provenance Systems http://users.ecs.soton.ac.uk/lavm/talks/iam09.pdf

Example The provenance of a bottle of wine includes:• Grapes from which it is made• Where those grapes grew• Process in the wine’s preparation• How the wine was stored• Between which parties the wine was transported, e.g. producer to distributer to retailer• Where it was auctioned

Page 3: Keepit Course 3: Provenance (and OPM), based on slides by Luc Moreau

Provenance Definition

• Oxford English Dictionary: – the fact of coming from some particular source or quarter;

origin, derivation– the history or pedigree of a work of art, manuscript, rare

book, etc.; – concretely, a record of the passage of an item through its various owners.

• The provenance of a piece of data is the process that led to that piece of data

Page 4: Keepit Course 3: Provenance (and OPM), based on slides by Luc Moreau

The Science Lifecycle

scientists

LocalWebRepositories

Graduate Students

Undergraduate Students

Virtual Learning Environment

Technical Reports

Reprints

Peer-Reviewed Journal &

Conference Papers

Preprints &

Metadata

Certified Experimental Results

& Analyses

experimentation

Data, Metadata, Provenance, Scripts, Workflows, Services,Ontologies, Blogs, ...

Digital Libraries

Next Generation Researchers

Adapted from David De Roure’s slides

Page 5: Keepit Course 3: Provenance (and OPM), based on slides by Luc Moreau

scientists

LocalWebRepositories

Graduate Students

Undergraduate Students

Virtual Learning Environment

Technical Reports

Reprints

Peer-Reviewed Journal &

Conference Papers

Preprints &

Metadata

Certified Experimental Results

& Analyses

experimentation

Data, Metadata, Provenance, Scripts, Workflows, Services,Ontologies, Blogs, ...

Digital Libraries

Next Generation Researchers

Finding the Provenance of research outputs

across all the systemsdata transited through

Page 6: Keepit Course 3: Provenance (and OPM), based on slides by Luc Moreau

Open Provenance Model (OPM)

• Allows us to express all the causes of an item• Allow for process-oriented and dataflow

oriented views• Based on a notion of annotated causality

graphMoreau, L., et al. v1.00 (Dec 2007), OPM v1.01

(Jul 2008), OPM v1.1 (Dec 2009)

Page 7: Keepit Course 3: Provenance (and OPM), based on slides by Luc Moreau

OPM Requirements• To allow provenance information to be

exchanged between systems, by means of a compatibility layer based on a shared provenance model.

• To allow developers to build and share tools that operate on such provenance model.

• To define the model in a precise, technology-agnostic manner.

• To define bindings to XML/RDF separately• To support a digital representation of provenance

for any “thing”, whether produced by computer systems or not

Page 8: Keepit Course 3: Provenance (and OPM), based on slides by Luc Moreau

OPM Serialisation

• OPM is an abstract data model to represent past execution and what causes data and processes to occur

• OPM can be serialised in different formats, referred to as “technology bindings” or serializations

• OPM XML schema (http://openprovenance.org/model/v1.01.a)

• OPM RDF schema• OPM OWL ontology• Effort underway to ensure full equivalence of

representations

Page 9: Keepit Course 3: Provenance (and OPM), based on slides by Luc Moreau

Nodes• Artifact: Immutable piece of state, which

may have a physical embodiment in a physical object, or a digital representation in a computer system.

• Process: Action or series of actions performed on or caused by artifacts, and resulting in new artifacts.

• Agent: Contextual entity acting as a catalyst of a process, enabling, facilitating, controlling, affecting its execution.

A

P

Ag

Page 10: Keepit Course 3: Provenance (and OPM), based on slides by Luc Moreau

Edges

A1 A2

P1 P2wasTriggeredBy

wasDerivedFrom

A Pused(R)

APwasGeneratedBy(R)

Ag PwasControlledBy(R)

Edge labels are in the past to express that these are used to describe past executions

Page 11: Keepit Course 3: Provenance (and OPM), based on slides by Luc Moreau

Illustration• Process “used” artifacts and

“generated” artifact• Edge “roles” indicate the

function of the artifact with respect to the process (akin to function parameters)

• Edges and nodes can be typed

Causation chain:• P was caused by A1 and A2• A3 and A4 were caused by P• Does it mean that A3 and A4

were caused by A1 and A2?

P

A1 A2

A3 A4

used(divisor)used(dividend)

wasGeneratedBy(rest)wasGeneratedBy(quotient)

type=division

Page 12: Keepit Course 3: Provenance (and OPM), based on slides by Luc Moreau

Time Constraints

A Pused(R) AwasGeneratedBy(R)

Ag

wasControlledBy(R)start: T2end: T5

T4T3

T1<T3 (artifact must exist before being used)T2<T3 (process must have started before using artifacts)T3<T5 (process uses artifacts before it ends)T2<T4 (process must have started before generating artifacts)T4<T5 (process generates artifacts before it ends)T4<T6 (artifact must exist before being used)T2<T5 (process must have started before ending)no constraint between t3 and t4

wasGeneratedBy(R)

T1

used(R)

T6

Page 13: Keepit Course 3: Provenance (and OPM), based on slides by Luc Moreau

Dublin Core Profile (draft)

• To many people, provenance is primarily about attribution, citation, bibliographic information

• DC provides terms to relate resources to such information

• DC profile aims to use of Dublin Core terms to OPM concepts and graph patterns

with Simon Miles and Joe Futrelle

Page 14: Keepit Course 3: Provenance (and OPM), based on slides by Luc Moreau

DC to OPM example: dc:publisher

A2

A1

P

publish

wasSameResourceAs

state=published

AgwasActionOf

state=unpublished

personname=Luc

used

wasGeneratedBy

Page 15: Keepit Course 3: Provenance (and OPM), based on slides by Luc Moreau

What have we learned about provenance?

• Provenance: describes and records the results of processes on objects over time• OPM represents provenance as XML• OPM can be serialised in different formats

• RDF, Semantic Web

• OPM is a work in progress

By working with an open standard model, that can pass information as XML and in standard serialisation formats (e.g. RDF), it should be possible to build provenance services into repository environments