standard provenance reporting and scientific software management in virtual laboratories

Catherine Wise, Nicholas J Car, Ryan Fraser and Geoff Squire

Data61 and LAND & WATER

Standard Proveance Reporting and Scientifc Software Management in Virtual Labs

What are VLs?

What is VHIRL?

What is provenance?

How does VHIRL manage provenance (or not)?

How do we represent VHIRL’s actions to standardised provenance?

What work, other than representation, is needed for provenance?

What benefits do we get from this work?

Outline

What are VLs?

From https://nectar.org.au/virtual-laboratories-1, they are:

data repositories and computational tools and streamlining research workflows

What are VLs?

https://nectar.org.au/virtual-laboratories-1

What is VHIRL?

• Virtual Hazards Impact & Risk Laboratory (VHIRL) is a scientific workflow portal

• Gives researchers access to a cloud computing for natural hazards research

• data from a variety of sources• uses cloud computing resources

• currently has tools for the earthquakes, tsunamis & tropical cyclones in the Asia-Pacific region

What is VHIRL?

Components of the Virtual Lab: Virtual Hazard Impact & Risk Laboratory (VHIRL)

Data Services Processing Services

Compute Services Enablers

Virtual Laboratories

/AppsData Analytics

Magnetics

Gravity

DEM

eScript

ANUGA

NCIPetascale

NCICloud

NeCTAR Cloud

AmazonCloud

Desktop

Service Orchestration

ProvenanceMetadata

Auth.

CoastalInundation

Tsuanmi Inundation

Scenario

Cyclone Wind Path Calculation

Landsat

Bathymetry

Cyclone WindModel

Surface Wave Propagation

(earthquake)

TCRM

Connectivity via Provenance | Melanie Ayre | eResearch Australiasia 2015, Brisbane

What is provenance?

From http://en.wikipedia.org/wiki/Provenance#Computer_Science:

What is provenance?

“Computer science uses the term provenance to mean the lineage of data or processes, as per data provenance. However there is a field of informatics research within computer science called provenance that studies how provenance of data and processes should be characterised, stored and used. Semantic web standards bodies, such as the World Wide Web Consortium, ratified a standard for provenance representation in 2014, known as PROV.”

http://en.wikipedia.org/wiki/Provenance#Computer_Science

http://en.wikipedia.org/wiki/Computer_science

http://en.wikipedia.org/wiki/Information_science

http://en.wikipedia.org/wiki/Information_science

http://en.wikipedia.org/wiki/Semantic_web

http://en.wikipedia.org/wiki/Semantic_web

http://en.wikipedia.org/wiki/World_Wide_Web_Consortium

http://en.wikipedia.org/wiki/World_Wide_Web_Consortium

http://www.w3.org/TR/prov-overview/

http://www.w3.org/TR/prov-overview/

How do we represent VLs using standardised provenance?

• Natively tracks ‘everything’ used for scenario (re)runs• Is not a: Data store, Software repo, Records mgt system• Externalises as much information mgt as possible• Code managed by the SSSC

VHIRL’s own data management

• SSSC is a web-based system to manage code & dependencies

• Contains Problems & Solutions that define a workflow

• Solutions consists of a Toolbox• Toolboxes are code wrapped

in a Python script + description of the required inputs

Scientific Solutions Software Centre (SSSC)

Class diagram for the SSSC

Scientific Solutions Software Centre (SSSC)• Beautiful, RESTful APIthis example: http://vhirl-dev.csiro.au/scm/toolbox/2

• Solution prov:Plan

• No RDF metadata, yet!

http://vhirl-dev.csiro.au/scm/toolbox/2



Mapping VHIRL to PROV 1

Input Data Process Output Data


Code Process Output Data

Config

Input Data

“Ontology Design Pattern”


Code Process Output Data

Config

Input Data

Who/ which

system

Who

wasGeneratedBy

wasAttributedTo

wasAssociatedWith

used

Entity Activity AgentPROV classes:

Mapping VHIRL to PROMS

Report N

Entity Activity AgentPROV classes:PROMS classes:

hadStartingActivity /

hadEndingActivityReporting System X

reportingSystem

R.S. Report

Mapping VHIRL to PROMS

VHIRL provenance into PROMS Server

Report N

Entity Activity AgentPROV classes:PROMS classes:

Reporting System X

R.S. Report

Report NReport N

Report M

Report NReporting System Y Report N

Report NReport N

OrganisationalProvenance

Store

reported and stored

Modelling VHIRL’s data types

VL Run output data

user

actedOnbehalfOf

The VL

Report N

reportingSystem

managed data

web service

data

user supplied

data

managed code

user supplied

code

PROMS Reporting Toolkits

VHIRL’s native PROV output

RDF file

What work other, than representation, is needed for

provenance?

Provenance effort (step) pyramid

Data Management

Establishing Reporting

Continued Reporting

managed data

web service

data

user supplied

data

managed code

user supplied

code

Data Management

output data

all Entities need to be ID’d (via URI) and persisted

VL Runeach VL run is reported as an Activity within a Report

each VL instance has/needs an ID and is modelled as a Reporting System

usereach VL user is known by their login (account) details. Modelled as a Reporter

The VL

Report N

each VL Report is ID’d and persisted in the VL Provenance Store

managed data

web service

data

user supplied

data

managed code

user supplied

code

Data ManagementVL ID’d and persisted

output data

cited using PROMS-O format

soon to be VL ID’d and persisted, with minimal metadata recorded too

SSSC ID’s and persisted

perhaps SSSC ID’s and persisted, perhaps VL managed

soon to be VL ID’d and persisted, if required, perhaps with time limits

managed data

web service

data

user supplied

data

managed code

user supplied

code

Data ManagementVL ID’d and persisted

output data

cited using PROMS-O format

soon to be VL ID’d and persisted, with minimal metadata recorded too

SSSC ID’s and persisted

perhaps SSSC ID’s and persisted, perhaps VL managed

soon to be VL ID’d and persisted, if required, perhaps with time limits

Virtual Labs Service Citation Example

[{ref}] {service title}{service endpoint URI}{query}{time queried}{cached copy ID}

[1] “Subset of elevation”

http://pid.csiro.au/service/anuga-thredds“bussleton.nc?var=elevation&spatial=bb&north=-33.06495205829679&south=-33.551573283840156&west=114.84967874597227&east=115.70661233971667&temporal=all&time_start=&time_end=&horizStride”

“2014-12-15T13:15:11”

http://pid.csiro.au/dataset/abcd1234

http://siss2.anu.edu.au/thredds/ncss/grid/anuga/busselton.nc

http://pid.csiro.au/dataset/abcd1234

Establishing Reporting

VLReport

OrganisationalProvenance

Store

querying & redelivery

Pro

vena

nce

Rep

ortin

g To

olki

t

C#

Java

Python

Establishing Reporting - Reporting Toolkits

managed data

web service

data

VL Run

“Grid X”

“Service Y”

“Run 456”

e1 = Entity(title='Grid X',description='netCDF grid of property X',uri='http://eg-vl.org.au/dataset/123',downloadURL='http://eg-vl.org.au/dataset/123?_view=dl',wasAttributedTo='http://data.ga.gov.au/id/person/john.doe')

Agent N

Report N Report for Run 456

http://eg-vl.org.au/dataset/123

http://eg-vl.org.au/dataset/123?_view=download

http://data.ga.gov.au/id/person/john.doe


managed data

web service

data

VL Run

“Grid X”

“Service Y”

“Run 456”

e1 = Entity(title='Grid X',description='netCDF grid of property X',uri='http://eg-vl.org.au/dataset/123',downloadURL='http://eg-vl.org.au/dataset/123?_view=dl',wasAttributedTo='http://data.ga.gov.au/id/person/john.doe')

Agent N

e2 = ServiceEntity(title='Subset of elevation',description='5km solar radiation interpolated raster service',serviceBaseUri='http://siss2.anu.edu.au/anuga/busselton.nc',query='var=elevation&spatial=bb&north=-33.06495205&south=-33.551573283&west=114.84967874&east=115.70661233&temporal=all&time_start=&time_end=&horizStride',queriedAtTime='2014-12-15T13:15:11'chachedCopy='http://bom.gov.au/dataset/678')


http://eg-vl.org.au/dataset/123

http://eg-vl.org.au/dataset/123?_view=download

http://data.ga.gov.au/id/person/john.doe

http://siss2.anu.edu.au/thredds/ncss/grid/anuga/busselton.nc

http://bom.gov.au/dataset/678


managed data

web service

data

VL Run

“Grid X”

“Service Y”

“Run 456”

Agent N

a0 = Activity(title='Run 456',description='Upper bound run, full Grid X use',wasAssociatedWith={VL added automatically},startedAtTime={VL added automatically},endedAtTime={VL added automatically},usedEntities= [e1, e2],generatedEntities={VL added automatically})Report N Report for

Run 456


managed data

web service

data

VL Run

“Grid X”

“Service Y”

“Run 456”

Agent N


r0 = Report(title='Report for Run 456',description='Upper bound run, full Grid X use',startingActivity={VL added automatically},endingActivity={VL added automatically})

rs0 = ReportSender('http://provstore.vl.org.au/report/')rs.send(r0)

http://provstore.vl.org.au/report/

What do we get from this work?

Graph power!

Report NReporting System X

...

URI power!

Report NReporting System X

corporate staff DB

temp repo

public web service

DAP-style repo

PROMS instance

Distributed graphs!

GA PROMS instance

VL PROMS instance

Uni Prov Store

Distributed Querying via endpoint cache