reproducible,science,viaseman2cs,and, provenance… · reproducible,science,viaseman2cs,and,...

26
Reproducible science via seman2cs and provenance for ecological data @metama7j [email protected] 0000-0003-0077-4738 Ma#hew B. Jones Christopher Jones Lauren Walker Peter Slaughter Benjamin Leinfelder Na@onal Center for Ecological Analysis and Synthesis (NCEAS)

Upload: trandang

Post on 11-Apr-2018

217 views

Category:

Documents


3 download

TRANSCRIPT

Reproducible  science  via  seman2cs  and  provenance  for  ecological  data  

@metama7j  [email protected]  0000-0003-0077-4738

   

Ma#hew  B.  Jones  Christopher  Jones  Lauren  Walker  Peter  Slaughter  Benjamin  Leinfelder    Na@onal  Center  for  Ecological  Analysis  and  Synthesis  (NCEAS)  

Science  

Smith, Melinda D., Alan K. Knapp, and Scott L. Collins. "A framework for assessing ecosystem dynamics in response to chronic resource alterations induced by global change." Ecology 90.12 (2009): 3279-3289. doi:10.1890/08-1815.1 !

Reproducible Science

Capturing provenance is crucial for transparency, interpretation, debugging, … => repeatable experiments, => reproducible science Slide credit: B. Ludaescher

Kinds  of  Provenance  •  Prospec@ve  Provenance  •  method/workflow  descrip2on  (“workflow-­‐land”)  

•  Retrospec@ve  Provenance  •  Run2me  tracking  (“trace-­‐land”)  •  “This  created  from  that”  

4!

Common Uses of Provenance

•  Audit  trail:  data  trace  and  possible  errors  •  A#ribu@on:  credit  and  responsibility  for  data  and  scien2fic  results  

•  Data  quality:  assess  input  data,  computa2on  •  Discovery:  find  versions,  derived  products  •  Replica@on:  computa2ons  are  repeatable  •  Re-­‐use:  adapt  and  adopt  for  new  uses  

•  Goal:  Facilitate  reproducible  science  

•  Track  data  deriva@on  history  •  Track  data  inputs  and  outputs  of  analyses  •  Track  analysis  and  model  execu@ons  •  Preserve  and  document  soQware  

•  Link  all  of  these  to  publica@ons  

Provenance:  2me  travel  in  DataONE  

Using  a  common  model  

W3C  has  published  the  ‘PROV’  family  of  recommenda2ons  

Entity

Activity

Agent

wasAssociatedWith

wasAttributedTo

used wasGeneratedBy

See w3.org/TR/prov-o/

Using  a  common  model  

Example:  Scien2fic  workflow  

map image

R script Execution

Scientist

wasAssociatedWith

wasAttributedTo

wasGeneratedBy

Using  a  common  model  

Example:  Scien2fic  workflow  

map image

R script Execution

Scientist

wasAssociatedWith

wasAttributedTo

wasGeneratedBy!

CSV data used

wasDerivedFrom

Using  a  common  model  

Example:  Scien2fic  workflow  

map image

R script Execution

Scientist

wasAssociatedWith

wasAttributedTo

wasGeneratedBy

CSV data used

wasDerivedFrom

< “map image” wasDerivedFrom “CSV data” >

ProvONE  extends  PROV  for  science  

http://purl.dataone.org/provone-v1-dev!

Data  package  with  ProvONE  trace  

resource  map  

science  metadata  

system  metadata  

science  data  

system  metadata  

system  metadata  

ProvONE trace showing relationships

figures  

system  metadata  

soQware  

system  metadata  

Data  Package  2   Data  Package  1  

Provenance  search  and  browse  

DataONE  harvests  provenance  informa2on  and  indexes  it  

Repository! DataONE!

ITK Client!

publish!

harvest!

Provenance:  Figures  

Provenance:  Data  

Tools  for  crea2ng  provenance  

Matlab  DataONE  Toolbox  

Recordr  R  Library  

Java  YesWorkflow  Tool  

R  ‘recordr’  package  

1 # Generate map of locations by type

2 library(recordr)

3 recordr <- new(“Recordr”)

4 pkg <- record(recordr, “./hcdbSites.R”, “loc-by-type-png”)

‘recordr’  func2ons  

record()

startRecord()

endRecord()

listRuns()

deleteRuns()

viewRun()

publish()

set()

get()

saveConfig()

loadConfig()

listConfig()

See: Run Manager API document

R:  managing  script  runs  

> listRuns(recordr)

Script StartTime EndTime Published Tag RunID

hcdbSites.R 18:53:09 18:53:09 unpublished loc-by-type-png C85A ...

> deleteRuns(recordr, “loc-by-type-png”)

C85A188-B72E-49F1-AEF4-7BFC24DA186B

> viewRun(recordr, “loc-by-type-png”)

… details about the run listed here ...

> publishRun(recordr, “loc-by-type-png”)

C85A188-B72E-49F1-AEF4-7BFC24DA186B

22

Publication: version 1

Publication: version 2

•  Now,  when  a  user  cites  a  pub,  we  know:  

•  Which  data  produced  it  •  What  soQware  produced  it  •  What  was  derived  from  it  •  Who  to  credit  down  the  

a7ribu2on  stack  

•  Katz  &  Smith.  2014.  Implemen@ng  Transi@ve  Credit  with  JSON-­‐LD.  arXiv:1407.5117  

Transi2ve  credit  

Open  So\ware  and  Specifica2ons  

R  Packages  (in  tes@ng)  recordr:  h7ps://github.com/NCEAS/recordr  dataone:  h7ps://github.com/DataONEorg/rdataone  datapackage:  h7ps://github.com/ropensci/datapackage  

 Matlab  Toolbox  (in  development)  

matlab-­‐dataone:            h7ps://github.com/DataONEorg/matlab-­‐dataone  

 ProvONE  Specifica@on  (draQ)    h7p://purl.dataone.org/provone-­‐v1-­‐dev  

Image credit: @ESAOpenSci!

Thank  You  

DataONE  Provenance  Contributors                    Funding  DataONE:  NSF  Grant  #  0830944  and  1430508  Community  Dynamics:  NSF  Grant  #  1262463  

•  Matt Jones!•  Chris Jones!•  Lauren Walker!•  Peter Slaughter!•  Ben Leinfelder!•  Mark Schildhauer!•  Steve Aulenbach!•  Christopher Schwalm!

•  Paolo Missier!•  Bertram Ludäscher!•  Rachel Volentine!•  Susanna Yang Cao!•  Dave Vieglais!•  Yaxing Wei!•  Tim McPhillips!•  Phase I Provenance group!

This presentation is made available under a CC-BY 4.0 license.

http://creativecommons.org/licenses/by/4.0/