semantic provenance for science data products: application

1

Semantic Provenance for Science DataProducts: Application to Image Data

ProcessingStephan Zednik, Peter Fox, Deborah L. McGuinness, Paulo Pinheiro da Silva, Cynthia Chang

Abstract—A challenge in providing scientific data services to a broad user base is to also provide the metadata servicesand tools the user base needs to correctly interpret and trust the provided data. Provenance metadata is especially vital toestablishing trust, giving the user information on the conditions under which the data originated and any processing thatwas applied to generate the data product provided.In this paper, we describe our work on a federated set of data services in the area of solar coronal physics. These dataservices provide a particular challenge because there is decades of existing data whose provenance we will have toreconstruct, and because the quality of the final data product is highly sensitive to data capture conditions, informationwhich is not currently propagated with the data.We describe our use of semantic technologies for encoding provenance and domain knowledge and show how provenanceand domain ontologies can be used together to satisfy complex use cases. We show our progress on provenance searchand visualization tools and highlight the need for semantics in the user tools. Finally, we describe how our methods areapplicable to generic data processing systems.

✦

1 INTRODUCTION

WE aim to create a next-generation virtual ob-servatory1 with extensive provenance sup-

port. Provenance is a first-class concept in oursystem; with full support in our search, explana-tion, and visualization tools. We require a generalprovenance model that is applicable in a wide arrayof domains, and integrable with domain models,so that domain concepts can be modeled alongwith provenance concepts. For these reasons wehave chosen to use the Proof Markup Language(PML) [3], [7] family of OWL ontologies as ourprovenance model. We show how PML can beused to model provenance causality chains, intro-duce our domain model, and show how the PMLprovenance model and our science domain modelscan be integrated in a manner that provides a rich

• Stephan Zednik, Peter Fox, Deborah L. McGuinness, and CynthiaChang are with the Rensselaer Polytechnic Institute, TetherlessWorld Constellation.

• Paulo Pinheiro da Silva is with the University of Texas at El Paso,Department of Computer Science.

1. A virtual observatory is a collection of interoperating dataarchives and software tools which utilize the internet to form ascientific research environment in which research programs canbe conducted.

provenance infrastructure, able to model complexscientific provenance relations.

We have chosen to test our system in the do-main of solar coronal physics, using the AdvancedCoronal Observing System (ACOS) as a testbed.The ACOS data products are the result of severaldata ingest pipelines, processing observations fromthree imaging instruments located at the MaunaLoa Solar Observatory (MLSO). The ACOS datapipelines are distributed data pipelines, operated inpart at MLSO in Hawaii and the National Centerfor Atmospheric Research High Altitude Observa-tory (NCAR/HAO) in Boulder, CO. ACOS has beenoperational for over a decade, and has producedterabytes of data.

ACOS was chosen because its data pipelinesare typical of data ingest systems and the vastquantity of existing data ACOS has generated inits decades of operation provides the opportunityto design a system geared to reconstruct, as wellas capture, provenance. One of the ACOS datapipelines, the Chromospheric Helium-I ImagingPhotometer (CHIP) Intensity Image pipeline, is il-lustrated in Figure 1. This high-level diagram isdesigned to show not just the process/artifact flowof the pipeline, but domain concepts that could

and should be captured and represented in theprovenance. In the pipeline, data (square boxes)passes through a number of stages (ovals) eachof which can contain a number of complex pro-cessing, analysis, human interaction, and decisionsteps. Each of these stages contains domain-specificinformation (dotted-lined boxes) related to the dataproduct provenance.

Of particular interest is information in thepipeline that is not a direct or inferred result ofthe data capture event. The Observer Log is ahuman-generated account of weather conditionsand system status during the instrument observingschedule for the day. Bad weather conditions orinstrument instability, noted in the log, can havesignificant negative effect on the quality of thedata observations. This information is currentlynot propagated in the data pipeline nor do thedata products reference it in any way, but is aninvaluable reference in determining why an imagehas been given a low-grade quality assessment.This information is an important component of theorigin of the data image and should be representedin its provenance.

The motivation for this project arose from ourexperiences designing and deploying a solar terres-trial physics virtual observatory system [1], [2], andfrom numerous discussions with the data providers(i.e. ’roles’ in Figure 1). Among their remarks werethe following:

• Data is being used in new ways and we fre-quently do not have sufficient information onwhat happened to the data along the process-ing stages to determine if it is suitable for a usewe did not envision.

• We often fail to capture, represent, and prop-agate manually generated information thatneeds to go with the data flows.

Further, when science data and visual represen-tations of the data (such as the CHIP Quick Lookimages) are made available to the end-user, theproduct has often gone through a number of datafiltration and processing steps. If thorough prove-nance metadata and processing documentation isnot captured, propagated, and made available tothe end-user; the data system is in effect a ’blackbox’, and the end-user must blindly trust the sci-ence quality of the data product and long-termconsistency of the pipeline processing.

Virtual Observatories are particularly prone tothis information gap. This project traces the entire

general comments,

weather comments,

instrument

comments

Raw Image

Files with

Headers

Quick Look

Processing

Observer Logs

Data

Analyst

PI

Data

Capture

Site/Instrument

Configuration

Instrument

Calibration

site calibration,

calibration image set

Ship to

NCAR

Quality

Assesment,

Data Product

Manager

Images with

Headers, and

Observer Logs

Quick Look GIF

(Visual Product)

Calibration,

Accounting,

Quality

Rating Calibrated Data

with Quality

value

Quality field

specifying 'Good',

'Bad', or 'Ugly' data;

added by QA

Flatten,

Re-scale,

Rotate

Image Final Data

Images

(Data Product)

Site

Observer

sub-processes.

processing algorithms,

input parameters

instrument operating

mode

End User

Fig. 1.Chromospheric Helium-I Imaging Photometer(CHIP) Intensity Image pipeline

pipeline and accounts for all roles, processes, andmetadata as they relate to use cases which requireprovenance.

2 USE CASES

During discussions with science project partici-pants, we developed an initial set of use caseswhich reflect real user questions that cannotpresently be answered in any routine or automatedmanner:

• What were the cloud cover and seeing condi-tions during the observation period of this dataproduct artifact?

• What calibrations have been applied to thisdata product artifact?

• Who (person or program) added the com-ments to the science data file for the bestvignetted rectangular polarization brightnessimage from January 26, 2005 18:49:09UT takenby the ACOS Mark IV polarimeter?

2

• Find all good CHIP He 1083 nm intensity im-ages on March 21, 2008.

• Why does this data look bad?These use case scenarios mix domain and prove-

nance terms in a manner that would make thequestion difficult to answer if the provenance anddomain models are independent. The cloud coverand seeing conditions during observation periods isrecorded, but it is never directly associated with norpropagated with processed data. We could adjustthe data processing pipeline to pull this informa-tion from the observation records and propagateit with the data as metadata; but this would be avery heavy-handed approach to take for any andall information associated with a data product’sgeneration and processing that a user may wantto see. Instead, we intend to build the link in theprovenance representation of the data product, sothat by following the data product’s causality graphthe system can find the weather condition records,calibrations applied, quality control information,etc. that are associated with the data product’sgeneration or processing.

These use case scenarios are representative ofmany different question types that are routinelyasked for all data products produced by the ACOSdata ingest pipeline, and we believe these are rep-resentative of use cases that are common in anyscience data pipeline application.

3 PROVENANCE REQUIREMENTSWe require a provenance infrastructure that sup-ports queries, filtering, and reasoning by domainconcepts. A design requiring the hardwiring ofdomain concepts into the provenance model, orinto the system logic that accesses the provenancestore, is undesirable because it will be difficult tomaintain and extend, and furthermore, such a hard-wired application will make interoperation morechallenging.

The provenance infrastructure must also sup-port existing data systems, requiring little to nomodification of the processing pipeline; ACOS isa production system, and we do not have the op-portunity to re-engineer it. The system should alsosupport generating some amount of provenance forexisting processed data. It is not feasible to repro-cess all existing data, and doing so with the currentpipeline may introduce discrepancies between thenewly processed products and archived productsprocessed on a earlier and different version of the

pipeline. The provenance capture should be config-urable such that as much provenance as possiblecan be generated based on our understanding ofan earlier version of the pipeline, without forcingus to re-run data processing.

Finally, since ACOS is a distributed system theprovenance infrastructure must also work as a dis-tributed system. Provenance should be gatheredwhere processing occurs and made available as partof a distributed provenance store.

4 DATA MODEL4.1 Provenance RepresentationTo support our provenance requirements we haveelected to use OWL ontologies to model both do-main and provenance concepts. The provenanceand domain base ontologies are independent, butthe system’s individuals reference both models (viamultiple-inheritance), so queries, filtering and rea-soning by either domain or provenance conceptsare supported. This design supports our desireto build a maintainable system that refrains fromhardcoding solar terrestrial concepts into the baseprovenance model or provenance logic.

We have chosen as our provenance model theInference Web [6] Framework’s Proof Markup Lan-guage [3], [7] (PML) because of its capabilitiesin representing conclusions, justifications (inferenceand source usage), and explanations. Another par-ticularly useful aspect of the PML model is itsseparation of the process engine and process ruleconcepts. By defining these concepts separately,PML can represent both the process that was ex-ecuted (PML InferenceEngine) and the rule (PMLInferenceRule) by which the executed process oper-ated. Another way to view this concept separationis that PML can capture both execution historyand execution purpose. Inference rules are a piv-otal concept of the justification aspect of PML andprovide a clear mechanism for relating domainconcepts to a provenance causality graph. Whileother provenance models such as Open ProvenanceModel (OPM) could have provided some of thefoundation provided in PML, we found some ofthe core representational constructs such as thosementioned above to be well suited for our applica-tions. For further analysis of the relative benefitsof PML and OPM, see ’Towards Usable and In-teroperable Workflow Provenance: Emperical CaseStudies using PML’ [4] and ’Domain Knowledgeand Provenance in Science Data Systems’ [5].

3

Fig. 2.Basic PML NodeSet

In PML, a piece of information (the conclusion)and its justification(s) are modeled as a nodeset. Anodeset justification, known as an inference step,is used to describe the engine or source and therule used to generate the nodeset conclusion. Eachinference step may specify a list of nodesets, knownas antecedents, whose conclusions it is dependentupon. The antecedent relations between nodesetscan be used to build a causality graph, or explana-tion, for the conclusion of the nodeset. The funda-mental classes and properties of a PML nodeset areshown in Figure 2.

4.2 Domain RepresentationThe VSTO2 solar-terrestrial ontology, developedduring our previous experience deploying a se-mantic virtual observatory [1], [2] will be usedas one of our core science domain models. TheVSTO ontology provides a model for data prod-ucts, instruments, and parameters related to solar-terrestrial data systems. The VSTO ontology doesnot currently describe the processing that occursin a typical science data ingest pipeline (calibra-tions, transformations, data filtering, quality controlprocesses, etc.) so we are developing our own sci-ence data processing ontology based on experiencegained during this3 project and similar work withthe MDSA4 project.

Figure 3 illustrates some domain model conceptsfrom the VSTO and (in-development) science data

2. Virtual Solar Terrestrial Observatory,http://vsto.org/

3. Semantic Provenance Capture in Data Ingest Systems4. Multi-Sensor Data Synergy Advisor,

http://tw.rpi.edu/portal/MDSA

Fig. 3.VSTO (vsto) and Science Data Processing (spcdis)domain concepts

processing ontologies that we will be integratingwith the ACOS provenance. Of particular interest inthis example is the vsto:InstrumentOperatingModeconcept, which is defined as the configuration andprocess that allows an instrument to produce the re-quired signal. In the solar terrestrial domain ter-minology, an operating mode is not treated asan input configuration to a process, not as anartifact, but as a description of the state of theinstrument and the entailing process for datacapture. It is a description of how an instru-ment performs data capture of a specific param-eter type. In fact, vsto:InstrumentOperatingModewas originally modeled as a subclass of the classvsto:AbstractProcess. The VSTO InstrumentOper-atingMode and science data processing ontologyDataCapture concepts relate to each other in muchthe same way the PML InferenceRule and Infer-enceEngine relate, and in the next section we willshow how they can be integrated.

4.3 Provenance and Domain Model IntegrationThe provenance and domain ontology conceptsare integrated not in the model definitions, butin the individuals’ declarations by taking ad-vantage of OWLs natural support for multiple-inheritance. Where it is deemed beneficial toexpress both domain and provenance concepts,individuals (ontology class instances) are de-fined with multiple types, one type from theprovenance model and at least one type fromthe domain ontologies. As an example, thescience data processing ontology may definean individual spcdis:FlatFieldCalibration of typespcdis:Calibration and type pmlp:InferenceRule.The use case ’What calibrations have been applied to

4

Fig. 4.Individuals integrating provenance and domainmodels

Fig. 5.PML NodeSet comprised of domain model inte-grated individuals

this data product artifact?’ may now be answeredby querying the data product’s causality graph forinference rule’s of type spcdis:Calibration used byany nodeset’s justification.

Figure 4 shows how individuals of the domainconcepts illustrated in Figure 3 may be mapped toPML provenance concepts. The instrument operat-ing mode instance is modeled not as the conclusionof a some nodeset’s justification that acts as anantecedent to the data capture nodeset’s justifica-tion, but as the rule by which data capture occurs.The individual spcdis:CHIP is defined as both avsto:Photometer and a pmlp:Sensor, allowing it tobe both the source of the conclusion of the datacapture justification as well as assert any propertiesfor which vsto:Photometer is in the domain.

We can now model a PML NodeSet using in-dividuals that have type and properties from do-main ontologies, as shown in Figure 5. Integratingdomain types and properties with the provenance-based causality graph allows us to answer complexuse case scenarios such as ’What calibrations havebeen applied to this data product artifact?’ by perform-ing reasoning on domain concepts integrated withthe individuals in the provenance.

5 PROVENANCE CAPTURE

To assist in PML generation, we describe a typeof program referred to as a PML data annotator.A PML data annotator is a simple program whosesole purpose is to capture the provenance of asingle decision/process in a decision system andencode that provenance as a PML nodeset. PMLdata annotator programs are run as componentsof a workflow; either as part or separate to theactual decision processing. When run as part of thedata processing, the PML data annotator invokesthe inference engine directly; extracting requiredprocessing inputs from antecedent nodesets andpassing this information during inference engineinvocation.

For the ACOS provenance capture we will utilizea parallel workflow, where PML data annotatorsdo not directly invoke inference engines but re-construct the processing of the existing data ingestpipelines. This architecture also supports our needfor provenance generations for archived or pre-existing data products without preprocessing of thedata. The PML data annotators are in in a workflowthat simulates the processing of the data pipelineusing analysis of existing artifacts and informationabout the data processing encoded in the PML dataannotator configuration to reconstruct provenance.The PML data annotator workflow can be recon-figured to simulate different variations of the dataprocessing pipeline to generate provenance fromdata processing pipelines that are no longer active.

6 PROVENANCE SEARCH

We will utilize the search and explanation capa-bilities of the Inference Web toolset to provideboth a free text and guided search on provenanceand domain concepts. Guided searches generate aSPARQL query on the provenance + domain RDFand free text searches currently performs a standardfull text index search on the same. Search results

5

Fig. 6.Probe-It visualizing the provenance of a CHIPQuickLook Visual Product

can be viewed using a number of tools includingthe Inference Web browser, where the user canexplore the provenance encoding in detail, or inProbe-It!, a second generation PML provenance vi-sualization tool using applet technology for greatergraph functionality, introduced in the next section.

7 PROVENANCE VISUALIZATIONProbe-It! [8] is graphical browser of PML-basedprovenance, developed by Cyber-ShARE at theUniversity of Texas at El Paso. Probe-It! generatesa causality graph for all antecedents to a spec-ified nodeset and generates a visual representa-tion (where applicable) for the conclusion of allnodesets in the graph. A Probe-It! visualizationof the CHIP QuickLook5 Visual Product is shownin Figure 6. Our primary interest in Probe-It forthe ACOS provenance is in enabling scientists tobetter understand imperfections in and processingconsequences upon science data images.

8 DISCUSSION AND CONCLUSIONTo date we have reconstructed provenance for theQuick Look visual product from the CHIP Inten-sity data ingest and shown how weather conditioninformation, previously not propagated, can pro-vide value added to the data product. We haveintroduced our work constructing a domain-aware

5. QuickLook images are lightly-calibrated visual approxi-mations of the data image, generated in real-time and usedprimarily as quality checks on instrument operation

provenance store built using a generic, extensibleprovenance model and a solar-terrestrial domainmodel.

We described how this provenance store can beused to represent the relationships expressed in ouruse cases, and why these use cases are importantin increasing user trust in the data products. Whilewe did this in one workflow setting, since the usecases are representative of those in many scientificworkflow settings, we believe this work providesa foundation for scientific workflow provenanceapplications. We have described in brief how weintend to use a parallel workflow to reconstruct andgenerate provenance, and the semantically-enabledprovenance search, explanation, and visualizationtools we will provide for the end users.

The next stage of our work will involve furthermodeling of data pipeline concepts in the ACOSprovenance ontology, further documentation of theACOS data pipelines, and construction of PMLwrappers for newly documented sections of thedata pipelines. We also intend to prototype seman-tic provenance faceted-search interfaces, move ourfree text search to the Apache Lucene text searchengine, and develop new visual representationsof nodeset conclusions in our visual provenancebrowser. Following the completion and testing ofthe ACOS application, we will separate our exten-sions that are ACOS-specific from those that aregeneral to science applications and release scientificprovenance module extensions for VSTO and PMLontologies and related wrapper support tools.

ACKNOWLEDGMENTS

The SPCDIS6 project is funded by NSF, Office of Cy-ber Infrastructure under the SDCI program, grantnumber OCI-0721943, and in collaboration withUTEP CyberShare Center, funded by NSF under theCREST program, grant number HRD-0734825.

REFERENCES

[1] McGuinness, D., Fox, P., Cinquini, L., West, P., Garcia,J., Benedict, J., Middleton, D.: The Virtual Solar-TerrestrialObservatory: A Deployed Semantic Web Application CaseStudy for Scientific Research. In the proceedings of the19th Conference on Innovative Applications of ArtificialIntelligence (IAAI). Vancouver, BC, Canada, July 2007, 1730-1737 and AI magazine, 29, #1, 65-76.

6. Semantic Provenance Capture in Data Ingest Systems

6

[2] Fox, P., McGuinness, D., Cinquini, L., West, P., Garcia, J.,Benedict, J., Middleton. D.: Ontology-supported ScientificData Frameworks: The Virtual Solar-Terrestrial ObservatoryExperience. Computers and Geosciences. Vol. 35, Issue 4, pp724-738.

[3] McGuinness, D., Ding, L., Pinheiro da Silva, P., Chang, C.:PML 2: A Modular Explanation Interlingua. In ExaCt pp.49-55 Also Stanford KSL Tech Report KSL-07-07 (2007)

[4] Michaelis, J., Ding, L., Shangguan, Z., Zednik, S., Huang, R.,Pinheiro da Silva, P., Del Rio, N., McGuinness, D.: TowardsUsable and Interoperable Workflow Provenance: EmpiricalCase Studies using PML. To appear in the Proceedings of theFirst International Workshop on the role of Semantic Web inProvenance Management, Chantilly, VA. (2009)

[5] Zednik, S., Fox, P., McGuinness, D.: Domain Knowledgeand Provenance in Science Data Systems. To appear inEmerging Issues in e-Science: Collaboration, Provenance,and the Ethics of Data (IN13), AGU Fall 2009 Meeting, SanFrancisco, CA. (2009)

[6] McGuinness, D., Ding, L., Pinheiro da Silva, P.: ExplainingAnswers from the Semantic Web: The Inference Web Ap-proach. Web Semantics: Science, Services and Agents on theWorld Wide Web Special issue: International Semantic WebConference 2003 - Edited by K.Sycara and J. Mylopoulous.1(4). Fall, 2004. Also, Stanford KSL Tech Report KSL-04-03.

[7] Pinheiro da Silva, P., McGuinness, D., Fikes, R.: A ProofMarkup Language for Semantic Web Services. InformationSystems, 31(4-5), June-July 2006, pp 381-395. Prev. version,KSL Tech Report KSL-04-01 (June 2006)

[8] Del Rio, N., Pinheiro da Silva, P.: Probe-It! Visualization sup-port for provenance. Proceedings of the Second InternationalSymposium on Visual Computing (ISVC 2), Lake Tahoe, NV,USA. Volume 4842 of LNCS, pages 732-741, Springer (2007)

7

semantic provenance for science data products: application

Documents