emerging trends in provenance

50
Emerging Trends in Provenance Deborah L. McGuinness Tetherless World Constellation Chair Rensselaer Polytechnic Institute SWPM Workshop at ISWC November 7, 2010 Shanghai, China

Upload: saad

Post on 13-Jan-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Emerging Trends in Provenance. Deborah L. McGuinness Tetherless World Constellation Chair Rensselaer Polytechnic Institute SWPM Workshop at ISWC November 7, 2010 Shanghai, China. Outline. Some historical explanation & provenance settings Selected current provenance settings - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Emerging Trends in Provenance

Emerging Trends in Provenance

Deborah L. McGuinness

Tetherless World Constellation Chair

Rensselaer Polytechnic Institute

SWPM Workshop at ISWC

November 7, 2010

Shanghai, China

Page 2: Emerging Trends in Provenance

Outline

– Some historical explanation & provenance settings

– Selected current provenance settings• Virtual Observatory• Open Data

– Discussion topics

Page 3: Emerging Trends in Provenance

Selected Background

• Bell Labs: designing description logics & environments aimed at supporting applications such as configuration.– led to research on making DL-based systems useful

– with focus on explanation

• Stanford: focus on ontology-enabled xx, large hybrid systems, later x informatics– led to ontology evolution and diagnostic

environments, renewed explanation, now from a broader perspective expanding beyond FOL and adding emphasis on provenance

Page 4: Emerging Trends in Provenance

Background cont.

• Rensselaer Polytechnic Institute/ TWC: next generation web, web science research center, open data, next generation semantic eScience– Led to more connections with social platforms,

empowering collections (of users, data, etc.)

Page 5: Emerging Trends in Provenance

Explanation via Graph

Explanation via Customized Summary

Explanation via Annotation

Inference Web (IW)

End Users

End-User Interact

ionservices

DistributedPML data

Data Access & Data

Analysis Services

Validate PML data

Access published PML data

Inference Web is a semantic web-based knowledge provenance management infrastructure:

• Uses a provenance interlingua (PML) for encoding and interchange of provenance metadata in distributed environments • Provides interactive explanation services for end-users• Provides data access and analysis services for enriching the value of knowledge provenance

•It has been used in a wide range of applications

Page 6: Emerging Trends in Provenance

Proof/Provenance Markup Language (PML)

• A kind of linked data on the Web

• Modularized & extensible– Provenance: annotate provenance properties– Justification: encodes provenance relations (including support for multiple

justifications)– Trust: add trust annotation

• Semantic Web based

Enterprise Web

Enterprise Web

World Wide Web

D D

PMLdata

PMLdata

DD

D

PMLdata

PMLdata

PMLdata

D

D PMLdata

PMLdata

D

Page 7: Emerging Trends in Provenance

7

Making Systems Actionable using Knowledge Provenance

Mobile Wine Agent

GILA

Combining Proofs in

TPTP

CALO

7

Knowledge Provenance

in Virtual Observatories

Intelligence Analyst Tools

NOW including Data-gov

Page 8: Emerging Trends in Provenance

User Require Provenance!Users demand it! If users (humans and agents) are to use, reuse, and integrate system

answers, they must trust them.

Intelligence analysts: (from DTO/IARPA’s NIMD)Andrew. Cowell, Deborah McGuinness, Carrie Varley, and David A. Thurman. Knowledge-Worker Requirements for Next Generation

Query Answering and Explanation Systems. Proc. of Intelligent User Interfaces for Intelligence Analysis Workshop, Intl Conf. on Intelligent User Interfaces (IUI 2006), Sydney, Australia.

Intelligent Assistant Users: (from DARPA’s PAL/CALO)Alyssa Glass, Deborah L. McGuinness, Paulo Pinheiro da Silva, and Michael Wolverton. Trustable Task Processing Systems. In Roth-

Berghofer, T., and Richter, M.M., editors, KI Journal, Special Issue on Explanation, Kunstliche Intelligenz, 2008.

Virtual Observatory Users: (from NSF’s VSTO)Deborah McGuinness, Peter Fox, Luca Cinquini, Patrick West, Jose Garcia, James L. Benedict, and Don Middleton. The Virtual Solar-

Terrestrial Observatory: A Deployed Semantic Web Application Case Study for Scientific Research. Proc. of the Nineteenth Conference on Innovative Applications of Artificial Intelligence (IAAI-07). Vancouver, British Columbia, Canada.

And… as systems become more diverse, distributed, embedded, and depend on more varied data and communities, more provenance and more types are needed

.

Page 9: Emerging Trends in Provenance

Two Application Scenarios: 1.Interdisciplinary next generation virtual observatories2.Open Linked Data

Page 10: Emerging Trends in Provenance

10

CHIP Pipeline(Chromospheric Helium Image Photometer)

Mauna Loa Solar Observatory (MLSO)Hawaii

National Center for Atmospheric Research (NCAR) Data Center.Boulder, CO

Intensity Images (GIF)

Velocity Images (GIF)

•Follow-up Processing on Raw Data (e.g., Flat Field Calibration)•Quality Checking(Images Graded: GOOD, BAD, UGLY)

•Raw Image Data

Raw Image DataCaptured by CHIPChromosphericHelium-I ImagePhotometer

•Raw Data Capture

Publishes

10

Page 11: Emerging Trends in Provenance

11

Semantic Provenance Capture for Data Ingest Systemcs (SPCDIS)

Fact: Scientific data services are increasing in usage and scope, and with these increases comes growing need for access to provenance information.

Provenance Project Goal: to design a reusable, interoperable provenance infrastructure.

Science Project Goal: design and implement an extensible provenance solution that is deployed at the science data ingest/ product generation time.

Outcome: implemented provenance solution in one science setting AND operational specification for other scientific data applications.

Extends vsto.org

Page 12: Emerging Trends in Provenance

ACOSData Ingest

• Typical science data processing pipelines

• Distributed

• Some metadata in silos

• Much metadata lost

• Many human-in-loop decisions, events

• No metadata infrastructure for any user

• Community is broadening

Chromospheric Helium Imaging Photometer (CHIP) Data IngestACOS – Advanced Coronal Observing System 12

Page 13: Emerging Trends in Provenance

The Advanced Coronal Observing System case for Provenance

???

???

Source Processing Product

•Provenance metadata currently not propagated with or linked to the data products

•Processing metadata•Origin (observation) metadata

•Data products are the result of “black box” systems•Most users do not know what calibrations, transformations, and QA processing have been applied to the data product

13

Page 14: Emerging Trends in Provenance

Advanced Coronal Observing System (ACOS) Provenance Use Cases

• What were the cloud cover and seeing conditions during the observation period of this image?

• What calibrations have been applied to this image?

• Why does this image look bad?

14

Page 15: Emerging Trends in Provenance

PML Usage in SPCDIS

• Justification– Explanation– Causality graph

• Provenance– Conclusion– Source– Engine– Rule

• Trust– Trust/Belief metrics

NodeSetNodeSet

JustificationJustification

ConclusionConclusion

NodeSetNodeSet

JustificationJustification

ConclusionConclusion

NodeSetNodeSet

JustificationJustification

ConclusionConclusion

EngineEngine RuleRule RuleRule

hasAntecedentList

hasSourceUsagehasInferenceRule

hasInferenceEngine

SourceUsageSourceUsage

SourceSource

DateTimeDateTime

15

Page 16: Emerging Trends in Provenance

20080602 Fox VSTO et al. 16

Page 17: Emerging Trends in Provenance

17

Tools

Page 18: Emerging Trends in Provenance

PML in Action

• This is the PML provenance encoding for a “quick look” gif file, which is generated from two image data datasets

Node set for the quickloook gif file

hasConclusion: a reference to the gif file itself

InferenceStep: how the gif file was derived

hasAntecedents

hasInferenceRulehasInferenceEngine

The “antecedents” of the quicklook gif file are other node sets

Page 19: Emerging Trends in Provenance

A PML-Enhanced Image

provenance

CHIP Quick-LookCHIP PML-Enhance Quick-Look

Page 20: Emerging Trends in Provenance

Integrated View

• Observer log’s information added into quicklook image’s provenance

Page 21: Emerging Trends in Provenance

Provenance aware faceted search

Tetherless World Constellation 21

Page 22: Emerging Trends in Provenance

Current Issues

• Successful interdisciplinary VO; needed provenance• Successful provenance integration for experts; needs to

support more diverse audience– As the user base diversifies, what updates are needed? – Will a domain ontology for MLSO/NCAR-affiliated staff be

understandable by citizen scientists?... No– How can our representational infrastructure be extended with

contextual information relevant to user needs? E.g., linking data products from one part of the CHIP pipeline to specific solar events or events at MLSO (such as reports of bad weather)

– Should provenance ontologies provide extensional capabilities to include domain-informed extensions – yes

– [1] Stephan Zednik, Peter Fox and Deborah L. McGuinness, “System Transparency, or How I Learned to Worry about Meaning and Love Provenance!” Proceedings of IPAW 2010

– [2] James R. Michaelis, Li Ding, Zhenning Shangguan, Stephan Zednik, Rui Huang, Paulo Pinheiro da Silva, Nicholas Del Rio and Deborah L. McGuinness, “Towards Usable and Interoperable Workflow Provenance: Empirical Case Studies Using PML” Proceedings of SWPM 2009

– [3] AGU 2010 with papers with Fox, et al, McGuinness et al., Zednick et al,, West. et. al, Michaelis et al, …

22

Page 23: Emerging Trends in Provenance

User Annotations (James Michaelis)

• Allowing users to annotate provenance elements is a potential solution

• Allow a user community to make replies to questions from individuals• E.g., citizen scientists can get information

extensions through help of project staff • Additionally, allow user community to assert

information on provenance elements• Vision: to incrementally aggregate information

attached to provenance traces, through these annotations.

23

Page 24: Emerging Trends in Provenance

User Annotations

• Allowing users to annotate provenance elements is a potential solution

• Allow a user community to make replies to questions from individuals• E.g., citizen scientists can get information

extensions through help of project staff • Additionally, allow user community to assert

information on provenance elements• Vision: to incrementally aggregate information

attached to provenance traces, through these annotations.

24

Page 25: Emerging Trends in Provenance

User Annotations

• Can expand information attached to provenance records in two ways:• Clarification: Providing an answer to a question

about a provenance element (such as an expanded definition of its purpose).

• Context Extension: Provide supplemental information outside the scope of a provenance record, which may aid in provenance understanding.

25

Page 26: Emerging Trends in Provenance

User Annotations

• Types of annotations• Assertion: A user directly asserts a clarification or

context extension• Clarification Request: A user makes a request for a

clarification on a provenance element.• Context Extension Request: A user makes a request

for a context extension.• Reply: A user replies to a clarification request or

context extension request.• Discussions may feature participants with different

backgrounds. At a high level, such users can be distinguished by Roles • (e.g., Staff, Citizen Scientist)

26

Page 27: Emerging Trends in Provenance

Use Case 1A

27

Flatten: Apply flat field calibration to an image, using averaged bias and flat files for the corresponding processing day.

Server ResponseServer Response

RequestRequest

RequestRequest

Processing Details for Intensity Image 20101007. 232213.chp.hsh.gif

Server ResponseServer Response

Definition for function FlattenAlice

Web Service

Web Service

Intensity Image: 20101007. 232213.chp.hsh.gif

ACTIVITY ID PERFORMED BY FUNCTION

ID:1 Flatten

ID:2 CenterImage

Type: Clarification Request Topic: Flatten (Function Definition)Text: Could someone provide a definition of “Flat Field Calibration”?

Annotation SubmissionAnnotation Submission

Page 28: Emerging Trends in Provenance

Use Case 1B

28

Server ResponseServer Response

Annotation SubmissionAnnotation Submission

RequestRequest

Details for Annotation: Annotation_1

Type: Clarification Request Topic: FlattenText: Could someone provide a definition of “Flat Field Calibration”?

Type: Reply Reply To: Annotation_1 Clarification On: FlattenAuthor: Bob Role: StaffReply: A definition of Flat Field Calibration is given at the provided link.Link: http://www.phys.vt.edu/~jhs/SIP/processing.html

Web Service

Web Service

Bob

Page 29: Emerging Trends in Provenance

Annotation Structure – Use Cases 1A, 1B

29

Annotation_1Annotation_1Topic

Has TextCould someone provide a definition of “Flat Field Calibration”?

Has Author

AliceAlice

Annotation_2Annotation_2 BobBobHas Author

Clarification For

Reply To

A definition of Flat Field Calibration is given at the provided link.

FlattenFlattenType

Reply

TypeHas Text

Has Link

http://www.phys.vt.edu/~jhs/SIP/processing.html

Clarification Request

StaffRole

Page 30: Emerging Trends in Provenance

Use Case 2

30

For each listed image i = {0 … n}For each listed image i = {0 … n}

Annotation SubmissionAnnotation Submission

Type: Assertion Author: Bob Topic: (all applicable images viewed)Text: CME Event observed in referenced images.

Initial Server ResponseInitial Server Response

List of Intensity Images For 2010-08-01 – 2010-08-04

RequestRequest

Visualization of listed image i

Server ResponseServer Response

Bob inspects each image to see if it has visual evidence of Coronal Mass Ejection related activity

Bob inspects each image to see if it has visual evidence of Coronal Mass Ejection related activity

Web Service

Web Service

Bob

Visualization of image IID: image_i

Page 31: Emerging Trends in Provenance

Related Work & Status

• myExperiment[1]– Social networking site for exchanging workflow-centric materials– Support primarily for annotation on workflow-scripts, as opposed to

provenance-based information• Tupelo[2]

– Semantic Content Repository, designed to facilitate provenance storage/querying

– Uses Open Provenance Model (OPM)– User annotations/discussions supported for URI-based content, but

no specific focus on aggregating content directly on provenance elements

• Status – draft PMLA module. Implementation and evaluation with SPCDIS

31[1] http://tupeloproject.ncsa.uiuc.edu/[2] http://www.myexperiment.org/

Page 32: Emerging Trends in Provenance

Example Population Science Issues (with NIH)

• Do policies (taxation, smoking bans, etc) impact health and health care costs?

• What data should we display to help scientists and lay people evaluate related questions?

• What data might be presented so that people choose to make (positive) behavior changes?

• What does the following data show?• What are appropriate follow ups?

Page 33: Emerging Trends in Provenance

PopSciGrid (Alpha)

Page 34: Emerging Trends in Provenance

PopSciGrid

Page 35: Emerging Trends in Provenance

PopSciGrid II

Page 36: Emerging Trends in Provenance

PopSciGrid III

Page 37: Emerging Trends in Provenance

Drill Down Questions

• Should we focus on prevalence?

• What is prevalence (definition)?

• How is it measured (overall / in this data set)?

• Conditions under which the data was obtained (date, sample set, extenuating conditions, …)

• Do we need more data, more inference, more xxx…

Page 38: Emerging Trends in Provenance

Our Position

System Transparency supports user understanding and trust

Our Research Goal: Provide interoperable infrastructure that supports explanations of sources, assumptions, and answers as an enabler for trust

Page 39: Emerging Trends in Provenance

Mashup Provenance from data-gov

• Critical for making demos useful, understandable, and actionable

DatasetDemo

Agency

Page 41: Emerging Trends in Provenance

Sample Application Domain (with Xian Li)

• Study of Supreme Court Justices needs data from different sources

Judicial Databasese.g. SCDB(Spaeth 1999 )

Newspaper Commentse.g. The New York Times 

Biographical Directoriese.g. Who's Who in America

Public opinions(Tate and Handberg. 1991 )

Court cases, votes(Segal, and Spaeth. 1993 ; Schubert, 1965 ; Pritchett, 1948 ; Rohde, D. and Spaeth, 1976 ; )

Personal attributes: education, nominator, …(Segal. and Spaeth, 1993, 2002 )

Page 42: Emerging Trends in Provenance

Sample Use Case (with Li and Lebo)

• Surprise • Application reports that Robert H. Jackson was

nominated by a Green Party President• There hasn't been a Green Party President

Page 43: Emerging Trends in Provenance

Use Case

• Green Party President?o User believes that the System is Incorrect

o Look for provenance of information to identify whether it is the source that is incorrect or the application interpreted the source incorrectly.

Page 44: Emerging Trends in Provenance

Provenance Encoding

ns:subject http://dbpedia.org/

resources/Robert_H._Jackson

ns:subject http://dbpedia.org/

resources/Robert_H._Jackson

ns:query_templatehttp://dbpedia.org/sparql?query=select...

%JUSTICE%...

ns:query_templatehttp://dbpedia.org/sparql?query=select...

%JUSTICE%...

pmlj:InferenceSteppmlj:InferenceStep

ns:query_urins:query_uri

ns:query_resultns:query_result

ns:output_formatns:output_format ns:service_urins:service_uri…

pmlj:isConsequenceOf

pmlj:InferenceSteppmlj:InferenceSteppmlj:isConsequenceOf

“Green”“Green” “DBpedia”“DBpedia”

Query Creation

Query Execution

Attribution located!

Distrust event

Page 45: Emerging Trends in Provenance

Challenges for Data Aggregators (with Tim Lebo, Greg Williams)

45

Page 46: Emerging Trends in Provenance

Challenges for Data Aggregators

46

Page 47: Emerging Trends in Provenance

Assumptions and Objectives

• Most data are from third-party sources• Data are updated regularly and irregularly• Complete interpretation is not immediately possible• Subsequent interpretations should be backward-compatible• Distinguishing among sources• Minimizing manual modifications• Tracing to source data• Attributing data authors and curators

47

Page 48: Emerging Trends in Provenance

Approach

48

• Capturing conversion provenance, exposed as linked data:

1 – Following redirects 2 – Retrieving data file 3 – Unzipping 4 – Manual tweaks

5 – Converter invocation 6 – Predicate lineage 7 – Tracing triple to table cell

8 – Populating endpoint

• Parameterized interpretation parameters

Page 49: Emerging Trends in Provenance

Future Directions

• Presenting provenance information in LOGD dataset description pages

• Extending visualization APIs to incorporate provenance within interface

• Leveraging provenance connectivity to investigate latent associations among datasets and presentations

49

US-UK Foreign Aid Comparison

Queried as RDFProviding direct link to original data

Page 50: Emerging Trends in Provenance

Discussion

• Provenance is growing in acceptance, need, and type• Some interlinguas have emerged that have significant usage and

have shown significant value• Interdisciplinary eScience and open data are increasing the need

and potentially pace.• A few trends we have observed:

– Domain-specific extensions can be of value– Techniques for supporting interaction with large diverse communities are

needed (we believe user annotation is one such critical technique)– Data aggregators face additional challenges if provenance is not

available… and may accelerate the demand for provenance and provenance standards

– Getting back to the portion of the source used is critical for some– Tracking manipulations is critical for some– Providing and creating provenance as part of a larger eco-system is key

• Open (govt, science, etc) data (along with semantic web applications with embedded information about knowledge provenance and term meaning) is providing many new opportunities and will continue to change our lives.

• Questions? dlm <at> cs <dot> rpi <dot> edu

50