provenance in open distributed information systems syed imran jami phd candidate fast-nu

Provenance in Open Distributed Information Systems

Syed Imran JamiPhD Candidate

FAST-NU

Introduction

• Provenance Systems– Provenance is considered as a metadata that keeps

the record of the origin and history of a target object.– The metadata contains the log of each step in

sourcing, moving, and processing the object. – Keeps the record of transformation steps on target

object– Provides information related to recreation of object – Helps in maintaining the quality and reliability of

object – Provide trust mechanism on object for its use in

simulation and experiments

ComputerComputer

Computer

Log 1 Log nLog 2

Provenance Recording

Representation

Provenance Storage

Processing System

Group

UserUser Querying

Introduction (2)

• Open Distributed Information Systems– Information and sequence of steps performed are

distributed among information systems that are independent and could be under different administrative controls

– Nodes can be heterogeneous– Now widely used in collaboration and information

sharing– Requires open access (read/write) to digital artifact

• Web 2.0 (blogs, Wikipedia,etc)• Grids and Cloud Computing

Our Problem

• Main Problem– To propose and develop provenance

system for open distributed environment

Our Problem• Main Problem

– To propose and develop provenance system for open distributed environment

• Research Question– How can we develop provenance model for an information

system in open distributed environment

• Hypothesis1. Provenance model for an information system in an open

distributed environment can be developed by incorporating agents to autonomously track the interactions.

2. Providing provenance ontology enables the provenance representation in RDF graphs to work in a heterogeneous environment.

3. The use of ontology and RDF graphs will also make the system domain independent.

Motivation & Justification• Most of the existing provenance systems track data only

– The definition of data is now changing– Information portals in open environment can contain data,

document and information– Tagged representation in XML reduces the gap between data

and document• Most of the existing provenance systems are specialized

(domain dependent)– Open distributed systems should be able to accommodate any

kind of information -- Generic • The existing systems are not Autonomous

– They require to change in operating systems or work flows in order to track provenance

• Most of the existing provenance systems do not give importance to Heterogeneity – It is one of the important factor to be considered in open

distributed systems

Research Issues

• Provenance Tracking, Representation and Storage in open distributed systems lead to following research challenges– Autonomousity – Domain Independent– Heterogeneity– Scalability and Efficiency– Genericity– Mobility– Privacy & Security

Proposed Solution

• As a testbed we developed an XML based Information System– XML page contains information contributed by

different sources and used by different users– Each interaction is merged with main XML

page using Agents– Provenance of each interaction is tracked usin

g Multi Agent Systems– Provenance logs are represented in RDF Gra

phs as Triples– The logs are stored in distributed locations

Proposed Solution

• Generic– Research Question (1):

• Can we develop a provenance system that can track not only data but also other digital objects.

– Most of the existing systems work for data only• For example they use RDBMS as underlying storage

mechanisms • The provenance model should be generic that can

accommodate data, documents and other digital artifacts– Semantic Grid based techniques can play its role

• XML reduces the gap between data and documents due to tagged representation

• All data formats are translated to XML in information system• Our provenance tracking system will track the interactions

performed as XML tree

Proposed Solutions

• Autonomousity – Research Question (Sub problem 1)

• Can we develop a model that does not require to change or adapt OS, language platform or workflow application to track provenance?

– To provide automated and autonomous tracking – Almost all the systems are dependent on APIs, OS routines,

workflows etc to track provenance which is not recommended for open systems like grids since one can’t change OS or Workflows to use the provenance aware information service

– Multi Agent based systems can be used to provide autonomous nature

– Only one work uses MAS to track data provenance for their Health care system (specialized domain)

– MAS based system will provide the best autonomous system among other options

Proposed Solution• Heterogeneity

– Research Question (2)• Can we develop a provenance system that can track the

transformation steps in heterogeneous nodes of open distributed system.

– The system should record and track provenance even for heterogeneous nodes

• Device Heterogeneity• Platform Heterogeneity• Semantic (Schema) Heterogeneity

– JVM based implementation will provide heterogeneity at device and platform

– Semantic Heterogeneity will be solved by representing provenance metadata in RDF triples as graphs

• XML and RDF are standards according to W3C for all systems and devices

• Requires to develop RDF vocabulary for Provenance – Ontology

• JVM, XML and RDF based provenance model will make our system Domain Independent

Proposed Solution• Scalability

– Research Question (3)• Can we make provenance storage and tracking scalable?

– The tracking system should be Scalable in case of increasing number of users in open distributed system

• The simultaneous recording through agents will make the tracking scalable. Each node is responsible for autonomously tracking the interaction

– The scalable storage system depends on the location of provenance store containing log

• With the target or separate ??• Centralized or Decentralized• Decentralized system will be scalable

– RDF graphs will reside on some other node » No single node will be over utilized

• Problem: This solution will cost efficiency !!• Another solution is to store sub graphs at the local host instead of

combining and merging sub graphs into one

Proposed Solution• Efficiency

– Research Question (4)• With the propose solution of scalability, can we adapt efficiency in our

system for fast retrieval of provenance metadata scattered around the system

• The solutions of scalability costs the overhead of low efficiency– Extra time required to search for RDF graphs– Some lookup tables will be required.

• Solution– Each digital artifact must be given unique ID like URI– Unique IDs should compose of binary strings– Lookup table will use these binary strings for fast retrievals

• Can use our own developed ID system• A trie based indexing scheme can be used

– Requires very small number of entries to store large strings– Depends on the width of strings not the total values O(w) where w is

independent of n (number of IDs).

– Single RDF graph should be maintained for multiple copies

Current Progress• A prototype application is developed that is serving as a t

estbed for information system on open distributed environment

• The system can track provenance log in RDF file that is merged in single main RDF graph that keeps that track of information

• Dublin Core is used as an ontology for provenance • Both the contribution to information and provenance

metadata are transmitted through Aglets• An ID system is developed to label the digital artifact• Scalability analysis is performed on distributed tightly

coupled provenance store

Results• The earlier results are showing

that Provenance log is independent of file size

• The logs are dependent on interactions

• Our storage algorithm has some limitations. Logs are converging at one place

0

5000

10000

15000

20000

25000

1 2 3 4 5 6 7 8 9 10

Document No.

Fil

e S

ize

Document Size

Doc Prov Size

0

500

1000

1500

2000

2500

3000

1 2 3 4 5 6 7 8 9 10

Document No.

Pro

ven

an

ce l

og

siz

e

(byte

s)

0

1

2

3

4

5

6

7

8

Fra

mes (

inte

racti

on

)

Doc Prov Size

Frames

0

5000

10000

15000

20000

25000

30000

1 3 5 7 9 11 13 15 17 19

Users

Pro

ven

an

ce S

tora

ge s

ize

ProvenanceStorage

Contribution towards Provenance

• A Knowledge Provenance Architecture Open Distributed Systems

• Autonomous Provenance Recording in Heterogeneous nodes

• A Scalable Provenance Storage System• Semantic Heterogeneity of Provenance

System using Provenance Ontology• A Domain Independent Provenance

System

Publications• Syed Imran Jami and Zubair A. Shaikh, "A workflow based academic

management system using multi agent approach", Proceedings of the 11th WSEAS International Conference on Computers, Agios Nikolaos, Crete Island, Greece, Pg 202-207, Year of Publication: 2007, ISSN:1790-5117

• Imran Jami and Zubair A. Shaikh, "A Multi Agent based Architecture for Data Provenance in Semantic Grid", Proceedings of International Multi-Conference of Engineers and Computer Scientists, Hong Kong, Pg 360-364, Year of Publication: 2008, ISBN: 978-988-98671-8-8

• Syed Imran Jami, Jemal Abawajy, Zubair A. Shaikh, “A Taxonomy of Provenance Models for Open Distributed Systems”, Submitted in Journal of Information Sciences, Elsevier Publisher, Impact Factor 2.147

• Syed Imran Jami, Jemal Abawajy, Zubair A. Shaikh, “Information Provenance for Open Distributed Collaborative System”, About to submit in ACS high impact conference.

provenance in open distributed information systems syed imran jami phd candidate fast-nu

Documents

existing provenance

provenance representation

provenance model

agents provenance

provenance ontology

open environment

existing systems

information sharing