provenance in open distributed information systems syed imran jami phd candidate fast-nu
TRANSCRIPT
Provenance in Open Distributed Information Systems
Syed Imran JamiPhD Candidate
FAST-NU
Introduction
• Provenance Systems– Provenance is considered as a metadata that keeps
the record of the origin and history of a target object.– The metadata contains the log of each step in
sourcing, moving, and processing the object. – Keeps the record of transformation steps on target
object– Provides information related to recreation of object – Helps in maintaining the quality and reliability of
object – Provide trust mechanism on object for its use in
simulation and experiments
ComputerComputer
Computer
Log 1 Log nLog 2
Provenance Recording
Representation
Provenance Storage
Processing System
Group
UserUser Querying
Introduction (2)
• Open Distributed Information Systems– Information and sequence of steps performed are
distributed among information systems that are independent and could be under different administrative controls
– Nodes can be heterogeneous– Now widely used in collaboration and information
sharing– Requires open access (read/write) to digital artifact
• Web 2.0 (blogs, Wikipedia,etc)• Grids and Cloud Computing
Our Problem
• Main Problem– To propose and develop provenance
system for open distributed environment
Our Problem• Main Problem
– To propose and develop provenance system for open distributed environment
• Research Question– How can we develop provenance model for an information
system in open distributed environment
• Hypothesis1. Provenance model for an information system in an open
distributed environment can be developed by incorporating agents to autonomously track the interactions.
2. Providing provenance ontology enables the provenance representation in RDF graphs to work in a heterogeneous environment.
3. The use of ontology and RDF graphs will also make the system domain independent.
Motivation & Justification• Most of the existing provenance systems track data only
– The definition of data is now changing– Information portals in open environment can contain data,
document and information– Tagged representation in XML reduces the gap between data
and document• Most of the existing provenance systems are specialized
(domain dependent)– Open distributed systems should be able to accommodate any
kind of information -- Generic • The existing systems are not Autonomous
– They require to change in operating systems or work flows in order to track provenance
• Most of the existing provenance systems do not give importance to Heterogeneity – It is one of the important factor to be considered in open
distributed systems
Research Issues
• Provenance Tracking, Representation and Storage in open distributed systems lead to following research challenges– Autonomousity – Domain Independent– Heterogeneity– Scalability and Efficiency– Genericity– Mobility– Privacy & Security
Proposed Solution
• As a testbed we developed an XML based Information System– XML page contains information contributed by
different sources and used by different users– Each interaction is merged with main XML
page using Agents– Provenance of each interaction is tracked usin
g Multi Agent Systems– Provenance logs are represented in RDF Gra
phs as Triples– The logs are stored in distributed locations
Proposed Solution
• Generic– Research Question (1):
• Can we develop a provenance system that can track not only data but also other digital objects.
– Most of the existing systems work for data only• For example they use RDBMS as underlying storage
mechanisms • The provenance model should be generic that can
accommodate data, documents and other digital artifacts– Semantic Grid based techniques can play its role
• XML reduces the gap between data and documents due to tagged representation
• All data formats are translated to XML in information system• Our provenance tracking system will track the interactions
performed as XML tree
Proposed Solutions
• Autonomousity – Research Question (Sub problem 1)
• Can we develop a model that does not require to change or adapt OS, language platform or workflow application to track provenance?
– To provide automated and autonomous tracking – Almost all the systems are dependent on APIs, OS routines,
workflows etc to track provenance which is not recommended for open systems like grids since one can’t change OS or Workflows to use the provenance aware information service
– Multi Agent based systems can be used to provide autonomous nature
– Only one work uses MAS to track data provenance for their Health care system (specialized domain)
– MAS based system will provide the best autonomous system among other options
Proposed Solution• Heterogeneity
– Research Question (2)• Can we develop a provenance system that can track the
transformation steps in heterogeneous nodes of open distributed system.
– The system should record and track provenance even for heterogeneous nodes
• Device Heterogeneity• Platform Heterogeneity• Semantic (Schema) Heterogeneity
– JVM based implementation will provide heterogeneity at device and platform
– Semantic Heterogeneity will be solved by representing provenance metadata in RDF triples as graphs
• XML and RDF are standards according to W3C for all systems and devices
• Requires to develop RDF vocabulary for Provenance – Ontology
• JVM, XML and RDF based provenance model will make our system Domain Independent
Proposed Solution• Scalability
– Research Question (3)• Can we make provenance storage and tracking scalable?
– The tracking system should be Scalable in case of increasing number of users in open distributed system
• The simultaneous recording through agents will make the tracking scalable. Each node is responsible for autonomously tracking the interaction
– The scalable storage system depends on the location of provenance store containing log
• With the target or separate ??• Centralized or Decentralized• Decentralized system will be scalable
– RDF graphs will reside on some other node » No single node will be over utilized
• Problem: This solution will cost efficiency !!• Another solution is to store sub graphs at the local host instead of
combining and merging sub graphs into one
Proposed Solution• Efficiency
– Research Question (4)• With the propose solution of scalability, can we adapt efficiency in our
system for fast retrieval of provenance metadata scattered around the system
• The solutions of scalability costs the overhead of low efficiency– Extra time required to search for RDF graphs– Some lookup tables will be required.
• Solution– Each digital artifact must be given unique ID like URI– Unique IDs should compose of binary strings– Lookup table will use these binary strings for fast retrievals
• Can use our own developed ID system• A trie based indexing scheme can be used
– Requires very small number of entries to store large strings– Depends on the width of strings not the total values O(w) where w is
independent of n (number of IDs).
– Single RDF graph should be maintained for multiple copies
Current Progress• A prototype application is developed that is serving as a t
estbed for information system on open distributed environment
• The system can track provenance log in RDF file that is merged in single main RDF graph that keeps that track of information
• Dublin Core is used as an ontology for provenance • Both the contribution to information and provenance
metadata are transmitted through Aglets• An ID system is developed to label the digital artifact• Scalability analysis is performed on distributed tightly
coupled provenance store
Results• The earlier results are showing
that Provenance log is independent of file size
• The logs are dependent on interactions
• Our storage algorithm has some limitations. Logs are converging at one place
0
5000
10000
15000
20000
25000
1 2 3 4 5 6 7 8 9 10
Document No.
Fil
e S
ize
Document Size
Doc Prov Size
0
500
1000
1500
2000
2500
3000
1 2 3 4 5 6 7 8 9 10
Document No.
Pro
ven
an
ce l
og
siz
e
(byte
s)
0
1
2
3
4
5
6
7
8
Fra
mes (
inte
racti
on
)
Doc Prov Size
Frames
0
5000
10000
15000
20000
25000
30000
1 3 5 7 9 11 13 15 17 19
Users
Pro
ven
an
ce S
tora
ge s
ize
ProvenanceStorage
Contribution towards Provenance
• A Knowledge Provenance Architecture Open Distributed Systems
• Autonomous Provenance Recording in Heterogeneous nodes
• A Scalable Provenance Storage System• Semantic Heterogeneity of Provenance
System using Provenance Ontology• A Domain Independent Provenance
System
Publications• Syed Imran Jami and Zubair A. Shaikh, "A workflow based academic
management system using multi agent approach", Proceedings of the 11th WSEAS International Conference on Computers, Agios Nikolaos, Crete Island, Greece, Pg 202-207, Year of Publication: 2007, ISSN:1790-5117
• Imran Jami and Zubair A. Shaikh, "A Multi Agent based Architecture for Data Provenance in Semantic Grid", Proceedings of International Multi-Conference of Engineers and Computer Scientists, Hong Kong, Pg 360-364, Year of Publication: 2008, ISBN: 978-988-98671-8-8
• Syed Imran Jami, Jemal Abawajy, Zubair A. Shaikh, “A Taxonomy of Provenance Models for Open Distributed Systems”, Submitted in Journal of Information Sciences, Elsevier Publisher, Impact Factor 2.147
• Syed Imran Jami, Jemal Abawajy, Zubair A. Shaikh, “Information Provenance for Open Distributed Collaborative System”, About to submit in ACS high impact conference.