Provenance in Open Distributed Information Systems - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Provenance in Open Distributed Information Systems

Description:

The scalable storage system depends on the location of provenance store containing log ... analysis is performed on distributed tightly coupled provenance store ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 23

Provided by: p9494

Category:

more less

Transcript and Presenter's Notes

Title: Provenance in Open Distributed Information Systems

1
Provenance in Open Distributed Information
Systems

PhD Scholar Syed Imran Jami
Dated 17th February, 2009
Presented under CRUC Weekly Research Seminar

2
Introduction

Provenance Systems
Provenance is considered as a metadata that keeps
the record of the origin and history of a target
object.
The metadata contains the log of each step in
sourcing, moving, and processing the object.
Keeps the record of transformation steps on
target object
Provides information related to recreation of
object
Helps in maintaining the quality and reliability
of object
Provide trust mechanism on object for its use in
simulation and experiments

3
(No Transcript)
4
Introduction (2)

Open Distributed Information Systems
Information and sequence of steps performed are
distributed among information systems that are
independent and could be under different
administrative controls
Nodes can be heterogeneous
Now widely used in collaboration and information
sharing
Requires open access (read/write) to digital
artifact
Web 2.0 (blogs, Wikipedia,etc)
Grids and Cloud Computing

5
Our Problem

Main Problem
To propose and develop provenance system for open
distributed environment
Research Question
How can we develop provenance model for an
information system in open distributed
environment
Hypothesis
Provenance model for an information system in an
open distributed environment can be developed by
incorporating agents to autonomously track the
interactions.
Providing provenance ontology enables the
provenance representation in RDF graphs to work
in a heterogeneous environment.
The use of ontology and RDF graphs will also make
the system domain independent.

6
Motivation Justification

Most of the existing provenance systems track
data only
The definition of data is now changing
Information portals in open environment can
contain data, document and information
Tagged representation in XML reduces the gap
between data and document
Most of the existing provenance systems are
specialized (domain dependent)
Open distributed systems should be able to
accommodate any kind of information -- Generic
The existing systems are not Autonomous
They require to change in operating systems or
work flows in order to track provenance
Most of the existing provenance systems do not
give importance to Heterogeneity
It is one of the important factor to be
considered in open distributed systems

7
Research Issues

Provenance Tracking, Representation and Storage
in open distributed systems lead to following
research challenges
Autonomousity
Domain Independent
Heterogeneity
Scalability and Efficiency
Genericity
Mobility
Privacy Security

8
Proposed Solution

As a testbed we developed an XML based
Information System
XML page contains information contributed by
different sources and used by different users
Each interaction is merged with main XML page
using Agents
Provenance of each interaction is tracked using
Multi Agent Systems
Provenance logs are represented in RDF Graphs as
Triples
The logs are stored in distributed locations

9
Proposed Solution

Generic
Research Question (1)
Can we develop a provenance system that can track
not only data but also other digital objects.
Most of the existing systems work for data only
For example they use RDBMS as underlying storage
mechanisms
The provenance model should be generic that can
accommodate data, documents and other digital
artifacts
Semantic Grid based techniques can play its role
XML reduces the gap between data and documents
due to tagged representation
All data formats are translated to XML in
information system
Our provenance tracking system will track the
interactions performed as XML tree

10
Proposed Solutions

Autonomousity
Research Question (Sub problem 1)
Can we develop a model that does not require to
change or adapt OS, language platform or workflow
application to track provenance?
To provide automated and autonomous tracking
Almost all the systems are dependent on APIs, OS
routines, workflows etc to track provenance which
is not recommended for open systems like grids
since one cant change OS or Workflows to use the
provenance aware information service
Multi Agent based systems can be used to provide
autonomous nature
Only one work uses MAS to track data provenance
for their Health care system (specialized domain)
MAS based system will provide the best autonomous
system among other options

11
Proposed Solution

Heterogeneity
Research Question (2)
Can we develop a provenance system that can track
the transformation steps in heterogeneous nodes
of open distributed system.
The system should record and track provenance
even for heterogeneous nodes
Device Heterogeneity
Platform Heterogeneity
Semantic (Schema) Heterogeneity
JVM based implementation will provide
heterogeneity at device and platform
Semantic Heterogeneity will be solved by
representing provenance metadata in RDF triples
as graphs
XML and RDF are standards according to W3C for
all systems and devices
Requires to develop RDF vocabulary for Provenance
Ontology
JVM, XML and RDF based provenance model will make
our system Domain Independent

12
Proposed Solution

Scalability
Research Question (3)
Can we make provenance storage and tracking
scalable?
The tracking system should be Scalable in case of
increasing number of users in open distributed
system
The simultaneous recording through agents will
make the tracking scalable. Each node is
responsible for autonomously tracking the
interaction
The scalable storage system depends on the
location of provenance store containing log
With the target or separate ??
Centralized or Decentralized
Decentralized system will be scalable
RDF graphs will reside on some other node
No single node will be over utilized
Problem This solution will cost efficiency !!
Another solution is to store sub graphs at the
local host instead of combining and merging sub
graphs into one

13
Proposed Solution

Efficiency
Research Question (4)
With the propose solution of scalability, can we
adapt efficiency in our system for fast retrieval
of provenance metadata scattered around the
system
The solutions of scalability costs the overhead
of low efficiency
Extra time required to search for RDF graphs
Some lookup tables will be required.
Solution
Each digital artifact must be given unique ID
like URI
Unique IDs should compose of binary strings
Lookup table will use these binary strings for
fast retrievals
Can use our own developed ID system
Single RDF graph should be maintained for
multiple copies

14
Current Progress

A prototype application is developed that is
serving as a testbed for information system on
open distributed environment
The system can track provenance log in RDF file
that is merged in single main RDF graph that
keeps that track of information
Dublin Core is used as an ontology for provenance
Both the contribution to information and
provenance metadata are transmitted through
Aglets
An ID system is developed to label the digital
artifact
Scalability analysis is performed on distributed
tightly coupled provenance store

15
Results

The earlier results are showing that Provenance
log is independent of file size
The logs are dependent on interactions
Our storage algorithm has some limitations. Logs
are converging at one place

16
Contribution towards Provenance

A Knowledge Provenance Architecture Open
Distributed Systems
Autonomous Provenance Recording in Heterogeneous
nodes
A Scalable Provenance Storage System
Semantic Heterogeneity of Provenance System using
Provenance Ontology
A Domain Independent Provenance System

17
Publications

Syed Imran Jami and Zubair A. Shaikh, "A workflow
based academic management system using multi
agent approach", Proceedings of the 11th WSEAS
International Conference on Computers, Agios
Nikolaos, Crete Island, Greece, Pg 202-207, Year
of Publication 2007, ISSN1790-5117
Imran Jami and Zubair A. Shaikh, "A Multi Agent
based Architecture for Data Provenance in
Semantic Grid", Proceedings of International
Multi-Conference of Engineers and Computer
Scientists, Hong Kong, Pg 360-364, Year of
Publication 2008, ISBN 978-988-98671-8-8
Syed Imran Jami, Jemal Abawajy, Zubair A. Shaikh,
A Taxonomy of Provenance Models for Open
Distributed Systems, Submitted in Journal of
Information Sciences, Elsevier Publisher, Impact
Factor 2.147
Syed Imran Jami, Jemal Abawajy, Zubair A. Shaikh,
Information Provenance for Open Distributed
Collaborative System, About to submit in ACS
high impact conference.