Title: PROVENANCE
1PROVENANCE
Department of Computer Science Software
Engineering Research Group, Berlin, Germany
Welcome to this Presentation
2Presentation Agenda
- What is Provenance?
- Why Provenance is important and two major strands
of Provenance? - Provenance and Linked Data
- Provenance Data Model
- Provenance Vocabularies
- The Open Provenance Model
- Provenance Data Quality Assessment
- Summary - Scientific and Technical Challenges of
Provenance
1
3What is Provenance?
- Provenance
- Recording the history of data and its place of
origin - Provenance Dictionary Definitions
- The Merriam-Webster online diction Origin ,
Source - Oxford English Dictionary The place of origin
or earliest known history of something origin,
derivation. - Provenance Definitions
- 1. Provenance refers to the source of Information
such as entities and processes involved in
producing or delivering an artifact. (Yolanda) - 2. Provenance is a description of how things came
to be, and how they came to be in the state they
are in today. Statements about the provenance can
themselves be considered to have provenance. (Jim
M)
Continues ...
2
4What is Provenance?
- Provenance Working Definitions
- Provenance of a resource is a record that
describes entities and processes involved in
producing and delivering or otherwise influencing
that resource. Provenance provides a critical
foundation for assessing authenticity, enabling
trust, and allowing reproducibility. Provenance
assertions are a form of contextual metadata and
can themselves become important records with
their own provenance. (W3C) - Provenance Web Definition
- 4. On the web, provenance would include
information about the creation and publication of
web resources as well as information about access
of those resources, and activities related to
their discussion, linking, and reuse.
Continues ...
3
5What is Provenance?
- Provenance Definitions
-
- 5. Provenance is documentation of the set of
artifacts, processes, and agents that have caused
a artifact to be, and of the contexts of these
entities. Provenance provides a critical
foundation for assessing authenticity, enabling
trust, and allowing reproducibility and
assertions of provenance can themselves become
important records with their own provenance. (Jim
M)
4
6What kind of History?
- Data Creator/Data Publisher
- Data Creation Date
- Data Modifier Modification Date
- Data Description
- Etc...
5
7Why Provenance is Important?
- The need of Provenance for data integration and
reuse - Data comes from various diverse data sources
- Varying Quality
- Different Scope
- Different Assumptions
6
8Two major strands of Provenance
- Data Provenance
- Data provenance is Fine-grain Provenance
- Derivation of a piece of data i.e. results of
transformations - Description of the origin of a piece of data and
process by which it arrives in a database
- Workflow Provenance
- Workflow Provenance is coarse-grain provenance
- Refers to records of history of the derivation of
the final output of workflow - Perform typically for complex processing tasks
7
9Data And Workflow Provenance
Data Provenance
When information describing that how data has
moved through a network of databases is referred
to as fine-grain or data provenance.
Fine-grain provenance can further categorized
into where, how and why-Provenance. A query
execution simply copy data elements from some
source to some target database and
where-provenance identifies these source elements
where the data in the target is copied from.
Why-provenance provides justification for the
data elements appearing in the output and
how-provenance describes some parts of the input
influenced certain parts of the output.
Workflow Provenance
When Information describing how derived data has
been calculated from raw observations that is
referred to as coarse-grain or workflow
provenance. The widespread use of workflow flow
tools for processing scientific data facilitate
for capturing provenance information. The
workflow process describes all the steps involved
in producing a given data set and, hence captures
it provenance information.
7A
10Provenance Dimensions - 1
- Content of Provenance Information
- Attribution - provenance as the sources or
entities that were used to create a new result - Responsibility - knowing who endorses a
particular piece of information or result - Origin - recorded vs reconstructed, verified vs
non-verified, asserted vs inferred - Process - provenance as the process that yielded
an artifact - Reproducibility (e.g. workflows, mashups, text
extraction) - Data Access (e.g. access time, accessed server,
party responsible for accessed server) - Evolution and versioning
- Republishing (e.g. re-tweeting, re-blogging,
re-publishing) - Updates (e.g. a document with content from
various sources and that changes over time) - Justification for decisions Includes
argumentation, hypotheses, why-not questions - Entailment - given the results to a particular
query, what tuples led to those results
8
11Provenance Dimensions - 2
- Management of Provenance Information
- Publication - Making provenance information
available (expose, distribute) - Access - Finding and querying provenance
information - Dissemination control Track policies specified
by creator for when/how an artifact can be used - Access Control - incorporate access control
policies to access provenance information - Licensing - stating what rights the object
creators and users have based on provenance - Law enforcement (e.g. enforcing privacy policies
on the use of personal information) - Scale - how to operate with large amounts of
provenance information - Use of Provenance Information
- Understanding - End user consumption of
provenance - abstraction, multiple levels of description,
summary - presentation, visualization
9
12Provenance Dimensions - 3
- Interoperability - combining provenance produced
by multiple different systems - Comparison - finding what is common in the
provenance of two or more entities - Accountability - the ability to check the
provenance of an object with respect to some
expectation - Verification - of a set of requirements
- Compliance - with a set of policies
- Trust - making trust judgments based on
provenance - Information quality - choosing among competing
evidence from diverse sources (e.g. linked data
use cases) - Incorporating reputation and reliability ratings
with attribution information - Imperfections - reasoning about provenance
information that is not complete or correct - Incomplete provenance
- Uncertain/probabilistic provenance
- Erroneous provenance
- Fraudulent provenance
- Debugging
10
13Web of Data
11
Adapted from Cetinia, iSOCO Innovation Lab, J.M.G
Perez, Provenance eScience to the Web of Data,
11/09
14The Linked Data Paradigm
- How can we exploit all the available data?
- Data can be reuse and remix
- Common flexible and usable APIs
- Standard vocabularies to describe interlinked
datasets - Various Tools
- Understand the Semantic Web vision
12
15Provenance and Link Data
- Provenance provides the ability
- Trace the sources of various kinds of data
- Enable the exploration of relationships between
datasets, their authors and affiliations - Provenance analysis provides an insight on how
data is produced and exploited - Provenance create a notion of information quality
- Is a certain dataset consistent and up to date?
- Is the connection between two datasets
meaningful? - Is a given dataset relevant for a particular
domain? -
- Provenance to establish information
trustworthiness - Provenance to provide data views relating to some
criteria
13
16The Provenance Data Model
Institutional Level
Metadata associated with origin in terms of its
data attributes (e.g, AuthorName, Title, URL,
etc.)
Experimental Protocol Level
The Origin of datasets (e.g. History area,
region, organisation or institution)
Data Analysis and Significance Level
Datasets statistical analysis methodology for
selecting relevant attributes (e.g. Either
datasets divided into parts, output values,
versions, etc)
Dataset Description Level
Who published that datasets. The vocabulary of
interlinked datasets such as Dublin Core, voiD,
PRV, etc.
14
17The Provenance Related Vocabularies
- DC Dublin Core
- FOAF Friend of a Friend
- SIOC Semantic Interlinked online communities
- WOT Web of Trust Schema
- OMV Ontology Metadata vocabulary
- SWP Semantic Web Publishing
- VoiD Vocabulary for interlinked datasets
- PRV Provenance Vocabulary
- PML Proof Markup Language
- PAV SWAN provenance ontology
- OUZO Provenance ontology
- CS Changeset Vocabulary
- Etc.
15
18Provenance Related Metadata
- Provenance related metadata is either directly
attached to data item or its host the documents
or it is available as additional data on web. - For example Attached metadata are RDF
statements about an RDF graph that contains the
statements, AuthorName and Creation date of blog
entries added to syndication feed, or information
about an image and detached metadata can be
represented in RDF using vocabularies.
16
19A Provenance Architecture for the Web of Data
Application Layer
Authoritative agencies require to certify and
keep data provenance secure
17
Adapted from Cetinia, iSOCO Innovation Lab, J.M.G
Perez, Provenance eScience to the Web of Data,
11/09
20Main Action Points
Provenance Vocabularies
Awareness of Data Providers
Tools for Data Providers
Represent and reason with trust and information
quality
Generalization of Provenance Metadata
W3C Provenance Incubator Group
Extend emerging Linked data vocabularies
Provenance Authoritative Agencies
Linked Data Standards (VOiD)
VOiD
Provenance Visualization
18
Adapted from Cetinia, iSOCO Innovation Lab, J.M.G
Perez, Provenance eScience to the Web of Data,
11/09
21The Open Provenance Model
- The Open Provenance Model in which data is being
produced/transformed into new state. It can also
represent the one or more data items from an old
to a new state. - OPM graph model for provenance which describes
the graph whose edges denote the relationship
between occurrence presented by the nodes. - The main purpose of OPM is to support the
assessment of various data qualities such as
reliability, accuracy and timeliness.
19
22OPM Classifies nodes into three parts
Artifacts
Artifacts are the parts of data of fixed value
and context that possibly represent an entity in
a given state. Edges can also have annotations
for providing the information on how occurrence
cause another.
Process
Process are performed on artifacts in order to
produce another artifact.
Agents
Agents indicate the entities which are
controlling the process such as user.
20
23Model of Web Data Provenance
Provenance Graph It describes the provenance of
data Items
Nodes
Edges
Sub-graphs
Provenance elements (Pieces of provenance
information)
Relating Provenance elements to each other
Related data items if possible
21
24Main Focus of Provenance of Web Data
- Provenance Models Define
- Types of Provenance elements (roles)
- Relationship between those elements
22
Adapted from Olaf Hartigs, Humboldt University
Berlin, Provenance Information in the Web of
Data, 04/09
25Provenance Data Quality Assessment
- The Quality of Information
- Main Objectives are accessing the quality of
datasets - Quality of datasets in multidimensional
perspectives
Categories Criteria
Intrinsic Objectivity, Believability, Accuracy
Contextual Completeness, Relevance, Timeliness
Representational Understandable, Concise, Precise
Accessibility Availability, Securing Licensing, Constrains (Format Procedures)
- Relevance of criteria determined by preferences
and performing certain tasks on available datasets
23
26Provenance Data Quality
- Data Trustworthiness
- Data Authenticity
- Data Reliability
- Dimensions of Believability
- Trustworthiness of source
- Data Lineage The origin of data
- Related Artifacts and actors
- Reasonableness of data
- Possibility The extent to which data value is
possible - Consistency The extent to which a data value is
consistent with other values of same data -
- Quality of Data Provenance has Three dimensions
- Correctness
- Completeness
- Relevancy
-
24
27Provenance Data Quality
- Quality of Datasets
- Timeliness
- Consistency between datasets
- Consistency over source The extent to which a
data value is consistent with other values of the
same data - Consistency over time The extent to which the
data value is consistent with past data values - Stable and meaningful data
-
- Temporal of Data
- Transaction valid times closeness The extent to
which a data value is credible based on proximity
of transaction time to valid times. - Transaction time overlap The extent to which a
data value is derived from data values with
overlapping valid times.
25
28Trust Evaluation
- Some Questions must need to be considered while
provenance data trust evaluation - Who created that content(s) (author or
attributions)? - Was the contents manipulated? If yes then by
what process or source? - Who is providing those contents
(repositories)? -
26
29Quality of Data Assessment
- Assign numeric values to Quality Criteria of
Datasets or Scoring/Rating Systems - Proactive Approach
- Precision vs Practicality
-
Semi-Automatic Approach
Manual Approach
- Rating based system
- Reputation based system
- Questionnaires base system
27
30Reasons of Assessment
- Main Reasons
- Provenance of assessed data on the web
- Primary Objectives
- Identify the methods / approaches to
automatically assess the quality of data on the
web - Or Identify the methods to assess the Quality
Criteria of Data automatically of web data.
28
31A Generalize Assessment Approach
Step - 1
Generate a provenance graph for the data item
Annotate the provenance graph with impact values
Step - 2
Execute the assessment function/program (script)
Step - 3
29
32Generate a Provenance Graph
- What types of provenance elements are necessarily
require? - What types of details (i.e. granularity) are
necessarily require? - Where and how do we get provenance information?
- Two complementary options
- Recordings
- Analyzing the metadata
30
33Annotation with Impact Values
- How might each Provenance element can influence
the quality of data? - Each type of element has to analyze
systematically - What kinds of impact values are necessary and how
to represent the influence through impact values? - It is not necessary that impact values should
be numeric - It also depends on the assessment functions
- How do we determine the impact values?
31
34Determine the Impact Values
- From Provenance Information
- From user Input
- Rating-based systems, or reputation-based
systems - Configuration options
- Through Content Analysis
- Comparison of data contents
- Adoption of information retrieval methods
- Adoption of data cleansing techniques
- Through Context Analysis
- Further metadata
- Domain knowledge
32
35Annotation with Impact Values
- How might each Provenance element can influence
the quality of data? -
Provenance Element Type
Creation Date
Creation Guidelines
Source data items
Data creator
Impact Values
Creation time
Weights
Expiry time
33
36Assessment Function (s)
- How the assessment function look alike?
- Develop function together with impact values
- Take incompleteness into consideration
- Provenance graph could be fragmentary
- Annotation could be missing
34
37Scientific and Technical Challenges of Provenance
1(SUMMARY)
- Provenance information need to be
- Represented
- Captured and recorded
- Stored and secured, queries and reasoned about
- Visualized and browsed
35
38Scientific and Technical Challenges of Provenance
- 2
- Vocabularies for representation of provenance
contents - Need representation of process (workflow),
entities roles, data collections,
meta-assertions, etc. - The open provenance model (OPM)
- Granularity of provenance records
- How much detail is useful, manageable/scalable in
practice? - Size of provenance can be orders of magnitude
larger than base data. - Provenance evaluation for information quality and
trust management
36
39Scientific and Technical Challenges of Provenance
2a
- Evaluation and updates
- Shelf timeliness of data
- Determine when data becomes obsolete based on
provenance information - Versioning of data sources
- Relate updates of data based on provenance
information - Provenance-aware visualization, navigation and
resource consumption
37
40Scientific and Technical Challenges of Provenance
and Trust 3
- Policies based on Provenance information
- Association-based policies
- Source is cited in Spiegel
- Source is cited in Wikipedia
- Bias-based policies
- Source is an Oil company
- Distrust policies
- Source is a blog
- Policies may be restricted to a context
- Topic of search, topics of pages, tags of page
- Trust policies may be shared across users
38
41Thanks for your attentions !
Any Questions?
Freie University Berlin Computer Science
Department Software Engineering Research
Group TakuStr 9, Berlin, Germany.
39
42References
- W3C Website, What is provenance? Modified at
November 2010, http//www.w3.org/2005/Incubator/pr
ov/wiki/What_Is_Provenance - W3C Website, A working Definition of Provenance,
Modified at November 2010, http//www.w3.org/2005/
Incubator/prov/wiki/What_Is_ProvenanceA_Working_D
efinition_of_Provenance - Hartig, O. Provenance information in the Web of
data. In Proceedings of LDOW 2009 (Madrid, Spain,
April 2009). - O. Hartig and J. Zhao. Using web data provenance
for quality assessment. Pro-ceedings of the 1st
Int. Workshop on the Role of Semantic Web in
Provenance - D. Brickley and L. Miller, FOAF Vocabulary
Specification, November 2007. http//xmlns.com/foa
f/spec - U. Bojars and J. G. Breslin. SIOC Core Ontology
Specification, Revision 1.30, Jan. 2009.
http//rdfs.org/sioc/spec/ - Luc Moreau, Juliana Freire, Joe Futrelle, Robert
E. McGrath, Jim Myers, and Patrick Paulson. The
open provenance model An overview. In IPAW,
pages 323326, 2008. - L. L. Pipino, Y. W. Lee, and R. Y. Wang, Data
Quality Assessment,Communications of the ACM,
vol. 45, Issue no. 4, p. 211-218, 2009. - You-Wei cheah, Beth Plale. Provenance Analysis
Towards qaulity provenance. In proceeding of 8th
IEEE International conference on eScience,
Chicago Illinois, Oct. 2012. http//www.ci.uchicag
o.edu/escience2012/pdf/Provenance_Analysis-Towards
_Quality_Provenance.pdf - Yogesh Simmhan, Beth Plale, and Dennis Gannon. A
survey of data provenance in e-science. SIGMOD
Record, 34(3)3136, 2005. - Prat, N., and Madnick, S. Evaluating and
aggregating data believability across quality
sub-dimensions and data lineage. In Proceedings
of WITS 2007 (Montreal, Canada, December 2007),
p.169-174. - Y. Simmhan, B. Plale, and D. Gannon. A Survey of
Data Provenance in e-Science. SIGMOD Record,
Computer Science Department, Indiana University.
Vol. 34, Issue No. 3, p3136, ACM, Sept. 2005. - P. Buneman, S. Khanna, and W. C. Tan. Data
Provenance Some Basic Issues. In Proceedings of
the 20th Conference on Foundations of Software
Technology and Theoretical Computer Science (FST
TCS), p87-93, Springer, Dec. 2000. - Prat, N., and Madnick, S. Measuring data
believability A provenance approach. Proceedings
of HICSS-41 (Big Island, HI, January 2008), IEEE,
p.1-10. - Jose Manuel Gomez-Perez, Invited Lectures on
Programmable web and the web of data, November
2009, URJC, Campus de Mostoles, Departmental II,
Salon de grados, Madrid, Spain, Website,
http//www.cetinia.urjc.es/es/node/331 - Website http//www.w3.org/2005/Incubator/prov/wi
ki/images/0/02/Provenance-XG-Overview.pdf - http//www.w3.org/2005/Incubator/prov/wiki/Provena
nce_Dimensions - http//www.w3.org/2005/Incubator/prov/wiki/W3C_Pro
venance_Incubator_Group_Wiki