PROVENANCE - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

PROVENANCE

Description:

Department of Computer Science Software Engineering Research Group, Berlin, Germany PROVENANCE Abdul Saboor Welcome to this Presentation – PowerPoint PPT presentation

Number of Views:126
Avg rating:3.0/5.0
Slides: 43
Provided by: ABDULS4
Category:

less

Transcript and Presenter's Notes

Title: PROVENANCE


1
PROVENANCE
Department of Computer Science Software
Engineering Research Group, Berlin, Germany
  • Abdul Saboor

Welcome to this Presentation
2
Presentation Agenda
  • What is Provenance?
  • Why Provenance is important and two major strands
    of Provenance?
  • Provenance and Linked Data
  • Provenance Data Model
  • Provenance Vocabularies
  • The Open Provenance Model
  • Provenance Data Quality Assessment
  • Summary - Scientific and Technical Challenges of
    Provenance

1
3
What is Provenance?
  • Provenance
  • Recording the history of data and its place of
    origin
  • Provenance Dictionary Definitions
  • The Merriam-Webster online diction Origin ,
    Source
  • Oxford English Dictionary The place of origin
    or earliest known history of something origin,
    derivation.
  • Provenance Definitions
  • 1. Provenance refers to the source of Information
    such as entities and processes involved in
    producing or delivering an artifact. (Yolanda)
  • 2. Provenance is a description of how things came
    to be, and how they came to be in the state they
    are in today. Statements about the provenance can
    themselves be considered to have provenance. (Jim
    M)

Continues ...
2
4
What is Provenance?
  • Provenance Working Definitions
  • Provenance of a resource is a record that
    describes entities and processes involved in
    producing and delivering or otherwise influencing
    that resource. Provenance provides a critical
    foundation for assessing authenticity, enabling
    trust, and allowing reproducibility. Provenance
    assertions are a form of contextual metadata and
    can themselves become important records with
    their own provenance. (W3C)
  • Provenance Web Definition
  • 4. On the web, provenance would include
    information about the creation and publication of
    web resources as well as information about access
    of those resources, and activities related to
    their discussion, linking, and reuse.

Continues ...
3
5
What is Provenance?
  • Provenance Definitions
  • 5. Provenance is documentation of the set of
    artifacts, processes, and agents that have caused
    a artifact to be, and of the contexts of these
    entities. Provenance provides a critical
    foundation for assessing authenticity, enabling
    trust, and allowing reproducibility and
    assertions of provenance can themselves become
    important records with their own provenance. (Jim
    M)

4
6
What kind of History?
  • Data Creator/Data Publisher
  • Data Creation Date
  • Data Modifier Modification Date
  • Data Description
  • Etc...

5
7
Why Provenance is Important?
  • The need of Provenance for data integration and
    reuse
  • Data comes from various diverse data sources
  • Varying Quality
  • Different Scope
  • Different Assumptions

6
8
Two major strands of Provenance
  • Data Provenance
  • Data provenance is Fine-grain Provenance
  • Derivation of a piece of data i.e. results of
    transformations
  • Description of the origin of a piece of data and
    process by which it arrives in a database
  • Workflow Provenance
  • Workflow Provenance is coarse-grain provenance
  • Refers to records of history of the derivation of
    the final output of workflow
  • Perform typically for complex processing tasks

7
9
Data And Workflow Provenance
Data Provenance
When information describing that how data has
moved through a network of databases is referred
to as fine-grain or data provenance.
Fine-grain provenance can further categorized
into where, how and why-Provenance. A query
execution simply copy data elements from some
source to some target database and
where-provenance identifies these source elements
where the data in the target is copied from.
Why-provenance provides justification for the
data elements appearing in the output and
how-provenance describes some parts of the input
influenced certain parts of the output.
Workflow Provenance
When Information describing how derived data has
been calculated from raw observations that is
referred to as coarse-grain or workflow
provenance. The widespread use of workflow flow
tools for processing scientific data facilitate
for capturing provenance information. The
workflow process describes all the steps involved
in producing a given data set and, hence captures
it provenance information.
7A
10
Provenance Dimensions - 1
  • Content of Provenance Information
  • Attribution - provenance as the sources or
    entities that were used to create a new result
  • Responsibility - knowing who endorses a
    particular piece of information or result
  • Origin - recorded vs reconstructed, verified vs
    non-verified, asserted vs inferred
  • Process - provenance as the process that yielded
    an artifact
  • Reproducibility (e.g. workflows, mashups, text
    extraction)
  • Data Access (e.g. access time, accessed server,
    party responsible for accessed server)
  • Evolution and versioning
  • Republishing (e.g. re-tweeting, re-blogging,
    re-publishing)
  • Updates (e.g. a document with content from
    various sources and that changes over time)
  • Justification for decisions Includes
    argumentation, hypotheses, why-not questions
  • Entailment - given the results to a particular
    query, what tuples led to those results

8
11
Provenance Dimensions - 2
  • Management of Provenance Information
  • Publication - Making provenance information
    available (expose, distribute)
  • Access - Finding and querying provenance
    information
  • Dissemination control Track policies specified
    by creator for when/how an artifact can be used
  • Access Control - incorporate access control
    policies to access provenance information
  • Licensing - stating what rights the object
    creators and users have based on provenance
  • Law enforcement (e.g. enforcing privacy policies
    on the use of personal information)
  • Scale - how to operate with large amounts of
    provenance information
  • Use of Provenance Information
  • Understanding - End user consumption of
    provenance
  • abstraction, multiple levels of description,
    summary
  • presentation, visualization

9
12
Provenance Dimensions - 3
  • Interoperability - combining provenance produced
    by multiple different systems
  • Comparison - finding what is common in the
    provenance of two or more entities
  • Accountability - the ability to check the
    provenance of an object with respect to some
    expectation
  • Verification - of a set of requirements
  • Compliance - with a set of policies
  • Trust - making trust judgments based on
    provenance
  • Information quality - choosing among competing
    evidence from diverse sources (e.g. linked data
    use cases)
  • Incorporating reputation and reliability ratings
    with attribution information
  • Imperfections - reasoning about provenance
    information that is not complete or correct
  • Incomplete provenance
  • Uncertain/probabilistic provenance
  • Erroneous provenance
  • Fraudulent provenance
  • Debugging

10
13
Web of Data
11
Adapted from Cetinia, iSOCO Innovation Lab, J.M.G
Perez, Provenance eScience to the Web of Data,
11/09
14
The Linked Data Paradigm
  • How can we exploit all the available data?
  • Data can be reuse and remix
  • Common flexible and usable APIs
  • Standard vocabularies to describe interlinked
    datasets
  • Various Tools
  • Understand the Semantic Web vision

12
15
Provenance and Link Data
  • Provenance provides the ability
  • Trace the sources of various kinds of data
  • Enable the exploration of relationships between
    datasets, their authors and affiliations
  • Provenance analysis provides an insight on how
    data is produced and exploited
  • Provenance create a notion of information quality
  • Is a certain dataset consistent and up to date?
  • Is the connection between two datasets
    meaningful?
  • Is a given dataset relevant for a particular
    domain?
  • Provenance to establish information
    trustworthiness
  • Provenance to provide data views relating to some
    criteria

13
16
The Provenance Data Model
Institutional Level
Metadata associated with origin in terms of its
data attributes (e.g, AuthorName, Title, URL,
etc.)
Experimental Protocol Level
The Origin of datasets (e.g. History area,
region, organisation or institution)
Data Analysis and Significance Level
Datasets statistical analysis methodology for
selecting relevant attributes (e.g. Either
datasets divided into parts, output values,
versions, etc)
Dataset Description Level
Who published that datasets. The vocabulary of
interlinked datasets such as Dublin Core, voiD,
PRV, etc.
14
17
The Provenance Related Vocabularies
  • DC Dublin Core
  • FOAF Friend of a Friend
  • SIOC Semantic Interlinked online communities
  • WOT Web of Trust Schema
  • OMV Ontology Metadata vocabulary
  • SWP Semantic Web Publishing
  • VoiD Vocabulary for interlinked datasets
  • PRV Provenance Vocabulary
  • PML Proof Markup Language
  • PAV SWAN provenance ontology
  • OUZO Provenance ontology
  • CS Changeset Vocabulary
  • Etc.

15
18
Provenance Related Metadata
  • Provenance related metadata is either directly
    attached to data item or its host the documents
    or it is available as additional data on web.
  • For example Attached metadata are RDF
    statements about an RDF graph that contains the
    statements, AuthorName and Creation date of blog
    entries added to syndication feed, or information
    about an image and detached metadata can be
    represented in RDF using vocabularies.

16
19
A Provenance Architecture for the Web of Data
Application Layer
Authoritative agencies require to certify and
keep data provenance secure
17
Adapted from Cetinia, iSOCO Innovation Lab, J.M.G
Perez, Provenance eScience to the Web of Data,
11/09
20
Main Action Points
Provenance Vocabularies
Awareness of Data Providers
Tools for Data Providers
Represent and reason with trust and information
quality
Generalization of Provenance Metadata
W3C Provenance Incubator Group
Extend emerging Linked data vocabularies
Provenance Authoritative Agencies
Linked Data Standards (VOiD)
VOiD
Provenance Visualization
18
Adapted from Cetinia, iSOCO Innovation Lab, J.M.G
Perez, Provenance eScience to the Web of Data,
11/09
21
The Open Provenance Model
  • The Open Provenance Model in which data is being
    produced/transformed into new state. It can also
    represent the one or more data items from an old
    to a new state.
  • OPM graph model for provenance which describes
    the graph whose edges denote the relationship
    between occurrence presented by the nodes.
  • The main purpose of OPM is to support the
    assessment of various data qualities such as
    reliability, accuracy and timeliness.

19
22
OPM Classifies nodes into three parts
Artifacts
Artifacts are the parts of data of fixed value
and context that possibly represent an entity in
a given state. Edges can also have annotations
for providing the information on how occurrence
cause another.
Process
Process are performed on artifacts in order to
produce another artifact.
Agents
Agents indicate the entities which are
controlling the process such as user.
20
23
Model of Web Data Provenance
Provenance Graph It describes the provenance of
data Items
Nodes
Edges
Sub-graphs
Provenance elements (Pieces of provenance
information)
Relating Provenance elements to each other
Related data items if possible
21
24
Main Focus of Provenance of Web Data
  • Provenance Models Define
  • Types of Provenance elements (roles)
  • Relationship between those elements

22
Adapted from Olaf Hartigs, Humboldt University
Berlin, Provenance Information in the Web of
Data, 04/09
25
Provenance Data Quality Assessment
  • The Quality of Information
  • Main Objectives are accessing the quality of
    datasets
  • Quality of datasets in multidimensional
    perspectives

Categories Criteria
Intrinsic Objectivity, Believability, Accuracy
Contextual Completeness, Relevance, Timeliness
Representational Understandable, Concise, Precise
Accessibility Availability, Securing Licensing, Constrains (Format Procedures)
  • Relevance of criteria determined by preferences
    and performing certain tasks on available datasets

23
26
Provenance Data Quality
  • Data Trustworthiness
  • Data Authenticity
  • Data Reliability
  • Dimensions of Believability
  • Trustworthiness of source
  • Data Lineage The origin of data
  • Related Artifacts and actors
  • Reasonableness of data
  • Possibility The extent to which data value is
    possible
  • Consistency The extent to which a data value is
    consistent with other values of same data
  • Quality of Data Provenance has Three dimensions
  • Correctness
  • Completeness
  • Relevancy

24
27
Provenance Data Quality
  • Quality of Datasets
  • Timeliness
  • Consistency between datasets
  • Consistency over source The extent to which a
    data value is consistent with other values of the
    same data
  • Consistency over time The extent to which the
    data value is consistent with past data values
  • Stable and meaningful data
  • Temporal of Data
  • Transaction valid times closeness The extent to
    which a data value is credible based on proximity
    of transaction time to valid times.
  • Transaction time overlap The extent to which a
    data value is derived from data values with
    overlapping valid times.

25
28
Trust Evaluation
  • Some Questions must need to be considered while
    provenance data trust evaluation
  • Who created that content(s) (author or
    attributions)?
  • Was the contents manipulated? If yes then by
    what process or source?
  • Who is providing those contents
    (repositories)?

26
29
Quality of Data Assessment
  • Assign numeric values to Quality Criteria of
    Datasets or Scoring/Rating Systems
  • Proactive Approach
  • Precision vs Practicality

Semi-Automatic Approach
Manual Approach
  • Rating based system
  • Reputation based system
  • Questionnaires base system

27
30
Reasons of Assessment
  • Main Reasons
  • Provenance of assessed data on the web
  • Primary Objectives
  • Identify the methods / approaches to
    automatically assess the quality of data on the
    web
  • Or Identify the methods to assess the Quality
    Criteria of Data automatically of web data.

28
31
A Generalize Assessment Approach
Step - 1
Generate a provenance graph for the data item
Annotate the provenance graph with impact values
Step - 2
Execute the assessment function/program (script)
Step - 3
29
32
Generate a Provenance Graph
  • What types of provenance elements are necessarily
    require?
  • What types of details (i.e. granularity) are
    necessarily require?
  • Where and how do we get provenance information?
  • Two complementary options
  • Recordings
  • Analyzing the metadata

30
33
Annotation with Impact Values
  • How might each Provenance element can influence
    the quality of data?
  • Each type of element has to analyze
    systematically
  • What kinds of impact values are necessary and how
    to represent the influence through impact values?
  • It is not necessary that impact values should
    be numeric
  • It also depends on the assessment functions
  • How do we determine the impact values?

31
34
Determine the Impact Values
  • From Provenance Information
  • From user Input
  • Rating-based systems, or reputation-based
    systems
  • Configuration options
  • Through Content Analysis
  • Comparison of data contents
  • Adoption of information retrieval methods
  • Adoption of data cleansing techniques
  • Through Context Analysis
  • Further metadata
  • Domain knowledge

32
35
Annotation with Impact Values
  • How might each Provenance element can influence
    the quality of data?

Provenance Element Type
Creation Date
Creation Guidelines
Source data items
Data creator
Impact Values
Creation time
Weights
Expiry time
33
36
Assessment Function (s)
  • How the assessment function look alike?
  • Develop function together with impact values
  • Take incompleteness into consideration
  • Provenance graph could be fragmentary
  • Annotation could be missing

34
37
Scientific and Technical Challenges of Provenance
1(SUMMARY)
  • Provenance information need to be
  • Represented
  • Captured and recorded
  • Stored and secured, queries and reasoned about
  • Visualized and browsed

35
38
Scientific and Technical Challenges of Provenance
- 2
  • Vocabularies for representation of provenance
    contents
  • Need representation of process (workflow),
    entities roles, data collections,
    meta-assertions, etc.
  • The open provenance model (OPM)
  • Granularity of provenance records
  • How much detail is useful, manageable/scalable in
    practice?
  • Size of provenance can be orders of magnitude
    larger than base data.
  • Provenance evaluation for information quality and
    trust management

36
39
Scientific and Technical Challenges of Provenance
2a
  • Evaluation and updates
  • Shelf timeliness of data
  • Determine when data becomes obsolete based on
    provenance information
  • Versioning of data sources
  • Relate updates of data based on provenance
    information
  • Provenance-aware visualization, navigation and
    resource consumption

37
40
Scientific and Technical Challenges of Provenance
and Trust 3
  • Policies based on Provenance information
  • Association-based policies
  • Source is cited in Spiegel
  • Source is cited in Wikipedia
  • Bias-based policies
  • Source is an Oil company
  • Distrust policies
  • Source is a blog
  • Policies may be restricted to a context
  • Topic of search, topics of pages, tags of page
  • Trust policies may be shared across users

38
41
Thanks for your attentions !
Any Questions?
Freie University Berlin Computer Science
Department Software Engineering Research
Group TakuStr 9, Berlin, Germany.
39
42
References
  1. W3C Website, What is provenance? Modified at
    November 2010, http//www.w3.org/2005/Incubator/pr
    ov/wiki/What_Is_Provenance
  2. W3C Website, A working Definition of Provenance,
    Modified at November 2010, http//www.w3.org/2005/
    Incubator/prov/wiki/What_Is_ProvenanceA_Working_D
    efinition_of_Provenance
  3. Hartig, O. Provenance information in the Web of
    data. In Proceedings of LDOW 2009 (Madrid, Spain,
    April 2009).
  4. O. Hartig and J. Zhao. Using web data provenance
    for quality assessment. Pro-ceedings of the 1st
    Int. Workshop on the Role of Semantic Web in
    Provenance
  5. D. Brickley and L. Miller, FOAF Vocabulary
    Specification, November 2007. http//xmlns.com/foa
    f/spec
  6. U. Bojars and J. G. Breslin. SIOC Core Ontology
    Specification, Revision 1.30, Jan. 2009.
    http//rdfs.org/sioc/spec/
  7. Luc Moreau, Juliana Freire, Joe Futrelle, Robert
    E. McGrath, Jim Myers, and Patrick Paulson. The
    open provenance model An overview. In IPAW,
    pages 323326, 2008.
  8. L. L. Pipino, Y. W. Lee, and R. Y. Wang, Data
    Quality Assessment,Communications of the ACM,
    vol. 45, Issue no. 4, p. 211-218, 2009.
  9. You-Wei cheah, Beth Plale. Provenance Analysis
    Towards qaulity provenance. In proceeding of 8th
    IEEE International conference on eScience,
    Chicago Illinois, Oct. 2012. http//www.ci.uchicag
    o.edu/escience2012/pdf/Provenance_Analysis-Towards
    _Quality_Provenance.pdf
  10. Yogesh Simmhan, Beth Plale, and Dennis Gannon. A
    survey of data provenance in e-science. SIGMOD
    Record, 34(3)3136, 2005.
  11. Prat, N., and Madnick, S. Evaluating and
    aggregating data believability across quality
    sub-dimensions and data lineage. In Proceedings
    of WITS 2007 (Montreal, Canada, December 2007),
    p.169-174.
  12. Y. Simmhan, B. Plale, and D. Gannon. A Survey of
    Data Provenance in e-Science. SIGMOD Record,
    Computer Science Department, Indiana University.
    Vol. 34, Issue No. 3, p3136, ACM, Sept. 2005.
  13. P. Buneman, S. Khanna, and W. C. Tan. Data
    Provenance Some Basic Issues. In Proceedings of
    the 20th Conference on Foundations of Software
    Technology and Theoretical Computer Science (FST
    TCS), p87-93, Springer, Dec. 2000.
  14. Prat, N., and Madnick, S. Measuring data
    believability A provenance approach. Proceedings
    of HICSS-41 (Big Island, HI, January 2008), IEEE,
    p.1-10.
  15. Jose Manuel Gomez-Perez, Invited Lectures on
    Programmable web and the web of data, November
    2009, URJC, Campus de Mostoles, Departmental II,
    Salon de grados, Madrid, Spain, Website,
    http//www.cetinia.urjc.es/es/node/331
  16. Website http//www.w3.org/2005/Incubator/prov/wi
    ki/images/0/02/Provenance-XG-Overview.pdf
  17. http//www.w3.org/2005/Incubator/prov/wiki/Provena
    nce_Dimensions
  18. http//www.w3.org/2005/Incubator/prov/wiki/W3C_Pro
    venance_Incubator_Group_Wiki
Write a Comment
User Comments (0)
About PowerShow.com