DATA PROVENANCE - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

DATA PROVENANCE

Description:

10 Projects Funded in the first cycle in May 2005 ... Architecture of PROMS. Provenance Graph. Occurs at. is_involved_in. Data. Cycom381/S2. Uni-Glass 111 ... – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 41
Provided by: ram1163
Category:
Tags: data | provenance

less

Transcript and Presenter's Notes

Title: DATA PROVENANCE


1
DATA PROVENANCE
  • Sudha Ram
  • Eller Professor of MIS
  • Department of MIS
  • University of Arizona
  • Tucson, AZ
  • Email ram_at_eller.arizona.edu
  • URL http//adrg.eller.arizona.edu
  • April 7, 2006

Supported in part by NSF grant IIS-0455993
2
National Digital Information Infrastructure and
Preservation Program (NDIIP)
  • Funded by Congress -- 100 Million Program
  • Managed by Library of Congress and National
    Science Foundation
  • DIGITAL ARCHIVING PROGRAM (DIGARCH)
  • 10 Projects Funded in the first cycle in May
    2005
  • Data Provenance UA and Raytheon Missile Systems

3
Research Objectives
  • Investigate the semantics of Data Provenance
  • Develop an ontology to represent the semantics
    of Provenance
  • Develop automated ways to harvest Provenance

4
What is Provenance?
5
What is Provenance?
  •   From the Merriam-Webster Online
    Dictionary
  • http//webster.com/cgibin/dictionary?bookDic
    tionaryvaprovenancex
  • provenance
  • One entry found for provenance.Main Entry
    provenance Pronunciation 'präv-nn(t)s,
    'prä-v-"nän(t)sFunction nounEtymology
    French, from provenir to come forth, originate,
    from Latin provenire, from pro- forth venire to
    come -- more at PRO-, COME1 ORIGIN, SOURCE2
    the history of ownership of a valued object or
    work of art or literature

6
What is Provenance?
  • Lineage, Pedigree, Origin
  • Enables correct interpretation
  • Includes
  • Who created it
  • How was it derived
  • Ownership
  • Assumptions
  • .
  • Provenance is an overloaded Term

7
Uses of Provenance
  • Data Quality
  • Audit Trail
  • Replication Recipe
  • Attribution/Digital Rights Management
  • Informational

8
Previous Work on Provenance
  • The concept of data provenance hasnt been
    rigorously defined
  • - Origin of data and its movement between
    databases P. Buneman et al, 2002
  • - Processing and transformations of data
    C. Goble et al, 2002
  • Data provenance should be comprehensive enough to
    be useful in the future
  • Current research focuses on some aspects of
    provenance while ignoring others

9
Related research
A metadata schema captures creator, date (e.g.
creation date, and modification date), software
program.
  • Data-centric approach
  • - Metadata models in the archival domain such
    as Dublin Core and Marc, PREMIS, OAIS
  • - The Collaboratory for Multi-scale Chemical
    Science (CMCS) in Chemistry Pancerella, et al.,
    2003
  • Process-centric approach
  • - MyGrid in bioinformatics Goble, et al. 2004
  • - Chimera in high energy physics and
    astronomy Foster, et. al, 2004
  • - The Earth System Science Workbench (ESSW)
    in earth science Frew Bose, 2001

D2
P3
D4
D0
P1
D1
D3
P2
Neither of these two approaches has specified
the full spectrum of data provenance
10
Previous Work on Provenance
  • Investigated in many domains e.g. Chemistry,
    Physics, Astronomy, Biology
  • Focus on Why and Where Buneman, 2003
  • Closely related to Data Quality, Reliability and
    Metadata
  • Gap Comprehensive definition of Provenance

11
Specific Research Questions
  • Understanding Semantics of Provenance
  • Representing and Harvesting Provenance
  • Implementing and Using Provenance

12
Research questions
Understanding semantics of provenance
What are the key elements of data
provenance? What are the relationships between
these elements?
How can data provenance be represented? How can
data provenance be automatically or
semi-automatically harvested?
Representing and harvesting provenance
Implementing and evaluating provenance
How useful is our model of data provenance?
13
Methodology
  • Domains New Product Design, BioTechnology
    Computing
  • Understand Provenance Requirements
  • Analyzed more than 200 Scenarios
  • Developed W7 Model of Provenance

14
New Product Design
15
BioTechnology Computing Facility
16
Importance of Provenance
  • Both Domains of Study
  • Many types of Data
  • Many places of origin
  • Many users
  • Reusability of data

17
Scenarios
  • Collected more than 200 Scenarios (Use Cases)
  • Analyzed the Use Cases
  • Derived W7 Model of Provenance

18
Examples of provenance
  • - Creator, publisher, contributor.
  • - Ownership
  • Dates (e.g. creation date and modification date)
  • The literature reference where data were first
    reported
  • Current location of storage of the data
  • How the data has been derived or transformed
  • Experimental procedures or computations that
    transform data
  • The sequence of ideas leading to an experiment
  • Hypotheses an experiment is intended to test
  • Instrument settings
  • Parameters of software application

Who
When
Where
How
Why
Which
- Creation, transformation, derivation,
retirement
What
19
Theoretical basis Bunges work
  • Bunges view of history Mario Bunge, Treatise on
    Basic Philosophy Vol. 3 Ontology I The
    Furniture of the World. Boston, MA Reidel, 1977

Event
What
Space
Space
Where
Destruction
Time
When
Creation
Action
How
Agent
Who, Which
Time
20
Data Provenance Semantics
An engineering team is developing the actuator
fin, the wing system that steers missiles.
21
Semantics of Provenance
  • The team considers a material which turns out
    to be unqualified for the actuator fin since the
    vendor of the material does not belong to the
    list of approved vendors, the material panels
    were fabricated outside of the U.S, and material
    test results provided were generated 10 years
    ago.

Who, where, when determine data quality and
reliability
22
Semantics of Provenance
  • The actuator fin team encounters a
    new-to-the industry material. As the material is
    fairly new, it is not established how to use it.
    The team finds that another team used the
    material, M, (which has similar physical and
    mechanical properties) in another project. The
    actuator team refers to the M provenance record
    to find out how and why M has been used and what
    are the lessons learned in previous experiences.

Who, How, Why facilitate data reuse and sharing
23
Semantics of Provenance
  • Two engineers are measuring the tensile
    strength of the new material. They both perform
    the same test on several samples of the material.
    The first engineer computes the tensile strength
    of the material by taking the average of values
    of the samples, while the second computes it by
    recording the minimum value. There is no way to
    compare them unless the derivation procedures are
    known to both engineers.

How record data derivation procedure
24
Semantics of Provenance
  • An engineer detects a calibration error in an
    instrument and determines that the error may have
    existed since July of last year. He wants to
    locate the material data that has been generated
    using the instrument since July of last year.

Which and when avoid and correct data recording
mistakes
25
Semantics of Provenance
26
Analysis of Provenance
27
Semantics of Provenance
  • Anchor Point Information Life Cycle Events
    WHAT
  • All other elements of Provenance describe the
    events

28
Information lifecycle
Review
Approval
Archiving
Storage
Verification
Deletion
Creation
Access
Information lifecycle in new product design and
development
29
W7 Model of Provenance
Provenance
Location where an event happens
Describes what happens to the data by recording
various events such as creation, use, and
transformation of the data during its lifecycle
The decision making rationale of actions.
Records the occurrence time of events
Describes the instruments or software
applications used in creating or processing the
data
Refers to people or organizations involved in
data creation and transformation
Documents actions upon the data. It describes the
details of how data has been created or
transformed
Relationship
Concept
Property
is_a
part_of
30
Granularity of Provenance
  • Single Data Value
  • Data Field (Attribute)
  • Record (Entity)
  • Subset of Records
  • Whole Data Set (Entity Class)
  • Whole Database

31
Representation of Provenance
  • Annotations
  • Grammar for Annotations

32
Annotation Syntax
ltwhat annotationgt//ltwhen annotationgt//ltwhere
annotationgt//ltwho annotationgt// lthow
annotationgt//ltwhich annotationgt//ltwhy annotationgt
ltwhat annotationgtWhat ltevent typegt ltevent
typegtCreation Transformation Access
Archiving Storage ltwhen annotationgt//When
ltvalid timegt ltvalid timegtInstant Period
Interval (ltGranularitygt) ltGranularitygtYear
Day Hour Minute Second ltwhere
annotationgt//Where ltlocationgt/ltlocation
typegt ltlocationgtPoint Line
Region ltlocation typegtInternal External lthow
annotationgt//How ltaction typegtltaction
propertiesgt ltaction typegtPrimitive
Complex ltaction propertiesgtltpreconditiongtltme
thodgtltartifactgtltresourcegt ltpreconditiongt/P
recondition(ltstringgt , ltstringgt)
ltmethodgt/Method (ltstringgt ,
ltstringgt) ltartifactgt/Artifact (ltstringgt ,
ltstringgt) ltresourcegt/Resource (ltstringgt ,
ltstringgt) ltwho annotationgt//Who ltagent
typegtltrolegtposition ltagent typegtIndividual
Organization Artificial ltrolegt(Role) ltposi
tiongt(position) ltwhich annotationgt//Which
ltdevice typegtltdescriptiongtltfunctiongtltsettingsgt
ltdevice typegtInstrument Software ltdescriptio
ngt/Description (ltstringgt , ltstringgt) ltfunctio
ngt/Function ltsettingsgt/Settings ltwhy
annotationgt//Why ltbeliefgt/ltgoalgt ltbeliefgt-
Assumption Hypothesis Assumption/Hypothesis
ltgoalgt- Goal ltstringgtletter letter
digit
What Annotation Phrase
WhatCreationTransformationAccessArchivingStor
age//
When Annotation Phrase
WhenPeriod(Day)//
Who Annotation Phrase
WhoIndividual(Role, Position)//
33
Data Provenance Modeling Annotated Conceptual
Schema
Annotation Phrase WhatCreation //
HowPrimitive/Aritifacts(Input) // Why-/Goal
34
Architecture of PROMS
35
Provenance Graph
Cycom381/S2 Uni-Glass 111 Tensile Strength
759 Mpa
What Derivation
When Jan. 5, 2006
Data
occurs_at
is_involved_in
happens_in
Where Raytheon, Tucson, AZ
  • Who
  • Name John Herold
  • Role Creator

is_used_in
leads_to
because_of
  • How
  • Method Average
  • (exclude outliers)
  • Why
  • Project SM-3
  • Program

Which Granta Design
Occurs at
has_input
has_input
Test Specimen S1 Tensile Strength 762 Mpa
Test Specimen S2 Tensile Strength 756.3 Mpa
WhatCreation
leads_to
is_involved_in
  • How
  • Test specification SACMA SRM-4
  • Test temperature  108 F
  • Condition of test specimen Dry
  • Who
  • Name AME Material
  • Test Lab
  • Role Tester

36
Provenance Graph
  • Provenance is useful for context based data
    analysis
  • Provenance can be used to group knowledge
  • SQL-P Provenance language

37
Miscellaneous Issues
  • Aggregation
  • Inheritance
  • Immutability
  • Automated Harvesting
  • Propagation
  • Security and Access Controls

38
Ongoing and Future Work
  • Evaluating Utility of Provenance
  • Extending to Other Domains
  • Using Provenance for Information Life Cycle
    Management

39
Contribution/Conclusion
  • Understand the meaning of Provenance
  • Autonomous harvesting of Provenance
  • Extend to other domains Drug Discovery,
    Biodiversity, Manufacturing.
  • Develop a Consortium Industry, Federal Agencies,
    and other partners.
  • Impact on Long Term Digital Archiving and
    preservation and Data Sharing

40
QUESTIONS?
Write a Comment
User Comments (0)
About PowerShow.com