Title: DATA PROVENANCE
1DATA PROVENANCE
- Sudha Ram
- Eller Professor of MIS
- Department of MIS
- University of Arizona
- Tucson, AZ
- Email ram_at_eller.arizona.edu
- URL http//adrg.eller.arizona.edu
- April 7, 2006
Supported in part by NSF grant IIS-0455993
2National Digital Information Infrastructure and
Preservation Program (NDIIP)
- Funded by Congress -- 100 Million Program
- Managed by Library of Congress and National
Science Foundation - DIGITAL ARCHIVING PROGRAM (DIGARCH)
- 10 Projects Funded in the first cycle in May
2005 - Data Provenance UA and Raytheon Missile Systems
3Research Objectives
- Investigate the semantics of Data Provenance
- Develop an ontology to represent the semantics
of Provenance - Develop automated ways to harvest Provenance
4What is Provenance?
5What is Provenance?
- From the Merriam-Webster Online
Dictionary - http//webster.com/cgibin/dictionary?bookDic
tionaryvaprovenancex -
- provenance
- One entry found for provenance.Main Entry
provenance Pronunciation 'präv-nn(t)s,
'prä-v-"nän(t)sFunction nounEtymology
French, from provenir to come forth, originate,
from Latin provenire, from pro- forth venire to
come -- more at PRO-, COME1 ORIGIN, SOURCE2
the history of ownership of a valued object or
work of art or literature
6What is Provenance?
- Lineage, Pedigree, Origin
- Enables correct interpretation
- Includes
- Who created it
- How was it derived
- Ownership
- Assumptions
- .
- Provenance is an overloaded Term
-
7Uses of Provenance
- Data Quality
- Audit Trail
- Replication Recipe
- Attribution/Digital Rights Management
- Informational
8Previous Work on Provenance
- The concept of data provenance hasnt been
rigorously defined - - Origin of data and its movement between
databases P. Buneman et al, 2002 - - Processing and transformations of data
C. Goble et al, 2002 - Data provenance should be comprehensive enough to
be useful in the future - Current research focuses on some aspects of
provenance while ignoring others -
9Related research
A metadata schema captures creator, date (e.g.
creation date, and modification date), software
program.
- Data-centric approach
- - Metadata models in the archival domain such
as Dublin Core and Marc, PREMIS, OAIS - - The Collaboratory for Multi-scale Chemical
Science (CMCS) in Chemistry Pancerella, et al.,
2003 - Process-centric approach
- - MyGrid in bioinformatics Goble, et al. 2004
- - Chimera in high energy physics and
astronomy Foster, et. al, 2004 - - The Earth System Science Workbench (ESSW)
in earth science Frew Bose, 2001 -
-
-
D2
P3
D4
D0
P1
D1
D3
P2
Neither of these two approaches has specified
the full spectrum of data provenance
10Previous Work on Provenance
- Investigated in many domains e.g. Chemistry,
Physics, Astronomy, Biology - Focus on Why and Where Buneman, 2003
- Closely related to Data Quality, Reliability and
Metadata - Gap Comprehensive definition of Provenance
11Specific Research Questions
- Understanding Semantics of Provenance
- Representing and Harvesting Provenance
- Implementing and Using Provenance
12Research questions
Understanding semantics of provenance
What are the key elements of data
provenance? What are the relationships between
these elements?
How can data provenance be represented? How can
data provenance be automatically or
semi-automatically harvested?
Representing and harvesting provenance
Implementing and evaluating provenance
How useful is our model of data provenance?
13Methodology
- Domains New Product Design, BioTechnology
Computing - Understand Provenance Requirements
- Analyzed more than 200 Scenarios
- Developed W7 Model of Provenance
14New Product Design
15BioTechnology Computing Facility
16Importance of Provenance
- Both Domains of Study
- Many types of Data
- Many places of origin
- Many users
- Reusability of data
17Scenarios
- Collected more than 200 Scenarios (Use Cases)
- Analyzed the Use Cases
- Derived W7 Model of Provenance
18Examples of provenance
- - Creator, publisher, contributor.
- - Ownership
- Dates (e.g. creation date and modification date)
- The literature reference where data were first
reported - Current location of storage of the data
- How the data has been derived or transformed
- Experimental procedures or computations that
transform data - The sequence of ideas leading to an experiment
- Hypotheses an experiment is intended to test
- Instrument settings
- Parameters of software application
Who
When
Where
How
Why
Which
- Creation, transformation, derivation,
retirement
What
19Theoretical basis Bunges work
- Bunges view of history Mario Bunge, Treatise on
Basic Philosophy Vol. 3 Ontology I The
Furniture of the World. Boston, MA Reidel, 1977
Event
What
Space
Space
Where
Destruction
Time
When
Creation
Action
How
Agent
Who, Which
Time
20Data Provenance Semantics
An engineering team is developing the actuator
fin, the wing system that steers missiles.
21Semantics of Provenance
- The team considers a material which turns out
to be unqualified for the actuator fin since the
vendor of the material does not belong to the
list of approved vendors, the material panels
were fabricated outside of the U.S, and material
test results provided were generated 10 years
ago.
Who, where, when determine data quality and
reliability
22Semantics of Provenance
- The actuator fin team encounters a
new-to-the industry material. As the material is
fairly new, it is not established how to use it.
The team finds that another team used the
material, M, (which has similar physical and
mechanical properties) in another project. The
actuator team refers to the M provenance record
to find out how and why M has been used and what
are the lessons learned in previous experiences.
Who, How, Why facilitate data reuse and sharing
23Semantics of Provenance
- Two engineers are measuring the tensile
strength of the new material. They both perform
the same test on several samples of the material.
The first engineer computes the tensile strength
of the material by taking the average of values
of the samples, while the second computes it by
recording the minimum value. There is no way to
compare them unless the derivation procedures are
known to both engineers.
How record data derivation procedure
24Semantics of Provenance
- An engineer detects a calibration error in an
instrument and determines that the error may have
existed since July of last year. He wants to
locate the material data that has been generated
using the instrument since July of last year.
Which and when avoid and correct data recording
mistakes
25Semantics of Provenance
26Analysis of Provenance
27Semantics of Provenance
- Anchor Point Information Life Cycle Events
WHAT - All other elements of Provenance describe the
events
28Information lifecycle
Review
Approval
Archiving
Storage
Verification
Deletion
Creation
Access
Information lifecycle in new product design and
development
29W7 Model of Provenance
Provenance
Location where an event happens
Describes what happens to the data by recording
various events such as creation, use, and
transformation of the data during its lifecycle
The decision making rationale of actions.
Records the occurrence time of events
Describes the instruments or software
applications used in creating or processing the
data
Refers to people or organizations involved in
data creation and transformation
Documents actions upon the data. It describes the
details of how data has been created or
transformed
Relationship
Concept
Property
is_a
part_of
30Granularity of Provenance
- Single Data Value
- Data Field (Attribute)
- Record (Entity)
- Subset of Records
- Whole Data Set (Entity Class)
- Whole Database
31Representation of Provenance
- Annotations
- Grammar for Annotations
32Annotation Syntax
ltwhat annotationgt//ltwhen annotationgt//ltwhere
annotationgt//ltwho annotationgt// lthow
annotationgt//ltwhich annotationgt//ltwhy annotationgt
ltwhat annotationgtWhat ltevent typegt ltevent
typegtCreation Transformation Access
Archiving Storage ltwhen annotationgt//When
ltvalid timegt ltvalid timegtInstant Period
Interval (ltGranularitygt) ltGranularitygtYear
Day Hour Minute Second ltwhere
annotationgt//Where ltlocationgt/ltlocation
typegt ltlocationgtPoint Line
Region ltlocation typegtInternal External lthow
annotationgt//How ltaction typegtltaction
propertiesgt ltaction typegtPrimitive
Complex ltaction propertiesgtltpreconditiongtltme
thodgtltartifactgtltresourcegt ltpreconditiongt/P
recondition(ltstringgt , ltstringgt)
ltmethodgt/Method (ltstringgt ,
ltstringgt) ltartifactgt/Artifact (ltstringgt ,
ltstringgt) ltresourcegt/Resource (ltstringgt ,
ltstringgt) ltwho annotationgt//Who ltagent
typegtltrolegtposition ltagent typegtIndividual
Organization Artificial ltrolegt(Role) ltposi
tiongt(position) ltwhich annotationgt//Which
ltdevice typegtltdescriptiongtltfunctiongtltsettingsgt
ltdevice typegtInstrument Software ltdescriptio
ngt/Description (ltstringgt , ltstringgt) ltfunctio
ngt/Function ltsettingsgt/Settings ltwhy
annotationgt//Why ltbeliefgt/ltgoalgt ltbeliefgt-
Assumption Hypothesis Assumption/Hypothesis
ltgoalgt- Goal ltstringgtletter letter
digit
What Annotation Phrase
WhatCreationTransformationAccessArchivingStor
age//
When Annotation Phrase
WhenPeriod(Day)//
Who Annotation Phrase
WhoIndividual(Role, Position)//
33Data Provenance Modeling Annotated Conceptual
Schema
Annotation Phrase WhatCreation //
HowPrimitive/Aritifacts(Input) // Why-/Goal
34Architecture of PROMS
35Provenance Graph
Cycom381/S2 Uni-Glass 111 Tensile Strength
759 Mpa
What Derivation
When Jan. 5, 2006
Data
occurs_at
is_involved_in
happens_in
Where Raytheon, Tucson, AZ
- Who
- Name John Herold
- Role Creator
is_used_in
leads_to
because_of
- How
- Method Average
- (exclude outliers)
Which Granta Design
Occurs at
has_input
has_input
Test Specimen S1 Tensile Strength 762 Mpa
Test Specimen S2 Tensile Strength 756.3 Mpa
WhatCreation
leads_to
is_involved_in
- How
- Test specification SACMA SRM-4
- Test temperature 108 F
- Condition of test specimen Dry
- Who
- Name AME Material
- Test Lab
- Role Tester
36Provenance Graph
- Provenance is useful for context based data
analysis - Provenance can be used to group knowledge
- SQL-P Provenance language
37Miscellaneous Issues
- Aggregation
- Inheritance
- Immutability
- Automated Harvesting
- Propagation
- Security and Access Controls
38Ongoing and Future Work
- Evaluating Utility of Provenance
- Extending to Other Domains
- Using Provenance for Information Life Cycle
Management
39Contribution/Conclusion
- Understand the meaning of Provenance
- Autonomous harvesting of Provenance
- Extend to other domains Drug Discovery,
Biodiversity, Manufacturing. - Develop a Consortium Industry, Federal Agencies,
and other partners. - Impact on Long Term Digital Archiving and
preservation and Data Sharing
40QUESTIONS?