Title: Provenance: an open approach to experiment validation in e-Science
1Provenance an open approach to experiment
validation in e-Science
- Professor Luc Moreau
- L.Moreau_at_ecs.soton.ac.uk
- University of Southampton
- www.ecs.soton.ac.uk/lavm
2Provenance PASOA Teams
- University of Southampton
- Luc Moreau, Paul Groth, Simon Miles, Victor Tan,
Miguel Branco, Sofia Tsasakou, Sheng Jiang, Steve
Munroe, Zheng Chen - IBM UK (EU Project Coordinator)
- John Ibbotson, Neil Hardman, Alexis Biller
- University of Wales, Cardiff
- Omer Rana, Arnaud Contes, Vikas Deora, Ian
Wootten, Shrija Rajbhandari - Universitad Politecnica de Catalunya (UPC)
- Steven Willmott, Javier Vazquez
- SZTAKI
- Laszlo Varga, Arpad Andics,
- Tamas Kifor
- German Aerospace
- Andreas Schreiber, Guy Kloss,
- Frank Danneman
3Contents
- Motivation
- Provenance Concepts
- Provenance Architecture
- Standardisation
- Provenance Queries
- Conclusions
4Motivation
5Scientific Research
Academic Peer Review
6Audit Business Regulations
Accounting
Audit - Sarbanes-Oxley - Basel II -
European Rec. R(97)5 (protection of medical
data) - .
Healthcare
Banking
7e-Science datasets
- How to undertake peer-reviewing and validation of
e-Scientific results?
8Sarbanes-Oxley
- The American Competitiveness and Corporate
Accountability Act of 2002, commonly known as the
Sarbanes-Oxley Act, was signed into law on July
30, 2002. - The law is intended to protect investors by
improving the accuracy and reliability of
corporate disclosures. - Sarbanes-Oxley also defines a higher level of
responsibility, accountability, and financial
reporting transparency - changes that are
intended to return confidence to investors, as
well.
9Food Drug Administration
10Basel II
11Compliance to Regulations
- The next-compliance problem
- Can we be certain that by ensuring compliance to
a new regulation, we do not break previous
compliance?
12Current Solutions
- Proprietary, Monolithic
- Silos, Closed
- Do not inter-operate with other applications
- Not adaptable to new regulations
13Provenance
- Oxford English Dictionary
- the fact of coming from some particular source or
quarter origin, derivation - the history or pedigree of a work of art,
manuscript, rare book, etc. - concretely, a record of the passage
- of an item through its various
- owners.
- Concept vs representation
14Provenance in Computer Systems
- Our definition of provenance in the context of
applications for which process matters to end
users - The provenance of a piece of data is the
process that led to that piece of data - Our aim is to conceive a computer-based
representation of provenance that allows us to
perform useful analysis and reasoning to support
our use cases
15Our Approach
- Define core concepts pertaining to provenance
- Specify functionality required to become
provenance-aware - Define open data models and protocols that allow
systems to inter-operate - Standardise data models and protocols
- Provide a reference implementation
- Provide reasoning capability
16Context (1)
- Aerospace engineering maintain a historical
record of design processes, up to 99 years.
Organ transplant management tracking of previous
decisions, crucial to maximise the efficiency in
matching and recovery rate of patients
17Context (2)
Bioinformatics verification and auditing of
experiments (e.g. for drug approval)
High Energy Physics tracking, analysing,
verifying data sets in the ATLAS Experiment of
the Large Hadron Collider (CERN)
18Provenance Concepts
19Provenance Lifecycle
Core Interfaces to Provenance Store
Provenance Store
Query and Reason over Provenance of Data
Administer Store and its contents
20Nature of Documentation
- We represent the provenance of some data by
documenting the process that led to the data - documentation can be complete or partial
- it can be accurate or inaccurate
- it can present conflicting or consensual views of
the actors involved - it can provide operational details of execution
or it can be abstract.
21p-assertion
- A given element of process documentation will be
referred to as a p-assertion - p-assertion is an assertion that is made by an
actor and pertains to a process.
22Service Oriented Architecture
- Broad definition of service as component that
takes some inputs and produces some outputs. - Services are brought together to solve a given
problem typically via a workflow definition that
specifies their composition. - Interactions with services take place with
messages that are constructed according to
services interface specification. - The term actor denotes either a client or a
service in a SOA. - A process is defined as execution of a workflow
23Process Documentation (1)
From these p-assertions, we can derive that M3
was sent by Actor 1 and received by Actor 2 (and
likewise for M4)
Actor 2
Actor 1
M1
M3
If actors are black boxes, these assertions are
not very useful because we do not know
dependencies between messages
M4
M2
24Process Documentation (2)
Actor 2
Actor 1
M1
M3
These assertions help identify order of
messages, but not how data was computed
M4
M2
25Process Documentation (3)
Actor 2
Actor 1
M1
M3
These assertions help identify how data is
computed, but provide no information about
non-functional characteristics of the
computation (time, resources used, etc)
M4
M2
26Process Documentation (4)
Actor 2
Actor 1
M1
M3
M4
M2
27Types of p-assertions (1)
- Interaction p-assertion is an assertion of the
contents of a message by an actor that has sent
or received that message
28Types of p-assertions (2)
- Relationship p-assertion is an assertion, made
by an actor, that describes how the actor
obtained an output message sent in an
interaction by applying some function to input
messages from other interactions (likewise for
data)
29Types of p-assertions (3)
- Actor state p-assertion assertion made by an
actor about its internal state in the context of
a specific interaction
I used sparc processor I used algorithm
x version x.y.z
30Data flow
- Interaction p-assertions allow us to specify a
flow of data between actors - Relationship p-assertions allow us to
characterise the flow of data inside an actor - Overall data flow (internal external)
constitutes a DAG, which characterises the
process that led to a result
31Provenance Architecture
32Interfaces to Provenance Store
Provenance Store
Query and Reason over Provenance of Data
Administer Store and its contents
33(No Transcript)
34P-Assertion schemas
35The p-structure (1)
- The p-structure is a common logical structure of
the provenance store shared by all asserting and
querying actors - Hierarchical
- Indexed by interactions (interaction 1 message
exchange)
Senders view Receivers view
36The p-structure (2)
Asserter identity
All p-assertions asserted by a given actor
participating in an interaction
37Recording Protocol (Groth04-06)
- Abstract machines
- DS Properties
- Termination
- Liveness
- Safety
- Statelessness
- Documentation Properties
- Immutability
- Attribution
- Datatype safety
- Foundation for adding necessary cryptographic
techniques
38Querying Functionality (Miles06)
- Process Documentation Query Interface allows for
navigation of the documentation of execution - Allows us to view the provenance store (i.e. the
p-structure) as if containing XML data structures - Independent of technology used for running
application and internal store representation - Seamless navigation of application dependent and
application independent process documentation
39Querying Functionality (Miles06)
- Provenance Query Interface allows us to obtain
the provenance of some specific data - A recognition that there is not one provenance
for a piece of data, but there may be different,
depending on the end-users interest - Hence, provenance is seen as the result of a
query - Identify a piece of data at a specific execution
point - Scope of the process of interest
- Filter in/out p-assertions according to actors,
process, types of relationships, etc
40Available Software
- PReServ (Paul Groth Simon Miles)
- Offer recording and querying interfaces
- Available from www.pasoa.org
- OGSA-DAI based version available from
www.gridprovenance.org - Is being used in a bioinformatics application
(cf. hpdc05, iswc05)
41Provenance Store Components
ProvenanceStoreFactory
Uses
Factory
ProvenanceServiceResourceHome
Uses
Manages
OGSA-DAI
eXist XML Database
PStoreDatabase
OGSA-DAI Client API
ProvenanceServiceResource
Globus GT4 Container
ProvenanceService
Globus GT4 Container
Slide from John Ibbotson
External Security Services
42Provenance Store Security
Deny
Policy Decision Point
ProvenanceStoreFactory
Approve
Factory
Deny
Policy Decision Point
Approve
Request
ProvenanceService
ACL File (XML)
Slide from John Ibbotson
Provenance GT4 Container
43Provenance Implementation
- The Client Side Library exposes Provenance Store
functionality and separates Actor from
alternative Server side implementations - EU Provenance project implementation
- PASOA PreServ
- Security is being extended to allow federation
using Globus Community Authorization Service (CAS)
Slide from John Ibbotson
44Standardisation
45Standardisation Options
46Purpose of Standardisation
Application
Application
Provenance Stores
Allow for multiple applications to document their
execution. Applications may be running in
different institutions.
47Purpose of Standardisation
Application
Provenance Store
Provenance Store
Provenance Store
Allow for multiple stores from multiple IT
providers
48Purpose of Standardisation
Provenance Store
Provenance Store
Query Provenance of Data
Allow for multiple stores from multiple IT
providers
49Purpose of Standardisation
Convert in standard data format
Allow for legacy, monolithic applications to
expose their contents (according to standard
schema)
50Purpose of Standardisation
Application
Allow third parties to host provenance stores,
which are trusted by application owners but also
auditors
51Compliance Oriented Architectures
- Separate execution documentation from compliance
verification - Allows for multiple compliance verifications
- Allows for validation to take place across
multiple applications, possibly run by different
institutions (in particular, allows for
outsourcing and subcontracting). - Approach is suitable for e-scientific
peer-reviewing and business compliance
verification
52Standardisation Philosophy
- Thin layer common between systems extensible
data model - Model can be extended for specific
- technologies (WS, Web, ), or
- application domains (Bio, Healthcare, Desktop, )
- Service interfaces
53Proposed List of Specifications
Generic Profiles
Domain Specific Profiles
WS-Prov-DM-Sec
WS-Prov-Intro
WS-Prov-DM-Link
WS-Prov-Glo
WS-Prov-DM-Infer
WS-Prov-DM
WS-Prov-DM-DS
WS-Prov-Primer
WS-Prov-DM-Rel
WS-Prov-Rec
WS-Prov-Query
Technology Bindings
WS-Prov-SOAP
WS-Prov-WWW
54Provenance Queries(Miles06)
55Example Application
GUI
1. average (7, 5)
Averager
Divider
2. divide (12, 2)
3. answer (6)
4. answer (6)
Averager(in1,in2) return
(in1in2)/2 Averager delegates the division
operation to the service Divider
5. store (6, file1)
Store
56Example Application
GUI
1. average (7, 5)
Averager
Divider
2. divide (12, 2)
3. answer (6)
- Relationships
- 12 in msg 2 is sum of 7, 5 in msg 1
- 6 in msg 3 is division of 12, 2 in msg 2
- 6 in msg 4 is copy of 6 in msg 3
- 6 in msg 4 is average of 7, 5 in msg 1
- 6 in msg 6 is copy of 6 in msg 4
- Tracers
- are used to demarcate activities (aka sets of
services) - added by Averager in call to Divider
- returned by Divider in response
5. store (6, file1)
Store
57The data we want to find the provenance of
- Identify the event where the entity is
documented - In this case, the event is the receipt of a
request to store the data in file named file1 - Identify the data entity within that message
- In this case, the data of interest is the 6
stored in file1
58Provenance Graph
5
7
Averager
GUI
Averager
GUI
Sum of
12
2
Divider
Averager
Divider
Averager
Dividend
Divisor
Division of
6
Averager
Divider
Average of
Copy of
6
GUI
Averager
Copy of
6
Store
GUI
59Scoped Provenance Graph
5
7
Averager
GUI
Averager
GUI
Allows us to ignore the high level structure of
the computation and to focus on the actual
operations e.g. allows us to establish what a
given provider actually does
Sum of
12
2
Divider
Averager
Divider
Averager
Dividend
Divisor
Division of
6
Averager
Divider
Average of
Copy of
Filter to exclude Average of relationships
6
GUI
Averager
Copy of
6
Store
GUI
60Scoped Provenance Graph
5
7
Averager
GUI
Averager
GUI
Allows us to consider a given service (and all
its inferior invocations) as a black box high
level account of provenance e.g. no detail
should be provided about the internals of Averager
Sum of
12
2
Divider
Averager
Divider
Averager
Dividend
Divisor
Division of
6
Averager
Divider
Average of
Copy of
Filter to exclude messages containing tracer
This is equivalent to hiding the
internal operation of Averager
6
GUI
Averager
Copy of
6
Store
GUI
61Scoped Provenance Graph
5
7
Averager
GUI
Averager
GUI
Allows us to scope the provenance graph according
to types of data or operations e.g. looking at
the restorations of a painting rather than its
various owners
Sum of
12
2
Divider
Averager
Divider
Averager
Dividend
Divisor
Division of
6
Averager
Divider
Average of
Copy of
Filter to exclude Divisor parameters
6
GUI
Averager
Copy of
6
Store
GUI
62Provenance Query
63Practically
- Event and Data Identification
- //psinteractionRecord
- psinteractionKey/psmessageSink/
- wsaEndpointReference/
- wsaAddress"http//www.example.com/
store" - The interaction record in which the receiver
(messageSink) has - address http//www.example.com/store
- //psinteractionPAssertion
- exenvelope/exstore/exlocation"/home/sm/
data/file1" - //exenvelope/exstore/exdata
Event identification
Data identification
64Practically
- The scope of the provenance query
- Unscoped query
- /
- Exclude averageOf relation
- /pqrelationshipTargetpsrelation!
- "http//www.example.comaverageOf"
- Exclude tracer introduced by Averager
- /pqrelationshipTarget/psinteractionPAssertion
- not(exenvelope/phpheader/
- phinteractionMetaData
- phtracer"process//sub/1")
65Provenance of Donor Diagnosis Request
Data Collection Request
Healthcare Record Manager
Donor Data Collector
Was Caused By
EHCR Request
EHCRS
Healthcare Record Manager
Is Response To
Is Response To
EHCR
Healthcare Record Manager
EHCRS
Includes Data
Data Collection Complete
Patient (in Brain Death Notification)
Test Results
Brain Death Manager
Testing Lab
Brain Death Manager
Healthcare Record Manager
Brain Death Manager
User Interface
Donor Data Collection
Patient
Test Results
Is Diagnosis Request For
Diagnose Request
Donor Data Collector
Brain Death Manager
Was Caused By
Diagnose Request
Decision Maker
Donor Data Collector
66Conclusions
67To Sum Up
Finance
Distribution
Aerospace
Standardising the documentation of Business
Processes
Healthcare
Automobile
Pharmaceutical
- Compliance check
- Rerun/Reproduce
- Analyse
Query
Slide from John Ibbotson
68Conclusions
- Crucial topic for many applications
- Full architectural specification
- An implementation available for download
- Methodology to make application provenance-aware
- www.pasoa.org
- www.gridprovenance.org
69Provenance Challenge
twiki.ipaw.info
70Publications
- Paul Groth, Simon Miles, Weijian Fang, Sylvia C.
Wong, Klaus-Peter Zauner, and Luc Moreau.
Recording and Using Provenance in a Protein
Compressibility Experiment. In Proceedings of the
14th IEEE International Symposium on High
Performance Distributed Computing (HPDC'05), July
2005. - Paul Groth, Michael Luck, and Luc Moreau. A
protocol for recording provenance in
service-oriented Grids. In Proceedings of the 8th
International Conference on Principles of
Distributed Systems (OPODIS'04), Grenoble,
France, December 2004. - Paul Groth, Michael Luck, and Luc Moreau.
Formalising a protocol for recording provenance
in Grids. In Proceedings of the UK OST e-Science
second All Hands Meeting 2004 (AHM'04),
Nottingham, UK, September 2004. - Simon Miles, Paul Groth, Miguel Branco, and Luc
Moreau. The requirements of recording and using
provenance in e-Science experiments. Technical
report, University of Southampton, 2005. - Luc Moreau, Syd Chapman, Andreas Schreiber, Rolf
Hempel, Omer Rana, Lazslo Varga, Ulises Cortes,
and Steven Willmott. Provenance-based Trust for
Grid Computing --- Position Paper. In , 2003. - Paul Townend, Paul Groth, and Jie Xu. A
Provenance-Aware Weighted Fault Tolerance Scheme
for Service-Based Applications. In Proc. of the
8th IEEE International Symposium on
Object-oriented Real-time distributed Computing
(ISORC 2005), May 2005. - Paul Groth, Simon Miles, Victor Tan, and Luc
Moreau. Architecture for Provenance Systems.
Technical report, University of Southampton,
October 2005.
71Questions