Provenance: an open approach to experiment validation in e-Science - PowerPoint PPT Presentation

About This Presentation
Title:

Provenance: an open approach to experiment validation in e-Science

Description:

Luc Moreau, Paul Groth, Simon Miles, Victor Tan, Miguel Branco, ... Foundation for adding necessary cryptographic techniques. Querying Functionality (Miles06) ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 69
Provided by: lucmo4
Category:

less

Transcript and Presenter's Notes

Title: Provenance: an open approach to experiment validation in e-Science


1
Provenance an open approach to experiment
validation in e-Science
  • Professor Luc Moreau
  • L.Moreau_at_ecs.soton.ac.uk
  • University of Southampton
  • www.ecs.soton.ac.uk/lavm

2
Provenance PASOA Teams
  • University of Southampton
  • Luc Moreau, Paul Groth, Simon Miles, Victor Tan,
    Miguel Branco, Sofia Tsasakou, Sheng Jiang, Steve
    Munroe, Zheng Chen
  • IBM UK (EU Project Coordinator)
  • John Ibbotson, Neil Hardman, Alexis Biller
  • University of Wales, Cardiff
  • Omer Rana, Arnaud Contes, Vikas Deora, Ian
    Wootten, Shrija Rajbhandari
  • Universitad Politecnica de Catalunya (UPC)
  • Steven Willmott, Javier Vazquez
  • SZTAKI
  • Laszlo Varga, Arpad Andics,
  • Tamas Kifor
  • German Aerospace
  • Andreas Schreiber, Guy Kloss,
  • Frank Danneman

3
Contents
  • Motivation
  • Provenance Concepts
  • Provenance Architecture
  • Standardisation
  • Provenance Queries
  • Conclusions

4
Motivation
5
Scientific Research
Academic Peer Review
6
Audit Business Regulations
Accounting
Audit - Sarbanes-Oxley - Basel II -
European Rec. R(97)5 (protection of medical
data) - .
Healthcare
Banking
7
e-Science datasets
  • How to undertake peer-reviewing and validation of
    e-Scientific results?

8
Sarbanes-Oxley
  • The American Competitiveness and Corporate
    Accountability Act of 2002, commonly known as the
    Sarbanes-Oxley Act, was signed into law on July
    30, 2002.
  • The law is intended to protect investors by
    improving the accuracy and reliability of
    corporate disclosures.
  • Sarbanes-Oxley also defines a higher level of
    responsibility, accountability, and financial
    reporting transparency - changes that are
    intended to return confidence to investors, as
    well.

9
Food Drug Administration
10
Basel II
11
Compliance to Regulations
  • The next-compliance problem
  • Can we be certain that by ensuring compliance to
    a new regulation, we do not break previous
    compliance?

12
Current Solutions
  • Proprietary, Monolithic
  • Silos, Closed
  • Do not inter-operate with other applications
  • Not adaptable to new regulations

13
Provenance
  • Oxford English Dictionary
  • the fact of coming from some particular source or
    quarter origin, derivation
  • the history or pedigree of a work of art,
    manuscript, rare book, etc.
  • concretely, a record of the passage
  • of an item through its various
  • owners.
  • Concept vs representation

14
Provenance in Computer Systems
  • Our definition of provenance in the context of
    applications for which process matters to end
    users
  • The provenance of a piece of data is the
    process that led to that piece of data
  • Our aim is to conceive a computer-based
    representation of provenance that allows us to
    perform useful analysis and reasoning to support
    our use cases

15
Our Approach
  • Define core concepts pertaining to provenance
  • Specify functionality required to become
    provenance-aware
  • Define open data models and protocols that allow
    systems to inter-operate
  • Standardise data models and protocols
  • Provide a reference implementation
  • Provide reasoning capability

16
Context (1)
  • Aerospace engineering maintain a historical
    record of design processes, up to 99 years.

Organ transplant management tracking of previous
decisions, crucial to maximise the efficiency in
matching and recovery rate of patients
17
Context (2)
Bioinformatics verification and auditing of
experiments (e.g. for drug approval)
High Energy Physics tracking, analysing,
verifying data sets in the ATLAS Experiment of
the Large Hadron Collider (CERN)
18
Provenance Concepts
19
Provenance Lifecycle
Core Interfaces to Provenance Store
Provenance Store
Query and Reason over Provenance of Data
Administer Store and its contents
20
Nature of Documentation
  • We represent the provenance of some data by
    documenting the process that led to the data
  • documentation can be complete or partial
  • it can be accurate or inaccurate
  • it can present conflicting or consensual views of
    the actors involved
  • it can provide operational details of execution
    or it can be abstract.

21
p-assertion
  • A given element of process documentation will be
    referred to as a p-assertion
  • p-assertion is an assertion that is made by an
    actor and pertains to a process.

22
Service Oriented Architecture
  • Broad definition of service as component that
    takes some inputs and produces some outputs.
  • Services are brought together to solve a given
    problem typically via a workflow definition that
    specifies their composition.
  • Interactions with services take place with
    messages that are constructed according to
    services interface specification.
  • The term actor denotes either a client or a
    service in a SOA.
  • A process is defined as execution of a workflow

23
Process Documentation (1)
From these p-assertions, we can derive that M3
was sent by Actor 1 and received by Actor 2 (and
likewise for M4)
Actor 2
Actor 1
M1
M3
If actors are black boxes, these assertions are
not very useful because we do not know
dependencies between messages
M4
M2
24
Process Documentation (2)
Actor 2
Actor 1
M1
M3
These assertions help identify order of
messages, but not how data was computed
M4
M2
25
Process Documentation (3)
Actor 2
Actor 1
M1
M3
These assertions help identify how data is
computed, but provide no information about
non-functional characteristics of the
computation (time, resources used, etc)
M4
M2
26
Process Documentation (4)
Actor 2
Actor 1
M1
M3
M4
M2
27
Types of p-assertions (1)
  • Interaction p-assertion is an assertion of the
    contents of a message by an actor that has sent
    or received that message

28
Types of p-assertions (2)
  • Relationship p-assertion is an assertion, made
    by an actor, that describes how the actor
    obtained an output message sent in an
    interaction by applying some function to input
    messages from other interactions (likewise for
    data)

29
Types of p-assertions (3)
  • Actor state p-assertion assertion made by an
    actor about its internal state in the context of
    a specific interaction

I used sparc processor I used algorithm
x version x.y.z
30
Data flow
  • Interaction p-assertions allow us to specify a
    flow of data between actors
  • Relationship p-assertions allow us to
    characterise the flow of data inside an actor
  • Overall data flow (internal external)
    constitutes a DAG, which characterises the
    process that led to a result

31
Provenance Architecture
32
Interfaces to Provenance Store
Provenance Store
Query and Reason over Provenance of Data
Administer Store and its contents
33
(No Transcript)
34
P-Assertion schemas
35
The p-structure (1)
  • The p-structure is a common logical structure of
    the provenance store shared by all asserting and
    querying actors
  • Hierarchical
  • Indexed by interactions (interaction 1 message
    exchange)

Senders view Receivers view
36
The p-structure (2)
Asserter identity
All p-assertions asserted by a given actor
participating in an interaction
37
Recording Protocol (Groth04-06)
  • Abstract machines
  • DS Properties
  • Termination
  • Liveness
  • Safety
  • Statelessness
  • Documentation Properties
  • Immutability
  • Attribution
  • Datatype safety
  • Foundation for adding necessary cryptographic
    techniques

38
Querying Functionality (Miles06)
  • Process Documentation Query Interface allows for
    navigation of the documentation of execution
  • Allows us to view the provenance store (i.e. the
    p-structure) as if containing XML data structures
  • Independent of technology used for running
    application and internal store representation
  • Seamless navigation of application dependent and
    application independent process documentation

39
Querying Functionality (Miles06)
  • Provenance Query Interface allows us to obtain
    the provenance of some specific data
  • A recognition that there is not one provenance
    for a piece of data, but there may be different,
    depending on the end-users interest
  • Hence, provenance is seen as the result of a
    query
  • Identify a piece of data at a specific execution
    point
  • Scope of the process of interest
  • Filter in/out p-assertions according to actors,
    process, types of relationships, etc

40
Available Software
  • PReServ (Paul Groth Simon Miles)
  • Offer recording and querying interfaces
  • Available from www.pasoa.org
  • OGSA-DAI based version available from
    www.gridprovenance.org
  • Is being used in a bioinformatics application
    (cf. hpdc05, iswc05)

41
Provenance Store Components
ProvenanceStoreFactory
Uses
Factory
ProvenanceServiceResourceHome
Uses
Manages
OGSA-DAI
eXist XML Database
PStoreDatabase
OGSA-DAI Client API
ProvenanceServiceResource
Globus GT4 Container
ProvenanceService
Globus GT4 Container
Slide from John Ibbotson
External Security Services
42
Provenance Store Security
Deny
Policy Decision Point
ProvenanceStoreFactory
Approve
Factory
Deny
Policy Decision Point
Approve
Request
ProvenanceService
ACL File (XML)
Slide from John Ibbotson
Provenance GT4 Container
43
Provenance Implementation
  • The Client Side Library exposes Provenance Store
    functionality and separates Actor from
    alternative Server side implementations
  • EU Provenance project implementation
  • PASOA PreServ
  • Security is being extended to allow federation
    using Globus Community Authorization Service (CAS)

Slide from John Ibbotson
44
Standardisation
45
Standardisation Options
46
Purpose of Standardisation
Application
Application
Provenance Stores
Allow for multiple applications to document their
execution. Applications may be running in
different institutions.
47
Purpose of Standardisation
Application
Provenance Store
Provenance Store
Provenance Store
Allow for multiple stores from multiple IT
providers
48
Purpose of Standardisation
Provenance Store
Provenance Store
Query Provenance of Data
Allow for multiple stores from multiple IT
providers
49
Purpose of Standardisation
Convert in standard data format
Allow for legacy, monolithic applications to
expose their contents (according to standard
schema)
50
Purpose of Standardisation
Application
Allow third parties to host provenance stores,
which are trusted by application owners but also
auditors
51
Compliance Oriented Architectures
  • Separate execution documentation from compliance
    verification
  • Allows for multiple compliance verifications
  • Allows for validation to take place across
    multiple applications, possibly run by different
    institutions (in particular, allows for
    outsourcing and subcontracting).
  • Approach is suitable for e-scientific
    peer-reviewing and business compliance
    verification

52
Standardisation Philosophy
  • Thin layer common between systems extensible
    data model
  • Model can be extended for specific
  • technologies (WS, Web, ), or
  • application domains (Bio, Healthcare, Desktop, )
  • Service interfaces

53
Proposed List of Specifications
Generic Profiles
Domain Specific Profiles
WS-Prov-DM-Sec
WS-Prov-Intro
WS-Prov-DM-Link
WS-Prov-Glo
WS-Prov-DM-Infer
WS-Prov-DM
WS-Prov-DM-DS
WS-Prov-Primer
WS-Prov-DM-Rel
WS-Prov-Rec
WS-Prov-Query
Technology Bindings
WS-Prov-SOAP
WS-Prov-WWW
54
Provenance Queries(Miles06)
55
Example Application
GUI
1. average (7, 5)
Averager
Divider
2. divide (12, 2)
3. answer (6)
4. answer (6)
Averager(in1,in2) return
(in1in2)/2 Averager delegates the division
operation to the service Divider
5. store (6, file1)
Store
56
Example Application
GUI
1. average (7, 5)
Averager
Divider
2. divide (12, 2)
3. answer (6)
  • Relationships
  • 12 in msg 2 is sum of 7, 5 in msg 1
  • 6 in msg 3 is division of 12, 2 in msg 2
  • 6 in msg 4 is copy of 6 in msg 3
  • 6 in msg 4 is average of 7, 5 in msg 1
  • 6 in msg 6 is copy of 6 in msg 4
  • Tracers
  • are used to demarcate activities (aka sets of
    services)
  • added by Averager in call to Divider
  • returned by Divider in response

5. store (6, file1)
Store
57
The data we want to find the provenance of
  • Identify the event where the entity is
    documented
  • In this case, the event is the receipt of a
    request to store the data in file named file1
  • Identify the data entity within that message
  • In this case, the data of interest is the 6
    stored in file1

58
Provenance Graph
5
7
Averager
GUI
Averager
GUI
Sum of
12
2
Divider
Averager
Divider
Averager
Dividend
Divisor
Division of
6
Averager
Divider
Average of
Copy of
6
GUI
Averager
Copy of
6
Store
GUI
59
Scoped Provenance Graph
5
7
Averager
GUI
Averager
GUI
Allows us to ignore the high level structure of
the computation and to focus on the actual
operations e.g. allows us to establish what a
given provider actually does
Sum of
12
2
Divider
Averager
Divider
Averager
Dividend
Divisor
Division of
6
Averager
Divider
Average of
Copy of
Filter to exclude Average of relationships
6
GUI
Averager
Copy of
6
Store
GUI
60
Scoped Provenance Graph
5
7
Averager
GUI
Averager
GUI
Allows us to consider a given service (and all
its inferior invocations) as a black box high
level account of provenance e.g. no detail
should be provided about the internals of Averager
Sum of
12
2
Divider
Averager
Divider
Averager
Dividend
Divisor
Division of
6
Averager
Divider
Average of
Copy of
Filter to exclude messages containing tracer
This is equivalent to hiding the
internal operation of Averager
6
GUI
Averager
Copy of
6
Store
GUI
61
Scoped Provenance Graph
5
7
Averager
GUI
Averager
GUI
Allows us to scope the provenance graph according
to types of data or operations e.g. looking at
the restorations of a painting rather than its
various owners
Sum of
12
2
Divider
Averager
Divider
Averager
Dividend
Divisor
Division of
6
Averager
Divider
Average of
Copy of
Filter to exclude Divisor parameters
6
GUI
Averager
Copy of
6
Store
GUI
62
Provenance Query
63
Practically
  • Event and Data Identification
  • //psinteractionRecord
  • psinteractionKey/psmessageSink/
  • wsaEndpointReference/
  • wsaAddress"http//www.example.com/
    store"
  • The interaction record in which the receiver
    (messageSink) has
  • address http//www.example.com/store
  • //psinteractionPAssertion
  • exenvelope/exstore/exlocation"/home/sm/
    data/file1"
  • //exenvelope/exstore/exdata

Event identification
Data identification
64
Practically
  • The scope of the provenance query
  • Unscoped query
  • /
  • Exclude averageOf relation
  • /pqrelationshipTargetpsrelation!
  • "http//www.example.comaverageOf"
  • Exclude tracer introduced by Averager
  • /pqrelationshipTarget/psinteractionPAssertion
  • not(exenvelope/phpheader/
  • phinteractionMetaData
  • phtracer"process//sub/1")

65
Provenance of Donor Diagnosis Request
Data Collection Request
Healthcare Record Manager
Donor Data Collector
Was Caused By
EHCR Request
EHCRS
Healthcare Record Manager
Is Response To
Is Response To
EHCR
Healthcare Record Manager
EHCRS
Includes Data
Data Collection Complete
Patient (in Brain Death Notification)
Test Results
Brain Death Manager
Testing Lab
Brain Death Manager
Healthcare Record Manager
Brain Death Manager
User Interface
Donor Data Collection
Patient
Test Results
Is Diagnosis Request For
Diagnose Request
Donor Data Collector
Brain Death Manager
Was Caused By
Diagnose Request
Decision Maker
Donor Data Collector
66
Conclusions
67
To Sum Up
Finance
Distribution
Aerospace
Standardising the documentation of Business
Processes
Healthcare
Automobile
Pharmaceutical
  • Compliance check
  • Rerun/Reproduce
  • Analyse

Query
Slide from John Ibbotson
68
Conclusions
  • Crucial topic for many applications
  • Full architectural specification
  • An implementation available for download
  • Methodology to make application provenance-aware
  • www.pasoa.org
  • www.gridprovenance.org

69
Provenance Challenge
twiki.ipaw.info
70
Publications
  1. Paul Groth, Simon Miles, Weijian Fang, Sylvia C.
    Wong, Klaus-Peter Zauner, and Luc Moreau.
    Recording and Using Provenance in a Protein
    Compressibility Experiment. In Proceedings of the
    14th IEEE International Symposium on High
    Performance Distributed Computing (HPDC'05), July
    2005.
  2. Paul Groth, Michael Luck, and Luc Moreau. A
    protocol for recording provenance in
    service-oriented Grids. In Proceedings of the 8th
    International Conference on Principles of
    Distributed Systems (OPODIS'04), Grenoble,
    France, December 2004.
  3. Paul Groth, Michael Luck, and Luc Moreau.
    Formalising a protocol for recording provenance
    in Grids. In Proceedings of the UK OST e-Science
    second All Hands Meeting 2004 (AHM'04),
    Nottingham, UK, September 2004.
  4. Simon Miles, Paul Groth, Miguel Branco, and Luc
    Moreau. The requirements of recording and using
    provenance in e-Science experiments. Technical
    report, University of Southampton, 2005.
  5. Luc Moreau, Syd Chapman, Andreas Schreiber, Rolf
    Hempel, Omer Rana, Lazslo Varga, Ulises Cortes,
    and Steven Willmott. Provenance-based Trust for
    Grid Computing --- Position Paper. In , 2003.
  6. Paul Townend, Paul Groth, and Jie Xu. A
    Provenance-Aware Weighted Fault Tolerance Scheme
    for Service-Based Applications. In Proc. of the
    8th IEEE International Symposium on
    Object-oriented Real-time distributed Computing
    (ISORC 2005), May 2005.
  7. Paul Groth, Simon Miles, Victor Tan, and Luc
    Moreau. Architecture for Provenance Systems.
    Technical report, University of Southampton,
    October 2005.

71
Questions
Write a Comment
User Comments (0)
About PowerShow.com