Title: Virtual Data Provenance: Representation and Query
1Virtual Data ProvenanceRepresentation and Query
PASS Workshop, Harvard University
- Yong Zhao
- Department of Computer Science
- University of Chicago
- yongzh_at_cs.uchicago.edu
Michael Wilde (ANL, Uchicago), Ian Foster (ANL,
UChicago)
31 May 2006
2Virtual Data System
- The GriPhyN (Grid Physics Network) Project
- Petascale Data Grid Infrastructure for Data
Intensive Sciences - Four large physics experiments
- Started as a provenance system
- Represent, query, and automate data derivation
process - A data and workflow management system for science
communities - Applied in physics, astronomy, neuroscience,
bioinformatics, and scientific education
3Motivation
- Scale and complexity of data and analysis
procedures - Enormous quantities of data, petabyte-scale
- Large, complex procedures/workflows composed from
individual simple ones - Community-wide Collaboration
- Description, discovery, understanding,
validation, composition, adaptation - Reproducibility, valid-ability, audit-ability
- Usability and productivity
- Gain control over data
- Ease of use, focusing on science itself
- Throughput
4Virtual Data Concept
- Capture and manage information about
relationships among - Data (of distributed locations and widely varying
representations) - Programs ( their inputs, outputs, prerequisites,
constraints) - Computations ( execution environments)
- Apply this information to, e.g.
- Discovery data and program discovery
- Explanation provenance (data reproduction and
validation) - Workflow management structured paradigm for
organizing, locating, specifying and requesting
data - Planning and scheduling
- Performance optimization
5Whats Virtual about it?
- Data represented by logical structures and
logical file names - Mapped to persistent storage and physical
locations - Data associated with recipes and derivation
histories - Transfer vs. computation
- Make vs. build mode
- Mapped to workflows and executed on Grid
- Procedures describe logical operations on typed
inputs and ouputs - Mapped to applications/services on multiple Grid
sites - Workflows represented in logical graph structures
- Compiled into concrete execution plans
- Scheduled dynamically onto available Grid
resources - Execution recorded as invocation records
6Virtual Data Schema
7Data Derivation Process
Specification
Planning
Execution
8Multi-Tier Storage Support
FileSys
DB Schema
DB Driver
JDBC
XMLDB
RDBMS
NXD
9Provenance Model
- Temporal Aspect
- Prospective Provenance
- Workflow recipes for how to produce data
- Metadata annotations about procedures and data
- Retrospective Provenance
- Invocation records of run time environments and
resources used site, host, executable, execution
time, file stats ... - Dimensional Aspect
- Virtual data relationships
- Derivation lineage
- Metadata annotations
10Provenance Query
- Virtual data relationships
- Primary entities in the schema procedures,
calls, args, datasets, invocations - Annotations
- Application specific information
- Lineage graph
- The derivation history of a data
- Graph pattern query
- Multi-dimensional
- Modification and Composition
11Context for Query ExamplesFunctional MRI Analysis
Workflow courtesy James Dobson, Dartmouth Brain
Imaging Center
12Virtual Data Relationships
- Query by procedure signature
- Show procedures that have inputs of type
subjectImage and output types of warp - Query by actual arguments
- Show align_warp calls (including all arguments),
with argument modelrigid - Query by runtime characteristics
- Show calls to procedure align_warp, and their
runtimes, that ran in less than 30 minutes on
non-IA64 processors. - Combined query
- Show me the average runtime of all align_warp
calls with argument modelrigid that ran in less
than 30 minutes.
13Annotation Queries
- Find all the annotations for a specific object
- Procedures, calls, arguments, datasets
- Find objects by specific annotations
- Find all objects (of any type) annotated with
predicate p of type t and value v - Find objects of a specific type annotated with
predicate p of type t and value v - Find objects (one type or any type) annotated by
same set of attribute predicates. - Example
- List anonymized subject images for young
subjects - Find datasets of type subjectImage , annotated
with privacyanonymized and subjectTypeyoung
14Lineage Queries
- Basic lineage graph queries refer to information
that has been propagated along derivation
relationships - find datasets derived from dataset d
- find ancestor datasets to dataset d that have
type t - find datasets that were derived within 2 levels
of procedure p - Graph pattern matching (in progress)
- find datasets that are the result of workpattern
wp - find the procedure calls in workflow w whose
inputs have been processed by any subgraph
matching workpattern wp.
15Workflow Patterns
- Match graph patterns of transformations, calls,
and invocations - Workpattern query yields set of workflows with
subgraphs that match the workpattern - Examples
- Show me all output datasets of softmean calls
that were aligned with modelaffine. - I.e., where softmean was preceded in the
workflow, directly or indirectly, by an
align_warp call with argument modelaffine - Show me all the calls to reslicer that follow
directly after softmean.
16Work Pattern Queries
align_warp//softmean
softmean/slicer
17Multi-Dimensional Queries
- Powerful by joining queries across multiple
dimensions of the schema - Can be used to successively filter/expand a
result set, to arbitrary depth. - Examples
- Find procedures that take in ImageAtlas and
Axis,have been called with atlas.std.2005.img,an
d have annotation QALevel gt 5.6 - Find the output dataset names (and all their
metadata tags) of softmean that were align_warped
with modelaffine and with input metadata
centerUChicago
18Modification and Composition
- Change argument values in a set of calls
- Parameter sweeping
- Change procedures in a set of workflows
- For a specific workflow w, replace every
occurrence of procedure p1 with procedure p2
(which has the same signature as p1) - Edit subgraphs of a workflow, creating new
workflows - Edit metadata throughout a workflow
19Wait, There is More
- Statistics and data mining from invocation
records - Example ATLAS Simulation
- 1.2M runtime records, 447K datasets derived
within 1.5 years - How much compute time was delivered?
- years mon year
- ------------------
- .45 6 2004
- 20 7 2004
- 34 8 2004
- 40 9 2004
- 15 10 2004
- 15 11 2004
- 8.9 12 2004
- ------------------
- Reporting
- Anomaly analysis
- Performance prediction Prophesy, TAMU
- Optimization
20Integration into Applications
- Command Line Tools
- Java APIs
- Virtual Data Portal
- Integrated, interactive problem solving
environment - User management
- Job submission
- Workflow construction, visualization, execution
- Grid resource management
- Web Service Interfaces
21Portal Architecture
22QuarkNet Cosmic Ray e-Lab
- Using Grid virtual data tools and methods to
transform and enrich science learning and
education - Its an experiment to give students the means to
- discover and apply datasets, algorithms, and data
analysis methods - collaborate by developing new ones and sharing
results and observations - learn data analysis methods that will ready and
excite them for a scientific career - Educational researchers evaluate the
effectiveness of such an endeavor
Collaboration with Marge Bardeen, Tom Jordan, Liz
Quigg, Eric Gilbert, Paul Nepywoda, Fermilab
http//quarknet.uchicago.edu/elab/cosmic
23Cosmic Ray Shower Study
24The Big PictureVirtual Data Grid
25Summary
- An integrated schema for provenance
representation - Powerful query of multiple dimensions of
provenance information - Graph pattern matching and editing to exploit
deeper knowledge in lineage information - Query examples from scientific applications
26Remaining Challenging Issues
- Provenance of metadata annotation
- Provenance of workflow composition
- Granularity of provenance capturing
- Provenance lifetime management
- Lease based
- Pattern discovery from user/group behavior
- Instance versioning for multiple re-runs
- Smart re-runs
- data associated with signature-based hashes
27For More Information
- GriPhyN
- http//www.griphyn.org
- VDS
- http//www.griphyn.org/vds
- QuarkNet e-Lab
- http//quarknet.cs.uchicago.edu/elab/cosmic
- Publications
- http//www.cs.uchicago.edu/yongzh