Virtual Data Provenance: Representation and Query - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Virtual Data Provenance: Representation and Query

Description:

PASS. Virtual Data System. The GriPhyN (Grid Physics Network) Project ... PASS. Lineage Queries ... PASS. Multi-Dimensional Queries ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 28
Provided by: Yong
Category:

less

Transcript and Presenter's Notes

Title: Virtual Data Provenance: Representation and Query


1
Virtual Data ProvenanceRepresentation and Query
PASS Workshop, Harvard University
  • Yong Zhao
  • Department of Computer Science
  • University of Chicago
  • yongzh_at_cs.uchicago.edu

Michael Wilde (ANL, Uchicago), Ian Foster (ANL,
UChicago)
31 May 2006
2
Virtual Data System
  • The GriPhyN (Grid Physics Network) Project
  • Petascale Data Grid Infrastructure for Data
    Intensive Sciences
  • Four large physics experiments
  • Started as a provenance system
  • Represent, query, and automate data derivation
    process
  • A data and workflow management system for science
    communities
  • Applied in physics, astronomy, neuroscience,
    bioinformatics, and scientific education

3
Motivation
  • Scale and complexity of data and analysis
    procedures
  • Enormous quantities of data, petabyte-scale
  • Large, complex procedures/workflows composed from
    individual simple ones
  • Community-wide Collaboration
  • Description, discovery, understanding,
    validation, composition, adaptation
  • Reproducibility, valid-ability, audit-ability
  • Usability and productivity
  • Gain control over data
  • Ease of use, focusing on science itself
  • Throughput

4
Virtual Data Concept
  • Capture and manage information about
    relationships among
  • Data (of distributed locations and widely varying
    representations)
  • Programs ( their inputs, outputs, prerequisites,
    constraints)
  • Computations ( execution environments)
  • Apply this information to, e.g.
  • Discovery data and program discovery
  • Explanation provenance (data reproduction and
    validation)
  • Workflow management structured paradigm for
    organizing, locating, specifying and requesting
    data
  • Planning and scheduling
  • Performance optimization

5
Whats Virtual about it?
  • Data represented by logical structures and
    logical file names
  • Mapped to persistent storage and physical
    locations
  • Data associated with recipes and derivation
    histories
  • Transfer vs. computation
  • Make vs. build mode
  • Mapped to workflows and executed on Grid
  • Procedures describe logical operations on typed
    inputs and ouputs
  • Mapped to applications/services on multiple Grid
    sites
  • Workflows represented in logical graph structures
  • Compiled into concrete execution plans
  • Scheduled dynamically onto available Grid
    resources
  • Execution recorded as invocation records

6
Virtual Data Schema
7
Data Derivation Process
Specification
Planning
Execution
8
Multi-Tier Storage Support
FileSys
DB Schema
DB Driver
JDBC
XMLDB
RDBMS
NXD
9
Provenance Model
  • Temporal Aspect
  • Prospective Provenance
  • Workflow recipes for how to produce data
  • Metadata annotations about procedures and data
  • Retrospective Provenance
  • Invocation records of run time environments and
    resources used site, host, executable, execution
    time, file stats ...
  • Dimensional Aspect
  • Virtual data relationships
  • Derivation lineage
  • Metadata annotations

10
Provenance Query
  • Virtual data relationships
  • Primary entities in the schema procedures,
    calls, args, datasets, invocations
  • Annotations
  • Application specific information
  • Lineage graph
  • The derivation history of a data
  • Graph pattern query
  • Multi-dimensional
  • Modification and Composition

11
Context for Query ExamplesFunctional MRI Analysis
Workflow courtesy James Dobson, Dartmouth Brain
Imaging Center
12
Virtual Data Relationships
  • Query by procedure signature
  • Show procedures that have inputs of type
    subjectImage and output types of warp
  • Query by actual arguments
  • Show align_warp calls (including all arguments),
    with argument modelrigid
  • Query by runtime characteristics
  • Show calls to procedure align_warp, and their
    runtimes, that ran in less than 30 minutes on
    non-IA64 processors.
  • Combined query
  • Show me the average runtime of all align_warp
    calls with argument modelrigid that ran in less
    than 30 minutes.

13
Annotation Queries
  • Find all the annotations for a specific object
  • Procedures, calls, arguments, datasets
  • Find objects by specific annotations
  • Find all objects (of any type) annotated with
    predicate p of type t and value v
  • Find objects of a specific type annotated with
    predicate p of type t and value v
  • Find objects (one type or any type) annotated by
    same set of attribute predicates.
  • Example
  • List anonymized subject images for young
    subjects
  • Find datasets of type subjectImage , annotated
    with privacyanonymized and subjectTypeyoung

14
Lineage Queries
  • Basic lineage graph queries refer to information
    that has been propagated along derivation
    relationships
  • find datasets derived from dataset d
  • find ancestor datasets to dataset d that have
    type t
  • find datasets that were derived within 2 levels
    of procedure p
  • Graph pattern matching (in progress)
  • find datasets that are the result of workpattern
    wp
  • find the procedure calls in workflow w whose
    inputs have been processed by any subgraph
    matching workpattern wp.

15
Workflow Patterns
  • Match graph patterns of transformations, calls,
    and invocations
  • Workpattern query yields set of workflows with
    subgraphs that match the workpattern
  • Examples
  • Show me all output datasets of softmean calls
    that were aligned with modelaffine.
  • I.e., where softmean was preceded in the
    workflow, directly or indirectly, by an
    align_warp call with argument modelaffine
  • Show me all the calls to reslicer that follow
    directly after softmean.

16
Work Pattern Queries
align_warp//softmean
softmean/slicer
17
Multi-Dimensional Queries
  • Powerful by joining queries across multiple
    dimensions of the schema
  • Can be used to successively filter/expand a
    result set, to arbitrary depth.
  • Examples
  • Find procedures that take in ImageAtlas and
    Axis,have been called with atlas.std.2005.img,an
    d have annotation QALevel gt 5.6
  • Find the output dataset names (and all their
    metadata tags) of softmean that were align_warped
    with modelaffine and with input metadata
    centerUChicago

18
Modification and Composition
  • Change argument values in a set of calls
  • Parameter sweeping
  • Change procedures in a set of workflows
  • For a specific workflow w, replace every
    occurrence of procedure p1 with procedure p2
    (which has the same signature as p1)
  • Edit subgraphs of a workflow, creating new
    workflows
  • Edit metadata throughout a workflow

19
Wait, There is More
  • Statistics and data mining from invocation
    records
  • Example ATLAS Simulation
  • 1.2M runtime records, 447K datasets derived
    within 1.5 years
  • How much compute time was delivered?
  • years mon year
  • ------------------
  • .45 6 2004
  • 20 7 2004
  • 34 8 2004
  • 40 9 2004
  • 15 10 2004
  • 15 11 2004
  • 8.9 12 2004
  • ------------------
  • Reporting
  • Anomaly analysis
  • Performance prediction Prophesy, TAMU
  • Optimization

20
Integration into Applications
  • Command Line Tools
  • Java APIs
  • Virtual Data Portal
  • Integrated, interactive problem solving
    environment
  • User management
  • Job submission
  • Workflow construction, visualization, execution
  • Grid resource management
  • Web Service Interfaces

21
Portal Architecture
22
QuarkNet Cosmic Ray e-Lab
  • Using Grid virtual data tools and methods to
    transform and enrich science learning and
    education
  • Its an experiment to give students the means to
  • discover and apply datasets, algorithms, and data
    analysis methods
  • collaborate by developing new ones and sharing
    results and observations
  • learn data analysis methods that will ready and
    excite them for a scientific career
  • Educational researchers evaluate the
    effectiveness of such an endeavor

Collaboration with Marge Bardeen, Tom Jordan, Liz
Quigg, Eric Gilbert, Paul Nepywoda, Fermilab
http//quarknet.uchicago.edu/elab/cosmic
23
Cosmic Ray Shower Study
24
The Big PictureVirtual Data Grid
25
Summary
  • An integrated schema for provenance
    representation
  • Powerful query of multiple dimensions of
    provenance information
  • Graph pattern matching and editing to exploit
    deeper knowledge in lineage information
  • Query examples from scientific applications

26
Remaining Challenging Issues
  • Provenance of metadata annotation
  • Provenance of workflow composition
  • Granularity of provenance capturing
  • Provenance lifetime management
  • Lease based
  • Pattern discovery from user/group behavior
  • Instance versioning for multiple re-runs
  • Smart re-runs
  • data associated with signature-based hashes

27
For More Information
  • GriPhyN
  • http//www.griphyn.org
  • VDS
  • http//www.griphyn.org/vds
  • QuarkNet e-Lab
  • http//quarknet.cs.uchicago.edu/elab/cosmic
  • Publications
  • http//www.cs.uchicago.edu/yongzh
Write a Comment
User Comments (0)
About PowerShow.com