Towards Iterators in the Virtual Data Language - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Towards Iterators in the Virtual Data Language

Description:

... 'File(BLOB)', 'psField', 'ConvertPsField' }, { 'Camcol', 'File(BLOB)', 'fpFieldStat', 'ConvertFpFieldStat' }, { 'Camcol', 'File(BLOB)', 'fpAtlas', 'ConvertFpAtlas' ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 35
Provided by: lucm152
Category:

less

Transcript and Presenter's Notes

Title: Towards Iterators in the Virtual Data Language


1
Towards Iterators in the Virtual Data Language
  • Luc Moreau
  • Electronics and Computer Science
  • University of Southampton
  • on sabbatical at UofC/ANL
  • L.Moreau_at_ecs.soton.ac.uk

2
Contents
  • Brief VDL Overview
  • Types to specify data set structures
  • Abstract data sets vs. physical data sets
  • Physical representation of data sets
  • Queries over data sets
  • Towards iterators for VDL2

3
Virtual Data Scenario
Manage workflow
On-demand data generation
Update workflow following changes
Explain how to derive a result, e.g. for file8
psearch t 10 i file3 file4 file5 o
file8summarize t 10 i file6 o file7reformat
f fz i file2 o file3 file4 file5 conv l esd
o aod i file 2 o file6simulate t 10 o file1
file2
4
Virtual DataDescribes analysis workflow
psearch t 10
file1
file8
simulate t 10
file2
reformat f fz
Requesteddataset
file7
conv I esd o aod
summarize t 10
file6
  • The recorded virtual data recipe here is
  • Files 8 lt (1,3,4,5,7), 7 lt 6, (3,4,5,6) lt 2
  • Programs 8 lt psearch, 7 lt summarize,(3,4,5) lt
    reformat, 6 lt conv, (1,2) lt simulate

5
Virtual DataDescribes analysis workflow
psearch t 10
file1
file8
simulate t 10
file2
reformat f fz
Requestedfile
file7
conv I esd o aod
summarize t 10
file6
  • To recreate file 8 Step 1
  • simulate gt file1, file2

6
Virtual DataDescribes analysis workflow
psearch t 10
file1
file8
simulate t 10
file2
reformat f fz
Requestedfile
file7
conv I esd o aod
summarize t 10
file6
  • To re-create file8 Step 2
  • files 3, 4, 5, 6 derived from file 2
  • reformat gt file3, file4, file5
  • conv gt file 6

7
Virtual DataDescribes analysis workflow
psearch t 10
file1
file8
simulate t 10
file2
reformat f fz
Requestedfile
file7
conv I esd o aod
summarize t 10
file6
  • To re-create file 8 step 3
  • File 7 depends on file 6
  • Summarize gt file 7

8
Virtual DataDescribes analysis workflow
psearch t 10
file8
simulate t 10
Requestedfile
file7
summarize t 10
  • To re-create file 8 final step
  • File 8 depends on files 1, 3, 4, 5, 7
  • psearch lt file1, file3, file4, file5, file 7 gt
    file 8

9
VDL in One Slide
  • Transformation
  • Abstract template of program invocation
  • Similar to "function definition"
  • Derivation
  • Function call to a Transformation
  • Invocation
  • Record of a Derivation execution
  • Storage of both past and future
  • Derivation a recipe of how data products can be
    generated
  • Provenance a record of how data products were
    generated
  • These XML documents reside in a virtual data
    catalog VDC - a relational database

10
VDL Describes Workflowvia Data Dependencies
file1
  • TR tr1(in a1, out a2)
  • argument stdin a1 
  • argument stdout a2
  • TR tr2(in a1, out a2)
  • argument stdin a1
  • argument stdout a2
  • DV x1-gttr1(a1_at_infile1, a2_at_outfile2)
  • DV x2-gttr2(a1_at_infile2, a2_at_outfile3)

x1
file2
x2
file3
11
Workflow example
  • Graph structure
  • Fan-in
  • Fan-out
  • "left" and "right" can run in parallel
  • Needs external input file
  • Located via replica catalog
  • Data file dependencies
  • Form graph structure

preprocess
findrange
findrange
analyze
12
VDL Shortcomings
  • Currently, in VDL, no iterator over data sets.
  • Users have to go over awkward process
  • Outside VDL select subset of data set
  • Generate mapping logical-physical files
  • Generate workflow DAX
  • Run workflow
  • Prevents true compositionality of
    transformations, and automated provenance
    tracking.

13
Virtual Data Sets?
  • Can we extend the idea of Virtual Data to Virtual
    Data Sets?
  • Can we separate the abstract description of a
    data set from its physical implementation?
  • Can we define transformations in terms of data
    set abstract descriptions, and use such
    transformations on different physical
    representations?
  • Can the system take care of casting between the
    different physical representations?

14
Types to Specify Data Sets
  • Type declaration inspired by C type constructs
  • Structs allow us to refer to elements by name
  • Arrays allow us to refer to elements by index
  • Typedefs allow us to name types
  • e.g.
  • Foo
  • int a
  • int b
  • Bar c
  • Hux d

15
QuarkNet Example
16
SDSS DR2
17
SDSS DR2
  • How to deal with encoding of file names?
  • fpObjc-100000-3-0110.fit
  • run 100000
  • column 3
  • field number 0110
  • Data sets may be given attributes and associate
    values, i.e. key value pairs.

18
XML Schema
  • XML Schemas express shared vocabularies and
    allow machines to carry out rules made by people.
  • They provide a means for defining the structure,
    content and semantics of XML documents
  • http//www.w3.org/XML/Schema

Data Sets
19
XML Schema benefits
  • W3C Standard
  • Lots of existing tools (editors, validators,
    query languages)
  • Adopted by Web and Grid services
  • Does it mean VDL should have an XML syntax? No!
    (cf. VDLt vs. VDLx)

20
XML Schemas for Describing Data Sets
  • XML Schemas provide a good mechanism to specify
    the structure of data sets.
  • Uniform way for representing both data within
    files and sets outside files.
  • Good because in the long run this will allow us
    to express workflows that operate both on data
    sets and their file contents
  • DR2 key value pairs in fit files headers could
    be referred to using this mechanism.
  • It is not a requirement to express the contents
    of file (e.g. not desirable for binary format),
    but it is a possibility that can be used, when
    convenient.

21
Physical Representation of Data Sets
  • QuarkNet HEPSearch examples comprises 4 different
    formats for a small logical data set (Excel and
    ascii files for 2002 and 2003).
  • Transformations (written as Perl Programs) expect
    data sets in a specific format.
  • DR2 is available from a local directory or a http
    url http//das.sdss.org/DR2/data/
  • Subsets were made available to us as tar balls.

22
(No Transcript)
23
How to express physical representation?
  • As a first approximation, for each type, we need
    to provide
  • The kind of physical data container used to
    represent this element, e.g. directory, url,
    file, etc
  • Two conversion functions
  • Read function Given the physical representation,
    how to define the name of the abstract object and
    its attributes.
  • Write function Given the abstract
    representation, how to construct the name of the
    physical object
  • To be complete, we need to identify the element
    name, and the complex type in which it appears
    (and possibly its context).

24
How to express physical representation?
  • Work still in progress. Current programmatic
    representation for reading DR2.
  • null, "Dir", "imaging",
    "ConvertToSelf" ,
  • "ReRun", "Dir", "objcs",
    "ConvertToSelf" ,
  • "Imaging", "Dir", "run",
    "ConvertRun" ,
  • "Run", "Dir", "rerun",
    "ConvertReRun" ,
  • "Objcs", "Dir", "camcol",
    "ConvertCamcol" ,
  • "Camcol", "File(BLOB)", "fpBIN",
    "ConvertFpBIN" ,
  • "Camcol", "File(BLOB)", "fpM",
    "ConvertFpM" ,
  • "Camcol", "File(BLOB)", "psField",
    "ConvertPsField" ,
  • "Camcol", "File(BLOB)", "fpFieldStat",
    "ConvertFpFieldStat" ,
  • "Camcol", "File(BLOB)", "fpAtlas",
    "ConvertFpAtlas" ,
  • "Camcol", "File(BLOB)", "fpObjc",
    "ConvertFpObjc"

Function to convert Element name
Element
Physical Representation
Contex type
25
Query examples
  • Use of xpath 1.0 as the query language
  • Simple directory like navigation of data sets
  • /DR2/imaging
  • /run1/rerun1/objcs
  • /camcol3
  • /fpBIN10
  • 1st run, 1st rerun, 3rd camcol, 10th fpBIN file

26
Query examples
  • Short-cuts
  • /DR2//fpBIN
  • fpBIN files at any depth

27
Query examples
  • Use of attributes
  • /DR2/imaging
  • /run_at_number'1239'
  • /rerun_at_number'6'
  • /objcs/camcol1
  • /fpObjc_at_field'110'
  • Run 1239, rerun 6, fpObjc files with field 110

28
VDC Queries
  • In practice, queries over data sets should not
    only be related to the structure of the data set
    but also to metadata contained in the virtual
    data catalog.
  • Xpath queries support functions, and we have
    pre-defined a function to query the catalog.

29
VDC Queries (example)
  • Get all fpAtlas files in run 1239 of DR2 such
    that the metadata attribute isUseful is set to
    yes in the VDC.
  • /DR2
  • //run_at_number'1239'
  • //fpAtlasvdcmetadata('isUseful')'yes'

30
Iterators in VDL2
  • Selection of a subset of a data set
  • define dataset1 with type,format
  • dataset2
  • select ltltxpath_exprgtgt
  • in dataset1

31
Iterators in VDL2
  • Iterating over a subset of a data set
  • dataset2
  • forall x in ltltxpath_exprgtgt
  • of dataset1
  • call ltlttransformationgtgt x, params,

32
Conclusion
  • Separating abstract data type from physical
    representation is powerful.
  • In the spirit of a semantic description.
  • Useful for casting of data sets into the
    appropriate physical representation requested by
    transformation.
  • Key ideas presented here have been implemented.

33
Future Work
  • Complete the language to specify physical
    encoding
  • Support other physical representations (Stateful
    Grid Services, Databases, etc)
  • Specify VDL2 iterators
  • Large data sets lazy traversal of data sets,
    checkpointing of traversal state, recovery over
    failures.

34
Acknowledgements
  • The GriPhyN team at UoC
  • Ian Foster
  • Mike Wilde
  • Jens Voeckler
  • Yong Zhao
Write a Comment
User Comments (0)
About PowerShow.com