An SQL API for Object Oriented Perl - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

An SQL API for Object Oriented Perl

Description:

The BaBar SQL API. User view: Perl classes and command-line tool. Behind the scenes. Table schema configuration classes. Summary of features ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 30
Provided by: TimA96
Category:
Tags: api | sql | object | oriented | perl

less

Transcript and Presenter's Notes

Title: An SQL API for Object Oriented Perl


1
An SQL API for Object Oriented Perl
  • Tim Adye
  • BaBar Collaboration
  • Particle Physics Department
  • eScience Centre informal seminar
  • 12th November 2004

2
Talk Plan
  • Summary of BaBar and its bookkeeping system
  • large-scale distributed metadata catalogue in
    Gridspeak
  • The problem
  • Why not just write SQL?
  • The BaBar SQL API
  • User view Perl classes and command-line tool
  • Behind the scenes
  • Table schema configuration classes
  • Summary of features
  • Could this be generalised to other applications?
  • Possible improvements
  • Comparison with other DBIx packages
  • Summary and references

3
Feedback and Discussion
  • Would this be useful outside BaBar?
  • Would it be worth making a public release?
  • eg. on CPAN
  • Does it need any improvements?
  • New features
  • Make it compatible with some other standards
  • eg. sit on top of another abstraction like
    DBIxTable
  • A better name!
  • or have I just reinvented the wheel?
  • If so, maybe I should import some classic wheel
    technology into BaBar, rather than try to export
    my own square wheel idea to the rest of the
    world

4
The BaBar Collaboration
10 Countries 77 Institutions 593
Physicists
5
The BaBar Detector at SLAC
Linear Accelerator
26th May 1999 first events recorded by BaBar
2.2 km PEP-II ring
BaBar
6
The BaBar Bookkeeping
  • New bookkeeping system created in parallel with
    new data model
  • Switch from OO database (Objectivity) to flat
    files (ROOT)
  • Much simpler to manage, but no longer have the
    top-level Objectivity catalogue
  • Actually the catalogue didnt really help us, so
    random set of bookkeeping tools had been build up
  • The main aim of the project was to create a
    coherent bookkeeping system to replace these ad
    hoc tools
  • Successfully deployed Feb 2004. In use ever
    since.
  • Started development in Jan 2003
  • beta test in Autumn

7
Bookkeeping Concepts
  • Currently BaBar analysis data consists of
  • 43G events (event results from one ee-
    collision)
  • in 335k files
  • total size of 186 TB
  • 156M rows in largest table
  • Expect all to grow geometrically over the next
    few years
  • Data grouped in an overlapping hierarchy
  • many-to-many relations between tables, eg.
  • 1.7M runs unit of datataking
  • 219k collections unit of access to event store
  • 335k files each collection stored in a few
    physical files
  • 18k datasets groups of data that analysis users
    want
  • Need to select runs, collections, etc based on
    attributes from any of these tables

8
(No Transcript)
9
Features of the BaBar Bookkeeping
  • Decoupled from event store
  • Job configuration written to site-independent tcl
    file
  • Worker nodes dont need access to database server
  • Mirrored to other sites
  • From SLAC to 4 Tier A, 15 Tier C, user
    laptops
  • Supports Oracle and MySQL (mirror Oracle?MySQL)
  • Remote access
  • Used for mirroring and for sites who dont want
    local db
  • Connection parameter/key distribution system
  • Eg. BookkeepingCommand -siteslac or -siteral
    transparently connects to correct Oracle server
    at SLAC or MySQL at RAL.
  • Allows fallback servers if there are problems

10
Other Features of the Bookkeeping
  • Import of data to each site managed by local
    bookkeeping system
  • Each site knows what it has and what it wants
  • New data imported automatically in cron job
  • Task management system
  • Controls job submission, checking, resubmission
  • Currently just for production jobs, but soon
    for user analysis jobs too
  • Looking at interfacing this to LCG job submission
  • Already use LCG for UK Monte Carlo production

11
User Access
  • Users need to query database to find out what
    data to process
  • May also need other information
  • eg. luminosity, run numbers, file sizes
  • Mostly select by dataset, but may need to limit
    further
  • eg. only data available locally, taken at peak
    energy, excluding some problem datataking period
  • Cannot expect users to know which tables to use,
    how to join them, or even the SQL syntax
  • Even worse if the schema change
  • Cannot expect developers to code for all
    combinations of queries with all possible
    selections
  • Previous ad hoc tools (some mine!) tried to do
    this and it was a nightmare, even for a simpler
    table structure

12
BaBar SQL API user view
  • Each column that users might want to query or
    select on is given a unique logical name
    regardless of which table it lives in
  • These names are used to specify query values
  • query-gtaddValues('collection', 'gbytes')
  • and selections
  • query-gtaddSelector('dataset', 'Dilepton-')
  • query-gtaddSelector('run', '10000-19999')
  • Different types of data allow for different
    selection syntax, eg. wildcards for names, or
    ranges for run numbers.
  • Can also use SQL expressions (in terms of logical
    names)
  • query-gtaddValues('SUM(lumi)/1000')
  • and sorting, row limits, etc

Each of these happens to be in a different table
13
SQL API returning results
  • Thats enough to generate a valid SQL SELECT
    query. To return the results-
  • my sta query-gtexecute()
  • while (my row sta-gtfetch())
  • print row-gtgbytes(), row-gtcollection(), "\n"
  • The query object collects the user requests
  • query-gtexecute() returns a statement accessor
    (like a DBI statement handle).
  • sta iterates over row objects, each of which has
    accessors for each query value, gbytes and
    collection.
  • Thats all there is to it!
  • After the usual DBI connect, and query object
    instantiation (see later), these statements form
    a working program

14
Command-Line Tools
  • Standard BaBar tools use this API to create job
    configuration, create datasets, calculate
    luminosities, etc.
  • Standard tasks, but optionally allowing
    additional selections
  • Also provide an expert tool that allows access
    to full API functionality from the command line
  • This has proved very popular, with many
    non-experts making their own unique queries

15
Examples
  • BbkUser --datasetA0-Run4-OnPeak-R14 \
    --is_local1 --file_status0 \ dse_lumi
    events gbytes file
  • DSE_LUMI EVENTS GBYTES FILE

  • 1250.3 526115 1.6 /store/PRskims/R14/14.4.0d/
    A0/02/A0_0239.01.root
  • 1250.3 526115 0.8 /store/PRskims/R14/14.4.0d/
    A0/02/A0_0239.02HBCA.root
  • 1348.4 576239 1.6 /store/PRskims/R14/14.4.0d/
    A0/02/A0_0240.01.root
  • 1348.4 576239 1.0 /store/PRskims/R14/14.4.0d/
    A0/02/A0_0240.02HBCA.root
  • ...
  • 156 rows returned
  • BbkUser -collection-filecoll.lis \
  • tot_gbytes collection

16
What happens behind your back
  • The SQL API
  • translates the logical names to table columns
  • selects the required tables and joins
  • including otherwise unused tables required for
    the joins
  • generates and executes a valid SQL SELECT
    statement
  • creates a statement accessor object
  • dynamically generates a class for the row objects
    with accessors for each query value

17
Our Example
BbkUser --datasetA0-Run4-OnPeak-R14
--is_local1 --file_status0 dse_lumi
events gbytes file
  • That first BbkUser command involved 5 tables
  • including one that provides the join between
    dataset and collection tables
  • SELECT dse.lumi_sum AS "dse_lumi",
    dse.output_nev AS "events",
  • file.bytes, dse.name AS
    "collection",
  • file.suffix AS "file_suffix", ds.id AS
    "ds_id",
  • dse.id AS "dse_id", dtd.id AS
    "dtd_id",
  • dtd.link_status
  • FROM bbk_dataset ds, bbk_dsentities dse,
    data_files dfile,
  • bbk_files file, bbk_dstodse dtd
  • WHERE ds.iddtd.ds_id
  • AND dtd.dse_iddse.id
  • AND dse.idfile.dse_id
  • AND file.iddfile.file_id
  • AND ds.name'A0-Run4-OnPeak-R14'
  • AND dse.is_local'1'
  • AND dfile.status'0'

The SQL API can even pretty-print it like this
for you ? (Whats shown here is somewhat
abbreviated actual command includes full
database and table names in case of ambiguities)
18
(No Transcript)
19
Table schema configuration classes
  • Mapping between logical names and table columns
    is defined in the configuration classes.
  • One class per table
  • Can also define special properties of each column
    (eg. whether to allow ranges (100-199) for
    selection).
  • Possible joins between tables defined here too
  • Use logical column names for join conditions, so
    one table class does not need to know about
    column names in other classes.
  • In most cases its just a matter of listing
    logical vs. column names
  • with a little Perl syntactic sugar
  • Inheritence of config classes expresses
    commonalities
  • eg. common id and created, and modified columns

20
Example Table Configuration
  • sub table return 'bbk_files'
  • sub tableConfig
  • return
  • alias gt 'file',
  • columns gt
  • bytes gt 'bytes,
  • uuid gt 'uuid',
  • checksum gt 'checksum',
  • file_suffix gt 'suffix',
  • nfiles gt 'COUNT(DISTINCT file_id)',
  • gbytes gt '(bytes/1073741824)',
  • tot_gbytes gt 'SUM(bytes)/1073741824',
  • file_dse_id gt column gt 'dse_id',
    selectorType gt 'range' ,
  • file gt valueAction gt 'addLfnValue',
    selectorAction gt 'lfnSelector' ,
  • ,
  • joins gt
  • dse_id gt 'file_dse_id',
  • file_id gt 'dfile_file_id',

21
Putting it all together
  • Configuration classes must be registered with
    query object
  • my query new BbkSqlSelect
  • query-gtaddModules(new MyTableClass(dbi))
  • but of course it is usually simpler to provide a
    query object pre-registered with all the table
    configs as part of a specific API.

22
Overriding and Synthetic Columns
  • A crutial advantage of this system is that it
    allows us to override the default behaviour
  • Allows us to hide complexities from user
  • Make even complex schema changes transparent to
    users
  • A logical name can refer to
  • ordinary database column name
  • SQL expression (in terms of database columns, or
    other logical names)
  • Perl method to pre- or post-process selection or
    query value
  • synthetic query value or selection
  • can return calculated value or alter behaviour
  • Global post-processing
  • Can be triggered by value, selection, or table
    inclusion
  • Allows global filtering of returned rows

23
What happens behind your back 2
  • We already used some of these features without
    noticing!
  • Dataset names can be found in the bbk_dataset or
    the bbk_aliases table
  • Requires a check and translation using the alias
    table
  • Datasets can evolve with time, with collection
    being added or removed
  • Need to query dataset for any time in the past,
    or use tagged dataset alias (like a CVS tag)
  • Implemented by automatically including date
    selection in query, and post-processing returned
    results to remove deleted collections
  • File names are made from a collection name a
    suffix
  • query-gtaddSelector('file') splits the file name
    for the query and the row-gtfile() accessor
    rejoins them

24
Features
  • Supports Oracle 8 and MySQL 3.23
  • Most queries that can be expressed in both these
    dialects can be expressed by users via the API
    without breaking the paradigm of a flat namespace
  • aggregatation and grouping
  • sorting and distinct
  • MySQLs LIMIT emulated in Oracle
  • inner and outer joins (generates Oracle or MySQL
    syntax)
  • Does not support UNION or subqueries
  • Could be added, but not in MySQL 3.23
  • Convenience features
  • automatic Getopt specification
  • query results display formatting and summary
    table generation
  • Configuration class summary table generation

25
Limitations
  • Assumes tables can be joined in a unique way
  • ie. the joins form an acyclic graph
  • can still select different joins with explicit
    switches
  • Each column must have its own unique logical name
  • This is usually a good thing
  • but if the same data is held in different
    columns, it would be more efficient to
    automatically select from tables that are already
    included

26
A public version
  • Current version has a few BaBar-specific pieces
  • BaBar Connection manager can use DBI directly
  • BaBar Options manager can use Getopt directly
  • BaBar base objects borrow required methods
  • BaBar table formatting class remove
    functionality or publish this too
  • otherwise just uses standard Perl modules
  • but with different table configs could be used
    elsewhere
  • Already do this in BaBar used for QA and TM
    databases
  • Maybe Im making some other assumptions that are
    true of our database and requirements, but not
    more generally so. I cant think of any.
  • Needs a better name!
  • This is really an SQL API creator

27
Possible improvements
  • Tidy up code!
  • User and config APIs are OK, but in between its
    pretty ugly
  • Separate functionality that can be used on its
    own
  • Already true of the DBI statement accessor class
  • More SQL dialects PostgreSQL, MS SQL?
  • New SQL syntax subqueries, UNIONs,
  • INSERT, UPDATE, etc
  • these dont need joins, so hand-coding not such a
    problem
  • Automatic selection of different join
    possibilities
  • Automatic generation of default table classes
    from SQL schema
  • Could use The SQL Fairy
  • though not much work to do it by hand

28
Why not use another package?
  • More than 100 DBIx and other SQL access packages
    in CPAN
  • Could not find any that do all (or even most) of
  • hide table structure from user
  • allow multi-table queries, taking care of joins
    automatically
  • do not impose their own conventions on table
    schema
  • allow query values and selections to be
    overridden
  • allow transparent post-processing of query
    results
  • provide accessor functions for query results
  • I believe that taken together these features
    provide a clear and easy to use abstraction

29
References
  • The BaBar Experiment
  • http//hepunx.rl.ac.uk/BFROOT/
  • BaBar Bookkeeping project
  • http//hepunx.rl.ac.uk/BFROOT/www/Computing/Distri
    buted/Bookkeeping/Documentation/
  • BaBar Bookkeeping presentation and paper
  • http//indico.cern.ch/contributionDisplay.py?contr
    ibId338sessionId7confId0
  • D.A. Smith et al.,BaBar Book Keeping project
    a distributed meta-data catalog of the BaBar
    event store,Proc. Computing in High Energy and
    Nuclear Physics 2004 (CHEP04).
  • CPAN Database Interfaces (see particularly DBIx)
  • http//cpan.uwinnipeg.ca/chapter/Database_Interfac
    es
Write a Comment
User Comments (0)
About PowerShow.com