An SQL API for Object Oriented Perl

About This Presentation

Title:

An SQL API for Object Oriented Perl

Description:

The BaBar SQL API. User view: Perl classes and command-line tool. Behind the scenes. Table schema configuration classes. Summary of features ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 30

Provided by: TimA96

Category:

more less

Transcript and Presenter's Notes

Title: An SQL API for Object Oriented Perl

1
An SQL API for Object Oriented Perl

Tim Adye
BaBar Collaboration
Particle Physics Department
eScience Centre informal seminar
12th November 2004

2
Talk Plan

Summary of BaBar and its bookkeeping system
large-scale distributed metadata catalogue in
Gridspeak
The problem
Why not just write SQL?
The BaBar SQL API
User view Perl classes and command-line tool
Behind the scenes
Table schema configuration classes
Summary of features
Could this be generalised to other applications?
Possible improvements
Comparison with other DBIx packages
Summary and references

3
Feedback and Discussion

Would this be useful outside BaBar?
Would it be worth making a public release?
eg. on CPAN
Does it need any improvements?
New features
Make it compatible with some other standards
eg. sit on top of another abstraction like
DBIxTable
A better name!
or have I just reinvented the wheel?
If so, maybe I should import some classic wheel
technology into BaBar, rather than try to export
my own square wheel idea to the rest of the
world

4
The BaBar Collaboration
10 Countries 77 Institutions 593
Physicists
5
The BaBar Detector at SLAC
Linear Accelerator
26th May 1999 first events recorded by BaBar
2.2 km PEP-II ring
BaBar
6
The BaBar Bookkeeping

New bookkeeping system created in parallel with
new data model
Switch from OO database (Objectivity) to flat
files (ROOT)
Much simpler to manage, but no longer have the
top-level Objectivity catalogue
Actually the catalogue didnt really help us, so
random set of bookkeeping tools had been build up
The main aim of the project was to create a
coherent bookkeeping system to replace these ad
hoc tools
Successfully deployed Feb 2004. In use ever
since.
Started development in Jan 2003
beta test in Autumn

7
Bookkeeping Concepts

Currently BaBar analysis data consists of
43G events (event results from one ee-
collision)
in 335k files
total size of 186 TB
156M rows in largest table
Expect all to grow geometrically over the next
few years
Data grouped in an overlapping hierarchy
many-to-many relations between tables, eg.
1.7M runs unit of datataking
219k collections unit of access to event store
335k files each collection stored in a few
physical files
18k datasets groups of data that analysis users
want
Need to select runs, collections, etc based on
attributes from any of these tables

8
(No Transcript)
9
Features of the BaBar Bookkeeping

Decoupled from event store
Job configuration written to site-independent tcl
file
Worker nodes dont need access to database server
Mirrored to other sites
From SLAC to 4 Tier A, 15 Tier C, user
laptops
Supports Oracle and MySQL (mirror Oracle?MySQL)
Remote access
Used for mirroring and for sites who dont want
local db
Connection parameter/key distribution system
Eg. BookkeepingCommand -siteslac or -siteral
transparently connects to correct Oracle server
at SLAC or MySQL at RAL.
Allows fallback servers if there are problems

10
Other Features of the Bookkeeping

Import of data to each site managed by local
bookkeeping system
Each site knows what it has and what it wants
New data imported automatically in cron job
Task management system
Controls job submission, checking, resubmission
Currently just for production jobs, but soon
for user analysis jobs too
Looking at interfacing this to LCG job submission
Already use LCG for UK Monte Carlo production

11
User Access

Users need to query database to find out what
data to process
May also need other information
eg. luminosity, run numbers, file sizes
Mostly select by dataset, but may need to limit
further
eg. only data available locally, taken at peak
energy, excluding some problem datataking period
Cannot expect users to know which tables to use,
how to join them, or even the SQL syntax
Even worse if the schema change
Cannot expect developers to code for all
combinations of queries with all possible
selections
Previous ad hoc tools (some mine!) tried to do
this and it was a nightmare, even for a simpler
table structure

12
BaBar SQL API user view

Each column that users might want to query or
select on is given a unique logical name
regardless of which table it lives in
These names are used to specify query values
query-gtaddValues('collection', 'gbytes')
and selections
query-gtaddSelector('dataset', 'Dilepton-')
query-gtaddSelector('run', '10000-19999')
Different types of data allow for different
selection syntax, eg. wildcards for names, or
ranges for run numbers.
Can also use SQL expressions (in terms of logical
names)
query-gtaddValues('SUM(lumi)/1000')
and sorting, row limits, etc

Each of these happens to be in a different table
13
SQL API returning results

Thats enough to generate a valid SQL SELECT
query. To return the results-
my sta query-gtexecute()
while (my row sta-gtfetch())
print row-gtgbytes(), row-gtcollection(), "\n"
The query object collects the user requests
query-gtexecute() returns a statement accessor
(like a DBI statement handle).
sta iterates over row objects, each of which has
accessors for each query value, gbytes and
collection.
Thats all there is to it!
After the usual DBI connect, and query object
instantiation (see later), these statements form
a working program

14
Command-Line Tools

Standard BaBar tools use this API to create job
configuration, create datasets, calculate
luminosities, etc.
Standard tasks, but optionally allowing
additional selections
Also provide an expert tool that allows access
to full API functionality from the command line
This has proved very popular, with many
non-experts making their own unique queries

15
Examples

BbkUser --datasetA0-Run4-OnPeak-R14 \
--is_local1 --file_status0 \ dse_lumi
events gbytes file
DSE_LUMI EVENTS GBYTES FILE
1250.3 526115 1.6 /store/PRskims/R14/14.4.0d/
A0/02/A0_0239.01.root
1250.3 526115 0.8 /store/PRskims/R14/14.4.0d/
A0/02/A0_0239.02HBCA.root
1348.4 576239 1.6 /store/PRskims/R14/14.4.0d/
A0/02/A0_0240.01.root
1348.4 576239 1.0 /store/PRskims/R14/14.4.0d/
A0/02/A0_0240.02HBCA.root
...
156 rows returned
BbkUser -collection-filecoll.lis \
tot_gbytes collection

16
What happens behind your back

The SQL API
translates the logical names to table columns
selects the required tables and joins
including otherwise unused tables required for
the joins
generates and executes a valid SQL SELECT
statement
creates a statement accessor object
dynamically generates a class for the row objects
with accessors for each query value

17
Our Example
BbkUser --datasetA0-Run4-OnPeak-R14
--is_local1 --file_status0 dse_lumi
events gbytes file

That first BbkUser command involved 5 tables
including one that provides the join between
dataset and collection tables
SELECT dse.lumi_sum AS "dse_lumi",
dse.output_nev AS "events",
file.bytes, dse.name AS
"collection",
file.suffix AS "file_suffix", ds.id AS
"ds_id",
dse.id AS "dse_id", dtd.id AS
"dtd_id",
dtd.link_status
FROM bbk_dataset ds, bbk_dsentities dse,
data_files dfile,
bbk_files file, bbk_dstodse dtd
WHERE ds.iddtd.ds_id
AND dtd.dse_iddse.id
AND dse.idfile.dse_id
AND file.iddfile.file_id
AND ds.name'A0-Run4-OnPeak-R14'
AND dse.is_local'1'
AND dfile.status'0'

The SQL API can even pretty-print it like this
for you ? (Whats shown here is somewhat
abbreviated actual command includes full
database and table names in case of ambiguities)
18
(No Transcript)
19
Table schema configuration classes

Mapping between logical names and table columns
is defined in the configuration classes.
One class per table
Can also define special properties of each column
(eg. whether to allow ranges (100-199) for
selection).
Possible joins between tables defined here too
Use logical column names for join conditions, so
one table class does not need to know about
column names in other classes.
In most cases its just a matter of listing
logical vs. column names
with a little Perl syntactic sugar
Inheritence of config classes expresses
commonalities
eg. common id and created, and modified columns

20
Example Table Configuration

sub table return 'bbk_files'
sub tableConfig
return
alias gt 'file',
columns gt
bytes gt 'bytes,
uuid gt 'uuid',
checksum gt 'checksum',
file_suffix gt 'suffix',
nfiles gt 'COUNT(DISTINCT file_id)',
gbytes gt '(bytes/1073741824)',
tot_gbytes gt 'SUM(bytes)/1073741824',
file_dse_id gt column gt 'dse_id',
selectorType gt 'range' ,
file gt valueAction gt 'addLfnValue',
selectorAction gt 'lfnSelector' ,
,
joins gt
dse_id gt 'file_dse_id',
file_id gt 'dfile_file_id',

21
Putting it all together

Configuration classes must be registered with
query object
my query new BbkSqlSelect
query-gtaddModules(new MyTableClass(dbi))
but of course it is usually simpler to provide a
query object pre-registered with all the table
configs as part of a specific API.

22
Overriding and Synthetic Columns

A crutial advantage of this system is that it
allows us to override the default behaviour
Allows us to hide complexities from user
Make even complex schema changes transparent to
users
A logical name can refer to
ordinary database column name
SQL expression (in terms of database columns, or
other logical names)
Perl method to pre- or post-process selection or
query value
synthetic query value or selection
can return calculated value or alter behaviour
Global post-processing
Can be triggered by value, selection, or table
inclusion
Allows global filtering of returned rows

23
What happens behind your back 2

We already used some of these features without
noticing!
Dataset names can be found in the bbk_dataset or
the bbk_aliases table
Requires a check and translation using the alias
table
Datasets can evolve with time, with collection
being added or removed
Need to query dataset for any time in the past,
or use tagged dataset alias (like a CVS tag)
Implemented by automatically including date
selection in query, and post-processing returned
results to remove deleted collections
File names are made from a collection name a
suffix
query-gtaddSelector('file') splits the file name
for the query and the row-gtfile() accessor
rejoins them

24
Features

Supports Oracle 8 and MySQL 3.23
Most queries that can be expressed in both these
dialects can be expressed by users via the API
without breaking the paradigm of a flat namespace
aggregatation and grouping
sorting and distinct
MySQLs LIMIT emulated in Oracle
inner and outer joins (generates Oracle or MySQL
syntax)
Does not support UNION or subqueries
Could be added, but not in MySQL 3.23
Convenience features
automatic Getopt specification
query results display formatting and summary
table generation
Configuration class summary table generation

25
Limitations

Assumes tables can be joined in a unique way
ie. the joins form an acyclic graph
can still select different joins with explicit
switches
Each column must have its own unique logical name
This is usually a good thing
but if the same data is held in different
columns, it would be more efficient to
automatically select from tables that are already
included

26
A public version

Current version has a few BaBar-specific pieces
BaBar Connection manager can use DBI directly
BaBar Options manager can use Getopt directly
BaBar base objects borrow required methods
BaBar table formatting class remove
functionality or publish this too
otherwise just uses standard Perl modules
but with different table configs could be used
elsewhere
Already do this in BaBar used for QA and TM
databases
Maybe Im making some other assumptions that are
true of our database and requirements, but not
more generally so. I cant think of any.
Needs a better name!
This is really an SQL API creator

27
Possible improvements

Tidy up code!
User and config APIs are OK, but in between its
pretty ugly
Separate functionality that can be used on its
own
Already true of the DBI statement accessor class
More SQL dialects PostgreSQL, MS SQL?
New SQL syntax subqueries, UNIONs,
INSERT, UPDATE, etc
these dont need joins, so hand-coding not such a
problem
Automatic selection of different join
possibilities
Automatic generation of default table classes
from SQL schema
Could use The SQL Fairy
though not much work to do it by hand

28
Why not use another package?

More than 100 DBIx and other SQL access packages
in CPAN
Could not find any that do all (or even most) of
hide table structure from user
allow multi-table queries, taking care of joins
automatically
do not impose their own conventions on table
schema
allow query values and selections to be
overridden
allow transparent post-processing of query
results
provide accessor functions for query results
I believe that taken together these features
provide a clear and easy to use abstraction

29
References

The BaBar Experiment
http//hepunx.rl.ac.uk/BFROOT/
BaBar Bookkeeping project
http//hepunx.rl.ac.uk/BFROOT/www/Computing/Distri
buted/Bookkeeping/Documentation/
BaBar Bookkeeping presentation and paper
http//indico.cern.ch/contributionDisplay.py?contr
ibId338sessionId7confId0
D.A. Smith et al.,BaBar Book Keeping project
a distributed meta-data catalog of the BaBar
event store,Proc. Computing in High Energy and
Nuclear Physics 2004 (CHEP04).
CPAN Database Interfaces (see particularly DBIx)
http//cpan.uwinnipeg.ca/chapter/Database_Interfac
es