Title: The Bookkeeping SQL API
1The BookkeepingSQL API
- Tim Adye
- Rutherford Appleton Laboratory
- Bookkeeping / Data Distribution Parallel
- BaBar Collaboration Meeting
- 8th December 2004
2Talk Plan
- The problem
- Why not just write SQL?
- The BaBar SQL API
- User view Perl classes and command-line tool
- Behind the scenes
- Table schema configuration classes
- Summary of features
- Could this be generalised to other applications?
- Possible improvements
- Comparison with other DBIx packages
- Summary and references
3User Access
- Users need to query database to find out what
data to process - May also need other information
- eg. luminosity, run numbers, file sizes
- Mostly select by dataset, but may need to limit
further - eg. only data available locally, taken at peak
energy, excluding some problem datataking period - Cannot expect users to know which tables to use,
how to join them, or even the SQL syntax - Even worse if the schema change
- Cannot expect developers to code for all
combinations of queries with all possible
selections - Previous ad hoc tools (some mine!) tried to do
this and it was a nightmare, even for a simpler
table structure
4(No Transcript)
5BaBar SQL API user view
- Each column that users might want to query or
select on is given a unique logical name
regardless of which table it lives in - These names are used to specify query values
- query-gtaddValues('collection', 'gbytes')
- and selections
- query-gtaddSelector('dataset', 'Dilepton-')
- query-gtaddSelector('run', '10000-19999')
- Different types of data allow for different
selection syntax, eg. wildcards for names, or
ranges for run numbers. - Can also use SQL expressions (in terms of logical
names) - query-gtaddValues('SUM(lumi)/1000')
- and sorting, row limits, etc
Each of these happens to be in a different table
6SQL API returning results
- Thats enough to generate a valid SQL SELECT
query. To return the results- - my sta query-gtexecute()
- while (my row sta-gtfetch())
- print row-gtgbytes(), row-gtcollection(), "\n"
-
- The query object collects the user requests
- query-gtexecute() returns a statement accessor
(like a DBI statement handle). - sta iterates over row objects, each of which has
accessors for each query value, gbytes and
collection. - Thats all there is to it!
- After the usual DBI connect, and query object
instantiation (see later), these statements form
a working program
7Command-Line Tools
- Standard BaBar tools use this API to create job
configuration, create datasets, calculate
luminosities, etc. - Standard tasks, but optionally allowing
additional selections - Also provide an expert tool that allows access
to full API functionality from the command line - This has proved very popular, with many
non-experts making their own unique queries
8Examples
- BbkUser --datasetA0-Run4-OnPeak-R14 \
--is_local1 --file_status0 \ dse_lumi
events gbytes file \ - --styleadye --display
- DSE_LUMI EVENTS GBYTES FILE
-
- 1250.3 526115 1.6 /store/PRskims/R14/14.4.0d/
A0/02/A0_0239.01.root - 1250.3 526115 0.8 /store/PRskims/R14/14.4.0d/
A0/02/A0_0239.02HBCA.root - 1348.4 576239 1.6 /store/PRskims/R14/14.4.0d/
A0/02/A0_0240.01.root - 1348.4 576239 1.0 /store/PRskims/R14/14.4.0d/
A0/02/A0_0240.02HBCA.root - ...
- 156 rows returned
- BbkUser -collection-filecoll.lis \
- tot_gbytes collection
9What happens behind your back
- The SQL API
- translates the logical names to table columns
- selects the required tables and joins
- including otherwise unused tables required for
the joins - generates and executes a valid SQL SELECT
statement - creates a statement accessor object
- dynamically generates a class for the row objects
with accessors for each query value
10 Our Example
BbkUser --datasetA0-Run4-OnPeak-R14
--is_local1 --file_status0 dse_lumi
events gbytes file
- That first BbkUser command involved 5 tables
- including one that provides the join between
dataset and collection tables - SELECT dse.lumi_sum AS "dse_lumi",
dse.output_nev AS "events", - file.bytes, dse.name AS
"collection", - file.suffix AS "file_suffix", ds.id AS
"ds_id", - dse.id AS "dse_id", dtd.id AS
"dtd_id", - dtd.link_status
- FROM bbk_dataset ds, bbk_dsentities dse,
data_files dfile, - bbk_files file, bbk_dstodse dtd
- WHERE ds.iddtd.ds_id
- AND dtd.dse_iddse.id
- AND dse.idfile.dse_id
- AND file.iddfile.file_id
- AND ds.name'A0-Run4-OnPeak-R14'
- AND dse.is_local'1'
- AND dfile.status'0'
The SQL API can even pretty-print it like this
for you ? (Whats shown here is somewhat
abbreviated actual command includes full
database and table names in case of ambiguities)
11(No Transcript)
12Table schema configuration classes
- Mapping between logical names and table columns
is defined in the configuration classes. - One class per table
- Can also define special properties of each column
(eg. whether to allow ranges (100-199) for
selection). - Possible joins between tables defined here too
- Use logical column names for join conditions, so
one table class does not need to know about
column names in other classes. - In most cases its just a matter of listing
logical vs. column names - with a little Perl syntactic sugar
- Inheritence of config classes expresses
commonalities - eg. common id and created, and modified columns
13Example Table Configuration
- sub table return 'bbk_files'
- sub tableConfig
- return
- alias gt 'file',
- columns gt
- bytes gt 'bytes,
- uuid gt 'uuid',
- checksum gt 'checksum',
- file_suffix gt 'suffix',
- nfiles gt 'COUNT(DISTINCT file_id)',
- gbytes gt '(bytes/1073741824)',
- tot_gbytes gt 'SUM(bytes)/1073741824',
- file_dse_id gt column gt 'dse_id',
selectorType gt 'range' , - file gt valueAction gt 'addLfnValue',
selectorAction gt 'lfnSelector' , - ,
- joins gt
- dse_id gt 'file_dse_id',
- file_id gt 'dfile_file_id',
14Putting it all together
- Configuration classes must be registered with
query object - my query new BbkSqlSelect(bbkconfig)
- query-gtaddModules(new MyTableClass(bbkconfig))
- but of course it is usually simpler to provide a
query object pre-registered with all the table
configs as part of a specific API.
15Overriding and Synthetic Columns
- A crutial advantage of this system is that it
allows us to override the default behaviour - Allows us to hide complexities from user
- Make even complex schema changes transparent to
users - A logical name can refer to
- ordinary database column name
- SQL expression (in terms of database columns, or
other logical names) - Perl method to pre- or post-process selection or
query value - synthetic query value or selection
- can return calculated value or alter behaviour
- Global post-processing
- Can be triggered by value, selection, or table
inclusion - Allows global filtering of returned rows
16What happens behind your back 2
- We already used some of these features without
noticing! - Dataset names can be found in the bbk_dataset or
the bbk_aliases table - Requires a check and translation using the alias
table - Datasets can evolve with time, with collection
being added or removed - Need to query dataset for any time in the past,
or use tagged dataset alias (like a CVS tag) - Implemented by automatically including date
selection in query, and post-processing returned
results to remove deleted collections - File names are made from a collection name a
suffix - query-gtaddSelector('file') splits the file name
for the query and the row-gtfile() accessor
rejoins them
17Features
- Supports Oracle 8 and MySQL 3.23
- Most queries that can be expressed in both these
dialects can be expressed by users via the API
without breaking the paradigm of a flat namespace - aggregatation and grouping
- sorting and distinct
- MySQLs LIMIT emulated in Oracle
- inner and outer joins (generates Oracle or MySQL
syntax) - Does not support UNION or subqueries
- Could be added, but not in MySQL 3.23
- Convenience features
- automatic Getopt specification
- query results display formatting and summary
table generation - Configuration class summary table generation
18Limitations
- Assumes tables can be joined in a unique way
- ie. the joins form an acyclic graph
- can still select different joins with explicit
switches - Each column must have its own unique logical name
- This is usually a good thing
- but if the same data is held in different
columns, it would be more efficient to
automatically select from tables that are already
included
19A public version
- Current version has a few BaBar-specific pieces
- BaBar Connection/Configuration manager can use
DBI directly - BaBar Options manager can use Getopt directly
- BaBar base objects borrow required methods
- BaBar table formatting class publish this too
- otherwise just uses standard Perl modules
- but with different table configs could be used
elsewhere - Already do this in BaBar used for QA and TM
databases - Maybe Im making some other assumptions that are
true of our database and requirements, but not
more generally so. I cant think of any. - Needs a better name!
- This is really an SQL API creator
DBIxSqlAbstractor ???
20Possible improvements
- Tidy up code!
- User and config APIs are OK, but in between its
pretty ugly - Separate functionality that can be used on its
own - Already true of the DBI statement accessor class
- More SQL dialects PostgreSQL, MS SQL?
- New SQL syntax subqueries, UNIONs,
- INSERT, UPDATE, etc
- these dont need joins, so hand-coding not such a
problem - Automatic selection of different join
possibilities - Automatic generation of default table classes
from SQL schema - Could use The SQL Fairy
- though not much work to do it by hand
21Why not use another package?
- More than 100 DBIx and other SQL access packages
in CPAN - Could not find any that do all (or even most) of
- hide table structure from user
- allow multi-table queries, taking care of joins
automatically - do not impose their own conventions on table
schema - allow query values and selections to be
overridden - allow transparent post-processing of query
results - provide accessor functions for query results
- I believe that taken together these features
provide a clear and easy to use abstraction
22Feedback and Discussion
- Would this be useful outside BaBar?
- Is it a good idea to make a public release?
- eg. on CPAN
- Does it need any improvements?
- New features
- Make it compatible with some other standards
- eg. sit on top of another abstraction like
DBIxTable - A better name!
23References
- BaBar Bookkeeping project
- http//slac.stanford.edu/BFROOT/www/Computing/Dist
ributed/Bookkeeping/Documentation/ - BaBar Bookkeeping presentation and paper
- http//indico.cern.ch/contributionDisplay.py?contr
ibId338sessionId7confId0 - D.A. Smith et al.,BaBar Book Keeping project
a distributed meta-data catalog of the BaBar
event store,Proc. Computing in High Energy and
Nuclear Physics 2004 (CHEP04). - CPAN Database Interfaces (see particularly DBIx)
- http//cpan.uwinnipeg.ca/chapter/Database_Interfac
es