Title: Dirk%20Duellmann,%20IT-DB%20
1The POOL Persistency Framework
- Dirk Duellmann, IT-DB LCG-POOL
- Opera Computing Meeting
- 23th September 2003
2What is POOL?
- Project Goal Develop a common Persistency
Framework for physics applications at the LHC - Pool Of persistent Objects for LHC
- Common effort between the LHC experiments and the
CERN IT-DB group - for defining its scope and architecture
- for the development of its components
- Being integrated by CMS, ATLAS and soon LHCb
- First test production now CMS processed 700k
events
3POOL Timeline and Statistics
- POOL project started April 2002
- Ramping up from 1.6 to 10 FTE
- Persistency Workshop in June 2002
- First internal release POOL V0.1 in October 2002
- In one year of active development since then
- 10 public releases
- POOL V1.3.1 last Friday
- Some 50 internal releases
- Often picked up by experiments to confirm
fixes/new functionality - Very useful to insure releases meet experiment
expectations beforehand - Handled some 150 bug reports
- Savannah web portal proven helpful
- POOL followed from the beginning a rather
aggressive schedule to meet the first production
needs of the experiments.
4POOL Persistency
- To allow the multi-PB of experiment data and
associated meta data to be stored in a
distributed and Grid enabled fashion - various types of data of different volumes (event
data, physics and detector simulation, detector
data and bookkeeping data) - Hybrid technology approach, combining
- C object streaming technology, such as Root
I/O, for the bulk data - transactional safe Relational Database (RDBMS)
services, such as MySQL, for catalogs,
collections and meta data - In particular, it provides
- Persistency for C transient objects
- Transparent navigation from one object across
file and technolog boundaries - Integrated with a external File Catalog to keep
track of the file physical location, allowing
files to be moved or replicated
5POOL Architecture
- POOL is a component based system
- follows the LCG Architecture Blueprint
- Provides a technology neutral API
- Abstract component C interfaces
- Insulates the experiment framework user code from
several concrete implementation details and
technologies used today - POOL user code is not dependent on implementation
libraries - No link time dependency on implementation
packages (e.g. MySQL, Root, Xerces-C..) - Backend component implementations are loaded at
runtime via the SEAL plug-in infrastructure - Three major domains, weakly coupled, interacting
via abstract interfaces
6POOL Work Package breakdown
- Storage Manager
- Streams transient C objects into/from disk
storage - Resolves a logical object reference into a
physical object - Uses Root I/O. A proof of concept with a RDBMS
storage manager prototype underway - File Catalog
- Maintains consistent lists of accessible files
(physical and logical names) together with their
unique identifiers (FileID), which appear in the
object representation in the persistent space - Resolves a logical file reference (FileID) into a
physical file - Collections
- Provides the tools to manage potentially (large)
ensembles of objects stored via POOL persistence
services - Explicit server-side selection of object from
queryable collections - Implicit defined by physical containment of the
objects
7POOL Component Breakdown
8POOL on the Grid
Collections
Grid Dataset Registry
Grid Resources
User Application
File Catalog
Replica Location Service
Replica Manager
RootI/O
Meta Data
Meta Data Catalog
LCG POOL
Grid Middleware
9POOL off the Grid
Collections
MySQL or RootIO Collection
User Application
File Catalog
XML / MySQL Catalog
RootI/O
Meta Data
MySQL
LCG POOL
Disconnected Laptop
10POOL Client lt-gt Server
- POOL will be mainly used experiment frameworks,
as client library loaded by user applications - POOL applications are Grid aware via the File
Catalog component based on the EDG Replica
Location Service (RLS) - File resolution and meta data queries are
forwarded to Grid middleware requests - The POOL storage manager ensures the remote file
access via Root I/O (such as RFIO/dCache),
possibly later replaced by the Grid File Access
Library (GFAL), once it will be available
11POOL File Catalog
- Files are referred to inside POOL via a unique
and immutable file identifier, (FileID) generated
at creation time - POOL added the system generated FileID to the
standard Grid m-n mapping - Stable inter-file reference
- Global Unique Identifier (GUID) implementation
for FileID - allows the production of a consistent sets of
files with internal references without requiring
a central ID allocation service - catalog fragments created independently can later
be merged without modification to corresponding
data file
12Concrete implementations
- XML Catalog
- typically used as local file by a single
user/process at a time - no need for network or centralised server
- supports R/O operations via http
- tested up to 50K entries
- Native MySQL Catalog
- handles multiple users and jobs (multi-threaded)
- tested up to 1M entries
- EDG-RLS Catalog
- Grid aware applications
- Oracle iAS or Tomcat Oracle / MySQL backend
- pre-production service based on Oracle (from
IT/DB) , RLSTEST, already in use for POOL V1.0
13Use case isolated system
XML
lookup input files
register output files
Import
jobs
Publish
No network
EDG
- The user extracts a set of interesting files and
a catalog fragment describing them from a
(central) Grid based catalog into a local XML
catalog - Selection is performed based on file or
collection meta data - After disconnecting from the Grid the user
executes some standard jobs navigating through
the extracted data - New output files are registered into the local
XML catalog - Once the new data is ready for publishing and the
user is connected the new catalog fragment is
submitted to the Grid based catalog
14Use case farm production
lx1.cern.ch
quality check
lookup
register
jobs
XML
publish
MySQL
pc3.cern.ch
publish
publish
quality check
lookup
register
jobs
publish
EDG
- A production job runs and creates files and their
catalog entries in a local XML file - During the production the catalog can be used to
cleanup files - Once the data quality checks have been passed the
production manager decides to publishes the
production XML catalog fragment to the site
database one and eventually to the Grid based
catalog
15POOL Storage Hierarchy
- A application may access databases (eg ROOT
files) from a set of catalogs - Each database has containers of one specific
technology (eg ROOT trees) - Smart Pointers are used
- to transparently load objects into a client side
cache - define object associations across file or
technology boundaries
16Storage Hierarchy
Data Cache
StorageMgr
- Object Pointer
- Persistent Address
DiskStorage
17Database Technologies
- Identify commonalties and differences between
technologiesNecessary knowledge when
reading/writing
- Model adapts to any technology with direct record
access - Need to know record identifier in advance
- RDBMS More or less traditional
- Primary key must be uniquely determined before
writing - Probably two round-trips
18Dictionary Population/Conversion
DictionaryGeneration
19Client Data Access
20POOL Milestones
- First Public Release - V0.3 December 02
- Navigation between files supported, catalog
components integrated - LCG Dictionary moved to SEAL and picked up from
there - Basic dictionary integration for elementary types
- First Functionally Complete Release - V1.0 June
03 - LCG dictionary integration for most requested
language features including STL containers - Consistent meta data support for file catalog and
event collections (aka tag collections) - Integration with EDG-RLS pre-production service
(rlstest.cern.ch) - First Production Release - V1.1 July 03
- Added bare C pointer support, transient data
members, update of streaming layer data,
simplified (user) transaction model - Due to the large number of requests from
integration activities still rather a
functionality release than the planned
consolidation release. - EDG-RLS production service (one catalog server
per experiment) - Project stayed close to release data estimates
- Maximum variance 2 weeks
- Usually release within a few days around the
predicted target date
21POOL Experiment Integration
- Experiment Integration started after the POOL
V1.1 release - CMS
- LCG AA s/w environment was easy to pickup scram,
SEAL plugin loading - (optional) Object cache provided by POOL is used
directly - Many functional request have been formulated and
implemented during their first POOL integration - ATLAS
- More complicated framework integration eg due to
interferences between the SEAL and GAUDI plugin
loading facilities - Investigating several integration options with
or w/o POOL object cache role of LCG object
white board - More development work required on the experiment
side and more people involved - LHCb
- Share most integration issues with ATLAS
- Expect no significant problems as key POOL
developer and framework integrator is the same
person
22Summary
- The LCG Pool project provides a hybrid store
integrating object streaming (eg Root I/O) with
RDBMS technology (eg MySQL) for consistent meta
data handling - Strong emphasis on component decoupling and well
defined communication/dependencies - Transparent cross-file and cross-technology
object navigation via C smart pointers - Integration with Grid wide Data Catalogs (eg
EDG-RLS) - but preserving networked and grid-decoupled
working modes - Recently also the various ConditionsDB
implementations have been moved into the scope of
the LCG Persistency Project - Work on software and release integration with
POOL and other LCG Application area services will
start soon - POOL has been integrated into LHC experiments
software frameworks and is use for the
pre-production activities in CMS - Selected as persistency mechanism for CMS PCP,
ATLAS and LHCb data challenges
23How to find out more about POOL?
- POOL Home Page
- http//pool.cern.ch/
- POOL Workbook
- http//lcgapp.cern.ch/project/workbook/pool/curren
t/pool.html - POOL savannah portal (bug reports, cvs)
- http//savannah.cern.ch/projects/pool
- POOL binary distribution (provided by LCG-SPI)
- http//lcgapp.cern.ch/project/spi/lcgsoft
24Generic Persistent Model
Transient
25Cache Access Through References
- References know about the Data Cache
- 2 operation modes - Clear at checkpoint
- Auto-clear with
reference count
- References are implemented as smart pointers
- Use cache manager for load-on-demand
- Use the object key of the cache manager
26POOL Cache Access
Key Object Token
2 ltpointergt ltpointergt
Token
Storage Technology
Object Type
Persistent Location
27Follow Object Associations
Entry ID
Link ID
28The Link Table
- Contains all information to resurrect an object
- Storage type
- Database name
- Container name
- Object type (class name)
- Cache hints
- E.g. other possible transient conversions
- Size O(Associations in class model)
- Local to every database
- Size is limited
29File Catalog functionality
- Connection and transaction control functions
- Catalog insertion and update functions on logical
and physical filenames - Catalog lookup functions (by filename, FileID or
query) - Clean-up after an unsuccessful job
- Catalog entries iterator
- File Meta data operations (e.g. define or insert
file meta data) - Cross catalog operations (e.g. extract a XML
fragment and append it to the MySQL catalog) - Python based graphic user interface for the
catalog browsing
30File Catalog Scaling Tests
- update and lookup performances look fine
- scalability shown up to a few hundred boxes
- need to understand availability and backup in the
production context
31File Catalog Browser Prototype
32 Pros cons of POOL vs Vanilla Root
- More services
- File catalog (disconnected/Grid aware)
- Object cache manager
- Functionalities
- - Object storage operation more explicit
Write/Read/Update/Delete - Object navigation transparent Objects linked
are resolved automatically inside the store - Simple transaction handling
- Root functionalities preserved
- Trees and Keys format available (simple to
switch)
33 Pros cons of POOL vs Vanilla Root
- Integration in the software
- Dictionary Generation easier with LCG Dict
- No changes are needed on class headers
- no instrumentation, no inheritance from
external classes - Architecture
- Technology independent interfaces
- Modular component design
- Component can be used separately
- Layered structure
34 Pros cons of POOL vs Vanilla Root
- And also
- Easy to replace with Vanilla Root
- Allow for alternatives to RootI/O !
- Concerns (POOL disadvantages)
- Performance
- Multiple software layer on top of Root
- - Reliability
- POOL is very young