Dirk%20Duellmann,%20IT-DB%20 - PowerPoint PPT Presentation

About This Presentation

Title:

Dirk%20Duellmann,%20IT-DB%20

Description:

(e.g. MySQL, Root, Xerces-C. ... Oracle iAS or Tomcat Oracle / MySQL backend ... operations (e.g. extract a XML fragment and append it to the MySQL catalog) ... – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 35

Provided by: Gir79

Category:

more less

Transcript and Presenter's Notes

Title: Dirk%20Duellmann,%20IT-DB%20

1
The POOL Persistency Framework

Dirk Duellmann, IT-DB LCG-POOL
Opera Computing Meeting
23th September 2003

2
What is POOL?

Project Goal Develop a common Persistency
Framework for physics applications at the LHC
Pool Of persistent Objects for LHC
Common effort between the LHC experiments and the
CERN IT-DB group
for defining its scope and architecture
for the development of its components
Being integrated by CMS, ATLAS and soon LHCb
First test production now CMS processed 700k
events

3
POOL Timeline and Statistics

POOL project started April 2002
Ramping up from 1.6 to 10 FTE
Persistency Workshop in June 2002
First internal release POOL V0.1 in October 2002
In one year of active development since then
10 public releases
POOL V1.3.1 last Friday
Some 50 internal releases
Often picked up by experiments to confirm
fixes/new functionality
Very useful to insure releases meet experiment
expectations beforehand
Handled some 150 bug reports
Savannah web portal proven helpful
POOL followed from the beginning a rather
aggressive schedule to meet the first production
needs of the experiments.

4
POOL Persistency

To allow the multi-PB of experiment data and
associated meta data to be stored in a
distributed and Grid enabled fashion
various types of data of different volumes (event
data, physics and detector simulation, detector
data and bookkeeping data)
Hybrid technology approach, combining
C object streaming technology, such as Root
I/O, for the bulk data
transactional safe Relational Database (RDBMS)
services, such as MySQL, for catalogs,
collections and meta data
In particular, it provides
Persistency for C transient objects
Transparent navigation from one object across
file and technolog boundaries
Integrated with a external File Catalog to keep
track of the file physical location, allowing
files to be moved or replicated

5
POOL Architecture

POOL is a component based system
follows the LCG Architecture Blueprint
Provides a technology neutral API
Abstract component C interfaces
Insulates the experiment framework user code from
several concrete implementation details and
technologies used today
POOL user code is not dependent on implementation
libraries
No link time dependency on implementation
packages (e.g. MySQL, Root, Xerces-C..)
Backend component implementations are loaded at
runtime via the SEAL plug-in infrastructure
Three major domains, weakly coupled, interacting
via abstract interfaces

6
POOL Work Package breakdown

Storage Manager
Streams transient C objects into/from disk
storage
Resolves a logical object reference into a
physical object
Uses Root I/O. A proof of concept with a RDBMS
storage manager prototype underway
File Catalog
Maintains consistent lists of accessible files
(physical and logical names) together with their
unique identifiers (FileID), which appear in the
object representation in the persistent space
Resolves a logical file reference (FileID) into a
physical file
Collections
Provides the tools to manage potentially (large)
ensembles of objects stored via POOL persistence
services
Explicit server-side selection of object from
queryable collections
Implicit defined by physical containment of the
objects

7
POOL Component Breakdown
8
POOL on the Grid
Collections
Grid Dataset Registry
Grid Resources
User Application
File Catalog
Replica Location Service
Replica Manager
RootI/O
Meta Data
Meta Data Catalog
LCG POOL
Grid Middleware
9
POOL off the Grid
Collections
MySQL or RootIO Collection
User Application
File Catalog
XML / MySQL Catalog
RootI/O
Meta Data
MySQL
LCG POOL
Disconnected Laptop
10
POOL Client lt-gt Server

POOL will be mainly used experiment frameworks,
as client library loaded by user applications
POOL applications are Grid aware via the File
Catalog component based on the EDG Replica
Location Service (RLS)
File resolution and meta data queries are
forwarded to Grid middleware requests
The POOL storage manager ensures the remote file
access via Root I/O (such as RFIO/dCache),
possibly later replaced by the Grid File Access
Library (GFAL), once it will be available

11
POOL File Catalog

Files are referred to inside POOL via a unique
and immutable file identifier, (FileID) generated
at creation time
POOL added the system generated FileID to the
standard Grid m-n mapping
Stable inter-file reference
Global Unique Identifier (GUID) implementation
for FileID
allows the production of a consistent sets of
files with internal references without requiring
a central ID allocation service
catalog fragments created independently can later
be merged without modification to corresponding
data file

12
Concrete implementations

XML Catalog
typically used as local file by a single
user/process at a time
no need for network or centralised server
supports R/O operations via http
tested up to 50K entries
Native MySQL Catalog
handles multiple users and jobs (multi-threaded)
tested up to 1M entries
EDG-RLS Catalog
Grid aware applications
Oracle iAS or Tomcat Oracle / MySQL backend
pre-production service based on Oracle (from
IT/DB) , RLSTEST, already in use for POOL V1.0

13
Use case isolated system
XML
lookup input files
register output files
Import
jobs
Publish
No network
EDG

The user extracts a set of interesting files and
a catalog fragment describing them from a
(central) Grid based catalog into a local XML
catalog
Selection is performed based on file or
collection meta data
After disconnecting from the Grid the user
executes some standard jobs navigating through
the extracted data
New output files are registered into the local
XML catalog
Once the new data is ready for publishing and the
user is connected the new catalog fragment is
submitted to the Grid based catalog

14
Use case farm production
lx1.cern.ch
quality check
lookup
register
jobs
XML
publish
MySQL
pc3.cern.ch
publish
publish
quality check
lookup
register

jobs
publish
EDG

A production job runs and creates files and their
catalog entries in a local XML file
During the production the catalog can be used to
cleanup files
Once the data quality checks have been passed the
production manager decides to publishes the
production XML catalog fragment to the site
database one and eventually to the Grid based
catalog

15
POOL Storage Hierarchy

A application may access databases (eg ROOT
files) from a set of catalogs
Each database has containers of one specific
technology (eg ROOT trees)
Smart Pointers are used
to transparently load objects into a client side
cache
define object associations across file or
technology boundaries

16
Storage Hierarchy
Data Cache
StorageMgr

Object Pointer
Persistent Address

DiskStorage
17
Database Technologies

Identify commonalties and differences between
technologiesNecessary knowledge when
reading/writing

Model adapts to any technology with direct record
access
Need to know record identifier in advance
RDBMS More or less traditional
Primary key must be uniquely determined before
writing
Probably two round-trips

18
Dictionary Population/Conversion
DictionaryGeneration
19
Client Data Access
20
POOL Milestones

First Public Release - V0.3 December 02
Navigation between files supported, catalog
components integrated
LCG Dictionary moved to SEAL and picked up from
there
Basic dictionary integration for elementary types
First Functionally Complete Release - V1.0 June
03
LCG dictionary integration for most requested
language features including STL containers
Consistent meta data support for file catalog and
event collections (aka tag collections)
Integration with EDG-RLS pre-production service
(rlstest.cern.ch)
First Production Release - V1.1 July 03
Added bare C pointer support, transient data
members, update of streaming layer data,
simplified (user) transaction model
Due to the large number of requests from
integration activities still rather a
functionality release than the planned
consolidation release.
EDG-RLS production service (one catalog server
per experiment)
Project stayed close to release data estimates
Maximum variance 2 weeks
Usually release within a few days around the
predicted target date

21
POOL Experiment Integration

Experiment Integration started after the POOL
V1.1 release
CMS
LCG AA s/w environment was easy to pickup scram,
SEAL plugin loading
(optional) Object cache provided by POOL is used
directly
Many functional request have been formulated and
implemented during their first POOL integration
ATLAS
More complicated framework integration eg due to
interferences between the SEAL and GAUDI plugin
loading facilities
Investigating several integration options with
or w/o POOL object cache role of LCG object
white board
More development work required on the experiment
side and more people involved
LHCb
Share most integration issues with ATLAS
Expect no significant problems as key POOL
developer and framework integrator is the same
person

22
Summary

The LCG Pool project provides a hybrid store
integrating object streaming (eg Root I/O) with
RDBMS technology (eg MySQL) for consistent meta
data handling
Strong emphasis on component decoupling and well
defined communication/dependencies
Transparent cross-file and cross-technology
object navigation via C smart pointers
Integration with Grid wide Data Catalogs (eg
EDG-RLS)
but preserving networked and grid-decoupled
working modes
Recently also the various ConditionsDB
implementations have been moved into the scope of
the LCG Persistency Project
Work on software and release integration with
POOL and other LCG Application area services will
start soon
POOL has been integrated into LHC experiments
software frameworks and is use for the
pre-production activities in CMS
Selected as persistency mechanism for CMS PCP,
ATLAS and LHCb data challenges

23
How to find out more about POOL?

POOL Home Page
http//pool.cern.ch/
POOL Workbook
http//lcgapp.cern.ch/project/workbook/pool/curren
t/pool.html
POOL savannah portal (bug reports, cvs)
http//savannah.cern.ch/projects/pool
POOL binary distribution (provided by LCG-SPI)
http//lcgapp.cern.ch/project/spi/lcgsoft

24
Generic Persistent Model
Transient
25
Cache Access Through References

References know about the Data Cache
2 operation modes - Clear at checkpoint
- Auto-clear with
reference count

References are implemented as smart pointers
Use cache manager for load-on-demand
Use the object key of the cache manager

26
POOL Cache Access
Key Object Token

2 ltpointergt ltpointergt

Token
Storage Technology
Object Type
Persistent Location
27
Follow Object Associations
Entry ID
Link ID
28
The Link Table

Contains all information to resurrect an object
Storage type
Database name
Container name
Object type (class name)
Cache hints
E.g. other possible transient conversions
Size O(Associations in class model)
Local to every database
Size is limited

29
File Catalog functionality

Connection and transaction control functions
Catalog insertion and update functions on logical
and physical filenames
Catalog lookup functions (by filename, FileID or
query)
Clean-up after an unsuccessful job
Catalog entries iterator
File Meta data operations (e.g. define or insert
file meta data)
Cross catalog operations (e.g. extract a XML
fragment and append it to the MySQL catalog)
Python based graphic user interface for the
catalog browsing

30
File Catalog Scaling Tests

update and lookup performances look fine
scalability shown up to a few hundred boxes
need to understand availability and backup in the
production context

31
File Catalog Browser Prototype
32
Pros cons of POOL vs Vanilla Root

More services
File catalog (disconnected/Grid aware)
Object cache manager
Functionalities
- Object storage operation more explicit
Write/Read/Update/Delete
Object navigation transparent Objects linked
are resolved automatically inside the store
Simple transaction handling
Root functionalities preserved
Trees and Keys format available (simple to
switch)

33
Pros cons of POOL vs Vanilla Root