Dirk%20Duellmann,%20IT-DB%20 - PowerPoint PPT Presentation

About This Presentation
Title:

Dirk%20Duellmann,%20IT-DB%20

Description:

(e.g. MySQL, Root, Xerces-C. ... Oracle iAS or Tomcat Oracle / MySQL backend ... operations (e.g. extract a XML fragment and append it to the MySQL catalog) ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 35
Provided by: Gir79
Category:
Tags: 20duellmann | 20it | dirk | mysql

less

Transcript and Presenter's Notes

Title: Dirk%20Duellmann,%20IT-DB%20


1
The POOL Persistency Framework
  • Dirk Duellmann, IT-DB LCG-POOL
  • Opera Computing Meeting
  • 23th September 2003

2
What is POOL?
  • Project Goal Develop a common Persistency
    Framework for physics applications at the LHC
  • Pool Of persistent Objects for LHC
  • Common effort between the LHC experiments and the
    CERN IT-DB group
  • for defining its scope and architecture
  • for the development of its components
  • Being integrated by CMS, ATLAS and soon LHCb
  • First test production now CMS processed 700k
    events

3
POOL Timeline and Statistics
  • POOL project started April 2002
  • Ramping up from 1.6 to 10 FTE
  • Persistency Workshop in June 2002
  • First internal release POOL V0.1 in October 2002
  • In one year of active development since then
  • 10 public releases
  • POOL V1.3.1 last Friday
  • Some 50 internal releases
  • Often picked up by experiments to confirm
    fixes/new functionality
  • Very useful to insure releases meet experiment
    expectations beforehand
  • Handled some 150 bug reports
  • Savannah web portal proven helpful
  • POOL followed from the beginning a rather
    aggressive schedule to meet the first production
    needs of the experiments.

4
POOL Persistency
  • To allow the multi-PB of experiment data and
    associated meta data to be stored in a
    distributed and Grid enabled fashion
  • various types of data of different volumes (event
    data, physics and detector simulation, detector
    data and bookkeeping data)
  • Hybrid technology approach, combining
  • C object streaming technology, such as Root
    I/O, for the bulk data
  • transactional safe Relational Database (RDBMS)
    services, such as MySQL, for catalogs,
    collections and meta data
  • In particular, it provides
  • Persistency for C transient objects
  • Transparent navigation from one object across
    file and technolog boundaries
  • Integrated with a external File Catalog to keep
    track of the file physical location, allowing
    files to be moved or replicated

5
POOL Architecture
  • POOL is a component based system
  • follows the LCG Architecture Blueprint
  • Provides a technology neutral API
  • Abstract component C interfaces
  • Insulates the experiment framework user code from
    several concrete implementation details and
    technologies used today
  • POOL user code is not dependent on implementation
    libraries
  • No link time dependency on implementation
    packages (e.g. MySQL, Root, Xerces-C..)
  • Backend component implementations are loaded at
    runtime via the SEAL plug-in infrastructure
  • Three major domains, weakly coupled, interacting
    via abstract interfaces

6
POOL Work Package breakdown
  • Storage Manager
  • Streams transient C objects into/from disk
    storage
  • Resolves a logical object reference into a
    physical object
  • Uses Root I/O. A proof of concept with a RDBMS
    storage manager prototype underway
  • File Catalog
  • Maintains consistent lists of accessible files
    (physical and logical names) together with their
    unique identifiers (FileID), which appear in the
    object representation in the persistent space
  • Resolves a logical file reference (FileID) into a
    physical file
  • Collections
  • Provides the tools to manage potentially (large)
    ensembles of objects stored via POOL persistence
    services
  • Explicit server-side selection of object from
    queryable collections
  • Implicit defined by physical containment of the
    objects

7
POOL Component Breakdown
8
POOL on the Grid
Collections
Grid Dataset Registry
Grid Resources
User Application
File Catalog
Replica Location Service
Replica Manager
RootI/O
Meta Data
Meta Data Catalog
LCG POOL
Grid Middleware
9
POOL off the Grid
Collections
MySQL or RootIO Collection
User Application
File Catalog
XML / MySQL Catalog
RootI/O
Meta Data
MySQL
LCG POOL
Disconnected Laptop
10
POOL Client lt-gt Server
  • POOL will be mainly used experiment frameworks,
    as client library loaded by user applications
  • POOL applications are Grid aware via the File
    Catalog component based on the EDG Replica
    Location Service (RLS)
  • File resolution and meta data queries are
    forwarded to Grid middleware requests
  • The POOL storage manager ensures the remote file
    access via Root I/O (such as RFIO/dCache),
    possibly later replaced by the Grid File Access
    Library (GFAL), once it will be available

11
POOL File Catalog
  • Files are referred to inside POOL via a unique
    and immutable file identifier, (FileID) generated
    at creation time
  • POOL added the system generated FileID to the
    standard Grid m-n mapping
  • Stable inter-file reference
  • Global Unique Identifier (GUID) implementation
    for FileID
  • allows the production of a consistent sets of
    files with internal references without requiring
    a central ID allocation service
  • catalog fragments created independently can later
    be merged without modification to corresponding
    data file

12
Concrete implementations
  • XML Catalog
  • typically used as local file by a single
    user/process at a time
  • no need for network or centralised server
  • supports R/O operations via http
  • tested up to 50K entries
  • Native MySQL Catalog
  • handles multiple users and jobs (multi-threaded)
  • tested up to 1M entries
  • EDG-RLS Catalog
  • Grid aware applications
  • Oracle iAS or Tomcat Oracle / MySQL backend
  • pre-production service based on Oracle (from
    IT/DB) , RLSTEST, already in use for POOL V1.0

13
Use case isolated system
XML
lookup input files
register output files
Import
jobs
Publish
No network
EDG
  • The user extracts a set of interesting files and
    a catalog fragment describing them from a
    (central) Grid based catalog into a local XML
    catalog
  • Selection is performed based on file or
    collection meta data
  • After disconnecting from the Grid the user
    executes some standard jobs navigating through
    the extracted data
  • New output files are registered into the local
    XML catalog
  • Once the new data is ready for publishing and the
    user is connected the new catalog fragment is
    submitted to the Grid based catalog

14
Use case farm production
lx1.cern.ch
quality check
lookup
register
jobs
XML
publish
MySQL
pc3.cern.ch
publish
publish
quality check
lookup
register

jobs
publish
EDG
  • A production job runs and creates files and their
    catalog entries in a local XML file
  • During the production the catalog can be used to
    cleanup files
  • Once the data quality checks have been passed the
    production manager decides to publishes the
    production XML catalog fragment to the site
    database one and eventually to the Grid based
    catalog

15
POOL Storage Hierarchy
  • A application may access databases (eg ROOT
    files) from a set of catalogs
  • Each database has containers of one specific
    technology (eg ROOT trees)
  • Smart Pointers are used
  • to transparently load objects into a client side
    cache
  • define object associations across file or
    technology boundaries

16
Storage Hierarchy
Data Cache
StorageMgr
  • Object Pointer
  • Persistent Address

DiskStorage
17
Database Technologies
  • Identify commonalties and differences between
    technologiesNecessary knowledge when
    reading/writing
  • Model adapts to any technology with direct record
    access
  • Need to know record identifier in advance
  • RDBMS More or less traditional
  • Primary key must be uniquely determined before
    writing
  • Probably two round-trips

18
Dictionary Population/Conversion
DictionaryGeneration
19
Client Data Access
20
POOL Milestones
  • First Public Release - V0.3 December 02
  • Navigation between files supported, catalog
    components integrated
  • LCG Dictionary moved to SEAL and picked up from
    there
  • Basic dictionary integration for elementary types
  • First Functionally Complete Release - V1.0 June
    03
  • LCG dictionary integration for most requested
    language features including STL containers
  • Consistent meta data support for file catalog and
    event collections (aka tag collections)
  • Integration with EDG-RLS pre-production service
    (rlstest.cern.ch)
  • First Production Release - V1.1 July 03
  • Added bare C pointer support, transient data
    members, update of streaming layer data,
    simplified (user) transaction model
  • Due to the large number of requests from
    integration activities still rather a
    functionality release than the planned
    consolidation release.
  • EDG-RLS production service (one catalog server
    per experiment)
  • Project stayed close to release data estimates
  • Maximum variance 2 weeks
  • Usually release within a few days around the
    predicted target date

21
POOL Experiment Integration
  • Experiment Integration started after the POOL
    V1.1 release
  • CMS
  • LCG AA s/w environment was easy to pickup scram,
    SEAL plugin loading
  • (optional) Object cache provided by POOL is used
    directly
  • Many functional request have been formulated and
    implemented during their first POOL integration
  • ATLAS
  • More complicated framework integration eg due to
    interferences between the SEAL and GAUDI plugin
    loading facilities
  • Investigating several integration options with
    or w/o POOL object cache role of LCG object
    white board
  • More development work required on the experiment
    side and more people involved
  • LHCb
  • Share most integration issues with ATLAS
  • Expect no significant problems as key POOL
    developer and framework integrator is the same
    person

22
Summary
  • The LCG Pool project provides a hybrid store
    integrating object streaming (eg Root I/O) with
    RDBMS technology (eg MySQL) for consistent meta
    data handling
  • Strong emphasis on component decoupling and well
    defined communication/dependencies
  • Transparent cross-file and cross-technology
    object navigation via C smart pointers
  • Integration with Grid wide Data Catalogs (eg
    EDG-RLS)
  • but preserving networked and grid-decoupled
    working modes
  • Recently also the various ConditionsDB
    implementations have been moved into the scope of
    the LCG Persistency Project
  • Work on software and release integration with
    POOL and other LCG Application area services will
    start soon
  • POOL has been integrated into LHC experiments
    software frameworks and is use for the
    pre-production activities in CMS
  • Selected as persistency mechanism for CMS PCP,
    ATLAS and LHCb data challenges

23
How to find out more about POOL?
  • POOL Home Page
  • http//pool.cern.ch/
  • POOL Workbook
  • http//lcgapp.cern.ch/project/workbook/pool/curren
    t/pool.html
  • POOL savannah portal (bug reports, cvs)
  • http//savannah.cern.ch/projects/pool
  • POOL binary distribution (provided by LCG-SPI)
  • http//lcgapp.cern.ch/project/spi/lcgsoft

24
Generic Persistent Model
Transient
25
Cache Access Through References
  • References know about the Data Cache
  • 2 operation modes - Clear at checkpoint
    - Auto-clear with
    reference count
  • References are implemented as smart pointers
  • Use cache manager for load-on-demand
  • Use the object key of the cache manager

26
POOL Cache Access
Key Object Token

2 ltpointergt ltpointergt

Token
Storage Technology
Object Type
Persistent Location
27
Follow Object Associations
Entry ID
Link ID
28
The Link Table
  • Contains all information to resurrect an object
  • Storage type
  • Database name
  • Container name
  • Object type (class name)
  • Cache hints
  • E.g. other possible transient conversions
  • Size O(Associations in class model)
  • Local to every database
  • Size is limited

29
File Catalog functionality
  • Connection and transaction control functions
  • Catalog insertion and update functions on logical
    and physical filenames
  • Catalog lookup functions (by filename, FileID or
    query)
  • Clean-up after an unsuccessful job
  • Catalog entries iterator
  • File Meta data operations (e.g. define or insert
    file meta data)
  • Cross catalog operations (e.g. extract a XML
    fragment and append it to the MySQL catalog)
  • Python based graphic user interface for the
    catalog browsing

30
File Catalog Scaling Tests
  • update and lookup performances look fine
  • scalability shown up to a few hundred boxes
  • need to understand availability and backup in the
    production context

31
File Catalog Browser Prototype
32
Pros cons of POOL vs Vanilla Root
  • More services
  • File catalog (disconnected/Grid aware)
  • Object cache manager
  • Functionalities
  • - Object storage operation more explicit
    Write/Read/Update/Delete
  • Object navigation transparent Objects linked
    are resolved automatically inside the store
  • Simple transaction handling
  • Root functionalities preserved
  • Trees and Keys format available (simple to
    switch)

33
Pros cons of POOL vs Vanilla Root
  • Integration in the software
  • Dictionary Generation easier with LCG Dict
  • No changes are needed on class headers
  • no instrumentation, no inheritance from
    external classes
  • Architecture
  • Technology independent interfaces
  • Modular component design
  • Component can be used separately
  • Layered structure

34
Pros cons of POOL vs Vanilla Root
  • And also
  • Easy to replace with Vanilla Root
  • Allow for alternatives to RootI/O !
  • Concerns (POOL disadvantages)
  • Performance
  • Multiple software layer on top of Root
  • - Reliability
  • POOL is very young
Write a Comment
User Comments (0)
About PowerShow.com