EventStore Managing Event Versioning and Data Partitioning using Legacy Data Formats - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

EventStore Managing Event Versioning and Data Partitioning using Legacy Data Formats

Description:

skims are just indicies, independent of data clustering ... What files belong with which grade, skim and run. Versioning information. Indexing Data ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 17
Provided by: lnsCo
Category:

less

Transcript and Presenter's Notes

Title: EventStore Managing Event Versioning and Data Partitioning using Legacy Data Formats


1
EventStoreManagingEvent Versioning and Data
Partitioning using Legacy Data Formats
  • Chris Jones
  • Valentin Kuznetsov
  • Dan Riley
  • Greg Sharp
  • CLEO Collaboration
  • Cornell University

2
Goals
  • Fast and scalable
  • e.g. run through events w/ data in memory
    gt10,000ev/s
  • Event data stays in original file format
  • e.g. CLEOs object format, Root, etc.
  • Can manage data and MC
  • Data will be versioned
  • can always get back the version of data you used
    before
  • Can choose runs based on run conditions
  • e.g. run energy, status of RICH subdetector,
    etc.
  • Handles overlapping skims in same job
  • Easy to add/supercede data to an event
  • e.g. can add post-reconstruction info e.g. p0
    and reconstructed Ds
  • No dependence on proprietary software
  • e.g. drop our use of Objectivity

3
EventStore Sizes
  • Personal
  • For individual physicists (e.g. laptops)
  • Holds personal skims
  • No separate processes (e.g. databases) needed to
    run
  • Group
  • For large offsite collaborators or on-site groups
  • Holds a large subset of our data
  • All data on disk
  • Requires running a 3rd party database
  • Collaboration
  • Cornell Site
  • Holds all of our data with replication for
    improved performance
  • Interacts with HSM
  • Requires running a 3rd party database
  • Share everything except choice of file meta-data
    DB

4
Data Organization
  • Data is organized into grades
  • raw
  • data directly from detector
  • physics
  • Reconstruction output approved for analysis
  • Skims are defined within a grade
  • physics all, qcd, tau,
  • skims are just indicies, independent of data
    clustering
  • different skims can reference the same event
  • extremely easy to add additional skims

5
Adding Data
  • Easy to add new objects to events in an existing
    grade
  • e.g. could add p0s to physics grade after the
    post-reconsruction calibration has been done
  • Can avoid common run time calculations
  • e.g. shower energy, p0 finding,
  • save CPU time
  • guarantee consistency when reprocessing

6
File Meta Data
  • Meta data about all files are stored in a
    relational DB
  • System independent of choice of DB
  • Presently using
  • SQLite for personal
  • mySQL for group and collaboration
  • Meta data stored
  • Logical File ID (64-bit number) to file path
  • What files belong with which grade, skim and run
  • Versioning information

7
Indexing Data
  • Three types of files are used when reading data
  • Index
  • translates (run, event, MC ID) to location record
    index
  • index file has fixed record length
  • fast random access
  • Location
  • knows where in a set of data files can be found
    each data unit
  • gives us random access
  • versioning implemented using different location
    files
  • specialized for each data storage format
  • location file has fixed record length
  • Data store
  • any way to store data should work
  • implemented for two file formats with variable
    record sizes
  • Why not use a relational database for event
    indexing?
  • Indexing is read only
  • Info is accessed every event
  • must be very fast
  • Index is traversed on the client

8
Reading Data
Index
Location
offset
offset
offset
1 1 1 0
00 FF 00
0 0 0 2
1 3 1 3
08 FF 12
0 0 0 2
1 7 1 4
22 FF 34
0 0 0 2
FF 00 52
1 1 1 2
Data
Raw
QCD
9
Performance Sequential Access
Compare reading sequentially the same data files
as a chain of files versus using the EventStore
EventStore is a constant 15 slower
10
Performance Event List Access
Compare using an event list to access the
same data files using a chain of files versus
using the EventStore
The EventStore scales better the more events
that are skipped
11
Versioning
  • When starting a new analysis, usually want the
    most recent reconstruction
  • When adding new data to an existing analysis,
    want to go back over the same data
  • Specify version by giving a date
  • notation yyyymmdd
  • e.g. eventstore in 20040501
  • Do not have to specify date of a version change,
    EventStore will find the closest version just
    before that date
  • In analysis, physicists use the date they first
    processed the data

12
Versioning Evolution
  • If data is reprocessed, a new date stamp must be
    used to distinguish the data
  • e.g. if CLEO reprocesses data31 must create new
    date for physics grade
  • physics 20040101 -has first processing of data31
  • physics 20040402 -has more recent processing of
    data31
  • When new datasets are added, CLEO officers can
    append it to any date stamp
  • e.g. newly recon data35 can be added to physics
    20040101 and 20040402
  • When new data types are added, they can be placed
    in corresponding date stamp
  • e.g. p0s derived from 20040402 may be stored
    there
  • If data type is replaced, need new date stamp

13
Versioning Evolution
run number
Version
14
Versioning Information
  • Each chunk of data added will have its own
    specific versioning information
  • e.g. Recon-20040312-Feb13_04_P2 for data32
  • reconstructed data used software release
    Feb13_04_P2 with last change to any item that
    affects recon no later than 2004/03/12
  • The version date-stamp is a logical version
    made up of individual specific versions which
    describe a non-overlapping run range
  • CLEO officers decide what specific versions
    should be used together to form a logical version
  • Want to make tool that given a date stamp and a
    run range, it will tell you how to generate your
    MC

15
Run Selection
  • Support multiple ways of specifying runs
  • run range
  • runs 202000 203000
  • datasets
  • datasets 31-34
  • energy
  • energy 1.89
  • runs whose energy cluster around this beam
    energy
  • energy psi(3770)
  • energy psi(3770)-off
  • detector state
  • detector mu unused
  • mu not used in this analysis so can use runs
    where mu bad
  • detector rich used
  • Data obtained by querying a Web Service
  • Run meta-data is centralized
  • Can also be accessed via a web browser

16
Conclusion
  • We have deployed the EventStore
  • Features
  • Adaptable to any legacy data formats and
    relational databases
  • Provides random access to events
  • Allows incremental addition of data
  • Enables independence of event indexing and data
    clustering
  • Reuses data files for different versions
  • Users reactions have been very favorable
Write a Comment
User Comments (0)
About PowerShow.com