EventStore Managing Event Versioning and Data Partitioning using Legacy Data Formats

About This Presentation

Title:

EventStore Managing Event Versioning and Data Partitioning using Legacy Data Formats

Description:

skims are just indicies, independent of data clustering ... What files belong with which grade, skim and run. Versioning information. Indexing Data ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 17

Provided by: lnsCo

Learn more at: http://www.lns.cornell.edu

Category:

more less

Transcript and Presenter's Notes

Title: EventStore Managing Event Versioning and Data Partitioning using Legacy Data Formats

1
EventStoreManagingEvent Versioning and Data
Partitioning using Legacy Data Formats

Chris Jones
Valentin Kuznetsov
Dan Riley
Greg Sharp
CLEO Collaboration
Cornell University

2
Goals

Fast and scalable
e.g. run through events w/ data in memory
gt10,000ev/s
Event data stays in original file format
e.g. CLEOs object format, Root, etc.
Can manage data and MC
Data will be versioned
can always get back the version of data you used
before
Can choose runs based on run conditions
e.g. run energy, status of RICH subdetector,
etc.
Handles overlapping skims in same job
Easy to add/supercede data to an event
e.g. can add post-reconstruction info e.g. p0
and reconstructed Ds
No dependence on proprietary software
e.g. drop our use of Objectivity

3
EventStore Sizes

Personal
For individual physicists (e.g. laptops)
Holds personal skims
No separate processes (e.g. databases) needed to
run
Group
For large offsite collaborators or on-site groups
Holds a large subset of our data
All data on disk
Requires running a 3rd party database
Collaboration
Cornell Site
Holds all of our data with replication for
improved performance
Interacts with HSM
Requires running a 3rd party database
Share everything except choice of file meta-data
DB

4
Data Organization

Data is organized into grades
raw
data directly from detector
physics
Reconstruction output approved for analysis
Skims are defined within a grade
physics all, qcd, tau,
skims are just indicies, independent of data
clustering
different skims can reference the same event
extremely easy to add additional skims

5
Adding Data

Easy to add new objects to events in an existing
grade
e.g. could add p0s to physics grade after the
post-reconsruction calibration has been done
Can avoid common run time calculations
e.g. shower energy, p0 finding,
save CPU time
guarantee consistency when reprocessing

6
File Meta Data

Meta data about all files are stored in a
relational DB
System independent of choice of DB
Presently using
SQLite for personal
mySQL for group and collaboration
Meta data stored
Logical File ID (64-bit number) to file path
What files belong with which grade, skim and run
Versioning information

7
Indexing Data

Three types of files are used when reading data
Index
translates (run, event, MC ID) to location record
index
index file has fixed record length
fast random access
Location
knows where in a set of data files can be found
each data unit
gives us random access
versioning implemented using different location
files
specialized for each data storage format
location file has fixed record length
Data store
any way to store data should work
implemented for two file formats with variable
record sizes
Why not use a relational database for event
indexing?
Indexing is read only
Info is accessed every event
must be very fast
Index is traversed on the client

8
Reading Data
Index
Location
offset
offset
offset
1 1 1 0
00 FF 00
0 0 0 2
1 3 1 3
08 FF 12
0 0 0 2
1 7 1 4
22 FF 34
0 0 0 2
FF 00 52
1 1 1 2
Data
Raw
QCD
9
Performance Sequential Access
Compare reading sequentially the same data files
as a chain of files versus using the EventStore
EventStore is a constant 15 slower
10
Performance Event List Access
Compare using an event list to access the
same data files using a chain of files versus
using the EventStore
The EventStore scales better the more events
that are skipped
11
Versioning

When starting a new analysis, usually want the
most recent reconstruction
When adding new data to an existing analysis,
want to go back over the same data
Specify version by giving a date
notation yyyymmdd
e.g. eventstore in 20040501
Do not have to specify date of a version change,
EventStore will find the closest version just
before that date
In analysis, physicists use the date they first
processed the data

12
Versioning Evolution

If data is reprocessed, a new date stamp must be
used to distinguish the data
e.g. if CLEO reprocesses data31 must create new
date for physics grade
physics 20040101 -has first processing of data31
physics 20040402 -has more recent processing of
data31
When new datasets are added, CLEO officers can
append it to any date stamp
e.g. newly recon data35 can be added to physics
20040101 and 20040402
When new data types are added, they can be placed
in corresponding date stamp
e.g. p0s derived from 20040402 may be stored
there
If data type is replaced, need new date stamp

13
Versioning Evolution
run number
Version
14
Versioning Information

Each chunk of data added will have its own
specific versioning information
e.g. Recon-20040312-Feb13_04_P2 for data32
reconstructed data used software release
Feb13_04_P2 with last change to any item that
affects recon no later than 2004/03/12
The version date-stamp is a logical version
made up of individual specific versions which
describe a non-overlapping run range
CLEO officers decide what specific versions
should be used together to form a logical version
Want to make tool that given a date stamp and a
run range, it will tell you how to generate your
MC

15
Run Selection

Support multiple ways of specifying runs
run range
runs 202000 203000
datasets
datasets 31-34
energy
energy 1.89
runs whose energy cluster around this beam
energy
energy psi(3770)
energy psi(3770)-off
detector state
detector mu unused
mu not used in this analysis so can use runs
where mu bad
detector rich used
Data obtained by querying a Web Service
Run meta-data is centralized
Can also be accessed via a web browser

16
Conclusion

We have deployed the EventStore
Features
Adaptable to any legacy data formats and
relational databases
Provides random access to events
Allows incremental addition of data
Enables independence of event indexing and data
clustering
Reuses data files for different versions
Users reactions have been very favorable

Write a Comment

User Comments (0)

About PowerShow.com

EventStore Managing Event Versioning and Data Partitioning using Legacy Data Formats - PowerPoint PPT Presentation

EventStore Managing Event Versioning and Data Partitioning using Legacy Data Formats

skims are just indicies, independent of data clustering ... What files belong with which grade, skim and run. Versioning information. Indexing Data ... – PowerPoint PPT presentation