Data Model - PowerPoint PPT Presentation

About This Presentation

Title:

Data Model

Description:

Delivered an unprecedented object in CMS: a document with an analysis ... The current Data Model & Analysis (DMA) task is. the natural continuation of the DM RTAG ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 34

Provided by: pari160

Learn more at: http://www.phys.ufl.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data Model

1
Data Model Analysis Task
P.Sphicas Computing Model Workshop CPT
Week November 3, 2004

Outline
Introruction
Data Model
Data Production
Plus pending issues
Next issues
Summary

2
A bit of history, the charge and the actors
3
Data Model/Analysis

Data Management RTAG
Concentrated period of work, weekly meetings,
plenty of discussion
most of it convergent
Delivered an unprecedented object in CMS a
document with an analysis/justification for
things, along with an explicit set of
recommendations
The current Data Model Analysis (DMA) task is
the natural continuation of the DM RTAG
its completion, via the inclusion of analysis
issued
Core discussion group consists of two people
from the DM RTAG (L. Bauerdick and A. Yagil), Ms.
CMSAnalysis (Lucia), Mr. Co-Reconstruction (N.
Neumeister), Mr. Online-framework (Emilio) and
one person from PRS (me).
A couple volunteers have stepped forward recently

4
Data Model/Analysis

Obvious on numerous issues we cannot afford to
actually test out completely, to the extent that
we would like, all alternatives
Implication is that we have to discuss things,
think them out, make a number of educated
guesses and then decide.
Scope of DMA covers much of what the Physics TDR
will reveal
From chapter 1 of CM document scope for DMA
includes
Event sizes, formats, streaming, DST/AOD/etc,
calibration/conditions data, work flow, update
freq, simulation samples (sizes, distribution),
group needs for analysis, data and job movement,
priority management, interactive analysis.
With the exception of a couple items, the Physics
TDR is expected to deliver answers on the
numbers/choices listed above
Put otherwise this exercise is taking place
roughly a year before it should. But it cannot
wait, due to LCG schedule.
Thus we are charged with analyzing all
alternatives, providing an ordered list and an
explicit recommendation on each.
And eventually we, the CM group, has to decide.

5
Data Model
6
Recommendation 1

Need streams out of the HLT farm
About 10 of them
equal size across them factors of 2-3 can be
found/expected
One stream is an express line
In complete overlap with all the other streams
Every event in the express line stream is an
event found in another stream as well
Justification
Facilitates prioritization of reconstruction
Facilitates application of re-calibration,
re-reconstruction
Cost is tunable any overlap of events across
streams is a function of cuts. No reason why
this overlap cannot stay at 10 level
Depending on Grid economics, it MAY map more
easily to a set of distributed Tier-1 centers (if
re-processings done at Tier-1s)
No (obvious) advantage to having a single big
stream

7
Streams at the Tevatron (backup)

D0 all events in one stream (physics)
Plus one monitor stream
Production outputs two files DST and a thumbnail
(TMB)
TMB 20-60 kB/event compressed version of the
data meant for most analyses
Reconstruction yields 25 skims
CDF multiple streams out of HLT (10)
Streams get reconstructed and produce physics
lists/paths
Outputs one file DST raw data called
datasets
For SOME datasets, a compressed version of
DSTRawData is written out. Independent dataset.
About 40-50 of them production finishes here
Datasets are the smallest sample quantum seen
(and managed) by physics groups
A physics group can be using one or many datasets

8
Issue 2

What is the output of production
Model Tier-0 processes raw data from on-line
system and outputs CMS Reconstructed Events.
Define content of these RECEVT files.
Here issue is more complicated, but before
looking at options, here are some suggested
guidelines
The CMS system must be flexible. Put otherwise
re-reconstruction should be straightforward and
easy
Expect that we will be redoing things often
The CMS system should not be designed with the
outside boundary conditions (e.g. available
funding today) built into things.
Should have an iterative process of
design/costing/comparing/re-designing/re-costing/r
e-comparing/re-re-designing/re-re-costing/

9
Recommendation 2

We MUST set the following two strategic goals
Our reconstruction program should allow frequent
re-processing of our data
It should be FAST watch out for fancy features
and certainly for fancy code
At the LHC it is far better to be first with the
correct result, than to be second with a pretty
code with polymorphism, multi-range applicability
and redundancy-checking, etc etc (Nostradamus)
Our data and analysis model should allow frequent
re-processing of our data
Raw data must be readily available to the
reconstructing sites
Investigate what this implies

10
Raw data readily available

Three options for this
keep raw data in RECEVT wherever RECEVT lies,
re-reconstruction can occur instantly
Need active storage management system
Implies multiple copies of RAW data, so need to
watch cost
keep RECEVT split in two files, one for the RAW
information and one for the RECOnstructed
information
Need data model that links the two (metadata)
Implies (again) distributing RAW data to multiple
places
keep Reconstructed information completely
separate from RAW information
Implies re-reconstruction site options very
limited. At the limit, we could decide that CMS
has one and only one copy of the raw data, and
that copy is split across the potential
reconstructors, i.e. the Tier-1 centers

11
Raw Data in the RECEVT

To mitigate the cost of multiple copies of the
raw data, one can consider compressing heavily
Clearly this costs in CPU and complexity
At startup, few events, so no need.
Afterwards if indeed one reprocesses only once
or twice /year, decompressing once or twice per
year is really not a significant overhead
Complexity not significant
So, conclusion yes, packing of the data is an
element for our data model
To proceed we need to know how often we will be
accessing what information
Not knowing this, we must look at the Tevatron

12
Event files content

Two very different models
CDF is close to the RAW data
About 100 kB, kept in same physical file as the
full event
There is about 50-60 kB of reconstructed objects
added to this
This file re-emerges with every re-processing
There are, at any point in time N copies of the
raw data where N is the number of
re-processings active
Note most users do NOT run on these files. They
run analyses on group ROOT-tuples
D0 works on thumbnails
About 20-60kB/event. Number going up (to 100
kB?)
Some limited re-reconstruction capability with
TMBs
Most users run their analysis on skimmed TMBs
Skim production handled by D0 Common Samples
Group
Usually in conjunction with TMB fixing

13
Reprocessing and TMB fixing

CDF reprocesses (nowadays) once per year
Their entire Run II sample
Major pluses small event size, short processing
time
But clearly, production still a big job
D0 has re-processed their data once so far
All other reprocessing takes the form of
corrections/fixes applied to the TMBs
This is a major effort, lasts a few weeks
So far occurs about twice per year
Files in skims combined to make larger files
Negatives rather slow reconstruction program,
control over corrections that go into TMB-fixing

14
Recommendation 2

Premises
we will reprocess data quite often at least in
the beginning
flexibility (aka re-running to fix
bugs/features/calibration) is of higher
importance than storage savings
All such statements should be understood to be
true to a finite extent. When costs become
prohibitive, obviously we take the other route
We must, therefore, have a model where the raw
data is nearby/readily available
We should make every possible effort to keep the
raw data as part of the RECEVT distribution.
Next decision one or two files?

15
Issue 3 data size

Recipe for calculating CMS event size
SIM_size F_unpack F_sim F_debug F_hlt /
F_pack
SIM_size our best estimate, using OSCAR and DAQ
headers
F_unpack factor describing going to
reconstruction representation
F_sim factor describing how realistic the
simulation is
F_debug factor describing the increased amount
of data we take in the beginning (because of
looser readout/zero-suppresion/etc)
F_hlt same as F_debug but for the HLT
F_pack what we can gain by packing the data
at the end of processing
These factors are not present throughout lifetime
of the experiment. Clearly, F_debug at some
point becomes 1

16
Data size

We do not have final numbers for all these
factors
Only first estimates. Given here for
illustration
F_sim 1.6-2.0
CDF roughly twice as many hits in real events
than in simu events.
F_debug 2.5
Depends very much on phase of the experiment.
These numbers preliminary figures from CDF
F_hlt 1.25
F_pack 2.0
F_unpack TBM
New estimate of CMS event size is 300 kB (at low
lumi)
Defined as 2x1033cm-2s-1. From DM RTAG.
Suggestion we adopt it.
Lumi-dependence for LgtLlow linear, slope 1 (
interactions rises).
Looks like DAQ evt at startup 1.5MB.
Asymptocically at high-lumi 1MB. But this
drops with time

17
Recommendation 3

At least in the early phase of the experiment,
keep RECEVT as a single file, carrying around the
RAW data
What we lose in storage we gain back in
flexibility and immediacy in re-processing
As experiment matures, if economies allow,
continue
Otherwise split into two files

18
Issue 4, Recommendation 4

Offline streaming
Clear need for further partitioning of the data,
beyond the HLT streams
Both CDF and D0 have this
Inclusive leptons ? muons, electrons, taus (!),
etc
We do not know how many are needed, but no reason
to believe we need a number vastly different from
CDF/D0
They have 50/25 different groups
Suggestion we split the offline output into 50
streams (take upper value for safety)
These offline streams are the smallest data
sample quanta handled by the physics groups

19
Issue 5

Why we are doing all this? Who will read/use the
RECEVT files?
Let us for a moment (temporarily) define the
RECOnstructed part of the event as the
information needed by 90 of the analyses
Closest object to todays DST
Unfortunately, neither CDF nor D0 use these
files/format to do analysis
CDF uses ROOT-tuples
D0 the TMBs
Morale of the story for all the previous stuff
to have high value (beyond the massive
re-production issues) we must have an analysis
model that utilizes the CMS Event Data Model.
Directly. As much as possible.

20
Data production
21
Issue 1 where/how it all happens

We speak of having a copy of the RAW data at CERN
(Tier-0) and skipping the Reconstructed data to
the Tier-1 centers
Since RAW data part of distribution, we now have
two copies of the raw data
Any reprocessing, will result in another RAW data
copy. Unless reprocessing overwrites/users
resources of the previous processing
Implies that Tier-1 center where re-processing is
possible is fixed. (Since CERN cannot reprocess
everything at the Tier-0 again)
Implies that CERN copy is an idle one, a
disaster avoidance copy
This is the least attractive scenario

22
Issue 1 continued

In fact, best possible scenario (for CMS) would
be to treat these 1/5 chunks of the RAW data as
coming in two copies which are distributed among
all the Tier-1 centers.
i.e. we do not have an idle copy at CERN, but a
second active copy at a second Tier-1 center
And for every 1/5 of the data, we have two sites
to choose from where to re-do something.
Of course there are costs associated with this
choice
Active data storage managers. Who pays? etc
For completeness, full list of options to be
investigated

23
Issue 1 continued
Option/Actor Tier-0 Tier-1 Physics/ Analysis Coordination
vaulted copy distributed active copy "dead" tape copy emergency usage only does not participate in re-processing share of ESDcustodial ownershipre-process its part distribute intoT1sre-proc pre-determined
active copy distributed active copy active copy at CERN participates in re-proc same more coordination required
two distributed active copies basically no or little role. most serious role data is stored in T1s! most difficult role here
24
Issue 1, Pseudo-Recommendation 1

Avoid option 1, i.e. vaulted copy distributed
active copy
Little motivation for 2nd copy in this case.
CERN copy is, for all intents and purposes, only
an insurance policy
Many current experiments do not have such
insurance
The cost of a second copy should have significant
benefits to the experiment
For example more flexibility.
Note that option 2 (active copy distributed
active copy) implies that the CERN Tier-1
throughput is 5 times larger than that of other
Tier-1 centers
Looks like option 3 is the one we should think
about a the most likely to survive financial and
political reqs

25
Fill in details

Reprocessing at Tier-X centers
Work out details (form user point of view) of
what happens when
Inputs for reconstruction
Calibration tasks
Observations
CDF and D0 are moving towards the same reality
it takes about 100 kB of data to do physics at
the Tevatron.
But they have taken two different initial
positions
CDF ALL data, keep decreasing
D0 littlest possible data, keep increasing
Reminds me of adding variables in ntuples/objects
in root-trees

26
Offline processing
27
Ill-defined issues
28
Next issues
29
Following Issues

Content of DST
Usage pattern
Content of mini-DST
Usage pattern
Number of copies, co-ordination of re-making
Social contract who runs what. What is
forbidden, what is discouraged, what is available
depending on resources, who keeps track
Calibration/Conditions data who creates it.
Where it gets stored. How often is it used.
Types of calibration data (low-level,
high-level)
Usage of simulated data. How much? Where? Why?
Data and Job movement (!...)
Interactive Analysis

30
Interactive analysis

We must avoid the ROOT-tuple trap
We must ensure that a (willing) user can, if
(s)he wants, do ALL that he wishes by running a
single program, ORCA
Huge advantages to this scheme
global debugging of code
huge gains in development avoid re-inventing the
car
single-training objective
clear turn-on path onto CMS, from professor to
student
avoid hand-corrections that are difficult to
reproduce
minimize dichotomy between developer and user
communities. In a sense, every user will be a
mini-developer. Question will be developer
grade, not type
clear, straightforward feedback of analysis
code into reconstruction code
what can be done determined by what is at the
input. Only.

31
Recommendation (final)

We take it for granted that we will make a
maximal investment in the EDM, COBRA, ORCA,
OSCAR, FAMOS, and stick to it. And provide
direct access to a plotting/analysis package that
provides interactive capability
Need an easy-to-learn, easy-to-use system for
plot-making
Huge returns
ROOT is currently the most popular one in our
field
Do this ROOT-binding in the most careful and
optimal way possible

32
Summary
33
Summary

Overriding goal to measure the right number
first.
Irrespective of elegance of the method, tactics,
etc
We must learn from other peoples mistakes
Plan for ours, so plan to re-process, and plan
for flexibility
Plan to simplify things at the expense of
beauty
Plan to duplicate/triplicate things where the
simplicity of the rules/systems exceeds the
anticipated costs
Every experiment I have seen to this date, ended
up using orders of magnitude more computing than
anticipated at inception time. CMS will be no
exception.
Things we need people should think about the
issues listed as pending and/or ill-defined and
provide feedback