Title: Data Model
1Data Model Analysis Task
P.Sphicas Computing Model Workshop CPT
Week November 3, 2004
- Outline
- Introruction
- Data Model
- Data Production
- Plus pending issues
- Next issues
- Summary
2A bit of history, the charge and the actors
3Data Model/Analysis
- Data Management RTAG
- Concentrated period of work, weekly meetings,
plenty of discussion - most of it convergent
- Delivered an unprecedented object in CMS a
document with an analysis/justification for
things, along with an explicit set of
recommendations - The current Data Model Analysis (DMA) task is
- the natural continuation of the DM RTAG
- its completion, via the inclusion of analysis
issued - Core discussion group consists of two people
from the DM RTAG (L. Bauerdick and A. Yagil), Ms.
CMSAnalysis (Lucia), Mr. Co-Reconstruction (N.
Neumeister), Mr. Online-framework (Emilio) and
one person from PRS (me). - A couple volunteers have stepped forward recently
4Data Model/Analysis
- Obvious on numerous issues we cannot afford to
actually test out completely, to the extent that
we would like, all alternatives - Implication is that we have to discuss things,
think them out, make a number of educated
guesses and then decide. - Scope of DMA covers much of what the Physics TDR
will reveal - From chapter 1 of CM document scope for DMA
includes - Event sizes, formats, streaming, DST/AOD/etc,
calibration/conditions data, work flow, update
freq, simulation samples (sizes, distribution),
group needs for analysis, data and job movement,
priority management, interactive analysis. - With the exception of a couple items, the Physics
TDR is expected to deliver answers on the
numbers/choices listed above - Put otherwise this exercise is taking place
roughly a year before it should. But it cannot
wait, due to LCG schedule. - Thus we are charged with analyzing all
alternatives, providing an ordered list and an
explicit recommendation on each. - And eventually we, the CM group, has to decide.
5Data Model
6Recommendation 1
- Need streams out of the HLT farm
- About 10 of them
- equal size across them factors of 2-3 can be
found/expected - One stream is an express line
- In complete overlap with all the other streams
- Every event in the express line stream is an
event found in another stream as well - Justification
- Facilitates prioritization of reconstruction
- Facilitates application of re-calibration,
re-reconstruction - Cost is tunable any overlap of events across
streams is a function of cuts. No reason why
this overlap cannot stay at 10 level - Depending on Grid economics, it MAY map more
easily to a set of distributed Tier-1 centers (if
re-processings done at Tier-1s) - No (obvious) advantage to having a single big
stream
7Streams at the Tevatron (backup)
- D0 all events in one stream (physics)
- Plus one monitor stream
- Production outputs two files DST and a thumbnail
(TMB) - TMB 20-60 kB/event compressed version of the
data meant for most analyses - Reconstruction yields 25 skims
- CDF multiple streams out of HLT (10)
- Streams get reconstructed and produce physics
lists/paths - Outputs one file DST raw data called
datasets - For SOME datasets, a compressed version of
DSTRawData is written out. Independent dataset. - About 40-50 of them production finishes here
- Datasets are the smallest sample quantum seen
(and managed) by physics groups - A physics group can be using one or many datasets
8Issue 2
- What is the output of production
- Model Tier-0 processes raw data from on-line
system and outputs CMS Reconstructed Events. - Define content of these RECEVT files.
- Here issue is more complicated, but before
looking at options, here are some suggested
guidelines - The CMS system must be flexible. Put otherwise
re-reconstruction should be straightforward and
easy - Expect that we will be redoing things often
- The CMS system should not be designed with the
outside boundary conditions (e.g. available
funding today) built into things. - Should have an iterative process of
design/costing/comparing/re-designing/re-costing/r
e-comparing/re-re-designing/re-re-costing/
9Recommendation 2
- We MUST set the following two strategic goals
- Our reconstruction program should allow frequent
re-processing of our data - It should be FAST watch out for fancy features
and certainly for fancy code - At the LHC it is far better to be first with the
correct result, than to be second with a pretty
code with polymorphism, multi-range applicability
and redundancy-checking, etc etc (Nostradamus) - Our data and analysis model should allow frequent
re-processing of our data - Raw data must be readily available to the
reconstructing sites - Investigate what this implies
10Raw data readily available
- Three options for this
- keep raw data in RECEVT wherever RECEVT lies,
re-reconstruction can occur instantly - Need active storage management system
- Implies multiple copies of RAW data, so need to
watch cost - keep RECEVT split in two files, one for the RAW
information and one for the RECOnstructed
information - Need data model that links the two (metadata)
- Implies (again) distributing RAW data to multiple
places - keep Reconstructed information completely
separate from RAW information - Implies re-reconstruction site options very
limited. At the limit, we could decide that CMS
has one and only one copy of the raw data, and
that copy is split across the potential
reconstructors, i.e. the Tier-1 centers
11Raw Data in the RECEVT
- To mitigate the cost of multiple copies of the
raw data, one can consider compressing heavily - Clearly this costs in CPU and complexity
- At startup, few events, so no need.
- Afterwards if indeed one reprocesses only once
or twice /year, decompressing once or twice per
year is really not a significant overhead - Complexity not significant
- So, conclusion yes, packing of the data is an
element for our data model - To proceed we need to know how often we will be
accessing what information - Not knowing this, we must look at the Tevatron
12Event files content
- Two very different models
- CDF is close to the RAW data
- About 100 kB, kept in same physical file as the
full event - There is about 50-60 kB of reconstructed objects
added to this - This file re-emerges with every re-processing
- There are, at any point in time N copies of the
raw data where N is the number of
re-processings active - Note most users do NOT run on these files. They
run analyses on group ROOT-tuples - D0 works on thumbnails
- About 20-60kB/event. Number going up (to 100
kB?) - Some limited re-reconstruction capability with
TMBs - Most users run their analysis on skimmed TMBs
- Skim production handled by D0 Common Samples
Group - Usually in conjunction with TMB fixing
13Reprocessing and TMB fixing
- CDF reprocesses (nowadays) once per year
- Their entire Run II sample
- Major pluses small event size, short processing
time - But clearly, production still a big job
- D0 has re-processed their data once so far
- All other reprocessing takes the form of
corrections/fixes applied to the TMBs - This is a major effort, lasts a few weeks
- So far occurs about twice per year
- Files in skims combined to make larger files
- Negatives rather slow reconstruction program,
control over corrections that go into TMB-fixing
14Recommendation 2
- Premises
- we will reprocess data quite often at least in
the beginning - flexibility (aka re-running to fix
bugs/features/calibration) is of higher
importance than storage savings - All such statements should be understood to be
true to a finite extent. When costs become
prohibitive, obviously we take the other route - We must, therefore, have a model where the raw
data is nearby/readily available - We should make every possible effort to keep the
raw data as part of the RECEVT distribution. - Next decision one or two files?
15Issue 3 data size
- Recipe for calculating CMS event size
- SIM_size F_unpack F_sim F_debug F_hlt /
F_pack - SIM_size our best estimate, using OSCAR and DAQ
headers - F_unpack factor describing going to
reconstruction representation - F_sim factor describing how realistic the
simulation is - F_debug factor describing the increased amount
of data we take in the beginning (because of
looser readout/zero-suppresion/etc) - F_hlt same as F_debug but for the HLT
- F_pack what we can gain by packing the data
at the end of processing - These factors are not present throughout lifetime
of the experiment. Clearly, F_debug at some
point becomes 1
16Data size
- We do not have final numbers for all these
factors - Only first estimates. Given here for
illustration - F_sim 1.6-2.0
- CDF roughly twice as many hits in real events
than in simu events. - F_debug 2.5
- Depends very much on phase of the experiment.
These numbers preliminary figures from CDF - F_hlt 1.25
- F_pack 2.0
- F_unpack TBM
- New estimate of CMS event size is 300 kB (at low
lumi) - Defined as 2x1033cm-2s-1. From DM RTAG.
Suggestion we adopt it. - Lumi-dependence for LgtLlow linear, slope 1 (
interactions rises). - Looks like DAQ evt at startup 1.5MB.
Asymptocically at high-lumi 1MB. But this
drops with time
17Recommendation 3
- At least in the early phase of the experiment,
keep RECEVT as a single file, carrying around the
RAW data - What we lose in storage we gain back in
flexibility and immediacy in re-processing - As experiment matures, if economies allow,
continue - Otherwise split into two files
18Issue 4, Recommendation 4
- Offline streaming
- Clear need for further partitioning of the data,
beyond the HLT streams - Both CDF and D0 have this
- Inclusive leptons ? muons, electrons, taus (!),
etc - We do not know how many are needed, but no reason
to believe we need a number vastly different from
CDF/D0 - They have 50/25 different groups
- Suggestion we split the offline output into 50
streams (take upper value for safety) - These offline streams are the smallest data
sample quanta handled by the physics groups
19Issue 5
- Why we are doing all this? Who will read/use the
RECEVT files? - Let us for a moment (temporarily) define the
RECOnstructed part of the event as the
information needed by 90 of the analyses - Closest object to todays DST
- Unfortunately, neither CDF nor D0 use these
files/format to do analysis - CDF uses ROOT-tuples
- D0 the TMBs
- Morale of the story for all the previous stuff
to have high value (beyond the massive
re-production issues) we must have an analysis
model that utilizes the CMS Event Data Model.
Directly. As much as possible.
20Data production
21Issue 1 where/how it all happens
- We speak of having a copy of the RAW data at CERN
(Tier-0) and skipping the Reconstructed data to
the Tier-1 centers - Since RAW data part of distribution, we now have
two copies of the raw data - Any reprocessing, will result in another RAW data
copy. Unless reprocessing overwrites/users
resources of the previous processing - Implies that Tier-1 center where re-processing is
possible is fixed. (Since CERN cannot reprocess
everything at the Tier-0 again) - Implies that CERN copy is an idle one, a
disaster avoidance copy - This is the least attractive scenario
22Issue 1 continued
- In fact, best possible scenario (for CMS) would
be to treat these 1/5 chunks of the RAW data as
coming in two copies which are distributed among
all the Tier-1 centers. - i.e. we do not have an idle copy at CERN, but a
second active copy at a second Tier-1 center - And for every 1/5 of the data, we have two sites
to choose from where to re-do something. - Of course there are costs associated with this
choice - Active data storage managers. Who pays? etc
- For completeness, full list of options to be
investigated
23Issue 1 continued
Option/Actor Tier-0 Tier-1 Physics/ Analysis Coordination
vaulted copy distributed active copy "dead" tape copy emergency usage only does not participate in re-processing share of ESDcustodial ownershipre-process its part distribute intoT1sre-proc pre-determined
active copy distributed active copy active copy at CERN participates in re-proc same more coordination required
two distributed active copies basically no or little role. most serious role data is stored in T1s! most difficult role here
24Issue 1, Pseudo-Recommendation 1
- Avoid option 1, i.e. vaulted copy distributed
active copy - Little motivation for 2nd copy in this case.
- CERN copy is, for all intents and purposes, only
an insurance policy - Many current experiments do not have such
insurance - The cost of a second copy should have significant
benefits to the experiment - For example more flexibility.
- Note that option 2 (active copy distributed
active copy) implies that the CERN Tier-1
throughput is 5 times larger than that of other
Tier-1 centers - Looks like option 3 is the one we should think
about a the most likely to survive financial and
political reqs
25Fill in details
- Reprocessing at Tier-X centers
- Work out details (form user point of view) of
what happens when - Inputs for reconstruction
- Calibration tasks
- Observations
- CDF and D0 are moving towards the same reality
it takes about 100 kB of data to do physics at
the Tevatron. - But they have taken two different initial
positions - CDF ALL data, keep decreasing
- D0 littlest possible data, keep increasing
- Reminds me of adding variables in ntuples/objects
in root-trees
26Offline processing
27Ill-defined issues
28Next issues
29Following Issues
- Content of DST
- Usage pattern
- Content of mini-DST
- Usage pattern
- Number of copies, co-ordination of re-making
- Social contract who runs what. What is
forbidden, what is discouraged, what is available
depending on resources, who keeps track - Calibration/Conditions data who creates it.
Where it gets stored. How often is it used.
Types of calibration data (low-level,
high-level) - Usage of simulated data. How much? Where? Why?
- Data and Job movement (!...)
- Interactive Analysis
30Interactive analysis
- We must avoid the ROOT-tuple trap
- We must ensure that a (willing) user can, if
(s)he wants, do ALL that he wishes by running a
single program, ORCA - Huge advantages to this scheme
- global debugging of code
- huge gains in development avoid re-inventing the
car - single-training objective
- clear turn-on path onto CMS, from professor to
student - avoid hand-corrections that are difficult to
reproduce - minimize dichotomy between developer and user
communities. In a sense, every user will be a
mini-developer. Question will be developer
grade, not type - clear, straightforward feedback of analysis
code into reconstruction code - what can be done determined by what is at the
input. Only.
31Recommendation (final)
- We take it for granted that we will make a
maximal investment in the EDM, COBRA, ORCA,
OSCAR, FAMOS, and stick to it. And provide
direct access to a plotting/analysis package that
provides interactive capability - Need an easy-to-learn, easy-to-use system for
plot-making - Huge returns
- ROOT is currently the most popular one in our
field - Do this ROOT-binding in the most careful and
optimal way possible
32Summary
33Summary
- Overriding goal to measure the right number
first. - Irrespective of elegance of the method, tactics,
etc - We must learn from other peoples mistakes
- Plan for ours, so plan to re-process, and plan
for flexibility - Plan to simplify things at the expense of
beauty - Plan to duplicate/triplicate things where the
simplicity of the rules/systems exceeds the
anticipated costs - Every experiment I have seen to this date, ended
up using orders of magnitude more computing than
anticipated at inception time. CMS will be no
exception. - Things we need people should think about the
issues listed as pending and/or ill-defined and
provide feedback