Data Model - PowerPoint PPT Presentation

About This Presentation
Title:

Data Model

Description:

Delivered an unprecedented object in CMS: a document with an analysis ... The current Data Model & Analysis (DMA) task is. the natural continuation of the DM RTAG ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 34
Provided by: pari160
Learn more at: http://www.phys.ufl.edu
Category:

less

Transcript and Presenter's Notes

Title: Data Model


1
Data Model Analysis Task
P.Sphicas Computing Model Workshop CPT
Week November 3, 2004
  • Outline
  • Introruction
  • Data Model
  • Data Production
  • Plus pending issues
  • Next issues
  • Summary

2
A bit of history, the charge and the actors
3
Data Model/Analysis
  • Data Management RTAG
  • Concentrated period of work, weekly meetings,
    plenty of discussion
  • most of it convergent
  • Delivered an unprecedented object in CMS a
    document with an analysis/justification for
    things, along with an explicit set of
    recommendations
  • The current Data Model Analysis (DMA) task is
  • the natural continuation of the DM RTAG
  • its completion, via the inclusion of analysis
    issued
  • Core discussion group consists of two people
    from the DM RTAG (L. Bauerdick and A. Yagil), Ms.
    CMSAnalysis (Lucia), Mr. Co-Reconstruction (N.
    Neumeister), Mr. Online-framework (Emilio) and
    one person from PRS (me).
  • A couple volunteers have stepped forward recently

4
Data Model/Analysis
  • Obvious on numerous issues we cannot afford to
    actually test out completely, to the extent that
    we would like, all alternatives
  • Implication is that we have to discuss things,
    think them out, make a number of educated
    guesses and then decide.
  • Scope of DMA covers much of what the Physics TDR
    will reveal
  • From chapter 1 of CM document scope for DMA
    includes
  • Event sizes, formats, streaming, DST/AOD/etc,
    calibration/conditions data, work flow, update
    freq, simulation samples (sizes, distribution),
    group needs for analysis, data and job movement,
    priority management, interactive analysis.
  • With the exception of a couple items, the Physics
    TDR is expected to deliver answers on the
    numbers/choices listed above
  • Put otherwise this exercise is taking place
    roughly a year before it should. But it cannot
    wait, due to LCG schedule.
  • Thus we are charged with analyzing all
    alternatives, providing an ordered list and an
    explicit recommendation on each.
  • And eventually we, the CM group, has to decide.

5
Data Model
6
Recommendation 1
  • Need streams out of the HLT farm
  • About 10 of them
  • equal size across them factors of 2-3 can be
    found/expected
  • One stream is an express line
  • In complete overlap with all the other streams
  • Every event in the express line stream is an
    event found in another stream as well
  • Justification
  • Facilitates prioritization of reconstruction
  • Facilitates application of re-calibration,
    re-reconstruction
  • Cost is tunable any overlap of events across
    streams is a function of cuts. No reason why
    this overlap cannot stay at 10 level
  • Depending on Grid economics, it MAY map more
    easily to a set of distributed Tier-1 centers (if
    re-processings done at Tier-1s)
  • No (obvious) advantage to having a single big
    stream

7
Streams at the Tevatron (backup)
  • D0 all events in one stream (physics)
  • Plus one monitor stream
  • Production outputs two files DST and a thumbnail
    (TMB)
  • TMB 20-60 kB/event compressed version of the
    data meant for most analyses
  • Reconstruction yields 25 skims
  • CDF multiple streams out of HLT (10)
  • Streams get reconstructed and produce physics
    lists/paths
  • Outputs one file DST raw data called
    datasets
  • For SOME datasets, a compressed version of
    DSTRawData is written out. Independent dataset.
  • About 40-50 of them production finishes here
  • Datasets are the smallest sample quantum seen
    (and managed) by physics groups
  • A physics group can be using one or many datasets

8
Issue 2
  • What is the output of production
  • Model Tier-0 processes raw data from on-line
    system and outputs CMS Reconstructed Events.
  • Define content of these RECEVT files.
  • Here issue is more complicated, but before
    looking at options, here are some suggested
    guidelines
  • The CMS system must be flexible. Put otherwise
    re-reconstruction should be straightforward and
    easy
  • Expect that we will be redoing things often
  • The CMS system should not be designed with the
    outside boundary conditions (e.g. available
    funding today) built into things.
  • Should have an iterative process of
    design/costing/comparing/re-designing/re-costing/r
    e-comparing/re-re-designing/re-re-costing/

9
Recommendation 2
  • We MUST set the following two strategic goals
  • Our reconstruction program should allow frequent
    re-processing of our data
  • It should be FAST watch out for fancy features
    and certainly for fancy code
  • At the LHC it is far better to be first with the
    correct result, than to be second with a pretty
    code with polymorphism, multi-range applicability
    and redundancy-checking, etc etc (Nostradamus)
  • Our data and analysis model should allow frequent
    re-processing of our data
  • Raw data must be readily available to the
    reconstructing sites
  • Investigate what this implies

10
Raw data readily available
  • Three options for this
  • keep raw data in RECEVT wherever RECEVT lies,
    re-reconstruction can occur instantly
  • Need active storage management system
  • Implies multiple copies of RAW data, so need to
    watch cost
  • keep RECEVT split in two files, one for the RAW
    information and one for the RECOnstructed
    information
  • Need data model that links the two (metadata)
  • Implies (again) distributing RAW data to multiple
    places
  • keep Reconstructed information completely
    separate from RAW information
  • Implies re-reconstruction site options very
    limited. At the limit, we could decide that CMS
    has one and only one copy of the raw data, and
    that copy is split across the potential
    reconstructors, i.e. the Tier-1 centers

11
Raw Data in the RECEVT
  • To mitigate the cost of multiple copies of the
    raw data, one can consider compressing heavily
  • Clearly this costs in CPU and complexity
  • At startup, few events, so no need.
  • Afterwards if indeed one reprocesses only once
    or twice /year, decompressing once or twice per
    year is really not a significant overhead
  • Complexity not significant
  • So, conclusion yes, packing of the data is an
    element for our data model
  • To proceed we need to know how often we will be
    accessing what information
  • Not knowing this, we must look at the Tevatron

12
Event files content
  • Two very different models
  • CDF is close to the RAW data
  • About 100 kB, kept in same physical file as the
    full event
  • There is about 50-60 kB of reconstructed objects
    added to this
  • This file re-emerges with every re-processing
  • There are, at any point in time N copies of the
    raw data where N is the number of
    re-processings active
  • Note most users do NOT run on these files. They
    run analyses on group ROOT-tuples
  • D0 works on thumbnails
  • About 20-60kB/event. Number going up (to 100
    kB?)
  • Some limited re-reconstruction capability with
    TMBs
  • Most users run their analysis on skimmed TMBs
  • Skim production handled by D0 Common Samples
    Group
  • Usually in conjunction with TMB fixing

13
Reprocessing and TMB fixing
  • CDF reprocesses (nowadays) once per year
  • Their entire Run II sample
  • Major pluses small event size, short processing
    time
  • But clearly, production still a big job
  • D0 has re-processed their data once so far
  • All other reprocessing takes the form of
    corrections/fixes applied to the TMBs
  • This is a major effort, lasts a few weeks
  • So far occurs about twice per year
  • Files in skims combined to make larger files
  • Negatives rather slow reconstruction program,
    control over corrections that go into TMB-fixing

14
Recommendation 2
  • Premises
  • we will reprocess data quite often at least in
    the beginning
  • flexibility (aka re-running to fix
    bugs/features/calibration) is of higher
    importance than storage savings
  • All such statements should be understood to be
    true to a finite extent. When costs become
    prohibitive, obviously we take the other route
  • We must, therefore, have a model where the raw
    data is nearby/readily available
  • We should make every possible effort to keep the
    raw data as part of the RECEVT distribution.
  • Next decision one or two files?

15
Issue 3 data size
  • Recipe for calculating CMS event size
  • SIM_size F_unpack F_sim F_debug F_hlt /
    F_pack
  • SIM_size our best estimate, using OSCAR and DAQ
    headers
  • F_unpack factor describing going to
    reconstruction representation
  • F_sim factor describing how realistic the
    simulation is
  • F_debug factor describing the increased amount
    of data we take in the beginning (because of
    looser readout/zero-suppresion/etc)
  • F_hlt same as F_debug but for the HLT
  • F_pack what we can gain by packing the data
    at the end of processing
  • These factors are not present throughout lifetime
    of the experiment. Clearly, F_debug at some
    point becomes 1

16
Data size
  • We do not have final numbers for all these
    factors
  • Only first estimates. Given here for
    illustration
  • F_sim 1.6-2.0
  • CDF roughly twice as many hits in real events
    than in simu events.
  • F_debug 2.5
  • Depends very much on phase of the experiment.
    These numbers preliminary figures from CDF
  • F_hlt 1.25
  • F_pack 2.0
  • F_unpack TBM
  • New estimate of CMS event size is 300 kB (at low
    lumi)
  • Defined as 2x1033cm-2s-1. From DM RTAG.
    Suggestion we adopt it.
  • Lumi-dependence for LgtLlow linear, slope 1 (
    interactions rises).
  • Looks like DAQ evt at startup 1.5MB.
    Asymptocically at high-lumi 1MB. But this
    drops with time

17
Recommendation 3
  • At least in the early phase of the experiment,
    keep RECEVT as a single file, carrying around the
    RAW data
  • What we lose in storage we gain back in
    flexibility and immediacy in re-processing
  • As experiment matures, if economies allow,
    continue
  • Otherwise split into two files

18
Issue 4, Recommendation 4
  • Offline streaming
  • Clear need for further partitioning of the data,
    beyond the HLT streams
  • Both CDF and D0 have this
  • Inclusive leptons ? muons, electrons, taus (!),
    etc
  • We do not know how many are needed, but no reason
    to believe we need a number vastly different from
    CDF/D0
  • They have 50/25 different groups
  • Suggestion we split the offline output into 50
    streams (take upper value for safety)
  • These offline streams are the smallest data
    sample quanta handled by the physics groups

19
Issue 5
  • Why we are doing all this? Who will read/use the
    RECEVT files?
  • Let us for a moment (temporarily) define the
    RECOnstructed part of the event as the
    information needed by 90 of the analyses
  • Closest object to todays DST
  • Unfortunately, neither CDF nor D0 use these
    files/format to do analysis
  • CDF uses ROOT-tuples
  • D0 the TMBs
  • Morale of the story for all the previous stuff
    to have high value (beyond the massive
    re-production issues) we must have an analysis
    model that utilizes the CMS Event Data Model.
    Directly. As much as possible.

20
Data production
21
Issue 1 where/how it all happens
  • We speak of having a copy of the RAW data at CERN
    (Tier-0) and skipping the Reconstructed data to
    the Tier-1 centers
  • Since RAW data part of distribution, we now have
    two copies of the raw data
  • Any reprocessing, will result in another RAW data
    copy. Unless reprocessing overwrites/users
    resources of the previous processing
  • Implies that Tier-1 center where re-processing is
    possible is fixed. (Since CERN cannot reprocess
    everything at the Tier-0 again)
  • Implies that CERN copy is an idle one, a
    disaster avoidance copy
  • This is the least attractive scenario

22
Issue 1 continued
  • In fact, best possible scenario (for CMS) would
    be to treat these 1/5 chunks of the RAW data as
    coming in two copies which are distributed among
    all the Tier-1 centers.
  • i.e. we do not have an idle copy at CERN, but a
    second active copy at a second Tier-1 center
  • And for every 1/5 of the data, we have two sites
    to choose from where to re-do something.
  • Of course there are costs associated with this
    choice
  • Active data storage managers. Who pays? etc
  • For completeness, full list of options to be
    investigated

23
Issue 1 continued
Option/Actor Tier-0 Tier-1 Physics/ Analysis Coordination
vaulted copy distributed active copy "dead" tape copy emergency usage only does not participate in re-processing share of ESDcustodial ownershipre-process its part distribute intoT1sre-proc pre-determined
active copy distributed active copy active copy at CERN participates in re-proc same more coordination required
two distributed active copies basically no or little role. most serious role data is stored in T1s! most difficult role here
24
Issue 1, Pseudo-Recommendation 1
  • Avoid option 1, i.e. vaulted copy distributed
    active copy
  • Little motivation for 2nd copy in this case.
  • CERN copy is, for all intents and purposes, only
    an insurance policy
  • Many current experiments do not have such
    insurance
  • The cost of a second copy should have significant
    benefits to the experiment
  • For example more flexibility.
  • Note that option 2 (active copy distributed
    active copy) implies that the CERN Tier-1
    throughput is 5 times larger than that of other
    Tier-1 centers
  • Looks like option 3 is the one we should think
    about a the most likely to survive financial and
    political reqs

25
Fill in details
  • Reprocessing at Tier-X centers
  • Work out details (form user point of view) of
    what happens when
  • Inputs for reconstruction
  • Calibration tasks
  • Observations
  • CDF and D0 are moving towards the same reality
    it takes about 100 kB of data to do physics at
    the Tevatron.
  • But they have taken two different initial
    positions
  • CDF ALL data, keep decreasing
  • D0 littlest possible data, keep increasing
  • Reminds me of adding variables in ntuples/objects
    in root-trees

26
Offline processing
27
Ill-defined issues
28
Next issues
29
Following Issues
  • Content of DST
  • Usage pattern
  • Content of mini-DST
  • Usage pattern
  • Number of copies, co-ordination of re-making
  • Social contract who runs what. What is
    forbidden, what is discouraged, what is available
    depending on resources, who keeps track
  • Calibration/Conditions data who creates it.
    Where it gets stored. How often is it used.
    Types of calibration data (low-level,
    high-level)
  • Usage of simulated data. How much? Where? Why?
  • Data and Job movement (!...)
  • Interactive Analysis

30
Interactive analysis
  • We must avoid the ROOT-tuple trap
  • We must ensure that a (willing) user can, if
    (s)he wants, do ALL that he wishes by running a
    single program, ORCA
  • Huge advantages to this scheme
  • global debugging of code
  • huge gains in development avoid re-inventing the
    car
  • single-training objective
  • clear turn-on path onto CMS, from professor to
    student
  • avoid hand-corrections that are difficult to
    reproduce
  • minimize dichotomy between developer and user
    communities. In a sense, every user will be a
    mini-developer. Question will be developer
    grade, not type
  • clear, straightforward feedback of analysis
    code into reconstruction code
  • what can be done determined by what is at the
    input. Only.

31
Recommendation (final)
  • We take it for granted that we will make a
    maximal investment in the EDM, COBRA, ORCA,
    OSCAR, FAMOS, and stick to it. And provide
    direct access to a plotting/analysis package that
    provides interactive capability
  • Need an easy-to-learn, easy-to-use system for
    plot-making
  • Huge returns
  • ROOT is currently the most popular one in our
    field
  • Do this ROOT-binding in the most careful and
    optimal way possible

32
Summary
33
Summary
  • Overriding goal to measure the right number
    first.
  • Irrespective of elegance of the method, tactics,
    etc
  • We must learn from other peoples mistakes
  • Plan for ours, so plan to re-process, and plan
    for flexibility
  • Plan to simplify things at the expense of
    beauty
  • Plan to duplicate/triplicate things where the
    simplicity of the rules/systems exceeds the
    anticipated costs
  • Every experiment I have seen to this date, ended
    up using orders of magnitude more computing than
    anticipated at inception time. CMS will be no
    exception.
  • Things we need people should think about the
    issues listed as pending and/or ill-defined and
    provide feedback
Write a Comment
User Comments (0)
About PowerShow.com