Diapositiva 1 - PowerPoint PPT Presentation

1 / 10
About This Presentation
Title:

Diapositiva 1

Description:

... to be covered using specific ORCA executables for defined physics analyses and ... COBRA and ORCA executables should be mainly used to perfom the validation stuff ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 11
Provided by: nicolade
Category:
Tags: diapositiva | orca

less

Transcript and Presenter's Notes

Title: Diapositiva 1


1
Status and future of Validation Tools for CMS
analysis
Nicola De Filippis
Dipartimento di Fisica dellUniversità degli
Studi e del Politecnico di Bari e INFN
2
Overall scenarios
  • Present scenario of data management
  • 80 of data not usable at CERN because not
    validated correctly
  • anarchy in publishing
  • no complete dataset published
  • no reproducibility of the analysis (CRAB)
  • no mapping run/event in PubDB at each site
  • Future scenario of data management
  • complete datasets (MC data or real data)
    different concept of completeness ?
  • mapping run/event in the DBS ?
  • multiple validation flags in DBS ?
  • data validated in producer centresno further
    validation needed ?

3
Actual solution
  • ValidationTools_0_2_7
  • Data were validated at the end of the transfer in
    a remote site .
  • Missing or corrupted runs are optionally skipped
  • Data catalogs are published in PuDB at the end of
    procedures
  • Data were ensured to be usable for analysis
  • New functionalities
  • Validation flag inserted in order to validate or
    not data samples. Not validated dataset will be
    published in PubDB with ValidationStatusNOT_VALID
    ATED
  • "Technical" validation supported based on the use
    of batchcobra executable for DIGI
  • Solved the problem of the selection of run/events
    on SLC3 by Vincenzo!!!!
  • Improved interface to Phedex check on file
    existence removed because ensured by Phedex
  • Stable validation monitoring performed by JAM
  • added "mmencode" executable to decompress pool
    fragment because it is missing in SLC3
    distribution
  • CMS note updated with recent implementations,
    almost ready to be submitted.

4
Proposal to DM team (solution 1)
  • Validation hierarchy
  • technical validation file integrity
    guarantees (checksum, size) matching of the
    information stored in the central database for
    Monte Carlo production, DBS (actually RefDB), and
    that extracted from local files the data quality
    is not verified at their step so it is not
    granted that the end-user analysis can access
    data correctly
  • transfer validation to ensure the file
    integrity. It is covered up to now by the CMS
    official tool for data movement, Phedex.
  • validation for analysis which should ensure full
    data (and metadata) consistency and readyness for
    analysis in remote site. At this step data should
    published in local file catalogs or database.
  • physics validation to be covered using specific
    ORCA executables for defined physics analyses and
    should be addressed by people of physics groups.
    Physics validation should cover also calibrations
    validation.

5
Where validation? (solution 1)
  • Technical validation should be performed at
    Monte Carlo producer centers or at CERN Tier 0.
  • The validation stuff for analysis have to be
    triggered at Tier 1s as soon as new data were
    transferred.
  • In principle no further validation or any
    previous one should be performed at Tier 2/3s.
  • A strategy has to be defined for handling
    metadata. Problems can rise with the validation
    of samples including just a subset of runs
    because of the metadata content in that case the
    validation for analysis has to be executed in the
    site receiving data and metadata should be filled
    at this stage.
  • The redundancy of the operations should be
    avoided.

6
Proposal to DM team (solution 2)
  • Validation hierarchy
  • technical validation validation for analysis
    at producer centers or at CERN
  • transfer validation via Phedex
  • physics validation to be covered using specific
    ORCA executables for defined physics analyses and
    should be addressed by people of physics groups.
    Physics validation should cover also calibrations
    validation.
  • Pro
  • -- no validation stuff after the transfer
  • Contra
  • -- deep validation performed at producer centers
    or at CERN is not fast and could involve staging
    from tape

7
Ideas for technical validation (1)
Requirements -- Automatic and not too much time
expensive -- To be performed daily -- Some
light and fast executable -- Automatic
publishing in PubDB -- incremental validation as
soon new runs are produced. Task to be
included -- Includes file names and checksums of
produced data -- Checking summary files against
RefDB/DBS information -- a way to flag
conditions in RefDB/DBS if file is lost between
summary file sending and publication, is lost
later, or a run is just bad. -- a tool to
synchronize at will. Results -- Results
of validation on production should go into the
RefDB/DBS. -- Information on analysis validation
should go into PubDB equivalent. -- There was a
need to have several types of validation flags
available to describe different states of
validation.
8
Ideas for technical validation (2)
  • A lighter weight tool that does a more complete
    technical validation than batchCOBRA could be
    commissioned to EDM group.
  • Contra
  • - the problem with validation is not usually the
    length of time it takes to run the jobs, but
    rather the cost of staging data out from tape.
  • - It could work if it is synchronized with
    transfer or if it is synchronized with staging
    out for the first analysis on the data.
  • Pro
  • -- The functionality proposed of the incremental
    validation can be useful when transferring large
    datasets
  • -- it should be possible to validate data as it
    is being transferred and allow access to data
    before the whole dataset arrives.
  • This is also important in environments where data
    will be streamed.

9
Subsequent validations
Transfer validation -- check list of files
transferred and for each file check the size and
checksum. -- Additional lightweight/heavy
weight validation steps as per site policy.
Validation for analysis -- The validation for
analysis procedure has the purpose of ensuring
data integrity and readiness of end-user
analysis. -- COBRA and ORCA executables should
be mainly used to perfom the validation stuff
with different levels of check of data
consistency. -- validated catalogs and
information published in the local database for
publishing, actually PubDB. -- Different use
cases have to be supported such as the validation
of Monte Carlo datasets and real data. -- A layer
of validation monitoring have to be also provided
in order to check in real-time the multiple steps
of the validation stuff in a given site.
10
Questions about future
  1. Should we go on with development and
    reengineering?
  2. Is reasonable an official commitment on
    validation by CMS data management?
  3. Could we require to EDM group to develop a
    specific executable for validation?
  4. Will be the validation stuff performed in the
    framework
  5. Is there the place for our involvment in this
    task?
  6. What about publishing data in DBS? Have the
    information to be validated?
  7. What about the evolution of the validation with
    Phedex?
  8. Have we to support the mapping run/events or
    event collections/events locally ?
Write a Comment
User Comments (0)
About PowerShow.com