Title: Diapositiva 1
1Status and future of Validation Tools for CMS
analysis
Nicola De Filippis
Dipartimento di Fisica dellUniversità degli
Studi e del Politecnico di Bari e INFN
2Overall scenarios
- Present scenario of data management
- 80 of data not usable at CERN because not
validated correctly - anarchy in publishing
- no complete dataset published
- no reproducibility of the analysis (CRAB)
- no mapping run/event in PubDB at each site
- Future scenario of data management
- complete datasets (MC data or real data)
different concept of completeness ? - mapping run/event in the DBS ?
- multiple validation flags in DBS ?
- data validated in producer centresno further
validation needed ?
3Actual solution
- ValidationTools_0_2_7
- Data were validated at the end of the transfer in
a remote site . - Missing or corrupted runs are optionally skipped
- Data catalogs are published in PuDB at the end of
procedures - Data were ensured to be usable for analysis
- New functionalities
- Validation flag inserted in order to validate or
not data samples. Not validated dataset will be
published in PubDB with ValidationStatusNOT_VALID
ATED - "Technical" validation supported based on the use
of batchcobra executable for DIGI - Solved the problem of the selection of run/events
on SLC3 by Vincenzo!!!! - Improved interface to Phedex check on file
existence removed because ensured by Phedex - Stable validation monitoring performed by JAM
- added "mmencode" executable to decompress pool
fragment because it is missing in SLC3
distribution - CMS note updated with recent implementations,
almost ready to be submitted.
4Proposal to DM team (solution 1)
- Validation hierarchy
- technical validation file integrity
guarantees (checksum, size) matching of the
information stored in the central database for
Monte Carlo production, DBS (actually RefDB), and
that extracted from local files the data quality
is not verified at their step so it is not
granted that the end-user analysis can access
data correctly - transfer validation to ensure the file
integrity. It is covered up to now by the CMS
official tool for data movement, Phedex. - validation for analysis which should ensure full
data (and metadata) consistency and readyness for
analysis in remote site. At this step data should
published in local file catalogs or database. - physics validation to be covered using specific
ORCA executables for defined physics analyses and
should be addressed by people of physics groups.
Physics validation should cover also calibrations
validation.
5Where validation? (solution 1)
- Technical validation should be performed at
Monte Carlo producer centers or at CERN Tier 0. - The validation stuff for analysis have to be
triggered at Tier 1s as soon as new data were
transferred. - In principle no further validation or any
previous one should be performed at Tier 2/3s. - A strategy has to be defined for handling
metadata. Problems can rise with the validation
of samples including just a subset of runs
because of the metadata content in that case the
validation for analysis has to be executed in the
site receiving data and metadata should be filled
at this stage. - The redundancy of the operations should be
avoided.
6Proposal to DM team (solution 2)
- Validation hierarchy
- technical validation validation for analysis
at producer centers or at CERN - transfer validation via Phedex
- physics validation to be covered using specific
ORCA executables for defined physics analyses and
should be addressed by people of physics groups.
Physics validation should cover also calibrations
validation. - Pro
- -- no validation stuff after the transfer
- Contra
- -- deep validation performed at producer centers
or at CERN is not fast and could involve staging
from tape
7Ideas for technical validation (1)
Requirements -- Automatic and not too much time
expensive -- To be performed daily -- Some
light and fast executable -- Automatic
publishing in PubDB -- incremental validation as
soon new runs are produced. Task to be
included -- Includes file names and checksums of
produced data -- Checking summary files against
RefDB/DBS information -- a way to flag
conditions in RefDB/DBS if file is lost between
summary file sending and publication, is lost
later, or a run is just bad. -- a tool to
synchronize at will. Results -- Results
of validation on production should go into the
RefDB/DBS. -- Information on analysis validation
should go into PubDB equivalent. -- There was a
need to have several types of validation flags
available to describe different states of
validation.
8Ideas for technical validation (2)
- A lighter weight tool that does a more complete
technical validation than batchCOBRA could be
commissioned to EDM group. -
- Contra
- - the problem with validation is not usually the
length of time it takes to run the jobs, but
rather the cost of staging data out from tape. - - It could work if it is synchronized with
transfer or if it is synchronized with staging
out for the first analysis on the data. - Pro
- -- The functionality proposed of the incremental
validation can be useful when transferring large
datasets - -- it should be possible to validate data as it
is being transferred and allow access to data
before the whole dataset arrives. - This is also important in environments where data
will be streamed.
9Subsequent validations
Transfer validation -- check list of files
transferred and for each file check the size and
checksum. -- Additional lightweight/heavy
weight validation steps as per site policy.
Validation for analysis -- The validation for
analysis procedure has the purpose of ensuring
data integrity and readiness of end-user
analysis. -- COBRA and ORCA executables should
be mainly used to perfom the validation stuff
with different levels of check of data
consistency. -- validated catalogs and
information published in the local database for
publishing, actually PubDB. -- Different use
cases have to be supported such as the validation
of Monte Carlo datasets and real data. -- A layer
of validation monitoring have to be also provided
in order to check in real-time the multiple steps
of the validation stuff in a given site.
10Questions about future
- Should we go on with development and
reengineering? - Is reasonable an official commitment on
validation by CMS data management? - Could we require to EDM group to develop a
specific executable for validation? - Will be the validation stuff performed in the
framework - Is there the place for our involvment in this
task? - What about publishing data in DBS? Have the
information to be validated? - What about the evolution of the validation with
Phedex? - Have we to support the mapping run/events or
event collections/events locally ?