Diapositiva 1 - PowerPoint PPT Presentation

1 / 10

About This Presentation

Title:

Diapositiva 1

Description:

... to be covered using specific ORCA executables for defined physics analyses and ... COBRA and ORCA executables should be mainly used to perfom the validation stuff ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 11

Provided by: nicolade

Category:

more less

Transcript and Presenter's Notes

Title: Diapositiva 1

1
Status and future of Validation Tools for CMS
analysis
Nicola De Filippis
Dipartimento di Fisica dellUniversità degli
Studi e del Politecnico di Bari e INFN
2
Overall scenarios

Present scenario of data management
80 of data not usable at CERN because not
validated correctly
anarchy in publishing
no complete dataset published
no reproducibility of the analysis (CRAB)
no mapping run/event in PubDB at each site
Future scenario of data management
complete datasets (MC data or real data)
different concept of completeness ?
mapping run/event in the DBS ?
multiple validation flags in DBS ?
data validated in producer centresno further
validation needed ?

3
Actual solution

ValidationTools_0_2_7
Data were validated at the end of the transfer in
a remote site .
Missing or corrupted runs are optionally skipped
Data catalogs are published in PuDB at the end of
procedures
Data were ensured to be usable for analysis

New functionalities
Validation flag inserted in order to validate or
not data samples. Not validated dataset will be
published in PubDB with ValidationStatusNOT_VALID
ATED
"Technical" validation supported based on the use
of batchcobra executable for DIGI
Solved the problem of the selection of run/events
on SLC3 by Vincenzo!!!!
Improved interface to Phedex check on file
existence removed because ensured by Phedex
Stable validation monitoring performed by JAM
added "mmencode" executable to decompress pool
fragment because it is missing in SLC3
distribution
CMS note updated with recent implementations,
almost ready to be submitted.

4
Proposal to DM team (solution 1)

Validation hierarchy
technical validation file integrity
guarantees (checksum, size) matching of the
information stored in the central database for
Monte Carlo production, DBS (actually RefDB), and
that extracted from local files the data quality
is not verified at their step so it is not
granted that the end-user analysis can access
data correctly
transfer validation to ensure the file
integrity. It is covered up to now by the CMS
official tool for data movement, Phedex.
validation for analysis which should ensure full
data (and metadata) consistency and readyness for
analysis in remote site. At this step data should
published in local file catalogs or database.
physics validation to be covered using specific
ORCA executables for defined physics analyses and
should be addressed by people of physics groups.
Physics validation should cover also calibrations
validation.

5
Where validation? (solution 1)

Technical validation should be performed at
Monte Carlo producer centers or at CERN Tier 0.
The validation stuff for analysis have to be
triggered at Tier 1s as soon as new data were
transferred.
In principle no further validation or any
previous one should be performed at Tier 2/3s.
A strategy has to be defined for handling
metadata. Problems can rise with the validation
of samples including just a subset of runs
because of the metadata content in that case the
validation for analysis has to be executed in the
site receiving data and metadata should be filled
at this stage.
The redundancy of the operations should be
avoided.

6
Proposal to DM team (solution 2)

Validation hierarchy
technical validation validation for analysis
at producer centers or at CERN
transfer validation via Phedex
physics validation to be covered using specific
ORCA executables for defined physics analyses and
should be addressed by people of physics groups.
Physics validation should cover also calibrations
validation.
Pro
-- no validation stuff after the transfer
Contra
-- deep validation performed at producer centers
or at CERN is not fast and could involve staging
from tape

7
Ideas for technical validation (1)
Requirements -- Automatic and not too much time
expensive -- To be performed daily -- Some
light and fast executable -- Automatic
publishing in PubDB -- incremental validation as
soon new runs are produced. Task to be
included -- Includes file names and checksums of
produced data -- Checking summary files against
RefDB/DBS information -- a way to flag
conditions in RefDB/DBS if file is lost between
summary file sending and publication, is lost
later, or a run is just bad. -- a tool to
synchronize at will. Results -- Results
of validation on production should go into the
RefDB/DBS. -- Information on analysis validation
should go into PubDB equivalent. -- There was a
need to have several types of validation flags
available to describe different states of
validation.
8
Ideas for technical validation (2)

A lighter weight tool that does a more complete
technical validation than batchCOBRA could be
commissioned to EDM group.
Contra
- the problem with validation is not usually the
length of time it takes to run the jobs, but
rather the cost of staging data out from tape.
- It could work if it is synchronized with
transfer or if it is synchronized with staging
out for the first analysis on the data.
Pro
-- The functionality proposed of the incremental
validation can be useful when transferring large
datasets
-- it should be possible to validate data as it
is being transferred and allow access to data
before the whole dataset arrives.
This is also important in environments where data
will be streamed.

9
Subsequent validations
Transfer validation -- check list of files
transferred and for each file check the size and
checksum. -- Additional lightweight/heavy
weight validation steps as per site policy.
Validation for analysis -- The validation for
analysis procedure has the purpose of ensuring
data integrity and readiness of end-user
analysis. -- COBRA and ORCA executables should
be mainly used to perfom the validation stuff
with different levels of check of data
consistency. -- validated catalogs and
information published in the local database for
publishing, actually PubDB. -- Different use
cases have to be supported such as the validation
of Monte Carlo datasets and real data. -- A layer
of validation monitoring have to be also provided
in order to check in real-time the multiple steps
of the validation stuff in a given site.
10
Questions about future