Title: Implementing Metadata Using RLS/LCG
1Implementing Metadata Using RLS/LCG
- James Cunha Werner
- University of Manchester
- http//www.hep.man.ac.uk/u/jamwer/
2Babar Experiment
- The BaBar experiment studies the differences
between matter and antimatter, to throw light on
the problem, posed by Sakharov, of how the
matter-antimatter symmetric Big Bang can have
given rise to todays matter-dominated universe. - High energy collisions between electrons and
positrons produce other elementary particles,
giving tracks and clusters which are recorded by
several high granularity detectors and from which
the properties of the short-lived particles can
be deduced.
3- Each recorded collision, called an event,
comprises a large volume of data, and thousand of
millions of events are recorded, giving a total
dataset size of hundreds of thousands of
Gigabytes (or hundreds of Terabytes).
4Sources of Data in Babar
5Amount of data
Files Size (TB) Events (Million)
Run1 6,972 2.0 593
Run2 11,527 6.3 1,925
Run3 7,383 3.2 951
Run4 16,671 12.2 3,999
Run5 (2xRun4) ??? 32,000 24 8
Run6 (2xRun5) ??? 64,000 48 16
Run7 (2xRun6) ??? 128,000 100 32
SuperBabar !
Systematic errors gtgtgt statistical errors
Same amount of Monte Carlo Generated data!
6Data Structure
- The user interface to the eventstore event
"collection". Each collection represents an
ordered series of N events and a user can choose
to read the events from the 1st one in the
sequence or from any given offset into the
sequence. - Data components
- hdr - event header
- usr - user data
- tag - tag information
- cnd - candidate information
- aod - "analysis object data"
- tru - MC truth data (only in MC data)
- esd - "event summary data"
- sim - "sim" data from BgsApp or MooseApp like
GHits/GVertices (only in MC data) - raw - subset of raw data from xtc persisted in
the Kanga eventstore
7Data organisation
- How data are stored (level of detail)
- micro hdr usr tag cnd aod ( tru)
- mini micro esd
- Data access
- collections - these are "logical" names that
users use to configure their jobs. These are
site-independent so (assuming the site has
imported the data) the same collection name
should work at any site. - logical file names (LFN) - these are
site-independent names give to all files in the
eventstore. Any references within the event data
itself _must_ use LFN's so that these remain
valid when they are moved from site to site. - physical file names (PFN) - these are file names
that will vary from site to site. In practice
they are usually derived from the LFN's by adding
a prefix that encapsulates how the data is
accessed at that site.
8(No Transcript)
9Feeding RLS with metadata
- Generation of basic metadata file with files
selection!/bin/bashBbkDatasetTcl
--dbsitelocal gt MetaLista.txtcat MetaLista.txt
awk '// print "BbkDatasetTcl --site local
--nolocal \""1"\""' gtgt geratclchmod 700
geratcl./geratcl - Feeding RLS with basic files
- !/bin/bashls .tcl awk '// split(1,a,".")
print "edg-rm --vo babar cr file///home/jamwer/Pg
mCM2/MetaData/"1 " -l lfn"a1 " gt "
a1".rlstok"' gtgt alimrlschmod 700
alimrls./alimrls
10Conformity CE catalogue
- Run evaluation software to establish CE
conformity and perform catalogue update. - !/bin/bashldapsearch -x -H
ldap//lcgbdii02.gridpp.rl.ac.uk2170 -b
'Mds-vo-namelocal,oGrid' '((objectClassGlueCE)
(GlueCEAccessControlBaseRuleVObabar))' grep
"GlueCEUniqueID" gt cenames.txtcat cenames.txt
awk '// print "./catal "2' gt subload.shchmod
700 subload.sh./subload.shcat loadrlssubm gtgt
1.histocat 1.histo awk ' /Sub/ FileName2
/https/ HandleName2 print "echo " HandleName
"gt " FileName".tok " ' gtgt gridtokchmod 700
gridtok./gridtok
11Conformity validation
- Verify if site follow experiment standards
- !/bin/bashecho Hostname
/bin/hostnameecho Start time
/bin/dateecholocalpwdecho Babar
initialisation ". VO_BABAR_SW_DIR/babar-grid-set
up-env.shechoecho Environment
variables"printenvechocd localecho Arquivos
disponiveis locallsechoecho " - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - "echocd
BFDIST/releases/14.5.2srtpath 14.5.2
Linux24RH72_i386_gcc2953cd localBbkDatasetTcl
--dbsitelocal gt MetaLista.txtcat MetaLista.txt
awk '// print "BbkDatasetTcl --site local
\""1"\""' gtgt geratclchmod 700
geratcl./geratclexport CE_NAME1ls .tcl
awk -v siteCE_NAME '// split(1,a,".") print
"edg-rm --vo babar addAlias cat " 1"
lfn"a1"."site ' gtgt alimrlschmod 700
alimrls./alimrlsechoecho " - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - "echoecho End time
/bin/date
12Analysis Submission to Grid
(Prototype)
- Single command ./easygrid dataset_name
- Perform Handlers management and submission
- Configurable to achieve users requirements
- Software based in State-machine
- Verify skimdata available
- If not available perform BbkDatasetTCL to
generate skimData. Each file will be a job. - Verify if there are handlers pending
- If not, script generation (gera.c) with
edg-job-submit and ClassAdds, and script
execution. Nest for submission policy and
optimisation. - If yes, verify job status. When the all jobs
ended, recover results in user folder.
13Job Submission system, metadata and data
14Metadata/Event files and Computer elements
For each dataset there is a metadata file
containing the names of the event files. These
physical files are registered with the RLS, with
several logical file names in the format
datsetname_CEJobQueue assigned to them as
aliases, showing the CEs which contain copies of
that dataset. Searching all the aliases for a
dataset name provides a list of CEs to which jobs
can be submitted.
15Managing large files in Grid
- The analysis executable is allocated in the SE
and its logical file name (LFN) is also
catalogued in the RLS so any WN need download it
only once. - Metadata not only for data, but to support other
files as well.
16Gera
- Generation of all necessary information to submit
the jobs on the Grid. - Job Description Language (JDL) files
- the script with all necessary tasks to run the
analysis remotely at a WN - some grid dependent analysis parameters.
- The JDL files define the input sandbox with all
necessary files to be transferred - WN balance load algorithm matches requirements to
perform the task optimally.
17Running analysis programs
When the task is delivered in the WN, scripts
start running to initialize the specific Babar
environment, and the analysis software is
downloaded.
18Benchmarks
Behavior of particles in the BaBar
Electromagnetic Calorimeter (EMC)
- The different behavior of electrons, hadrons, and
muons can be distinguished. - Performing this analysis takes 7 days using one
computer 24 hours a day. - Using 10 CPUs in parallel, accessed via the Grid,
it took only 8 hours.
19- Pi- N Pi0 decays, with N 1, 2, 3 and 4
- Invariant masses of pairs of gammas, as measured
by the EMC, from Pi0 decay produce a mass peak at
135 MeV (the peak in the plot). All other
combinations are spread randomly around all
energies (background). - There were 81,700,000 events in the dataset and
it took 4 days to run in production, with 26 jobs
in parallel to run it in one single computer
would take more than 3 months.
20Summary
- Easygrid is working and provides all job
submission structure using LCG grid, RLS and
metadata management. - Provides handlers management transparent to the
user. - Easy to use !!!
- Configurable to achieve users requirements and
maybe for other experiments as well. - See homepage http//www.hep.man.ac.uk/u/jamwer/
for more details. - Thanks for the opportunity!