Title: A CMS computing project
1A CMS computing project BOSS (Batch Object
Submission System )
- Zhang YongJun
- (Imperial College London)
- Background GRID and LHC
- CMS computing project BOSS
2LHC (Large Hadron Collider)
- LHC is a particle accelerator located at CERN,
which situated Geneva on the border between
Switzerland and France. It is scheduler to start
operation in 2007. - LHC will collide protons with colliding energy 14
TeV and will also collide heavy ions like lead
(Pb).
3Detector and trigger
- 75 million electronics channels from various
subdetector - Data from detector is electrical signals.
- By applying calibration, the physical quantity
(momentum, energy ) can be know from the
strength of the electrical signal - Trigger system selects interesting event
- Reconstruction procedure builds physics object
with property from raw event - Data analysis apply a set of cut to select
specific set of event corresponding a specific
physical channel
4Software
full simulation
fast simulation
physics
- Simulation is essential for the detector/software
design as well as for data analysis - Fast simulation comparing to full simulation is
fast but depends on the parameters extracted from
full simulation
generator
LHC
generator
simulation
detector
digitization
trigger
Fast simulation
reconstruction
Data analysis (ROOT)
5LHC computing model
- 225MB/s for CMS from online to offline. Lot of
data will come and it is out the ability for one
site to process all data. So a tier data
distribution structure is proposed. CERN is
Tier0, and every country has one Tier1 and
several Tier2. - Tier1 reconstructs event and host data. Tier 2
runs physicists analysis job. - This Tier structure is built upon Grid software.
6Computing before Grid
CERN
Imperial College
(yjzhang)
(yzhang)
RAL
(????????)
- Need an account to submit job to every site.
- To submit job to a newly joined site, a new
account needs to be created. - Although these sites actually take part in the
same project like CMS, it is difficult to share
CPU and data
7Computing on Grid
CERN
Imperial College
- Instead of using account, user holds a
certificate to submit job - Those sited accept this certificate form a
Virtual Organization (VO). All those sites
joined CMS experiment can join CMS VO. - Certificate is issued by some kind of authority
by using RSA algorithm. - On VO, more services can be added to help user to
submit job, for example, scheduling and
mornitoring.
RAL
certificate
Bag Attributes friendlyName yongjun zhang's
eScience ID localKeyID 65 AB 3E 55 38 77 49
B3 3A 93 26 B5 08 68 D1 8C A9 CD 6A D8
subject/CUK/OeScience/OUImperial/LPhysics/CN
yongjun zhang issuer/CUK/OeScience/OUAuthorit
y/CNCA/emailAddressca-operator_at_grid-support.ac.u
k -----BEGIN CERTIFICATE----- MIIFbzCCBFegAwIBAgIC
FHowDQYJKoZIhvcNAQEFBQAwcDELMAkGA1UEBhMCVUsx ETAPB
gNVBAoTCGVTY2llbmNlMRIwEAYDVQQLEwlBdXRob3JpdHkxCzA
JBgNVBAMT . -----END CERTIFICATE-----
8Work flow management on Grid
CERN (VO CMS)
Imperial College (VO CMS)
Resource Broker (CMS)
RAL (VO CMS)
certificate
- To make users job submission even more easier, a
job submission service - Resource Broker (RB) can
set up upon VO. RB can delegate user to submit
job a non-busy site. - To accept jobs submitted from all over VO, a
dedicate cluster can be set up as Computing
Element (CE). - Similarly, there are many VO based services like
monitoring and logging have been developed.
9Work flow management on Grid
CERN (VO CMS)
Imperial College (VO CMS)
Resource Broker (CMS)
RAL (VO CMS)
certificate
LFN
PFN
Catalogue DataBase
- On Grid, user specify file by its Logical File
Name (LFN). Grid service looks up database to
find out all its corresponding Physical File
Names (PFN), and selects one from them to do real
work. Between LFN and PFN is UUID to link these
two. - Dedicated site can be built to be a Storage
Element (SE) to host large amount of data, for
example gfe02.hep.ph.ic.ac.uk, which uses dCache
tool.
10BOSS - Batch Object Submission System
CRAB
BOSS
BOSS Logging
- Boss is a part of CMS workload management system
- Boss provides logging, bookkeeping and
monitoring. - Boss sits between user(CRAB) and scheduler/Grid.
- Boss is a generic submission tool, and will
provide Python / C APIs which will be used by
CRAB, then CRABBOSS are the complete submission
tools.
monitoring
11Sample Task specification
- lt?xml version"1.0" encoding"UTF-8"
standaloneyes"?gt - lttaskgt
- ltiterator nameITR start0 end100
step1gt - ltchain scheduler"glite rtupdater"mysql"
ch_tool_name"jobExecutor"gt - ltprogram exec"test.pl"
- argsITR"
- stderr"err_ITR
- program_type"test
- stdin"in
- stdout"out_ITR"
- infiles"Examples/test.pl,Examples/i
n - outfiles"out_ITR,err_ITR
- outtopdir"" /gt
- lt/chaingt
- lt/iteratorgt
- lt/taskgt
- Example of task containing 100 chains each
consisting of 1 program. - Program specific monitoring activated - results
returned via MySQL connection.
12BOSS components overview
user CLI
admin. CLI
Python interface
user GUI
Pro-active UI ?
BOSS Logging
BOSS kernel APIs kernel objects BossTask,
database scheduler
Grid or Local Scheduler
monitoring
BOSS on WN jobExecutor tar ball Configuration
file and executables
- Boss has 2 parts, (1) BOSS on UI and (2) BOSS on
WN. - Boss on UI has two sub layer further (a) user
interface, (b) Boss kernel. - Boss kernel further include APIs
(BossUserSession, BossAdministratorSession) and
kernel objects (BossConfiguration, BossTask,
BossDataBase and BossScheduler). - Boss on WN has level structure Task, Chain,
Program, userExecutable.
13BOSS internal data flow
administrator
user/CRAB
WN
task.xml
schema.xml
API/user
API/adm
Job.tar ( job.xml, monitoring,,ORCA,input...)
JOB_ID 1
START_TIME
STATUS running
JOB_ID TYPE INPUT
1 ORCA FILE1
2 ORCA FILE2
wrapper /Shreek
Journal file
BOSS logging
scheduler/JDL
monitoring
14BOSS internal work flow
15BOSS WN reorganization proposal
BOSS UI reorganization proposal
job.tar
job components
plug-in
core services
blackboard
pro-active Plug-in
JobExecuter
File of job configuration
pro-active service
JobMonitor
programChaining
monitor interface
- All variable things go to configuration file so
that leave rest components simple, even no
recompilation needed when new components added - Configuration file is created during job
preparation stage, it owns all information needed - JobExecuter only has to interpret the
configuration file - Core services can talk each other, so they
dependent each other - Plug-in only talks to services so that it achieve
independency to be plug-in - Tar ball job.tar is created during job
preparation stage, synchronized with
configuration file creation. A service or plug-in
is referenced by configuration files ( logically
or even physically there are more than one
configuration files ) should be added to the tar
ball as well
16Structure of level 2, 3 and final
level 1
level 2
level final
level 3
- Chaining configuration file owns all information
to chain programs together, it leaves
programChaining program clean and stable - Chaining configuration file is created during
chain preparation stage ( a step of the job
preparation stage ) - programChaining interprets the chaining
configuration file and executes its commands - Job configuration file, chaining configuration
file and program configuration file have similar
( or same ) structure and functionality. They
even can share the same physical file, but
logically they should be different to achieve
flexibility
17BOSS Status and plans
BOSS Status and plans
- New functionality has been implemented or is
being written - Tasks, job and executables.
- XML task description.
- C and Python APIs
- Basic executable chaining - currently only
default chainer with linear chaining. - Separate logging and monitoring DBs.
- Implemented DBs in either MySQL or SQLite (more
to come). - Optional RT monitoring with multiple
implementations, currently only MonaLisa and
direct MySQL connections (to be deprecated). - To be done in the near future
- Allow chainer plugins.
- Implement more RT monitoring solutions i.e R-GMA.
- Look at writing wrapper in scripting language i.e
Perl/Python. - Optimize architecture and separate data from
functionality.
18GRID organizations
Resource management Grid Resource Allocation
Management Protocol (GRAM) Information Services
Monitoring and Discovery Service (MDS) Security
Services Grid Security Infrastructure (GSI)
Data Movement and Management Global Access to
Secondary Storage (GASS) and GridFTP
ALICE ATLAS CMS LHCb
Projects PI - POOL/CondDB - SEAL
- ROOT - Simulation - SPI - 3D (GDA)
- To build a consistent, robust and secure Grid
network that will attract additional computing
resources. - To continuously improve and maintain the
middleware in order to deliver a reliable service
to users. - To attract new users from industry as well as
science and ensure they receive the high standard
of training and support they need.
There are many national scale Grid related
collaboration, for example, GridPP, is a UK
national collaboration funded by the UK
government through PPARC as part of its e-Science
Programme. It collaborates with CERN and EGEE.
19 Backup slides
20Boss key components
administrator
user/CRAB
WN
task.xml
schema.xml
API/user
API/adm
Bosstask
BossScheduler
BossDB
Job.tar ( job.xml, monitoring,,ORCA,input...)
JOB_ID 1
START_TIME
STATUS running
JOB_ID TYPE INPUT
1 ORCA FILE1
2 ORCA FILE2
wrapper /Shreek
Journal file
scheduler/JDL
monitoring
BOSS logging
21Boss level structure on WN
Blackboard
Pro-active
Interface?
JobExecuter (wrapper)
pre-filter
user executable
runtime-filter
JobMonitor
programExecutor1
post-fileter
JobChaining
programExecutor2
level 0
level 1
level 2
level final
level 3
- At least level 0, level 1 and level final have to
be there - level 2 and level 3 can be omitted, this can
easily achieved by rewriting configuration file - New level can be easily inserted between level 1
and level final by rewriting configuration file - Every level can has its configuration file or not
- JobExecutor controls all proccess on worker node
- Pro-active process not planned for first release.
- JobChaining simple linear program execution in
first release allow possibility of plugins (ie
Shreek) in the future. - Simple monitoring via output stream filters
planned for first release more extensive
options available later.
22BOSS history
W. Bacchi, G. Codispoti, C. Grandi, INFN
Bologna D. Colling, B. MacEvoy, S. Wakefield, Y.
Zhang. Imperial College London
Old BOSS
Italian group Claudio, 2001-
Imperial group Hugh,Stuart, Dave,
Barry,Yong 2003-2005
GROSS
logging bookkeeping
CMS specific functionality group of jobs
scheduler
Bologna Imperial joint meeting Stuard, Dave,
Barry,Yong, Claudio,and all Bologna
group 17/12/2004, Bologna
monitoring
New BOSS
Joint meeting Stuart,Dave,Yong,Henry, Claudio. 02-
03/02/2005, Imperial
taskjobprogram
CMS WM workshop 14-15/07/2005, Padova
adopted XML structure
defined framework and priority
BOSS Group meeting 12-14/10/2005 Bologna
23Schema configuration file proposal
- ltTABLE NAME"TASK"gt ltELEMENT NAME"TASK_ID"
TYPE"INTEGER PRIMARY KEY" DAUGHTER"CHAIN" /gt
ltELEMENT NAME"ITERATORS" TYPE"TEXT NOT NULL
DEFAULT """ /gt ltELEMENT NAME"TASK_INFILES"
TYPE"TEXT NOT NULL DEFAULT """ /gt ltELEMENT
NAME"DECL_USER" TYPE"TEXT NOT NULL DEFAULT """
/gt ltELEMENT NAME"DECL_PATH" TYPE"TEXT NOT
NULL DEFAULT """ /gt ltELEMENT NAME"DECL_TIME"
TYPE"INTEGER NOT NULL DEFAULT 0" /gt
lt/TABLEgt - ltTABLE NAME"CHAIN"gt ltELEMENT
NAME"CHAIN_ID" TYPE"INTEGER PRIMARY KEY"
DAUGHTER"PROGRAM" MOTHER"TASK" /gt ltELEMENT
NAME"TASK_ID" TYPE"INTEGER NOT NULL DEFAULT 0"
TAG4DB"MOTHER_ID" /gt ltELEMENT
NAME"SCHEDULER" TYPE"TEXT NOT NULL DEFAULT """
/gt ltELEMENT NAME"RTUPDATER" TYPE"TEXT NOT
NULL DEFAULT """ /gt ltELEMENT NAME"SCHED_ID"
TYPE"TEXT NOT NULL DEFAULT """ /gt ltELEMENT
NAME"CHAIN_CLAD_FILE" TYPE"TEXT NOT NULL
DEFAULT """ /gt ltELEMENT NAME"LOG_FILE"
TYPE"TEXT NOT NULL DEFAULT """ /gt ltELEMENT
NAME"SUB_USER" TYPE"TEXT NOT NULL DEFAULT """
/gt ltELEMENT NAME"SUB_PATH" TYPE"TEXT NOT
NULL DEFAULT """ /gt ltELEMENT NAME"SUB_TIME"
TYPE"INTEGER NOT NULL DEFAULT 0" /gt
lt/TABLEgt - ltTABLE NAME"PROGRAM"gt ltELEMENT
NAME"PROGRAM_ID" TYPE"INTEGER PRIMARY KEY"
MOTHER"CHAIN" /gt ltELEMENT NAME"CHAIN_ID"
TYPE"INTEGER NOT NULL DEFAULT 0"
TAG4DB"MOTHER_ID" /gt ltELEMENT NAME"TYPE"
TYPE"TEXT NOT NULL DEFAULT """ /gt ltELEMENT
NAME"EXEC" TYPE"TEXT NOT NULL DEFAULT """ /gt
ltELEMENT NAME"ARGS" TYPE"TEXT NOT NULL DEFAULT
""" /gt ltELEMENT NAME"STDIN" TYPE"TEXT NOT
NULL DEFAULT """ /gt ltELEMENT NAME"STDOUT"
TYPE"TEXT NOT NULL DEFAULT """ /gt ltELEMENT
NAME"STDERR" TYPE"TEXT NOT NULL DEFAULT """ /gt
ltELEMENT NAME"PROGRAM_TIMES" TYPE"TEXT NOT
NULL DEFAULT """ /gt ltELEMENT NAME"INFILES"
TYPE"TEXT NOT NULL DEFAULT """
TAG4SCHED"IN_FILES" /gt ltELEMENT
NAME"OUTFILES" TYPE"TEXT NOT NULL DEFAULT """
TAG4SCHED"OUT_FILES" /gt ltELEMENT
NAME"OUTTOPDIR" TYPE"TEXT NOT NULL DEFAULT """
/gt lt/TABLEgt - ltTABLE NAME"PROGRAMTYPE"gt
ltELEMENT NAME"NAME" TYPE"CHAR(30) NOT NULL
PRIMARY KEY" TAG4DB"UPDATE_KEY"
TAG4SCHED"META_DATA" /gt ltELEMENT
NAME"PROGRAM_SCHEMA" TYPE"TEXT NOT NULL DEFAULT
""" TAG4DB"INSERT_FILE_CONTENT,CREATE_TABLE_CONTE
NT" TAG4SCHED"PROGRAMTYPE_CONTENT" /gt
ltELEMENT NAME"COMMENT" TYPE"VARCHAR(100) NOT
NULL DEFAULT """ TAG4SCHED"META_DATA" /gt
ltELEMENT NAME"PRE_BIN" TYPE"TEXT NOT NULL
DEFAULT """ TAG4DB"INSERT_FILE_CONTENT"
TAG4SCHED"PROGRAMTYPE_CONTENT" /gt ltELEMENT
NAME"RUN_BIN" TYPE"TEXT NOT NULL DEFAULT """
TAG4DB"INSERT_FILE_CONTENT" TAG4SCHED"PROGRAMTYP
E_CONTENT" /gt ltELEMENT NAME"POST_BIN"
TYPE"TEXT NOT NULL DEFAULT """
TAG4DB"INSERT_FILE_CONTENT" TAG4SCHED"PROGRAMTYP
E_CONTENT" /gt lt/TABLEgt
24Dataset and PhEDEx
Boss level structure on WN
How to understand PhEDEs?
jm_Hit245_2_g133/jm03b_qcd_120_170
01D4DF4E-A4EB-4047-A94A-1A550265872F.zip
866822E9-244B-4C1D-BF1D-080E71D343F0.zip 021C736B-
A2A4-43E2-9F25-829F9E7E8F35.zip
8B3ED1AD-14AB-4696-BADD-71119EA7652A.zip ... tota
lly 135 files, 200GB
- Manually dataset transfer
- find out from where to copy the dataset
- copy the files one by one
- publish files into catalog one by one
- write private scripts to do the transfer
- use PhEDEx
- PhEDEx has a collection of scripts or script
templates - PhEDEx provides a framework (a set of agents) to
support scripts - PhEDEx has a central Database (TMDB) to
coordinate every step in transfer process - PhEDEx has a website to monitor transfer status
and handle dataset request - ...
file catalog
lt?xml version"1.0" encoding"UTF-8"
standalone"no" ?gt ltPOOLFILECATALOGgt ltFile
ID"01D4DF4E-A4EB-4047-A94A-1A550265872F"gt
ltphysicalgt ltpfn filetype""
name"dcap//gfe02.hep.ph.ic.ac.uk22128/pnfs/hep.
ph.ic.ac.uk/data/cms/phedex/jm03b_qcd_120_170/Hit/
01D4DF4E-A4EB-4047-A94A-1A550265872F.zip"/gt
lt/physicalgt ltlogicalgt ltlfn
name"ZippedEVD.121000153.121000154.jm_Hit245_2_g1
33.jm03b_qcd_120_170.zip"/gt lt/logicalgt
ltmetadata att_name"dataset" att_value"jm03b_qcd_
120_170"/gt ltmetadata att_name"jobid"
att_value"1126203628"/gt ltmetadata
att_name"owner" att_value"jm_Hit245_2_g133"/gt lt/
Filegt ltFile ID"866822E9-244B-4C1D-BF1D-080E71D3
43F0"gt lt/Filegt lt/POOLFILECATALOGgt
62796961
25Developers point of view of PhEDEx
26Users point of view of PhEDEx
NODE (IC)
Configuration
NODE
FileDownloadDestination
FileDownload
FileDownloadVeryfy
FilePFNExport
TMDB
WWW
FileDownloadDelete
NodeRouter
FileDownloadPublish
FileRouter
PFNLookup
NODE (RAL)
...
...
User needs to write glue scripts which are driven
by agents.
27Event data model
- Data is defined as event data model (EDM).