Title: CMS on the Grid
1CMS on the Grid
Toward a fully distributed Physics Analysis
- Vincenzo Innocente
- CERN/EP
2Challenges Complexity
- Detector
- 2 orders of magnitude more channels than today
- Triggers must choose correctly only 1 event in
every 400,000 - Level 23 triggers are software-based (must be of
highest quality)
Computer resources will not be available in a
single location
3Challenges Geographical Spread
- 1700 Physicists
- 150 Institutes
- 32 Countries
- CERN state 55
- NMS 45
- Major challenges associated with
- Communication and collaboration at a distance
- Distributed computing resources
- Remote software development and physics analysis
4Challenges b physics
- Typically the subject of thesis and work of small
groups in university already today - 150 physicists in CMS Heavy-flavor group
- gt 40 institutions involved
- Often requires precise and specialized algorithms
for vertex-reconstruction and particle
identification - Most of CMS triggered events include B particles
- High level software triggers select exclusive
channels in events triggered in hardware using
inclusive conditions - Objectives
- Allow remote physicists to access detailed
event-information - Migrate effectively reconstruction and selection
algorithms to HTL
5HEP Experiment-Data Analysis
Quasi-online Reconstruction
Environmental data
Detector Control
Online Monitoring
store
Request part of event
Store rec-Obj
Request part of event
Event Filter Object Formatter
Request part of event
store
Persistent Object Store Manager
Database Management System
Store rec-Obj and calibrations
Physics Paper
store
Request part of event
Data Quality Calibrations Group Analysis
Simulation
User Analysis on demand
6Analysis Model
- Hierarchy of Processes (Experiment, Analysis
Groups, Individuals)
3000 SI95sec/event 1 job year
3000 SI95sec/event 3 jobs per year
Reconstruction
Experiment- Wide Activity (109 events)
Re-processing 3 per year
New detector calibrations Or understanding
5000 SI95sec/event
25 SI95sec/event 20 jobs per month
Monte Carlo
Trigger based and Physics based refinements
Iterative selection Once per month
Selection
20 Groups Activity (109 ? 107 events)
10 SI95sec/event 500 jobs per day
25 Individual per Group Activity (106 108
events)
Different Physics cuts MC comparison 1 time
per day
Analysis
Algorithms applied to data to get results
7Data handling baseline
- CMS computing in year 2007
- data model
- typical objects 1KB-1MB
- 3 PB of storage space
- 10,000 CPUs
- 31 sites 1 tier05 tier125 tier2 all over the
world - I/O rates disk-gtCPU 10,000 MB/s, average 1
MB/s/CPU - RAW-gtESD generation 0.2 MB/s I/O
/ CPU - ESD-gtAOD generation 5 MB/s I/O
/ CPU - AOD analysis into histos 0.2 MB/s
I/O / CPU - DPD generation from AOD and ESD 10 MB/s I/O /
CPU - Wide-area I/O capacity order of 700 MByte/s
aggregate over all payload intercontinental
TCP/IP streams - This implies a system with heavy reliance on
access to site-local (cached) data
8Prototype Computing Installation (T0/T1)
9Scalability, regional centres
- CMS computing in year 2007
- Object data model, typical objects 1KB-1MB
- 3 PB of storage space
- 10,000 CPUs
- Regional centres 31 sites1 tier0 5 tier1 25
tier2 all over the world - I/O rates disk-gtCPU 10,000 MB/s, average 1
MB/s/CPU just to keep CPUs busy - Wide-area I/O capacity order of 700 MByte/s
aggregate over all payload intercontinental
TCP/IP streams - This implies a distributed system with heavy
reliance on access to site-local (cached) data - Natural match for Grid technology
10Analysis Environments
- Real Time Event Filtering and Monitoring
- Data driven pipeline
- High reliability
- Pre-emptive Simulation, Reconstruction and Event
Classification - Massive parallel batch-sequential process
- Excellent error recovery and rollback mechanisms
- Excellent scheduling and bookkeeping systems
- Interactive Statistical Analysis
- Rapid Application Development environment
- Excellent visualization and browsing tools
- Human readable navigation
11Different challenges
- Centralized quasi-online processing
- Keep-up with the rate
- Validate and distribute data efficiently
- Distributed organized processing
- Automatization
- Interactive chaotic analysis
- Efficient access to data and Metadata
- Management of private data
12Migration
- Today Nobel price becomes trigger for tomorrow
- (and background the day after)
- Boundaries between running environments are fuzzy
- Physics Analysis algorithms should migrate up
to the online to make the trigger more selective - Robust batch systems should be made available for
physics analysis of large data sample - The result of offline calibrations should be fed
back to online to make the trigger more efficient
13The Final Challenge
- Beyond the interactive analysis tool (User point
of view) - Data analysis presentation N-tuples,
histograms, fitting, plotting, - A great range of other activities with fuzzy
boundaries (Developer point of view) - Batch
- Interactive from pointy-clicky to Emacs-like
power tool to scripting - Setting up configuration management tools,
application frameworks and reconstruction
packages - Data store operations Replicating entire data
stores Copying runs, events, event parts between
stores Not just copying but also doing something
more complicatedfiltering, reconstruction,
analysis, - Browsing data stores down to object detail level
- 2D and 3D visualisation
- Moving code across final analysis, reconstruction
and triggers - Today this involves (too) many tools
14Architecture Overview
Data Browser
Generic analysis Tools
GRID
Distributed Data Store Computing Infrastructure
Analysis job wizards
Objy tools
ORCA
COBRA
OSCAR
FAMOS
Detector/Event Display
CMS tools
Federation wizards
Software development and installation
Coherent set of basic tools and mechanisms
Consistent User Interface
15Offline Architecture Requirements at LHC
- Bigger Experiment, higher rate, more data
- Larger and dispersed user community performing
non trivial queries against a large event store - Make best use of new IT technologies
- Increased demand of both flexibility and
coherence - ability to plug-in new algorithms
- ability to run the same algorithms in multiple
environments - guarantees of quality and reproducibility
- high-performance user-friendliness
16Requirements on data processing
- High efficiency
- Processing-sites hardware optimization
- Processing-sites software optimization
- job structure depends very much on hardware setup
- Data quality assurance
- Data validation
- Data history (job book-keeping)
- Automatize
- Input data discovery
- Crash recovery
- Resource monitoring
- Identify bottlenecks and fragile components
17Analysis part
- Physics data analysis will be done by 100s of
users - Analysis part is connected to same catalogs
- Maintain a global view of all data
- Big analysis jobs can use production job handling
mechanisms - Analysis services based on tags
18Emacs used to edit CMS C plugin to create and
fill histograms
OpenInventor-based display of selected event
Lizard Qt plotter
ANAPHE histogram Extended with pointers to CMS
events
Python shell with Lizard CMS modules
19Varied components and data flows
Tier 0/1/2
Tier 1/2
Production data flow
TAGs/AODs data flow
Tier 3/4/5
Physics Query flow
User
20TODAY
- Data production and analysis exercises
- Granularity (Data Product) Data-Set
- Development and deployment of a data distributed
processing system (Hardware Software) - Test and integration of Grid middleware
prototypes - RD on distributed interactive analysis
21CMS Production 2000-2002
Signal
Zebra files with HITS
HEPEVT ntuples
CMSIM
MC Prod.
MB
Catalog import
ORCA Digitization (merge signal and MB)
Objectivity Database
ORCA ooHit Formatter
Objectivity Database
ORCA Prod.
Catalog import
HLT Algorithms New Reconstructed Objects
Objectivity Database
HLT Grp Databases
Mirrored Dbs (US, Russia, Italy..)
22Current CMS Production
23CMS Production stream
Task Application Input Output Output Req. on resources
Task Application non-standard non-standard non-standard Req. on resources
1 Generation Pythia None None Ntuple (static link) Geometry files Storage
2 Simulation CMSIM Ntuple Ntuple FZ file (static link) Geometry files Storage
3 Hit Formatting ORCA H.F. FZ file FZ file DB Shared libs Full CMS env. Storage
4 Digitization ORCA Digi. DB DB DB Shared libs Full CMS env. Storage
5 User analysis ORCA User DB DB Ntuple or root Shared libs Full CMS env. Distributed input
24Production 2002, Complexity
Number of Regional Centers 11
Number of Computing Centers 21
Number of CPUs 1000
Largest Local Center 176 CPUs
Number of Production Passes for each Dataset(including analysis group processing done by production) 6-8
Number of Files 11,000
Data Size (Not including fz files from Simulation) 17TB
File Transfer by GDMP and by perl Scripts over scp/bbcp 7TB toward T1 4TB toward T2
25Spring02 CPU Resources
4.4.02 700 active CPUs plus 400 CPUs to come
Wisconsin
UFL 5
18
Bristol 3
UCSD 3
RAL 6
Caltech 4
Moscow
FNAL 8
10
HIP 1
INFN 18
CERN 15
IN2P3 10
IC 6
26Current data processing
27ORCA Db Structure
One CMSIM Job, oo-formatted into multiple Dbs.
For example
FZ File
Few kB/ev
MC Info Container 1
300kB/ev
1 CMSIM Job
ooHit dB's
100kB/ev
Calo/Muon Hits
200kB/ev
Tracker Hits
Multiple sets of ooHits concatenated into
single Db file. For example
MC Info Run1
MC Info Run2
2 GB/file
Concatenated MC Info from N runs.
MC Info Run3..
Physical and logical Db structures diverge...
28Production center setup
- Most critical task is digitization
- 300 KB per pile-up event
- 200 pile-up events per signal event ? 60 MB
- 10 s to digitize 1 full event on a 1 GHz CPU
- 6 MB / s per CPU (12 MB / s per dual processor
client) - Up to 5 clients per pile-up server ( 60 MB / s
on its network card Gigabit) - Fast disk access
5 clients per server
29INFN-Legnaro Tier-2 prototype
2001 35 Nodes 70 CPUs 3500 SI95 8 TB
1
8
2
N24
2001-2-3 up to 190 Nodes
N24
N1
N24
N1
N1
FastEth
FastEth
FastEth
SWITCH
SWITCH
SWITCH
To WAN 34 Mbps 2001 155 Mbps 2002
32 GigaEth 1000 BT
2001 11 Servers 1100 SI95 2.5 TB
S16
S1
S11
Sx Disk Server Node Dual PIII 1 GHz Dual PCI
(33/32 66/64) 512 MB 3x75 GB Eide Raid 0-5
disks (exp up to 10) 1x20 GB disk O.S.
Nx Computational Node Dual PIII 1 GHz 512
MB 3x75 GB Eide disk 1x20 GB for O.S.
30IMPALA
- Each step in the production chain is split into 3
sub-steps - Each sub-step is factorized into customizable
functions
JobDeclaration
Search for something to do
JobCreation
Generate jobs from templates
JobSubmission
Submit jobs to the scheduler
31Job declaration and creation
- Jobs to-do are automatically discovered
- looking at predefined directory contents for the
Fortran Steps - querying the Objectivity/DB federation for
Digitization, Event Selection, Analysis - Once the to-do list is ready, the site manager
can actually generate instances of jobs starting
from a template - Job execution includes validation of produced data
32Job submission
- Thank to the sub-step decomposition into
customizable functions site managers can - Define local actions to be taken to submit the
job (is there any job scheduler? Which one? How
are the queues organized?) - Define local actions to be taken before and after
the start of the job (is there a tape library?
Need to stage tapes before run?) - Auto-recovery of crashed jobs
- When a job is started for the first time its
startup cards are automatically modified so that
if the job is re-started it continues from the
last analyzed event
33BOSS
- Submission of batch jobs to a computing farm
- Independency from local scheduler (PBS, LSF,
Condor, etc...) - Persistent storage of job information (in RDB)
- Job dependent book-keeping monitor different
information in different job types - (e.g. number of events in input, number of events
in output, version of software used, internal
production software errors, etc)
34BOSS job submission an running
BOSS
Local Scheduler
boss submit boss query boss kill
BOSS DB
- Accepts job submission from users
- Stores info about job in a DB
- Builds a wrapper around the job (BossExecuter)
- Sends the wrapper to the local scheduler
- The wrapper sends to the DB info about the job
35Store info about a job
- A registered job has a schema associated to it
with the relevant information to be stored - A table is created in the DB to keep this info.
36Getting info from the job
- A registered job has scripts associated to it
which are able to understand the job output
Users executable
37Boss Logical Diagram
Job Specification
Executable
Book-keeping definition
submit
Job instrumentation for book-keeping, and
submission
query, kill
SQL UPDATE
Book-keeping DB
SQL SELECT
Book-keeping info retrieval and task modification
SQL UPDATE
submit
Book-keeping info update (MySQL)
query, kill
Filter Interface
submit, kill, query
Scheduler Condor Vanilla, LSF, FBSNG, Grid
Scheduler
Executing Job
38TOMORROW
- Map Data-Sets to Grid Data-Products
- Use Grid Security infrastructure Workload
manager - Deploy Grid-enabled portal to interactive
Analysis - Global monitoring of Grid performances and
quality of service
39Computing
- Ramp Production systems 05-07 (30,30,40 of
cost each year) - Match Computing power available with LHC
luminosity
2007 300M Reco ev/mo 200M Re-Reco ev/mo 50k ev/s
Analysis
2006 200M Reco ev/mo 100M Re-Reco ev/mo 30k ev/s
Analysis
40Toward ONE Grid
- Build a unique CMS-GRID framework (EUUS)
- EU and US grids not interoperable today. Wait for
help from DataTAG-iVDGL-GLUE - Work in parallel in EU and US
- Main US activities
- MOP
- Virtual Data System
- Interactive Analysis
- Main EU activities
- Integration of IMPALA with EDG WP1WP2 sw.
- Batch Analysis user job submission analysis
farm
41PPDG MOP system
- PPDG Developed MOP System
- Allows submission of CMS prod. Jobs from a
central location, run on remote locations, and
returnresults - Relies on GDMP for replication
- Globus GRAM
- Condor-G and local queuing systems for Job
Scheduling - IMPALA for Job Specification
- being deployed in USCMS testbed
- Proposed as basis for next CMS-wide production
infrastructure
42(No Transcript)
43Prototype VDG System (production)
no code
existing
implemented using MOP
44Globally Scalable Monitoring Service
Push Pull rsh ssh existing scripts snmp
45Optimisation of Tag Databases
- Tags (n-tuple) are small (0.2 - 1 kbyte) summary
objects for each event - Crucial for fast selection of interesting event
subsetsthis will be an intensive activity - Past work concentrated in three main areas
- Development of Objectivity based Tags integrated
with the CMS COBRA framework and Lizard - Investigations of Tag bitmap indexing to speed
queries - Comparisons of OO and traditional databases (SQL
Server, Oracle 9i, PostGreSQL) as efficient
stores for Tags - New work concentrates on tag based analysis
services
46CLARENS a Portal to the Grid
- Grid-enabling the working environment for
physicists' data analysis - Clarens consists of a server communicating with
various clients via the commodity XML-RPC
protocol. This ensures implementation
independence. - The server is implemented in C to give access
to the CMS OO analysis toolkit. - The server will provide a remote API to Grid
tools - Security services provided by the Grid (GSI)
- The Virtual Data Toolkit Object collection
access - Data movement between Tier centres using GSI-FTP
- CMS analysis software (ORCA/COBRA),
- Current prototype is running on the Caltech
proto-Tier2 - More information at http//clarens.sourceforge.net
, along with a web-based demo
47Clarens Architecture
- Common protocol spoken by all types of clients to
all types of services - Implement service once for all clients
- Implement client access to service once for each
client type using common protocol already
implemented for all languages (C, Java,
Fortran, etc. -) - Common protocol is XML-RPC with SOAP close to
working, CORBA doable, but would require
different server above Clarens (uses IIOP, not
HTTP) - Handles authentication using Grid certificates,
connection management, data serialization,
optionally encryption - Implementation uses stable, well-known server
infrastructure (Apache) that is debugged/audited
over a long period by many - Clarens layer itself implemented in Python, but
can be reimplemented in C should performance be
inadequate
48Clarens Architecture II
http/https
Service
Clarens
Web server
RPC
Client
49Clarens Architecture
Authentication
Authentication
Session initialization
Session initialization
Request deserializing
Request serializing
Request mashalling
Request transmission
Worker code invocation
Worker code invocation
Result deserializing
Result serializing
Session termination
Session termination
50- Clarens is a simple way to implement web services
on the server - Provides some basic connectivity functionality
common to all services - Uses commodity protocols
- No Globus needed on client side, only certificate
- Simple to implement clients in scripts and
compiled code
512007
- Sub event components map to Grid Data-Products
- Balance of load between Network and CPU
- Complete Data and Software base virtually
available at the physicist desktop
52Simulation, Reconstruction Analysis Software
System
Uploadable on the Grid
Physics modules
Specific Framework
Reconstruction Algorithms
Data Monitoring
Event Filter
Physics Analysis
Grid-enabled Application Framework
Calibration Objects
Event Objects
Configuration Objects
Generic Application Framework
Grid-Aware Data-Products
adapters and extensions
Basic Services
C standard library Extension toolkit
ODBMS
Geant3/4
CLHEP
Paw Replacement
53Reconstruction on Demand
Compare the results of two different track
reconstruction algorithms
Rec Hits
Detector Element
Rec Hits
Rec Hits
Hits
Event
Rec T1
T1
CaloCl
Rec T2
Analysis
Rec CaloCl
T2
54Conclusions
- Grid is the enabling technology for the effective
deployment of a coherent and consistent data
processing environment - This is the only base for an efficient physics
analysis program at LHC - CMS is engaged in an active development, test and
deployment program of all software and hardware
component that will constitute the future LHC
grid