Towards the operations of the Italian Tier-1 for CMS: lessons learned from the CMS Data Challenge - PowerPoint PPT Presentation

About This Presentation

Title:

Towards the operations of the Italian Tier-1 for CMS: lessons learned from the CMS Data Challenge

Description:

Towards the operations of. the Italian Tier-1 for CMS: ... ORCA. Job. ORCA. Job. T0 data. distribution. agents. Tier-0. Tier-1. Tier-2. TMDB. disk-SE. EBs ... – PowerPoint PPT presentation

Number of Views:396

Avg rating:3.0/5.0

Slides: 31

Provided by: danieleb

Category:

more less

Transcript and Presenter's Notes

Title: Towards the operations of the Italian Tier-1 for CMS: lessons learned from the CMS Data Challenge

1
Towards the operations ofthe Italian Tier-1 for
CMSlessons learned from the CMS Data Challenge

D. Bonacorsi
(on behalf of INFN-CNAF Tier-1 staff and the CMS
experiment)

ACAT 2005 X Int. Work. on Advanced Computing
Analysis Techniques in Physics Research May
22nd-27th, 2005 - DESY, Zeuthen, Germany
2
Outline

The past
CMS operational environment during the Data
Challenge
focus on INFN-CNAF Tier-1 resources and set-up
The present
lessons learned from the challenge
The future
try to apply what we (think we) learned

3
The INFN-CNAF Tier-1

Located at INFN-CNAF centre, in Bologna (Italy)
computing facility for INFN HNEP community
one of the main nodes of GARR network
Multi-experiment Tier-1
LHC experiments AMS, Argo, BaBar, CDF, Magic,
Virgo,
evolution dynamic sharing of resources among
involved exps
CNAF is a relevant Italian site from a Grid
perspective
partecipating to LCG, EGEE, INFN-GRID projects
support to RD activities, develop/testing
prototypes/components
traditional access to resources granted also,
but more manpower-consuming

4
Tier-1 resources and services

computing power
CPU farms for 1300 kSI2k few dozen of servers
biproc boxes 320 _at_0.8-2.4 GHz, 350 _at_3 GHz, ht
activated
storage
on-line data access (disks)
IDE, SCSI, FC 4 NAS systems 60 TB, 2 SAN
systems 225 TB
custodial task on MSS (tapes in Castor HSM
system)
Stk L180 lib - overall 18 TB
Stk 5500 lib - 6 LTO-2 240 TB 2 9940b
136 TB (more to be installed)
networking
T1 LAN
rack FE switches with 2xGbps uplinks to core
switch (ds ? via GE to core)
upgrade foreseen ? rack Gb switches
1 Gbps T1 link to WAN (1 Gbps is for Service
Challenge)
will be 10 Gbps Q3 2005
More
infrastructure (electric power, UPS, etc.)
system administration, database services
administration, etc.
support to experiment-specific activities

5
The CMS Data Challenge what and how

Validate the CMS computing model on a sufficient
number of Tier-0/1/2s
? large scale test of the computing/analysis
models

Generation Simulation

CMS Pre-Challenge Production (PCP)
up to digitization (needed as input for DC)
mainly non-grid productions
but also grid prototypes (CMS/LCG-0, LCG-1,
Grid3)

Digitization
70M Monte Carlo events (20M with Geant-4)
produced, 750K jobs ran, 3500 KSI2000 months, 80
TB of data

CMS Data Challenge (DC04)
Reconstruction and analysis on CMS data sustained
over 2 months
at the 5 of the LHC rate at full luminosity
? 25 of start-up lumi
sustain a 25 Hz reconstruction rate in the Tier-0
farm
register data and metadata to a world-readable
catalogue
distribute reconstructed data from Tier-0 to
Tier-1/2s
analyze reconstructed data at the Tier-1/2s as
they arrive
monitor/archive information on resources and
processes

Reconstruction
Analysis

not a CPU challenge.. ? aimed to the
demostration of feasibility of the full chain

6
PCP set-up a hybrid model
Phys.Group asks for a new dataset
Production Manager defines assignments
RefDB
shell scripts
Data-level query
Local Batch Manager
BOSS DB
Job level query
McRunjob
Site Manager starts an assignment
7
PCP grid-based prototypes
CMS prod. steps INFN/CMS
Generation 13 Simulation 14
ooHitformatting 21 Digitisation 18
Strong INFN contribution to crucial PCP
production, in both
? traditional production ? constant work of
integration in CMS between ? CMS software and
production tools ? evolving EDG-X?LCG-Y
middleware in several phases ? CMS Stress
Test with EDGlt1.4, then ? PCP on the
CMS/LCG-0 testbed ? PCP on LCG-1 towards
DC04 with LCG-2
EU-CMS submit to LCG scheduler ? CMS-LCG
virtual Regional Center 0.5 Mevts Generation
heavy pythia (2000 jobs 8 hours each, 10
KSI2000 months) 2.1 Mevts Simulation
CMSIMOSCAR (8500 jobs 10hours each, 130
KSI2000 months) 2 TB data
OSCAR 0.6 Mevts on LCG-1
PIII 1GHz
CMSIM 1.5 Mevts on CMS/LCG-0
8
Global DC04 layout and workflow
Hierarchy of RCs data distribution chains 3
distinct scenarios deployed and tested
9
INFN-specific DC04 workflow
disk-SE Export Buffer
Transfer Management DB
CNAF T1
TRA-Agent
data flow
local MySQL
T1 Castor SE
LTO-2 tape library
query db
update db
Legnaro T2
SAFE-Agent
REP-Agent
T1 disk-SE
T2 disk-SE

data movement T0?T1
data custodial task interface to MSS
data movement T1?T2 for real-time analysis

Basic issues addressed at T1
10
An example Data flow during just 1 day of DC04
CNAF T1 Castor SE
CNAF T1 Castor SE
eth I/O input from SE-EB
TCP connections
Just one day Apr, 19th
RAM memory
CNAF T1 disk-SE
eth I/O input from Castor SE
green
Legnaro T2 disk-SE
eth I/O input from Castor SE
11
DC04 outcome (grand-summary focus on INFN T1)

reconstruction/data-transfer/analysis may run at
25 Hz
automatic registration and distribution of data,
key role of the TMDB
was the embrional PhEDEx!
support a (reasonable) variety of different data
transfer tools and set-up
Tier-1s different performances, related to
operational choices
SRB, LCG Replica Manager and SRM investigated
see CHEP04 talk
INFN T1 good performance of LCG-2 chain (PIC T1
also)
register all data and metadata (POOL) to a
world-readable catalogue
RLS good as a global file catalogue, bad as a
global metadata catalogue
analyze the reconstructed data at the Tier-1s as
data arrive
LCG components dedicated bdIIRB UIs, CEsWNs
at CNAF and PIC
real-time analysis at Tier-2s was demonstrated
to be possible
15k jobs submitted
time window between reco data availability -
start of analysis jobs can be reasonably low
(i.e. 20 mins)
reduce number of files (i.e. increase
lteventsgt/ltfilesgt)
more efficient use of bandwidth
reduce overhead of commands
address scalability of MSS systems (!)

12
Learn from DC04 lessons

Some general considerations may apply
although a DC is experiment-specific, maybe its
conclusions are not
an experiment-specific problem is better
addressed if conceived as a shared one in a
shared Tier-1
an experiment DC just provides hints, real work
gives insight
? crucial role of the experiments at the Tier-1
find weaknesses of CASTOR MSS system in
particular operating conditions
stress-test new LSF farm with official production
jobs by CMS
testing DNS-based load-balancing by serving data
for production and/or analysis from CMS
disk-servers
test new components, newly installed/upgraded
Grid tools, etc
find bottleneck and scalability problems in DB
services
give feedback on monitoring and accounting
activities

13
T1 today farming What changed since DC04?

Migration in progress
OS
RH v.7.3 ? SLC v.3.0.4
middleware
upgrade to LCG v.2.4.0
install/manage WNs/servers
lcfgng ? Quattor
integration LCG-Quattor
batch scheduler
TorqueMaui ? LSF v.6.0
queues for prod/anal
manage Grid interfacing

RUNNING
PENDING
Total nb. jobs
Max nb. slots

Analysis
controlled and fake (DC04) vs.
unpredictable and real (now)
T1 provides one full LCG site 2 dedicated
RBs/bdII support to CRABers
Interoperability always an issue, even harder in
a transition period
dealing with 2-3 sub-farms in use by 10 exps
(in prod)
resource use optimization still to be achieved

? see N.DeFilippis session II day 3
14
T1 today storage What changed since DC04?

Storage issues (1/2) disks
driven by requirements of LHC data processing at
the Tier-1
i.e. simultaneous access of PBs of data from
1000 nodes at high rate
main focus is on robust, load-balanced, redundant
solutions to grant proficient and stable data
access to distributed users
namely make both sw and data accessible from
jobs running on WNs
remote access (gridftp) and local access (rfiod,
xrootd, GPFS) services, afs/nfs to share exps sw
on WNs, filesystems tests, specific problem
solving in analysts daily operations, CNAF
participation to SC2/3, etc.
a SAN approach with a parallel filesystem on-top
looks promising
Storage issues (2/2) tapes
CMS DC04 helped to focus some problems
LTO-2 drives not efficiently used by exps in
production at T1
performance degradation increases as file size
decreases
hangs on locate/fskip after 100 not-sequential
reading
not-full tapes are labelled RDONLY after 50-100
GB written only
CASTOR performances increase with clever
pre-staging of files
some reliability achieved only on
sequential/pre-staged reading
solutions?
from the HSM sw side fix coming with CASTOR v.2
(Q2 2005)?
from the HSM hw side test 9940b drives in prod
(see PIC T1)

see P.P.Ricci session II day 3
15
Current CMS set-up at the Tier-1
CMS activities at the Tier-1
Castor MSS
local
CMS
Local prod
Resources manag.
remote access
shared
PhEDEx agents
Grid prod/anal
control
logical grouping
Castor disk buffer
CPUs
SE
Overflow
Operations control
WN
WN
WN
Import-Export Buffer
WN
WN
WN
PhEDEx agents
WN
WN
WN
SE
Analysis disks
WN
WN
WN
LSF
SE
Core
Production disks
WN
WN
WN
CE
WN
WN
WN
Grid.it / LCG layer
16
PhEDEx in CMS

PhEDEx (Physics Experiment Data Export) used by
CMS

overall infrastructure for data transfer
management in CMS
allocation and transfers of CMS physics data
among Tier-0/1/2s
different datasets move on bidirectional routes
among Regional Centers
data should reside on SEs (e.g. gsiftp or srm
protocols)

components
TMDB ? from DC04
files, topology, subscriptions...
coherent set of sw agents, loosely coupled,
inter-operating and communicating with TMDB
blackboard
e.g. agents for data allocation (based on site
data subscriptions), file import/export,
migration to MSS, routing (based on implemented
topologies), monitoring, etc

INFN T1 mainly on data transfer
INFN T1 mainly on prod/anal

born, and growing fast
gt70 TB known to PhEDEx, gt150 TB total replicated

17
PhEDEx transfer rates T0?INFN T1
CNAF T1 diskserver I/O
weekly
daily
Rate out of CERN Tier-0
18
PhEDEx at INFN

INFN-CNAF is a T1 node in PhEDEx
CMS DC04 experience was crucial to start-up
PhEDEX in INFN
CNAF node operational since the beginning
First phase (Q3/4 2004)
Agent code development focus on operations
T0?T1 transfers
gt1 TB/day T0?T1 demonstrated feasible
but the aim is not to achieve peaks, but to
sustain them in normal operations
Second phase (Q1 2005)
PhEDEx deployment in INFN to Tier-n, ngt1
distributed topology scenario
Tier-n agents run at remote sites, not at the T1
know-how required, T1 support
already operational at Legnaro, Pisa, Bari,
Bologna

An example data flow to T2s in daily operations
(here a test with 2000 files, 90 GB, with no
optimization)
450 Mbps CNAF T1 ? LNL-T2
205 Mbps CNAF T1 ? Pisa-T2

Third phase (Qgt1 2005)
Many issues.. e.g. stability of service, dynamic
routing, coupling PhEDEx to CMS official
production system, PhEDEx involvement in
SC3-phaseII, etc

19
CMS MonteCarlo productions

CMS production system evolving into a permanent
effort
strong contribution of INFN T1 to CMS productions
252 assignments in PCP-DC04, for all
production step both local and Grid
plenty of assignments (simulation only) now
running on LCG (ItalySpain)
CNAF support for direct submitters backup SEs
provided for Spain
currently, digitization/DST efficiently run
locally (mostly at T1)
produced data hence injected in the CMS data
distribution infrastructure
future of T1 productions rounds of scheduled
reprocessing

12.9 Mevts assegnati
11.8 Mevts prodotti
DST production at INFN T1
20
coming next Service Challenge (SC3)

data transfer and data serving in real use-cases
review existing infrastructure/tools and give a
boost
details of the challenge are currently under
definition
Two phases
Jul05 SC3 throughput phase
Tier-0/1/2 simultaneous import/export, MSS
involved
move real files, store on real hw
gtSep05 SC3 service phase
small scale replica of the overall system
modest throughput, main focus is on testing in a
quite complete environment, with all the crucial
components
space for experiment-specific tests and inputs
Goals
test crucial components, push to prod-quality,
and measure.
towards the next production service
INFN T1 participated in SC2, and is joining SC3

21
Conclusions

INFN-CNAF T1 is quite young but ramping-up
towards stable production-quality services
optimized use of resources interfaces to the
Grid
policy/HR to support experiments at the Tier-1
the Tier-1 actively partecipated to CMS DC04
good hints identified bottlenecks in managing
resources, scalability,
Learn the lessons overall revision of CMS set-up
at the T1
involves both Grid and non-Grid access
first results are encouraging, success of daily
operations
local/Grid productions distributed analysis are
running
Go ahead
long path
next step on it preparation for SC3, also with
CMS applications

22
Back-up slides
23
PhEDEx transfer rates T0?INFN T1
CNAF T1 diskserver I/O
weekly
daily
Rate out of CERN Tier-0
24
PhEDEx transfer rates T0?INFN T1
CNAF T1 diskserver I/O
weekly
daily
Rate out of CERN Tier-0
25
Lethal injuries only
CNAF autopsy of DC04

Agents drain data from SE-EB down to CNAF/PIC
T1s and
land directly on a Castor SE buffer
? it occurred that in DC04 these files were many
and small
So for any file on the Castor SE fs, a tape
migration is
foreseen with a given policy, regardless of their
size/nb..
? this strongly affected data transfer at CNAF
T1
(MSS below is STK tape lib with LTO-2
tapes)
Castor stager scalability issues
many small files (mostly 500B-50kB) ?
stager db bad performances of stager db for
gt300-400k entries (may need more RAM?)
CNAF fast set-up of an additional stager in
DC04 basically worked
REP-Agent cloned to transparently continue
replication to disk-SEs
tape library LTO-2 issues
high nb. segments on tape ? bad tape
read/write performances, LTO-2 SCSI errors,
repositioning failures, slow migration to
tape and delays in the TMDB SAFE-labelling,

see next slide
DC04
26
Non-lethal injuries
CNAF autopsy of DC04

minor (?) Castor/tape-library issues
Castor filename length (more info Castor ticket
CT196717)
ext3 file-system corruption on a partition of
the old stager
tapes blocked in the library
several crashes/hanging of the TRA-Agent (rate
3 times per week)
? created from time to time some backlogs,
nevertheless fast to be recovered
? post-mortem analysis in progress
experience with the Replica Manager interface
e.g. files of size 0 created at
destination when trying to replicate from Castor
SE some data
which are temporarily not accessible for
stager (or other) problems on the Castor side
? needs further tests to achieve
reproducibility and then Savannah reports
Globus-MDS Information System instabilities
(rate once per week)
? some temporary stop of data transfer
(i.e. no SE found means no replicas)
RLS instabilities (rate once per week)

constant and painful debugging
27
CMS DC04 number and sizes of files
DC04 data time window 51 (3) days
March 11th May 3rd
gt3k files for gt750 GB
May 2nd
May 1st
Global CNAF network activity
340 Mbps (gt42 MB/s) sustained for 5 hours (max
was 383.8 Mbps)
28
Description of RLS usage
Local POOL catalogue
TMDB
Tier-1 Transfer agent
SRB GMCAT
Replica Manager
RM/SRM/SRB EB agents
4. Copy files to Tier-1s
Resource Broker
3. Copy/delete files to/from export buffers
5. Submit analysis job
LCG ORCA Analysis Job
Configuration agent
2. Find Tier-1 Location (based on metadata)
6. Process DST and register private data
CNAF RLS replica
1. Register Files
XML Publication Agent
ORACLE mirroring
Specific client tools POOL CLI, Replica Manager
CLI, C LRC API based programs, LRC java API
tools (SRB/GMCAT), Resource Broker
29
Tier-0 in DC04
Architecture built on

Tier-0

Systems
LSF batch system
3 racks, 44 nodes each, dedicated tot 264 CPUs
Dual P-IV Xeon 2.4GHz, 1GB mem, 100baseT
Dedicated cmsdc04 batch queue, 500 RUN-slots
Disk servers
DC04 dedicated stager, with 2 pools
2 pools IB and GDB, 10 4 TB
Export Buffers
EB-SRM ( 4 servers, 4.2 TB total )
EB-SRB ( 4 servers, 4.2 TB total )
EB-SE ( 3 servers, 3.1 TB total )
Databases
RLS (Replica Location Service)
TMDB (Transfer Management DB)
Transfer steering
Agents steering data transfers

GDB
ORCA RECO Job
RefDB
IB
TMDB
fake on-line process
POOL RLS catalogue
Castor
30
CMS Production tools

CMS production tools (OCTOPUS)
RefDB
Contains production requests with all needed
parameters to produce the dataset and the details
about the production process
MCRunJob
Evolution of IMPALA more modular (plug-in
approach)
Tool/framework for job preparation and job
submission
BOSS
Real-time job-dependent parameter tracking. The
running job standard output/error are intercepted
and filtered information are stored in BOSS
database. The remote updator is based on MySQL
but a remote updator based on R-GMA is being
developed.