Towards the operations of the Italian Tier-1 for CMS: lessons learned from the CMS Data Challenge - PowerPoint PPT Presentation

About This Presentation
Title:

Towards the operations of the Italian Tier-1 for CMS: lessons learned from the CMS Data Challenge

Description:

Towards the operations of. the Italian Tier-1 for CMS: ... ORCA. Job. ORCA. Job. T0 data. distribution. agents. Tier-0. Tier-1. Tier-2. TMDB. disk-SE. EBs ... – PowerPoint PPT presentation

Number of Views:396
Avg rating:3.0/5.0
Slides: 31
Provided by: danieleb
Category:

less

Transcript and Presenter's Notes

Title: Towards the operations of the Italian Tier-1 for CMS: lessons learned from the CMS Data Challenge


1
Towards the operations ofthe Italian Tier-1 for
CMSlessons learned from the CMS Data Challenge
  • D. Bonacorsi
  • (on behalf of INFN-CNAF Tier-1 staff and the CMS
    experiment)

ACAT 2005 X Int. Work. on Advanced Computing
Analysis Techniques in Physics Research May
22nd-27th, 2005 - DESY, Zeuthen, Germany
2
Outline
  • The past
  • CMS operational environment during the Data
    Challenge
  • focus on INFN-CNAF Tier-1 resources and set-up
  • The present
  • lessons learned from the challenge
  • The future
  • try to apply what we (think we) learned

3
The INFN-CNAF Tier-1
  • Located at INFN-CNAF centre, in Bologna (Italy)
  • computing facility for INFN HNEP community
  • one of the main nodes of GARR network
  • Multi-experiment Tier-1
  • LHC experiments AMS, Argo, BaBar, CDF, Magic,
    Virgo,
  • evolution dynamic sharing of resources among
    involved exps
  • CNAF is a relevant Italian site from a Grid
    perspective
  • partecipating to LCG, EGEE, INFN-GRID projects
  • support to RD activities, develop/testing
    prototypes/components
  • traditional access to resources granted also,
    but more manpower-consuming

4
Tier-1 resources and services
  • computing power
  • CPU farms for 1300 kSI2k few dozen of servers
  • biproc boxes 320 _at_0.8-2.4 GHz, 350 _at_3 GHz, ht
    activated
  • storage
  • on-line data access (disks)
  • IDE, SCSI, FC 4 NAS systems 60 TB, 2 SAN
    systems 225 TB
  • custodial task on MSS (tapes in Castor HSM
    system)
  • Stk L180 lib - overall 18 TB
  • Stk 5500 lib - 6 LTO-2 240 TB 2 9940b
    136 TB (more to be installed)
  • networking
  • T1 LAN
  • rack FE switches with 2xGbps uplinks to core
    switch (ds ? via GE to core)
  • upgrade foreseen ? rack Gb switches
  • 1 Gbps T1 link to WAN (1 Gbps is for Service
    Challenge)
  • will be 10 Gbps Q3 2005
  • More
  • infrastructure (electric power, UPS, etc.)
  • system administration, database services
    administration, etc.
  • support to experiment-specific activities

5
The CMS Data Challenge what and how
  • Validate the CMS computing model on a sufficient
    number of Tier-0/1/2s
  • ? large scale test of the computing/analysis
    models

Generation Simulation
  • CMS Pre-Challenge Production (PCP)
  • up to digitization (needed as input for DC)
  • mainly non-grid productions
  • but also grid prototypes (CMS/LCG-0, LCG-1,
    Grid3)

Digitization
70M Monte Carlo events (20M with Geant-4)
produced, 750K jobs ran, 3500 KSI2000 months, 80
TB of data
  • CMS Data Challenge (DC04)
  • Reconstruction and analysis on CMS data sustained
    over 2 months
  • at the 5 of the LHC rate at full luminosity
    ? 25 of start-up lumi
  • sustain a 25 Hz reconstruction rate in the Tier-0
    farm
  • register data and metadata to a world-readable
    catalogue
  • distribute reconstructed data from Tier-0 to
    Tier-1/2s
  • analyze reconstructed data at the Tier-1/2s as
    they arrive
  • monitor/archive information on resources and
    processes

Reconstruction
Analysis
  • not a CPU challenge.. ? aimed to the
    demostration of feasibility of the full chain

6
PCP set-up a hybrid model
Phys.Group asks for a new dataset
Production Manager defines assignments
RefDB
shell scripts
Data-level query
Local Batch Manager
BOSS DB
Job level query
McRunjob
Site Manager starts an assignment
7
PCP grid-based prototypes
CMS prod. steps INFN/CMS
Generation 13 Simulation 14
ooHitformatting 21 Digitisation 18
Strong INFN contribution to crucial PCP
production, in both
? traditional production ? constant work of
integration in CMS between ? CMS software and
production tools ? evolving EDG-X?LCG-Y
middleware in several phases ? CMS Stress
Test with EDGlt1.4, then ? PCP on the
CMS/LCG-0 testbed ? PCP on LCG-1 towards
DC04 with LCG-2
EU-CMS submit to LCG scheduler ? CMS-LCG
virtual Regional Center 0.5 Mevts Generation
heavy pythia (2000 jobs 8 hours each, 10
KSI2000 months) 2.1 Mevts Simulation
CMSIMOSCAR (8500 jobs 10hours each, 130
KSI2000 months) 2 TB data
OSCAR 0.6 Mevts on LCG-1
PIII 1GHz
CMSIM 1.5 Mevts on CMS/LCG-0
8
Global DC04 layout and workflow
Hierarchy of RCs data distribution chains 3
distinct scenarios deployed and tested
9
INFN-specific DC04 workflow
disk-SE Export Buffer
Transfer Management DB
CNAF T1
TRA-Agent
data flow
local MySQL
T1 Castor SE
LTO-2 tape library
query db
update db
Legnaro T2
SAFE-Agent
REP-Agent
T1 disk-SE
T2 disk-SE
  • data movement T0?T1
  • data custodial task interface to MSS
  • data movement T1?T2 for real-time analysis

Basic issues addressed at T1
10
An example Data flow during just 1 day of DC04
CNAF T1 Castor SE
CNAF T1 Castor SE
eth I/O input from SE-EB
TCP connections
Just one day Apr, 19th
RAM memory
CNAF T1 disk-SE
eth I/O input from Castor SE
green
Legnaro T2 disk-SE
eth I/O input from Castor SE
11
DC04 outcome (grand-summary focus on INFN T1)
  • reconstruction/data-transfer/analysis may run at
    25 Hz
  • automatic registration and distribution of data,
    key role of the TMDB
  • was the embrional PhEDEx!
  • support a (reasonable) variety of different data
    transfer tools and set-up
  • Tier-1s different performances, related to
    operational choices
  • SRB, LCG Replica Manager and SRM investigated
    see CHEP04 talk
  • INFN T1 good performance of LCG-2 chain (PIC T1
    also)
  • register all data and metadata (POOL) to a
    world-readable catalogue
  • RLS good as a global file catalogue, bad as a
    global metadata catalogue
  • analyze the reconstructed data at the Tier-1s as
    data arrive
  • LCG components dedicated bdIIRB UIs, CEsWNs
    at CNAF and PIC
  • real-time analysis at Tier-2s was demonstrated
    to be possible
  • 15k jobs submitted
  • time window between reco data availability -
    start of analysis jobs can be reasonably low
    (i.e. 20 mins)
  • reduce number of files (i.e. increase
    lteventsgt/ltfilesgt)
  • more efficient use of bandwidth
  • reduce overhead of commands
  • address scalability of MSS systems (!)

12
Learn from DC04 lessons
  • Some general considerations may apply
  • although a DC is experiment-specific, maybe its
    conclusions are not
  • an experiment-specific problem is better
    addressed if conceived as a shared one in a
    shared Tier-1
  • an experiment DC just provides hints, real work
    gives insight
  • ? crucial role of the experiments at the Tier-1
  • find weaknesses of CASTOR MSS system in
    particular operating conditions
  • stress-test new LSF farm with official production
    jobs by CMS
  • testing DNS-based load-balancing by serving data
    for production and/or analysis from CMS
    disk-servers
  • test new components, newly installed/upgraded
    Grid tools, etc
  • find bottleneck and scalability problems in DB
    services
  • give feedback on monitoring and accounting
    activities

13
T1 today farming What changed since DC04?
  • Migration in progress
  • OS
  • RH v.7.3 ? SLC v.3.0.4
  • middleware
  • upgrade to LCG v.2.4.0
  • install/manage WNs/servers
  • lcfgng ? Quattor
  • integration LCG-Quattor
  • batch scheduler
  • TorqueMaui ? LSF v.6.0
  • queues for prod/anal
  • manage Grid interfacing

RUNNING
PENDING
Total nb. jobs
Max nb. slots
  • Analysis
  • controlled and fake (DC04) vs.
    unpredictable and real (now)
  • T1 provides one full LCG site 2 dedicated
    RBs/bdII support to CRABers
  • Interoperability always an issue, even harder in
    a transition period
  • dealing with 2-3 sub-farms in use by 10 exps
    (in prod)
  • resource use optimization still to be achieved

? see N.DeFilippis session II day 3
14
T1 today storage What changed since DC04?
  • Storage issues (1/2) disks
  • driven by requirements of LHC data processing at
    the Tier-1
  • i.e. simultaneous access of PBs of data from
    1000 nodes at high rate
  • main focus is on robust, load-balanced, redundant
    solutions to grant proficient and stable data
    access to distributed users
  • namely make both sw and data accessible from
    jobs running on WNs
  • remote access (gridftp) and local access (rfiod,
    xrootd, GPFS) services, afs/nfs to share exps sw
    on WNs, filesystems tests, specific problem
    solving in analysts daily operations, CNAF
    participation to SC2/3, etc.
  • a SAN approach with a parallel filesystem on-top
    looks promising
  • Storage issues (2/2) tapes
  • CMS DC04 helped to focus some problems
  • LTO-2 drives not efficiently used by exps in
    production at T1
  • performance degradation increases as file size
    decreases
  • hangs on locate/fskip after 100 not-sequential
    reading
  • not-full tapes are labelled RDONLY after 50-100
    GB written only
  • CASTOR performances increase with clever
    pre-staging of files
  • some reliability achieved only on
    sequential/pre-staged reading
  • solutions?
  • from the HSM sw side fix coming with CASTOR v.2
    (Q2 2005)?
  • from the HSM hw side test 9940b drives in prod
    (see PIC T1)

see P.P.Ricci session II day 3
15
Current CMS set-up at the Tier-1
CMS activities at the Tier-1
Castor MSS
local
CMS
Local prod
Resources manag.
remote access
shared
PhEDEx agents
Grid prod/anal
control
logical grouping
Castor disk buffer
CPUs
SE
Overflow
Operations control
WN
WN
WN
Import-Export Buffer
WN
WN
WN
PhEDEx agents
WN
WN
WN
SE
Analysis disks
WN
WN
WN
LSF
SE
Core
Production disks
WN
WN
WN
CE
WN
WN
WN
Grid.it / LCG layer
16
PhEDEx in CMS
  • PhEDEx (Physics Experiment Data Export) used by
    CMS
  • overall infrastructure for data transfer
    management in CMS
  • allocation and transfers of CMS physics data
    among Tier-0/1/2s
  • different datasets move on bidirectional routes
    among Regional Centers
  • data should reside on SEs (e.g. gsiftp or srm
    protocols)
  • components
  • TMDB ? from DC04
  • files, topology, subscriptions...
  • coherent set of sw agents, loosely coupled,
    inter-operating and communicating with TMDB
    blackboard
  • e.g. agents for data allocation (based on site
    data subscriptions), file import/export,
    migration to MSS, routing (based on implemented
    topologies), monitoring, etc

INFN T1 mainly on data transfer
INFN T1 mainly on prod/anal
  • born, and growing fast
  • gt70 TB known to PhEDEx, gt150 TB total replicated

17
PhEDEx transfer rates T0?INFN T1
CNAF T1 diskserver I/O
weekly
daily
Rate out of CERN Tier-0
18
PhEDEx at INFN
  • INFN-CNAF is a T1 node in PhEDEx
  • CMS DC04 experience was crucial to start-up
    PhEDEX in INFN
  • CNAF node operational since the beginning
  • First phase (Q3/4 2004)
  • Agent code development focus on operations
    T0?T1 transfers
  • gt1 TB/day T0?T1 demonstrated feasible
  • but the aim is not to achieve peaks, but to
    sustain them in normal operations
  • Second phase (Q1 2005)
  • PhEDEx deployment in INFN to Tier-n, ngt1
  • distributed topology scenario
  • Tier-n agents run at remote sites, not at the T1
    know-how required, T1 support
  • already operational at Legnaro, Pisa, Bari,
    Bologna

An example data flow to T2s in daily operations
(here a test with 2000 files, 90 GB, with no
optimization)
450 Mbps CNAF T1 ? LNL-T2
205 Mbps CNAF T1 ? Pisa-T2
  • Third phase (Qgt1 2005)
  • Many issues.. e.g. stability of service, dynamic
    routing, coupling PhEDEx to CMS official
    production system, PhEDEx involvement in
    SC3-phaseII, etc

19
CMS MonteCarlo productions
  • CMS production system evolving into a permanent
    effort
  • strong contribution of INFN T1 to CMS productions
  • 252 assignments in PCP-DC04, for all
    production step both local and Grid
  • plenty of assignments (simulation only) now
    running on LCG (ItalySpain)
  • CNAF support for direct submitters backup SEs
    provided for Spain
  • currently, digitization/DST efficiently run
    locally (mostly at T1)
  • produced data hence injected in the CMS data
    distribution infrastructure
  • future of T1 productions rounds of scheduled
    reprocessing

12.9 Mevts assegnati
11.8 Mevts prodotti
DST production at INFN T1
20
coming next Service Challenge (SC3)
  • data transfer and data serving in real use-cases
  • review existing infrastructure/tools and give a
    boost
  • details of the challenge are currently under
    definition
  • Two phases
  • Jul05 SC3 throughput phase
  • Tier-0/1/2 simultaneous import/export, MSS
    involved
  • move real files, store on real hw
  • gtSep05 SC3 service phase
  • small scale replica of the overall system
  • modest throughput, main focus is on testing in a
    quite complete environment, with all the crucial
    components
  • space for experiment-specific tests and inputs
  • Goals
  • test crucial components, push to prod-quality,
    and measure.
  • towards the next production service
  • INFN T1 participated in SC2, and is joining SC3

21
Conclusions
  • INFN-CNAF T1 is quite young but ramping-up
    towards stable production-quality services
  • optimized use of resources interfaces to the
    Grid
  • policy/HR to support experiments at the Tier-1
  • the Tier-1 actively partecipated to CMS DC04
  • good hints identified bottlenecks in managing
    resources, scalability,
  • Learn the lessons overall revision of CMS set-up
    at the T1
  • involves both Grid and non-Grid access
  • first results are encouraging, success of daily
    operations
  • local/Grid productions distributed analysis are
    running
  • Go ahead
  • long path
  • next step on it preparation for SC3, also with
    CMS applications

22
Back-up slides
23
PhEDEx transfer rates T0?INFN T1
CNAF T1 diskserver I/O
weekly
daily
Rate out of CERN Tier-0
24
PhEDEx transfer rates T0?INFN T1
CNAF T1 diskserver I/O
weekly
daily
Rate out of CERN Tier-0
25
Lethal injuries only
CNAF autopsy of DC04
  • Agents drain data from SE-EB down to CNAF/PIC
    T1s and
  • land directly on a Castor SE buffer
  • ? it occurred that in DC04 these files were many
    and small
  • So for any file on the Castor SE fs, a tape
    migration is
  • foreseen with a given policy, regardless of their
    size/nb..
  • ? this strongly affected data transfer at CNAF
    T1
  • (MSS below is STK tape lib with LTO-2
    tapes)
  • Castor stager scalability issues
  • many small files (mostly 500B-50kB) ?
    stager db bad performances of stager db for
  • gt300-400k entries (may need more RAM?)
  • CNAF fast set-up of an additional stager in
    DC04 basically worked
  • REP-Agent cloned to transparently continue
    replication to disk-SEs
  • tape library LTO-2 issues
  • high nb. segments on tape ? bad tape
    read/write performances, LTO-2 SCSI errors,
  • repositioning failures, slow migration to
    tape and delays in the TMDB SAFE-labelling,

see next slide
DC04
26
Non-lethal injuries
CNAF autopsy of DC04
  • minor (?) Castor/tape-library issues
  • Castor filename length (more info Castor ticket
    CT196717)
  • ext3 file-system corruption on a partition of
    the old stager
  • tapes blocked in the library
  • several crashes/hanging of the TRA-Agent (rate
    3 times per week)
  • ? created from time to time some backlogs,
    nevertheless fast to be recovered
  • ? post-mortem analysis in progress
  • experience with the Replica Manager interface
  • e.g. files of size 0 created at
    destination when trying to replicate from Castor
    SE some data
  • which are temporarily not accessible for
    stager (or other) problems on the Castor side
  • ? needs further tests to achieve
    reproducibility and then Savannah reports
  • Globus-MDS Information System instabilities
    (rate once per week)
  • ? some temporary stop of data transfer
    (i.e. no SE found means no replicas)
  • RLS instabilities (rate once per week)

constant and painful debugging
27
CMS DC04 number and sizes of files
DC04 data time window 51 (3) days
March 11th May 3rd
gt3k files for gt750 GB
May 2nd
May 1st
Global CNAF network activity
340 Mbps (gt42 MB/s) sustained for 5 hours (max
was 383.8 Mbps)
28
Description of RLS usage
Local POOL catalogue
TMDB
Tier-1 Transfer agent
SRB GMCAT
Replica Manager
RM/SRM/SRB EB agents
4. Copy files to Tier-1s
Resource Broker
3. Copy/delete files to/from export buffers
5. Submit analysis job
LCG ORCA Analysis Job
Configuration agent
2. Find Tier-1 Location (based on metadata)
6. Process DST and register private data
CNAF RLS replica
1. Register Files
XML Publication Agent
ORACLE mirroring
Specific client tools POOL CLI, Replica Manager
CLI, C LRC API based programs, LRC java API
tools (SRB/GMCAT), Resource Broker
29
Tier-0 in DC04
Architecture built on

Tier-0
  • Systems
  • LSF batch system
  • 3 racks, 44 nodes each, dedicated tot 264 CPUs
  • Dual P-IV Xeon 2.4GHz, 1GB mem, 100baseT
  • Dedicated cmsdc04 batch queue, 500 RUN-slots
  • Disk servers
  • DC04 dedicated stager, with 2 pools
  • 2 pools IB and GDB, 10 4 TB
  • Export Buffers
  • EB-SRM ( 4 servers, 4.2 TB total )
  • EB-SRB ( 4 servers, 4.2 TB total )
  • EB-SE ( 3 servers, 3.1 TB total )
  • Databases
  • RLS (Replica Location Service)
  • TMDB (Transfer Management DB)
  • Transfer steering
  • Agents steering data transfers

GDB
ORCA RECO Job
RefDB
IB
TMDB
fake on-line process
POOL RLS catalogue
Castor
30
CMS Production tools
  • CMS production tools (OCTOPUS)
  • RefDB
  • Contains production requests with all needed
    parameters to produce the dataset and the details
    about the production process
  • MCRunJob
  • Evolution of IMPALA more modular (plug-in
    approach)
  • Tool/framework for job preparation and job
    submission
  • BOSS
  • Real-time job-dependent parameter tracking. The
    running job standard output/error are intercepted
    and filtered information are stored in BOSS
    database. The remote updator is based on MySQL
    but a remote updator based on R-GMA is being
    developed.
Write a Comment
User Comments (0)
About PowerShow.com