Title: LCG Service Challenge Report
1LCG Service Challenge Report
- LHCC closed session 17/11/2005
- Sijbrand de Jong also for
- Dominique Boutigny, Rainer Mankel, Davide
Salomoni, Junji Haba, Rik Yoshida
- SC time path and goals
- SC3 results and plans
- SC4 plans
- Conclusion
2Service Challenges (SC) time path goals
SC1 - Nov04-Jan05 - data transfer between CERN
and three Tier-1s (FNAL, NIKHEF, FZK)
2005
SC2 Apr05 - data distribution from CERN to 7
Tier-1s 600 MB/sec sustained for 10 days (one
third of final nominal rate)
2006
cosmics
2007
first beams
first physics
2008
full physics run
3SC3 results data throughput
Failed by factor of 2, needs to be repeated
4SC3 results data throughput
- Sources of problems
- Service instability
- CASTOR
- dCache (lots of help from DESY)
- Storage Resource Management (SRM) in general
- gLite File Transfer Service (FTS)
- Much debugging and fixing of various components
5SC3 results data throughput
Rerun scheduled in January 2006, targets
6SC3 results LCG team view
Jamie Shiers
- Considered much more than just a throughput
test - SRM required on all sites
- LCG File Catalogue (LFC) deployed as required
- Global catalog for LHCb, site local for
ALICE ATLAS - CMS will be essentially as LHCb
- Many of the baseline services (BSWG) deployed
- (Nearly) all T1s took part better than foreseen
! - gt20 T2s included Good !
- All experiments, clear goals, metrics
established - Underlined complexity of the enterprise
- Excellent collaboration between LCG, sites
experiments - Many problems resolved, continue to improve
- Need re-run of throughput test to confirm all
fixed
7SC3 results What experiments tested
Nick Brook
Data base service needed, but not yet tested
8SC3 results Where experiments tested
9SC3 results Experiments findings (1)
Gained lots of experience, did much
debugging Many file transfers did not work on
first attempt
Quality Successful transfers vs. those
started Hours Number of hours with successful
transfers Rate Volume / Hours
10SC3 results Experiments findings (2)
CASTOR-2 has not always been a delight (
stronger wording) Still need to establish many of
original goals Stability is key
When service stable - LHCb SC3 needs surpassed
11SC3 results Experiments findings (3)
Much data has been transferred ALICE 200,000
files 20 TB CMS
145 TB sustained rate 20-90 MB/s LHCb
75,000 files 10 TB sustained rate 10-55
MB/s ATLAS 20 TB
sustained rate 20-30 MB/s Many CPU resources
have been used. Reviewers remarks Experiments
results not cast in the same metrics Much
quantitative information lacking in presentations
12SC3 resultsTier 1 2 experience (1)
M.Mazzucato (CNAF), I.Fisk (FNAL), G.Stewart
(Glasgow)
Remember that HEP is not the only user, others
have different requirements (biology,
astrophysics,) Operating system issues
(redhat vs scientific linux) CNAF WAN-gtdisk
175 MB/s WAN-gtTape 50 MB/s 1200-1550
kSI2k CPU power active T2 in test
Torino, Milano, Pisa, Legnaro, Bari, Catania
13SC3 resultsTier 1 2 experience (2)
FNAL no major hick-ups, much functionality
tested earlier had their share of
small problems helped T2 sites
supports both OSG-0.2 and LCG-2
concurrently running CDF DØ (Glasgow) T2 Much
British CPU will be T2 (many sites too)
software installation Python versus
bash
SFT/GGUS messages clearer
quality of release availability
operator availability 95 uptime ambitious
real support from T1 people
14SC4 plans
- Full throughput test to start disk ? disk /
disk ? tape - Full baseline services, including database
service, VOMS - T0 recording to tape and T1 reprocessing
- Site Functionality Test (SFT) as performance
metric - Many SFT sub-tests still have to be
defined/implemented - T1 sites all seem to be ready to start
- Many T2 sites interested, no problem to reach
20-40 sites - Deploy COOL, 3D, AA services (PROOF, xrootd)
- Next generation tape drives
15SC4 Preparation for data taking ?
YES, I think so, but
still have to pass SC3 and how to measure
success
16Conclusion
- Service Challenges sensible approach to prepare
- for data taking phase
- SC3 spawned a lot of effort, also outside LCG
- Many T1 and T2 sites enthusiastically
participate - Major improvements in many baseline services
- Still to demonstrate required SC3 throughput
- Need database services for SC4
- Quantitative information needed for tractability