Title: CERN Fabric
1CERN Fabric
T. Adye, J. Hada, R. Mankel, R. Yoshida, K.
Woller LCG Comprehensive Review Referee
Report LHCC Meeting, CERN, 27-Sep-2006
2Input
CERN T0 and CAF Status (B. Panzer-Steindel) Castor
Progress Report (O. Bärring) dCache (P.
Fuhrmann) DPM Status and Plans (J.-P. Baud) SRM v
2.2 Status (M. Litmaath) Networking Status (D.
Foster)
Supplemented by individual discussions
Plenary and Stream B presentations
Discussions with referees
Plus my personal grain of salt
Materials available on http//agenda.cern.ch/ful
lAgenda.php?idaa057132
3Overall Impression
- T0 and CAF are well on track
- Still slightly underfunded despite recent
improvements - Impressice empty space in computer center
- Building, cooling and power upgrades planned as
required - T0 well understood
- Demonstrated capabilities in full scale ATLAS
test - CAF requirements still not well defined
- Experiments need to deliver well in advance
- Keep in mind purchasing cycles of 6 months
- Storage systems have improved performance
- Still adding features, need ongoing attention
4Technology
- Tapes ok
- Tape libraries installed, technologies work and
are scalable - CPU ok
- New Intel chips provide a leap in
price/performance - Moderate clock frequency scaling expected
- More cores per chip, reduced power consumption
- Disk mostly ok for now
- Scalability issues, capacity vs. of spindles
- Application footprints not fully understood (CAF)
- OS will be SL4 for LHC startup
- Migration needed in the next months
- Will be a test case for later operations, esp.
applications - Networking ok, scalable
5Funding
- Improved, now less underfunded
- prelim. 7-10 shortage
- Due to reduced 2007 requirements
- Due to leap in CPU price/performance
- Installed resources below MOU (except tape)
- Buying late makes sense for most IT hardware
- Used resources still well below installed
resources - ? not the reason for performance problems in SCs
Concern Both are one-off effects. The reduced
requirements are a chance to catch up. 2008
requirements may be higer than before. Budget
cuts by funding agencies could have adverse
effects.
6Staffing
- T0 and CAF manpower is tight
- Could be a problem when scaling services
- Mostly permanent staff, highly motivated
- Project Manpower needs perspective
- Without EGEE successor, manpower shortage at LHC
startup envisaged - Longer term EU project would be beneficial
- T1/T2 centers are still doing operations with
project staff
Concerns T0/CAF Infrastructure staffing may be
short for 2008 ramp up. End of EGEE II and LHC
physics start now almost coincide. Need to avoid
brain drain in critical phase.
7T0 Performance
- T0 Design performance met in Atlas only test
- Aggregate performance for four exps not
demonstrated yet - DAQ not yet included in testing
- T1 transfer not included
- Need to demonstrate thisin unattended running
- Data buffer at CERN plannedfor one week
- Problem for T0 when data do not get out in time
- T0-T1 link needs attention
Concern T0/T1 data transfer did not yet reach
design goals. A little more of the impressive
progress is needed here.
8Scalability
- We still have an order of magnitude to go
- Scaling problems in underlying services observed
- E.g. LSF-Oracle coupling in Castor2
- E.g. Directory services (Lookups)
- Conventional mechanisms of load balancing may
help - And might improve redundancy and stability
Minor concerns More scalability issues will show
up. Reason to grow a lot in 2007 despite reduced
requirements?
97x24 Operations (my view)
What we have
What people suggest
1 FTE 230 day x 8 hours 1840 h/year
24/7 expert service 8760 h/year 4.8 FTE
- Theres no way to have experts 24x7 (x52)
- Need to design services to survive trivial
failures - Commercially available load balancers may help
- Need to design for increased reaction times
- By building service level redundancy where
possible - For rare complex problems, on duty coordinator
may help getting the required experts together
fast.
10Castor2
- Deployment at T0 successful, well integrated
- Inherent performance problems hit ATLAS and CMS,
fix underway - Tier sites had problems, high support load for
CERN staff - Project focus clearly on LHC
- Castor review in June outlined risks
- Generally positive towards the project
- Quote Many years of periods of operational
distress - Late feature requests lead to delays
- Durable storage requested May 2006, Xrootd
interface
Concerns Use of Durable Storage can block the
system. Adding features should step behind stable
operation. Is manpower sufficient / suffciently
experienced? External site support
11dCache
- Project manpower has improved
- 1 FTE for dCache user support now
- Looks more complete than Castor2
- It has a smaller scope, though
- No clear deadline for implementing SRM v2.2
- But seems to be on track
- Community Support OSG fund their own requests
- Would that be a model for LCG as well?
Concerns Older dCache installations impaired SC4
performance. Sites need to keep up to date
(scheduled downtimes)
12DPM
- In widespread use at 50 smaller sites
- Will be late in implementing SRM v 2.2
- Serious manpower troubles
- Open position, but no candidates
- In terms of staffing, DPM is a very small project
- Not an issue for T0 and CAF
- Indirect issue for T1s (transfer to/from T2s with
DPM)
Concerns DPM visibility will rise with SCs
reaching smaller T2s. What is CERNs commitment
towards the project?
13SRM v 2.2
- WLCG optimized
- Contains options which are hard to implement
- Like moving from no-tape to tape backed pools
- Implemented late (May 2006), causing troubles
- Castor2 and dCache will have it in time, DPM not
- SRM 3 still required by broader community
- How will this be adopted by WLCG?
Concerns Take care Some features, when used,
endanger infrastructure. Adoption of SRM v 3.0
during LHC startup might create havoc. May be
safer to stick with v 2.2 during ramp up.
14Networking
- Technologically uncritical
- Wire rates are beyond requirements
- Watch interference between security and bandwitdh
- Scales well to current requirements
- And will scale beyond if required, provided
funding - Wide area bandwidth is commercially available
- Local area application footprint in CAF not yet
clear
Minor concern Given the long purchasing cycles,
the CAF requirements for both networking and
disk should be clarified by the experiments in
due time.
15Summary
- Impressive progress has been made in 2006
- Encourage and enable people to keep up their
efforts - Both schedule (fully) and funding (almost) now on
LHC timescale - It might be wise to over-fulfill the new 2007
requirements wrt. 2008 - Technology sufficiently advanced to handle LHC
requirements - Many performance issues have been adressed
- Scaling issues still remain (more will show up)
- Storage systems implement SRM v 2.2 and xrood
interfaces - Stability still needs improvement
- DPM needs special attention
- Focus now shifts from performance to stability
- Which should also be true for software
development! - The major challenges now are operational
- Need to work out the required level of 24x7
services -