CERN Fabric - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

CERN Fabric

Description:

CERN Fabric. T. Adye, J. Hada, R. Mankel, R. Yoshida, K. Woller. LCG Comprehensive Review ... K. Woller: CERN Fabric - Referee Report to LHCC. 2. Input ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 16
Provided by: lcgWe
Category:
Tags: cern | fabric

less

Transcript and Presenter's Notes

Title: CERN Fabric


1
CERN Fabric
T. Adye, J. Hada, R. Mankel, R. Yoshida, K.
Woller LCG Comprehensive Review Referee
Report LHCC Meeting, CERN, 27-Sep-2006
2
Input

CERN T0 and CAF Status (B. Panzer-Steindel) Castor
Progress Report (O. Bärring) dCache (P.
Fuhrmann) DPM Status and Plans (J.-P. Baud) SRM v
2.2 Status (M. Litmaath) Networking Status (D.
Foster)
Supplemented by individual discussions
Plenary and Stream B presentations
Discussions with referees
Plus my personal grain of salt
Materials available on http//agenda.cern.ch/ful
lAgenda.php?idaa057132
3
Overall Impression
  • T0 and CAF are well on track
  • Still slightly underfunded despite recent
    improvements
  • Impressice empty space in computer center
  • Building, cooling and power upgrades planned as
    required
  • T0 well understood
  • Demonstrated capabilities in full scale ATLAS
    test
  • CAF requirements still not well defined
  • Experiments need to deliver well in advance
  • Keep in mind purchasing cycles of 6 months
  • Storage systems have improved performance
  • Still adding features, need ongoing attention

4
Technology
  • Tapes ok
  • Tape libraries installed, technologies work and
    are scalable
  • CPU ok
  • New Intel chips provide a leap in
    price/performance
  • Moderate clock frequency scaling expected
  • More cores per chip, reduced power consumption
  • Disk mostly ok for now
  • Scalability issues, capacity vs. of spindles
  • Application footprints not fully understood (CAF)
  • OS will be SL4 for LHC startup
  • Migration needed in the next months
  • Will be a test case for later operations, esp.
    applications
  • Networking ok, scalable

5
Funding
  • Improved, now less underfunded
  • prelim. 7-10 shortage
  • Due to reduced 2007 requirements
  • Due to leap in CPU price/performance
  • Installed resources below MOU (except tape)
  • Buying late makes sense for most IT hardware
  • Used resources still well below installed
    resources
  • ? not the reason for performance problems in SCs

Concern Both are one-off effects. The reduced
requirements are a chance to catch up. 2008
requirements may be higer than before. Budget
cuts by funding agencies could have adverse
effects.
6
Staffing
  • T0 and CAF manpower is tight
  • Could be a problem when scaling services
  • Mostly permanent staff, highly motivated
  • Project Manpower needs perspective
  • Without EGEE successor, manpower shortage at LHC
    startup envisaged
  • Longer term EU project would be beneficial
  • T1/T2 centers are still doing operations with
    project staff

Concerns T0/CAF Infrastructure staffing may be
short for 2008 ramp up. End of EGEE II and LHC
physics start now almost coincide. Need to avoid
brain drain in critical phase.
7
T0 Performance
  • T0 Design performance met in Atlas only test
  • Aggregate performance for four exps not
    demonstrated yet
  • DAQ not yet included in testing
  • T1 transfer not included
  • Need to demonstrate thisin unattended running
  • Data buffer at CERN plannedfor one week
  • Problem for T0 when data do not get out in time
  • T0-T1 link needs attention

Concern T0/T1 data transfer did not yet reach
design goals. A little more of the impressive
progress is needed here.
8
Scalability
  • We still have an order of magnitude to go
  • Scaling problems in underlying services observed
  • E.g. LSF-Oracle coupling in Castor2
  • E.g. Directory services (Lookups)
  • Conventional mechanisms of load balancing may
    help
  • And might improve redundancy and stability

Minor concerns More scalability issues will show
up. Reason to grow a lot in 2007 despite reduced
requirements?
9
7x24 Operations (my view)
What we have
What people suggest
1 FTE 230 day x 8 hours 1840 h/year
24/7 expert service 8760 h/year 4.8 FTE
  • Theres no way to have experts 24x7 (x52)
  • Need to design services to survive trivial
    failures
  • Commercially available load balancers may help
  • Need to design for increased reaction times
  • By building service level redundancy where
    possible
  • For rare complex problems, on duty coordinator
    may help getting the required experts together
    fast.

10
Castor2
  • Deployment at T0 successful, well integrated
  • Inherent performance problems hit ATLAS and CMS,
    fix underway
  • Tier sites had problems, high support load for
    CERN staff
  • Project focus clearly on LHC
  • Castor review in June outlined risks
  • Generally positive towards the project
  • Quote Many years of periods of operational
    distress
  • Late feature requests lead to delays
  • Durable storage requested May 2006, Xrootd
    interface

Concerns Use of Durable Storage can block the
system. Adding features should step behind stable
operation. Is manpower sufficient / suffciently
experienced? External site support
11
dCache
  • Project manpower has improved
  • 1 FTE for dCache user support now
  • Looks more complete than Castor2
  • It has a smaller scope, though
  • No clear deadline for implementing SRM v2.2
  • But seems to be on track
  • Community Support OSG fund their own requests
  • Would that be a model for LCG as well?

Concerns Older dCache installations impaired SC4
performance. Sites need to keep up to date
(scheduled downtimes)
12
DPM
  • In widespread use at 50 smaller sites
  • Will be late in implementing SRM v 2.2
  • Serious manpower troubles
  • Open position, but no candidates
  • In terms of staffing, DPM is a very small project
  • Not an issue for T0 and CAF
  • Indirect issue for T1s (transfer to/from T2s with
    DPM)

Concerns DPM visibility will rise with SCs
reaching smaller T2s. What is CERNs commitment
towards the project?
13
SRM v 2.2
  • WLCG optimized
  • Contains options which are hard to implement
  • Like moving from no-tape to tape backed pools
  • Implemented late (May 2006), causing troubles
  • Castor2 and dCache will have it in time, DPM not
  • SRM 3 still required by broader community
  • How will this be adopted by WLCG?

Concerns Take care Some features, when used,
endanger infrastructure. Adoption of SRM v 3.0
during LHC startup might create havoc. May be
safer to stick with v 2.2 during ramp up.
14
Networking
  • Technologically uncritical
  • Wire rates are beyond requirements
  • Watch interference between security and bandwitdh
  • Scales well to current requirements
  • And will scale beyond if required, provided
    funding
  • Wide area bandwidth is commercially available
  • Local area application footprint in CAF not yet
    clear

Minor concern Given the long purchasing cycles,
the CAF requirements for both networking and
disk should be clarified by the experiments in
due time.
15
Summary
  • Impressive progress has been made in 2006
  • Encourage and enable people to keep up their
    efforts
  • Both schedule (fully) and funding (almost) now on
    LHC timescale
  • It might be wise to over-fulfill the new 2007
    requirements wrt. 2008
  • Technology sufficiently advanced to handle LHC
    requirements
  • Many performance issues have been adressed
  • Scaling issues still remain (more will show up)
  • Storage systems implement SRM v 2.2 and xrood
    interfaces
  • Stability still needs improvement
  • DPM needs special attention
  • Focus now shifts from performance to stability
  • Which should also be true for software
    development!
  • The major challenges now are operational
  • Need to work out the required level of 24x7
    services
Write a Comment
User Comments (0)
About PowerShow.com