CERN Fabric - PowerPoint PPT Presentation

1 / 15

About This Presentation

Title:

CERN Fabric

Description:

CERN Fabric. T. Adye, J. Hada, R. Mankel, R. Yoshida, K. Woller. LCG Comprehensive Review ... K. Woller: CERN Fabric - Referee Report to LHCC. 2. Input ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 16

Provided by: lcgWe

Category:

more less

Transcript and Presenter's Notes

Title: CERN Fabric

1
CERN Fabric
T. Adye, J. Hada, R. Mankel, R. Yoshida, K.
Woller LCG Comprehensive Review Referee
Report LHCC Meeting, CERN, 27-Sep-2006
2
Input

CERN T0 and CAF Status (B. Panzer-Steindel) Castor
Progress Report (O. Bärring) dCache (P.
Fuhrmann) DPM Status and Plans (J.-P. Baud) SRM v
2.2 Status (M. Litmaath) Networking Status (D.
Foster)
Supplemented by individual discussions
Plenary and Stream B presentations
Discussions with referees
Plus my personal grain of salt
Materials available on http//agenda.cern.ch/ful
lAgenda.php?idaa057132
3
Overall Impression

T0 and CAF are well on track
Still slightly underfunded despite recent
improvements
Impressice empty space in computer center
Building, cooling and power upgrades planned as
required
T0 well understood
Demonstrated capabilities in full scale ATLAS
test
CAF requirements still not well defined
Experiments need to deliver well in advance
Keep in mind purchasing cycles of 6 months
Storage systems have improved performance
Still adding features, need ongoing attention

4
Technology

Tapes ok
Tape libraries installed, technologies work and
are scalable
CPU ok
New Intel chips provide a leap in
price/performance
Moderate clock frequency scaling expected
More cores per chip, reduced power consumption
Disk mostly ok for now
Scalability issues, capacity vs. of spindles
Application footprints not fully understood (CAF)
OS will be SL4 for LHC startup
Migration needed in the next months
Will be a test case for later operations, esp.
applications
Networking ok, scalable

5
Funding

Improved, now less underfunded
prelim. 7-10 shortage
Due to reduced 2007 requirements
Due to leap in CPU price/performance
Installed resources below MOU (except tape)
Buying late makes sense for most IT hardware
Used resources still well below installed
resources
? not the reason for performance problems in SCs

Concern Both are one-off effects. The reduced
requirements are a chance to catch up. 2008
requirements may be higer than before. Budget
cuts by funding agencies could have adverse
effects.
6
Staffing

T0 and CAF manpower is tight
Could be a problem when scaling services
Mostly permanent staff, highly motivated
Project Manpower needs perspective
Without EGEE successor, manpower shortage at LHC
startup envisaged
Longer term EU project would be beneficial
T1/T2 centers are still doing operations with
project staff

Concerns T0/CAF Infrastructure staffing may be
short for 2008 ramp up. End of EGEE II and LHC
physics start now almost coincide. Need to avoid
brain drain in critical phase.
7
T0 Performance

T0 Design performance met in Atlas only test
Aggregate performance for four exps not
demonstrated yet
DAQ not yet included in testing
T1 transfer not included
Need to demonstrate thisin unattended running
Data buffer at CERN plannedfor one week
Problem for T0 when data do not get out in time
T0-T1 link needs attention

Concern T0/T1 data transfer did not yet reach
design goals. A little more of the impressive
progress is needed here.
8
Scalability

We still have an order of magnitude to go
Scaling problems in underlying services observed
E.g. LSF-Oracle coupling in Castor2
E.g. Directory services (Lookups)
Conventional mechanisms of load balancing may
help
And might improve redundancy and stability

Minor concerns More scalability issues will show
up. Reason to grow a lot in 2007 despite reduced
requirements?
9
7x24 Operations (my view)
What we have
What people suggest
1 FTE 230 day x 8 hours 1840 h/year
24/7 expert service 8760 h/year 4.8 FTE

Theres no way to have experts 24x7 (x52)
Need to design services to survive trivial
failures
Commercially available load balancers may help
Need to design for increased reaction times
By building service level redundancy where
possible
For rare complex problems, on duty coordinator
may help getting the required experts together
fast.

10
Castor2

Deployment at T0 successful, well integrated
Inherent performance problems hit ATLAS and CMS,
fix underway
Tier sites had problems, high support load for
CERN staff
Project focus clearly on LHC
Castor review in June outlined risks
Generally positive towards the project
Quote Many years of periods of operational
distress
Late feature requests lead to delays
Durable storage requested May 2006, Xrootd
interface

Concerns Use of Durable Storage can block the
system. Adding features should step behind stable
operation. Is manpower sufficient / suffciently
experienced? External site support
11
dCache

Project manpower has improved
1 FTE for dCache user support now
Looks more complete than Castor2
It has a smaller scope, though
No clear deadline for implementing SRM v2.2
But seems to be on track
Community Support OSG fund their own requests
Would that be a model for LCG as well?

Concerns Older dCache installations impaired SC4
performance. Sites need to keep up to date
(scheduled downtimes)
12
DPM

In widespread use at 50 smaller sites
Will be late in implementing SRM v 2.2
Serious manpower troubles
Open position, but no candidates
In terms of staffing, DPM is a very small project
Not an issue for T0 and CAF
Indirect issue for T1s (transfer to/from T2s with
DPM)

Concerns DPM visibility will rise with SCs
reaching smaller T2s. What is CERNs commitment
towards the project?
13
SRM v 2.2

WLCG optimized
Contains options which are hard to implement
Like moving from no-tape to tape backed pools
Implemented late (May 2006), causing troubles
Castor2 and dCache will have it in time, DPM not
SRM 3 still required by broader community
How will this be adopted by WLCG?

Concerns Take care Some features, when used,
endanger infrastructure. Adoption of SRM v 3.0
during LHC startup might create havoc. May be
safer to stick with v 2.2 during ramp up.
14
Networking

Technologically uncritical
Wire rates are beyond requirements
Watch interference between security and bandwitdh
Scales well to current requirements
And will scale beyond if required, provided
funding
Wide area bandwidth is commercially available
Local area application footprint in CAF not yet
clear

Minor concern Given the long purchasing cycles,
the CAF requirements for both networking and
disk should be clarified by the experiments in
due time.
15
Summary