PPT – PowerPoint presentation | free to download

About This Presentation

Title:

Description:

NFS problems for home directories or ESM areas ... Informal exchange (e-mail, meetings, phone) ... RAL (UK) is leading sub-project on developing operations services ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 33

Provided by: Pasc1

Learn more at: https://www.racf.bnl.gov

Category:

Tags: directories | telephone | uk

more less

Transcript and Presenter's Notes

Title:

1
LCG Operation During the Data Challenges
Markus Schulz, IT-GD, CERNmarkus.schulz_at_cern.ch
Discussion on Operation Models
EGEE is a project funded by the European Union
under contract IST-2003-508833
2
Outline

Building LCG-2
Data Challenges (very brief)
Problems (not so brief)
Operating LCG
how it was planned
how it happened to be done
how it felt
Whats next?
I will skip many slides to leave room for
discussions

Comment /Shout in REALTIME!!!!!
3
History

December 2003 LCG-2
Full set of functionality for DCs, first MSS
integration
Deployed in January to 8 core sites (less sites
less trouble)
DCs started in February -gt testing in production
Large sites integrate resources into LCG (MSS and
farms)
Introduced a pre-production service for the
experiments
Alternative packaging (tool based and generic
installation guides)
Mai 2004 -gt now monthly incremental releases
Not all releases are distributed to external
sites
Improved services, functionality, stability and
packing step by step
Timely response to experiences from the data
challenges

4
LCG-2 Status 22 10 2004
new interested sites should look here release
Cyprus

Total
82 Sites
9400 CPUs
6.5 PByte

5
Integrating Sites

Sites contact GD Group or Regional Center
Sites go to the release page
Sites decide on manual or tool based installation
(LCFGng)
documentation for both available
WN and UI from next release on tar-ball based
release
almost trivial install of WNs and UIs
Sites provide security and contact information
Sites install and use provided tests for
debugging
support from regional centers or CERN
CERN GD certifies site and adds it to the
monitoring and information system
sites are daily re-certified and problems traced
in SAVANNAH
Large sites have integrated their local batch
systems in LCG-2
Adding new sites is now quite smooth
problem is keeping large number of sites
correctly configured

worked 80 times
failed 3-5 times
6
Data Challenges

Large scale production effort of the LHC
experiments
test and validate the computing models
produce needed simulated data
test experiments production frame works and
software
test the provided grid middleware
test the services provided by LCG-2
All experiments used LCG-2 for part of their
production

7
Data Challenges

Phase I
7.7 Million events fully simulated (Geant 4) in
95.000 jobs
22 TByte
Total CPU 972 MSI-2k hours
gt40 produced on LCG-2 (used LCG-2, GRID3,
NorduGrid)

8
Data Challenges
9
Data Challenges
3-5 106/day
LCG restarted
LCG paused
LCG in action
1.8 106/day
DIRAC alone
10
Problems during the data challenges

All experiments encountered on LCG-2 similar
problems
LCG sites suffering from configuration and
operational problems
not adequate resources on some sites (hardware,
human..)
this is now the main source of failures
Load balancing between different sites is
problematic
jobs can be attracted to sites that have no
adequate resources
modern batch systems are too complex and dynamic
to summarize their behavior in a few values in
the IS
Identification and location of problems in LCG-2
is difficult
distributed environment, access to many logfiles
needed..
status of monitoring tools
Handling thousands of jobs is time consuming and
tedious
Support for bulk operation is not adequate
Performance and scalability of services
storage (access and number of files)
job submission
information system
file catalogues
Services suffered from hardware problems (no fail
over services)

DC summary
11
Outstanding Middleware Issues

Collection Outstanding Middleware Issues
Important 1st systematic confrontation of
required functionalities with capabilities of the
existing middleware
Some can be patched, worked around,
Those related to fundamental problems with
underlying models and architectures have to be
input as essential requirements to future
developments (EGEE)
Middleware is now not perfect but quite stable
Much has been improved during DCs
A lot of effort still going into improvements and
fixes
Big hole is missing space management on SEs
especially for Tier 2 sites

12
Operational issues (selection)

Slow response from sites
Upgrades, response to problems, etc.
Problems reported daily some problems last for
weeks
Lack of staff available to fix problems
Vacation period, other high priority tasks
Various mis-configurations (see next slide)
Lack of configuration management problems that
are fixed reappear
Lack of fabric management (mostly smaller sites)
scratch space, single nodes drain queues,
incomplete upgrades, .
Lack of understanding
Admins reformat disks of SE
Provided documentation often not read (carefully)
new activity started to develop hierarchical
adaptive documentation
simpler way to install middleware on farm nodes
(even remotely in user space)
Firewall issues
often less than optimal coordination between grid
admins and firewall maintainers
PBS problems
Scalability, robustness (switching to torque
helps)

13
Site (mis) - configurations

Site mis-configuration was responsible for most
of the problems that occurred during the
experiments Data Challenges. Here is a
non-complete list of problems
The variable VO ltVOgt SW DIR points to a non
existent area on WNs.
The ESM is not allowed to write in the area
dedicated to the software installation
Only one certificate allowed to be mapped to
the ESM local account
Wrong information published in the information
system (Glue Object Classes not linked)
Queue time limits published in minutes instead
of seconds and not normalized
/etc/ld.so.conf not properly configured. Shared
libraries not found.
Machines not synchronized in time
Grid-mapfiles not properly built
Pool accounts not created but the rest of the
tools configured with pool accounts
Firewall issues
CA files not properly installed
NFS problems for home directories or ESM areas
Services configured to use the wrong/no
Information Index (BDII)
Wrong user profiles
Default user shell environment too big
Only partly related to middleware complexity

integrated all common small problems into 1 BIG
PROBLEM
14
Running Services

Multiple instances of core services for each of
the experiments
separates problems, avoids interference between
experiments
improves availability
allows experiments to maintain individual
configuration (information system)
addresses scalability to some degree
Monitoring tools for services currently not
adequate
tools under development to implement control
system
Access to storage via load balanced interfaces
CASTOR
dCache
Services that carry state are problematic to
restart on new nodes
needed after hardware problems, or security
problems
State Transition between partial usage and full
usage of resources
required change in queue configuration (faire
share, individual queues/VO)
next release will come with description for fair
share configuration (smaller sites)

DC summary
15
Support during the DCs

User (Experiment) Support
GD at CERN worked very close with the experiments
production managers
Informal exchange (e-mail, meetings, phone)
No Secrets approach, GD people on experiments
mail lists and vice versa
ensured fast response
tracking of problems tedious, but both sites have
been patient
clear learning curve on BOTH sites
LCG GGUS (grid user support) at FZK became
operational after start of the DCs
due to the importance of the DCs the experiments
switch slowly to the new service
Very good end user documentation by GD-EIS
Dedicated testbed for experiments with next LCG-2
release
rapid feedback, influenced what made it into the
next release
Installation (Site) Support
GD prepared releases and supported sites
(certification, re-certification)
Regional centres supported their local sites
(some more, some less)
Community style help via mailing list (high
traffic!!)
FAQ lists for trouble shooting and configuration
issues Taipei RAL

16
Support during the DCs

Operations Service
RAL (UK) is leading sub-project on developing
operations services
Initial prototype http//www.grid-support.ac.uk/GO
C/
Basic monitoring tools
Mail lists for problem resolution
Working on defining policies for operation,
responsibilities (draft document)
Working on grid wide accounting
Monitoring
GridICE (development of DataTag Nagios-based
tools)
GridPP job submission monitoring
Information system monitoring and consitency
check http//goc.grid.sinica.edu.tw/gstat/
CERN GD daily re-certification of sites
(including history)
escalation procedure under development
tracing of site specific problems via problem
tracking tool
tests core services and configuration

17
Screen Shots
18
Screen Shots
19
Problem HandlingPLAN
Monitoring/Followup
Triage VO / GRID
GOC
GGUS (Remedy)
GD CERN
Escalation
20
Problem HandlingOperation (most cases)
Community
VO A
Rollout Mailing List
VO B
GGUS
VO C
Triage
S-Site-2
GOC
GD CERN
S-Site-1
Monitoring Certification Follow-Up FAQs
Monitoring FAQs
S-Site-3
21
Problem Tracking

GGUS REMEDY
Middleware problems SAVANNAH LCG-OPERATION
Re-certification SAVANNAH LCG-SITES
Many (MOST) problems only tracked by e-mail
Much confusion on where to put problems
Training needed to get reasonable 1st level user
support
canned answers
experts need to focus on more complex tasks
Unification of FAQs (RAL, Taipei, Italy, )

22
EGEE Impact on Operations

The available effort for operations from EGEE is
now ramping up
LCG GOC (RAL) ? EGEE CICs and ROCs, Taipei
Hierarchical support structure
Regional Operations Centres (ROC)
One per region (9)
Front-line support for deployment, installation,
users
Core Infrastructure Centres (CIC)
Four ( Russia next year)
Evolve from GOC monitoring, troubleshooting,
operational control
24x7 in a 8x5 world ????
Also providing VO-specific and general services
EGEE NA3 organizes training for users and site
admins
NOW at HEPiX
Address common issues, experiences
Operations and Fabric Workshop
CERN 1-3 Nov

23
PART II

Operation models
How much can be delegated to whom?
autonomy/ availability
What are the consequences?
cost for 24/7 with 8x5 staff
One/multiple models for all sites/regions?
One model for site integration, update, user
support, security, operation?
latency, efficiency, distribution of workload ..
One size fits all?
Next slides are meant to stimulate discussions
not give answers

24
CICs and ROCs and Operations

Core Infrastructure Centers (CICs)
run services like RBs, Information Indices,
VO/VOMS, Catalogues
are the distributed Grid Operation Center (GOC)
and more.
Regional Operation Centers (ROCs)
coordinate activities in their region
give support to regional RCs
coordinate setup/upgrades
and more..
Resource Centers (RC)
computing and storage
Operation Management Center (OMC)
coordination

25
Model I Strict Hierarchy

CICs locates a problem with a RC or CIC in a
region
triggered by monitoring/ user alert
CIC enters the problem into the problem tracking
tool and assigns it to a ROC
ROC receives a notification and works on solving
the problem
region decides locally what the ROC can to do on
the RCs.
This can include restarting services etc.
The main emphasis is that the region decides on
the depth of the interaction.
gt different regions, different procedures
CICs NEVER contact a site
.gt ROCs need to be staffed all the time
ROC does it is fully responsible for ALL the
sites in the region

26
Model I Strict Hierarchy

Pro
Best model to transfer knowledge to the ROCs
all information flows through them
Different regions can have their own policies
this can reflect different administrative
relation of sites in a region.
Clear responsibility
until it is discovered it is the CICs fault then
it is always the ROCs fault
Cons
High latency
even for trivial operations we have to pass
through the ROCs
ROCs have to be staffed (reachable) all the time.
Regions will develop their own tools
parallel strands, less quality
Excluded for handling security

27
Model II Direct Com. Local Contr.

ROCs are active in
the follow-up of problems that take longer to
handle
setup of sites
CICs are active in
handling problems that can be solved by simple
interactions
communicated directly between CICs and RCs
ROCs are informed on all interactions between
CICs and RCs
all problems are entered into the problem
tracking tool.
restarting of services, etc. are handled by the
RCs

28
Model II Direct Com. Local Contr.

Pros
Resources are not lost for trivial reasons
Principe of local control is maintained
ROCs are in the loop,
but weak ROCs can't create too severe delays
No complex tools for communication management
needed
mail IRC sufficient
Cons
RCs need to be reachable at all times
not realistic, and very expensive
CICs have to be aware of the level of maturity of
O(100) RCs
ROCs have to monitor what is going on to learn
the trade
Language problems between the CICs and sysadmins
Unclear responsibility
"This was reported" / "Why didn't the CICs fix it
them self"

29
Model III Direct Com. Direct Contr.

Like Model II with some modifications
CICs have access to the services on the RCs
can, if the RC is not staffed, manage some of the
services
site publishes at any time
whether the local support is reachable or not
what actions are permitted by the CICs.
all interactions are logged and reported to RC
and ROC
Some tools that allow very controlled (limited)
access like this are under development (GSI
enabled remote SUDO)
Variation with ROCs only interaction (IIIa)

30
Model III Direct Com. Direct Contr.

Pros
Resources are not lost for trivial reasons
ROCs are in the loop,
but weak ROCs can't create too severe delays
One set of tools for remote operation
some uniformity ---gt chance for better quality
Site decides at any time on balance between
local/remote operation
RCs can be run for (short) time unattended
Cons
Set of tools for secure limited remote operation
respecting the sites policies has to be put in
place
ROCs have to monitor what is going to learn the
trade
Unclear responsibility
"This was reported" / "Why didn't the CICs fix it
them self"

31
Sample UseCases

User reports jobs failing on one site
User reports jobs failing on some/all sites
Monitoring shows site dropping in and out of the
IS
An acute security incident
Upgrading to a new version
Post mortem after the security incidents
.
Good preparation for the Operations Workshop

32
Summary

LCG-2 services have been supporting the data
challenges
Many middleware problems have been found many
addressed
Middleware itself is reasonably stable
Biggest outstanding issues are related to
providing and maintaining stable operations
Future middleware has to take this into account
Must be more manageable, trivial to configure and
install
Management and monitoring must be built into
services from the start on
Outcome of the workshop in November is crucial
for EGEE operation