Title:
1LCG Operation During the Data Challenges
Markus Schulz, IT-GD, CERNmarkus.schulz_at_cern.ch
Discussion on Operation Models
EGEE is a project funded by the European Union
under contract IST-2003-508833
2Outline
- Building LCG-2
- Data Challenges (very brief)
- Problems (not so brief)
- Operating LCG
- how it was planned
- how it happened to be done
- how it felt
- Whats next?
- I will skip many slides to leave room for
discussions
Comment /Shout in REALTIME!!!!!
3History
- December 2003 LCG-2
- Full set of functionality for DCs, first MSS
integration - Deployed in January to 8 core sites (less sites
less trouble) - DCs started in February -gt testing in production
- Large sites integrate resources into LCG (MSS and
farms) - Introduced a pre-production service for the
experiments - Alternative packaging (tool based and generic
installation guides) - Mai 2004 -gt now monthly incremental releases
- Not all releases are distributed to external
sites - Improved services, functionality, stability and
packing step by step - Timely response to experiences from the data
challenges
4LCG-2 Status 22 10 2004
new interested sites should look here release
Cyprus
- Total
- 82 Sites
- 9400 CPUs
- 6.5 PByte
5Integrating Sites
- Sites contact GD Group or Regional Center
- Sites go to the release page
- Sites decide on manual or tool based installation
(LCFGng) - documentation for both available
- WN and UI from next release on tar-ball based
release - almost trivial install of WNs and UIs
- Sites provide security and contact information
- Sites install and use provided tests for
debugging - support from regional centers or CERN
- CERN GD certifies site and adds it to the
monitoring and information system - sites are daily re-certified and problems traced
in SAVANNAH - Large sites have integrated their local batch
systems in LCG-2 - Adding new sites is now quite smooth
- problem is keeping large number of sites
correctly configured
worked 80 times
failed 3-5 times
6Data Challenges
- Large scale production effort of the LHC
experiments - test and validate the computing models
- produce needed simulated data
- test experiments production frame works and
software - test the provided grid middleware
- test the services provided by LCG-2
- All experiments used LCG-2 for part of their
production
7Data Challenges
- Phase I
- 7.7 Million events fully simulated (Geant 4) in
95.000 jobs - 22 TByte
- Total CPU 972 MSI-2k hours
- gt40 produced on LCG-2 (used LCG-2, GRID3,
NorduGrid)
8Data Challenges
9Data Challenges
3-5 106/day
LCG restarted
LCG paused
LCG in action
1.8 106/day
DIRAC alone
10Problems during the data challenges
- All experiments encountered on LCG-2 similar
problems - LCG sites suffering from configuration and
operational problems - not adequate resources on some sites (hardware,
human..) - this is now the main source of failures
- Load balancing between different sites is
problematic - jobs can be attracted to sites that have no
adequate resources - modern batch systems are too complex and dynamic
to summarize their behavior in a few values in
the IS - Identification and location of problems in LCG-2
is difficult - distributed environment, access to many logfiles
needed.. - status of monitoring tools
- Handling thousands of jobs is time consuming and
tedious - Support for bulk operation is not adequate
- Performance and scalability of services
- storage (access and number of files)
- job submission
- information system
- file catalogues
- Services suffered from hardware problems (no fail
over services)
DC summary
11Outstanding Middleware Issues
- Collection Outstanding Middleware Issues
- Important 1st systematic confrontation of
required functionalities with capabilities of the
existing middleware - Some can be patched, worked around,
- Those related to fundamental problems with
underlying models and architectures have to be
input as essential requirements to future
developments (EGEE) - Middleware is now not perfect but quite stable
- Much has been improved during DCs
- A lot of effort still going into improvements and
fixes - Big hole is missing space management on SEs
- especially for Tier 2 sites
12Operational issues (selection)
- Slow response from sites
- Upgrades, response to problems, etc.
- Problems reported daily some problems last for
weeks - Lack of staff available to fix problems
- Vacation period, other high priority tasks
- Various mis-configurations (see next slide)
- Lack of configuration management problems that
are fixed reappear - Lack of fabric management (mostly smaller sites)
- scratch space, single nodes drain queues,
incomplete upgrades, . - Lack of understanding
- Admins reformat disks of SE
- Provided documentation often not read (carefully)
- new activity started to develop hierarchical
adaptive documentation - simpler way to install middleware on farm nodes
(even remotely in user space) - Firewall issues
- often less than optimal coordination between grid
admins and firewall maintainers - PBS problems
- Scalability, robustness (switching to torque
helps)
13Site (mis) - configurations
- Site mis-configuration was responsible for most
of the problems that occurred during the
experiments Data Challenges. Here is a
non-complete list of problems - The variable VO ltVOgt SW DIR points to a non
existent area on WNs. - The ESM is not allowed to write in the area
dedicated to the software installation - Only one certificate allowed to be mapped to
the ESM local account - Wrong information published in the information
system (Glue Object Classes not linked) - Queue time limits published in minutes instead
of seconds and not normalized - /etc/ld.so.conf not properly configured. Shared
libraries not found. - Machines not synchronized in time
- Grid-mapfiles not properly built
- Pool accounts not created but the rest of the
tools configured with pool accounts - Firewall issues
- CA files not properly installed
- NFS problems for home directories or ESM areas
- Services configured to use the wrong/no
Information Index (BDII) - Wrong user profiles
- Default user shell environment too big
-
- Only partly related to middleware complexity
integrated all common small problems into 1 BIG
PROBLEM
14Running Services
- Multiple instances of core services for each of
the experiments - separates problems, avoids interference between
experiments - improves availability
- allows experiments to maintain individual
configuration (information system) - addresses scalability to some degree
- Monitoring tools for services currently not
adequate - tools under development to implement control
system - Access to storage via load balanced interfaces
- CASTOR
- dCache
- Services that carry state are problematic to
restart on new nodes - needed after hardware problems, or security
problems - State Transition between partial usage and full
usage of resources - required change in queue configuration (faire
share, individual queues/VO) - next release will come with description for fair
share configuration (smaller sites)
DC summary
15Support during the DCs
- User (Experiment) Support
- GD at CERN worked very close with the experiments
production managers - Informal exchange (e-mail, meetings, phone)
- No Secrets approach, GD people on experiments
mail lists and vice versa - ensured fast response
- tracking of problems tedious, but both sites have
been patient - clear learning curve on BOTH sites
- LCG GGUS (grid user support) at FZK became
operational after start of the DCs - due to the importance of the DCs the experiments
switch slowly to the new service - Very good end user documentation by GD-EIS
- Dedicated testbed for experiments with next LCG-2
release - rapid feedback, influenced what made it into the
next release - Installation (Site) Support
- GD prepared releases and supported sites
(certification, re-certification) - Regional centres supported their local sites
(some more, some less) - Community style help via mailing list (high
traffic!!) - FAQ lists for trouble shooting and configuration
issues Taipei RAL
16Support during the DCs
- Operations Service
- RAL (UK) is leading sub-project on developing
operations services - Initial prototype http//www.grid-support.ac.uk/GO
C/ - Basic monitoring tools
- Mail lists for problem resolution
- Working on defining policies for operation,
responsibilities (draft document) - Working on grid wide accounting
- Monitoring
- GridICE (development of DataTag Nagios-based
tools) - GridPP job submission monitoring
- Information system monitoring and consitency
check http//goc.grid.sinica.edu.tw/gstat/ - CERN GD daily re-certification of sites
(including history) - escalation procedure under development
- tracing of site specific problems via problem
tracking tool - tests core services and configuration
17Screen Shots
18Screen Shots
19Problem HandlingPLAN
Monitoring/Followup
Triage VO / GRID
GOC
GGUS (Remedy)
GD CERN
Escalation
20Problem HandlingOperation (most cases)
Community
VO A
Rollout Mailing List
VO B
GGUS
VO C
Triage
S-Site-2
GOC
GD CERN
S-Site-1
Monitoring Certification Follow-Up FAQs
Monitoring FAQs
S-Site-3
21Problem Tracking
- GGUS REMEDY
- Middleware problems SAVANNAH LCG-OPERATION
- Re-certification SAVANNAH LCG-SITES
- Many (MOST) problems only tracked by e-mail
- Much confusion on where to put problems
- Training needed to get reasonable 1st level user
support - canned answers
- experts need to focus on more complex tasks
- Unification of FAQs (RAL, Taipei, Italy, )
22EGEE Impact on Operations
- The available effort for operations from EGEE is
now ramping up - LCG GOC (RAL) ? EGEE CICs and ROCs, Taipei
- Hierarchical support structure
- Regional Operations Centres (ROC)
- One per region (9)
- Front-line support for deployment, installation,
users - Core Infrastructure Centres (CIC)
- Four ( Russia next year)
- Evolve from GOC monitoring, troubleshooting,
operational control - 24x7 in a 8x5 world ????
- Also providing VO-specific and general services
- EGEE NA3 organizes training for users and site
admins - NOW at HEPiX
- Address common issues, experiences
- Operations and Fabric Workshop
- CERN 1-3 Nov
23PART II
- Operation models
- How much can be delegated to whom?
- autonomy/ availability
- What are the consequences?
- cost for 24/7 with 8x5 staff
- One/multiple models for all sites/regions?
- One model for site integration, update, user
support, security, operation? - latency, efficiency, distribution of workload ..
- One size fits all?
- Next slides are meant to stimulate discussions
not give answers
24CICs and ROCs and Operations
- Core Infrastructure Centers (CICs)
- run services like RBs, Information Indices,
VO/VOMS, Catalogues - are the distributed Grid Operation Center (GOC)
- and more.
- Regional Operation Centers (ROCs)
- coordinate activities in their region
- give support to regional RCs
- coordinate setup/upgrades
- and more..
- Resource Centers (RC)
- computing and storage
- Operation Management Center (OMC)
- coordination
25Model I Strict Hierarchy
- CICs locates a problem with a RC or CIC in a
region - triggered by monitoring/ user alert
- CIC enters the problem into the problem tracking
tool and assigns it to a ROC - ROC receives a notification and works on solving
the problem - region decides locally what the ROC can to do on
the RCs. - This can include restarting services etc.
- The main emphasis is that the region decides on
the depth of the interaction. - gt different regions, different procedures
- CICs NEVER contact a site
- .gt ROCs need to be staffed all the time
- ROC does it is fully responsible for ALL the
sites in the region
26Model I Strict Hierarchy
- Pro
- Best model to transfer knowledge to the ROCs
- all information flows through them
- Different regions can have their own policies
- this can reflect different administrative
relation of sites in a region. - Clear responsibility
- until it is discovered it is the CICs fault then
it is always the ROCs fault - Cons
- High latency
- even for trivial operations we have to pass
through the ROCs - ROCs have to be staffed (reachable) all the time.
- Regions will develop their own tools
- parallel strands, less quality
- Excluded for handling security
27Model II Direct Com. Local Contr.
- ROCs are active in
- the follow-up of problems that take longer to
handle - setup of sites
- CICs are active in
- handling problems that can be solved by simple
interactions - communicated directly between CICs and RCs
- ROCs are informed on all interactions between
CICs and RCs - all problems are entered into the problem
tracking tool. - restarting of services, etc. are handled by the
RCs
28Model II Direct Com. Local Contr.
- Pros
- Resources are not lost for trivial reasons
- Principe of local control is maintained
- ROCs are in the loop,
- but weak ROCs can't create too severe delays
- No complex tools for communication management
needed - mail IRC sufficient
- Cons
- RCs need to be reachable at all times
- not realistic, and very expensive
- CICs have to be aware of the level of maturity of
O(100) RCs - ROCs have to monitor what is going on to learn
the trade - Language problems between the CICs and sysadmins
- Unclear responsibility
- "This was reported" / "Why didn't the CICs fix it
them self"
29Model III Direct Com. Direct Contr.
- Like Model II with some modifications
- CICs have access to the services on the RCs
- can, if the RC is not staffed, manage some of the
services - site publishes at any time
- whether the local support is reachable or not
- what actions are permitted by the CICs.
- all interactions are logged and reported to RC
and ROC - Some tools that allow very controlled (limited)
access like this are under development (GSI
enabled remote SUDO) - Variation with ROCs only interaction (IIIa)
30Model III Direct Com. Direct Contr.
- Pros
- Resources are not lost for trivial reasons
- ROCs are in the loop,
- but weak ROCs can't create too severe delays
- One set of tools for remote operation
- some uniformity ---gt chance for better quality
- Site decides at any time on balance between
local/remote operation - RCs can be run for (short) time unattended
- Cons
- Set of tools for secure limited remote operation
respecting the sites policies has to be put in
place - ROCs have to monitor what is going to learn the
trade - Unclear responsibility
- "This was reported" / "Why didn't the CICs fix it
them self"
31Sample UseCases
- User reports jobs failing on one site
- User reports jobs failing on some/all sites
- Monitoring shows site dropping in and out of the
IS - An acute security incident
- Upgrading to a new version
- Post mortem after the security incidents
- .
- Good preparation for the Operations Workshop
32Summary
- LCG-2 services have been supporting the data
challenges - Many middleware problems have been found many
addressed - Middleware itself is reasonably stable
- Biggest outstanding issues are related to
providing and maintaining stable operations - Future middleware has to take this into account
- Must be more manageable, trivial to configure and
install - Management and monitoring must be built into
services from the start on - Outcome of the workshop in November is crucial
for EGEE operation