PPT – PowerPoint presentation | free to download

About This Presentation

Title:

Description:

... training, dissemination, international cooperation. Builds on: ... NFS problems for home directories or ESM areas ... Informal exchange (e-mail, meetings, phone) ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 40

Provided by: Pasc159

Learn more at: https://osg-docdb.opensciencegrid.org

Category:

more less

Transcript and Presenter's Notes

Title:

1
LCG and EGEE Operations Markus Schulz,
IT-GD, CERNmarkus.schulz_at_cern.ch
EGEE is a project funded by the European Union
under contract IST-2003-508833
2
Outline

LCG
software
EGEE
History of LCG production service
Impact of Data Challenges on operations
Problems
Operating LCG
Preparing Releases
Support
how it was planned
how it was done
Summary of the operations workshop at CERN
New Structure
Summary
Interoperation (status by L. Field)

3
EGEE in a nutshell

Goal
Create a European wide production quality grid
infrastructure on top of present regional grid
programs
despite its name the project has a worldwide
scope
multi science project
Scale
70 leading institutes in 27 countries
300 FTEs
Aim 20000 CPUs
Initially 2 years project
Activities
48 service activities (operation, support)
24 middleware re-engineering
28 management, training, dissemination,
international cooperation
Builds on
LCG to establish a grid operations service
joint team for deployment and operations
Experience gained from running services for the
LHC experiments
HEP experiments are the pilot application for
EGEE

4
EGEE Middleware

New design driven by requirements of Experiments,
Bio-Medicals and Operations (strong multi science
aspect)
Process includes partners from EU and USA
Involves experienced Middleware providers from
AliEn, EDG, VDT
Monthly meetings in EU and USA
Prototyping approach as required by ARDA
Allowing for rapid release cycles and fast
feedback from early adopters
Formal Integration Testing mechanisms driven
from CERN
Should ensure quality coherence amongst the
developments coming from distributed teams
Includes formal defect tracking system
First stabilized version to be available by the
end of the year
Initial prototype however made available as of
May04, with currently 2 releases/month tackling
users/testing feedback
Target is to deploy components onto the LCG
preproduction service asap.

5
The LCG Project (and what it isnt)

Mission
To prepare, deploy and operate the computing
environment for the experiments to analyze the
data from the LHC detectors
Two phases
Phase 1 2002 2005
Build a prototype, based on existing grid
middleware
Deploy and run a production service
Produce the Technical Design Report for the final
system
Phase 2 2006 2008
Build and commission the initial LHC computing
environment
LCG is NOT a development project for middleware
but problem fixing is permitted (even if writing
code is required)
LCG-2 is the first production service for EGEE
Ian Bird is Operations Officer for both
projects

6
LCG-2 Software

LCG-2 core packages
VDT (Globus2, condor)
EDG
Resource Broker, job submission tools
Replica Management tools lcg tools
One central RMC and LRC for each VO, located at
CERN, ORACLE backend
SRM gridFtp based access to MSS (Castor,
dCache)
Several bits from other WPs (Config objects,
InfoProviders, Packaging)
GLUE 1.1 (Information schema) few essential LCG
extensions
(MDS) based Information System with significant
LCG enhancements (replacements, simplified,
scalability, LCG-BDII)
Mechanism for application (experiment) software
distribution
VOMs (in preparation)
Almost all components have gone through some
reengineering
robustness, scalability,efficiency
adaptation to local fabrics
The services are now quite stable and the
performance and scalability has been
significantly improved (within the limits of the
current architecture)

7
Experience

Jan 2003 GDB agreed to take VDT and EDG
components
September 2003 LCG-1
Extensive certification process
Integrated 32 sites 300 CPUs first use for
production
December 2003 LCG-2
Deployed in January to 8 core sites
Introduced a pre-production service for the
experiments
Alternative packaging (tool based and generic
installation guides)
Mai 2004 -gt now monthly incremental releases (not
all distributed)
Driven by the experiences from the data
challenges
Balance between stable operation and improved
versions (driven by users)
2-1-0, 2-1-1, 2-2-0, 2-3-0
Production services RBs BDIIs patched on demand
gt 90 sites 9300 CPUs (3-5 failed to come online)

8
Adding Sites

Sites contact GD Group or Regional Operation
Center
Sites go to the release page
Sites decide on manual or tool based installation
Sites provide security and contact information
GD forwards this to GOC and security officer
gt200 pages of documentation and FAQs are
available
Sites install and use provided tests for
debugging
large sites integrate their local batch system
support from ROCs or CERN
CERN GD certifies sites
adds them to the monitoring and information
system
sites are daily re-certified and problems traced
in SAVANNAH
Experiments install their software and add the
site to their IS
Adding new sites is now a quite smooth process
this takes between a few days to few weeks

worked 90 times
failed 3-5 times
9
Adding a Site
10
LCG-2 Status 18/11/2004
new interested sites should look here release
Cyprus

Total
90 Sites
9500 CPUs
6.5 PByte

11
Preparing a Release
CT Certification Testing
GDB Grid Deployment Board

Monthly process
Gathering of new material
Prioritization
Integration of items on list
Deployment on testbeds
First tests
feedback
Release to EIS testbed for experiment validation
Full testing (functional and stress)
feedback to patch/component providers
final list of new components
Internal release (LCFGng)
On demand
Preparation/Update of release notes for LCFGng
Preparation/Update of manual install
documentation
Test installations on GIS testbeds
Announcement on the LCG-Rollout list

EIS Experiment Integration Support
Applications
GIS Grid Infrastructure Support
Sites
12
Preparing a ReleaseInitial List,Prioritization,In
tegration,EIS,StressTest
CT
LCFGng change record
13
Preparing a ReleasePreparations for
Distribution, Upgrading
Sites upgrade at own pace
Certification is run daily
14
Process Experience

The process was decisive to improve the quality
of the middleware
The process is time consuming
There are many sequential operations
The format of the internal and external release
will be unified
Multiple packaging formats slow down release
preparation
tool based (LCFGng)
manual (tar ball based)
All components are treated equal
same level of testing for core components and non
vital tools
special process for acceptin tools already in use
by other project needed
Process of including new components not
sufficient transparent
Picking a good time for a new release is
difficult
conflict between users (NOW) and sites (planned)
Upgrading has proven to be a high risk operation
some sites suffered from acute configuration
amnesia
Process was one of the topics in the LCG
Operations Workshop

15
Impact of Data Challenges

Large scale production effort of the LHC
experiments
test and validate the computing models
produce needed simulated data
test experiments production frame works and
software
test the provided grid middleware
test the services provided by LCG-2
All experiments used LCG-2 for part of their
production

16
Data Challenges

Phase I
7.7 Million events fully simulated (Geant 4) in
95.000 jobs
22 TByte
Total CPU 972 MSI-2k hours
gt40 produced on LCG-2 (used LCG-2, GRID3,
NorduGrid)

17
Data Challenges
18
Data Challenges
3-5 106/day
LCG restarted
LCG paused
LCG in action
1.8 106/day
DIRAC alone
19
Problems during the data challenges

All experiments encountered on LCG-2 similar
problems
LCG sites suffering from configuration and
operational problems
not adequate resources on some sites (hardware,
human..)
this is now the main source of failures
Load balancing between different sites is
problematic
jobs can be attracted to sites that have no
adequate resources
modern batch systems are too complex and dynamic
to summarize their behavior in a few values in
the IS
Identification and location of problems in LCG-2
is difficult
distributed environment, access to many logfiles
needed (but hard)..
status of monitoring tools
Handling thousands of jobs is time consuming and
tedious
Support for bulk operation is not adequate
Performance and scalability of services
storage (access and number of files)
job submission
information system
file catalogues
Services suffered from hardware problems
(no fail over (design problem))

DC summary
20
Operational issues (selection)

Slow response from sites
Upgrades, response to problems, etc.
Problems reported daily some problems last for
weeks
Lack of staff available to fix problems
Vacation period, other high priority tasks
Various mis-configurations (see next slide)
Lack of configuration management problems that
are fixed re-appear
Lack of fabric management (mostly smaller sites)
scratch space, single nodes drain queues,
incomplete upgrades, .
Lack of understanding
Admins reformat disks of SE
Provided documentation often not (carefully) read
new activity to develop adaptive documentation
simpler way to install middleware (YAIM)
opens ways to maintain middleware remotely in
user space
Firewall issues
often less than optimal coordination between grid
admins and firewall maintainers
openPBS problems
Scalability, robustness (switching to torque
helps)

21
Site (mis) - configurations

Site mis-configuration was responsible for most
of the problems that occurred during the
experiments Data Challenges. Here is a
non-complete list of problems
The variable VO ltVOgt SW DIR points to a non
existent area on WNs.
The ESM is not allowed to write in the area
dedicated to the software installation
Only one certificate allowed to be mapped to
the ESM local account
Wrong information published in the information
system (Glue Object Classes not linked)
Queue time limits published in minutes instead
of seconds and not normalized
/etc/ld.so.conf not properly configured. Shared
libraries not found.
Machines not synchronized in time
Grid-mapfiles not properly built
Pool accounts not created but the rest of the
tools configured with pool accounts
Firewall issues
CA files not properly installed
NFS problems for home directories or ESM areas
Services configured to use the wrong/no
Information Index (BDII)
Wrong user profiles
Default user shell environment too big
Only partly related to middleware complexity

integrated all common small problems into
ONE BIG PROBLEM
22
Operating Services for DCs

Multiple instances of core services for each of
the experiments
separates problems, avoids interference between
experiments
improves availability
allows experiments to maintain individual
configuration
addresses scalability to some degree
Monitoring tools for services currently not
adequate
tools under development to implement control
system
moving tools to common transport and storage
format (R-GMA)
Access to storage via load balanced interfaces
CASTOR
dCache
Load balancing service for the Information system
index service
load balanced BDII deployed at CERN

DC summary
23
Support during the DCs

User (Experiment) Support
GD at CERN worked very close with the experiments
production managers
Informal exchange (e-mail, meetings, phone)
No Secrets approach, GD people on experiments
mail lists and vice versa
ensured fast response
tracking of problems tedious, but both sites have
been patient
clear learning curve on BOTH sites
LCG GGUS (grid user support) at FZK became
operational after start of the DCs
due to the importance of the DCs the experiments
switch slowly to the new service
Very good end user documentation by GD-EIS
Dedicated testbed for experiments with next LCG-2
release
rapid feedback, influenced what made it into the
next release
Installation and site operations support
GD prepared releases and supported sites
(certification, re-certification)
Regional centres supported their local sites
(some more, some less)
Community style help via mailing list (high
traffic!!)
FAQ lists for trouble shooting and configuration
issues Taipei RAL

24
Support during the DCs

Operations Service
RAL (UK) is leading sub-project on developing
operations services
Initial prototype http//www.grid-support.ac.uk/GO
C/
Basic monitoring tools
Mail lists for problem resolution
Working on defining policies for operation,
responsibilities (draft document)
Working on grid wide accounting (APPLE)
Monitoring
GridICE (development of DataTag Nagios-based
tools)
GridPP job submission monitoring, every few
hours, all RBs, allSites
Information system monitoring and consistency
check every 5 minutes http//goc.grid.sinica.edu.t
w/gstat/
CERN GD daily re-certification of sites
(including history)
escalation procedure
tracing of site specific problems via problem
tracking tool
tests core services and configuration

25
Screen Shots
26
Screen Shots
27
Some More Monitoring
28
Monitoring and Controls

Many monitoring tools and sources of information
available
Hard to combine information to spot problems
early
Split of monitoring into three parts
sensors
transport and storage
display
Transport and storage based on R-GMA monitoring
bus
Already ported
GIIS monitor, Re-Certification, Jobsubmission,
(GridICE, (LB of RB))
general display based on R-GMA
Building of complex alarms via sql queries
Controls
Taipei is building a message system that can be
used for interaction with sites

29
Problem HandlingPLAN for LCG
Monitoring/Follow-up
Triage VO / GRID
GOC
GGUS (Remedy)
GD CERN
Escalation
30
Problem HandlingOperation (most cases)
Community
VO A
Rollout Mailing List
VO B
GGUS
VO C
Triage
S-Site-2
GOC
GD CERN
S-Site-1
Monitoring Certification Follow-Up FAQs
Monitoring FAQs
S-Site-3
31
LCG Workshop on Operational Issues

Motivation
LCG -gt (LCGEGEE) transition requires changes
Lessons learned need to be implemented
Many different activities need to be coordinated
02 - 04 November at CERN
gt80 participants including from GRID3 and
NorduGrid
Agenda Here
1.5 days of plenary sessions
describe status and stimulate discussion
1 day parallel/joint working groups
very concrete work,
results into creation of task lists with names
attached to items
0.5 days of reports of the WG

32
LCG Workshop on Operational IssuesWGs I

Operational Security
Incident Handling Process
Variance in site support availability
Reporting Channels
Service Challenges
Operational Support
Workflow for operations security actions
What tools are needed to implement the model
24X7 global support
sharing operational load (taking turns)
Communication
Problem Tracking System
Defining Responsibilities
problem follow-up
deployment of new releases
Interface to User Support

33
LCG Workshop on Operational IssuesWGs II

Fabric Management
System installations
Batch/scheduling Systems
Fabric monitoring
Software installation
Representation of site status (load) in the
Information System
Software Management
Operations on and for VOs (add/remove/service
discovery)
Fault tolerance, operations on running services
(stop,upgrades, re-starts)
Link to developers
What level of intrusion can be tolerated on the
WNs (farm nodes)
application (experiment) software installation
Removing/(re-adding) sites with (fixed)troubles
Multiple views in the information system
(maintenance)

34
LCG Workshop on Operational IssuesWGs III

User Support
Defining what User Support means
Models for implementing a working user support
need for a Central User Support Coordination Team
(CUSC)
mandate and tasks
distributed/central (CUSC/RUSC)
workflow
VO-support
continuous support on integrating the VOs
software with the middleware
end user documentation
FAQs

35
LCG Workshop on Operational IssuesSummary

Very productive workshop
Partners (sites) assumed responsibility for tasks
Discussions very much focused on practical
matters
Some problems ask for architectural changes
gLite has to address these
It became clear that not all sites are created
equal
Removing troubled sites is inherently problematic
removing storage can have grid wide impact
Key issues in all aspects is to define split
between
Local,Regional and Central control and
responsibility
All WGs discussed communication

36
New Operations Model

EGEE Structure
OMC Operations Management Center
CICs Core Infrastructure Centers
services like file catalogues, RBs, central
infrastructure
operation support
CERN,France,Italy,UK, (Russia,Taipei)
ROCs Regional Operation Centers
regional support
France,Italy,UKIreland,GermanySwitzerland,N-Euro
pe,SW-Europe,Central Europe, Russia
RCs Resource Centers
data and CPUs

37
New Operations Model

Operations Center role rotates through the CICs
CIC on duty for one week
Procedures and tasks are currently defined
first operations manual is available (living
document)
tools, frequency of checks, escalation
procedures, hand over procedures
CIC on duty website
Problems are tracked with a tracking tool
now central in Savannah
migration to GGUS (remedy) with link to ROCs PT
tools
problems can be added at GGUS or ROC level
CICs monitor service, spot and track problems
interact with sites on short term problems
(service restart etc,)
interact with ROCs on longer, non trivial
problems
all communication with a site is visible for the
ROC
build FAQs
ROCs support
installation, first certification
resolving complex problems

38
New Operations Model
OMC
Other Grid
ROC
ROC
Other Grid
ROC
ROC
ROC
RC
RC
RC
RC
RC
RC
RC
39
Summary

LCG-2 services have been supporting the data
challenges
Many middleware problems have been found many
addressed
Middleware itself is reasonably stable
Biggest outstanding issues are related to
providing and maintaining stable operations
Future middleware has to take this into account
Must be more manageable, trivial to configure and
install
Management and monitoring must be built into
services from the start on
Operational Workshop has started many activities
Follow-up and keeping up the momentum is now
essential
Indicates a clear shift away from the
CERNtralized operation
CIC on duty is a first step to distribute
operational load