- PowerPoint PPT Presentation

About This Presentation
Title:

Description:

... training, dissemination, international cooperation. Builds on: ... NFS problems for home directories or ESM areas ... Informal exchange (e-mail, meetings, phone) ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 40
Provided by: Pasc159
Category:

less

Transcript and Presenter's Notes

Title:


1
LCG and EGEE Operations Markus Schulz,
IT-GD, CERNmarkus.schulz_at_cern.ch
EGEE is a project funded by the European Union
under contract IST-2003-508833
2
Outline
  • LCG
  • software
  • EGEE
  • History of LCG production service
  • Impact of Data Challenges on operations
  • Problems
  • Operating LCG
  • Preparing Releases
  • Support
  • how it was planned
  • how it was done
  • Summary of the operations workshop at CERN
  • New Structure
  • Summary
  • Interoperation (status by L. Field)

3
EGEE in a nutshell
  • Goal
  • Create a European wide production quality grid
    infrastructure on top of present regional grid
    programs
  • despite its name the project has a worldwide
    scope
  • multi science project
  • Scale
  • 70 leading institutes in 27 countries
  • 300 FTEs
  • Aim 20000 CPUs
  • Initially 2 years project
  • Activities
  • 48 service activities (operation, support)
  • 24 middleware re-engineering
  • 28 management, training, dissemination,
    international cooperation
  • Builds on
  • LCG to establish a grid operations service
  • joint team for deployment and operations
  • Experience gained from running services for the
    LHC experiments
  • HEP experiments are the pilot application for
    EGEE

4
EGEE Middleware
  • New design driven by requirements of Experiments,
    Bio-Medicals and Operations (strong multi science
    aspect)
  • Process includes partners from EU and USA
  • Involves experienced Middleware providers from
    AliEn, EDG, VDT
  • Monthly meetings in EU and USA
  • Prototyping approach as required by ARDA
  • Allowing for rapid release cycles and fast
    feedback from early adopters
  • Formal Integration Testing mechanisms driven
    from CERN
  • Should ensure quality coherence amongst the
    developments coming from distributed teams
  • Includes formal defect tracking system
  • First stabilized version to be available by the
    end of the year
  • Initial prototype however made available as of
    May04, with currently 2 releases/month tackling
    users/testing feedback
  • Target is to deploy components onto the LCG
    preproduction service asap.

5
The LCG Project (and what it isnt)
  • Mission
  • To prepare, deploy and operate the computing
    environment for the experiments to analyze the
    data from the LHC detectors
  • Two phases
  • Phase 1 2002 2005
  • Build a prototype, based on existing grid
    middleware
  • Deploy and run a production service
  • Produce the Technical Design Report for the final
    system
  • Phase 2 2006 2008
  • Build and commission the initial LHC computing
    environment
  • LCG is NOT a development project for middleware
  • but problem fixing is permitted (even if writing
    code is required)
  • LCG-2 is the first production service for EGEE
  • Ian Bird is Operations Officer for both
    projects

6
LCG-2 Software
  • LCG-2 core packages
  • VDT (Globus2, condor)
  • EDG
  • Resource Broker, job submission tools
  • Replica Management tools lcg tools
  • One central RMC and LRC for each VO, located at
    CERN, ORACLE backend
  • SRM gridFtp based access to MSS (Castor,
    dCache)
  • Several bits from other WPs (Config objects,
    InfoProviders, Packaging)
  • GLUE 1.1 (Information schema) few essential LCG
    extensions
  • (MDS) based Information System with significant
    LCG enhancements (replacements, simplified,
    scalability, LCG-BDII)
  • Mechanism for application (experiment) software
    distribution
  • VOMs (in preparation)
  • Almost all components have gone through some
    reengineering
  • robustness, scalability,efficiency
  • adaptation to local fabrics
  • The services are now quite stable and the
    performance and scalability has been
    significantly improved (within the limits of the
    current architecture)

7
Experience
  • Jan 2003 GDB agreed to take VDT and EDG
    components
  • September 2003 LCG-1
  • Extensive certification process
  • Integrated 32 sites 300 CPUs first use for
    production
  • December 2003 LCG-2
  • Deployed in January to 8 core sites
  • Introduced a pre-production service for the
    experiments
  • Alternative packaging (tool based and generic
    installation guides)
  • Mai 2004 -gt now monthly incremental releases (not
    all distributed)
  • Driven by the experiences from the data
    challenges
  • Balance between stable operation and improved
    versions (driven by users)
  • 2-1-0, 2-1-1, 2-2-0, 2-3-0
  • Production services RBs BDIIs patched on demand
  • gt 90 sites 9300 CPUs (3-5 failed to come online)

8
Adding Sites
  • Sites contact GD Group or Regional Operation
    Center
  • Sites go to the release page
  • Sites decide on manual or tool based installation
  • Sites provide security and contact information
  • GD forwards this to GOC and security officer
  • gt200 pages of documentation and FAQs are
    available
  • Sites install and use provided tests for
    debugging
  • large sites integrate their local batch system
  • support from ROCs or CERN
  • CERN GD certifies sites
  • adds them to the monitoring and information
    system
  • sites are daily re-certified and problems traced
    in SAVANNAH
  • Experiments install their software and add the
    site to their IS
  • Adding new sites is now a quite smooth process
  • this takes between a few days to few weeks

worked 90 times
failed 3-5 times
9
Adding a Site
10
LCG-2 Status 18/11/2004
new interested sites should look here release
Cyprus
  • Total
  • 90 Sites
  • 9500 CPUs
  • 6.5 PByte

11
Preparing a Release
CT Certification Testing
GDB Grid Deployment Board
  • Monthly process
  • Gathering of new material
  • Prioritization
  • Integration of items on list
  • Deployment on testbeds
  • First tests
  • feedback
  • Release to EIS testbed for experiment validation
  • Full testing (functional and stress)
  • feedback to patch/component providers
  • final list of new components
  • Internal release (LCFGng)
  • On demand
  • Preparation/Update of release notes for LCFGng
  • Preparation/Update of manual install
    documentation
  • Test installations on GIS testbeds
  • Announcement on the LCG-Rollout list

EIS Experiment Integration Support
Applications
GIS Grid Infrastructure Support
Sites
12
Preparing a ReleaseInitial List,Prioritization,In
tegration,EIS,StressTest
CT
LCFGng change record
13
Preparing a ReleasePreparations for
Distribution, Upgrading
Sites upgrade at own pace
Certification is run daily
14
Process Experience
  • The process was decisive to improve the quality
    of the middleware
  • The process is time consuming
  • There are many sequential operations
  • The format of the internal and external release
    will be unified
  • Multiple packaging formats slow down release
    preparation
  • tool based (LCFGng)
  • manual (tar ball based)
  • All components are treated equal
  • same level of testing for core components and non
    vital tools
  • special process for acceptin tools already in use
    by other project needed
  • Process of including new components not
    sufficient transparent
  • Picking a good time for a new release is
    difficult
  • conflict between users (NOW) and sites (planned)
  • Upgrading has proven to be a high risk operation
  • some sites suffered from acute configuration
    amnesia
  • Process was one of the topics in the LCG
    Operations Workshop

15
Impact of Data Challenges
  • Large scale production effort of the LHC
    experiments
  • test and validate the computing models
  • produce needed simulated data
  • test experiments production frame works and
    software
  • test the provided grid middleware
  • test the services provided by LCG-2
  • All experiments used LCG-2 for part of their
    production

16
Data Challenges
  • Phase I
  • 7.7 Million events fully simulated (Geant 4) in
    95.000 jobs
  • 22 TByte
  • Total CPU 972 MSI-2k hours
  • gt40 produced on LCG-2 (used LCG-2, GRID3,
    NorduGrid)

17
Data Challenges
18
Data Challenges
3-5 106/day
LCG restarted
LCG paused
LCG in action
1.8 106/day
DIRAC alone
19
Problems during the data challenges
  • All experiments encountered on LCG-2 similar
    problems
  • LCG sites suffering from configuration and
    operational problems
  • not adequate resources on some sites (hardware,
    human..)
  • this is now the main source of failures
  • Load balancing between different sites is
    problematic
  • jobs can be attracted to sites that have no
    adequate resources
  • modern batch systems are too complex and dynamic
    to summarize their behavior in a few values in
    the IS
  • Identification and location of problems in LCG-2
    is difficult
  • distributed environment, access to many logfiles
    needed (but hard)..
  • status of monitoring tools
  • Handling thousands of jobs is time consuming and
    tedious
  • Support for bulk operation is not adequate
  • Performance and scalability of services
  • storage (access and number of files)
  • job submission
  • information system
  • file catalogues
  • Services suffered from hardware problems
  • (no fail over (design problem))

DC summary
20
Operational issues (selection)
  • Slow response from sites
  • Upgrades, response to problems, etc.
  • Problems reported daily some problems last for
    weeks
  • Lack of staff available to fix problems
  • Vacation period, other high priority tasks
  • Various mis-configurations (see next slide)
  • Lack of configuration management problems that
    are fixed re-appear
  • Lack of fabric management (mostly smaller sites)
  • scratch space, single nodes drain queues,
    incomplete upgrades, .
  • Lack of understanding
  • Admins reformat disks of SE
  • Provided documentation often not (carefully) read
  • new activity to develop adaptive documentation
  • simpler way to install middleware (YAIM)
  • opens ways to maintain middleware remotely in
    user space
  • Firewall issues
  • often less than optimal coordination between grid
    admins and firewall maintainers
  • openPBS problems
  • Scalability, robustness (switching to torque
    helps)

21
Site (mis) - configurations
  • Site mis-configuration was responsible for most
    of the problems that occurred during the
    experiments Data Challenges. Here is a
    non-complete list of problems
  • The variable VO ltVOgt SW DIR points to a non
    existent area on WNs.
  • The ESM is not allowed to write in the area
    dedicated to the software installation
  • Only one certificate allowed to be mapped to
    the ESM local account
  • Wrong information published in the information
    system (Glue Object Classes not linked)
  • Queue time limits published in minutes instead
    of seconds and not normalized
  • /etc/ld.so.conf not properly configured. Shared
    libraries not found.
  • Machines not synchronized in time
  • Grid-mapfiles not properly built
  • Pool accounts not created but the rest of the
    tools configured with pool accounts
  • Firewall issues
  • CA files not properly installed
  • NFS problems for home directories or ESM areas
  • Services configured to use the wrong/no
    Information Index (BDII)
  • Wrong user profiles
  • Default user shell environment too big
  • Only partly related to middleware complexity

integrated all common small problems into
ONE BIG PROBLEM
22
Operating Services for DCs
  • Multiple instances of core services for each of
    the experiments
  • separates problems, avoids interference between
    experiments
  • improves availability
  • allows experiments to maintain individual
    configuration
  • addresses scalability to some degree
  • Monitoring tools for services currently not
    adequate
  • tools under development to implement control
    system
  • moving tools to common transport and storage
    format (R-GMA)
  • Access to storage via load balanced interfaces
  • CASTOR
  • dCache
  • Load balancing service for the Information system
    index service
  • load balanced BDII deployed at CERN

DC summary
23
Support during the DCs
  • User (Experiment) Support
  • GD at CERN worked very close with the experiments
    production managers
  • Informal exchange (e-mail, meetings, phone)
  • No Secrets approach, GD people on experiments
    mail lists and vice versa
  • ensured fast response
  • tracking of problems tedious, but both sites have
    been patient
  • clear learning curve on BOTH sites
  • LCG GGUS (grid user support) at FZK became
    operational after start of the DCs
  • due to the importance of the DCs the experiments
    switch slowly to the new service
  • Very good end user documentation by GD-EIS
  • Dedicated testbed for experiments with next LCG-2
    release
  • rapid feedback, influenced what made it into the
    next release
  • Installation and site operations support
  • GD prepared releases and supported sites
    (certification, re-certification)
  • Regional centres supported their local sites
    (some more, some less)
  • Community style help via mailing list (high
    traffic!!)
  • FAQ lists for trouble shooting and configuration
    issues Taipei RAL

24
Support during the DCs
  • Operations Service
  • RAL (UK) is leading sub-project on developing
    operations services
  • Initial prototype http//www.grid-support.ac.uk/GO
    C/
  • Basic monitoring tools
  • Mail lists for problem resolution
  • Working on defining policies for operation,
    responsibilities (draft document)
  • Working on grid wide accounting (APPLE)
  • Monitoring
  • GridICE (development of DataTag Nagios-based
    tools)
  • GridPP job submission monitoring, every few
    hours, all RBs, allSites
  • Information system monitoring and consistency
    check every 5 minutes http//goc.grid.sinica.edu.t
    w/gstat/
  • CERN GD daily re-certification of sites
    (including history)
  • escalation procedure
  • tracing of site specific problems via problem
    tracking tool
  • tests core services and configuration

25
Screen Shots
26
Screen Shots
27
Some More Monitoring
28
Monitoring and Controls
  • Many monitoring tools and sources of information
    available
  • Hard to combine information to spot problems
    early
  • Split of monitoring into three parts
  • sensors
  • transport and storage
  • display
  • Transport and storage based on R-GMA monitoring
    bus
  • Already ported
  • GIIS monitor, Re-Certification, Jobsubmission,
    (GridICE, (LB of RB))
  • general display based on R-GMA
  • Building of complex alarms via sql queries
  • Controls
  • Taipei is building a message system that can be
    used for interaction with sites

29
Problem HandlingPLAN for LCG
Monitoring/Follow-up
Triage VO / GRID
GOC
GGUS (Remedy)
GD CERN
Escalation
30
Problem HandlingOperation (most cases)
Community
VO A
Rollout Mailing List
VO B
GGUS
VO C
Triage
S-Site-2
GOC
GD CERN
S-Site-1
Monitoring Certification Follow-Up FAQs
Monitoring FAQs
S-Site-3
31
LCG Workshop on Operational Issues
  • Motivation
  • LCG -gt (LCGEGEE) transition requires changes
  • Lessons learned need to be implemented
  • Many different activities need to be coordinated
  • 02 - 04 November at CERN
  • gt80 participants including from GRID3 and
    NorduGrid
  • Agenda Here
  • 1.5 days of plenary sessions
  • describe status and stimulate discussion
  • 1 day parallel/joint working groups
  • very concrete work,
  • results into creation of task lists with names
    attached to items
  • 0.5 days of reports of the WG

32
LCG Workshop on Operational IssuesWGs I
  • Operational Security
  • Incident Handling Process
  • Variance in site support availability
  • Reporting Channels
  • Service Challenges
  • Operational Support
  • Workflow for operations security actions
  • What tools are needed to implement the model
  • 24X7 global support
  • sharing operational load (taking turns)
  • Communication
  • Problem Tracking System
  • Defining Responsibilities
  • problem follow-up
  • deployment of new releases
  • Interface to User Support

33
LCG Workshop on Operational IssuesWGs II
  • Fabric Management
  • System installations
  • Batch/scheduling Systems
  • Fabric monitoring
  • Software installation
  • Representation of site status (load) in the
    Information System
  • Software Management
  • Operations on and for VOs (add/remove/service
    discovery)
  • Fault tolerance, operations on running services
    (stop,upgrades, re-starts)
  • Link to developers
  • What level of intrusion can be tolerated on the
    WNs (farm nodes)
  • application (experiment) software installation
  • Removing/(re-adding) sites with (fixed)troubles
  • Multiple views in the information system
    (maintenance)

34
LCG Workshop on Operational IssuesWGs III
  • User Support
  • Defining what User Support means
  • Models for implementing a working user support
  • need for a Central User Support Coordination Team
    (CUSC)
  • mandate and tasks
  • distributed/central (CUSC/RUSC)
  • workflow
  • VO-support
  • continuous support on integrating the VOs
    software with the middleware
  • end user documentation
  • FAQs

35
LCG Workshop on Operational IssuesSummary
  • Very productive workshop
  • Partners (sites) assumed responsibility for tasks
  • Discussions very much focused on practical
    matters
  • Some problems ask for architectural changes
  • gLite has to address these
  • It became clear that not all sites are created
    equal
  • Removing troubled sites is inherently problematic
  • removing storage can have grid wide impact
  • Key issues in all aspects is to define split
    between
  • Local,Regional and Central control and
    responsibility
  • All WGs discussed communication

36
New Operations Model
  • EGEE Structure
  • OMC Operations Management Center
  • CICs Core Infrastructure Centers
  • services like file catalogues, RBs, central
    infrastructure
  • operation support
  • CERN,France,Italy,UK, (Russia,Taipei)
  • ROCs Regional Operation Centers
  • regional support
  • France,Italy,UKIreland,GermanySwitzerland,N-Euro
    pe,SW-Europe,Central Europe, Russia
  • RCs Resource Centers
  • data and CPUs

37
New Operations Model
  • Operations Center role rotates through the CICs
  • CIC on duty for one week
  • Procedures and tasks are currently defined
  • first operations manual is available (living
    document)
  • tools, frequency of checks, escalation
    procedures, hand over procedures
  • CIC on duty website
  • Problems are tracked with a tracking tool
  • now central in Savannah
  • migration to GGUS (remedy) with link to ROCs PT
    tools
  • problems can be added at GGUS or ROC level
  • CICs monitor service, spot and track problems
  • interact with sites on short term problems
    (service restart etc,)
  • interact with ROCs on longer, non trivial
    problems
  • all communication with a site is visible for the
    ROC
  • build FAQs
  • ROCs support
  • installation, first certification
  • resolving complex problems

38
New Operations Model
OMC
Other Grid
ROC
ROC
Other Grid
ROC
ROC
ROC
RC
RC
RC
RC
RC
RC
RC
39
Summary
  • LCG-2 services have been supporting the data
    challenges
  • Many middleware problems have been found many
    addressed
  • Middleware itself is reasonably stable
  • Biggest outstanding issues are related to
    providing and maintaining stable operations
  • Future middleware has to take this into account
  • Must be more manageable, trivial to configure and
    install
  • Management and monitoring must be built into
    services from the start on
  • Operational Workshop has started many activities
  • Follow-up and keeping up the momentum is now
    essential
  • Indicates a clear shift away from the
    CERNtralized operation
  • CIC on duty is a first step to distribute
    operational load
Write a Comment
User Comments (0)
About PowerShow.com