- PowerPoint PPT Presentation

About This Presentation
Title:

Description:

NFS problems for home directories or ESM areas ... Informal exchange (e-mail, meetings, phone) ... RAL (UK) is leading sub-project on developing operations services ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 33
Provided by: Pasc1
Learn more at: https://www.racf.bnl.gov
Category:

less

Transcript and Presenter's Notes

Title:


1
LCG Operation During the Data Challenges
Markus Schulz, IT-GD, CERNmarkus.schulz_at_cern.ch
Discussion on Operation Models
EGEE is a project funded by the European Union
under contract IST-2003-508833
2
Outline
  • Building LCG-2
  • Data Challenges (very brief)
  • Problems (not so brief)
  • Operating LCG
  • how it was planned
  • how it happened to be done
  • how it felt
  • Whats next?
  • I will skip many slides to leave room for
    discussions

Comment /Shout in REALTIME!!!!!
3
History
  • December 2003 LCG-2
  • Full set of functionality for DCs, first MSS
    integration
  • Deployed in January to 8 core sites (less sites
    less trouble)
  • DCs started in February -gt testing in production
  • Large sites integrate resources into LCG (MSS and
    farms)
  • Introduced a pre-production service for the
    experiments
  • Alternative packaging (tool based and generic
    installation guides)
  • Mai 2004 -gt now monthly incremental releases
  • Not all releases are distributed to external
    sites
  • Improved services, functionality, stability and
    packing step by step
  • Timely response to experiences from the data
    challenges

4
LCG-2 Status 22 10 2004
new interested sites should look here release
Cyprus
  • Total
  • 82 Sites
  • 9400 CPUs
  • 6.5 PByte

5
Integrating Sites
  • Sites contact GD Group or Regional Center
  • Sites go to the release page
  • Sites decide on manual or tool based installation
    (LCFGng)
  • documentation for both available
  • WN and UI from next release on tar-ball based
    release
  • almost trivial install of WNs and UIs
  • Sites provide security and contact information
  • Sites install and use provided tests for
    debugging
  • support from regional centers or CERN
  • CERN GD certifies site and adds it to the
    monitoring and information system
  • sites are daily re-certified and problems traced
    in SAVANNAH
  • Large sites have integrated their local batch
    systems in LCG-2
  • Adding new sites is now quite smooth
  • problem is keeping large number of sites
    correctly configured

worked 80 times
failed 3-5 times
6
Data Challenges
  • Large scale production effort of the LHC
    experiments
  • test and validate the computing models
  • produce needed simulated data
  • test experiments production frame works and
    software
  • test the provided grid middleware
  • test the services provided by LCG-2
  • All experiments used LCG-2 for part of their
    production

7
Data Challenges
  • Phase I
  • 7.7 Million events fully simulated (Geant 4) in
    95.000 jobs
  • 22 TByte
  • Total CPU 972 MSI-2k hours
  • gt40 produced on LCG-2 (used LCG-2, GRID3,
    NorduGrid)

8
Data Challenges
9
Data Challenges
3-5 106/day
LCG restarted
LCG paused
LCG in action
1.8 106/day
DIRAC alone
10
Problems during the data challenges
  • All experiments encountered on LCG-2 similar
    problems
  • LCG sites suffering from configuration and
    operational problems
  • not adequate resources on some sites (hardware,
    human..)
  • this is now the main source of failures
  • Load balancing between different sites is
    problematic
  • jobs can be attracted to sites that have no
    adequate resources
  • modern batch systems are too complex and dynamic
    to summarize their behavior in a few values in
    the IS
  • Identification and location of problems in LCG-2
    is difficult
  • distributed environment, access to many logfiles
    needed..
  • status of monitoring tools
  • Handling thousands of jobs is time consuming and
    tedious
  • Support for bulk operation is not adequate
  • Performance and scalability of services
  • storage (access and number of files)
  • job submission
  • information system
  • file catalogues
  • Services suffered from hardware problems (no fail
    over services)

DC summary
11
Outstanding Middleware Issues
  • Collection Outstanding Middleware Issues
  • Important 1st systematic confrontation of
    required functionalities with capabilities of the
    existing middleware
  • Some can be patched, worked around,
  • Those related to fundamental problems with
    underlying models and architectures have to be
    input as essential requirements to future
    developments (EGEE)
  • Middleware is now not perfect but quite stable
  • Much has been improved during DCs
  • A lot of effort still going into improvements and
    fixes
  • Big hole is missing space management on SEs
  • especially for Tier 2 sites

12
Operational issues (selection)
  • Slow response from sites
  • Upgrades, response to problems, etc.
  • Problems reported daily some problems last for
    weeks
  • Lack of staff available to fix problems
  • Vacation period, other high priority tasks
  • Various mis-configurations (see next slide)
  • Lack of configuration management problems that
    are fixed reappear
  • Lack of fabric management (mostly smaller sites)
  • scratch space, single nodes drain queues,
    incomplete upgrades, .
  • Lack of understanding
  • Admins reformat disks of SE
  • Provided documentation often not read (carefully)
  • new activity started to develop hierarchical
    adaptive documentation
  • simpler way to install middleware on farm nodes
    (even remotely in user space)
  • Firewall issues
  • often less than optimal coordination between grid
    admins and firewall maintainers
  • PBS problems
  • Scalability, robustness (switching to torque
    helps)

13
Site (mis) - configurations
  • Site mis-configuration was responsible for most
    of the problems that occurred during the
    experiments Data Challenges. Here is a
    non-complete list of problems
  • The variable VO ltVOgt SW DIR points to a non
    existent area on WNs.
  • The ESM is not allowed to write in the area
    dedicated to the software installation
  • Only one certificate allowed to be mapped to
    the ESM local account
  • Wrong information published in the information
    system (Glue Object Classes not linked)
  • Queue time limits published in minutes instead
    of seconds and not normalized
  • /etc/ld.so.conf not properly configured. Shared
    libraries not found.
  • Machines not synchronized in time
  • Grid-mapfiles not properly built
  • Pool accounts not created but the rest of the
    tools configured with pool accounts
  • Firewall issues
  • CA files not properly installed
  • NFS problems for home directories or ESM areas
  • Services configured to use the wrong/no
    Information Index (BDII)
  • Wrong user profiles
  • Default user shell environment too big
  • Only partly related to middleware complexity

integrated all common small problems into 1 BIG
PROBLEM
14
Running Services
  • Multiple instances of core services for each of
    the experiments
  • separates problems, avoids interference between
    experiments
  • improves availability
  • allows experiments to maintain individual
    configuration (information system)
  • addresses scalability to some degree
  • Monitoring tools for services currently not
    adequate
  • tools under development to implement control
    system
  • Access to storage via load balanced interfaces
  • CASTOR
  • dCache
  • Services that carry state are problematic to
    restart on new nodes
  • needed after hardware problems, or security
    problems
  • State Transition between partial usage and full
    usage of resources
  • required change in queue configuration (faire
    share, individual queues/VO)
  • next release will come with description for fair
    share configuration (smaller sites)

DC summary
15
Support during the DCs
  • User (Experiment) Support
  • GD at CERN worked very close with the experiments
    production managers
  • Informal exchange (e-mail, meetings, phone)
  • No Secrets approach, GD people on experiments
    mail lists and vice versa
  • ensured fast response
  • tracking of problems tedious, but both sites have
    been patient
  • clear learning curve on BOTH sites
  • LCG GGUS (grid user support) at FZK became
    operational after start of the DCs
  • due to the importance of the DCs the experiments
    switch slowly to the new service
  • Very good end user documentation by GD-EIS
  • Dedicated testbed for experiments with next LCG-2
    release
  • rapid feedback, influenced what made it into the
    next release
  • Installation (Site) Support
  • GD prepared releases and supported sites
    (certification, re-certification)
  • Regional centres supported their local sites
    (some more, some less)
  • Community style help via mailing list (high
    traffic!!)
  • FAQ lists for trouble shooting and configuration
    issues Taipei RAL

16
Support during the DCs
  • Operations Service
  • RAL (UK) is leading sub-project on developing
    operations services
  • Initial prototype http//www.grid-support.ac.uk/GO
    C/
  • Basic monitoring tools
  • Mail lists for problem resolution
  • Working on defining policies for operation,
    responsibilities (draft document)
  • Working on grid wide accounting
  • Monitoring
  • GridICE (development of DataTag Nagios-based
    tools)
  • GridPP job submission monitoring
  • Information system monitoring and consitency
    check http//goc.grid.sinica.edu.tw/gstat/
  • CERN GD daily re-certification of sites
    (including history)
  • escalation procedure under development
  • tracing of site specific problems via problem
    tracking tool
  • tests core services and configuration

17
Screen Shots
18
Screen Shots
19
Problem HandlingPLAN
Monitoring/Followup
Triage VO / GRID
GOC
GGUS (Remedy)
GD CERN
Escalation
20
Problem HandlingOperation (most cases)
Community
VO A
Rollout Mailing List
VO B
GGUS
VO C
Triage
S-Site-2
GOC
GD CERN
S-Site-1
Monitoring Certification Follow-Up FAQs
Monitoring FAQs
S-Site-3
21
Problem Tracking
  • GGUS REMEDY
  • Middleware problems SAVANNAH LCG-OPERATION
  • Re-certification SAVANNAH LCG-SITES
  • Many (MOST) problems only tracked by e-mail
  • Much confusion on where to put problems
  • Training needed to get reasonable 1st level user
    support
  • canned answers
  • experts need to focus on more complex tasks
  • Unification of FAQs (RAL, Taipei, Italy, )

22
EGEE Impact on Operations
  • The available effort for operations from EGEE is
    now ramping up
  • LCG GOC (RAL) ? EGEE CICs and ROCs, Taipei
  • Hierarchical support structure
  • Regional Operations Centres (ROC)
  • One per region (9)
  • Front-line support for deployment, installation,
    users
  • Core Infrastructure Centres (CIC)
  • Four ( Russia next year)
  • Evolve from GOC monitoring, troubleshooting,
    operational control
  • 24x7 in a 8x5 world ????
  • Also providing VO-specific and general services
  • EGEE NA3 organizes training for users and site
    admins
  • NOW at HEPiX
  • Address common issues, experiences
  • Operations and Fabric Workshop
  • CERN 1-3 Nov

23
PART II
  • Operation models
  • How much can be delegated to whom?
  • autonomy/ availability
  • What are the consequences?
  • cost for 24/7 with 8x5 staff
  • One/multiple models for all sites/regions?
  • One model for site integration, update, user
    support, security, operation?
  • latency, efficiency, distribution of workload ..
  • One size fits all?
  • Next slides are meant to stimulate discussions
    not give answers

24
CICs and ROCs and Operations
  • Core Infrastructure Centers (CICs)
  • run services like RBs, Information Indices,
    VO/VOMS, Catalogues
  • are the distributed Grid Operation Center (GOC)
  • and more.
  • Regional Operation Centers (ROCs)
  • coordinate activities in their region
  • give support to regional RCs
  • coordinate setup/upgrades
  • and more..
  • Resource Centers (RC)
  • computing and storage
  • Operation Management Center (OMC)
  • coordination

25
Model I Strict Hierarchy
  • CICs locates a problem with a RC or CIC in a
    region
  • triggered by monitoring/ user alert
  • CIC enters the problem into the problem tracking
    tool and assigns it to a ROC
  • ROC receives a notification and works on solving
    the problem
  • region decides locally what the ROC can to do on
    the RCs.
  • This can include restarting services etc.
  • The main emphasis is that the region decides on
    the depth of the interaction.
  • gt different regions, different procedures
  • CICs NEVER contact a site
  • .gt ROCs need to be staffed all the time
  • ROC does it is fully responsible for ALL the
    sites in the region

26
Model I Strict Hierarchy
  • Pro
  • Best model to transfer knowledge to the ROCs
  • all information flows through them
  • Different regions can have their own policies
  • this can reflect different administrative
    relation of sites in a region.
  • Clear responsibility
  • until it is discovered it is the CICs fault then
    it is always the ROCs fault
  • Cons
  • High latency
  • even for trivial operations we have to pass
    through the ROCs
  • ROCs have to be staffed (reachable) all the time.
  • Regions will develop their own tools
  • parallel strands, less quality
  • Excluded for handling security

27
Model II Direct Com. Local Contr.
  • ROCs are active in
  • the follow-up of problems that take longer to
    handle
  • setup of sites
  • CICs are active in
  • handling problems that can be solved by simple
    interactions
  • communicated directly between CICs and RCs
  • ROCs are informed on all interactions between
    CICs and RCs
  • all problems are entered into the problem
    tracking tool.
  • restarting of services, etc. are handled by the
    RCs

28
Model II Direct Com. Local Contr.
  • Pros
  • Resources are not lost for trivial reasons
  • Principe of local control is maintained
  • ROCs are in the loop,
  • but weak ROCs can't create too severe delays
  • No complex tools for communication management
    needed
  • mail IRC sufficient
  • Cons
  • RCs need to be reachable at all times
  • not realistic, and very expensive
  • CICs have to be aware of the level of maturity of
    O(100) RCs
  • ROCs have to monitor what is going on to learn
    the trade
  • Language problems between the CICs and sysadmins
  • Unclear responsibility
  • "This was reported" / "Why didn't the CICs fix it
    them self"

29
Model III Direct Com. Direct Contr.
  • Like Model II with some modifications
  • CICs have access to the services on the RCs
  • can, if the RC is not staffed, manage some of the
    services
  • site publishes at any time
  • whether the local support is reachable or not
  • what actions are permitted by the CICs.
  • all interactions are logged and reported to RC
    and ROC
  • Some tools that allow very controlled (limited)
    access like this are under development (GSI
    enabled remote SUDO)
  • Variation with ROCs only interaction (IIIa)

30
Model III Direct Com. Direct Contr.
  • Pros
  • Resources are not lost for trivial reasons
  • ROCs are in the loop,
  • but weak ROCs can't create too severe delays
  • One set of tools for remote operation
  • some uniformity ---gt chance for better quality
  • Site decides at any time on balance between
    local/remote operation
  • RCs can be run for (short) time unattended
  • Cons
  • Set of tools for secure limited remote operation
    respecting the sites policies has to be put in
    place
  • ROCs have to monitor what is going to learn the
    trade
  • Unclear responsibility
  • "This was reported" / "Why didn't the CICs fix it
    them self"

31
Sample UseCases
  • User reports jobs failing on one site
  • User reports jobs failing on some/all sites
  • Monitoring shows site dropping in and out of the
    IS
  • An acute security incident
  • Upgrading to a new version
  • Post mortem after the security incidents
  • .
  • Good preparation for the Operations Workshop

32
Summary
  • LCG-2 services have been supporting the data
    challenges
  • Many middleware problems have been found many
    addressed
  • Middleware itself is reasonably stable
  • Biggest outstanding issues are related to
    providing and maintaining stable operations
  • Future middleware has to take this into account
  • Must be more manageable, trivial to configure and
    install
  • Management and monitoring must be built into
    services from the start on
  • Outcome of the workshop in November is crucial
    for EGEE operation
Write a Comment
User Comments (0)
About PowerShow.com