Grid Support and Operations - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Grid Support and Operations

Description:

Example: Nagios 30 sites, 12 individual config files with dependencies ... Secure access through GridSite (X509 certificates) via PHP web interface ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 43
Provided by: author4
Category:

less

Transcript and Presenter's Notes

Title: Grid Support and Operations


1
Grid Support and Operations
  • John Gordon
  • CCLRC
  • GridPP9 - Edinburgh

2
What is support?
  • Not well defined
  • ..or rather defined differently in many places
  • End users, sysadmins, deployers, developers
  • all need support
  • Some examples

3
Grid Support Centre
  • 14 named staff at Rutherford, Daresbury,
    Manchester and Edinburgh.
  • Operates the UK e-Science Certification
    Authority.
  • http//ca.grid-support.ac.uk
  • Provides a helpdesk for first point of call
    queries.
  • Website for advertising services provided.
  • http//www.grid-support.ac.uk
  • Provides technical training and evaluations of
    middleware.
  • Supports the Level-2 Grid project.
  • National Information Server for Core programme.
  • Publishing of site monitoring information in
    xml.
  • Core support for the OGSA-DAI project.

4
European Grid Support Centre
  • Collaboration between CCLRC, CERN and KTH Sweden
    each providing 1 FTE
  • Point of trusted reliability between major
    projects and middleware producers.
  • Directly communicates with staff from Globus
    Alliance to ensure European issues faced having
    assisted with release.
  • Website up and running though currently a
    skeleton of the final content.
  • Attended EDG meeting in Barcelona to publicise
    and GGF-8 to guide User Services R.G. work.

5
Global Grid User Support GGUSThe Model
  • Started 1st of october at GridKa
    Forschungszentrum Karlsruhe (Germany)
  • Supports already 41 usergroups of GridKa
  • Websitehttp//www.ggus.org
  • E-Mailsupport_at_ggus.org

6
Information flow
Grid User
Service Request
GGUS
Interaction
Interaction
Data flow
GOC
ESUS
Grid related problems will be solved by GGUS or
sent to GOC using the GGUS system
Interaction
First line of support Problems (experiment
specific) will be solved by ESUS (with Savannah)
or sent to GGUS using an agreed interface
Local operations
GGUS Global Grid User SupportESUS Experiment
Specific User Support GOC Grid Operations Centre
7
GridPP TB-Support
  • Support Team
  • built from sysadmins. 4 funded by GridPP to work
    on EDG WP6, the rest are the usual site
    sysadmins.
  • Methods
  • Email list, phone meetings, personal visits, job
    submission monitoring
  • RB, VO, RC for UK use to support non-EDG use
  • Planned to verify EDG releases but they have been
    too infrequent to test procedures
  • Rollout
  • Experience from RAL in EDG dev testbeds and IC
    and Bristol in CMS testbeds
  • gt10 sites have been part of EDG app testbed at
    one time
  • 3 in LCG1

8
Savannah
9
EGEE Operations
  • Resource Centres all sites
  • Regional Operations Centres (ROC)
  • At least one per region!
  • RAL in UK/Ireland
  • Core Infrastructure Centres (CIC)
  • CERN, RAL, CNAF, CC-IN2P3

10
Others
  • Tier1Support
  • Role to support UK Tier2s in LCG
  • Deployment role in GridPP2
  • Tier2 Specialist Posts
  • Support for varous middleware areas
  • Middleware Developers

11
Where do you go for support?
  • Users go to experiment support
  • Experiment support diagnoses and forwards as
    necessary to Grid user support or middleware or
    operations or applications
  • Resource Centres look to their Regional
    Operations Centre (Tier2s to their Tier1)
  • ROCs will also push problems to their RCs.
  • But we know that users will go to their local
    sysadmin or direct to their Tier2 or Tier1 too.
  • And some sysadmins will go to their favourite
    experiment expert
  • And Tier1s will go direct to middleware experts.
  • In short, chaos.
  • Strategy for now is to have a UK Plan that is
    self-contained and can deliver support in the UK
    when and where required.
  • Interface this to the various outside bodies
  • Dont duplicate for the sake of it, but be ready
    to.
  • Or be prepared to role our work into wider
    provision when it is proven.

12
Grid Operations Centre
13
What is Operations?
  • RAL leading development of LCG GOC
  • The Vision
  • GOC Processes and Activities
  • Coordinating Grid Operations
  • Defining Service Level Parameters
  • Monitoring Service Performance Levels
  • First-Level Fault Analysis
  • Interacting with Local Support Groups
  • Coordinating Security Activities
  • Operations Development
  • Recent developments -

14
GOC - Monitoring
  • Who is Involved?
  • 3.0 FTE (Trevor Daniels, Dave Kant,
    Matt Thorpe, Jason Leake)
  • What are we Doing?
  • Monitor Grid Services, Manage Site
    Information, Accounting
  • Developed Tools to Configure/Integrate Monitoring
    to make the job easier
  • GPPMon
  • Nagios
  • Mapcentre
  • Example Mapcentre 30 sites 500 lines in config
    file
  • Example Nagios 30 sites, 12 individual
    config files with dependencies

Both tedious to configure Not practical by
hand with large numbers of nodes
15
GOC - Database
  • Develop/maintain a database to hold site
    information
  • Site Information (contact lists, resources, site
    information, URLs)
  • Secure access through GridSite (X509
    certificates) via PHP web interface
  • RC managers should maintain their own pages as
    part of the site certification process.
  • Monitoring scripts read information in database
    and run a set of customised tools to monitor the
    infrastructure.
  • To be included in the monitoring a site must
    register its resources (CE,SE,RB,RC,RLS,MDS,RGMA,B
    DII,..)
  • BDII can be queried to check GOC database is
    up-to-date.

16
GOC Monitoring Today
Remote UI Queries Database to build a list of
resources Submit monitoring jobs to those
resources Publish Results on WWW
EDG RESOURCES
EDG UI
GOC DB GridSite MySQL
LCG-1 RESOURCES
LCG1 UI
LCG-2 UI
LCG-2 RESOURCES
17
New GPPMon Features
  • Download Host Certificates daily and monitor Life
    Times for CEs and SEs for LCG and EDG

18
New GPPMon Features
  • Reliability of service provided using RRDTool to
    show Globus and RB stats

19
New GPPMon Features
  • Moving toward LCG-1, LCG-2 and EDG monitoring

Tuesday 3/2/04 1410 Only RAL and FZK have
updated their LCG-2 information in the GOC
database.
gridkap01.fzk.de
20
Nagios
  • Customised plugins for monitoring
  • Focus service behaviour and data consistency

Do RBs find resources Do site GIISs publish
correct hostname? Is the site running the latest
stable software release? Does the Gatekeeper
authenticate? Are the host certificates
valid? Are essential services running?
21
Nagios Screen Shots LCG-1
22
Nagios Screen Shots LCG-1
Service Summary for Gatekeeper Nodes
23
Nagios Screen Shots LCG-1
Host and Service Summary tables for BDII nodes
24
GOC Configuration
  • Example Manage a Grid-Wide Database
  • - provides access to site information via
    trusted certificate
  • - scripts to automatically configure Nagios
    from the GOC database
  • - provide plugins to monitor services for
    nagios
  • - create configurations file for mapcentre

25
GOC
Secure Database Management via HTTPS / X.509
GOC GridSite MySQL
Monitoring
Resource Centre Resources Site Information EDG,
LCG-1, LCG-2,
bdii
ce
se
rb
RC
26
GOC Server
http//goc.grid-support.ac.uk
27
Whats in the Database?
People Who do we notify when there are problems
28
Whats in the Database?
Node Information (Hostname, IP Address, Group)
29
Whats in the Database?
Scheduled Downtimes Advanced warning of site
maintenance resulting in reduced service
availability
30
LCG Accounting Overview
  • PBS log processed daily on site CE to extract
    required data, filter acts as R-GMA DBProducer -gt
    PbsRecords table
  • Gatekeeper log processed daily on site CE to
    extract required data, filter acts as R-GMA
    DBProducer -gt GkRecords table
  • Site GIIS interrogated daily on site CE to obtain
    SpecInt and SpecFloat values for CE, acts as
    DBProducer -gt SpecRecords table, one dated record
    per day
  • These three tables joined daily on MON to produce
    LcgRecords table. As each record is produced
    program acts as StreamProducer to send the
    entries to the LcgRecords table on the GOC site.
  • Site now has table containing its own accounting
    data GOC has aggregated table over whole of LCG.
  • Interactive and regular reports produced by site
    or at GOC site as required.
  • Note This is an improved design over that
    presented at the Jan GDB. The SOAP transport has
    been replaced by R-GMA.

31
(No Transcript)
32
Progress
  • Status on 3 Feb 2004
  • The code which will run on the CE to parse and
    process the PBS and Gatekeeper logs is written.
    The PbsRecords and GkRecords tables are created
    and are being populated.
  • The code to join these two tables and publish the
    new joined table (LcgRecords) is also written and
    working.
  • Work is in progress to write the archiver at the
    GOC to receive the aggregated LcgRecords table
    2 days work.
  • To do
  • Write the code to interrogate the site GIIS to
    extract the CPU power values and populate these
    fields in the tables 2 days work
  • Integration testing and debugging 5 days
  • Packaging for deployment 3 days
  • Write the report generators 30 days (estimate
    not yet designed)

33
Accounting Issues
  1. There is no R-GMA infrastructure LCG-wide, so
    most sites are not able to install and run the
    accounting suite at present. It is expected that
    R-GMA and the MON boxes will be rolled out in
    LCG2 soon after the storage problems are
    resolved. Until this happens the complete batch
    and gatekeeper logs will have to be copied to the
    GOC site for processing.
  2. The VO associated with a users DN is not
    available in the batch or gatekeeper logs. It
    will be assumed that the group ID used to execute
    user jobs, which is available, is the same as the
    VO name. This needs to be acknowledged as an LCG
    requirement.
  3. The global jobID assigned by the Resource Broker
    is not available in the batch or gatekeeper logs.
    This global jobID cannot therefore appear in the
    accounting reports. The RB Events Database
    contains this, but that is not accessible nor is
    it designed to be easily processed.
  4. At present the logs provide no means of
    distinguishing sub-clusters of a CE which have
    nodes of differing processing power. Changes to
    the information logged by the batch system will
    be required before such heterogeneous sites can
    be accounted properly. At present it is believed
    all sites are homogeneous.

34
Future Direction Towards EGEE
Distribute Tools to help the ROCs monitor their
RCs (Database Monitoring Packages) Distribute
Tools to help CICs monitor Core Services Grid
Wide Monitoring Ideas on how this would work
CIC monitoring tools query ROC databases
Select core services Run a standard set of
checks on those services Display information
/ Notifications
35
UK Deployment, Support and Operations
36
Deployment Team
Proposal for a UK wide Team to provide and run a
UK wide Grid The GridPP View. There are
alternative views for other stakeholders
EGEE
GridPP
JISC
Core UK
Production Manager
Core Grid Coordinator
Deployment Team
Grid Support Centre
MiddleWare Specialist Support
5 FTE
1 FTE
6 FTE
8 FTE
Grid Operations Centre

Security Officer
Helpdesk
2 Tier1 Deployment
Manager
RAL
Data and Storage Management
4 Tier 2 UK Coordinators
Operations (2)
Glasgow, Bristol, Edinburgh
LondonGrid,NorthGrid
Network Monitoring
ScotGrid, SouthGrid
VO Management and Services
Technical Writer
1 Tier2 Coordinator
North (0.5 FTE)
Ireland
Workload Management Services
Applications Expert
London
Network Management
Network Support
London (0.5 FTE)
37
Resource Centres
  • Tier1 Rutherford Appleton Laboratory
  • Tier-2 centres are distributed over many sites.
  • Sites which have signed up to LCG and deployed
    software
  • (RAL,IC,Cambridge) expect to join EGEE (PM1)

London Grid IC,QMUL,RHUL,UCL, Brunel
North Grid Daresbury, Lancaster, Liverpool,Manchester, Sheffield
Scot Grid Durham, Edinburgh, Glasgow
South Grid Birmingham, Bristol, Cambridge, Oxford, RAL-PPD
38
Tier-2 Centre Resources (Projected 2004)
Projected resources available in September 2004
to be applied to large-scale production Grid
deployment. The total CPU at each institute is
proportional to the size of the green circles.
The disk storage at each site is proportional to
the height of the grey vertical bars
Tier 2 Number Of CPUs TOTAL CPU KSI2000 Total Disk TB Total Tape TB
London 2454 1996 99 20
North Grid 2718 2801 209 332
South Grid 918 930 67 8
Scot Grid 368 318 79 0
Total 6458 6045 455 360
39
Roles (1)
  • Production Manager
  • Overall Manager to oversee operations and
    report to
  • other groups (ROC Coordinator, OMC )
  • Core Grid Coordinator
  • Bring UK non-Particle Physics projects
    (applications and resources) into EGEE

40
Roles (2)
  • Deployment Team
  • Consists of about 7 people to spearhead the
    rollout and certification of Grid software to the
    Resource Centres (Tier1 Tier2)
  • Grid Operations Centre
  • Similar role to the proposed CIC in EGEE.
  • Monitor health of services and provide
    toolkits
  • Operate Core Grid Services
  • Database of RCs managed by RC site
    administrators

41
Roles (3)
  • Middleware Specialist Support
  • Body of experts to provide specialist support
    to Resource Centres in key areas security, data
    management, network, VO management and workflow
    management.
  • Grid Support Centre
  • Helpdesk facility, CA
  • Broker requests to middleware specialists

42
Team UK
  • A large team in the UK (GridPP, EU, and other)
  • GridPP Production Manager should orchestrate this
    team to deliver a production grid for GridPP
  • But interwork with as many other UK grids and
    projects as possible
  • Meet our EGEE ROC and CIC deliverables for
    support and operations
  • A big challenge
Write a Comment
User Comments (0)
About PowerShow.com