Title: Grid Support and Operations
1Grid Support and Operations
- John Gordon
- CCLRC
- GridPP9 - Edinburgh
2What is support?
- Not well defined
- ..or rather defined differently in many places
- End users, sysadmins, deployers, developers
- all need support
- Some examples
3Grid Support Centre
- 14 named staff at Rutherford, Daresbury,
Manchester and Edinburgh. - Operates the UK e-Science Certification
Authority. - http//ca.grid-support.ac.uk
- Provides a helpdesk for first point of call
queries. - Website for advertising services provided.
- http//www.grid-support.ac.uk
- Provides technical training and evaluations of
middleware. - Supports the Level-2 Grid project.
- National Information Server for Core programme.
- Publishing of site monitoring information in
xml. - Core support for the OGSA-DAI project.
4European Grid Support Centre
- Collaboration between CCLRC, CERN and KTH Sweden
each providing 1 FTE - Point of trusted reliability between major
projects and middleware producers. - Directly communicates with staff from Globus
Alliance to ensure European issues faced having
assisted with release. - Website up and running though currently a
skeleton of the final content. - Attended EDG meeting in Barcelona to publicise
and GGF-8 to guide User Services R.G. work.
5Global Grid User Support GGUSThe Model
- Started 1st of october at GridKa
Forschungszentrum Karlsruhe (Germany) - Supports already 41 usergroups of GridKa
- Websitehttp//www.ggus.org
- E-Mailsupport_at_ggus.org
6Information flow
Grid User
Service Request
GGUS
Interaction
Interaction
Data flow
GOC
ESUS
Grid related problems will be solved by GGUS or
sent to GOC using the GGUS system
Interaction
First line of support Problems (experiment
specific) will be solved by ESUS (with Savannah)
or sent to GGUS using an agreed interface
Local operations
GGUS Global Grid User SupportESUS Experiment
Specific User Support GOC Grid Operations Centre
7 GridPP TB-Support
- Support Team
- built from sysadmins. 4 funded by GridPP to work
on EDG WP6, the rest are the usual site
sysadmins. - Methods
- Email list, phone meetings, personal visits, job
submission monitoring - RB, VO, RC for UK use to support non-EDG use
- Planned to verify EDG releases but they have been
too infrequent to test procedures - Rollout
- Experience from RAL in EDG dev testbeds and IC
and Bristol in CMS testbeds - gt10 sites have been part of EDG app testbed at
one time - 3 in LCG1
8Savannah
9EGEE Operations
- Resource Centres all sites
- Regional Operations Centres (ROC)
- At least one per region!
- RAL in UK/Ireland
- Core Infrastructure Centres (CIC)
- CERN, RAL, CNAF, CC-IN2P3
10Others
- Tier1Support
- Role to support UK Tier2s in LCG
- Deployment role in GridPP2
- Tier2 Specialist Posts
- Support for varous middleware areas
- Middleware Developers
11Where do you go for support?
- Users go to experiment support
- Experiment support diagnoses and forwards as
necessary to Grid user support or middleware or
operations or applications - Resource Centres look to their Regional
Operations Centre (Tier2s to their Tier1) - ROCs will also push problems to their RCs.
- But we know that users will go to their local
sysadmin or direct to their Tier2 or Tier1 too. - And some sysadmins will go to their favourite
experiment expert - And Tier1s will go direct to middleware experts.
- In short, chaos.
- Strategy for now is to have a UK Plan that is
self-contained and can deliver support in the UK
when and where required. - Interface this to the various outside bodies
- Dont duplicate for the sake of it, but be ready
to. - Or be prepared to role our work into wider
provision when it is proven.
12Grid Operations Centre
13What is Operations?
- RAL leading development of LCG GOC
- The Vision
- GOC Processes and Activities
- Coordinating Grid Operations
- Defining Service Level Parameters
- Monitoring Service Performance Levels
- First-Level Fault Analysis
- Interacting with Local Support Groups
- Coordinating Security Activities
- Operations Development
- Recent developments -
14GOC - Monitoring
- Who is Involved?
- 3.0 FTE (Trevor Daniels, Dave Kant,
Matt Thorpe, Jason Leake) - What are we Doing?
- Monitor Grid Services, Manage Site
Information, Accounting -
- Developed Tools to Configure/Integrate Monitoring
to make the job easier - GPPMon
- Nagios
- Mapcentre
- Example Mapcentre 30 sites 500 lines in config
file - Example Nagios 30 sites, 12 individual
config files with dependencies
Both tedious to configure Not practical by
hand with large numbers of nodes
15GOC - Database
- Develop/maintain a database to hold site
information - Site Information (contact lists, resources, site
information, URLs) - Secure access through GridSite (X509
certificates) via PHP web interface - RC managers should maintain their own pages as
part of the site certification process. - Monitoring scripts read information in database
and run a set of customised tools to monitor the
infrastructure. - To be included in the monitoring a site must
register its resources (CE,SE,RB,RC,RLS,MDS,RGMA,B
DII,..) - BDII can be queried to check GOC database is
up-to-date.
16GOC Monitoring Today
Remote UI Queries Database to build a list of
resources Submit monitoring jobs to those
resources Publish Results on WWW
EDG RESOURCES
EDG UI
GOC DB GridSite MySQL
LCG-1 RESOURCES
LCG1 UI
LCG-2 UI
LCG-2 RESOURCES
17New GPPMon Features
- Download Host Certificates daily and monitor Life
Times for CEs and SEs for LCG and EDG
18New GPPMon Features
- Reliability of service provided using RRDTool to
show Globus and RB stats
19New GPPMon Features
- Moving toward LCG-1, LCG-2 and EDG monitoring
Tuesday 3/2/04 1410 Only RAL and FZK have
updated their LCG-2 information in the GOC
database.
gridkap01.fzk.de
20Nagios
- Customised plugins for monitoring
- Focus service behaviour and data consistency
Do RBs find resources Do site GIISs publish
correct hostname? Is the site running the latest
stable software release? Does the Gatekeeper
authenticate? Are the host certificates
valid? Are essential services running?
21Nagios Screen Shots LCG-1
22Nagios Screen Shots LCG-1
Service Summary for Gatekeeper Nodes
23Nagios Screen Shots LCG-1
Host and Service Summary tables for BDII nodes
24GOC Configuration
- Example Manage a Grid-Wide Database
- - provides access to site information via
trusted certificate - - scripts to automatically configure Nagios
from the GOC database - - provide plugins to monitor services for
nagios - - create configurations file for mapcentre
-
25GOC
Secure Database Management via HTTPS / X.509
GOC GridSite MySQL
Monitoring
Resource Centre Resources Site Information EDG,
LCG-1, LCG-2,
bdii
ce
se
rb
RC
26GOC Server
http//goc.grid-support.ac.uk
27Whats in the Database?
People Who do we notify when there are problems
28Whats in the Database?
Node Information (Hostname, IP Address, Group)
29Whats in the Database?
Scheduled Downtimes Advanced warning of site
maintenance resulting in reduced service
availability
30LCG Accounting Overview
- PBS log processed daily on site CE to extract
required data, filter acts as R-GMA DBProducer -gt
PbsRecords table - Gatekeeper log processed daily on site CE to
extract required data, filter acts as R-GMA
DBProducer -gt GkRecords table - Site GIIS interrogated daily on site CE to obtain
SpecInt and SpecFloat values for CE, acts as
DBProducer -gt SpecRecords table, one dated record
per day - These three tables joined daily on MON to produce
LcgRecords table. As each record is produced
program acts as StreamProducer to send the
entries to the LcgRecords table on the GOC site. - Site now has table containing its own accounting
data GOC has aggregated table over whole of LCG. - Interactive and regular reports produced by site
or at GOC site as required. - Note This is an improved design over that
presented at the Jan GDB. The SOAP transport has
been replaced by R-GMA.
31(No Transcript)
32Progress
- Status on 3 Feb 2004
- The code which will run on the CE to parse and
process the PBS and Gatekeeper logs is written.
The PbsRecords and GkRecords tables are created
and are being populated. - The code to join these two tables and publish the
new joined table (LcgRecords) is also written and
working. - Work is in progress to write the archiver at the
GOC to receive the aggregated LcgRecords table
2 days work. - To do
- Write the code to interrogate the site GIIS to
extract the CPU power values and populate these
fields in the tables 2 days work - Integration testing and debugging 5 days
- Packaging for deployment 3 days
- Write the report generators 30 days (estimate
not yet designed)
33Accounting Issues
- There is no R-GMA infrastructure LCG-wide, so
most sites are not able to install and run the
accounting suite at present. It is expected that
R-GMA and the MON boxes will be rolled out in
LCG2 soon after the storage problems are
resolved. Until this happens the complete batch
and gatekeeper logs will have to be copied to the
GOC site for processing. - The VO associated with a users DN is not
available in the batch or gatekeeper logs. It
will be assumed that the group ID used to execute
user jobs, which is available, is the same as the
VO name. This needs to be acknowledged as an LCG
requirement. - The global jobID assigned by the Resource Broker
is not available in the batch or gatekeeper logs.
This global jobID cannot therefore appear in the
accounting reports. The RB Events Database
contains this, but that is not accessible nor is
it designed to be easily processed. - At present the logs provide no means of
distinguishing sub-clusters of a CE which have
nodes of differing processing power. Changes to
the information logged by the batch system will
be required before such heterogeneous sites can
be accounted properly. At present it is believed
all sites are homogeneous.
34Future Direction Towards EGEE
Distribute Tools to help the ROCs monitor their
RCs (Database Monitoring Packages) Distribute
Tools to help CICs monitor Core Services Grid
Wide Monitoring Ideas on how this would work
CIC monitoring tools query ROC databases
Select core services Run a standard set of
checks on those services Display information
/ Notifications
35UK Deployment, Support and Operations
36Deployment Team
Proposal for a UK wide Team to provide and run a
UK wide Grid The GridPP View. There are
alternative views for other stakeholders
EGEE
GridPP
JISC
Core UK
Production Manager
Core Grid Coordinator
Deployment Team
Grid Support Centre
MiddleWare Specialist Support
5 FTE
1 FTE
6 FTE
8 FTE
Grid Operations Centre
Security Officer
Helpdesk
2 Tier1 Deployment
Manager
RAL
Data and Storage Management
4 Tier 2 UK Coordinators
Operations (2)
Glasgow, Bristol, Edinburgh
LondonGrid,NorthGrid
Network Monitoring
ScotGrid, SouthGrid
VO Management and Services
Technical Writer
1 Tier2 Coordinator
North (0.5 FTE)
Ireland
Workload Management Services
Applications Expert
London
Network Management
Network Support
London (0.5 FTE)
37Resource Centres
- Tier1 Rutherford Appleton Laboratory
- Tier-2 centres are distributed over many sites.
- Sites which have signed up to LCG and deployed
software - (RAL,IC,Cambridge) expect to join EGEE (PM1)
London Grid IC,QMUL,RHUL,UCL, Brunel
North Grid Daresbury, Lancaster, Liverpool,Manchester, Sheffield
Scot Grid Durham, Edinburgh, Glasgow
South Grid Birmingham, Bristol, Cambridge, Oxford, RAL-PPD
38Tier-2 Centre Resources (Projected 2004)
Projected resources available in September 2004
to be applied to large-scale production Grid
deployment. The total CPU at each institute is
proportional to the size of the green circles.
The disk storage at each site is proportional to
the height of the grey vertical bars
Tier 2 Number Of CPUs TOTAL CPU KSI2000 Total Disk TB Total Tape TB
London 2454 1996 99 20
North Grid 2718 2801 209 332
South Grid 918 930 67 8
Scot Grid 368 318 79 0
Total 6458 6045 455 360
39Roles (1)
- Production Manager
- Overall Manager to oversee operations and
report to - other groups (ROC Coordinator, OMC )
- Core Grid Coordinator
- Bring UK non-Particle Physics projects
(applications and resources) into EGEE
40Roles (2)
- Deployment Team
- Consists of about 7 people to spearhead the
rollout and certification of Grid software to the
Resource Centres (Tier1 Tier2) -
- Grid Operations Centre
- Similar role to the proposed CIC in EGEE.
- Monitor health of services and provide
toolkits - Operate Core Grid Services
- Database of RCs managed by RC site
administrators -
41Roles (3)
- Middleware Specialist Support
- Body of experts to provide specialist support
to Resource Centres in key areas security, data
management, network, VO management and workflow
management. - Grid Support Centre
- Helpdesk facility, CA
- Broker requests to middleware specialists
42Team UK
- A large team in the UK (GridPP, EU, and other)
- GridPP Production Manager should orchestrate this
team to deliver a production grid for GridPP - But interwork with as many other UK grids and
projects as possible - Meet our EGEE ROC and CIC deliverables for
support and operations - A big challenge