SEEGRIDSCI Operations Procedures and Tools - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

SEEGRIDSCI Operations Procedures and Tools

Description:

The SEE-GRID-SCI initiative is co-funded by the European Commission under the ... Cannot read JobWrapper output, both from Condor and from Maradona ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 27
Provided by: dusanvud
Category:

less

Transcript and Presenter's Notes

Title: SEEGRIDSCI Operations Procedures and Tools


1
SEE-GRID-SCI Operations Procedures and Tools
  • Regional SEE-GRID-SCI Training for Site
    Administrators
  • Institute of Physics Belgrade
  • March 5-6, 2009

Antun Balaz Institute of Physics Belgrade,
Serbia antun_at_scl.rs
The SEE-GRID-SCI initiative is co-funded by the
European Commission under the FP7 Research
Infrastructures contract no. 211338
2
Overview
  • SEE-GRID operational and monitoring tools (and
    their relation to EGEE tools)
  • HGSM/GOCDB
  • Helpdesk/GGUS
  • BBmSAM/SAM
  • GStat
  • Nagios/CIC portal
  • Accounting portal
  • Downtime procedures
  • Upgrade procedures
  • Grid-Operator-On-Duty (GOOD)
  • Service Level Agreement (SLA)

3
Operational monitoring tools
GSTAT (Taiwan)
HELPDESK
VOMS
BDII
SAM
Accounting
NAGIOS
BBmSAM
R-GMA
HGSM
4
HGSM/GOCDB (1)
5
HGSM/GOCDB (2)
6
HGSM/GOCDB (3)
  • Static database containing all relevant data
    about all SEE-GRID and AEGIS sites
  • Must be kept synchronized with the real situation
  • All sheets must be properly updated
  • Site Info
  • Contacts
  • Site Nodes
  • Downtimes
  • XML dumps the easiest way to apply changes is
    to download XML dump of the data, edit it
    appropriately, and then upload the new XML file
    this also allows keeping of backups

7
HGSM/GOCDB (4)
  • The essential fields in HGSM
  • GIIS URL
  • Monitoring Yes
  • Status certified
  • Type seegrid_production, seegrid_certified,
    egee_production
  • Site Commitments
  • Contacts and administrators
  • All fields have to have correct values!
  • URL https//hgsm.grid.org.tr/

8
Helpdesk/GGUS (1)
9
Helpdesk/GGUS (2)
10
Helpdesk/GGUS (3)
  • Central reference point for tracking of all
    operational and user problems
  • Identified problems are reported through the
    Helpdesk and assigned to the appropriate
    supported
  • If problems cannot be solved within the SEE-GRID
    community, they are propagated to other
    projects/initiatives/support systems (e.g. GGUS)
  • URL https//helpdesk.see-grid.eu/

11
BBmSAM/SAM
12
BBmSAM History
13
BBmSAM
  • Portal that provides access to the database of
    SAM tests results
  • Central tools for identification of operational
    problems
  • Should be checked by each site admin on a daily
    basis
  • Should be used to troubleshoot problems
  • Also provides SLA figures
  • URL https//c01.grid.etfbl.net/

14
GStat (1)
15
GStat (2)
  • Central tool for monitoring of the information
    system of SEE-GRID infrastructure
  • Provides useful data
  • Identifies problems with sites
  • Should be checked by each site admin on a daily
    basis and used for troubleshooting
  • Useful ldapsearch commands can be found on GStat
    pages!
  • URL http//goc.grid.sinica.edu.tw/gstat/seegrid/

16
Nagios/CIC portal (1)
17
Nagios/CIC portal (2)
18
Nagios/CIC portal (3)
  • Collection of alarms raised by various tools
  • The aim is to integrate all the tools and make
    the life of site admins and infrastructure
    managers easier
  • In the future, automatic creation of Helpdesk
    tickets will be implemented
  • URL https//portal.ipp.acad.bg7443/seegridnagios
    /

19
Accounting portal (1)
  • Accounting by site
  • Accounting by countries and institutions
  • Accounting by applications

20
EGEE Accounting portal
21
Accounting portal (2)
  • Collects the accounting data from all SEE-GRID
    and AEGIS sites through apel accounting publisher
    developed by the project
  • Provides aggregated accounting data by site,
    country, institution, application
  • Each site must publish the accounting data
    properly
  • URL https//gserv1.ipp.acad.bg8443/Welcome/

22
Downtime procedures
  • Downtimes must be announced well in advance (1
    week is reasonable time)
  • There are always downtime due to hardware etc.
    failures that cannot be anticipated
  • All downtimes must be entered properly in HGSM
  • That way they are not be counted against the
    sites availability
  • In addition, all downtimes must be broadcasted by
    e-mail to the GIM, APP and proper VO mailing
    lists
  • Downtime should not exceed 10 of the total time
    (monthly, quarterly)
  • If yes, explanation must be provided
  • If the explanation is not accepted by the project
    management, SA1 claims will be rejected

23
Upgrade procedures
  • All upgrades/updates are announced over the GIM
    list
  • The broadcasts contain links to further
    instructions for upgrades for each Grid service
  • Site admins should carefully examine them before
    performing the update!
  • In addition, possible SEE-GRID-specific
    instructions are given in the e-mail
  • For especially important updates/changes, tickets
    are created for each site
  • For some upgrades/updates to be performed,
    downtimes may be required
  • OS updates must be regularly installed, to
    minimize security risks

24
Grid-Operator-On-Duty (GOOD)
  • Rotating shifts on a weekly basis
  • Each countrys GIM is responsible to monitor
    sites during his/her shift
  • Tickets are submitted to sites with problems,
    according to the status of sites in various
    monitoring tools (BBmSAM, GStat, Nagios,
    Accounting portal, etc.)
  • Older tickets that are not resolved are escalated
  • Support is given to sites that cannot resolve
    earlier identified operational problems
  • User tickets are assigned to the appropriate
    supporters
  • Wiki documentation is updated, or new wiki pages
    created if necessary
  • URLs
  • http//wiki.egee-see.org/index.php/SG_GOOD
  • http//wiki.egee-see.org/index.php/SG_Helpdesk_tic
    kets

25
Usual problems and links to (possible) solutions
  • BDII
  • siteBDII (GIIS) or top-level BDII is
    Unreachablehttp//faq.twgrid.org/faq/index.php?ac
    tionartikelcat14id11artlangen
  • No info publishedhttp//goc.grid.sinica.edu.tw/go
    cwiki/No_data_published_by_top_level_BDII
  • CA
  • CA version test failed with error messageThis
    CA is an old one and time allowed to upgrade is
    overhttp//grid-deployment.web.cern.ch/grid-deplo
    yment/lcg2CAlist.html
  • CE (Computing Element)
  • Job submission failed with error
    messageBrokerhelper Cannot plan. No compatible
    resourceshttp//goc.grid.sinica.edu.tw/gocwiki/B
    rokerhelper3A_Cannot_plan._No_compatible_resource
    s
  • Job submission failed with error messageGot a
    job held event, reason Unspecified gridmanager
    errorhttp//goc.grid.sinica.edu.tw/gocwiki/Unspec
    ified_gridmanager_error
  • Job submission failed with error messageCannot
    read JobWrapper output, both from Condor and from
    Maradonahttp//goc.grid.sinica.edu.tw/gocwiki/Can
    not_read_JobWrapper_output2e2e2e
  • Job submission failed with error message7
    authentication failedhttp//goc.grid.sinica.edu.t
    w/gocwiki/7_authentication_failed
  • Job submission failed with error message10 data
    transfer to the server failedhttp//goc.grid.sini
    ca.edu.tw/gocwiki/10_data_transfer_to_the_server_f
    ailed
  • 4444 Waiting jobs in the GRIShttp//goc.grid.sini
    ca.edu.tw/gocwiki/4444_Waiting_jobs_in_the_GRIS
  • SE (Storage Element)
  • File copy and registration failed with error
    message535 535-FTPD GSSAPI error GSS Major
    Status General failurehttp//goc.grid.sinica.edu
    .tw/gocwiki/535_535-FTPD_GSSAPI_error3A_GSS_Major
    _Status3A_General_failure

26
Service Level Agreement (SLA)
  • Old URL http//wiki.egee-see.org/index.php/SG_SLA
  • The change to the current one is that the
    required availability is 80, and that the
    availability is calculated on 3h basis, not on a
    daily basis
  • BBmSAM portal provides SLA figures
  • Sites not fully conforming to the SLA will have
    reduced funding
  • Sites with the availability lt50 will be
    uncertified
  • Sites fully conforming to the SLA will be put
    into seegrid_certified status and become visible
    to the whole SEE region (i.e. not only SEE-GRID,
    but also EGEE-SEE etc.)
Write a Comment
User Comments (0)
About PowerShow.com