Operations - PowerPoint PPT Presentation

1 / 8
About This Presentation
Title:

Operations

Description:

People running production are asked to check the failure reasons and try ... Site misconfiguration. try to identify the site and, possibly, the name of the WN ... – PowerPoint PPT presentation

Number of Views:10
Avg rating:3.0/5.0
Slides: 9
Provided by: judo1
Category:

less

Transcript and Presenter's Notes

Title: Operations


1
Operations
2
Something wrong?
Jobs can fail for many different causes. People
running production are asked to check the failure
reasons and try to fix them. Problems can occur
in different part of the system there may be
bugs in the Athena software, inefficiencies in
some grid service, system commands hanging
up Each of these inconveniences is treated in a
different way.
3
Errors of a kind
The most time-consuming (but not the most
annoying) activity is to understand the kind of
problem that a failed job found. Some
informations can be found in the ProdDB, checking
the fields
  • ERRORACRONYM
  • ERRORTEXT

in the EJOBEXE table. These two informations may
be enough to understand the failure reasons.
  • Querying the database can be done through
    different ways
  • scripts using the cx_Oracle libraries
  • phpOrAdmin web interface
  • job monitoring web page

4
Errors of a kind
Sometimes, the erorr fields in the EJOBEXE table
are not enough to fully understand what went
wrong. Suppose, for example, that the job died
for an Athena crash you would need the Athena
log file. Or the required input file could not be
retrieved by the job, so you should understand
where this file is and why it was not transfered.
  • Every job has 2 different log files
  • GridWrapper.log
  • ltjobNamegt.log

The former is the log of the grid job, so it
contains all the informations of the setup of the
environment on the WN, of the launching of the
Atlas sw and of the registration of the output
files. It is retrieved by Lexor on your UI under
/tmp/ltuserNamegt/JOBDIR/ltjobdefidgt/ltattemptnrgt The
latter is the log of the Athena executable and
can be found on a web server http//voatlas01.cer
n.ch/atlas/
5
Operators duties
Operators have a quite powerful script to help
them manage failed jobs its called pdbAdmin.py
and contains a set of useful queries and update
statements to be performed on the ProdDB. You can
download it from CVS. Its usage is described in
the wiki page https//twiki.cern.ch/twiki/bin/vie
w/Atlas/PdbAdmin
Operators duties are outlined at https//uimon.cer
n.ch/twiki/bin/view/Atlas/OnShiftInstructions
  • Production is organized in weekly shifts of 3
    operators
  • Production coordinator
  • Data management operator
  • Workload management operator

Instructions of what an operator has to do are
also described in a document by Xavi
Espinal ProdSysOnShift.pdf
6
What to do, in case
  • Site misconfiguration
  • try to identify the site and, possibly, the name
    of the WN
  • notify the site by either
  • open a ticket on GGUS
  • open a ticket on the corresponding ROC
  • contact the site
  • Athena crash
  • check the ltjobNamegt.log file on the web server
  • check the transformation exit code in the
    GridWrapper.log file
  • verify if the error is already known and
    unrecoverable
  • open a ticket on the validation Savannah portal

7
What to do, in case
  • Cant copy input file
  • try to identify the LFN of the input file
  • search for it in the LFC catalog
  • try to understand if it was just a temporary
    failure
  • in case of site problems, contact the site
    (through GGUS, the ROC or the site managers
    directly)
  • in case of DDM problems or errors, contact the
    DDM support mailing list

8
Useful links
http//www.ggus.org You can submit bugs to GGUS
through its web portal or simply by writing a
mail to support_at_ggus.org atlas-user-support_at_ggus.o
rg
ATLAS Savannah page for software
validation https//savannah.cern.ch/bugs/?groupva
lidationation
Some other useful links generic ATLAS wiki
pages https//uimon.cern.ch/twiki/bin/view/Atlas/W
ebHome ATLAS Offline Computing web page at
http//atlas.web.cern.ch/Atlas/GROUPS/SOFTWARE/OO
/ gLite documentation web page http//glite.web.c
ern.ch/glite/documentation/
Write a Comment
User Comments (0)
About PowerShow.com