Title: Operations
1Operations
2Something wrong?
Jobs can fail for many different causes. People
running production are asked to check the failure
reasons and try to fix them. Problems can occur
in different part of the system there may be
bugs in the Athena software, inefficiencies in
some grid service, system commands hanging
up Each of these inconveniences is treated in a
different way.
3Errors of a kind
The most time-consuming (but not the most
annoying) activity is to understand the kind of
problem that a failed job found. Some
informations can be found in the ProdDB, checking
the fields
in the EJOBEXE table. These two informations may
be enough to understand the failure reasons.
- Querying the database can be done through
different ways - scripts using the cx_Oracle libraries
- phpOrAdmin web interface
- job monitoring web page
4Errors of a kind
Sometimes, the erorr fields in the EJOBEXE table
are not enough to fully understand what went
wrong. Suppose, for example, that the job died
for an Athena crash you would need the Athena
log file. Or the required input file could not be
retrieved by the job, so you should understand
where this file is and why it was not transfered.
- Every job has 2 different log files
- GridWrapper.log
- ltjobNamegt.log
The former is the log of the grid job, so it
contains all the informations of the setup of the
environment on the WN, of the launching of the
Atlas sw and of the registration of the output
files. It is retrieved by Lexor on your UI under
/tmp/ltuserNamegt/JOBDIR/ltjobdefidgt/ltattemptnrgt The
latter is the log of the Athena executable and
can be found on a web server http//voatlas01.cer
n.ch/atlas/
5Operators duties
Operators have a quite powerful script to help
them manage failed jobs its called pdbAdmin.py
and contains a set of useful queries and update
statements to be performed on the ProdDB. You can
download it from CVS. Its usage is described in
the wiki page https//twiki.cern.ch/twiki/bin/vie
w/Atlas/PdbAdmin
Operators duties are outlined at https//uimon.cer
n.ch/twiki/bin/view/Atlas/OnShiftInstructions
- Production is organized in weekly shifts of 3
operators - Production coordinator
- Data management operator
- Workload management operator
Instructions of what an operator has to do are
also described in a document by Xavi
Espinal ProdSysOnShift.pdf
6What to do, in case
- Site misconfiguration
- try to identify the site and, possibly, the name
of the WN - notify the site by either
- open a ticket on GGUS
- open a ticket on the corresponding ROC
- contact the site
- Athena crash
- check the ltjobNamegt.log file on the web server
- check the transformation exit code in the
GridWrapper.log file - verify if the error is already known and
unrecoverable - open a ticket on the validation Savannah portal
7What to do, in case
- Cant copy input file
- try to identify the LFN of the input file
- search for it in the LFC catalog
- try to understand if it was just a temporary
failure - in case of site problems, contact the site
(through GGUS, the ROC or the site managers
directly) - in case of DDM problems or errors, contact the
DDM support mailing list
8Useful links
http//www.ggus.org You can submit bugs to GGUS
through its web portal or simply by writing a
mail to support_at_ggus.org atlas-user-support_at_ggus.o
rg
ATLAS Savannah page for software
validation https//savannah.cern.ch/bugs/?groupva
lidationation
Some other useful links generic ATLAS wiki
pages https//uimon.cern.ch/twiki/bin/view/Atlas/W
ebHome ATLAS Offline Computing web page at
http//atlas.web.cern.ch/Atlas/GROUPS/SOFTWARE/OO
/ gLite documentation web page http//glite.web.c
ern.ch/glite/documentation/