Operations

About This Presentation

Title:

Operations

Description:

People running production are asked to check the failure reasons and try ... Site misconfiguration. try to identify the site and, possibly, the name of the WN ... – PowerPoint PPT presentation

Number of Views:10

Avg rating:3.0/5.0

Slides: 9

Provided by: judo1

Category:

more less

Transcript and Presenter's Notes

Title: Operations

1
Operations
2
Something wrong?
Jobs can fail for many different causes. People
running production are asked to check the failure
reasons and try to fix them. Problems can occur
in different part of the system there may be
bugs in the Athena software, inefficiencies in
some grid service, system commands hanging
up Each of these inconveniences is treated in a
different way.
3
Errors of a kind
The most time-consuming (but not the most
annoying) activity is to understand the kind of
problem that a failed job found. Some
informations can be found in the ProdDB, checking
the fields

ERRORACRONYM
ERRORTEXT

in the EJOBEXE table. These two informations may
be enough to understand the failure reasons.

Querying the database can be done through
different ways
scripts using the cx_Oracle libraries
phpOrAdmin web interface
job monitoring web page

4
Errors of a kind
Sometimes, the erorr fields in the EJOBEXE table
are not enough to fully understand what went
wrong. Suppose, for example, that the job died
for an Athena crash you would need the Athena
log file. Or the required input file could not be
retrieved by the job, so you should understand
where this file is and why it was not transfered.

Every job has 2 different log files
GridWrapper.log
ltjobNamegt.log

The former is the log of the grid job, so it
contains all the informations of the setup of the
environment on the WN, of the launching of the
Atlas sw and of the registration of the output
files. It is retrieved by Lexor on your UI under
/tmp/ltuserNamegt/JOBDIR/ltjobdefidgt/ltattemptnrgt The
latter is the log of the Athena executable and
can be found on a web server http//voatlas01.cer
n.ch/atlas/
5
Operators duties
Operators have a quite powerful script to help
them manage failed jobs its called pdbAdmin.py
and contains a set of useful queries and update
statements to be performed on the ProdDB. You can
download it from CVS. Its usage is described in
the wiki page https//twiki.cern.ch/twiki/bin/vie
w/Atlas/PdbAdmin
Operators duties are outlined at https//uimon.cer
n.ch/twiki/bin/view/Atlas/OnShiftInstructions

Production is organized in weekly shifts of 3
operators
Production coordinator
Data management operator
Workload management operator

Instructions of what an operator has to do are
also described in a document by Xavi
Espinal ProdSysOnShift.pdf
6
What to do, in case

Site misconfiguration
try to identify the site and, possibly, the name
of the WN
notify the site by either
open a ticket on GGUS
open a ticket on the corresponding ROC
contact the site

Athena crash
check the ltjobNamegt.log file on the web server
check the transformation exit code in the
GridWrapper.log file
verify if the error is already known and
unrecoverable
open a ticket on the validation Savannah portal

7
What to do, in case

Cant copy input file
try to identify the LFN of the input file
search for it in the LFC catalog
try to understand if it was just a temporary
failure
in case of site problems, contact the site
(through GGUS, the ROC or the site managers
directly)
in case of DDM problems or errors, contact the
DDM support mailing list

8
Useful links
http//www.ggus.org You can submit bugs to GGUS
through its web portal or simply by writing a
mail to support_at_ggus.org atlas-user-support_at_ggus.o
rg
ATLAS Savannah page for software
validation https//savannah.cern.ch/bugs/?groupva
lidationation
Some other useful links generic ATLAS wiki
pages https//uimon.cern.ch/twiki/bin/view/Atlas/W
ebHome ATLAS Offline Computing web page at
http//atlas.web.cern.ch/Atlas/GROUPS/SOFTWARE/OO
/ gLite documentation web page http//glite.web.c
ern.ch/glite/documentation/

Write a Comment

User Comments (0)