Title: Grid Troubleshooting
1Grid Troubleshooting
- Brian Tierney, Dan Gunter
- Lawrence Berkeley National Laboratory
- Laura Pearlman
- Information Sciences Institute
- http//www.cedps-scidac.org/
2CEDPS in a Nutshell
- Center for Enabling Distributed Petascale Science
- CEDPS seds (silent P)
- DOE SciDAC Center for Enabling Technology
- July 1, 2006 June 30, 2011, 2.4M/yr
- Collaboration between 5 sites
- Argonne National Laboratory
- Fermi National Laboratory
- Lawrence Berkeley National Laboratory
- USC Information Sciences Institute
- University of Wisconsin Madison
- Three focus areas
- Moving data to compute resources
- Moving compute services to data sites
- Troubleshooting and diagnosis tools
3The Troubleshooting Problem
- Large production Grids (OSG, TeraGrid, etc.)
report a high failure rate - 20-30 of jobs submitted to the Grid fail
- mostly authentication errors and disk space
problems - Users dont always notice, as jobs may be
automatically resubmitted and may succeed the
next time - Troubleshooting in this environment is very
difficult - Current Approach
- Log into all hosts used (if possible)
- grep various log files looking for problems
- Inconsistent logging levels
- Multiple file formats
- Often a tedious and time consuming problem
4CEDPS Troubleshooting Goals
- Be useful to all of the following
- Grid Operation Center folks
- Site Administrators
- Grid Users
- Grid Developers
5Use Case 1 Troubleshooting
- Allow GOC personnel to
- find log messages for jobs from VOAtlas running
at siteFNAL - find log messages related to servicecondor,
userJoe, siteIndiana - find log messages for userJoe
- find log messages with statuserror
- find all logs where job manager statuskilled
(ie jobs that were killed for running too long) - find log messages with start events with no
matching end event
6Use Case 2 Monitoring / Performance Analysis
- Allow GOC/site admins to determine
- what sites had connection attempts for a given
user DN - what data files were accessed most often
- which user moved the largest amount of data
- find log messages where the time between
start/end events are more than 3X the baseline
7Use Case 3 User Debugging / Provenance
- Allow a user to query for their own logs
- find log messages for all my jobs
- find log messages with statuserror
- Allow a user to determine all hosts/services that
my job used - find log messages related to Job X
- use this to determine what hosts were actually
used
7
8Overview of our approach
Grid stack
- Current focus
- Monitoring of Grid middleware
- Normalize information being logged
- Collect logs at each site (syslog-ng)
- Load logs at each site into a relational database
(MySQL) - Query cross-site with a distributed database
layer (OGSA-DAI)
9Log Normalization
- To troubleshoot effectively requires correct,
precise, and understandable logs - Time-synchronized hosts (using NTP)
- Information logged at the
- Start of all important operations
- End of all important operations
- Unique identifier in space/time for the operation
- Basic attributes describing the operation
- who (user DN)
- where (host IP addresses)
- what (operation arguments)
- A simple, parseable log format
10Social aspects of logging
- Developers add log messages for their own benefit
- Their test environment is far simpler and more
reliable than the actual deployment - maybe even their own laptop!
- Necessary to convince developers that
- this will not be a performance bottleneck
- there is a benefit for them
- namely, better ways to find bugs before deploying
the software and recreate them afterwards
11Correlating log messages
- Grid workflows are highly parallel
- multiple sites
- data resources, instruments, compute resources
- multiple components in each place
- all this being used by multiple users and jobs at
the same time! - In order to separate out which component
activities were associated with your job, local
and global identifiers need to be recorded
together whenever possible
12Logging Best Practices Recommendations
- Practices
- All logs should contain a unique event name and
an ISO-format timestamp - All system operations that might fail or
experience performance variations should be
wrapped with start and end events. - All logs from a given execution thread should be
tagged with a globally unique ID (or GUID), such
as a Universal Unique Identifiers (UUIDs) - Log format
- Logs should be composed of lines of ASCII
namevalue pairs - Example
- ts2006-12-08T184827.598448Z
eventorg.globus.gridFTP.transfer.start - progGridFTP-v4.2 guid1DDF1F3D-A677-4DBC-8C4E-6A
8A3B252AE3 - filefilename src.hostH1 src.portP1
dst.hostH2 dst.portP2 - http//www.cedps.net/wiki/index.php/LoggingBestPra
ctice
13Event Names
- Use a '.' as a separator and go from general to
specific - Same as Java class names
- First part of name should be used as a unique
namespace (e.g. org.globus) - Use start/end suffixes whenever possible
- Helps immensely with troubleshooting
- Examples
- org.globus.gridFTP.start
- org.globus.gridFTP.authn.start
- org.globus.gridFTP.authn.end
- org.globus.gridFTP.transfer.start
- org.globus.gridFTP.transfer.end
- org.globus.gridFTP.end
- org.globus.MDS.response.start
- org.globus.MDS.query.start
- org.globus.MDS.query.end
- org.globus.MDS.write.net.start
- org.globus.MDS.write.net.end
- org.globus.MDS.response.end
14Reporting Errors
- Errors should be reported as part of the end
event if possible - Use statusN (gt 0 success)
- Not attempting to define other status codes
- too hard to get agreement on these
- Example
- ts2006-12-08T183923.114369Z eventorg.globus.au
thz.gridmap.end status-1 DN/OCEDS/CNSo
me User msgCannot open gridmap file
/etc/grid-security/grid-mapfile for reading
guidF7D64975-069A-4152-A21F-57109AA46DFA
levelERROR -
15Globally Unique IDs
- Use the guid reserved name to allow correlation
of a set of events together - eventorg.globus.gridFTP.authn.start guid27023
- eventorg.globus.gridFTP.authn.end guid27023
- eventorg.globus.gridFTP.transfer.start
guid27023 - eventorg.globus.gridFTP.transfer.end guid27023
- Recommend use of standard program uuidgen to
generate globally unique ID - e.g. A5A563CD-D80C-4E58-9ECD-79C6B611E122
16Logging Example
Log file
Logical flow
timeT1 eventjob.create.start timeT2
eventjob.create.end job.idJ1 status0 timeT3
eventjob.run.start job.idJ1 timeT4
eventjob.run.end job.idJ1 status-1 timeT5
eventjob.copy.start job.idJ1 timeT6
eventjob.copy.end job.idJ1 status0
create job
run job
copy results
17Example Log GridFTP
- ts2006-12-08T183923.114369Z eventorg.globus.gr
idFTP.start progGridFTP-4.0.3 localhostmyhost
remoteHostsomehost.gov56010 serverModeinetd
guid1DDF1F3D-A677-4DBC-8C4E-6A8A3B252AE3 - ts2006-12-08T183923.114567Z eventorg.globus.gr
idFTP.authn.start DN/DCorg/DCdoegrids/OUPeopl
e/CNSomebody guid1DDF1F3D-A677-4DBC-8C4E-6A8A3B
252AE3 - ts2006-12-08T183925.514369Z eventorg.globus.gr
idFTP.authn.end DN/DCorg/DCdoegrids/OUPeople/
CNSomebody msg123456 successfully authorized
localUseruscmspool381 guid1DDF1F3D-A677-4DBC-8C4
E-6A8A3B252AE3 status0 - ts2006-12-08T183925.864369Z eventorg.globus.gr
idFTP.transfer.start file/tmp/myfile
tcpBufferSize128KB dataBlockSize262144
numStreams1 numStripes1 destHost129.79.4.64
guid1DDF1F3D-A677-4DBC-8C4E-6A8A3B252AE3 - ts2006-12-08T184502.214369Z eventorg.globus.gr
idFTP.transfer.end file/tmp/myfile
bytesTransferred678433 guid1DDF1F3D-A677-4DBC-8C
4E-6A8A3B252AE3 status0 - ts2006-12-08T184502.214386Z eventorg.globus.gr
idFTP.end guid1DDF1F3D-A677-4DBC-8C4E-6A8A3B252AE
3 status226
18Example Scenario Firewall blocking GridFTP
- globus-url-copy from server A to server B (3rd
party transfer) hangs why? - gridftp server logs contain the following
- ts2008-02-27. id30922 eventglobus-gridftp-ser
ver.start modeinetd - ts2008-02-27. id30922 eventglobus-gridftp-ser
ver.session.start - ts2008-02-27. id30922 eventglobus-gridftp-ser
ver.session.authn.start DN/DC - ts2008-02-27. id30922 eventglobus-gridftp-ser
ver.session.authn.end status0 - ts2008-02-27. id30922 eventglobus-gridftp-ser
ver.session.authz.start - ts2008-02-27. id30922 eventglobus-gridftp-ser
ver.session.authz.end status0 - ts2008-02-27. id30922 eventglobus-gridftp-ser
ver.transfer.start - But the logs are missing the remaining events
- globus-gridftp-server.transfer.end
- globus-gridftp-server.session.end
- globus-gridftp-server.end
- Since both authn and authz succeeded, a firewall
is likely the problem
19WSRF Logging
- Goal is to include enough information in log
messages to - correlate log messages involved in servicing a
single request - correlate log messages associated with the same
resource - Example
- GRAM invokes an RFT service within the same
container when servicing a GRAM request - need to be able to correlate the GRAM logs with
the RTF logs for this user job - For WSRF services, we suggest
- use the guid as the session ID
- add a resource ID to uniquely identify a resource
- for each unique guid (session ID), there should
be at least one log message linking that guid
with a resource.id - a hash of the EPR can be used for the resource ID
20Log collection architecture
Site
Node
Node
Node
Log parser DB loader
syslog-ng
Node
Node
Node
MySQL
Node
Node
Node
21syslog-ng
- No need to invent something new for this
syslog-ng tool fills all requirements - Open source, runs on all major OSes
- Fault tolerant, secure (via stunnel), scalable,
easy to configure, etc. - Large user base
- Can filter logs based on level and content
- Any number of sources and destinations
- Can act as a proxy, tunnel thru firewalls
- Execute programs Send email, load DB, etc.
- Built-in log rotation
- Timezone support and fully qualified host names
- http//www.balabit.com/products/syslog-ng/
Node
software
logs
/opt/log/
syslog-ng sender
syslog-ng receiver
22Logging Architecture for the Open Science Grid
23OGSA-DAI Integration Goals
- Provide access to log data over the grid
- Support flexible authorization policies
- E.g., Site admins can see local site data, VO
admins can see data related to their VO, and
users can see data related to jobs running under
their DN - Facilitate queries across multiple sites
- Possibly add support for joins with other
existing databases - E.g., GRAM audit db
23
24OGSA-DAI Deployment
site DB
site DB
OGSA-DAI
OGSA-DAI
View1
View2
DB
View1
View2
DB
PDPs/ PIPs
PDPs/ PIPs
PDPs/ PIPs
PDPs/ PIPs
PDPs/ PIPs
PDPs/ PIPs
OGSA-DAI
Resource group 1
Res. group 2
user 1
user 2
25Current Status
- syslog-ng -gt parser -gt loader -gt mySQL pipeline
now running on LBL OSG cluster and at NERSC - log parsers for Globus 4.2 logs, Globus 4.1
gatekeeper, SGE, Pegasus - working on DB query tools
- working on OSGA-DAI integration
- working on Condor log parsers
26More Information
- CEDPS Troubleshooting
- http//www.cedps.net/index.php/Troubleshooting
- Contact us if you need troubleshooting help!
- email BLTierney_at_lbl.gov, DKGunter_at_lbl.gov