Title: Grid Troubleshooting and Performance Analysis with the NetLogger Toolkit
1Grid Troubleshooting and Performance Analysis
with the NetLogger Toolkit
- Dan Gunter, Brian Tierney LBNL
- Laura Pearlman ISI
2Overview
- The Problem
- When building distributed systems, we often
observe unexpectedly low performance - the reasons for which are usually not obvious
- The bottlenecks can be in any of the following
components - the applications
- the middleware
- the operating systems
- the disks, network adapters, bus, memory, etc. on
either the sending or receiving host - the network switches and routers, and so on
- The Solution
- Highly instrumented systems with precision timing
information and targeted analysis tools
3Instrumentation
- Instrumentation should be
- non-intrusive
- easy to use
- provide real time monitoring ability
- use standard format(s)
- use precision timestamps
- Sometimes adding new instrumentation isnt an
option - learn to parse existing logs
4The NetLogger Toolkit
- The NetLogger Toolkit is..
- methodology
- unified format/API for source-code
instrumentation - C
- Java
- Python
- Perl
- Shell / command-line (!)
- implementation of CEDPS monitoring ideas
- platform for experimentation with research ideas
like on-the-fly summarization and anomaly
detection - A Globus Incubator project
5Getting the software
- Main download page
- http//acs.lbl.gov/NetLoggerWiki/index.php/Downloa
ds - Also on Globus Incubator page under,
surprisingly, NetLogger - NetLogger now bundled with Globus 4.2 (rc2)
- ./configure --enable-netlogger
- adds netlogger detailed instrumentation of
GridFTP client and server
6Methodology Tracing start/end calipers
- Log an event named foo.start before foo(), and
one named foo.end after foo() - Timestamps will be automatically added to both
events - Developer must add identifiers to link them
together (and with other related events)
7Analyzing start/end events
- By using start/end calipers around arbitrary
user-defined functions, we can always tell - Drawbacks of this approach
- For calls in same context which completed, 2
calls instead of 1 is wasteful - Depends on timestamps for duration
- If the call failed negative status code in
.end event - If something never finished Missing .end
event - .start needs enough info to tell what something
was - If the start message was lost Missing .start
- Duration within a call Developer doesnt need
to explicitly provide this works across
execution contexts - Time between calls .end to next .start
8Visualization with NetLogger Lifelines
- If applications specify an identifier for related
events, this allows the NetLogger visualization
tools to generate an object lifeline
- Current tools nl_lifeline command-line program
(gnuplot or R) - Historical tool nlv no longer supported
9Instrumentation example
- Instrumenting a for-loop
- pseudo-code
- DO for i 0 to 9
- LOG(eventfoo.start,ni,guidxyz)
- DO for j 0 to 99
- foo(j,i) bar(i,j)
- DONE
- LOG(eventfoo.end,ni,guidxyz,status0)
- DONE
10NetLogger Logs
- Sample output
- ts2008-05-12T230444.401309Z eventfoo.start
levelInfo guidxyz n0 - ts2008-05-12T230444.637781Z eventfoo.end
levelInfo guidxyz n0 status0 - ...
- ts2008-05-12T230448.666704Z eventfoo.start
levelInfo guidxyz n9 - ts2008-05-12T230448.905270Z eventfoo.end
levelInfo guidxyz n9 status0 - Verbose, but each line of output is portable and
self-describing - Automatic
- precision timestamp
- logging level
- See http//www.cedps.net/index.php/LoggingBestPrac
tices for details
11Language APIs
- Perl, Java just formatters
- Ensures compatibility with log4j, log4perl
- Python is either a formatter or writer (like C)
- hooks for Python logging module
- Shell is, well, a shell script
- nl_write eventfoo.start ni guid123
- C has most features
- can write to TCP, UDP, file, syslog
- built-in statistical log summarization for
high-volume logging
12Log summarization
- And now for something completely somewhat
different..
13Time-based summarization
Imagine you want detailed instrumentation of a
simple loop that reads data, processes it, then
writes out some result.
DO for i1, N log(read.start, ...)
read_data() log(read.end, ...)
process_data() log(write.start, ...)
write_data() log(write.end, ...) DONE
14Time-based summarization
Reporting only a periodic summary (mean, sd) of
times and values between start and end events
(of a given type and ID).
- Orders of magnitude data reduction
- Often increases understanding of the data
- Since the original instrumentation is still
there, you can easily turn on and off the full
detail in a running program
15Example GridFTP Bottleneck Detector
- GridFTP, previously
- By default only logs single throughput number
- GridFTP throughput might be limited by
- Sending/receiving disk
- NFS/AFS mounted partition?
- network
- currently no way to tell
- GridFTP 4.2
- Uses NetLogger summarization library to monitor
all I/O streams - Report disk / network numbers separately
- Use -nl-bottleneck option
16GridFTP with Bottleneck Detection
17Sample output(1)
globus-url-copy -len 100000000 -vb
-nl-bottleneck bmask \ ftp//host1port/tmp/big
file ftp//host2port/tmp/file123 Source
ftp//pabst45000/tmp/ Dest ftp//grolsch45000
/tmp/ bigfile -gt file123 100000000 bytes
119.21 MB/sec avg 119.21 MB/sec
inst Total instantaneous throughput disk read
3608.2 Mbits/s disk write 1386.1
Mbits/s net read 1723.9 Mbits/s net
write 1047.7 Mbits/s Bottleneck maybe disk
write
Disk write could be pushing back on net write
(something we see in GridFTP, not nlioperf)
18Sample output (2)
globus-url-copy -len 100000000 -vb
-nl-bottleneck bmask \ ftp//host1port/tmp/big
file ftp//host2port/tmp/file123 Source
ftp//pabst45000/tmp/ Dest ftp//grolsch45000
/tmp/ bigfile -gt file123 100000000 bytes
119.21 MB/sec avg 119.21 MB/sec
inst Total instantaneous throughput disk read
1235.7 Mbits/s disk write 2773.0
Mbits/s net read 836.3 Mbits/s net
write 1011.7 Mbits/s Bottleneck network
19Log Generation and Collection for GridFTP
20Anomaly detection example
Plot shows 10-second summaries of instantaneous
throughput vs. time
3 different algorithms detecting
anomalies static fixed bounds Mean/SD /- 2
standard deviations ECDF Tail of data histogram
21Log analysis pipeline
- We now return to our regularly scheduled
programming.. - But its good to remember that the goal is to do
something interesting with the data - data collection Im about to describe is just the
necessary plumbing
22Analyzing the Logs
- So what do we do now?
- grep
- vi
- less
- convert to CSV and load into R?
- Hmm.. need queries to join together various
related data items - maybe use SQL!
- OK -- in reality all of the above are sometimes
useful, thats why we have ASCII logs instead of
binary ones
23Logs ? RDBMS nl_pipeline
- Assume that logs are somehow arriving in some
central location - e.g., syslog-ng receiver writing to
/var/log/grid/ - May be in different files / directories
- nl_pipeline continuously parses and loads the
logs into a relational DB (MySQL or sqlite) - nl_parser configurable log parser
- starts a new output file every lttime-periodgt
- nl_loader
- reads output files, can delete them when done
- All components save current state so they can be
restarted without loss of data
part of NetLogger Toolkit
24nl_pipeline design
simple naming convention between parser and loader
25Database Schema
- Entity-Attribute-Value (EAV)
Idea Build views on top
26Schema rationale
- Monitoring data requirements
- High update rate
- Tens or hundreds of events per second
- Low query rate
- Only a handful of admins
- Relatively few user-specific queries
- Need recent data (1 min. lag OK, 1 hour not)
- Troubleshooting data requirements
- Heterogeneous GRAM, WS-GRAM, GridFTP, RFT,
Condor, GUMS, etc. - We dont know any better
27Preliminary work with Pegasus/quickstart logs
- Pegasus workflow manager from ISI
- Kickstart records XML document for each job
(invocation) - Sample dataset 20-hour Cybershake (seismological
simulation) run on 361 nodes - 884,547 jobs
- gt 40,000 jobs per hour, gt 10 per second
- Serial parse/load 80 min (40 min each)
- hack alert not overlapped / ad-hoc file batching
- but still roughly 20x real-time
28Example query on Pegasus log
- How many jobs ran per host
- SELECT count(id), value
- FROM
- event join attr on e_id id
- WHERE
- event.name 'pegasus.invocation' and attr.name
'host - GROUP BY value
think smoothed histogram
29Query performance
- In Pegasus data, takes O(20 sec) to get sum of
job times (even though we have a duration
attribute) - Why?
- Answer join with event table (800K 800K
rows) - Can see this by breaking this query
- select sum(value) from attr join event on e_id
id where attr.name 'duration' and event.name
'pegasus.invocation' (21.37 sec) - Into
- create table duration select e_id, value from
attr join event on e_id id where attr.name
'duration' and event.name 'pegasus.invocation'
(23.79 sec) - select sum(value) from duration (0.27 sec)
30Open schema issues
- Two obvious (to me) paths forward
- insert data into specific tables
- globus..
- condor
- pegasus
- gums
- etc.
- periodically or on-demand query EAV schema to
populate specific tables - Not sure which is best
- Need to balance flexibility, load performance,
and query performance
31Unifying schemas OGSA-DAI
- And now for something not quite completely
different.. - If someone else hasnt already covered it..
32Planned OGSA-DAI use
- OGSA-DAI is a web-services interface to databases
- Two planned uses
- Single database
- map users using grid certs to database users
- users can only see their logs
- admins can only see logs for their site
- Multiple databases
- send query to multiple sites, merge results
- good for troubleshooting jobs that cross site
database boundaries
33OGSA-DAI Deployment
site DB
site DB
OGSA-DAI
OGSA-DAI
View1
View2
DB
View1
View2
DB
PDPs/ PIPs
PDPs/ PIPs
PDPs/ PIPs
PDPs/ PIPs
PDPs/ PIPs
PDPs/ PIPs
OGSA-DAI
Resource group 1
Res. group 2
user 1
user 2
34Summary
- Overview of NetLogger Toolkit
- unified format/API for instrumentation (traces)
- methodology and analysis
- log summarization
- NetLogger Toolkits log parsing pipeline
- nl_parser, nl_loader, nl_pipeline
- Database schema issues
- Pegasus results
- Future work OGSA-DAI
- authorization using x509
- distributed queries
35Next steps for CEDPS
- Collecting Grid logs on NERSC
- Also collecting logs on our own small OSG ITB /
VTB cluster - Gathering requirements
- Coming up with lists of useful queries
- Feel free to drop us an email, point us at your
data, related pubs, blog, et. al.
36References
- NetLogger Toolkit
- http//acs.lbl.gov/NetLoggerWiki/
- Globus Incubator
- http//dev.globus.org/wiki/Incubator/NetLogger
- CEDPS home page
- http//cedps-scidac.org
- OGSA-DAI home
- http//www.ogsadai.org.uk/