Title: Logfile Analysis
1Log File Analysis
Jens-S. Vöckler
Desire II Cache Workshop, Budapest
02.03.2000
2Overview
- Log file processing What log files do exist?
- Log file analyser programs An overview
- Do-it-yourself Analysing log files yourself
- How to create long term observations
3Log file processing vs. monitoring
Log file analysis
Monitoring
- look at things after they happened
- find out about trends
- long term conception
- show behaviour instantly
- ring alarm bells before catastrophe
- short term conception
? The aims are different!
4Log file processing Online vs. Offline
- Online processing degrades cache performance
- Processing on-host at high performance loss,
- Continuous transfers elsewhere ? IPC overhead.
- Log file would have to be constantly polled.
- Offline processing allows more options
- Processing granularities are user defined and
repeatable. - Transfer of Log files during idle intervals.
? Offline processing preferable.
5Log files kept by Squid
- Several log files kept by Squid
- Explicit compile-time activation for some, e.g.
useragent.log - Explicit run-time deactivations for others, e.g.
store.log - Basic assumption Squid 2.3.STABLE1
- Squid 1.0 hierarchy.log in access.log since Squid
1.1 - UTC time stamps common to all log files
- Sometimes millisecond extension.
- Influence of --enable-time-hack compile time
option - Use of alarm().
- Coarseness of stamps.
6Log file squid.out
- Created by RunCache script.
- Contains Squid start up times.
- Also gets assert() failures.
? Not useful for statistics.
7Log file cache.log
- Created by Squid.
- Gets error messages.
- Debug messages go here.
? Very limited use for generating automated error
reports.
8Log file useragent.log
- Needs compile-time configuration.
- Compile with --enable-useragent-log flag.
- Needs run-time configuration
- Point useragent_log option to a file.
? Useful for generating browser distribution
reports.
9Log file store.log
- Basically a debug file ? transaction log.
- Contains objects that are or were on your disks
- Process completely for objects that do reside.
- Allows for URL ? store file name mapping.
- Needs to know cache_dir configuration option.
? Useful for object based statistics.
10Store.log file format
11Log file swap.state
- Special tool dump_swap_states by Henrik
Nordström - MD5 key,
- File number,
- File size,
- Timestamp, last reference, expires, last
modification, - Reference count,
- Flags.
- Association between stored file and object key.
? Useful for debugging.
12Log file access.log
- Used by most analysers.
- Two kind of log file formats supported.
- Sets do not overlap.
- Conversion between formats is not without loss.
- Option emulate_http_log switches between formats.
- Optional additional columns possible.
? Useful for most statistics.
13Common vs. native format
Common log file format
Native log file format
- Standard web server format.
- Parseable by a variety of tools.
- Less/ other information than native.
- HTTP protocol version
- Special format of Squid.
- Needs specialised tools.
- More/ other information than common.
- Request duration
- Timeout hierarchy information
- Upstream server address
- Content type
? Use of native log file format is recommended!
14Native log file format
15Influence of log_icp_queries option
- Logs all ICP traffic.
- Logged request method is ICP_QUERY.
- Enlarges log files by huge amounts.
- Needed for many kinds of analysis.
- Enabled by default.
16Managing log files
http//www.uninett.no/prosjekt/desire/arneberg/sta
tistics.html
- Rotate and process log files at regular
intervals - Set up the logfile_rotate option appropriately.
- Use the squid -k rotate API.
- Automatically rotate at least once a day, e.g.
twice a day - rotate log files at midnight...
- 0 0 /usr/local/squid/bin/squid -k rotate
- ... And again at high noon
- 0 12 /usr/local/squid/bin/squid -k rotate
- Transfer contents during a period of idleness.
- Reserve enough disk space for log files.
17Managing log files II
- Keep the size in mind.
- Storage will take space, transfer will take time.
- Analysis will take even more time and CPU power.
- Limit your interest.
- Even processed data is abundant.
- Respect the privacy of your clients.
- Privacy protection legislation may force you to
erase log files. - Keep logs unavailable.
- Anonymize log files before publication.
18Intro Analyser programs
- Focus on access.log analysers.
- Python scripts are not mentioned.
- This is a starting point!
- Use meta search engines for more links,
- http//www.squid-cache.org/Scripts/
- Test by feeding a log file from our university
cache - 388918 lines.
- 53 MB for 24 hours.
- Squid 1.1 log file format.
- PentiumPro 233 MHz, 128 MB, Linux 2.0.36, glibc2.
19Overview Analyser programs
20Squidclients
http//www.cineca.it/nico/squidclients.html
- Version 1.6 by Nico Tranquilli.
- Readily compiled binary distribution.
- Textual and HTML output.
- Features
- Quick overview over peers and clients.
- Needs ICP information.
21Squidtimes
http//www.cineca.it/nico/squidtimes.html
- Version 1.12 by Nico Tranquilli.
- Readily compiled binary distribution.
- HTML output with pictures.
- Features
- elapsed time per TCP, TCP count
- elapsed time for TCP HIT, TCP MISS
- elapsed time for UDP, MISS, HIT, count
- count of TCP, UDP and errors
- object size
- shows different things in the same diagram.
22Desire Project Statistics - DePStat
http//www.uninett.no/prosject/desire/DePStat/
- Version 1.67 by Lars Slettjord et al.
- Perl script collection.
- Text and HTML with emphasise on PostScript.
- Features
- HTTP traffic by requests, split by parents, HIT,
MISS, etc. - HTTP traffic in bytes split as above.
- Same for ICP traffic
- HITs per second
- Based on discussions in TF-CACHE.
- Modular concept in 2 stages with database
- Little documentation.
23NLANR scripts
http//squid.nlanr.net/Squid/Scripts/
- By the authors of Squid.
- Perl script collection.
- Textual output with ASCII graphics.
- Features
- summaries over requests, protocols.
- With UDP, TCP and TCP count
- client and origin site usage.
- URL types and top level domains.
- Client utilization and hit ratios.
- Histogramm of hot objects and top 25 objects.
24Access-times
http//squid.nlanr.net/Squid/Scripts/access_times.
awk
- By Andrew Richards.
- One awk script.
- Text table output.
- Yields a consise extracts of delays encountered.
25Jesalfa
http//ivs.cs.uni-magdeburg.de/elkner/webtools/je
salfa/
- Version 1.0 by Jens Elkner.
- C source distribution.
- HTML tables as output.
- Features - always by HIT and by Byte
- filename extension and content types.
- Request schemes.
- Toplevel domains.
- Top 20 second level domains.
26Calamaris
http//calamaris.cord.de/
- Version 2.29 by Cord Beermann.
- Perl script with man page.
- Text, HTML and mailto output.
- Feature - volume sorting is optional
- request methods and peak times.
- In- and outgoing TCP and UDP by status and
destination. - 2nd level domains, request schemes, content
types, suffices. - Incoming TCP and UDP clients.
- Several passes with partial log files possible.
27Seafood a5
http//statistics.www-cache.dfn.de/Projects/seafoo
d/
- Based on Calamaris 1. Version ?5 by Jens-S.
Vöckler et al. - C source distribution, uses STL.
- Text output.
- Features
- peak times.
- UDP and TCP requests, external fetches.
- Content type, 2nd and toplevel domains by count
and volume. - UDP clients by HIT, with MISS and speed.
- TCP clients by volume and HIT, with MISS and
speed. - Only Squid 1.1 log files.
28Seafood 2
http//www.cache.dfn.de/DFN-Cache/Development/Seaf
ood/
- TERENA Extended Cache Statistics project
- Improved version of Seafood a5.
- Optimised parser using tries.
- C/Perl source distribution, uses STL.
- SQL connectivity via intermediate file.
- Additional features
- AS lookups for DIRECT calls (time-intensive).
- Both, by volume and by request tables for textual
output. - Squid-style configuration file.
- Basically for Squid-2 log files.
29DIY analysis
- Available analysis programs may not cover your
needs - Any analysis is CPU intensive and time consuming.
- Do not use your cache host for running analysers.
- Need to optimise parse loop
- every millisecond saved here saves minutes in
execution time. - No need to optimise output
30Analysis using any programming language
- Leave out anything unnecessary.
- With Perl, chomp vs. last column data.
- Leave out reverse IP address lookups.
- Only parse as much of the file as you actually
need. - Do not print or write anything within the parsing
loop - A kind of I am alive flag might be helpful,
though. - Put conditions first, that will skip input lines.
- Sort the conditions with the most likely first.
31Analysis using Perl
- Process one line at a time
- open( IN, logfile ) die open logfile
!\n - while ( ltINgt )
- ...
-
- close(IN)
- Rescue partial results
- SIGINT sub exit(1)
- ... while loop accumulating
- exit(0)
- END
- ... process results
32Analysis using Perl - Regular expressions
- Avoid match variables ,,
- Try memory free parenthesis
- Greedy matches work faster in Perl.
- Prefer character classes over alternations.
- Only match or split as much as necessary.
- With the 388918 lines local log file on the Linux
host
33Analysing with C/C
- Provide sufficiently large input buffers.
- Learn about the optimisation of your compilers.
- E.g gcc 2.7.x msupersparc vs. gcc 2.8.x
-mcpuultrasparc - Program in ways your optimizer can work.
- avoid global variables and use const.
- Avoid the locale feature of system libraries.
- Benchmark different variants at different
optimisation levels - E.g. ai versus s
34Showing results
- Textual output is easily generated.
- can be used for further processing.
- HTML tables rely on access to the net.
- Excel can draw impressive diagrams.
- Import filter for tab separated text or fixed
width columns. - Not good with large amounts of data.
- Try gnuplot.
- Easy to use API.
- Textual control files.
- Multiple data files.
- All major output devices supported.
35Showing results II
- Use graphic interfaces directly.
- Not an option for beginners.
- Must auto scale yourself.
- Generating GIFs
- libgd for C and the GD module for Perl.
- PostScript as vector format.
- Generates small files which scale very well.
- Complete programming language, allows for post
processing. - Definitely not for beginners.
- Other interfaces, e.g. TeX, Tgif, etc.
36Long term observations
- The old DFN way
- Several steps involved.
- Several scripts involved.
- Manual intervention necessary.
- Backup?
- The new Seafood style
- Two steps.
- Storage in database.
- Automatic, interactive results.
37The old DFN way
- Condense your log files.
- Extract items of interest.
- Combine items.
- Plot the data represented.
38Condensation
- At regular interval, condense log files into
numbers. - Also adds privacy protection.
- Use powerful analysers for condensing numbers.
Request-Peak Request kByte
Hit- --------------- ------- ------ --------
------ ------ .com 140808 39.22
1222155 48.56 42.17 .de 145012
40.39 785763 31.22 44.81 .net
21120 5.88 146648 5.83 37.67 unresolved
20001 5.57 121435 4.82 37.99 .nu
3112 0.87 47118 1.87 26.51 .org
3614 1.01 33015 1.31
28.94 .nl 1173 0.33 20658
0.82 37.94 .ch 1649 0.46
17145 0.68 8.19 .edu 1503
0.42 15164 0.60 14.77 .uk
1766 0.49 12402 0.49 22.25
... --------------- ------- ------ --------
------ ------ Sum 359008 100.00
2516820 100.00 40.50
- Example with local server log files from 4th of
January 1999. - Table shows excerpt from Calamaris output.
- Patched for unlimited number of top level domains.
39Extraction
- Calamaris output is still large.
- Generate extract from numbers.
- Concentrate on a few trends.
- Parse Calamaris output.
- Combines numbers.
- Generates extracts.
- Easy to parse output.
toplevel_domain_request_com 140808 toplevel_domain
_request_de 145012 toplevel_domain_request_unresol
ved 20001 toplevel_domain_request_edu
1503 toplevel_domain_request_org
3614 toplevel_domain_request_gov
171 toplevel_domain_request_nl 1173 toplevel_domai
n_request_jp 368 toplevel_domain_request_uk
1766 toplevel_domain_request_net 21120
... toplevel_domain_request_sum
359008 toplevel_domain_kbyte_sum 2516820
40Combination
- Further limit view, similar to extraction.
- Extract and combine, generate gnuplot data files.
41Plot
- Gnuplot can combine data.
- One control file per diagram.
- Example needs manual work for the time stamps.
set terminal postscript eps color solid set
xlabel 'Day' set ylabel 'GByte per day' set
output 'rrzn.toplevel_domain.9901.eps' set title
'www-cache.rrzn.uni-hannover.de' set size 1.4,
1.0 set grid set yrange 03500000 set xtics (
"31.12.98" 0, "4.1.99" 4, "11.1.99" 11, "18.1.99"
18, "25.1.99" 25, "30.1.99" 30 ) set ytics (
"0.5" 500000, "1" 1000000, "1.5" 1500000, "2"
2000000,\ "2.5" 2500000, "3" 3000000, "3.5"
3500000 ) plot \ '9901/9901.toplevel_domain_kbyte
_com' using 2 title 'com' with steps, \
'9901/9901.toplevel_domain_kbyte_de' using 2
title 'de' with steps, \ '9901/9901.toplevel_dom
ain_kbyte_edu' using 2 title 'edu' with steps, \
'9901/9901.toplevel_domain_kbyte_gov' using 2
title 'gov' with steps, \ '9901/9901.toplevel_dom
ain_kbyte_jp' using 2 title 'jp' with steps, \
'9901/9901.toplevel_domain_kbyte_nl' using 2
title 'nl' with steps, \ '9901/9901.toplevel_dom
ain_kbyte_org' using 2 title 'org' with steps, \
'9901/9901.toplevel_domain_kbyte_uk' using 2
title 'uk' with steps, \ '9901/9901.toplevel_dom
ain_kbyte_unresolved' using 2 title 'unresolved'
with steps
42Result
43The New Seafood
- In-depth configuration file
- Easily incorporate Squid changes.
- Calamaris-like textual output file
- Limited backward compatibility.
- Use of SQL database
- Vendor-independence through DIF.
- All data ever gathered accessible.
- Combined view of complex issues.
- Web interface
- Interactive selection of interest.
44Seafood Design Aims
- Creating sums for different tables
- TCP/UDP peaks ? hourly interval I1.
- All other data ? daily interval I2.
- Uniform data
- Number of requests.
- Volume in byte.
- Duration in seconds.
- Hit percentage, for volume and requests
separately.
45Seafood Design Aims II
- Distinction between client-side and server-side
traffic. - Further distinction of server-side traffic
- going directly to the origin.
- travelling via a parent cache.
- fetched from a sibling cache.
46Seafood Data Collections
- Performance related (peak) data.
- Destination Domain data
- Limit top-level domains to configured ones.
- Limit 2nd-level domains to top-N.
- Protocols.
- Methods.
- MIME types
- Limit to configured types.
- Client side traffic ? DNS lookups!
- Limit clients to top-N.
- Server side traffic.
- AS number of direct destinations. ? AS lookups!
- Time-volume distribution of requests.
47Seafood what is a HIT?
48Seafood DB Design
49Seafood Web Interface
- Interactive selections
- Interval of interest.
- Cache host or hosts.
- Type of result.
- Requests vs. volume.
- Set of predefined Queries.
http//www.rvs.uni-hannover.de/people/voeckler/sel
.cgi
50Seafood Example for Output
51Seafood Where to Get
- Sources, documentation and updates
athttp//www.cache.dfn.de/DFN-Cache/Development
/Seafood/