Logfile Analysis

About This Presentation

Title:

Logfile Analysis

Description:

http://statistics.www-cache.dfn.de/Projects/seafood/ 02.03.2000. Log File Analysis ... DIY analysis. Available analysis programs may not cover your needs ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 52

Provided by: jenssv

Category:

more less

Transcript and Presenter's Notes

Title: Logfile Analysis

1
Log File Analysis
Jens-S. Vöckler
Desire II Cache Workshop, Budapest
02.03.2000
2
Overview

Log file processing What log files do exist?
Log file analyser programs An overview
Do-it-yourself Analysing log files yourself
How to create long term observations

3
Log file processing vs. monitoring
Log file analysis
Monitoring

look at things after they happened
find out about trends
long term conception

show behaviour instantly
ring alarm bells before catastrophe
short term conception

? The aims are different!
4
Log file processing Online vs. Offline

Online processing degrades cache performance
Processing on-host at high performance loss,
Continuous transfers elsewhere ? IPC overhead.
Log file would have to be constantly polled.
Offline processing allows more options
Processing granularities are user defined and
repeatable.
Transfer of Log files during idle intervals.

? Offline processing preferable.
5
Log files kept by Squid

Several log files kept by Squid
Explicit compile-time activation for some, e.g.
useragent.log
Explicit run-time deactivations for others, e.g.
store.log
Basic assumption Squid 2.3.STABLE1
Squid 1.0 hierarchy.log in access.log since Squid
1.1
UTC time stamps common to all log files
Sometimes millisecond extension.
Influence of --enable-time-hack compile time
option
Use of alarm().
Coarseness of stamps.

6
Log file squid.out

Created by RunCache script.
Contains Squid start up times.
Also gets assert() failures.

? Not useful for statistics.
7
Log file cache.log

Created by Squid.
Gets error messages.
Debug messages go here.

? Very limited use for generating automated error
reports.
8
Log file useragent.log

Needs compile-time configuration.
Compile with --enable-useragent-log flag.
Needs run-time configuration
Point useragent_log option to a file.

? Useful for generating browser distribution
reports.
9
Log file store.log

Basically a debug file ? transaction log.
Contains objects that are or were on your disks
Process completely for objects that do reside.
Allows for URL ? store file name mapping.
Needs to know cache_dir configuration option.

? Useful for object based statistics.
10
Store.log file format
11
Log file swap.state

Special tool dump_swap_states by Henrik
Nordström
MD5 key,
File number,
File size,
Timestamp, last reference, expires, last
modification,
Reference count,
Flags.
Association between stored file and object key.

? Useful for debugging.
12
Log file access.log

Used by most analysers.
Two kind of log file formats supported.
Sets do not overlap.
Conversion between formats is not without loss.
Option emulate_http_log switches between formats.
Optional additional columns possible.

? Useful for most statistics.
13
Common vs. native format
Common log file format
Native log file format

Standard web server format.
Parseable by a variety of tools.
Less/ other information than native.
HTTP protocol version

Special format of Squid.
Needs specialised tools.
More/ other information than common.
Request duration
Timeout hierarchy information
Upstream server address
Content type

? Use of native log file format is recommended!
14
Native log file format
15
Influence of log_icp_queries option

Logs all ICP traffic.
Logged request method is ICP_QUERY.
Enlarges log files by huge amounts.
Needed for many kinds of analysis.
Enabled by default.

16
Managing log files
http//www.uninett.no/prosjekt/desire/arneberg/sta
tistics.html

Rotate and process log files at regular
intervals
Set up the logfile_rotate option appropriately.
Use the squid -k rotate API.
Automatically rotate at least once a day, e.g.
twice a day
rotate log files at midnight...
0 0 /usr/local/squid/bin/squid -k rotate
... And again at high noon
0 12 /usr/local/squid/bin/squid -k rotate
Transfer contents during a period of idleness.
Reserve enough disk space for log files.

17
Managing log files II

Keep the size in mind.
Storage will take space, transfer will take time.
Analysis will take even more time and CPU power.
Limit your interest.
Even processed data is abundant.
Respect the privacy of your clients.
Privacy protection legislation may force you to
erase log files.
Keep logs unavailable.
Anonymize log files before publication.

18
Intro Analyser programs

Focus on access.log analysers.
Python scripts are not mentioned.
This is a starting point!
Use meta search engines for more links,
http//www.squid-cache.org/Scripts/
Test by feeding a log file from our university
cache
388918 lines.
53 MB for 24 hours.
Squid 1.1 log file format.
PentiumPro 233 MHz, 128 MB, Linux 2.0.36, glibc2.

19
Overview Analyser programs
20
Squidclients
http//www.cineca.it/nico/squidclients.html

Version 1.6 by Nico Tranquilli.
Readily compiled binary distribution.
Textual and HTML output.
Features
Quick overview over peers and clients.
Needs ICP information.

21
Squidtimes
http//www.cineca.it/nico/squidtimes.html

Version 1.12 by Nico Tranquilli.
Readily compiled binary distribution.
HTML output with pictures.
Features
elapsed time per TCP, TCP count
elapsed time for TCP HIT, TCP MISS
elapsed time for UDP, MISS, HIT, count
count of TCP, UDP and errors
object size
shows different things in the same diagram.

22
Desire Project Statistics - DePStat
http//www.uninett.no/prosject/desire/DePStat/

Version 1.67 by Lars Slettjord et al.
Perl script collection.
Text and HTML with emphasise on PostScript.
Features
HTTP traffic by requests, split by parents, HIT,
MISS, etc.
HTTP traffic in bytes split as above.
Same for ICP traffic
HITs per second
Based on discussions in TF-CACHE.
Modular concept in 2 stages with database
Little documentation.

23
NLANR scripts
http//squid.nlanr.net/Squid/Scripts/

By the authors of Squid.
Perl script collection.
Textual output with ASCII graphics.
Features
summaries over requests, protocols.
With UDP, TCP and TCP count
client and origin site usage.
URL types and top level domains.
Client utilization and hit ratios.
Histogramm of hot objects and top 25 objects.

24
Access-times
http//squid.nlanr.net/Squid/Scripts/access_times.
awk

By Andrew Richards.
One awk script.
Text table output.
Yields a consise extracts of delays encountered.

25
Jesalfa
http//ivs.cs.uni-magdeburg.de/elkner/webtools/je
salfa/

Version 1.0 by Jens Elkner.
C source distribution.
HTML tables as output.
Features - always by HIT and by Byte
filename extension and content types.
Request schemes.
Toplevel domains.
Top 20 second level domains.

26
Calamaris
http//calamaris.cord.de/

Version 2.29 by Cord Beermann.
Perl script with man page.
Text, HTML and mailto output.
Feature - volume sorting is optional
request methods and peak times.
In- and outgoing TCP and UDP by status and
destination.
2nd level domains, request schemes, content
types, suffices.
Incoming TCP and UDP clients.
Several passes with partial log files possible.

27
Seafood a5
http//statistics.www-cache.dfn.de/Projects/seafoo
d/

Based on Calamaris 1. Version ?5 by Jens-S.
Vöckler et al.
C source distribution, uses STL.
Text output.
Features
peak times.
UDP and TCP requests, external fetches.
Content type, 2nd and toplevel domains by count
and volume.
UDP clients by HIT, with MISS and speed.
TCP clients by volume and HIT, with MISS and
speed.
Only Squid 1.1 log files.

28
Seafood 2
http//www.cache.dfn.de/DFN-Cache/Development/Seaf
ood/

TERENA Extended Cache Statistics project
Improved version of Seafood a5.
Optimised parser using tries.
C/Perl source distribution, uses STL.
SQL connectivity via intermediate file.
Additional features
AS lookups for DIRECT calls (time-intensive).
Both, by volume and by request tables for textual
output.
Squid-style configuration file.
Basically for Squid-2 log files.

29
DIY analysis

Available analysis programs may not cover your
needs
Any analysis is CPU intensive and time consuming.
Do not use your cache host for running analysers.
Need to optimise parse loop
every millisecond saved here saves minutes in
execution time.
No need to optimise output

30
Analysis using any programming language

Leave out anything unnecessary.
With Perl, chomp vs. last column data.
Leave out reverse IP address lookups.
Only parse as much of the file as you actually
need.
Do not print or write anything within the parsing
loop
A kind of I am alive flag might be helpful,
though.
Put conditions first, that will skip input lines.
Sort the conditions with the most likely first.

31
Analysis using Perl

Process one line at a time
open( IN, logfile ) die open logfile
!\n
while ( ltINgt )
...
close(IN)
Rescue partial results
SIGINT sub exit(1)
... while loop accumulating
exit(0)
END
... process results

32
Analysis using Perl - Regular expressions

Avoid match variables ,,
Try memory free parenthesis
Greedy matches work faster in Perl.
Prefer character classes over alternations.
Only match or split as much as necessary.
With the 388918 lines local log file on the Linux
host

33
Analysing with C/C

Provide sufficiently large input buffers.
Learn about the optimisation of your compilers.
E.g gcc 2.7.x msupersparc vs. gcc 2.8.x
-mcpuultrasparc
Program in ways your optimizer can work.
avoid global variables and use const.
Avoid the locale feature of system libraries.
Benchmark different variants at different
optimisation levels
E.g. ai versus s

34
Showing results

Textual output is easily generated.
can be used for further processing.
HTML tables rely on access to the net.
Excel can draw impressive diagrams.
Import filter for tab separated text or fixed
width columns.
Not good with large amounts of data.
Try gnuplot.
Easy to use API.
Textual control files.
Multiple data files.
All major output devices supported.

35
Showing results II

Use graphic interfaces directly.
Not an option for beginners.
Must auto scale yourself.
Generating GIFs
libgd for C and the GD module for Perl.
PostScript as vector format.
Generates small files which scale very well.
Complete programming language, allows for post
processing.
Definitely not for beginners.
Other interfaces, e.g. TeX, Tgif, etc.

36
Long term observations

The old DFN way
Several steps involved.
Several scripts involved.
Manual intervention necessary.
Backup?
The new Seafood style
Two steps.
Storage in database.
Automatic, interactive results.

37
The old DFN way

Condense your log files.
Extract items of interest.
Combine items.
Plot the data represented.

38
Condensation

At regular interval, condense log files into
numbers.
Also adds privacy protection.
Use powerful analysers for condensing numbers.

Request-Peak Request kByte
Hit- --------------- ------- ------ --------
------ ------ .com 140808 39.22
1222155 48.56 42.17 .de 145012
40.39 785763 31.22 44.81 .net
21120 5.88 146648 5.83 37.67 unresolved
20001 5.57 121435 4.82 37.99 .nu
3112 0.87 47118 1.87 26.51 .org
3614 1.01 33015 1.31
28.94 .nl 1173 0.33 20658
0.82 37.94 .ch 1649 0.46
17145 0.68 8.19 .edu 1503
0.42 15164 0.60 14.77 .uk
1766 0.49 12402 0.49 22.25
... --------------- ------- ------ --------
------ ------ Sum 359008 100.00
2516820 100.00 40.50

Example with local server log files from 4th of
January 1999.
Table shows excerpt from Calamaris output.
Patched for unlimited number of top level domains.

39
Extraction

Calamaris output is still large.
Generate extract from numbers.
Concentrate on a few trends.
Parse Calamaris output.
Combines numbers.
Generates extracts.
Easy to parse output.

toplevel_domain_request_com 140808 toplevel_domain
_request_de 145012 toplevel_domain_request_unresol
ved 20001 toplevel_domain_request_edu
1503 toplevel_domain_request_org
3614 toplevel_domain_request_gov
171 toplevel_domain_request_nl 1173 toplevel_domai
n_request_jp 368 toplevel_domain_request_uk
1766 toplevel_domain_request_net 21120
... toplevel_domain_request_sum
359008 toplevel_domain_kbyte_sum 2516820
40
Combination

Further limit view, similar to extraction.
Extract and combine, generate gnuplot data files.

41
Plot

Gnuplot can combine data.
One control file per diagram.
Example needs manual work for the time stamps.

set terminal postscript eps color solid set
xlabel 'Day' set ylabel 'GByte per day' set
output 'rrzn.toplevel_domain.9901.eps' set title
'www-cache.rrzn.uni-hannover.de' set size 1.4,
1.0 set grid set yrange 03500000 set xtics (
"31.12.98" 0, "4.1.99" 4, "11.1.99" 11, "18.1.99"
18, "25.1.99" 25, "30.1.99" 30 ) set ytics (
"0.5" 500000, "1" 1000000, "1.5" 1500000, "2"
2000000,\ "2.5" 2500000, "3" 3000000, "3.5"
3500000 ) plot \ '9901/9901.toplevel_domain_kbyte
_com' using 2 title 'com' with steps, \
'9901/9901.toplevel_domain_kbyte_de' using 2
title 'de' with steps, \ '9901/9901.toplevel_dom
ain_kbyte_edu' using 2 title 'edu' with steps, \
'9901/9901.toplevel_domain_kbyte_gov' using 2
title 'gov' with steps, \ '9901/9901.toplevel_dom
ain_kbyte_jp' using 2 title 'jp' with steps, \
'9901/9901.toplevel_domain_kbyte_nl' using 2
title 'nl' with steps, \ '9901/9901.toplevel_dom
ain_kbyte_org' using 2 title 'org' with steps, \
'9901/9901.toplevel_domain_kbyte_uk' using 2
title 'uk' with steps, \ '9901/9901.toplevel_dom
ain_kbyte_unresolved' using 2 title 'unresolved'
with steps
42
Result
43
The New Seafood