Detecting problems in Internet services by mining text logs - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Detecting problems in Internet services by mining text logs

Description:

Most widely deployed program tracing tool. From printf to various 'loggers', 50 years ... Summary and future work. 13. Case Study I: Project Darkstar ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 28

Provided by: xuw

Category:

more less

Transcript and Presenter's Notes

Title: Detecting problems in Internet services by mining text logs

1
Detecting problems in Internet services by mining
text logs

Wei Xu, Armando Fox and Dave Patterson
RAD Lab
June 2008

2
RAD Lab Overview
Low level spec
Com- piler
High level spec
Instrumentation Backplane
New apps, equipment, global policies (eg SLA)
Offered load, resource utilization, etc.
Director
Policy-awareswitching
Training data
performance cost models
Log Mining
3
Console Logs - The forgotten treasure

Most widely deployed program tracing tool
From printf to various loggers, gt 50 years
Includes developer selected information
Very expressive, contains multi-dimentional data
No need for operator to modify anything
Although developers use console log heavily,
operators/users generally ignore them
Because console logs are hard to read!

4
Console logs are ignored because they are hard
to read

Verbose
Different levels of implementation details
Awkward language

Human

Highly unstructured, almost no schema

Machine
Problem Dont know what to look for!
5
Goal Make console log analysis easier

Provide a systematic method that
Routinely analyze logs, suggest potential
abnormal behavior
NO query (as in search) from users required
Query is required by all commonly used analysis
methods
grep
Splunk
Log filters / viewers

6
Three distinct case studies

Covers multiple styles of coding and logging
All features applies to every system

7
Outline

Approach overview
Log analysis for system operator
Project Darkstar
Hadoop file system
Log analysis for developers
SysX A distributed storage system
Summary and future work

8
Text Log Analysis Overview

TransactionImpltid468, , statePREPARING

Free text log
...
tid 468
state PREPARING
Extract structured data
Generate Features
Execution Sequence
Value percentage
Label Lifecycle
Problem!
Problem!
Problem!
Monitoring Detection
Clustering
K-Gram
9
Extract Structured Data

By match log lines to source code

070505 122335 checking memory usage 348MB
Log Template (ID15)Checking memory usage
(.)MB
Class Server void periodicTask(Task
task) LOG.info(checking
memory usage checkmemory() MB)

Where does the log comes from template for each
log line
Data contained in the log
10
Log extraction is not trivial in object oriented
languages

Parsing to abstract syntax tree (AST) level
Resolve toString() method of object used
Resolve dynamic binding (subclassing)
Index log templates to improve matching speed
Use auto-generated regular expression to extract
variable values

11
Log Extraction ExampleComplex log structures

logger.log(Level.FINER, "prepareAndCommit txn
0", txn)

Source
080525 224606,080 prepareAndCommit txn
TxnTrampolineoriginalTxnTransactionImpltid468,
creationTime1211780766069, timeout100,
statePREPARING
Log
Extracted
tid468
creationTime 1211780766069
state PREPARING
timeout 100
LOGID DataStoreImpl-1207
12
Fundamental Difference from Search Based Log
Analysis

Log itself as a language vs. log as documents
with English words
Analysis on log sequences vs. analysis on key
words
Clustering gtReveals the program structure
gtReveals runtime state of program
Scoringgt Find most important sequence in log

13
Outline

Approach overview
Log analysis for system operator
Project Darkstar
Hadoop file system
Log analysis for developers
SysX A distributed storage system
Summary and future work

14
Case Study I Project Darkstar

Experiment to test Darkstar under CPU disturbance
Turned on CPU power save as a CPU cap (common in
shared VM hosting)
Observed significant performance degradation

CPU power save
CPU Recovered
15
Feature I Value Distribution of a Single Label

Find labels (non-numerical) variables
Only have a small set of distinct values
Appear in log lots of times
In Darkstar, only 2 variables chosen
Periodically calculate percentage of each
distinct value
Calculate ?2 to detect abnormal percentage

state -gtACTIVE -gtPREPARING
-gtCOMMITTING -gt darkmud.mapChannel
-gtABORTING -gtABORTED
Util.formatOps -gtR -gtRW -gtW
16
Detection Results
CPU power save
CPU Recovered
Response Time (ms)
?2 of Ops

ACTIVE 1598 PREPARING 795 COMMITTING
217 darkmud.mapChannel 26 ABORTING 793 ABORTED
234
?2 of state
ACTIVE 1001 PREPARING 809 COMMITTING
269 darkmud.mapChannel 18 ABORTING 3 ABORTED 3
Time (sec)
17
Case II Hadoop File SystemFeature II Label
Lifecycle

Problem
Why some blocks takes longer than others?
Find labels that
Have many distinct values
Appear in multiple nodes (master, slave etc.)
Reported lots of times
Log lines reporting a label shows the lifecycle
of the label
Similar to execution trace without ordering
Treat as (unordered) sets of events

18
Lifecycle of a label is similar to execution path
The Lifecycle of blk_8490295198170274781

datanode_r16 Receiving block blk_849029519817027
4781 src dest...
namenode_r10 BLOCK NameSystem.allocateBlock
blk_8490295198170274781
datanode_r14 Receiving block blk_849029519817027
4781 src dest
datanode_r16 Received block blk_8490295198170274
781 of size 49486737 from
datanode_r14 Received block blk_8490295198170274
781 of size 49486737 from
datanode_r14 PacketResponder 0 for block
blk_8490295198170274781 terminating
datanode_r16 PacketResponder 1 for block
blk_8490295198170274781 terminating
namenode_r10 BLOCK NameSystem.addStoredBlock
blockMap updated 169.229.48.8250010 is added to
blk_8490295198170274781 size 49486737

19
Cluster labels with similar lifecycles

Unusual clusters might indicate problem
Hadoop FS Result
Only one variable (block id) chosen
11 clusters, 3 commons, 8 uncommon
Examine 8 uncommon clusters
4 (runtime) error identified
1 false positive due to insufficient training set
3 not understood
May reuse algorithms from path-based analysis

20
Outline

Approach overview
Log analysis for system operator
Project Darkstar
Hadoop file system
Log analysis for developers
SysX A distributed storage system
Summary and future work

21
Case Study III SysX A real dist. storage system
Master
Heartbeats
Storage Server
Storage Server
Client
Requests
Background tasks

Problem
In memory buffers are checkpointed to a shared FS
periodically
When Checkpoint fails, Storage server crashes
with OutOfMemoryError
Several days to debug, gt100 debugging lines

22
Why so hard to debug?
Line 858
No message showing checkpoint done! No error
message, nothing to grep
Line 2094
23
Feature III Single Thread Event Sequence

Generates training sets
Logs from normal executions, not hard in deployed
systems
Capture normal sequence of each thread
Find abnormal sequence ( never seen in
training data)
Detected the failed checkpoint problem w/o false
positives
Results improves with larger training set

24
K-gram model details
Training Set (Normal)
Testing Set (Checkpoint Hang)
K3
25
Outline

Approach overview
Log analysis for system operator
Project Darkstar
Hadoop file system
Log analysis for developers
SysX A distributed storage system
Summary and future work

26
Features discussed are general in systems

Although each feature is presented in a specific
case, it is applicable to all cases
Feature I is a global property of the system
Feature II and III are local properties
Can I always get what I want from console log?
Of course not, but where can you?
Data in logs are carefully chosen by developers
Likely to contain useful information

27
Future Work

(Need your help) analyze more production system
logs
More comparison with other tracing systems (e.g.
Xtrace)
Looking at multiple metrics combined
Online analysis of log streams
Extract log structure from binary data (e.g. Java
byte code) instead of source code
(If you are interested) make a distributable tool
set