Title: Detecting problems in Internet services by mining text logs
1Detecting problems in Internet services by mining
text logs
- Wei Xu, Armando Fox and Dave Patterson
- RAD Lab
- June 2008
2RAD Lab Overview
Low level spec
Com- piler
High level spec
Instrumentation Backplane
New apps, equipment, global policies (eg SLA)
Offered load, resource utilization, etc.
Director
Policy-awareswitching
Training data
performance cost models
Log Mining
3Console Logs - The forgotten treasure
- Most widely deployed program tracing tool
- From printf to various loggers, gt 50 years
- Includes developer selected information
- Very expressive, contains multi-dimentional data
- No need for operator to modify anything
- Although developers use console log heavily,
operators/users generally ignore them - Because console logs are hard to read!
4Console logs are ignored because they are hard
to read
- Verbose
- Different levels of implementation details
- Awkward language
Human
- Highly unstructured, almost no schema
Machine
Problem Dont know what to look for!
5Goal Make console log analysis easier
- Provide a systematic method that
- Routinely analyze logs, suggest potential
abnormal behavior - NO query (as in search) from users required
- Query is required by all commonly used analysis
methods - grep
- Splunk
- Log filters / viewers
6Three distinct case studies
- Covers multiple styles of coding and logging
- All features applies to every system
7Outline
- Approach overview
- Log analysis for system operator
- Project Darkstar
- Hadoop file system
- Log analysis for developers
- SysX A distributed storage system
- Summary and future work
8Text Log Analysis Overview
- TransactionImpltid468, , statePREPARING
Free text log
...
tid 468
state PREPARING
Extract structured data
Generate Features
Execution Sequence
Value percentage
Label Lifecycle
Problem!
Problem!
Problem!
Monitoring Detection
Clustering
K-Gram
9Extract Structured Data
- By match log lines to source code
070505 122335 checking memory usage 348MB
Log Template (ID15)Checking memory usage
(.)MB
Class Server void periodicTask(Task
task) LOG.info(checking
memory usage checkmemory() MB)
Where does the log comes from template for each
log line
Data contained in the log
10Log extraction is not trivial in object oriented
languages
- Parsing to abstract syntax tree (AST) level
- Resolve toString() method of object used
- Resolve dynamic binding (subclassing)
- Index log templates to improve matching speed
- Use auto-generated regular expression to extract
variable values
11Log Extraction ExampleComplex log structures
- logger.log(Level.FINER, "prepareAndCommit txn
0", txn)
Source
080525 224606,080 prepareAndCommit txn
TxnTrampolineoriginalTxnTransactionImpltid468,
creationTime1211780766069, timeout100,
statePREPARING
Log
Extracted
tid468
creationTime 1211780766069
state PREPARING
timeout 100
LOGID DataStoreImpl-1207
12Fundamental Difference from Search Based Log
Analysis
- Log itself as a language vs. log as documents
with English words - Analysis on log sequences vs. analysis on key
words - Clustering gtReveals the program structure
- gtReveals runtime state of program
- Scoringgt Find most important sequence in log
13Outline
- Approach overview
- Log analysis for system operator
- Project Darkstar
- Hadoop file system
- Log analysis for developers
- SysX A distributed storage system
- Summary and future work
14Case Study I Project Darkstar
- Experiment to test Darkstar under CPU disturbance
- Turned on CPU power save as a CPU cap (common in
shared VM hosting) - Observed significant performance degradation
CPU power save
CPU Recovered
15Feature I Value Distribution of a Single Label
- Find labels (non-numerical) variables
- Only have a small set of distinct values
- Appear in log lots of times
- In Darkstar, only 2 variables chosen
- Periodically calculate percentage of each
distinct value - Calculate ?2 to detect abnormal percentage
state -gtACTIVE -gtPREPARING
-gtCOMMITTING -gt darkmud.mapChannel
-gtABORTING -gtABORTED
Util.formatOps -gtR -gtRW -gtW
16Detection Results
CPU power save
CPU Recovered
Response Time (ms)
?2 of Ops
ACTIVE 1598 PREPARING 795 COMMITTING
217 darkmud.mapChannel 26 ABORTING 793 ABORTED
234
?2 of state
ACTIVE 1001 PREPARING 809 COMMITTING
269 darkmud.mapChannel 18 ABORTING 3 ABORTED 3
Time (sec)
17Case II Hadoop File SystemFeature II Label
Lifecycle
- Problem
- Why some blocks takes longer than others?
- Find labels that
- Have many distinct values
- Appear in multiple nodes (master, slave etc.)
- Reported lots of times
- Log lines reporting a label shows the lifecycle
of the label - Similar to execution trace without ordering
- Treat as (unordered) sets of events
18Lifecycle of a label is similar to execution path
The Lifecycle of blk_8490295198170274781
- datanode_r16 Receiving block blk_849029519817027
4781 src dest... - namenode_r10 BLOCK NameSystem.allocateBlock
blk_8490295198170274781 - datanode_r14 Receiving block blk_849029519817027
4781 src dest - datanode_r16 Received block blk_8490295198170274
781 of size 49486737 from - datanode_r14 Received block blk_8490295198170274
781 of size 49486737 from - datanode_r14 PacketResponder 0 for block
blk_8490295198170274781 terminating - datanode_r16 PacketResponder 1 for block
blk_8490295198170274781 terminating - namenode_r10 BLOCK NameSystem.addStoredBlock
blockMap updated 169.229.48.8250010 is added to
blk_8490295198170274781 size 49486737
19Cluster labels with similar lifecycles
- Unusual clusters might indicate problem
- Hadoop FS Result
- Only one variable (block id) chosen
- 11 clusters, 3 commons, 8 uncommon
- Examine 8 uncommon clusters
- 4 (runtime) error identified
- 1 false positive due to insufficient training set
- 3 not understood
- May reuse algorithms from path-based analysis
20Outline
- Approach overview
- Log analysis for system operator
- Project Darkstar
- Hadoop file system
- Log analysis for developers
- SysX A distributed storage system
- Summary and future work
21Case Study III SysX A real dist. storage system
Master
Heartbeats
Storage Server
Storage Server
Client
Requests
Background tasks
- Problem
- In memory buffers are checkpointed to a shared FS
periodically - When Checkpoint fails, Storage server crashes
with OutOfMemoryError - Several days to debug, gt100 debugging lines
22Why so hard to debug?
Line 858
No message showing checkpoint done! No error
message, nothing to grep
Line 2094
23Feature III Single Thread Event Sequence
- Generates training sets
- Logs from normal executions, not hard in deployed
systems - Capture normal sequence of each thread
- Find abnormal sequence ( never seen in
training data) - Detected the failed checkpoint problem w/o false
positives - Results improves with larger training set
24K-gram model details
Training Set (Normal)
Testing Set (Checkpoint Hang)
K3
25Outline
- Approach overview
- Log analysis for system operator
- Project Darkstar
- Hadoop file system
- Log analysis for developers
- SysX A distributed storage system
- Summary and future work
26Features discussed are general in systems
- Although each feature is presented in a specific
case, it is applicable to all cases - Feature I is a global property of the system
- Feature II and III are local properties
- Can I always get what I want from console log?
- Of course not, but where can you?
- Data in logs are carefully chosen by developers
- Likely to contain useful information
27Future Work
- (Need your help) analyze more production system
logs - More comparison with other tracing systems (e.g.
Xtrace) - Looking at multiple metrics combined
- Online analysis of log streams
- Extract log structure from binary data (e.g. Java
byte code) instead of source code - (If you are interested) make a distributable tool
set