Detecting problems in Internet services by mining text logs - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Detecting problems in Internet services by mining text logs

Description:

Most widely deployed program tracing tool. From printf to various 'loggers', 50 years ... Summary and future work. 13. Case Study I: Project Darkstar ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 28
Provided by: xuw
Category:

less

Transcript and Presenter's Notes

Title: Detecting problems in Internet services by mining text logs


1
Detecting problems in Internet services by mining
text logs
  • Wei Xu, Armando Fox and Dave Patterson
  • RAD Lab
  • June 2008

2
RAD Lab Overview
Low level spec
Com- piler
High level spec
Instrumentation Backplane
New apps, equipment, global policies (eg SLA)
Offered load, resource utilization, etc.
Director
Policy-awareswitching
Training data
performance cost models
Log Mining
3
Console Logs - The forgotten treasure
  • Most widely deployed program tracing tool
  • From printf to various loggers, gt 50 years
  • Includes developer selected information
  • Very expressive, contains multi-dimentional data
  • No need for operator to modify anything
  • Although developers use console log heavily,
    operators/users generally ignore them
  • Because console logs are hard to read!

4
Console logs are ignored because they are hard
to read
  • Verbose
  • Different levels of implementation details
  • Awkward language

Human
  • Highly unstructured, almost no schema

Machine
Problem Dont know what to look for!
5
Goal Make console log analysis easier
  • Provide a systematic method that
  • Routinely analyze logs, suggest potential
    abnormal behavior
  • NO query (as in search) from users required
  • Query is required by all commonly used analysis
    methods
  • grep
  • Splunk
  • Log filters / viewers

6
Three distinct case studies
  • Covers multiple styles of coding and logging
  • All features applies to every system

7
Outline
  • Approach overview
  • Log analysis for system operator
  • Project Darkstar
  • Hadoop file system
  • Log analysis for developers
  • SysX A distributed storage system
  • Summary and future work

8
Text Log Analysis Overview
  • TransactionImpltid468, , statePREPARING

Free text log
...
tid 468
state PREPARING
Extract structured data
Generate Features
Execution Sequence
Value percentage
Label Lifecycle
Problem!
Problem!
Problem!
Monitoring Detection
Clustering
K-Gram
9
Extract Structured Data
  • By match log lines to source code

070505 122335 checking memory usage 348MB
Log Template (ID15)Checking memory usage
(.)MB
Class Server void periodicTask(Task
task) LOG.info(checking
memory usage checkmemory() MB)

Where does the log comes from template for each
log line
Data contained in the log
10
Log extraction is not trivial in object oriented
languages
  • Parsing to abstract syntax tree (AST) level
  • Resolve toString() method of object used
  • Resolve dynamic binding (subclassing)
  • Index log templates to improve matching speed
  • Use auto-generated regular expression to extract
    variable values

11
Log Extraction ExampleComplex log structures
  • logger.log(Level.FINER, "prepareAndCommit txn
    0", txn)

Source
080525 224606,080 prepareAndCommit txn
TxnTrampolineoriginalTxnTransactionImpltid468,
creationTime1211780766069, timeout100,
statePREPARING
Log
Extracted
tid468
creationTime 1211780766069
state PREPARING
timeout 100
LOGID DataStoreImpl-1207
12
Fundamental Difference from Search Based Log
Analysis
  • Log itself as a language vs. log as documents
    with English words
  • Analysis on log sequences vs. analysis on key
    words
  • Clustering gtReveals the program structure
  • gtReveals runtime state of program
  • Scoringgt Find most important sequence in log

13
Outline
  • Approach overview
  • Log analysis for system operator
  • Project Darkstar
  • Hadoop file system
  • Log analysis for developers
  • SysX A distributed storage system
  • Summary and future work

14
Case Study I Project Darkstar
  • Experiment to test Darkstar under CPU disturbance
  • Turned on CPU power save as a CPU cap (common in
    shared VM hosting)
  • Observed significant performance degradation

CPU power save
CPU Recovered
15
Feature I Value Distribution of a Single Label
  • Find labels (non-numerical) variables
  • Only have a small set of distinct values
  • Appear in log lots of times
  • In Darkstar, only 2 variables chosen
  • Periodically calculate percentage of each
    distinct value
  • Calculate ?2 to detect abnormal percentage

state -gtACTIVE -gtPREPARING
-gtCOMMITTING -gt darkmud.mapChannel
-gtABORTING -gtABORTED
Util.formatOps -gtR -gtRW -gtW
16
Detection Results
CPU power save
CPU Recovered
Response Time (ms)
?2 of Ops

ACTIVE 1598 PREPARING 795 COMMITTING
217 darkmud.mapChannel 26 ABORTING 793 ABORTED
234
?2 of state
ACTIVE 1001 PREPARING 809 COMMITTING
269 darkmud.mapChannel 18 ABORTING 3 ABORTED 3
Time (sec)
17
Case II Hadoop File SystemFeature II Label
Lifecycle
  • Problem
  • Why some blocks takes longer than others?
  • Find labels that
  • Have many distinct values
  • Appear in multiple nodes (master, slave etc.)
  • Reported lots of times
  • Log lines reporting a label shows the lifecycle
    of the label
  • Similar to execution trace without ordering
  • Treat as (unordered) sets of events

18
Lifecycle of a label is similar to execution path
The Lifecycle of blk_8490295198170274781
  • datanode_r16 Receiving block blk_849029519817027
    4781 src dest...
  • namenode_r10 BLOCK NameSystem.allocateBlock
    blk_8490295198170274781
  • datanode_r14 Receiving block blk_849029519817027
    4781 src dest
  • datanode_r16 Received block blk_8490295198170274
    781 of size 49486737 from
  • datanode_r14 Received block blk_8490295198170274
    781 of size 49486737 from
  • datanode_r14 PacketResponder 0 for block
    blk_8490295198170274781 terminating
  • datanode_r16 PacketResponder 1 for block
    blk_8490295198170274781 terminating
  • namenode_r10 BLOCK NameSystem.addStoredBlock
    blockMap updated 169.229.48.8250010 is added to
    blk_8490295198170274781 size 49486737

19
Cluster labels with similar lifecycles
  • Unusual clusters might indicate problem
  • Hadoop FS Result
  • Only one variable (block id) chosen
  • 11 clusters, 3 commons, 8 uncommon
  • Examine 8 uncommon clusters
  • 4 (runtime) error identified
  • 1 false positive due to insufficient training set
  • 3 not understood
  • May reuse algorithms from path-based analysis

20
Outline
  • Approach overview
  • Log analysis for system operator
  • Project Darkstar
  • Hadoop file system
  • Log analysis for developers
  • SysX A distributed storage system
  • Summary and future work

21
Case Study III SysX A real dist. storage system
Master
Heartbeats
Storage Server
Storage Server
Client
Requests
Background tasks
  • Problem
  • In memory buffers are checkpointed to a shared FS
    periodically
  • When Checkpoint fails, Storage server crashes
    with OutOfMemoryError
  • Several days to debug, gt100 debugging lines

22
Why so hard to debug?
Line 858
No message showing checkpoint done! No error
message, nothing to grep
Line 2094
23
Feature III Single Thread Event Sequence
  • Generates training sets
  • Logs from normal executions, not hard in deployed
    systems
  • Capture normal sequence of each thread
  • Find abnormal sequence ( never seen in
    training data)
  • Detected the failed checkpoint problem w/o false
    positives
  • Results improves with larger training set

24
K-gram model details
Training Set (Normal)
Testing Set (Checkpoint Hang)
K3
25
Outline
  • Approach overview
  • Log analysis for system operator
  • Project Darkstar
  • Hadoop file system
  • Log analysis for developers
  • SysX A distributed storage system
  • Summary and future work

26
Features discussed are general in systems
  • Although each feature is presented in a specific
    case, it is applicable to all cases
  • Feature I is a global property of the system
  • Feature II and III are local properties
  • Can I always get what I want from console log?
  • Of course not, but where can you?
  • Data in logs are carefully chosen by developers
  • Likely to contain useful information

27
Future Work
  • (Need your help) analyze more production system
    logs
  • More comparison with other tracing systems (e.g.
    Xtrace)
  • Looking at multiple metrics combined
  • Online analysis of log streams
  • Extract log structure from binary data (e.g. Java
    byte code) instead of source code
  • (If you are interested) make a distributable tool
    set
Write a Comment
User Comments (0)
About PowerShow.com