Title: Root Cause Analysis of Failures in LargeScale Computing Environments
1Root Cause Analysis of Failures in Large-Scale
Computing Environments
- Naoya Maruyama (Tokyo Tech)Alexander V.
Mirgorodskiy (UW-Madison) Barton P. Miller
(UW-Madison) Satoshi Matsuoka (Tokyo Tech NII)
2Background
- Diagnosing faults in large-scale systems is hard.
- Use of diverse software/hardware components
- Often observed only on a production system
- Non-deterministic behaviors caused by different
orders of operations - Despite these complexities, traditional
approaches are insufficient. - An interactive debugger is not scalable with
regards to the number of processes/hosts - The printf debugging is too ad-hoc to be used in
productions systems
3Assumption and Objectives
- Systematic, well-focused fault diagnosis for
large-scale computing environments - Automate processes as much as possible
- Low false positive/negative rates
- We assume
- SPMD-style distributed systems
- A failure is observed by other components and
notified to the diagnosis engine
4Our Idea
- Narrowing down diagnosis steps by identifying
behavioral behaviors between correct and
incorrect execution, e.g., - A function was only called when the program
crashed. - A certain packet was never delivered to the
destination when the program hung up. - How?
- Collects execution data of programs at run time
- Identifies anomalies in the data
- Normal behavior ? correct behavior
- Anomalous behavior ? incorrect behavior
- Compares the normal and behaviors
5Current Achievements
- Prototype implementation for distributed systems
of SPMD style - Demonstration of systematic diagnosis of faults
in a real production system
6Overview of the Diagnosis Steps
- Data Collection
- Monitors and traces the execution data of a
target system - Data Analysis
- Identifies anomalies inside the trace
- Reports the results to the analyst for further
investigation
7Data Collection
- Collects function call traces
- Captures control-flow behaviors
- Can be extended to incorporate other types of
behaviors, as memory management operations,
concurrency, communication - How to collect the data?
- Use spTracer, a lightweight dynamic
instrumentation technique Mirgorodskiy et al.,
04 - Injects a tracing agent into a process of
interest - The agent inserts trace statements at all
function call sites - The statements generate log records with
timestamps - Keep the trace in a shared-memory segment to
retain the data even if the process crashes. - Manage the trace in a circular buffer
- Keeps only the most recent trace of fixed length
- Avoids unlimited growth of trace size
8Textual Representation of Call Traces
ENTER func_addr 0x819967c pid 5095 tid 4
timestamp 12131002746163258 LEAVE func_addr
0x819967c pid 5095 tid 4 timestamp
12131002746163936 ENTER func_addr 0x819967c pid
5095 tid 4 timestamp 12131002746164571 LEAVE
func_addr 0x819967c pid 5095 tid 4 timestamp
12131002746165197 ENTER func_addr 0x819967c pid
5095 tid 4 timestamp 12131002746165828 LEAVE
func_addr 0x819967c pid 5095 tid 4 timestamp
12131002746166395 LEAVE func_addr 0x80de590 pid
5095 tid 4 timestamp 12131002746166938 ENTER
func_addr 0x819967c pid 5095 tid 4 timestamp
12131002746167573 LEAVE func_addr 0x819967c pid
5095 tid 4 timestamp 12131002746179202 ENTER
func_addr 0x80de750 pid 5095 tid 4 timestamp
12131002746180027 ENTER func_addr 0x811b070 pid
5095 tid 4 timestamp 12131002746180691 ENTER
func_addr 0x8138710 pid 5095 tid 4 timestamp
12131002746181359 LEAVE func_addr 0x8138710 pid
5095 tid 4 timestamp 12131002746185934
The timestamp field is the system cycle counter.
9Visualizing Traces with Jumpshot
- Traces can be exported to the SLOG format to
visualize with Jumpshot
Multiple rows represent multiple threads/processes
Each rectangle means a function invocation
Nested rectangles mean nested function calls.
10Data Analysis
- Two-step analysis
- Finding the most anomalous process (trace)
- Finding the most anomalous function
- Presents two techniques
- Identifying fail-stop anomalies
- Identifying non-fail-stop anomalies
11Data Analysis Identifying Fail-Stop Anomalies
- Finds the process that stopped generating trace
records first - It ended substantially earlier than the others ?
The fail-stop anomaly - Traced ended at similar times ? Identify the
non-fail-stop anomalies
Fail-Stop Case
Non-Fail-Stop Case
Traces
Trace end time
Trace end time
12Data Analysis Identifying Non-Fail-Stop Anomalies
- Apply a distance-based anomaly detection
technique - Define a distance metric between each pair of
traces - Define a trace suspect score
- Report traces with highest suspect scores
13Defining the Distance Metric
- Say, there are only two functions, func_A and
func_B, and tree traces, trace_X, trace_Y, trace_Z
Normalized time spent in each function
func_B
1.0
trace_Z
trace_X
0.5
trace_Y
0.4
0
func_A
0.5
0.6
14Defining the Suspect Score
s(g)
g
s(h)
h
- Common behavior normal
- Suspect score s(h) distance to nearest
neighbor - Report process with the highest s to the analyst
- h is in the big mass, s(h) is low, h is normal
- g is a single outlier, s(g) is high, g is an
anomaly - What if there is more than one anomaly?
15Defining the Suspect Score
sk(g)
g
h
Computing the score using k2
- Suspect score sk(h) distance to the kth
neighbor - Exclude (k-1) closest neighbors
- Sensitivity study k NumProcesses/4 works well
- Represents distance to the big mass
- h is in the big mass, kth neighbor is close,
sk(h) is low - g is an outlier, kth neighbor is far, sk(g) is
high
16Defining the Suspect Score
sk(g)
g
h
- Anomalous means unusual, but unusual does not
always mean anomalous! - E.g., MPI master is different from all workers
- Would be reported as an anomaly (false positive)
- Distinguish false positives from true anomalies
- With knowledge of system internals manual
effort - With previous execution history can be
automated
17Defining the Suspect Score
g
h
n
- Add traces from known-normal previous run
- One-class classification
- Suspect score sk(h) distance to the kth trial
neighbor or the 1st known-normal neighbor - Distance to the big mass or known-normal behavior
- h is in the big mass, kth neighbor is close,
sk(h) is low - g is an outlier, normal node n is close, sk(g) is
low
18Finding Anomalous Function
- Fail-stop problems
- Failure is in the last function invoked
- Non-fail-stop problems
- Find why process h was marked as an anomaly
- Function with the highest contribution to s(h)
- s(h) d (h,g), where g is the chosen neighbor
- anomFn arg max di
i
19Experimental Study SCore on TitechGrid
- TitechGrid
- 129-node PC cluster at Tokyo Institute of
Technology - Serving as a production system for over a year
- SCore v5.4 is operated in the multi-user mode
- The scored daemons are connected in a ring with
the sc_watch process - Each process in the ring sends patrol messages
to the next daemon - If sc_watch receives no patrol messages for 10
minutes, it kills and restarts all the daemons
20Applying the Diagnosis System to SCore
- Injects the tracing agents into all scoreds
- Instruments sc_watch to save in-memory traces
when the daemons are being killed - Identify the anomalous trace
- Identify the anomalous function/call path
21Finding the Host
Suspect Score
- Host n129 is unusual different from the others
- Host n129 is anomalous not present in previous
known-normal runs - Host n129 is a new anomaly not present in
previous known-faulty runs
22Finding the Cause
score_write
score_write_short
output_job_status
__libc_write
- Call chain with the highest contribution to the
suspect score (output_job_status -gt
score_write_short -gt score_write -gt __libc_write) - Tries to output a log message to the scbcast
process - Writes to the scbcast process kept blocking for
10 minutes - Scbcast stopped reading data from its socket
bug! - Scored did not handle it well (spun in an
infinite loop) bug!
23Related Work
- Magpie Barham, et al., 04
- Diagnose distributed systems that serve HTTP
requests - components include web front-end, backend
databases, etc. - For errors on the incoming HTTP requests, locates
where the errors happened by looking at request
log on each components - Applied cluster analysis to request log
- PinPoint Chen, et al., 02
- Modified JBoss to record some of the events
associated to a request - such events as send/recv, exception handling
- Finds out the correlation between failures of
requests with the events by a technique to
cluster analysis
Both techniques are well suited to collections of
small request paths, but not to daemon-type
processes like SCore.
24Conclusion
- Summary
- Proposed systematic diagnosis methods for
distributed systems - Demonstrated efficacy of the methods on a real
production system - Next Steps
- Application study with a wider range of systems
- Richer execution data for smarter analysis
25Discussion
26References
- Omitted refer to the paper
27Related Work
- Similar approaches to similar problems
- Different approaches to similar problems
- Similar approaches to different problems
28Overview of the Diagnosis Steps
Step 1
Step 2
Step 3
trace set
anomalous trace function
anomalous trace
Tracingwith spTracer
fail-stop
Earliest LastTimestamp
Last TraceEntry
Analyst
non-fail-stop
Has ReferenceTraces?
Trace Visualizationwith Jumpshot
Supervised TraceRanking Based onSuspect Score
yes
Max Componentof the Delta Vector
no
Unsupervised TraceRanking Based onSuspect Score
29Overview of the Diagnosis Steps
Proc
process
Log
Log
Log
Execution data
Log
Log
Log
Proc
Crash
Log
Proc
Proc
Log
Proc
Proc
Log
Proc
Proc
Proc
Proc
Proc
Proc
Log
Proc
Proc
Proc
Log
Proc
Proc
????????????(???????????????????)
Log
Log
???????????
Log
Log
Log
Log
Log
?????????????????
Log
Log
Log
Log
Log
Log
Log
Log
Log
??????????????????
Log
Log
Log
Log
Log
Log
Log
Log
Log
Log