Root Cause Analysis of Failures in LargeScale Computing Environments - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Root Cause Analysis of Failures in LargeScale Computing Environments

Description:

Diagnosing faults in large-scale systems is hard. Use of diverse software ... Magpie [Barham, et al., 04] Diagnose distributed systems that serve HTTP requests ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 25

Provided by: nao92

Category:

more less

Transcript and Presenter's Notes

Title: Root Cause Analysis of Failures in LargeScale Computing Environments

1
Root Cause Analysis of Failures in Large-Scale
Computing Environments

Naoya Maruyama (Tokyo Tech)Alexander V.
Mirgorodskiy (UW-Madison) Barton P. Miller
(UW-Madison) Satoshi Matsuoka (Tokyo Tech NII)

2
Background

Diagnosing faults in large-scale systems is hard.
Use of diverse software/hardware components
Often observed only on a production system
Non-deterministic behaviors caused by different
orders of operations
Despite these complexities, traditional
approaches are insufficient.
An interactive debugger is not scalable with
regards to the number of processes/hosts
The printf debugging is too ad-hoc to be used in
productions systems

3
Assumption and Objectives

Systematic, well-focused fault diagnosis for
large-scale computing environments
Automate processes as much as possible
Low false positive/negative rates
We assume
SPMD-style distributed systems
A failure is observed by other components and
notified to the diagnosis engine

4
Our Idea

Narrowing down diagnosis steps by identifying
behavioral behaviors between correct and
incorrect execution, e.g.,
A function was only called when the program
crashed.
A certain packet was never delivered to the
destination when the program hung up.
How?
Collects execution data of programs at run time
Identifies anomalies in the data
Normal behavior ? correct behavior
Anomalous behavior ? incorrect behavior
Compares the normal and behaviors

5
Current Achievements

Prototype implementation for distributed systems
of SPMD style
Demonstration of systematic diagnosis of faults
in a real production system

6
Overview of the Diagnosis Steps

Data Collection
Monitors and traces the execution data of a
target system
Data Analysis
Identifies anomalies inside the trace
Reports the results to the analyst for further
investigation

7
Data Collection

Collects function call traces
Captures control-flow behaviors
Can be extended to incorporate other types of
behaviors, as memory management operations,
concurrency, communication
How to collect the data?
Use spTracer, a lightweight dynamic
instrumentation technique Mirgorodskiy et al.,
04
Injects a tracing agent into a process of
interest
The agent inserts trace statements at all
function call sites
The statements generate log records with
timestamps
Keep the trace in a shared-memory segment to
retain the data even if the process crashes.
Manage the trace in a circular buffer
Keeps only the most recent trace of fixed length
Avoids unlimited growth of trace size

8
Textual Representation of Call Traces
ENTER func_addr 0x819967c pid 5095 tid 4
timestamp 12131002746163258 LEAVE func_addr
0x819967c pid 5095 tid 4 timestamp
12131002746163936 ENTER func_addr 0x819967c pid
5095 tid 4 timestamp 12131002746164571 LEAVE
func_addr 0x819967c pid 5095 tid 4 timestamp
12131002746165197 ENTER func_addr 0x819967c pid
5095 tid 4 timestamp 12131002746165828 LEAVE
func_addr 0x819967c pid 5095 tid 4 timestamp
12131002746166395 LEAVE func_addr 0x80de590 pid
5095 tid 4 timestamp 12131002746166938 ENTER
func_addr 0x819967c pid 5095 tid 4 timestamp
12131002746167573 LEAVE func_addr 0x819967c pid
5095 tid 4 timestamp 12131002746179202 ENTER
func_addr 0x80de750 pid 5095 tid 4 timestamp
12131002746180027 ENTER func_addr 0x811b070 pid
5095 tid 4 timestamp 12131002746180691 ENTER
func_addr 0x8138710 pid 5095 tid 4 timestamp
12131002746181359 LEAVE func_addr 0x8138710 pid
5095 tid 4 timestamp 12131002746185934
The timestamp field is the system cycle counter.
9
Visualizing Traces with Jumpshot

Traces can be exported to the SLOG format to
visualize with Jumpshot

Multiple rows represent multiple threads/processes
Each rectangle means a function invocation
Nested rectangles mean nested function calls.
10
Data Analysis

Two-step analysis
Finding the most anomalous process (trace)
Finding the most anomalous function
Presents two techniques
Identifying fail-stop anomalies
Identifying non-fail-stop anomalies

11
Data Analysis Identifying Fail-Stop Anomalies

Finds the process that stopped generating trace
records first
It ended substantially earlier than the others ?
The fail-stop anomaly
Traced ended at similar times ? Identify the
non-fail-stop anomalies

Fail-Stop Case
Non-Fail-Stop Case
Traces
Trace end time
Trace end time
12
Data Analysis Identifying Non-Fail-Stop Anomalies

Apply a distance-based anomaly detection
technique
Define a distance metric between each pair of
traces
Define a trace suspect score
Report traces with highest suspect scores

13
Defining the Distance Metric

Say, there are only two functions, func_A and
func_B, and tree traces, trace_X, trace_Y, trace_Z

Normalized time spent in each function
func_B
1.0
trace_Z
trace_X
0.5
trace_Y
0.4
0
func_A
0.5
0.6
14
Defining the Suspect Score
s(g)
g
s(h)
h

Common behavior normal
Suspect score s(h) distance to nearest
neighbor
Report process with the highest s to the analyst
h is in the big mass, s(h) is low, h is normal
g is a single outlier, s(g) is high, g is an
anomaly
What if there is more than one anomaly?

15
Defining the Suspect Score
sk(g)
g
h
Computing the score using k2

Suspect score sk(h) distance to the kth
neighbor
Exclude (k-1) closest neighbors
Sensitivity study k NumProcesses/4 works well
Represents distance to the big mass
h is in the big mass, kth neighbor is close,
sk(h) is low
g is an outlier, kth neighbor is far, sk(g) is
high

16
Defining the Suspect Score
sk(g)
g
h

Anomalous means unusual, but unusual does not
always mean anomalous!
E.g., MPI master is different from all workers
Would be reported as an anomaly (false positive)
Distinguish false positives from true anomalies
With knowledge of system internals manual
effort
With previous execution history can be
automated

17
Defining the Suspect Score
g
h
n

Add traces from known-normal previous run
One-class classification
Suspect score sk(h) distance to the kth trial
neighbor or the 1st known-normal neighbor
Distance to the big mass or known-normal behavior
h is in the big mass, kth neighbor is close,
sk(h) is low
g is an outlier, normal node n is close, sk(g) is
low

18
Finding Anomalous Function

Fail-stop problems
Failure is in the last function invoked
Non-fail-stop problems
Find why process h was marked as an anomaly
Function with the highest contribution to s(h)
s(h) d (h,g), where g is the chosen neighbor
anomFn arg max di

i
19
Experimental Study SCore on TitechGrid

TitechGrid
129-node PC cluster at Tokyo Institute of
Technology
Serving as a production system for over a year
SCore v5.4 is operated in the multi-user mode
The scored daemons are connected in a ring with
the sc_watch process
Each process in the ring sends patrol messages
to the next daemon
If sc_watch receives no patrol messages for 10
minutes, it kills and restarts all the daemons

20
Applying the Diagnosis System to SCore

Injects the tracing agents into all scoreds
Instruments sc_watch to save in-memory traces
when the daemons are being killed
Identify the anomalous trace
Identify the anomalous function/call path

21
Finding the Host
Suspect Score

Host n129 is unusual different from the others
Host n129 is anomalous not present in previous
known-normal runs
Host n129 is a new anomaly not present in
previous known-faulty runs

22
Finding the Cause
score_write
score_write_short
output_job_status
__libc_write

Call chain with the highest contribution to the
suspect score (output_job_status -gt
score_write_short -gt score_write -gt __libc_write)
Tries to output a log message to the scbcast
process
Writes to the scbcast process kept blocking for
10 minutes
Scbcast stopped reading data from its socket
bug!
Scored did not handle it well (spun in an
infinite loop) bug!

23
Related Work

Magpie Barham, et al., 04
Diagnose distributed systems that serve HTTP
requests
components include web front-end, backend
databases, etc.
For errors on the incoming HTTP requests, locates
where the errors happened by looking at request
log on each components
Applied cluster analysis to request log
PinPoint Chen, et al., 02
Modified JBoss to record some of the events
associated to a request
such events as send/recv, exception handling
Finds out the correlation between failures of
requests with the events by a technique to
cluster analysis

Both techniques are well suited to collections of
small request paths, but not to daemon-type
processes like SCore.
24
Conclusion

Summary
Proposed systematic diagnosis methods for
distributed systems
Demonstrated efficacy of the methods on a real
production system
Next Steps
Application study with a wider range of systems
Richer execution data for smarter analysis

25
Discussion

Omitted

26
References

Omitted refer to the paper

27
Related Work

Similar approaches to similar problems
Different approaches to similar problems
Similar approaches to different problems

28
Overview of the Diagnosis Steps
Step 1
Step 2
Step 3
trace set
anomalous trace function
anomalous trace
Tracingwith spTracer
fail-stop
Earliest LastTimestamp
Last TraceEntry
Analyst
non-fail-stop
Has ReferenceTraces?
Trace Visualizationwith Jumpshot
Supervised TraceRanking Based onSuspect Score
yes
Max Componentof the Delta Vector
no
Unsupervised TraceRanking Based onSuspect Score
29
Overview of the Diagnosis Steps
Proc
process
Log
Log
Log
Execution data
Log
Log
Log
Proc
Crash
Log
Proc
Proc
Log
Proc
Proc
Log
Proc
Proc
Proc
Proc
Proc
Proc
Log
Proc
Proc
Proc
Log
Proc
Proc
????????????(???????????????????)
Log
Log
???????????
Log
Log
Log
Log
Log
?????????????????
Log
Log
Log
Log
Log
Log
Log
Log
Log
??????????????????
Log
Log
Log
Log
Log
Log
Log
Log
Log
Log

Write a Comment

User Comments (0)