Stanford ROC Updates

About This Presentation

Title:

Stanford ROC Updates

Description:

ROC Retreat, June 16-18, 2004. Emre Kiciman. Recovery-Oriented Computing. Stanford ROC Updates ... Source code bug injections (details on next page) ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 55

Provided by: rocCsBe

Learn more at: http://roc.cs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Stanford ROC Updates

1
Stanford ROC Updates

Armando Fox

2
Progress

Graduations
Ben Ling (SSM, cheap-recovery session state
manager)
Jamie Cutler (refactoring satellite groundstation
software architecture to apply ROC techniques)
Andy Huang DStore, a persistent cluster-based
hash table (CHT)
Consistency model concretized
Cheap recovery exploited for fast recovery
triggered by statistical monitoring
Cheap recovery exploited for online repartitioning

3
More progress

George Candea Microreboots at the EJB level in
J2EE apps
Shown to recover from variety of injected faults
J2EE app session state factored out into SSM,
making the J2EE app crash-only
Demo during poster session
Emre Kiciman Pinpoint further exploration of
anomaly-based failure detection in a minute

4
Fast Recovery meets Anomaly Detection

Use anomaly detection techniques to infer
(possible) failures
Act on alarms using low-overhead micro-recovery
mechanisms
Microreboots in EJB apps
Node- or process-level reboot in DStore or SSM
Occasional false positives OK since recovery is
so cheap
These ideas will be developed at Panel tonight,
and form topics for Breakouts tomorrow

5
Updates on PinPoint

Emre Kiciman and Armando Fox
emrek, fox_at_cs.stanford.edu

6
What Is This Talk About?

Overview of recent Pinpoint experiments
Including observations on fault behaviors
Comparison with other app-generic fault detectors
Tests of Pinpoint limitations
Status of deployment at real sites

7
Pinpoint Overview

Goal App-generic High-level failure detection
For app-level faults, detection is significant
of MTTR (75!)
Existing monitors hard to build/maintain or miss
high-level faults
Approach Monitor, aggregate, and analyze
low-level behaviors that correspond to high-level
semantics
Component interactions
Structure of runtime paths
Analysis of per-node statistics (req/sec, mem
usage, ...), without a priori thresholds
Assumption Anomalies are likely to be faults
Look for anomalies over time, or across peers in
the cluster.

8
Recap 3 Steps to Pinpoint

Observe low-level behaviors that reflect
app-level behavior
Likely to change iff application-behavior changes
App-transparent instrumentation!
Model normal behavior and look for anomalies
Assume most of system working most of the time
Look for anomalies over time and across peers
No a priori app-specific info!
Correlate anomalous behavior to likely causes
Assume observed connection between anomaly and
cause
Finally, notify admin or reboot component

9
An Internet Service...
10
A Failure...

Failures behave differently than normal
Look for anomalies in patterns of internal
behavior

11
Patterns Path-shapes
12
Patterns Component Interactions
13
Outline

Overview of recent Pinpoint experiments
Observations on fault behaviors
Comparison with other app-generic fault detectors
Tests of Pinpoint limitations
Status of deployment at real sites

14
Compared to other anomaly-detection...

Labeled and Unlabeled training sets
If we know the end user saw a failure, Pinpoint
can help with localization
But often were trying to catch failures that
end-user-level detectors would miss
Ground truth for the latter is HTML page
checksums database table snapshots
Current analyses are done offline
Eventual goal is to move to online, with new
models being trained and rotated in periodically
Alarms must be actionable
Microreboots (tomorrow) allows acting on alarms
even when false positives

15
Fault and Error Injection Behavior

Injected 4 types of faults and errors
Declared and runtime exceptions
Method call omissions
Source code bug injections (details on next page)
Results ranged in severity ( of requests
affected)
60 of faults caused cascades, affecting
secondary requests
We fared most poorly on the minor bugs

Fault type Num Severe (gt90) Major (gt1) Minor (lt1)
Declared ex 41 20 56 24
Runtime ex 41 17 59 24
Call omission 41 5 73 22
Src code bug 47 13 76 11
16
Experience w/Bug Injection

Wrote a Java code modifier to inject bugs
Injects 6 kinds of bugs into code in Petstore 1.3
Limited to bugs that would not be caught by
compiler, and are easy to inject -gt no major
structural bugs
Double-check fault existence by checksumming HTML
output
Not trivial to inject bugs that turn into
failures!
1st try inject 5-10 bugs into random spots in
each component.
Ran 100 experiments, only 4 caused any changes!
2nd try exhaustive enumeration of potential bug
spots
Found total of 41 active spots out of 1000s.
Rest is straight-line code w/no trivial bug
spots, or dead code.

17
Source Code Bugs (Detail)

Loop Errors Inverts loop conditions, injected
15.
while(b) stmt -gt while(!b) stmt
Misassignment Replaces LHS of assignment,
injected 1
if(a) -gt jf(a)
Misinitialization Clears a variable
initialization, injected 2
int i20 -gt int i0
Misreference Replaces a var reference, injected
6
availonStock-Ordered -gt availonStock-onOrder
Off-by-one Replaces comparison op, injected 17
if(a gt b) ... -gt if(a gt b) ...
Synchronization Removes synchronization code,
injected 0
synchronized stmt -gt stmt

18
Outline

Overview of recent Pinpoint experiments
Including observations on fault behaviors
Comparison with other app-generic fault detectors
Tests of Pinpoint limitations
Status of deployment at real sites

19
Metrics Recall and Precision

Recall C/T, how much of target was identified
Precision C/R, how much of results were correct
Also, precision 1 false positive rate

20
Metrics Applying Recall and Precision

Detection
Do failures in the system cause detectable
anomalies?
Recall of failures actually detected as
anomalies
Precision 1 - (false positive rate) 1.0 in
our expts
Identification (given a failure is detected)
recall how many actually-faulty requests are
returned
precision what of requests returned are
faulty 1-(false positive rate)
using HTML page checksums as ground truth
Workload PetStore 1.1 and 1.3 (significantly
different versions), plus RUBiS

21
Fault Detection Recall (All fault types)

Minor faults were hardest to detect
especially for Component Interaction

22
FD Recall (Severe Major Faults only)

Major faults are those that affect gt 1 of
requests
For these faults, Pinpoint has significantly
higher recall than other low-level detectors

23
Detecting Source Code Bugs

Source code bugs were hardest to detect
PS-analysis, CI-analysis individually detected
7-12 of all faults, 37 of major faults
HTTP detected 10 of all faults
We did better than HTTP logs, but thats no
excuse
Other faults PP strictly better than HTTP and
HTML det.
Src code bugs complementary together all
detected 15

24
Faulty Request Identification

HTTP monitoring has perfect precision since its
a ground truth indicator of a server fault
Path-shape analysis pulls more points out of the
bottom left corner

25
Faulty Request Identification

HTTP monitoring has perfect precision since its
a ground truth indicator of a server fault
Path-shape analysis pulls more points out of the
bottom left corner

26
Adjusting Precision

?1 recall68 precision14
?4 recall34 precision93
Low recall for faulty request identification
still detects 83 of fault experiments

27
Outline

Overview of recent Pinpoint experiments
Including observations on fault behaviors
Comparison with other app-generic fault detectors
Tests of Pinpoint limitations
Status of deployment at real sites

28
Outline

Overview of recent Pinpoint experiments
Including observations on fault behaviors
Comparison with other app-generic fault detectors
Tests of Pinpoint limitations
Status of deployment at real sites

29
Status of Real-World Deployment

Deploying parts of Pinpoint at 2 large sites
Site 1
Instrumenting middleware to collect request paths
for path-shape and component interaction analysis
Feasability completed, instrumentation in
progress...
Site 2
Applying peer-analysis techniques developed for
SSM and D-Store
Metrics (e.g., req/sec, memory usage, ...)
already being collected.
Beginning analysis and testing...

30
Summary

Fault injection experiments showed range of
behavior
Cascading faults to other requests range of
severity.
Pinpoint performed better than existing low-level
monitors
Detected 90 of major component-level errors
(exceptions, etc)
Even in worst-case expts (src code bugs) PP
provided a complementary improvement to existing
low-level monitors
Currently, validating Pinpoint in two real-world
services

31
Detail Slides
32
Limitations Independent Requests

PP assumes request-reply w/independent requests
Monitored RMI-based J2EE system (ECPerf 1.1)
.. is request-reply, but requests not
independent, nor is unit of work (UoW) well
defined.
Assume UoW 1 RMI call.
Most RMI calls resulted in short paths (1 comp)
Injected faults do not change these short paths
When anomalies occurred, rarely in faulty path...
Solution? Redefine UoW as multiple RMI calls
gt paths capture more behavioral changes
gt redefined UoW is likely app-specific

33
Limitations Well-defined Peers

PP assumes component peer groups well-defined
But behavior can depend on context
Ex. Naming server in a cluster
Front-end servers mostly send lookup requests
Back-end servers mostly respond to lookups.
Result No component matches average behavior
Both front-end and back-end naming servers
anomalous!
Solution? Extend component-IDs to include logical
location...

34
Bonus Slides
35
Ex. Application-Level Failure
No itinerary is actually available on this page
Ticket was bought in March for travel in April
But, website (superficially) appears to be
working. Heartbeat, pings, HTTP-GET tests are
not likely to detect the problem
36
Application-level Failures

Application-level failures are common
gt60 of sites have user-visible (incl. app-level)
failures BIG-SF
Detection is major portion of recovery time
TellMe detecting app-level failures is 75 of
recovery time CAK04
65 of user-visible failures mitigable by earlier
detection OGP03
Existing monitoring techniques aren't good enough
Low-level monitors pings, heartbeats, http error
monitoring
app-generic/low maintenance, - miss high-level
failures
High-level, app-specific tests
- app-specific/hard to maintain, can catch many
app-level failures,
- test coverage problem

37
Testbed and Faultload

Instrumented JBoss/J2EE middleware
J2EE state mgt, naming, etc. -gt Good layer of
indirection
JBoss open-source millions of downloads real
deployments
Track EJBs, JSPs, HTTP, RMI, JDBC, JNDI
w/synchronous reporting 2-40ms latency hit 17
throughput decrease.
Testbed applications
Petstore 1.3, Petstore 1.1, RUBiS, ECPerf
Test strategy inject faults, measure detection
rate
Declared and undeclared exceptions
Omitted calls app not likely to handle at all
Source code bugs (e.g., off-by-one errors, etc)

38
PCFGs Model Normal Path Shapes

Probabilistic Context Free Grammar (PCFG)
Represents likely calls made by each component
Learn probabilities of rules based on observed
paths
Anomalous path shapes
Score a path by summing the deviations of
P(observed calls) from average.
Detected 90 of faults in our experiments

39
Use PCFG to Score Paths

Measure difference between observed path and avg
Score(path) ? 1/ni - P(ri)
Higher scores are anomalous
Detected 90 of faults in our experiments

40
Separating Good from Bad Paths

Use dynamic threshold to detect anomalies
When unexpectedly many paths fall above Nth
percentile

41
Anomalies in Component Interaction

Weighted links model component interaction

42
Scoring CI Models

Score w/??? test of goodness-of-fit
Probability that same process generated both
Makes no assumptions about shape of distribution

w0.4
n030
w1.3
n110
w2.2
w3.1
n240
n320
Normal Pattern
43
Two Kinds of False Positives

Algorithmic false positives
No anomaly exists
But statistical technique made a mistake...
Semantic false positives
Correctly found an anomaly
But anomaly is not a failure

44
Resilient Against Semantic FP

Test against normal changes
1. Vary workload from browse purchase to
only browse
2. Minor upgrade from Petstore 1.3.1 to 1.3.2
Path-shape analysis found NO differences
Component interaction changes below threshold
For predictable, major changes
Consider lowering Pinpoint sensitivity until
retraining complete
-gt Window-of-vulnerability, but better than
false-positives.
Q Rate of normal changes? How quickly can we
retrain?
Minor changes every day, but only to parts of
site.
Training speed -gt how quickly is service
exercised?

45
Related Work

Detection and Localization
Richardson Performance failure detection
Infospect search for logical inconsistencies in
observed configuration
Event/alarm correlation systems use dependency
models to quiesce/collapse correlated alarms.
Request Tracing
Magpie tracing for performance
modeling/characterization
Mogul discovering majority behavior in black-box
distrib. systems
Compilers PL
DIDUCE hypothesize invariants, report when
they're broken
Bug Isolation Proj. correlate crashes w/state,
across real runs
Engler Analyze static code for patterns and
anomalies -gt bugs

46
Conclusions

Monitoring path shapes and component
interactions..
... easy to instrument, app-generic
... are likely to change when application fails
Model normal pattern of behavior, look for
anomalies
Key assumption most of system working most of
time
Anomaly detection detects high-level failures,
and is deployable
Resilient to (at least some) normal changes to
system
Current status
Deploying in real, large Internet service.
Anomaly detection techniques for structure-less
systems

47
More Information

http//www.stanford.edu/emrek/
Detecting Application-Level Failures in
Component-Based Internet Services.
Emre Kiciman, Armando Fox. In submission
Session State Beyond Soft State.
Benjamin Ling, Emre Kiciman, Armando Fox. NSDI'04
Path-based Failure and Evolution Management
Chen, Accardi, Kiciman, Lloyd, Patterson, Fox,
Brewer. NSDI'04

48
Localize Failures with Decision Tree

Search for features that occur with bad items,
but not good
Decision trees
Classification function
Each branch in tree tests a feature
Leaves of tree give classification
Learn decision tree to classify good/bad examples
But we won't use it for classification
Just look at learned classifier and extract
questions as features

49
Illustrative Decision Tree
50
Results Comparing Localization Rate
51
Monitoring Structure-less Systems

N replicated storage bricks handle read/write
requests
No complicated interactions or requests
-gt Cannot do structural anomaly
detection!
Alternative features (performance, mem usage,
etc)
Activity statistics How often did a brick do
something?
Msgs received/sec, dropped/sec, etc.
Same across all peers, assuming balanced workload
Use anomalies as likely failures
State statistics What is current state of system
Memory usage, queue length, etc.
Similar pattern across peers, but may not be in
phase
Look for patterns in time-series differences in
patterns indicate failure at a node.

52
Surprising Patterns in Time-Series

1. Discretize time-series into string. Keogh
0.2, 0.3, 0.4, 0.6, 0.8, 0.2 -gt aaabba
2. Calculate the frequencies of short substrings
in the string.
aa occurs twice ab, bb, ba occurs once.
3. Compare frequencies to normal, look for
substrings that occur much less or much more than
normal.

53
Inject Failures into Storage System

Inject performance failure every 60s in one brick
Slow all request my 1ms
Pinpoint detects failures in 1-2 periods
Does not detect anomalies during normal behavior
(including workload changes and GC)
Current issues Too many magic numbers
Working on improving these techniques to remove
or automatically choose magic numbers

54
Responding to Anomalies