Title: Stanford ROC Updates
1Stanford ROC Updates
2Progress
- Graduations
- Ben Ling (SSM, cheap-recovery session state
manager) - Jamie Cutler (refactoring satellite groundstation
software architecture to apply ROC techniques) - Andy Huang DStore, a persistent cluster-based
hash table (CHT) - Consistency model concretized
- Cheap recovery exploited for fast recovery
triggered by statistical monitoring - Cheap recovery exploited for online repartitioning
3More progress
- George Candea Microreboots at the EJB level in
J2EE apps - Shown to recover from variety of injected faults
- J2EE app session state factored out into SSM,
making the J2EE app crash-only - Demo during poster session
- Emre Kiciman Pinpoint further exploration of
anomaly-based failure detection in a minute
4Fast Recovery meets Anomaly Detection
- Use anomaly detection techniques to infer
(possible) failures - Act on alarms using low-overhead micro-recovery
mechanisms - Microreboots in EJB apps
- Node- or process-level reboot in DStore or SSM
- Occasional false positives OK since recovery is
so cheap - These ideas will be developed at Panel tonight,
and form topics for Breakouts tomorrow
5Updates on PinPoint
- Emre Kiciman and Armando Fox
- emrek, fox_at_cs.stanford.edu
6What Is This Talk About?
- Overview of recent Pinpoint experiments
- Including observations on fault behaviors
- Comparison with other app-generic fault detectors
- Tests of Pinpoint limitations
- Status of deployment at real sites
7Pinpoint Overview
- Goal App-generic High-level failure detection
- For app-level faults, detection is significant
of MTTR (75!) - Existing monitors hard to build/maintain or miss
high-level faults - Approach Monitor, aggregate, and analyze
low-level behaviors that correspond to high-level
semantics - Component interactions
- Structure of runtime paths
- Analysis of per-node statistics (req/sec, mem
usage, ...), without a priori thresholds - Assumption Anomalies are likely to be faults
- Look for anomalies over time, or across peers in
the cluster.
8Recap 3 Steps to Pinpoint
- Observe low-level behaviors that reflect
app-level behavior - Likely to change iff application-behavior changes
- App-transparent instrumentation!
- Model normal behavior and look for anomalies
- Assume most of system working most of the time
- Look for anomalies over time and across peers
- No a priori app-specific info!
- Correlate anomalous behavior to likely causes
- Assume observed connection between anomaly and
cause - Finally, notify admin or reboot component
9An Internet Service...
10A Failure...
- Failures behave differently than normal
- Look for anomalies in patterns of internal
behavior
11Patterns Path-shapes
12Patterns Component Interactions
13Outline
- Overview of recent Pinpoint experiments
- Observations on fault behaviors
- Comparison with other app-generic fault detectors
- Tests of Pinpoint limitations
- Status of deployment at real sites
14Compared to other anomaly-detection...
- Labeled and Unlabeled training sets
- If we know the end user saw a failure, Pinpoint
can help with localization - But often were trying to catch failures that
end-user-level detectors would miss - Ground truth for the latter is HTML page
checksums database table snapshots - Current analyses are done offline
- Eventual goal is to move to online, with new
models being trained and rotated in periodically - Alarms must be actionable
- Microreboots (tomorrow) allows acting on alarms
even when false positives
15Fault and Error Injection Behavior
- Injected 4 types of faults and errors
- Declared and runtime exceptions
- Method call omissions
- Source code bug injections (details on next page)
- Results ranged in severity ( of requests
affected) - 60 of faults caused cascades, affecting
secondary requests - We fared most poorly on the minor bugs
Fault type Num Severe (gt90) Major (gt1) Minor (lt1)
Declared ex 41 20 56 24
Runtime ex 41 17 59 24
Call omission 41 5 73 22
Src code bug 47 13 76 11
16Experience w/Bug Injection
- Wrote a Java code modifier to inject bugs
- Injects 6 kinds of bugs into code in Petstore 1.3
- Limited to bugs that would not be caught by
compiler, and are easy to inject -gt no major
structural bugs - Double-check fault existence by checksumming HTML
output - Not trivial to inject bugs that turn into
failures! - 1st try inject 5-10 bugs into random spots in
each component. - Ran 100 experiments, only 4 caused any changes!
- 2nd try exhaustive enumeration of potential bug
spots - Found total of 41 active spots out of 1000s.
- Rest is straight-line code w/no trivial bug
spots, or dead code.
17Source Code Bugs (Detail)
- Loop Errors Inverts loop conditions, injected
15. - while(b) stmt -gt while(!b) stmt
- Misassignment Replaces LHS of assignment,
injected 1 - if(a) -gt jf(a)
- Misinitialization Clears a variable
initialization, injected 2 - int i20 -gt int i0
- Misreference Replaces a var reference, injected
6 - availonStock-Ordered -gt availonStock-onOrder
- Off-by-one Replaces comparison op, injected 17
- if(a gt b) ... -gt if(a gt b) ...
- Synchronization Removes synchronization code,
injected 0 - synchronized stmt -gt stmt
18Outline
- Overview of recent Pinpoint experiments
- Including observations on fault behaviors
- Comparison with other app-generic fault detectors
- Tests of Pinpoint limitations
- Status of deployment at real sites
19Metrics Recall and Precision
- Recall C/T, how much of target was identified
- Precision C/R, how much of results were correct
- Also, precision 1 false positive rate
20Metrics Applying Recall and Precision
- Detection
- Do failures in the system cause detectable
anomalies? - Recall of failures actually detected as
anomalies - Precision 1 - (false positive rate) 1.0 in
our expts - Identification (given a failure is detected)
- recall how many actually-faulty requests are
returned - precision what of requests returned are
faulty 1-(false positive rate) - using HTML page checksums as ground truth
- Workload PetStore 1.1 and 1.3 (significantly
different versions), plus RUBiS
21Fault Detection Recall (All fault types)
- Minor faults were hardest to detect
- especially for Component Interaction
22FD Recall (Severe Major Faults only)
- Major faults are those that affect gt 1 of
requests - For these faults, Pinpoint has significantly
higher recall than other low-level detectors
23Detecting Source Code Bugs
- Source code bugs were hardest to detect
- PS-analysis, CI-analysis individually detected
7-12 of all faults, 37 of major faults - HTTP detected 10 of all faults
- We did better than HTTP logs, but thats no
excuse - Other faults PP strictly better than HTTP and
HTML det. - Src code bugs complementary together all
detected 15
24Faulty Request Identification
- HTTP monitoring has perfect precision since its
a ground truth indicator of a server fault - Path-shape analysis pulls more points out of the
bottom left corner
25Faulty Request Identification
- HTTP monitoring has perfect precision since its
a ground truth indicator of a server fault - Path-shape analysis pulls more points out of the
bottom left corner
26Adjusting Precision
- ?1 recall68 precision14
- ?4 recall34 precision93
- Low recall for faulty request identification
still detects 83 of fault experiments
27Outline
- Overview of recent Pinpoint experiments
- Including observations on fault behaviors
- Comparison with other app-generic fault detectors
- Tests of Pinpoint limitations
- Status of deployment at real sites
28Outline
- Overview of recent Pinpoint experiments
- Including observations on fault behaviors
- Comparison with other app-generic fault detectors
- Tests of Pinpoint limitations
- Status of deployment at real sites
29Status of Real-World Deployment
- Deploying parts of Pinpoint at 2 large sites
- Site 1
- Instrumenting middleware to collect request paths
for path-shape and component interaction analysis - Feasability completed, instrumentation in
progress... - Site 2
- Applying peer-analysis techniques developed for
SSM and D-Store - Metrics (e.g., req/sec, memory usage, ...)
already being collected. - Beginning analysis and testing...
30Summary
- Fault injection experiments showed range of
behavior - Cascading faults to other requests range of
severity. - Pinpoint performed better than existing low-level
monitors - Detected 90 of major component-level errors
(exceptions, etc) - Even in worst-case expts (src code bugs) PP
provided a complementary improvement to existing
low-level monitors - Currently, validating Pinpoint in two real-world
services
31Detail Slides
32Limitations Independent Requests
- PP assumes request-reply w/independent requests
- Monitored RMI-based J2EE system (ECPerf 1.1)
- .. is request-reply, but requests not
independent, nor is unit of work (UoW) well
defined. - Assume UoW 1 RMI call.
- Most RMI calls resulted in short paths (1 comp)
- Injected faults do not change these short paths
- When anomalies occurred, rarely in faulty path...
- Solution? Redefine UoW as multiple RMI calls
- gt paths capture more behavioral changes
- gt redefined UoW is likely app-specific
33Limitations Well-defined Peers
- PP assumes component peer groups well-defined
- But behavior can depend on context
- Ex. Naming server in a cluster
- Front-end servers mostly send lookup requests
- Back-end servers mostly respond to lookups.
- Result No component matches average behavior
- Both front-end and back-end naming servers
anomalous! - Solution? Extend component-IDs to include logical
location...
34Bonus Slides
35Ex. Application-Level Failure
No itinerary is actually available on this page
Ticket was bought in March for travel in April
But, website (superficially) appears to be
working. Heartbeat, pings, HTTP-GET tests are
not likely to detect the problem
36Application-level Failures
- Application-level failures are common
- gt60 of sites have user-visible (incl. app-level)
failures BIG-SF - Detection is major portion of recovery time
- TellMe detecting app-level failures is 75 of
recovery time CAK04 - 65 of user-visible failures mitigable by earlier
detection OGP03 - Existing monitoring techniques aren't good enough
- Low-level monitors pings, heartbeats, http error
monitoring - app-generic/low maintenance, - miss high-level
failures - High-level, app-specific tests
- - app-specific/hard to maintain, can catch many
app-level failures, - - test coverage problem
37Testbed and Faultload
- Instrumented JBoss/J2EE middleware
- J2EE state mgt, naming, etc. -gt Good layer of
indirection - JBoss open-source millions of downloads real
deployments - Track EJBs, JSPs, HTTP, RMI, JDBC, JNDI
- w/synchronous reporting 2-40ms latency hit 17
throughput decrease. - Testbed applications
- Petstore 1.3, Petstore 1.1, RUBiS, ECPerf
- Test strategy inject faults, measure detection
rate - Declared and undeclared exceptions
- Omitted calls app not likely to handle at all
- Source code bugs (e.g., off-by-one errors, etc)
38PCFGs Model Normal Path Shapes
- Probabilistic Context Free Grammar (PCFG)
- Represents likely calls made by each component
- Learn probabilities of rules based on observed
paths - Anomalous path shapes
- Score a path by summing the deviations of
P(observed calls) from average. - Detected 90 of faults in our experiments
39Use PCFG to Score Paths
- Measure difference between observed path and avg
- Score(path) ? 1/ni - P(ri)
- Higher scores are anomalous
- Detected 90 of faults in our experiments
40Separating Good from Bad Paths
- Use dynamic threshold to detect anomalies
- When unexpectedly many paths fall above Nth
percentile
41Anomalies in Component Interaction
- Weighted links model component interaction
42Scoring CI Models
- Score w/??? test of goodness-of-fit
- Probability that same process generated both
- Makes no assumptions about shape of distribution
w0.4
n030
w1.3
n110
w2.2
w3.1
n240
n320
Normal Pattern
43Two Kinds of False Positives
- Algorithmic false positives
- No anomaly exists
- But statistical technique made a mistake...
- Semantic false positives
- Correctly found an anomaly
- But anomaly is not a failure
44Resilient Against Semantic FP
- Test against normal changes
- 1. Vary workload from browse purchase to
only browse - 2. Minor upgrade from Petstore 1.3.1 to 1.3.2
- Path-shape analysis found NO differences
- Component interaction changes below threshold
- For predictable, major changes
- Consider lowering Pinpoint sensitivity until
retraining complete - -gt Window-of-vulnerability, but better than
false-positives. - Q Rate of normal changes? How quickly can we
retrain? - Minor changes every day, but only to parts of
site. - Training speed -gt how quickly is service
exercised?
45Related Work
- Detection and Localization
- Richardson Performance failure detection
- Infospect search for logical inconsistencies in
observed configuration - Event/alarm correlation systems use dependency
models to quiesce/collapse correlated alarms. - Request Tracing
- Magpie tracing for performance
modeling/characterization - Mogul discovering majority behavior in black-box
distrib. systems - Compilers PL
- DIDUCE hypothesize invariants, report when
they're broken - Bug Isolation Proj. correlate crashes w/state,
across real runs - Engler Analyze static code for patterns and
anomalies -gt bugs
46Conclusions
- Monitoring path shapes and component
interactions.. - ... easy to instrument, app-generic
- ... are likely to change when application fails
- Model normal pattern of behavior, look for
anomalies - Key assumption most of system working most of
time - Anomaly detection detects high-level failures,
and is deployable - Resilient to (at least some) normal changes to
system - Current status
- Deploying in real, large Internet service.
- Anomaly detection techniques for structure-less
systems
47More Information
- http//www.stanford.edu/emrek/
- Detecting Application-Level Failures in
Component-Based Internet Services. - Emre Kiciman, Armando Fox. In submission
- Session State Beyond Soft State.
- Benjamin Ling, Emre Kiciman, Armando Fox. NSDI'04
- Path-based Failure and Evolution Management
- Chen, Accardi, Kiciman, Lloyd, Patterson, Fox,
Brewer. NSDI'04
48Localize Failures with Decision Tree
- Search for features that occur with bad items,
but not good - Decision trees
- Classification function
- Each branch in tree tests a feature
- Leaves of tree give classification
- Learn decision tree to classify good/bad examples
- But we won't use it for classification
- Just look at learned classifier and extract
questions as features
49Illustrative Decision Tree
50Results Comparing Localization Rate
51Monitoring Structure-less Systems
- N replicated storage bricks handle read/write
requests - No complicated interactions or requests
- -gt Cannot do structural anomaly
detection! - Alternative features (performance, mem usage,
etc) - Activity statistics How often did a brick do
something? - Msgs received/sec, dropped/sec, etc.
- Same across all peers, assuming balanced workload
- Use anomalies as likely failures
- State statistics What is current state of system
- Memory usage, queue length, etc.
- Similar pattern across peers, but may not be in
phase - Look for patterns in time-series differences in
patterns indicate failure at a node.
52Surprising Patterns in Time-Series
- 1. Discretize time-series into string. Keogh
- 0.2, 0.3, 0.4, 0.6, 0.8, 0.2 -gt aaabba
- 2. Calculate the frequencies of short substrings
in the string. - aa occurs twice ab, bb, ba occurs once.
- 3. Compare frequencies to normal, look for
substrings that occur much less or much more than
normal.
53Inject Failures into Storage System
- Inject performance failure every 60s in one brick
- Slow all request my 1ms
- Pinpoint detects failures in 1-2 periods
- Does not detect anomalies during normal behavior
(including workload changes and GC) - Current issues Too many magic numbers
- Working on improving these techniques to remove
or automatically choose magic numbers
54Responding to Anomalies
- Want a policy for responding to anomalies
- Cross-check for failure
- 1. If no cause is correlated with the anomaly -gt
not failure - 2. Check user behavior for excessive reloads
- 3. Persistent anomaly? Check for recent state
changes - Recovery actions
- 1. Reboot component or app
- 2. Rollback failed request, try again.
- 3. Rollback software to last known good state.
- 4. Notify administrator