Stanford ROC Updates - PowerPoint PPT Presentation

About This Presentation
Title:

Stanford ROC Updates

Description:

ROC Retreat, June 16-18, 2004. Emre Kiciman. Recovery-Oriented Computing. Stanford ROC Updates ... Source code bug injections (details on next page) ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 55
Provided by: rocCsBe
Category:
Tags: roc | stanford | updates

less

Transcript and Presenter's Notes

Title: Stanford ROC Updates


1
Stanford ROC Updates
  • Armando Fox

2
Progress
  • Graduations
  • Ben Ling (SSM, cheap-recovery session state
    manager)
  • Jamie Cutler (refactoring satellite groundstation
    software architecture to apply ROC techniques)
  • Andy Huang DStore, a persistent cluster-based
    hash table (CHT)
  • Consistency model concretized
  • Cheap recovery exploited for fast recovery
    triggered by statistical monitoring
  • Cheap recovery exploited for online repartitioning

3
More progress
  • George Candea Microreboots at the EJB level in
    J2EE apps
  • Shown to recover from variety of injected faults
  • J2EE app session state factored out into SSM,
    making the J2EE app crash-only
  • Demo during poster session
  • Emre Kiciman Pinpoint further exploration of
    anomaly-based failure detection in a minute

4
Fast Recovery meets Anomaly Detection
  • Use anomaly detection techniques to infer
    (possible) failures
  • Act on alarms using low-overhead micro-recovery
    mechanisms
  • Microreboots in EJB apps
  • Node- or process-level reboot in DStore or SSM
  • Occasional false positives OK since recovery is
    so cheap
  • These ideas will be developed at Panel tonight,
    and form topics for Breakouts tomorrow

5
Updates on PinPoint
  • Emre Kiciman and Armando Fox
  • emrek, fox_at_cs.stanford.edu

6
What Is This Talk About?
  • Overview of recent Pinpoint experiments
  • Including observations on fault behaviors
  • Comparison with other app-generic fault detectors
  • Tests of Pinpoint limitations
  • Status of deployment at real sites

7
Pinpoint Overview
  • Goal App-generic High-level failure detection
  • For app-level faults, detection is significant
    of MTTR (75!)
  • Existing monitors hard to build/maintain or miss
    high-level faults
  • Approach Monitor, aggregate, and analyze
    low-level behaviors that correspond to high-level
    semantics
  • Component interactions
  • Structure of runtime paths
  • Analysis of per-node statistics (req/sec, mem
    usage, ...), without a priori thresholds
  • Assumption Anomalies are likely to be faults
  • Look for anomalies over time, or across peers in
    the cluster.

8
Recap 3 Steps to Pinpoint
  • Observe low-level behaviors that reflect
    app-level behavior
  • Likely to change iff application-behavior changes
  • App-transparent instrumentation!
  • Model normal behavior and look for anomalies
  • Assume most of system working most of the time
  • Look for anomalies over time and across peers
  • No a priori app-specific info!
  • Correlate anomalous behavior to likely causes
  • Assume observed connection between anomaly and
    cause
  • Finally, notify admin or reboot component

9
An Internet Service...
10
A Failure...
  • Failures behave differently than normal
  • Look for anomalies in patterns of internal
    behavior

11
Patterns Path-shapes
12
Patterns Component Interactions
13
Outline
  • Overview of recent Pinpoint experiments
  • Observations on fault behaviors
  • Comparison with other app-generic fault detectors
  • Tests of Pinpoint limitations
  • Status of deployment at real sites

14
Compared to other anomaly-detection...
  • Labeled and Unlabeled training sets
  • If we know the end user saw a failure, Pinpoint
    can help with localization
  • But often were trying to catch failures that
    end-user-level detectors would miss
  • Ground truth for the latter is HTML page
    checksums database table snapshots
  • Current analyses are done offline
  • Eventual goal is to move to online, with new
    models being trained and rotated in periodically
  • Alarms must be actionable
  • Microreboots (tomorrow) allows acting on alarms
    even when false positives

15
Fault and Error Injection Behavior
  • Injected 4 types of faults and errors
  • Declared and runtime exceptions
  • Method call omissions
  • Source code bug injections (details on next page)
  • Results ranged in severity ( of requests
    affected)
  • 60 of faults caused cascades, affecting
    secondary requests
  • We fared most poorly on the minor bugs

Fault type Num Severe (gt90) Major (gt1) Minor (lt1)
Declared ex 41 20 56 24
Runtime ex 41 17 59 24
Call omission 41 5 73 22
Src code bug 47 13 76 11
16
Experience w/Bug Injection
  • Wrote a Java code modifier to inject bugs
  • Injects 6 kinds of bugs into code in Petstore 1.3
  • Limited to bugs that would not be caught by
    compiler, and are easy to inject -gt no major
    structural bugs
  • Double-check fault existence by checksumming HTML
    output
  • Not trivial to inject bugs that turn into
    failures!
  • 1st try inject 5-10 bugs into random spots in
    each component.
  • Ran 100 experiments, only 4 caused any changes!
  • 2nd try exhaustive enumeration of potential bug
    spots
  • Found total of 41 active spots out of 1000s.
  • Rest is straight-line code w/no trivial bug
    spots, or dead code.

17
Source Code Bugs (Detail)
  • Loop Errors Inverts loop conditions, injected
    15.
  • while(b) stmt -gt while(!b) stmt
  • Misassignment Replaces LHS of assignment,
    injected 1
  • if(a) -gt jf(a)
  • Misinitialization Clears a variable
    initialization, injected 2
  • int i20 -gt int i0
  • Misreference Replaces a var reference, injected
    6
  • availonStock-Ordered -gt availonStock-onOrder
  • Off-by-one Replaces comparison op, injected 17
  • if(a gt b) ... -gt if(a gt b) ...
  • Synchronization Removes synchronization code,
    injected 0
  • synchronized stmt -gt stmt

18
Outline
  • Overview of recent Pinpoint experiments
  • Including observations on fault behaviors
  • Comparison with other app-generic fault detectors
  • Tests of Pinpoint limitations
  • Status of deployment at real sites

19
Metrics Recall and Precision
  • Recall C/T, how much of target was identified
  • Precision C/R, how much of results were correct
  • Also, precision 1 false positive rate

20
Metrics Applying Recall and Precision
  • Detection
  • Do failures in the system cause detectable
    anomalies?
  • Recall of failures actually detected as
    anomalies
  • Precision 1 - (false positive rate) 1.0 in
    our expts
  • Identification (given a failure is detected)
  • recall how many actually-faulty requests are
    returned
  • precision what of requests returned are
    faulty 1-(false positive rate)
  • using HTML page checksums as ground truth
  • Workload PetStore 1.1 and 1.3 (significantly
    different versions), plus RUBiS

21
Fault Detection Recall (All fault types)
  • Minor faults were hardest to detect
  • especially for Component Interaction

22
FD Recall (Severe Major Faults only)
  • Major faults are those that affect gt 1 of
    requests
  • For these faults, Pinpoint has significantly
    higher recall than other low-level detectors

23
Detecting Source Code Bugs
  • Source code bugs were hardest to detect
  • PS-analysis, CI-analysis individually detected
    7-12 of all faults, 37 of major faults
  • HTTP detected 10 of all faults
  • We did better than HTTP logs, but thats no
    excuse
  • Other faults PP strictly better than HTTP and
    HTML det.
  • Src code bugs complementary together all
    detected 15

24
Faulty Request Identification
  • HTTP monitoring has perfect precision since its
    a ground truth indicator of a server fault
  • Path-shape analysis pulls more points out of the
    bottom left corner

25
Faulty Request Identification
  • HTTP monitoring has perfect precision since its
    a ground truth indicator of a server fault
  • Path-shape analysis pulls more points out of the
    bottom left corner

26
Adjusting Precision
  • ?1 recall68 precision14
  • ?4 recall34 precision93
  • Low recall for faulty request identification
    still detects 83 of fault experiments

27
Outline
  • Overview of recent Pinpoint experiments
  • Including observations on fault behaviors
  • Comparison with other app-generic fault detectors
  • Tests of Pinpoint limitations
  • Status of deployment at real sites

28
Outline
  • Overview of recent Pinpoint experiments
  • Including observations on fault behaviors
  • Comparison with other app-generic fault detectors
  • Tests of Pinpoint limitations
  • Status of deployment at real sites

29
Status of Real-World Deployment
  • Deploying parts of Pinpoint at 2 large sites
  • Site 1
  • Instrumenting middleware to collect request paths
    for path-shape and component interaction analysis
  • Feasability completed, instrumentation in
    progress...
  • Site 2
  • Applying peer-analysis techniques developed for
    SSM and D-Store
  • Metrics (e.g., req/sec, memory usage, ...)
    already being collected.
  • Beginning analysis and testing...

30
Summary
  • Fault injection experiments showed range of
    behavior
  • Cascading faults to other requests range of
    severity.
  • Pinpoint performed better than existing low-level
    monitors
  • Detected 90 of major component-level errors
    (exceptions, etc)
  • Even in worst-case expts (src code bugs) PP
    provided a complementary improvement to existing
    low-level monitors
  • Currently, validating Pinpoint in two real-world
    services

31
Detail Slides
32
Limitations Independent Requests
  • PP assumes request-reply w/independent requests
  • Monitored RMI-based J2EE system (ECPerf 1.1)
  • .. is request-reply, but requests not
    independent, nor is unit of work (UoW) well
    defined.
  • Assume UoW 1 RMI call.
  • Most RMI calls resulted in short paths (1 comp)
  • Injected faults do not change these short paths
  • When anomalies occurred, rarely in faulty path...
  • Solution? Redefine UoW as multiple RMI calls
  • gt paths capture more behavioral changes
  • gt redefined UoW is likely app-specific

33
Limitations Well-defined Peers
  • PP assumes component peer groups well-defined
  • But behavior can depend on context
  • Ex. Naming server in a cluster
  • Front-end servers mostly send lookup requests
  • Back-end servers mostly respond to lookups.
  • Result No component matches average behavior
  • Both front-end and back-end naming servers
    anomalous!
  • Solution? Extend component-IDs to include logical
    location...

34
Bonus Slides
35
Ex. Application-Level Failure
No itinerary is actually available on this page
Ticket was bought in March for travel in April
But, website (superficially) appears to be
working. Heartbeat, pings, HTTP-GET tests are
not likely to detect the problem
36
Application-level Failures
  • Application-level failures are common
  • gt60 of sites have user-visible (incl. app-level)
    failures BIG-SF
  • Detection is major portion of recovery time
  • TellMe detecting app-level failures is 75 of
    recovery time CAK04
  • 65 of user-visible failures mitigable by earlier
    detection OGP03
  • Existing monitoring techniques aren't good enough
  • Low-level monitors pings, heartbeats, http error
    monitoring
  • app-generic/low maintenance, - miss high-level
    failures
  • High-level, app-specific tests
  • - app-specific/hard to maintain, can catch many
    app-level failures,
  • - test coverage problem

37
Testbed and Faultload
  • Instrumented JBoss/J2EE middleware
  • J2EE state mgt, naming, etc. -gt Good layer of
    indirection
  • JBoss open-source millions of downloads real
    deployments
  • Track EJBs, JSPs, HTTP, RMI, JDBC, JNDI
  • w/synchronous reporting 2-40ms latency hit 17
    throughput decrease.
  • Testbed applications
  • Petstore 1.3, Petstore 1.1, RUBiS, ECPerf
  • Test strategy inject faults, measure detection
    rate
  • Declared and undeclared exceptions
  • Omitted calls app not likely to handle at all
  • Source code bugs (e.g., off-by-one errors, etc)

38
PCFGs Model Normal Path Shapes
  • Probabilistic Context Free Grammar (PCFG)
  • Represents likely calls made by each component
  • Learn probabilities of rules based on observed
    paths
  • Anomalous path shapes
  • Score a path by summing the deviations of
    P(observed calls) from average.
  • Detected 90 of faults in our experiments

39
Use PCFG to Score Paths
  • Measure difference between observed path and avg
  • Score(path) ? 1/ni - P(ri)
  • Higher scores are anomalous
  • Detected 90 of faults in our experiments

40
Separating Good from Bad Paths
  • Use dynamic threshold to detect anomalies
  • When unexpectedly many paths fall above Nth
    percentile

41
Anomalies in Component Interaction
  • Weighted links model component interaction

42
Scoring CI Models
  • Score w/??? test of goodness-of-fit
  • Probability that same process generated both
  • Makes no assumptions about shape of distribution

w0.4
n030
w1.3
n110
w2.2
w3.1
n240
n320
Normal Pattern
43
Two Kinds of False Positives
  • Algorithmic false positives
  • No anomaly exists
  • But statistical technique made a mistake...
  • Semantic false positives
  • Correctly found an anomaly
  • But anomaly is not a failure

44
Resilient Against Semantic FP
  • Test against normal changes
  • 1. Vary workload from browse purchase to
    only browse
  • 2. Minor upgrade from Petstore 1.3.1 to 1.3.2
  • Path-shape analysis found NO differences
  • Component interaction changes below threshold
  • For predictable, major changes
  • Consider lowering Pinpoint sensitivity until
    retraining complete
  • -gt Window-of-vulnerability, but better than
    false-positives.
  • Q Rate of normal changes? How quickly can we
    retrain?
  • Minor changes every day, but only to parts of
    site.
  • Training speed -gt how quickly is service
    exercised?

45
Related Work
  • Detection and Localization
  • Richardson Performance failure detection
  • Infospect search for logical inconsistencies in
    observed configuration
  • Event/alarm correlation systems use dependency
    models to quiesce/collapse correlated alarms.
  • Request Tracing
  • Magpie tracing for performance
    modeling/characterization
  • Mogul discovering majority behavior in black-box
    distrib. systems
  • Compilers PL
  • DIDUCE hypothesize invariants, report when
    they're broken
  • Bug Isolation Proj. correlate crashes w/state,
    across real runs
  • Engler Analyze static code for patterns and
    anomalies -gt bugs

46
Conclusions
  • Monitoring path shapes and component
    interactions..
  • ... easy to instrument, app-generic
  • ... are likely to change when application fails
  • Model normal pattern of behavior, look for
    anomalies
  • Key assumption most of system working most of
    time
  • Anomaly detection detects high-level failures,
    and is deployable
  • Resilient to (at least some) normal changes to
    system
  • Current status
  • Deploying in real, large Internet service.
  • Anomaly detection techniques for structure-less
    systems

47
More Information
  • http//www.stanford.edu/emrek/
  • Detecting Application-Level Failures in
    Component-Based Internet Services.
  • Emre Kiciman, Armando Fox. In submission
  • Session State Beyond Soft State.
  • Benjamin Ling, Emre Kiciman, Armando Fox. NSDI'04
  • Path-based Failure and Evolution Management
  • Chen, Accardi, Kiciman, Lloyd, Patterson, Fox,
    Brewer. NSDI'04

48
Localize Failures with Decision Tree
  • Search for features that occur with bad items,
    but not good
  • Decision trees
  • Classification function
  • Each branch in tree tests a feature
  • Leaves of tree give classification
  • Learn decision tree to classify good/bad examples
  • But we won't use it for classification
  • Just look at learned classifier and extract
    questions as features

49
Illustrative Decision Tree
50
Results Comparing Localization Rate
51
Monitoring Structure-less Systems
  • N replicated storage bricks handle read/write
    requests
  • No complicated interactions or requests
  • -gt Cannot do structural anomaly
    detection!
  • Alternative features (performance, mem usage,
    etc)
  • Activity statistics How often did a brick do
    something?
  • Msgs received/sec, dropped/sec, etc.
  • Same across all peers, assuming balanced workload
  • Use anomalies as likely failures
  • State statistics What is current state of system
  • Memory usage, queue length, etc.
  • Similar pattern across peers, but may not be in
    phase
  • Look for patterns in time-series differences in
    patterns indicate failure at a node.

52
Surprising Patterns in Time-Series
  • 1. Discretize time-series into string. Keogh
  • 0.2, 0.3, 0.4, 0.6, 0.8, 0.2 -gt aaabba
  • 2. Calculate the frequencies of short substrings
    in the string.
  • aa occurs twice ab, bb, ba occurs once.
  • 3. Compare frequencies to normal, look for
    substrings that occur much less or much more than
    normal.

53
Inject Failures into Storage System
  • Inject performance failure every 60s in one brick
  • Slow all request my 1ms
  • Pinpoint detects failures in 1-2 periods
  • Does not detect anomalies during normal behavior
    (including workload changes and GC)
  • Current issues Too many magic numbers
  • Working on improving these techniques to remove
    or automatically choose magic numbers

54
Responding to Anomalies
  • Want a policy for responding to anomalies
  • Cross-check for failure
  • 1. If no cause is correlated with the anomaly -gt
    not failure
  • 2. Check user behavior for excessive reloads
  • 3. Persistent anomaly? Check for recent state
    changes
  • Recovery actions
  • 1. Reboot component or app
  • 2. Rollback failed request, try again.
  • 3. Rollback software to last known good state.
  • 4. Notify administrator
Write a Comment
User Comments (0)
About PowerShow.com