Fast%20Recovery%20 %20Statistical%20Anomaly%20Detection%20=%20Self-* - PowerPoint PPT Presentation

About This Presentation
Title:

Fast%20Recovery%20 %20Statistical%20Anomaly%20Detection%20=%20Self-*

Description:

Box-level Micro-recovery cheap enough to survive ... Build model of 'acceptable' operating envelope by ... single-key ('Berkeley DB') get/set ... – PowerPoint PPT presentation

Number of Views:115
Avg rating:3.0/5.0
Slides: 29
Provided by: fox66
Category:

less

Transcript and Presenter's Notes

Title: Fast%20Recovery%20 %20Statistical%20Anomaly%20Detection%20=%20Self-*


1
Fast Recovery Statistical Anomaly Detection
Self-
  • RADS/KATZ CATS Panel
  • June 2004 ROC Retreat

2
Outline
  • Motivation approach complex systems of black
    boxes
  • Measurements that respect black boxes
  • Box-level Micro-recovery cheap enough to survive
    false positives
  • Differences from related efforts
  • Early case studies
  • Research agenda

3
Complex Systems of Black Boxes
  • ...our ability to analyze and predict the
    performance of the enormously complex software
    systems that lies at the core of our economy is
    painfully inadequate. (Choudhury Weikum, 2000
    PITAC Report)
  • Build model of acceptable operating envelope by
    measurement analysis
  • Control theory, statistical correlation, anomaly
    detection...
  • Rely on external control, using inexpensive and
    simple mechanisms that respect the black box, to
    keep system in its acceptable operating envelope
  • Increase the size of the DB connection pool
    Hellerstein et al
  • Reallocate one or more whole machines
    Lassettre et al
  • Rejuvenate/reboot one or more machines
    Trivedi, Fox, others
  • Shoot one of the blocked txns everyone
  • Induce memory pressure on other apps
    Waldspurger et al

4
Differences from some existing problems
  • intrusion detection (Hofmeyr et al 98, others)
  • Detections must be actionable in a way that is
    likely to improve system (sacrificing
    availability for safety is unacceptable)
  • bug finding via anomaly detection (Engler,
    others)
  • Human-level monitoring/verification of detections
    not feasible, due to number of observations and
    short timescales for reaction
  • Can separate recovery from diagnosis/repair
    (dont always need to know root cause to recover)
  • modeling/predicting SLO violations (Hellerstein,
    Goldszmidt, others)
  • Labeled training set not necessarily available

5
Many other examples, but the point is...
Statistical techniques identify interesting
features and relationships from large datasets,
but frequent tradeoff between detection rate (or
detection time) and false positives
Make micro-recovery so inexpensive that
occasional false positives dont matter
  • Granularity of black box should match
    granularity of available external control
    mechanisms

6
Micro-recovery to survive false positives
  • Goal provide recovery management invariants
  • Salubrious returns some part of system to
    known state
  • Reclaim resources (memory, DB conns, sockets,
    DHCP lease...)
  • Throw away corrupt transient state
  • Possibly setup to retry operation, if appropriate
  • Safe affects only performance, not correctness
  • Non-disruptive performance impact is small
  • Predictable impact and time-to-complete is
    stable

Observe, Analyze, Act Not recovery, but
continuous adaptation
7
Crash-Only Building Blocks
Subsystem Control point How realized Statistical monitoring
SSM (diskless session state store) NSDI 04 Whole-node fast reboot (doesnt preserve state) Quorum-like redundancy Relaxed consistency Repair cost spread over many operations Time series of state metrics (Tarzan)
DStore (persistent hashtable) in preparation Whole-node reboot (preserves state) Quorum-like redundancy Relaxed consistency Repair cost spread over many operations Time series of state metrics (Tarzan)
JAGR (J2EE application server) AMS 2003 in prep. Microreboots of EJBs Modify appserver to undeploy/ redeploy EJBs and stall pending reqs Anomalous code paths and component interactions (Probabilistic context-free grammar)
  • Control points are safe, predictable,
    non-disruptive
  • Crash-only design shutdowncrash,
    recoverrestart
  • Makes state-management subsystems as easy to
    manage as stateless Web servers

8
Example Managing DStore and SSM
  • Rebooting is the only control mechanism
  • Has predictable effect and takes predictable
    time, regardless of what the process is doing
  • Like kill -9, turning off a VM, or pulling
    power cord
  • Intuition the infrastructure supporting the
    power switch is simpler than the applications
    using it
  • Due to slight overprovisioning inherent in
    replication, rebooting can have minimal effect on
    throughput latency
  • Relaxed consistency guarantees allow this to work
  • Activity and state statistics collected per brick
    every second any deviation gt reboot brick
  • Makes it as easy as managing a stateless server
    farm
  • Backpressure at many design points prevents
    saturation

9
Design Lessons Learned So Far
  • A spectrum of cleaning operations (Eric
    Anderson, HP Labs)
  • Consequence as t??, all problems will converge
    to repair of corrupted persistent data
  • Trade unnecessary consistency for faster
    recovery
  • spread recovery actions out incrementally/lazily
    (read repair) rather than doing it all at once
    (log replay)
  • gives predictable return-to-service time and
    acceptable variation in performance after
    recovery
  • keeps data available for reads and writes
    throughout recovery
  • Use single phase ops to avoid coupling/locking
    and the issues they raise, and justify the cost
    in consistency
  • Its OK to say no (backpressure)
  • Several places our design got it wrong in SSM
  • But even those mistakes could have been worked
    around by guard timers

10
Potential Limitations and Challenges
  • Hard failures
  • Configuration failures
  • Although similar approach has been used to
    troubleshoot those
  • Corruption of persistent state
  • Data structure repair work (Rinard et al.) may be
    combinable with automatic inference (Lam et al.)
  • Challenges
  • Stability and the autopilot problem
  • The base-rate fallacy
  • Multilevel learning
  • Online implementations of SLT techniques
  • Nonintrusive data collection and storage

11
An Architecture for Observe, Analyze, Act
  • Separates systems concerns from algorithm
    development
  • Programmable network elements provide extension
    of approach to other layers
  • Consistent with technology trends
  • Explicit //ism in CPU usage
  • Lots of disk storage with limited bandwidth

12
Conclusion
The real reason to reduce MTTRis to tolerate
false positives recovery ? adaptation
  • ...Ultimately, these aspects of autonomic
    systems will be emergent properties of a general
    architecture, and distinctions will blur into a
    more general notion of self-maintenance. (The
    Vision of Autonomic Computing)

13
Breakout sessions?
  1. James H Reserve some resources to deal with
    problems (by filtering or pre-reservation)
  2. Joe H How black is the black box? What gray
    box prior knowledge can you exploit (so you
    dont ignore the obvious)?
  3. Joe H Human role - can make statements about
    how system should act, so doesnt have to be
    completely hands-off training. Similarly, during
    training, human can give feedback about what
    anomalies are actually relevant (labeling).
  4. Lakshmi What kinds of apps is this intended to
    apply to? Where do ROC-like and OASIS-like apps
    differ?
  5. Mary Baker People can learn to game the system
    -gt randomness can be your friend. If behaviors
    have small number of modes, just have to look for
    behaviors in the valleys

14
Breakouts
  1. 19 -golden nuggets to guide architecture, e.g.,
    persistent identifiers for path-based
    analysis...what else?
  2. 8 - act what safe,fast,predictable behaviors
    of the system should we expose (other than, eg,
    rebooting)? Esp. those that contribute to
    security as well as dependability?
  3. 11 - architectures for different types of
    stateful systems - what kinds of
    persistent/semi-persistent state need to be
    factored out of apps, and how to store it
    interfaces, etc
  4. 20 - Given your goal of generic techniques for
    distributed systems, how will you know when
    youve succeeded/how do you validate the
    techniques? (What are the proof points you
    can hand to others to convince them youve
    succeeded, including but not limited to metrics?)
    Aaron/Dave Metrics How do you know youre
    observing the right things? What benchmarks will
    be needed?

15
Open Mic
  • James Hamilton - The Security Economy

16
Conclusion
The real reason to reduce MTTRis to tolerate
false positives recovery ? adaptation
  • Toward new science in autonomic computing
  • ...Ultimately, these aspects of autonomic
    systems will be emergent properties of a general
    architecture, and distinctions will blur into a
    more general notion of self-maintenance. (The
    Vision of Autonomic Computing)

17
Autonomic Technology Trends
  • CPU speed increases slowing down, need more
    explicit parallelism
  • Use extra CPU to collect and locally analyze
    data exploit temporal locality
  • Disk space is free (though bandwidth and
    disaster-recovery arent)
  • Can keep history of parallel as well as
    historical models for regression analysis,
    trending, etc.
  • VMs being used as unit of software distribution
  • Fault isolation
  • Opportunity for nonintrusive observation
  • Action that is independent of the hosted app

18
Data collection monitoring
  • Component frameworks allow for non-intrusive data
    collection without modifying the applications
  • Inter-EJB calls through runtime-managed level of
    indirection
  • Slightly coarser grain of analysis restrictions
    on legal paths make it more likely we can spot
    anomalies
  • Aspect-oriented programming allows further
    monitoring without perturbing application logic
  • Virtual machine monitors provide additional
    observation points
  • Already used by ASPs, for load balancing, app
    migration, etc.
  • Transparent to applications and hosted OSs
  • Likely to become the unit of software
    distribution (intra- and inter-cluster)

19
Optimizing for Specialized State Types
  • Two single-key (Berkeley DB) get/set state
    stores
  • Used for user session state, application workflow
    state, persistent user profiles, merchandise
    catalogs, ...
  • Replication to a set of N bricks provides
    durability
  • Write to subset, wait for subset, remember subset
  • DStore state persists forever as long as ?N/2?
    bricks survive
  • SSM If client loses cookie, state is lost
    otherwise, persists for time t with probability
    p, where t, p F(N, node MTBF)
  • Recoveryrestart, takes seconds or less
  • Efficacy doesnt depend on whether replica is
    behaving correctly
  • SSM node state not preserved (in-memory only)
  • DStore node state preserved, read-repair fixes

20
Detection recovery in SSM
  • 9 State statistics collected once per second
    from each brick
  • Tarzan time series analysis keep N-length time
    series, discretize each data point
  • count relative frequencies of all substrings of
    length k or shorter
  • compare against peer bricks reboot if at least 6
    stats anomalous works for aperiodic or
    irregular-period signals
  • Remember! We are not SLT/ML researchers!

21
Detection recovery in DStore
  • Metrics and algorithm comparable to those used in
    SSM
  • We inject fail-stutter behavior by increasing
    request latency
  • Bottom case more aggressive detection also
    results in 2 unnecessary reboots
  • But they dont matter much
  • Currently some voodoo constants for thresholds in
    both SSM and DStore
  • Trade-off of fast detection vs. false positives

22
What faults does this handle?
  • Substantially all non-Byzantine faults we
    injected
  • Node crash, hang/timeout/freeze
  • Fail-stutter Network loss (drop up to 70 of
    packets randomly)
  • Periodic slowdown (eg from garbage collection)
  • Persistent slowdown (one node lags the others)
  • Underlying (weak) assumption Most bricks are
    doing mostly the right thing most of the time
  • All anomalies can be safely coerced to crash
    faults
  • If that turned out to be the wrong thing, it
    didnt cost you much to try it
  • Human notified after threshold number of restarts
  • These systems are always recovering

23
Path-based analysis Microreboots
  • Pinpoint captures execution paths through EJBs
    as dynamic call trees (intra-method calls hidden)
  • Build probabilistic context-free grammar from
    these
  • Detect trees that correspond to very low
    probability parses
  • Respond by micro-rebooting(uRB) suspected-faulty
    EJBs
  • uRB takes 100s of msecs, vs.whole-app restart
    (8-10 sec)
  • Component interaction analysiscurrently finds
    55-75 of failures
  • Path shape analysis detects gt90 of failures
    but correctlylocalizes fewer

24
Crash-Only Design Lessons from SSM
  • Eliminate coupling
  • No dependence on any specific brick, just on a
    subset of minimum size -- even at the granularity
    of individual requests
  • Not even across phases of an operation
    single-phase nonblocking ops only gt predictable
    amount of work/request
  • Use randomness to avoid deterministic worst cases
    and hotspots
  • We initially violated this guideline by using an
    off-the-shelf JMS implementation that was
    centralized
  • Make parts interchangeable
  • Any replica in a write-set is as good as any
    other
  • Unlike erasure coding, only need 1 replica to
    survive
  • Cost is higher storage overhead, but were
    willing to pay that to get the self- properties

25
Enterprise Service Workloads
Observation Consequence
Internet service workloads consist of large numbers of independent users Large number of independent samples gives basis for success of statistical techniques
Even a flaky service is doing mostly the right thing most of the time Steady-state behavior can be extracted from normal operation
Heavy traffic volume means most of the service is exercised in a relatively short time Baseline model can be learned rapidly and updated in place periodically
3. We can continuously extract models from the
production system orthogonally to the application
26
Building models through measurement
  • Finding bugs using distributed assertion sampling
    Liblit et al, 2003
  • Instrument source code with assertions on pairs
    of variables (features)
  • Use sampling so that any given run of program
    exercises only a few assertions (to limit
    performance impact)
  • Use classification algorithm to identify which
    features are most predictive of faults (observed
    program crashes)
  • Goal bug finding

27
JAGR JBoss with Micro-reboots
  • performability of RUBiS (goodput/sec vs. time)
  • vanilla JBoss w/manual restarting of app-server,
    vs. JAGR w/automatic recovery and
    micro-rebooting
  • JAGR/RUBiS does 78 better than JBoss/RUBiS
  • Maintains 20 req/sec, even in the face of faults
  • Lower steady-state after recovery in first graph
    class reloading, recompiling, etc., which is not
    necessary with micro-reboots
  • Also used to fix memory leaks without rebooting
    whole appserver

28
Fast Recovery Statistical Anomaly Detection
Self-
  • Armando Fox and Emre Kiciman, Stanford
    UniversityMichael Jordan, Randy Katz, David
    Patterson, Ion Stoica,University of California,
    Berkeley
  • SoS Workshop, Bertinoro, Italy
Write a Comment
User Comments (0)
About PowerShow.com