Fast%20Recovery%20 %20Statistical%20Anomaly%20Detection%20=%20Self-* - PowerPoint PPT Presentation

About This Presentation

Title:

Fast%20Recovery%20 %20Statistical%20Anomaly%20Detection%20=%20Self-*

Description:

Box-level Micro-recovery cheap enough to survive ... Build model of 'acceptable' operating envelope by ... single-key ('Berkeley DB') get/set ... – PowerPoint PPT presentation

Number of Views:115

Avg rating:3.0/5.0

Slides: 29

Provided by: fox66

Learn more at: http://oasis.cs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Fast%20Recovery%20 %20Statistical%20Anomaly%20Detection%20=%20Self-*

1
Fast Recovery Statistical Anomaly Detection
Self-

RADS/KATZ CATS Panel
June 2004 ROC Retreat

2
Outline

Motivation approach complex systems of black
boxes
Measurements that respect black boxes
Box-level Micro-recovery cheap enough to survive
false positives
Differences from related efforts
Early case studies
Research agenda

3
Complex Systems of Black Boxes

...our ability to analyze and predict the
performance of the enormously complex software
systems that lies at the core of our economy is
painfully inadequate. (Choudhury Weikum, 2000
PITAC Report)
Build model of acceptable operating envelope by
measurement analysis
Control theory, statistical correlation, anomaly
detection...
Rely on external control, using inexpensive and
simple mechanisms that respect the black box, to
keep system in its acceptable operating envelope
Increase the size of the DB connection pool
Hellerstein et al
Reallocate one or more whole machines
Lassettre et al
Rejuvenate/reboot one or more machines
Trivedi, Fox, others
Shoot one of the blocked txns everyone
Induce memory pressure on other apps
Waldspurger et al

4
Differences from some existing problems

intrusion detection (Hofmeyr et al 98, others)
Detections must be actionable in a way that is
likely to improve system (sacrificing
availability for safety is unacceptable)
bug finding via anomaly detection (Engler,
others)
Human-level monitoring/verification of detections
not feasible, due to number of observations and
short timescales for reaction
Can separate recovery from diagnosis/repair
(dont always need to know root cause to recover)
modeling/predicting SLO violations (Hellerstein,
Goldszmidt, others)
Labeled training set not necessarily available

5
Many other examples, but the point is...
Statistical techniques identify interesting
features and relationships from large datasets,
but frequent tradeoff between detection rate (or
detection time) and false positives
Make micro-recovery so inexpensive that
occasional false positives dont matter

Granularity of black box should match
granularity of available external control
mechanisms

6
Micro-recovery to survive false positives

Goal provide recovery management invariants
Salubrious returns some part of system to
known state
Reclaim resources (memory, DB conns, sockets,
DHCP lease...)
Throw away corrupt transient state
Possibly setup to retry operation, if appropriate
Safe affects only performance, not correctness
Non-disruptive performance impact is small
Predictable impact and time-to-complete is
stable

Observe, Analyze, Act Not recovery, but
continuous adaptation
7
Crash-Only Building Blocks
Subsystem Control point How realized Statistical monitoring
SSM (diskless session state store) NSDI 04 Whole-node fast reboot (doesnt preserve state) Quorum-like redundancy Relaxed consistency Repair cost spread over many operations Time series of state metrics (Tarzan)
DStore (persistent hashtable) in preparation Whole-node reboot (preserves state) Quorum-like redundancy Relaxed consistency Repair cost spread over many operations Time series of state metrics (Tarzan)
JAGR (J2EE application server) AMS 2003 in prep. Microreboots of EJBs Modify appserver to undeploy/ redeploy EJBs and stall pending reqs Anomalous code paths and component interactions (Probabilistic context-free grammar)

Control points are safe, predictable,
non-disruptive
Crash-only design shutdowncrash,
recoverrestart
Makes state-management subsystems as easy to
manage as stateless Web servers

8
Example Managing DStore and SSM

Rebooting is the only control mechanism
Has predictable effect and takes predictable
time, regardless of what the process is doing
Like kill -9, turning off a VM, or pulling
power cord
Intuition the infrastructure supporting the
power switch is simpler than the applications
using it
Due to slight overprovisioning inherent in
replication, rebooting can have minimal effect on
throughput latency
Relaxed consistency guarantees allow this to work
Activity and state statistics collected per brick
every second any deviation gt reboot brick
Makes it as easy as managing a stateless server
farm
Backpressure at many design points prevents
saturation

9
Design Lessons Learned So Far

A spectrum of cleaning operations (Eric
Anderson, HP Labs)
Consequence as t??, all problems will converge
to repair of corrupted persistent data
Trade unnecessary consistency for faster
recovery
spread recovery actions out incrementally/lazily
(read repair) rather than doing it all at once
(log replay)
gives predictable return-to-service time and
acceptable variation in performance after
recovery
keeps data available for reads and writes
throughout recovery
Use single phase ops to avoid coupling/locking
and the issues they raise, and justify the cost
in consistency
Its OK to say no (backpressure)
Several places our design got it wrong in SSM
But even those mistakes could have been worked
around by guard timers

10
Potential Limitations and Challenges

Hard failures
Configuration failures
Although similar approach has been used to
troubleshoot those
Corruption of persistent state
Data structure repair work (Rinard et al.) may be
combinable with automatic inference (Lam et al.)
Challenges
Stability and the autopilot problem
The base-rate fallacy
Multilevel learning
Online implementations of SLT techniques
Nonintrusive data collection and storage

11
An Architecture for Observe, Analyze, Act

Separates systems concerns from algorithm
development
Programmable network elements provide extension
of approach to other layers
Consistent with technology trends
Explicit //ism in CPU usage
Lots of disk storage with limited bandwidth

12
Conclusion
The real reason to reduce MTTRis to tolerate
false positives recovery ? adaptation

...Ultimately, these aspects of autonomic
systems will be emergent properties of a general
architecture, and distinctions will blur into a
more general notion of self-maintenance. (The
Vision of Autonomic Computing)

13
Breakout sessions?

James H Reserve some resources to deal with
problems (by filtering or pre-reservation)
Joe H How black is the black box? What gray
box prior knowledge can you exploit (so you
dont ignore the obvious)?
Joe H Human role - can make statements about
how system should act, so doesnt have to be
completely hands-off training. Similarly, during
training, human can give feedback about what
anomalies are actually relevant (labeling).
Lakshmi What kinds of apps is this intended to
apply to? Where do ROC-like and OASIS-like apps
differ?
Mary Baker People can learn to game the system
-gt randomness can be your friend. If behaviors
have small number of modes, just have to look for
behaviors in the valleys

14
Breakouts

19 -golden nuggets to guide architecture, e.g.,
persistent identifiers for path-based
analysis...what else?
8 - act what safe,fast,predictable behaviors
of the system should we expose (other than, eg,
rebooting)? Esp. those that contribute to
security as well as dependability?
11 - architectures for different types of
stateful systems - what kinds of
persistent/semi-persistent state need to be
factored out of apps, and how to store it
interfaces, etc
20 - Given your goal of generic techniques for
distributed systems, how will you know when
youve succeeded/how do you validate the
techniques? (What are the proof points you
can hand to others to convince them youve
succeeded, including but not limited to metrics?)
Aaron/Dave Metrics How do you know youre
observing the right things? What benchmarks will
be needed?

15
Open Mic

James Hamilton - The Security Economy

16
Conclusion
The real reason to reduce MTTRis to tolerate
false positives recovery ? adaptation

Toward new science in autonomic computing
...Ultimately, these aspects of autonomic
systems will be emergent properties of a general
architecture, and distinctions will blur into a
more general notion of self-maintenance. (The
Vision of Autonomic Computing)

17
Autonomic Technology Trends

CPU speed increases slowing down, need more
explicit parallelism
Use extra CPU to collect and locally analyze
data exploit temporal locality
Disk space is free (though bandwidth and
disaster-recovery arent)
Can keep history of parallel as well as
historical models for regression analysis,
trending, etc.
VMs being used as unit of software distribution
Fault isolation
Opportunity for nonintrusive observation
Action that is independent of the hosted app

18
Data collection monitoring

Component frameworks allow for non-intrusive data
collection without modifying the applications
Inter-EJB calls through runtime-managed level of
indirection
Slightly coarser grain of analysis restrictions
on legal paths make it more likely we can spot
anomalies
Aspect-oriented programming allows further
monitoring without perturbing application logic
Virtual machine monitors provide additional
observation points
Already used by ASPs, for load balancing, app
migration, etc.
Transparent to applications and hosted OSs
Likely to become the unit of software
distribution (intra- and inter-cluster)

19
Optimizing for Specialized State Types

Two single-key (Berkeley DB) get/set state
stores
Used for user session state, application workflow
state, persistent user profiles, merchandise
catalogs, ...
Replication to a set of N bricks provides
durability
Write to subset, wait for subset, remember subset
DStore state persists forever as long as ?N/2?
bricks survive
SSM If client loses cookie, state is lost
otherwise, persists for time t with probability
p, where t, p F(N, node MTBF)
Recoveryrestart, takes seconds or less
Efficacy doesnt depend on whether replica is
behaving correctly
SSM node state not preserved (in-memory only)
DStore node state preserved, read-repair fixes

20
Detection recovery in SSM

9 State statistics collected once per second
from each brick
Tarzan time series analysis keep N-length time
series, discretize each data point
count relative frequencies of all substrings of
length k or shorter
compare against peer bricks reboot if at least 6
stats anomalous works for aperiodic or
irregular-period signals

Remember! We are not SLT/ML researchers!

21
Detection recovery in DStore

Metrics and algorithm comparable to those used in
SSM
We inject fail-stutter behavior by increasing
request latency
Bottom case more aggressive detection also
results in 2 unnecessary reboots
But they dont matter much
Currently some voodoo constants for thresholds in
both SSM and DStore
Trade-off of fast detection vs. false positives

22
What faults does this handle?

Substantially all non-Byzantine faults we
injected
Node crash, hang/timeout/freeze
Fail-stutter Network loss (drop up to 70 of
packets randomly)
Periodic slowdown (eg from garbage collection)
Persistent slowdown (one node lags the others)
Underlying (weak) assumption Most bricks are
doing mostly the right thing most of the time
All anomalies can be safely coerced to crash
faults
If that turned out to be the wrong thing, it
didnt cost you much to try it
Human notified after threshold number of restarts
These systems are always recovering

23
Path-based analysis Microreboots

Pinpoint captures execution paths through EJBs
as dynamic call trees (intra-method calls hidden)
Build probabilistic context-free grammar from
these
Detect trees that correspond to very low
probability parses
Respond by micro-rebooting(uRB) suspected-faulty
EJBs
uRB takes 100s of msecs, vs.whole-app restart
(8-10 sec)
Component interaction analysiscurrently finds
55-75 of failures
Path shape analysis detects gt90 of failures
but correctlylocalizes fewer

24
Crash-Only Design Lessons from SSM

Eliminate coupling
No dependence on any specific brick, just on a
subset of minimum size -- even at the granularity
of individual requests
Not even across phases of an operation
single-phase nonblocking ops only gt predictable
amount of work/request
Use randomness to avoid deterministic worst cases
and hotspots
We initially violated this guideline by using an
off-the-shelf JMS implementation that was
centralized
Make parts interchangeable
Any replica in a write-set is as good as any
other
Unlike erasure coding, only need 1 replica to
survive
Cost is higher storage overhead, but were
willing to pay that to get the self- properties

25
Enterprise Service Workloads
Observation Consequence
Internet service workloads consist of large numbers of independent users Large number of independent samples gives basis for success of statistical techniques
Even a flaky service is doing mostly the right thing most of the time Steady-state behavior can be extracted from normal operation
Heavy traffic volume means most of the service is exercised in a relatively short time Baseline model can be learned rapidly and updated in place periodically
3. We can continuously extract models from the
production system orthogonally to the application
26
Building models through measurement

Finding bugs using distributed assertion sampling
Liblit et al, 2003
Instrument source code with assertions on pairs
of variables (features)
Use sampling so that any given run of program
exercises only a few assertions (to limit
performance impact)
Use classification algorithm to identify which
features are most predictive of faults (observed
program crashes)
Goal bug finding

27
JAGR JBoss with Micro-reboots

performability of RUBiS (goodput/sec vs. time)
vanilla JBoss w/manual restarting of app-server,
vs. JAGR w/automatic recovery and
micro-rebooting
JAGR/RUBiS does 78 better than JBoss/RUBiS
Maintains 20 req/sec, even in the face of faults
Lower steady-state after recovery in first graph
class reloading, recompiling, etc., which is not
necessary with micro-reboots
Also used to fix memory leaks without rebooting
whole appserver

28
Fast Recovery Statistical Anomaly Detection
Self-

Armando Fox and Emre Kiciman, Stanford
UniversityMichael Jordan, Randy Katz, David
Patterson, Ion Stoica,University of California,
Berkeley
SoS Workshop, Bertinoro, Italy

Write a Comment

User Comments (0)