Title: Intro
1Intro Overview of RADS goals
- Armando Fox Dave Patterson
- CS 444A/CS 294-6, Stanford/UC Berkeley
- Fall 2004
2- Administrivia
- Course logistics registration
- Project expectations and other deliverables
- Background and motivation for RADS
- ROC and its relationship to RADS
- Early case studies
- Discussion projects, research directions, etc.
3Administrivia/goals
- Stanford enrollment vs. Axess
- SLT and CT tutorial VHS/DVDs available to view
- SLT and CT Lab/assignments grading policy
- Stanford and Berkeley meeting/transportation
logistics - Format of course
4Background motivation for RADS
5RADS in One Slide
- Philosophy of ROC focus on lowering MTTR to
improve overall availability - ROC achievements two levels of lowering MTTR
- Microrecovery fine-grained generic recovery
techniques recover only the failed part(s) of
the system, at much lower cost than whole-system
recovery - Undo sophisticated tools to help human operators
selectively back out destructive actions/changes
to a system - General approach use microrecovery as first
line of defense when it fails, provide support
to human operators to avoid having to reinstall
the world - RADS insight can combine cheap recovery with
statistical anomaly detection techniques
6Hence, (at least) 2 parts to RADS
- Investigating other microrecovery methods
- Investigating analysis techniques
- What to capture/represent in a model
- Addressing fundamental open challenges
- stability
- systematic misdiagnosis
- subversion by attackers
- etc.
- General insight different is bad
- law of large numbers arguments support this for
large services
7Why RADS
- Motivation
- 5 9s availability gt 5 down-minutes/year gt must
recover from (or mask) most failures without
human intervention - a principled way to design self- systems
- Technology
- High-traffic large-scale distributed/replicated
services gt large datasets - Analysis is CPU-intensive gt a way to trade extra
CPU cycles for dependability - Large logs/datasets for models gt storage is
cheap and getting cheaper - RADS addresses a clear need while exploiting
demonstrated technology trends
8Cheap Recovery
9Complex systems of black boxes
- ...our ability to analyze and predict the
performance of the enormously complex software
systems that lies at the core of our economy is
painfully inadequate. (Choudhury Weikum, 2000
PITAC Report) - Networked services too complex and
rapidly-changing to test exhaustively
collections of black boxes - Weekly or biweekly code drops not uncommon
- Market activities lead to integration of whole
systems - Need to get humans out of loop for at least some
monitoring/recovery loops - hence interest in autonomic approaches
- fast detection is often at odds with false alarms
10Consequences
- Complexity breeds increased bug counts and bug
impact - Heisenbugs, race conditions, environment-dependent
and hard-to-reproduce bugs still account for
majority of SW bugs in live systems - up to 80 of bugs found in production are those
for which a fix is not yet available - some application-level failures result in
user-visible bad behavior before they are
detected by site monitors - Tellme Networks up to 75 of downtime is
detection (sometimes by user complaints),
followed by localization - Amazon, Yahoo gross metrics track second-order
effect of bugs, but lags actual bug by minutes or
tens of minutes - Result downtime and increased management costs
- A.P. Wood, Software reliability from the
customer view, IEEE Computer, Aug. 2003
11Always adapting, always recovering
- Build statistical models of acceptable
operating envelope by measurement analysis on
live system - Control theory, statistical correlation, anomaly
detection... - Detect runtime deviations from model
- typical tradeoff is between detection rate
false positive rate - Rely on external control using inexpensive and
simple mechanisms that respect the black box, to
keep system within its acceptable operating
envelope - invariant attempting recovery wont make things
worse - makes inevitable false positives tolerable
- can then reduce false negatives by tuning
algos to be more aggressive and/or deploying
multiple detectors - Systems that are always adapting, always
recovering
12Toward recovery management invariants
- Observation instrumentation and analysis
- collect and analyze data from running systems
- rely on most systems work most of the time to
automatically derive baseline models - Analysis detect and localize anomalous behavior
- Action close loop automatically with
micro-recovery - Salubrious returns some part of system to
known state - Reclaim resources (memory, DB conns, sockets,
DHCP lease...), throw away corrupt transient
state, setup to retry operation if appropriate - Safe no effect on correctness, minimal effect on
performance - Localized parts not being microrecovered arent
affected - Fast recovery simplifies failure detection and
recovery management.
13Non-goals/complementary work
- All of the following are being capably studied by
others, and directly compose with our own
efforts... - Byzantine fault tolerance
- In-place repair of persistent data structures
- Hard-real-time response guarantees
- Adding checkpointing to legacy non-componentized
applications - Source code bug finding
- Advancing the state of the art in SLT (analysis
algorithms)
14Outline
- Micro-recoverable systems
- Concept of microrecovery
- A microrecoverable application server session
state store - Application-generic SLT-based failure detection
- Path and component analysis and localization for
appserver - Simple time series analyses for purpose-built
state store - Combining SLT detection with microrecoverable
systems - Discussion, related work, implications
conclusions
15Microrebooting one kind of microrecovery
- 60 of software failures in the field are
reboot-curable, even if root cause is unknown...
why? - Rebooting discards bad temporary data (corrupted
data structures that can be rebuilt) and
(usually) reclaims used resources - reestablishes control flow in a predictable way
(breaks deadlocks/livelocks, returns thread or
process to its start state) - To avoid imperiling correctness, we must...
- Separate data recovery from process recovery
- Safeguard the data
- Reclaim resources with high confidence
- Goal get same benefits of rebooting but at much
finer grain (hence faster and less disruptive) -
microrebooting - D. Oppenheimer et al., Why do Internet services
fail and what can be done about it? , USITS 2003
16Write example Write to Many, Wait for Few
Try to write to W random bricks, W 4Must wait
for WQ bricks to reply, WQ 2
Brick 1
Brick 2
Browser
Brick 3
Brick 4
Brick 5
17Write example Write to Many, Wait for Few
Try to write to W random bricks, W 4Must wait
for WQ bricks to reply, WQ 2
Brick 1
Brick 2
Browser
Brick 3
Brick 4
Brick 5
18Write example Write to Many, Wait for Few
Try to write to W random bricks, W 4Must wait
for WQ bricks to reply, WQ 2
Brick 1
Brick 2
Browser
Brick 3
Brick 4
Brick 5
19Write example Write to Many, Wait for Few
Try to write to W random bricks, W 4Must wait
for WQ bricks to reply, WQ 2
Brick 1
Brick 2
Browser
Brick 3
Brick 4
Brick 5
20Write example Write to Many, Wait for Few
Try to write to W random bricks, W 4Must wait
for WQ bricks to reply, WQ 2
Brick 1
Brick 2
Browser
14
Brick 3
Brick 4
Cookie holds metadata
Brick 5
21Read example
Try to read from Bricks 1, 4
Brick 1
14
Brick 2
Browser
Brick 3
Brick 4
Brick 5
22Read example
14
Brick 1
Brick 2
Browser
Brick 3
Brick 4
Brick 5
23Read example
Brick 1 crashes
Brick 1
Brick 2
Browser
Brick 3
Brick 4
Brick 5
24Read example
Brick 2
Browser
Brick 3
Brick 4
Brick 5
25SSM Failure and Recovery
- Failure of single node
- No data loss, WQ-1 remain
- State is available for R/W during failure
- Recovery
- Restart No special case recovery code
- State is available for R/W during brick restart
- Session state is self-recovering
- Users access pattern causes data to be rewritten
26Backpressure and Admission Control
Brick 1
Brick 2
Drop Requests
Brick 3
Brick 4
Brick 5
Heavy flow to Brick 3
27Statistical Monitoring
Statistics
Statistics NumElementsMemoryUsedInboxSizeNumDro
ppedNumReadsNumWrites
Brick 1
Brick 2
Brick 3
Brick 4
Brick 5
28SSM Monitoring
- N replicated bricks handle read/write requests
- Cannot do structural anomaly detection!
- Alternative features (performance, mem usage,
etc) - Activity statistics How often did a brick do
something? - Msgs received/sec, dropped/sec, etc.
- Same across all peers, assuming balanced workload
- Use anomalies as likely failures
- State statistics Current state of system
- Memory usage, queue length, etc.
- Similar pattern across peers, but may not be in
phase - Look for patterns in time-series differences in
patterns indicate failure at a node.
29Detecting Anomalous Conditions
- Metrics compared against those of peer bricks
- Basic idea Changes in workload tend to affect
all bricks equally - Underlying (weak) assumption Most bricks are
doing mostly the right thing most of the time - Anomaly in 6 or more (out of 9) metrics gt reboot
brick - Use different techniques for different stats
- Activity absolute median deviation
- State Tarzan time-series analysis
30Network Fault 70 packet loss in SAN
Brick restarts
Fault detectedBrick killed
Network fault injected
31J2EE as a platform for uRB-based recovery
- Java 2 Enterprise Edition, a component framework
for Internet request-reply style apps - App is a collection of components (EJBs)
created by subclassing a managed container class - application server provides component creation,
thread management, naming/directory services,
abstractions for database and HTTP sessions, etc. - Web pages with embedded servlets and Java Server
Pages invoke EJB methods - potential to improve all apps by modifying the
appserver - J2EE has a strong following, encourages modular
programming, and there are open source appservers
32Separating data recovery from process recovery
- For HTTP workloads, session state ? app
checkpoint - Store session state in a microrebootable session
state subsystem (NSDI04) - Recoverynon-state-preserving process restart,
redundancy gives probabilistic durability - Response time cost of externalizing session
state 25 - SSM, an N-way RAM-based state replication NSDI
04 behind existing J2EE API - Microreboot EJBs
- destroy all instances of EJB and associated
threads - releases appserver-level resources (DB
connections, etc) - discards appserver metadata about EJBs
- session state preserved across uRB
33JBossuRBsSSM fault injection
Fault injection null refs, deadlocks/infinite
loop, corruption of volatile EJB metadata,
resource leaks, Java runtime errors/exc
RUBiS online auction app (132K items, 1.5M bids,
100K subscribers) 150 simulated users/node35-45
req/sec/node Workload mix based on a commercial
auction site
Client-based failure detection
34uRB vs. full RB - action weighted goodput
- Example corrupt JNDI database entry,
RuntimeException, Java error measure G_aw in
1-second buckets - Localization is crude static analysis to
associate failed URL with set of EJBs,
incrementing an EJBs score whenever its
implicated - With uRBs, 89 reduction in failed requests and
9 more successful requests compared to full RB,
despite 6 false positives
35Performance overhead of JAGR
- 150 clients/node latency38 msec (3 -gt 7 nodes)
- Human-perceptible delay 100-200 msec
- Real auction site 41 req/sec, 33-300 msec latency
36Improving availability from users point of view
- uRB improves user-perceived availability vs. full
reboot - uRB complements failover
- (a) Initially, excess load on 2nd node brought it
down immediately after failover - (b) uRB results in some failed requests (96
fewer) from temporary overload - (c,d) Full reboot vs. uRB without failover
- For small clusters, should always try uRB first
37uRB Tolerates Lax Failure Detection
- Tolerates lag in detection latency (up to 53s in
our microbenchmark) and high false positive rates - Our naive detection algorithm had up to 60 false
positive rate in terms of what to uRB - we injected 97 false positives before reduction
in overall availability equaled cost of full RB - Always safe to use as first line of defense,
even when failover is possible - cost(uRBother recovery) ? cost(other recovery)
- success rate of uRB on reboot-curable failures is
comparable to whole-appserver reboot
38Performance penalties
- Baseline workload mix modeled on commercial site
- 150 simulated clients per node, 40-45 reqs/sec
per node - system at 70 utilization
- Throughput 1 worse due to instrumentation
- worst-case response latency increases from 800 to
1200ms - Average case 45ms to 80ms compare to 35-300ms
for commercial service - Well within human tolerance thresholds
- Entirely due to factoring out of session state
- Performance penalty is tolerable worth it
39Recovery and maintenance
40Microrecovery for Maintenance Operations
- Capacity discovery in SSM
- TCP-inspired flow control keeps system from
falling off a cliff - OK to say no is essential for this backpressure
to work - Microrejuvenation in JAGR (proactively
microreboot to fix localized memory leaks) - Splitting/coalescing in Dstore
- Split failure reappearance of failed node
- Same safe/non-disruptive recovery mechanisms are
used to lazily repair inconsistencies after new
node appears - Consequently, performance impact small enough to
do this as an online operation
41Using microrecovery for maintenance
- Capacity discovery in SSM
- redundancy mechanism used for recovery (write
many, wait few) also used to say no while
gracefully degrading performance
42Full rejuvenation vs. microrejuvenation
76
43Splitting/coalescing in Dstore
- Splitting/coalescing in Dstore
- Split failure reappearance of failed node
- Same mechanisms used to lazily repair
inconsistencies
44Summary microrecoverable systems
- Separation of data from process recovery
- Special-purpose data stores can be made
microrecoverable - OK to initiate microrecovery anytime for any
reason - no loss of correctness, tolerable loss of
performance - likely (but not guaranteed) to fix an important
class of transients - wont make things worse can always try full
recovery afterward - inexpensive enough to tolerate sloppy fault
detection - low-cost first line of defense
- some maintenance ops can be cast as
microrecovery - due to low cost, proactive maintenance can be
done online - can often convert unplanned long downtime into
planned shorter performance hit
45Anomaly detection as failure detection
46Example Anomaly Finding Techniques
Question does anomaly bug?
Includes design time and build time Includes
both offline (invasive) and online detection
techniques
47Examples of Badness Inference
- Sometimes can detect badness by looking for
inconsistencies in runtime behavior - We can observe program-specific properties
(though using automated methods) as well as
program-generic properties - Often, we must be able to first observe program
operating normally - Eraser detecting data races Savage et al. 2000
- Observe lock/unlock patterns around shared
variables - If a variable usually protected by lock/unlock or
mutex is observed to have interleaved reads,
report a violation - DIDUCE inferring invariants, then detecting
violations Hangal Lam 2002 - Start with strict invariant (x is always 3)
- Relax it as other values seen (x is in 0,10)
- Increase confidence in invariant as more
observations seen - Report violations of invariants that have
threshold confidence
48Generic runtime monitoring techniques
- What conditions are we monitoring for?
- Fail-stop vs. Fail-silent vs. Fail-stutter
- Byzantine failures
- Generic methods
- Heartbeats (what does loss of heartbeat mean?
Who monitors them?) - Resource monitoring (what is abnormal?)
- Application-specific monitoring ask a question
you know the answer to - Fault model enforcement
- coerce all observed faults to an expected
faults subset - if necessary, take additional actions to
completely induce the fault - Simplifies recovery since fewer distinct cases
- Avoids potential misdiagnosis of faults that have
common symptoms - Note, may sometimes appear to make things worse
(coerce a less-severe fault to a more-severe
fault) - Doesnt exercise all parts of the system
49Internet performance failure detection
- Various approaches, all of which exploit the law
of large numbers and (sort of) Central Limit
Theorem (which is?) - Establish baseline of quantity to be monitored
- Take observations, factor out data from known
failures - Normalize to workload?
- Look for significant deviations from baseline
- What to measure?
- Coarse-grain number of reqs/sec
- Finer-grain Number of TCP connections in
Established, Syn_sent, Syn_rcvd state - Even finer additional internal request
milestones - Hard to do in an application-generic way...but
frameworks can save us
50Example 1 Detection recovery in SSM
- 9 State statistics collected per second from
each replica - Tarzan time series analysis compares relative
frequencies of substrings corresponding to
discretized time series - anomalous gt at least 6 stats anomalous
works for aperiodic or irregular-period signals - robust against workload changes that affect all
replicas equally and against highly-correlated
metrics
Keogh et al., Finding surprising patterns in a
time series database in linear time and space,
SIGKDD 2002
51What faults does this handle?
- Essentially 100 availability vs. injected
faults - Node crash/hang/timeout/freeze
- Fail-stutter Network loss (drop up to 70 of
packets randomly) - Periodic slowdown (eg from garbage collection)
- Persistent slowdown (one node lags the others)
- Underlying (weak) assumption Most bricks are
doing mostly the right thing most of the time - All anomalies can be safely coerced to crash
faults - If reboot doesnt fix, it didnt cost you much to
try it - Human notified after threshold number of
restarts system has no concept of recovery - Allows SSM to be managed like a farm of stateless
servers
52Detecting anomalies in application logic
- Goal detect failures whose only obvious symptom
is change in semantics of application - Example wrong item data displayed wouldnt be
caught by HTML scraping or HTTP logs - Typically, site responds to HTTP pings, etc.
under such failures - These commonly result from exceptions of the form
we injected into RUBiS - Insight manifestation of bugs is the rare case,
so capture normal behavior of system under no
fault injection - Then detect threshold deviations from this
baseline - Periodically move the baseline to allow for
workload evolution
53Patterns Path shape analysis
- Model paths as parse trees in probabilistic CFG
- Build grammar under believed normal conditions,
then mark very unlikely paths as anomalous - after classification, build decision tree to
correlate path features (components touched) with
anomalous paths
54Patterns Component Interaction Analysis
- Model interactions between a component and its n
neighbors in the dynamic call graph as a weighted
DAG - compare to observed call graph using chi-squared
goodness-of-fit - can compare either across peers or against
historical data
55Precision and recall (example)
- Detection Recall of failures actually
detected as anomalies - Strictly better than HTTP/HTML monitoring
Detection recall, faults affecting gt1 of
workload
- Localization
- recall actually-faulty requests returned
- precision requests returned that are faulty
1-(FP rate) - Tradeoff between recall and precision (false
positive rate) - Even low-recall case corresponds to high
detection recall (.83)
56Pinpoint key results
- Detect 89-96 of injected failures, compared to
20-79 for HTML scraping and HTTP log monitoring - Limited success in detecting injected source bugs
- Example success caught a bug that prevented
shopping cart from iterating over its contents to
display them, and correctly identified at-fault
component (where bug was injected) - Resilient to normal workload changes
- Because we bin analysis by request category
- Resilient to bug fix release code changes
- Currently slow analysis lags 20s behind
application
57Combining uRBs and Pinpoint
- Simple recovery policy
- uRB all components whose normalized anomaly score
gt1.0 - if weve already done that, reboot the whole
application - More sophisticated policies certainly possible
58Combining uRBs and Pinpoint
- Example data structure corruption in SB_viewItem
EJB - 350 simulated clients
- 18.5s to detect/localize
- lt1s to repair
- Note, returned Webpage would be valid but
incorrect - Robust to typicalworkload changes bug patches
- More comprehensive deployment in progress
59Faulty Request Identification
- HTTP monitoring has perfect precision since its
a ground truth indicator of a server fault - Path-shape analysis pulls more points out of the
bottom left corner
60Faulty Request Identification
- HTTP monitoring has perfect precision since its
a ground truth indicator of a server fault - Path-shape analysis pulls more points out of the
bottom left corner
61Tolerating false positives in DStore
- Metrics and algorithm comparable to those used in
SSM - We inject fail-stutter behavior by increasing
request latency - Bottom case more aggressive detection also
results in 2 unnecessary reboots - But they dont matter much if there is modest
replication - Currently some voodoo constants for thresholds in
both SSM and DStore - Recall that these are off-the-shelf algorithms
should be able to do better - Trade-off earlier detection vs. false positives
62Summary of case studies
- Detection and localization good even with
simple algorithms fits well with localized
recovery - Performance penalty is tolerable worth it
- Note, microrecovery can also be used for
microrejuvenation
63Discussion
64Discussion What makes this work?
- What made it work in our examples specifically?
- Recovery speed Weaker consistency in SSM and
DStore in exchange for fast recovery and
predictable work done per request - Recovery correctness J2EE apps constrained to
checkpoint by manipulating session state, and
this is brought out in the app-writer-visible
APIs good isolation between components and
relative lack of shared state - Anomaly detection app behavior alternates short
sequences of EJB calls with updates to persistent
state, so can be characterized in terms of those
calls - Observations
- Neither diagnosis?recovery nor recovery?diagnosis
- Localization ! diagnosis, but its an important
optimization
65Why are statistical methods appealing?
- Large complex systems tend to exercise a lot of
their functionality in a fairly short amount of
time - Especially Internet services, with high-volume
workloads of largely independent requests - Even if we dont know what to measure,
statistical and data mining techniques can help
figure it out - Performance problems are often linked with
dependability problems (fail-stutter behavior),
for either HW or SW reasons - Most systems work well most of the time
- Corollary in a replica system, replicas should
behave the same most of the time
66When does it not work?
- When SLT-based monitoring does not apply
- Base-rate fallacy monitoring events so rare that
FP rate dominates - Gaming the system (deliberately or inadvertently)
- When failures cant be cured by any kind of
micro-recovery - Persistent-state corruption (or hardware failure)
- Corrupted configuration data
- a spectrum of undo
- When you cant say no
- Backpressure and possibility of caller-retry are
used to improve predictability - Promising you will say yes may be
difficult...question may be whether end-to-end
guarantees are needed at lower layers
67SSM/DStore as extreme design points
- Goal was to investigate extremes of no special
recovery - Could explore erasure coding (RepStore does this
dynamically) - Weakened consistency model of DStore vs. 2PC
- Spread cost of repair lazily across many
operations (rather than bulk recovery) - Spread some 2PC state maintenance to client in
the form of write in progress cookie - May be that 2PC would be affordable, but we were
interested in extreme design point of no special
restart code
68Role of 3-tier architecture
- Separation of concerns really, separation of
process recovery (control flow) from data
recovery - uRB and reboots recover processes SSM, DStore,
and traditional relational databases recover data - Not addressed is repair of data
69Shouldnt we just make software better?
- Yes we should (and many people are), but...
- We use commodity HWSW, despite the fact that
they are imperfect, less reliable than hardened
or purpose-built components, etc. Why? - Price/performance follows volume
- Allows specialization of efforts and composition
of reusable building blocks (vs. building
stovepipe system) - In short, it allows much faster overall pace of
innovation and deployment, for both technical and
economic reasons, even though the components
themselves are imperfect - We should assumecommodity programmers too
(observation from Brewster Kahle) - Give as much generic support to application as we
can
70Challenges open issues
- Algorithm issues that impinge on systems work
- Hand-tuned constants/thresholds in
algorithms--seems to be an issue in other
applications of SLT as well - Online vs. offline algorithms
- Stability of closed loop
- Systems issues
- How do you know youve checkpointed all
important state, or that something is safe to
retry? - How do you debug a moving target ? Traditional
methods/tools are confounded by code obfuscation,
sudden loss of transient program state (stack
heap), etc. (a great PhD thesis...) - debugging todays real systems is already hard
for these reasons - Real apps, faultloads, best practices, etc. hard
to get!
71RADS message in a nutshell
Statistical techniques can identify interesting
features and relationships from large datasets,
but frequent tradeoff between detection rate (or
detection time) and false positives
Make micro-recovery so inexpensive that
occasional false positives dont matter
- Achievable now on realistic applications
workloads - Synergistic with componentized apps frameworks
- Specific point of leverage for collaboration with
machine learning research lots of headroom for
improvement - Even simple algorithms show encouraging initial
results
72Project possibilities
73BACKUP SLIDES