Intro

About This Presentation

Title:

Intro

Description:

ROC and its relationship to RADS ... SLT and CT tutorial VHS/DVD's available to view. SLT and CT Lab ... 'Salubrious': returns some part of system ... – PowerPoint PPT presentation

Number of Views:65

Avg rating:3.0/5.0

Slides: 74

Provided by: fox66

Learn more at: https://radlab.cs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Intro

1
Intro Overview of RADS goals

Armando Fox Dave Patterson
CS 444A/CS 294-6, Stanford/UC Berkeley
Fall 2004

Administrivia
Course logistics registration
Project expectations and other deliverables
Background and motivation for RADS
ROC and its relationship to RADS
Early case studies
Discussion projects, research directions, etc.

3
Administrivia/goals

Stanford enrollment vs. Axess
SLT and CT tutorial VHS/DVDs available to view
SLT and CT Lab/assignments grading policy
Stanford and Berkeley meeting/transportation
logistics
Format of course

4
Background motivation for RADS
5
RADS in One Slide

Philosophy of ROC focus on lowering MTTR to
improve overall availability
ROC achievements two levels of lowering MTTR
Microrecovery fine-grained generic recovery
techniques recover only the failed part(s) of
the system, at much lower cost than whole-system
recovery
Undo sophisticated tools to help human operators
selectively back out destructive actions/changes
to a system
General approach use microrecovery as first
line of defense when it fails, provide support
to human operators to avoid having to reinstall
the world
RADS insight can combine cheap recovery with
statistical anomaly detection techniques

6
Hence, (at least) 2 parts to RADS

Investigating other microrecovery methods
Investigating analysis techniques
What to capture/represent in a model
Addressing fundamental open challenges
stability
systematic misdiagnosis
subversion by attackers
etc.
General insight different is bad
law of large numbers arguments support this for
large services

7
Why RADS

Motivation
5 9s availability gt 5 down-minutes/year gt must
recover from (or mask) most failures without
human intervention
a principled way to design self- systems
Technology
High-traffic large-scale distributed/replicated
services gt large datasets
Analysis is CPU-intensive gt a way to trade extra
CPU cycles for dependability
Large logs/datasets for models gt storage is
cheap and getting cheaper
RADS addresses a clear need while exploiting
demonstrated technology trends

8
Cheap Recovery
9
Complex systems of black boxes

...our ability to analyze and predict the
performance of the enormously complex software
systems that lies at the core of our economy is
painfully inadequate. (Choudhury Weikum, 2000
PITAC Report)
Networked services too complex and
rapidly-changing to test exhaustively
collections of black boxes
Weekly or biweekly code drops not uncommon
Market activities lead to integration of whole
systems
Need to get humans out of loop for at least some
monitoring/recovery loops
hence interest in autonomic approaches
fast detection is often at odds with false alarms

10
Consequences

Complexity breeds increased bug counts and bug
impact
Heisenbugs, race conditions, environment-dependent
and hard-to-reproduce bugs still account for
majority of SW bugs in live systems
up to 80 of bugs found in production are those
for which a fix is not yet available
some application-level failures result in
user-visible bad behavior before they are
detected by site monitors
Tellme Networks up to 75 of downtime is
detection (sometimes by user complaints),
followed by localization
Amazon, Yahoo gross metrics track second-order
effect of bugs, but lags actual bug by minutes or
tens of minutes
Result downtime and increased management costs
A.P. Wood, Software reliability from the
customer view, IEEE Computer, Aug. 2003

11
Always adapting, always recovering

Build statistical models of acceptable
operating envelope by measurement analysis on
live system
Control theory, statistical correlation, anomaly
detection...
Detect runtime deviations from model
typical tradeoff is between detection rate
false positive rate
Rely on external control using inexpensive and
simple mechanisms that respect the black box, to
keep system within its acceptable operating
envelope
invariant attempting recovery wont make things
worse
makes inevitable false positives tolerable
can then reduce false negatives by tuning
algos to be more aggressive and/or deploying
multiple detectors
Systems that are always adapting, always
recovering

12
Toward recovery management invariants

Observation instrumentation and analysis
collect and analyze data from running systems
rely on most systems work most of the time to
automatically derive baseline models
Analysis detect and localize anomalous behavior
Action close loop automatically with
micro-recovery
Salubrious returns some part of system to
known state
Reclaim resources (memory, DB conns, sockets,
DHCP lease...), throw away corrupt transient
state, setup to retry operation if appropriate
Safe no effect on correctness, minimal effect on
performance
Localized parts not being microrecovered arent
affected
Fast recovery simplifies failure detection and
recovery management.

13
Non-goals/complementary work

All of the following are being capably studied by
others, and directly compose with our own
efforts...
Byzantine fault tolerance
In-place repair of persistent data structures
Hard-real-time response guarantees
Adding checkpointing to legacy non-componentized
applications
Source code bug finding
Advancing the state of the art in SLT (analysis
algorithms)

14
Outline

Micro-recoverable systems
Concept of microrecovery
A microrecoverable application server session
state store
Application-generic SLT-based failure detection
Path and component analysis and localization for
appserver
Simple time series analyses for purpose-built
state store
Combining SLT detection with microrecoverable
systems
Discussion, related work, implications
conclusions

15
Microrebooting one kind of microrecovery

60 of software failures in the field are
reboot-curable, even if root cause is unknown...
why?
Rebooting discards bad temporary data (corrupted
data structures that can be rebuilt) and
(usually) reclaims used resources
reestablishes control flow in a predictable way
(breaks deadlocks/livelocks, returns thread or
process to its start state)
To avoid imperiling correctness, we must...
Separate data recovery from process recovery
Safeguard the data
Reclaim resources with high confidence
Goal get same benefits of rebooting but at much
finer grain (hence faster and less disruptive) -
microrebooting
D. Oppenheimer et al., Why do Internet services
fail and what can be done about it? , USITS 2003

16
Write example Write to Many, Wait for Few
Try to write to W random bricks, W 4Must wait
for WQ bricks to reply, WQ 2
Brick 1
Brick 2
Browser
Brick 3
Brick 4
Brick 5
17
Write example Write to Many, Wait for Few
Try to write to W random bricks, W 4Must wait
for WQ bricks to reply, WQ 2
Brick 1
Brick 2
Browser
Brick 3
Brick 4
Brick 5
18
Write example Write to Many, Wait for Few
Try to write to W random bricks, W 4Must wait
for WQ bricks to reply, WQ 2
Brick 1
Brick 2
Browser
Brick 3
Brick 4
Brick 5
19
Write example Write to Many, Wait for Few
Try to write to W random bricks, W 4Must wait
for WQ bricks to reply, WQ 2
Brick 1
Brick 2
Browser
Brick 3
Brick 4
Brick 5
20
Write example Write to Many, Wait for Few
Try to write to W random bricks, W 4Must wait
for WQ bricks to reply, WQ 2
Brick 1
Brick 2
Browser
14
Brick 3
Brick 4
Cookie holds metadata
Brick 5
21
Read example
Try to read from Bricks 1, 4
Brick 1
14
Brick 2
Browser
Brick 3
Brick 4
Brick 5
22
Read example
14
Brick 1
Brick 2
Browser
Brick 3
Brick 4
Brick 5
23
Read example
Brick 1 crashes
Brick 1
Brick 2
Browser
Brick 3
Brick 4
Brick 5
24
Read example
Brick 2
Browser
Brick 3
Brick 4
Brick 5
25
SSM Failure and Recovery

Failure of single node
No data loss, WQ-1 remain
State is available for R/W during failure
Recovery
Restart No special case recovery code
State is available for R/W during brick restart
Session state is self-recovering
Users access pattern causes data to be rewritten

26
Backpressure and Admission Control
Brick 1
Brick 2
Drop Requests
Brick 3
Brick 4
Brick 5
Heavy flow to Brick 3
27
Statistical Monitoring
Statistics
Statistics NumElementsMemoryUsedInboxSizeNumDro
ppedNumReadsNumWrites
Brick 1
Brick 2
Brick 3
Brick 4
Brick 5
28
SSM Monitoring

N replicated bricks handle read/write requests
Cannot do structural anomaly detection!
Alternative features (performance, mem usage,
etc)
Activity statistics How often did a brick do
something?
Msgs received/sec, dropped/sec, etc.
Same across all peers, assuming balanced workload
Use anomalies as likely failures
State statistics Current state of system
Memory usage, queue length, etc.
Similar pattern across peers, but may not be in
phase
Look for patterns in time-series differences in
patterns indicate failure at a node.

29
Detecting Anomalous Conditions

Metrics compared against those of peer bricks
Basic idea Changes in workload tend to affect
all bricks equally
Underlying (weak) assumption Most bricks are
doing mostly the right thing most of the time
Anomaly in 6 or more (out of 9) metrics gt reboot
brick
Use different techniques for different stats
Activity absolute median deviation
State Tarzan time-series analysis

30
Network Fault 70 packet loss in SAN
Brick restarts
Fault detectedBrick killed
Network fault injected
31
J2EE as a platform for uRB-based recovery

Java 2 Enterprise Edition, a component framework
for Internet request-reply style apps
App is a collection of components (EJBs)
created by subclassing a managed container class
application server provides component creation,
thread management, naming/directory services,
abstractions for database and HTTP sessions, etc.
Web pages with embedded servlets and Java Server
Pages invoke EJB methods
potential to improve all apps by modifying the
appserver
J2EE has a strong following, encourages modular
programming, and there are open source appservers

32
Separating data recovery from process recovery

For HTTP workloads, session state ? app
checkpoint
Store session state in a microrebootable session
state subsystem (NSDI04)
Recoverynon-state-preserving process restart,
redundancy gives probabilistic durability
Response time cost of externalizing session
state 25
SSM, an N-way RAM-based state replication NSDI
04 behind existing J2EE API
Microreboot EJBs
destroy all instances of EJB and associated
threads
releases appserver-level resources (DB
connections, etc)
discards appserver metadata about EJBs
session state preserved across uRB

33
JBossuRBsSSM fault injection
Fault injection null refs, deadlocks/infinite
loop, corruption of volatile EJB metadata,
resource leaks, Java runtime errors/exc
RUBiS online auction app (132K items, 1.5M bids,
100K subscribers) 150 simulated users/node35-45
req/sec/node Workload mix based on a commercial
auction site
Client-based failure detection
34
uRB vs. full RB - action weighted goodput

Example corrupt JNDI database entry,
RuntimeException, Java error measure G_aw in
1-second buckets
Localization is crude static analysis to
associate failed URL with set of EJBs,
incrementing an EJBs score whenever its
implicated
With uRBs, 89 reduction in failed requests and
9 more successful requests compared to full RB,
despite 6 false positives

35
Performance overhead of JAGR

150 clients/node latency38 msec (3 -gt 7 nodes)
Human-perceptible delay 100-200 msec
Real auction site 41 req/sec, 33-300 msec latency

36
Improving availability from users point of view

uRB improves user-perceived availability vs. full
reboot
uRB complements failover
(a) Initially, excess load on 2nd node brought it
down immediately after failover
(b) uRB results in some failed requests (96
fewer) from temporary overload
(c,d) Full reboot vs. uRB without failover
For small clusters, should always try uRB first

37
uRB Tolerates Lax Failure Detection

Tolerates lag in detection latency (up to 53s in
our microbenchmark) and high false positive rates
Our naive detection algorithm had up to 60 false
positive rate in terms of what to uRB
we injected 97 false positives before reduction
in overall availability equaled cost of full RB
Always safe to use as first line of defense,
even when failover is possible
cost(uRBother recovery) ? cost(other recovery)
success rate of uRB on reboot-curable failures is
comparable to whole-appserver reboot

38
Performance penalties

Baseline workload mix modeled on commercial site
150 simulated clients per node, 40-45 reqs/sec
per node
system at 70 utilization
Throughput 1 worse due to instrumentation
worst-case response latency increases from 800 to
1200ms
Average case 45ms to 80ms compare to 35-300ms
for commercial service
Well within human tolerance thresholds
Entirely due to factoring out of session state
Performance penalty is tolerable worth it

39
Recovery and maintenance
40
Microrecovery for Maintenance Operations

Capacity discovery in SSM
TCP-inspired flow control keeps system from
falling off a cliff
OK to say no is essential for this backpressure
to work
Microrejuvenation in JAGR (proactively
microreboot to fix localized memory leaks)
Splitting/coalescing in Dstore
Split failure reappearance of failed node
Same safe/non-disruptive recovery mechanisms are
used to lazily repair inconsistencies after new
node appears
Consequently, performance impact small enough to
do this as an online operation

41
Using microrecovery for maintenance

Capacity discovery in SSM
redundancy mechanism used for recovery (write
many, wait few) also used to say no while
gracefully degrading performance

42
Full rejuvenation vs. microrejuvenation
76
43
Splitting/coalescing in Dstore

Splitting/coalescing in Dstore
Split failure reappearance of failed node
Same mechanisms used to lazily repair
inconsistencies

44
Summary microrecoverable systems

Separation of data from process recovery
Special-purpose data stores can be made
microrecoverable
OK to initiate microrecovery anytime for any
reason
no loss of correctness, tolerable loss of
performance
likely (but not guaranteed) to fix an important
class of transients
wont make things worse can always try full
recovery afterward
inexpensive enough to tolerate sloppy fault
detection
low-cost first line of defense
some maintenance ops can be cast as
microrecovery
due to low cost, proactive maintenance can be
done online
can often convert unplanned long downtime into
planned shorter performance hit

45
Anomaly detection as failure detection
46
Example Anomaly Finding Techniques
Question does anomaly bug?
Includes design time and build time Includes
both offline (invasive) and online detection
techniques
47
Examples of Badness Inference

Sometimes can detect badness by looking for
inconsistencies in runtime behavior
We can observe program-specific properties
(though using automated methods) as well as
program-generic properties
Often, we must be able to first observe program
operating normally
Eraser detecting data races Savage et al. 2000
Observe lock/unlock patterns around shared
variables
If a variable usually protected by lock/unlock or
mutex is observed to have interleaved reads,
report a violation
DIDUCE inferring invariants, then detecting
violations Hangal Lam 2002
Start with strict invariant (x is always 3)
Relax it as other values seen (x is in 0,10)
Increase confidence in invariant as more
observations seen
Report violations of invariants that have
threshold confidence

48
Generic runtime monitoring techniques

What conditions are we monitoring for?
Fail-stop vs. Fail-silent vs. Fail-stutter
Byzantine failures
Generic methods
Heartbeats (what does loss of heartbeat mean?
Who monitors them?)
Resource monitoring (what is abnormal?)
Application-specific monitoring ask a question
you know the answer to
Fault model enforcement
coerce all observed faults to an expected
faults subset
if necessary, take additional actions to
completely induce the fault
Simplifies recovery since fewer distinct cases
Avoids potential misdiagnosis of faults that have
common symptoms
Note, may sometimes appear to make things worse
(coerce a less-severe fault to a more-severe
fault)
Doesnt exercise all parts of the system

49
Internet performance failure detection

Various approaches, all of which exploit the law
of large numbers and (sort of) Central Limit
Theorem (which is?)
Establish baseline of quantity to be monitored
Take observations, factor out data from known
failures
Normalize to workload?
Look for significant deviations from baseline
What to measure?
Coarse-grain number of reqs/sec
Finer-grain Number of TCP connections in
Established, Syn_sent, Syn_rcvd state
Even finer additional internal request
milestones
Hard to do in an application-generic way...but
frameworks can save us

50
Example 1 Detection recovery in SSM

9 State statistics collected per second from
each replica
Tarzan time series analysis compares relative
frequencies of substrings corresponding to
discretized time series
anomalous gt at least 6 stats anomalous
works for aperiodic or irregular-period signals
robust against workload changes that affect all
replicas equally and against highly-correlated
metrics

Keogh et al., Finding surprising patterns in a
time series database in linear time and space,
SIGKDD 2002
51
What faults does this handle?

Essentially 100 availability vs. injected
faults
Node crash/hang/timeout/freeze
Fail-stutter Network loss (drop up to 70 of
packets randomly)
Periodic slowdown (eg from garbage collection)
Persistent slowdown (one node lags the others)
Underlying (weak) assumption Most bricks are
doing mostly the right thing most of the time
All anomalies can be safely coerced to crash
faults
If reboot doesnt fix, it didnt cost you much to
try it
Human notified after threshold number of
restarts system has no concept of recovery
Allows SSM to be managed like a farm of stateless
servers

52
Detecting anomalies in application logic

Goal detect failures whose only obvious symptom
is change in semantics of application
Example wrong item data displayed wouldnt be
caught by HTML scraping or HTTP logs
Typically, site responds to HTTP pings, etc.
under such failures
These commonly result from exceptions of the form
we injected into RUBiS
Insight manifestation of bugs is the rare case,
so capture normal behavior of system under no
fault injection
Then detect threshold deviations from this
baseline
Periodically move the baseline to allow for
workload evolution

53
Patterns Path shape analysis

Model paths as parse trees in probabilistic CFG
Build grammar under believed normal conditions,
then mark very unlikely paths as anomalous
after classification, build decision tree to
correlate path features (components touched) with
anomalous paths

54
Patterns Component Interaction Analysis

Model interactions between a component and its n
neighbors in the dynamic call graph as a weighted
DAG
compare to observed call graph using chi-squared
goodness-of-fit
can compare either across peers or against
historical data

55
Precision and recall (example)

Detection Recall of failures actually
detected as anomalies
Strictly better than HTTP/HTML monitoring

Detection recall, faults affecting gt1 of
workload

Localization
recall actually-faulty requests returned
precision requests returned that are faulty
1-(FP rate)
Tradeoff between recall and precision (false
positive rate)
Even low-recall case corresponds to high
detection recall (.83)

56
Pinpoint key results

Detect 89-96 of injected failures, compared to
20-79 for HTML scraping and HTTP log monitoring
Limited success in detecting injected source bugs
Example success caught a bug that prevented
shopping cart from iterating over its contents to
display them, and correctly identified at-fault
component (where bug was injected)
Resilient to normal workload changes
Because we bin analysis by request category
Resilient to bug fix release code changes
Currently slow analysis lags 20s behind
application

57
Combining uRBs and Pinpoint

Simple recovery policy
uRB all components whose normalized anomaly score
gt1.0
if weve already done that, reboot the whole
application
More sophisticated policies certainly possible

58
Combining uRBs and Pinpoint

Example data structure corruption in SB_viewItem
EJB
350 simulated clients
18.5s to detect/localize
lt1s to repair
Note, returned Webpage would be valid but
incorrect
Robust to typicalworkload changes bug patches
More comprehensive deployment in progress

59
Faulty Request Identification

HTTP monitoring has perfect precision since its
a ground truth indicator of a server fault
Path-shape analysis pulls more points out of the
bottom left corner

60
Faulty Request Identification

HTTP monitoring has perfect precision since its
a ground truth indicator of a server fault
Path-shape analysis pulls more points out of the
bottom left corner

61
Tolerating false positives in DStore

Metrics and algorithm comparable to those used in
SSM
We inject fail-stutter behavior by increasing
request latency
Bottom case more aggressive detection also
results in 2 unnecessary reboots
But they dont matter much if there is modest
replication
Currently some voodoo constants for thresholds in
both SSM and DStore
Recall that these are off-the-shelf algorithms
should be able to do better
Trade-off earlier detection vs. false positives

62
Summary of case studies

Detection and localization good even with
simple algorithms fits well with localized
recovery
Performance penalty is tolerable worth it
Note, microrecovery can also be used for
microrejuvenation

63
Discussion
64
Discussion What makes this work?

What made it work in our examples specifically?
Recovery speed Weaker consistency in SSM and
DStore in exchange for fast recovery and
predictable work done per request
Recovery correctness J2EE apps constrained to
checkpoint by manipulating session state, and
this is brought out in the app-writer-visible
APIs good isolation between components and
relative lack of shared state
Anomaly detection app behavior alternates short
sequences of EJB calls with updates to persistent
state, so can be characterized in terms of those
calls
Observations
Neither diagnosis?recovery nor recovery?diagnosis
Localization ! diagnosis, but its an important
optimization

65
Why are statistical methods appealing?

Large complex systems tend to exercise a lot of
their functionality in a fairly short amount of
time
Especially Internet services, with high-volume
workloads of largely independent requests
Even if we dont know what to measure,
statistical and data mining techniques can help
figure it out
Performance problems are often linked with
dependability problems (fail-stutter behavior),
for either HW or SW reasons
Most systems work well most of the time
Corollary in a replica system, replicas should
behave the same most of the time

66
When does it not work?

When SLT-based monitoring does not apply
Base-rate fallacy monitoring events so rare that
FP rate dominates
Gaming the system (deliberately or inadvertently)
When failures cant be cured by any kind of
micro-recovery
Persistent-state corruption (or hardware failure)
Corrupted configuration data
a spectrum of undo
When you cant say no
Backpressure and possibility of caller-retry are
used to improve predictability
Promising you will say yes may be
difficult...question may be whether end-to-end
guarantees are needed at lower layers

67
SSM/DStore as extreme design points

Goal was to investigate extremes of no special
recovery
Could explore erasure coding (RepStore does this
dynamically)
Weakened consistency model of DStore vs. 2PC
Spread cost of repair lazily across many
operations (rather than bulk recovery)
Spread some 2PC state maintenance to client in
the form of write in progress cookie
May be that 2PC would be affordable, but we were
interested in extreme design point of no special
restart code

68
Role of 3-tier architecture

Separation of concerns really, separation of
process recovery (control flow) from data
recovery
uRB and reboots recover processes SSM, DStore,
and traditional relational databases recover data
Not addressed is repair of data

69
Shouldnt we just make software better?

Yes we should (and many people are), but...
We use commodity HWSW, despite the fact that
they are imperfect, less reliable than hardened
or purpose-built components, etc. Why?
Price/performance follows volume
Allows specialization of efforts and composition
of reusable building blocks (vs. building
stovepipe system)
In short, it allows much faster overall pace of
innovation and deployment, for both technical and
economic reasons, even though the components
themselves are imperfect
We should assumecommodity programmers too
(observation from Brewster Kahle)
Give as much generic support to application as we
can

70
Challenges open issues

Algorithm issues that impinge on systems work
Hand-tuned constants/thresholds in
algorithms--seems to be an issue in other
applications of SLT as well
Online vs. offline algorithms
Stability of closed loop
Systems issues
How do you know youve checkpointed all
important state, or that something is safe to
retry?
How do you debug a moving target ? Traditional
methods/tools are confounded by code obfuscation,
sudden loss of transient program state (stack
heap), etc. (a great PhD thesis...)
debugging todays real systems is already hard
for these reasons
Real apps, faultloads, best practices, etc. hard
to get!

71
RADS message in a nutshell
Statistical techniques can identify interesting
features and relationships from large datasets,
but frequent tradeoff between detection rate (or
detection time) and false positives
Make micro-recovery so inexpensive that
occasional false positives dont matter

Achievable now on realistic applications
workloads
Synergistic with componentized apps frameworks
Specific point of leverage for collaboration with
machine learning research lots of headroom for
improvement
Even simple algorithms show encouraging initial
results

72
Project possibilities
73
BACKUP SLIDES

Write a Comment

User Comments (0)