Intro - PowerPoint PPT Presentation

About This Presentation
Title:

Intro

Description:

ROC and its relationship to RADS ... SLT and CT tutorial VHS/DVD's available to view. SLT and CT Lab ... 'Salubrious': returns some part of system ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 74
Provided by: fox66
Category:
Tags: intro | salubrious

less

Transcript and Presenter's Notes

Title: Intro


1
Intro Overview of RADS goals
  • Armando Fox Dave Patterson
  • CS 444A/CS 294-6, Stanford/UC Berkeley
  • Fall 2004

2
  • Administrivia
  • Course logistics registration
  • Project expectations and other deliverables
  • Background and motivation for RADS
  • ROC and its relationship to RADS
  • Early case studies
  • Discussion projects, research directions, etc.

3
Administrivia/goals
  • Stanford enrollment vs. Axess
  • SLT and CT tutorial VHS/DVDs available to view
  • SLT and CT Lab/assignments grading policy
  • Stanford and Berkeley meeting/transportation
    logistics
  • Format of course

4
Background motivation for RADS
5
RADS in One Slide
  • Philosophy of ROC focus on lowering MTTR to
    improve overall availability
  • ROC achievements two levels of lowering MTTR
  • Microrecovery fine-grained generic recovery
    techniques recover only the failed part(s) of
    the system, at much lower cost than whole-system
    recovery
  • Undo sophisticated tools to help human operators
    selectively back out destructive actions/changes
    to a system
  • General approach use microrecovery as first
    line of defense when it fails, provide support
    to human operators to avoid having to reinstall
    the world
  • RADS insight can combine cheap recovery with
    statistical anomaly detection techniques

6
Hence, (at least) 2 parts to RADS
  • Investigating other microrecovery methods
  • Investigating analysis techniques
  • What to capture/represent in a model
  • Addressing fundamental open challenges
  • stability
  • systematic misdiagnosis
  • subversion by attackers
  • etc.
  • General insight different is bad
  • law of large numbers arguments support this for
    large services

7
Why RADS
  • Motivation
  • 5 9s availability gt 5 down-minutes/year gt must
    recover from (or mask) most failures without
    human intervention
  • a principled way to design self- systems
  • Technology
  • High-traffic large-scale distributed/replicated
    services gt large datasets
  • Analysis is CPU-intensive gt a way to trade extra
    CPU cycles for dependability
  • Large logs/datasets for models gt storage is
    cheap and getting cheaper
  • RADS addresses a clear need while exploiting
    demonstrated technology trends

8
Cheap Recovery
9
Complex systems of black boxes
  • ...our ability to analyze and predict the
    performance of the enormously complex software
    systems that lies at the core of our economy is
    painfully inadequate. (Choudhury Weikum, 2000
    PITAC Report)
  • Networked services too complex and
    rapidly-changing to test exhaustively
    collections of black boxes
  • Weekly or biweekly code drops not uncommon
  • Market activities lead to integration of whole
    systems
  • Need to get humans out of loop for at least some
    monitoring/recovery loops
  • hence interest in autonomic approaches
  • fast detection is often at odds with false alarms

10
Consequences
  • Complexity breeds increased bug counts and bug
    impact
  • Heisenbugs, race conditions, environment-dependent
    and hard-to-reproduce bugs still account for
    majority of SW bugs in live systems
  • up to 80 of bugs found in production are those
    for which a fix is not yet available
  • some application-level failures result in
    user-visible bad behavior before they are
    detected by site monitors
  • Tellme Networks up to 75 of downtime is
    detection (sometimes by user complaints),
    followed by localization
  • Amazon, Yahoo gross metrics track second-order
    effect of bugs, but lags actual bug by minutes or
    tens of minutes
  • Result downtime and increased management costs
  • A.P. Wood, Software reliability from the
    customer view, IEEE Computer, Aug. 2003

11
Always adapting, always recovering
  • Build statistical models of acceptable
    operating envelope by measurement analysis on
    live system
  • Control theory, statistical correlation, anomaly
    detection...
  • Detect runtime deviations from model
  • typical tradeoff is between detection rate
    false positive rate
  • Rely on external control using inexpensive and
    simple mechanisms that respect the black box, to
    keep system within its acceptable operating
    envelope
  • invariant attempting recovery wont make things
    worse
  • makes inevitable false positives tolerable
  • can then reduce false negatives by tuning
    algos to be more aggressive and/or deploying
    multiple detectors
  • Systems that are always adapting, always
    recovering

12
Toward recovery management invariants
  • Observation instrumentation and analysis
  • collect and analyze data from running systems
  • rely on most systems work most of the time to
    automatically derive baseline models
  • Analysis detect and localize anomalous behavior
  • Action close loop automatically with
    micro-recovery
  • Salubrious returns some part of system to
    known state
  • Reclaim resources (memory, DB conns, sockets,
    DHCP lease...), throw away corrupt transient
    state, setup to retry operation if appropriate
  • Safe no effect on correctness, minimal effect on
    performance
  • Localized parts not being microrecovered arent
    affected
  • Fast recovery simplifies failure detection and
    recovery management.

13
Non-goals/complementary work
  • All of the following are being capably studied by
    others, and directly compose with our own
    efforts...
  • Byzantine fault tolerance
  • In-place repair of persistent data structures
  • Hard-real-time response guarantees
  • Adding checkpointing to legacy non-componentized
    applications
  • Source code bug finding
  • Advancing the state of the art in SLT (analysis
    algorithms)

14
Outline
  • Micro-recoverable systems
  • Concept of microrecovery
  • A microrecoverable application server session
    state store
  • Application-generic SLT-based failure detection
  • Path and component analysis and localization for
    appserver
  • Simple time series analyses for purpose-built
    state store
  • Combining SLT detection with microrecoverable
    systems
  • Discussion, related work, implications
    conclusions

15
Microrebooting one kind of microrecovery
  • 60 of software failures in the field are
    reboot-curable, even if root cause is unknown...
    why?
  • Rebooting discards bad temporary data (corrupted
    data structures that can be rebuilt) and
    (usually) reclaims used resources
  • reestablishes control flow in a predictable way
    (breaks deadlocks/livelocks, returns thread or
    process to its start state)
  • To avoid imperiling correctness, we must...
  • Separate data recovery from process recovery
  • Safeguard the data
  • Reclaim resources with high confidence
  • Goal get same benefits of rebooting but at much
    finer grain (hence faster and less disruptive) -
    microrebooting
  • D. Oppenheimer et al., Why do Internet services
    fail and what can be done about it? , USITS 2003

16
Write example Write to Many, Wait for Few
Try to write to W random bricks, W 4Must wait
for WQ bricks to reply, WQ 2
Brick 1
Brick 2
Browser
Brick 3
Brick 4
Brick 5
17
Write example Write to Many, Wait for Few
Try to write to W random bricks, W 4Must wait
for WQ bricks to reply, WQ 2
Brick 1
Brick 2
Browser
Brick 3
Brick 4
Brick 5
18
Write example Write to Many, Wait for Few
Try to write to W random bricks, W 4Must wait
for WQ bricks to reply, WQ 2
Brick 1
Brick 2
Browser
Brick 3
Brick 4
Brick 5
19
Write example Write to Many, Wait for Few
Try to write to W random bricks, W 4Must wait
for WQ bricks to reply, WQ 2
Brick 1
Brick 2
Browser
Brick 3
Brick 4
Brick 5
20
Write example Write to Many, Wait for Few
Try to write to W random bricks, W 4Must wait
for WQ bricks to reply, WQ 2
Brick 1
Brick 2
Browser
14
Brick 3
Brick 4
Cookie holds metadata
Brick 5
21
Read example
Try to read from Bricks 1, 4
Brick 1
14
Brick 2
Browser
Brick 3
Brick 4
Brick 5
22
Read example
14
Brick 1
Brick 2
Browser
Brick 3
Brick 4
Brick 5
23
Read example
Brick 1 crashes
Brick 1
Brick 2
Browser
Brick 3
Brick 4
Brick 5
24
Read example
Brick 2
Browser
Brick 3
Brick 4
Brick 5
25
SSM Failure and Recovery
  • Failure of single node
  • No data loss, WQ-1 remain
  • State is available for R/W during failure
  • Recovery
  • Restart No special case recovery code
  • State is available for R/W during brick restart
  • Session state is self-recovering
  • Users access pattern causes data to be rewritten

26
Backpressure and Admission Control
Brick 1
Brick 2
Drop Requests
Brick 3
Brick 4
Brick 5
Heavy flow to Brick 3
27
Statistical Monitoring
Statistics
Statistics NumElementsMemoryUsedInboxSizeNumDro
ppedNumReadsNumWrites
Brick 1
Brick 2
Brick 3
Brick 4
Brick 5
28
SSM Monitoring
  • N replicated bricks handle read/write requests
  • Cannot do structural anomaly detection!
  • Alternative features (performance, mem usage,
    etc)
  • Activity statistics How often did a brick do
    something?
  • Msgs received/sec, dropped/sec, etc.
  • Same across all peers, assuming balanced workload
  • Use anomalies as likely failures
  • State statistics Current state of system
  • Memory usage, queue length, etc.
  • Similar pattern across peers, but may not be in
    phase
  • Look for patterns in time-series differences in
    patterns indicate failure at a node.

29
Detecting Anomalous Conditions
  • Metrics compared against those of peer bricks
  • Basic idea Changes in workload tend to affect
    all bricks equally
  • Underlying (weak) assumption Most bricks are
    doing mostly the right thing most of the time
  • Anomaly in 6 or more (out of 9) metrics gt reboot
    brick
  • Use different techniques for different stats
  • Activity absolute median deviation
  • State Tarzan time-series analysis

30
Network Fault 70 packet loss in SAN
Brick restarts
Fault detectedBrick killed
Network fault injected
31
J2EE as a platform for uRB-based recovery
  • Java 2 Enterprise Edition, a component framework
    for Internet request-reply style apps
  • App is a collection of components (EJBs)
    created by subclassing a managed container class
  • application server provides component creation,
    thread management, naming/directory services,
    abstractions for database and HTTP sessions, etc.
  • Web pages with embedded servlets and Java Server
    Pages invoke EJB methods
  • potential to improve all apps by modifying the
    appserver
  • J2EE has a strong following, encourages modular
    programming, and there are open source appservers

32
Separating data recovery from process recovery
  • For HTTP workloads, session state ? app
    checkpoint
  • Store session state in a microrebootable session
    state subsystem (NSDI04)
  • Recoverynon-state-preserving process restart,
    redundancy gives probabilistic durability
  • Response time cost of externalizing session
    state 25
  • SSM, an N-way RAM-based state replication NSDI
    04 behind existing J2EE API
  • Microreboot EJBs
  • destroy all instances of EJB and associated
    threads
  • releases appserver-level resources (DB
    connections, etc)
  • discards appserver metadata about EJBs
  • session state preserved across uRB

33
JBossuRBsSSM fault injection
Fault injection null refs, deadlocks/infinite
loop, corruption of volatile EJB metadata,
resource leaks, Java runtime errors/exc
RUBiS online auction app (132K items, 1.5M bids,
100K subscribers) 150 simulated users/node35-45
req/sec/node Workload mix based on a commercial
auction site
Client-based failure detection
34
uRB vs. full RB - action weighted goodput
  • Example corrupt JNDI database entry,
    RuntimeException, Java error measure G_aw in
    1-second buckets
  • Localization is crude static analysis to
    associate failed URL with set of EJBs,
    incrementing an EJBs score whenever its
    implicated
  • With uRBs, 89 reduction in failed requests and
    9 more successful requests compared to full RB,
    despite 6 false positives

35
Performance overhead of JAGR
  • 150 clients/node latency38 msec (3 -gt 7 nodes)
  • Human-perceptible delay 100-200 msec
  • Real auction site 41 req/sec, 33-300 msec latency

36
Improving availability from users point of view
  • uRB improves user-perceived availability vs. full
    reboot
  • uRB complements failover
  • (a) Initially, excess load on 2nd node brought it
    down immediately after failover
  • (b) uRB results in some failed requests (96
    fewer) from temporary overload
  • (c,d) Full reboot vs. uRB without failover
  • For small clusters, should always try uRB first

37
uRB Tolerates Lax Failure Detection
  • Tolerates lag in detection latency (up to 53s in
    our microbenchmark) and high false positive rates
  • Our naive detection algorithm had up to 60 false
    positive rate in terms of what to uRB
  • we injected 97 false positives before reduction
    in overall availability equaled cost of full RB
  • Always safe to use as first line of defense,
    even when failover is possible
  • cost(uRBother recovery) ? cost(other recovery)
  • success rate of uRB on reboot-curable failures is
    comparable to whole-appserver reboot

38
Performance penalties
  • Baseline workload mix modeled on commercial site
  • 150 simulated clients per node, 40-45 reqs/sec
    per node
  • system at 70 utilization
  • Throughput 1 worse due to instrumentation
  • worst-case response latency increases from 800 to
    1200ms
  • Average case 45ms to 80ms compare to 35-300ms
    for commercial service
  • Well within human tolerance thresholds
  • Entirely due to factoring out of session state
  • Performance penalty is tolerable worth it

39
Recovery and maintenance
40
Microrecovery for Maintenance Operations
  • Capacity discovery in SSM
  • TCP-inspired flow control keeps system from
    falling off a cliff
  • OK to say no is essential for this backpressure
    to work
  • Microrejuvenation in JAGR (proactively
    microreboot to fix localized memory leaks)
  • Splitting/coalescing in Dstore
  • Split failure reappearance of failed node
  • Same safe/non-disruptive recovery mechanisms are
    used to lazily repair inconsistencies after new
    node appears
  • Consequently, performance impact small enough to
    do this as an online operation

41
Using microrecovery for maintenance
  • Capacity discovery in SSM
  • redundancy mechanism used for recovery (write
    many, wait few) also used to say no while
    gracefully degrading performance

42
Full rejuvenation vs. microrejuvenation
76
43
Splitting/coalescing in Dstore
  • Splitting/coalescing in Dstore
  • Split failure reappearance of failed node
  • Same mechanisms used to lazily repair
    inconsistencies

44
Summary microrecoverable systems
  • Separation of data from process recovery
  • Special-purpose data stores can be made
    microrecoverable
  • OK to initiate microrecovery anytime for any
    reason
  • no loss of correctness, tolerable loss of
    performance
  • likely (but not guaranteed) to fix an important
    class of transients
  • wont make things worse can always try full
    recovery afterward
  • inexpensive enough to tolerate sloppy fault
    detection
  • low-cost first line of defense
  • some maintenance ops can be cast as
    microrecovery
  • due to low cost, proactive maintenance can be
    done online
  • can often convert unplanned long downtime into
    planned shorter performance hit

45
Anomaly detection as failure detection
46
Example Anomaly Finding Techniques
Question does anomaly bug?
Includes design time and build time Includes
both offline (invasive) and online detection
techniques
47
Examples of Badness Inference
  • Sometimes can detect badness by looking for
    inconsistencies in runtime behavior
  • We can observe program-specific properties
    (though using automated methods) as well as
    program-generic properties
  • Often, we must be able to first observe program
    operating normally
  • Eraser detecting data races Savage et al. 2000
  • Observe lock/unlock patterns around shared
    variables
  • If a variable usually protected by lock/unlock or
    mutex is observed to have interleaved reads,
    report a violation
  • DIDUCE inferring invariants, then detecting
    violations Hangal Lam 2002
  • Start with strict invariant (x is always 3)
  • Relax it as other values seen (x is in 0,10)
  • Increase confidence in invariant as more
    observations seen
  • Report violations of invariants that have
    threshold confidence

48
Generic runtime monitoring techniques
  • What conditions are we monitoring for?
  • Fail-stop vs. Fail-silent vs. Fail-stutter
  • Byzantine failures
  • Generic methods
  • Heartbeats (what does loss of heartbeat mean?
    Who monitors them?)
  • Resource monitoring (what is abnormal?)
  • Application-specific monitoring ask a question
    you know the answer to
  • Fault model enforcement
  • coerce all observed faults to an expected
    faults subset
  • if necessary, take additional actions to
    completely induce the fault
  • Simplifies recovery since fewer distinct cases
  • Avoids potential misdiagnosis of faults that have
    common symptoms
  • Note, may sometimes appear to make things worse
    (coerce a less-severe fault to a more-severe
    fault)
  • Doesnt exercise all parts of the system

49
Internet performance failure detection
  • Various approaches, all of which exploit the law
    of large numbers and (sort of) Central Limit
    Theorem (which is?)
  • Establish baseline of quantity to be monitored
  • Take observations, factor out data from known
    failures
  • Normalize to workload?
  • Look for significant deviations from baseline
  • What to measure?
  • Coarse-grain number of reqs/sec
  • Finer-grain Number of TCP connections in
    Established, Syn_sent, Syn_rcvd state
  • Even finer additional internal request
    milestones
  • Hard to do in an application-generic way...but
    frameworks can save us

50
Example 1 Detection recovery in SSM
  • 9 State statistics collected per second from
    each replica
  • Tarzan time series analysis compares relative
    frequencies of substrings corresponding to
    discretized time series
  • anomalous gt at least 6 stats anomalous
    works for aperiodic or irregular-period signals
  • robust against workload changes that affect all
    replicas equally and against highly-correlated
    metrics

Keogh et al., Finding surprising patterns in a
time series database in linear time and space,
SIGKDD 2002
51
What faults does this handle?
  • Essentially 100 availability vs. injected
    faults
  • Node crash/hang/timeout/freeze
  • Fail-stutter Network loss (drop up to 70 of
    packets randomly)
  • Periodic slowdown (eg from garbage collection)
  • Persistent slowdown (one node lags the others)
  • Underlying (weak) assumption Most bricks are
    doing mostly the right thing most of the time
  • All anomalies can be safely coerced to crash
    faults
  • If reboot doesnt fix, it didnt cost you much to
    try it
  • Human notified after threshold number of
    restarts system has no concept of recovery
  • Allows SSM to be managed like a farm of stateless
    servers

52
Detecting anomalies in application logic
  • Goal detect failures whose only obvious symptom
    is change in semantics of application
  • Example wrong item data displayed wouldnt be
    caught by HTML scraping or HTTP logs
  • Typically, site responds to HTTP pings, etc.
    under such failures
  • These commonly result from exceptions of the form
    we injected into RUBiS
  • Insight manifestation of bugs is the rare case,
    so capture normal behavior of system under no
    fault injection
  • Then detect threshold deviations from this
    baseline
  • Periodically move the baseline to allow for
    workload evolution

53
Patterns Path shape analysis
  • Model paths as parse trees in probabilistic CFG
  • Build grammar under believed normal conditions,
    then mark very unlikely paths as anomalous
  • after classification, build decision tree to
    correlate path features (components touched) with
    anomalous paths

54
Patterns Component Interaction Analysis
  • Model interactions between a component and its n
    neighbors in the dynamic call graph as a weighted
    DAG
  • compare to observed call graph using chi-squared
    goodness-of-fit
  • can compare either across peers or against
    historical data

55
Precision and recall (example)
  • Detection Recall of failures actually
    detected as anomalies
  • Strictly better than HTTP/HTML monitoring

Detection recall, faults affecting gt1 of
workload
  • Localization
  • recall actually-faulty requests returned
  • precision requests returned that are faulty
    1-(FP rate)
  • Tradeoff between recall and precision (false
    positive rate)
  • Even low-recall case corresponds to high
    detection recall (.83)

56
Pinpoint key results
  • Detect 89-96 of injected failures, compared to
    20-79 for HTML scraping and HTTP log monitoring
  • Limited success in detecting injected source bugs
  • Example success caught a bug that prevented
    shopping cart from iterating over its contents to
    display them, and correctly identified at-fault
    component (where bug was injected)
  • Resilient to normal workload changes
  • Because we bin analysis by request category
  • Resilient to bug fix release code changes
  • Currently slow analysis lags 20s behind
    application

57
Combining uRBs and Pinpoint
  • Simple recovery policy
  • uRB all components whose normalized anomaly score
    gt1.0
  • if weve already done that, reboot the whole
    application
  • More sophisticated policies certainly possible

58
Combining uRBs and Pinpoint
  • Example data structure corruption in SB_viewItem
    EJB
  • 350 simulated clients
  • 18.5s to detect/localize
  • lt1s to repair
  • Note, returned Webpage would be valid but
    incorrect
  • Robust to typicalworkload changes bug patches
  • More comprehensive deployment in progress

59
Faulty Request Identification
  • HTTP monitoring has perfect precision since its
    a ground truth indicator of a server fault
  • Path-shape analysis pulls more points out of the
    bottom left corner

60
Faulty Request Identification
  • HTTP monitoring has perfect precision since its
    a ground truth indicator of a server fault
  • Path-shape analysis pulls more points out of the
    bottom left corner

61
Tolerating false positives in DStore
  • Metrics and algorithm comparable to those used in
    SSM
  • We inject fail-stutter behavior by increasing
    request latency
  • Bottom case more aggressive detection also
    results in 2 unnecessary reboots
  • But they dont matter much if there is modest
    replication
  • Currently some voodoo constants for thresholds in
    both SSM and DStore
  • Recall that these are off-the-shelf algorithms
    should be able to do better
  • Trade-off earlier detection vs. false positives

62
Summary of case studies
  • Detection and localization good even with
    simple algorithms fits well with localized
    recovery
  • Performance penalty is tolerable worth it
  • Note, microrecovery can also be used for
    microrejuvenation

63
Discussion
64
Discussion What makes this work?
  • What made it work in our examples specifically?
  • Recovery speed Weaker consistency in SSM and
    DStore in exchange for fast recovery and
    predictable work done per request
  • Recovery correctness J2EE apps constrained to
    checkpoint by manipulating session state, and
    this is brought out in the app-writer-visible
    APIs good isolation between components and
    relative lack of shared state
  • Anomaly detection app behavior alternates short
    sequences of EJB calls with updates to persistent
    state, so can be characterized in terms of those
    calls
  • Observations
  • Neither diagnosis?recovery nor recovery?diagnosis
  • Localization ! diagnosis, but its an important
    optimization

65
Why are statistical methods appealing?
  • Large complex systems tend to exercise a lot of
    their functionality in a fairly short amount of
    time
  • Especially Internet services, with high-volume
    workloads of largely independent requests
  • Even if we dont know what to measure,
    statistical and data mining techniques can help
    figure it out
  • Performance problems are often linked with
    dependability problems (fail-stutter behavior),
    for either HW or SW reasons
  • Most systems work well most of the time
  • Corollary in a replica system, replicas should
    behave the same most of the time

66
When does it not work?
  • When SLT-based monitoring does not apply
  • Base-rate fallacy monitoring events so rare that
    FP rate dominates
  • Gaming the system (deliberately or inadvertently)
  • When failures cant be cured by any kind of
    micro-recovery
  • Persistent-state corruption (or hardware failure)
  • Corrupted configuration data
  • a spectrum of undo
  • When you cant say no
  • Backpressure and possibility of caller-retry are
    used to improve predictability
  • Promising you will say yes may be
    difficult...question may be whether end-to-end
    guarantees are needed at lower layers

67
SSM/DStore as extreme design points
  • Goal was to investigate extremes of no special
    recovery
  • Could explore erasure coding (RepStore does this
    dynamically)
  • Weakened consistency model of DStore vs. 2PC
  • Spread cost of repair lazily across many
    operations (rather than bulk recovery)
  • Spread some 2PC state maintenance to client in
    the form of write in progress cookie
  • May be that 2PC would be affordable, but we were
    interested in extreme design point of no special
    restart code

68
Role of 3-tier architecture
  • Separation of concerns really, separation of
    process recovery (control flow) from data
    recovery
  • uRB and reboots recover processes SSM, DStore,
    and traditional relational databases recover data
  • Not addressed is repair of data

69
Shouldnt we just make software better?
  • Yes we should (and many people are), but...
  • We use commodity HWSW, despite the fact that
    they are imperfect, less reliable than hardened
    or purpose-built components, etc. Why?
  • Price/performance follows volume
  • Allows specialization of efforts and composition
    of reusable building blocks (vs. building
    stovepipe system)
  • In short, it allows much faster overall pace of
    innovation and deployment, for both technical and
    economic reasons, even though the components
    themselves are imperfect
  • We should assumecommodity programmers too
    (observation from Brewster Kahle)
  • Give as much generic support to application as we
    can

70
Challenges open issues
  • Algorithm issues that impinge on systems work
  • Hand-tuned constants/thresholds in
    algorithms--seems to be an issue in other
    applications of SLT as well
  • Online vs. offline algorithms
  • Stability of closed loop
  • Systems issues
  • How do you know youve checkpointed all
    important state, or that something is safe to
    retry?
  • How do you debug a moving target ? Traditional
    methods/tools are confounded by code obfuscation,
    sudden loss of transient program state (stack
    heap), etc. (a great PhD thesis...)
  • debugging todays real systems is already hard
    for these reasons
  • Real apps, faultloads, best practices, etc. hard
    to get!

71
RADS message in a nutshell
Statistical techniques can identify interesting
features and relationships from large datasets,
but frequent tradeoff between detection rate (or
detection time) and false positives
Make micro-recovery so inexpensive that
occasional false positives dont matter
  • Achievable now on realistic applications
    workloads
  • Synergistic with componentized apps frameworks
  • Specific point of leverage for collaboration with
    machine learning research lots of headroom for
    improvement
  • Even simple algorithms show encouraging initial
    results

72
Project possibilities
73
BACKUP SLIDES
Write a Comment
User Comments (0)
About PowerShow.com