Detecting, Managing, and Diagnosing Failures with FUSE - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Detecting, Managing, and Diagnosing Failures with FUSE

Description:

Detecting, Managing, and Diagnosing Failures with FUSE. John Dunagan, Juhan Lee (MSN), Alec Wolman. WIP. 2. Goals & Target Environment ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 17
Provided by: momb9
Category:

less

Transcript and Presenter's Notes

Title: Detecting, Managing, and Diagnosing Failures with FUSE


1
Detecting, Managing, and Diagnosing Failures with
FUSE
  • John Dunagan, Juhan Lee (MSN), Alec Wolman
  • WIP

2
Goals Target Environment
  • Improve the ability of large internet portals to
    gain insight into failures
  • Non-goals
  • masking failures
  • use machine learning to inferabnormal behavior

3
MSN Background
  • Messenger, www.msn.com, Hotmail, Search, many
    other properties
  • Large (gt 100 million users)
  • Sources of Complexity
  • multiple data-centers
  • large of machines
  • complex internal network topology
  • diversity of applications and software
    infrastructure

4
The Plan
  • Detecting, managing, and diagnosing failures
  • Review MSNs current approaches
  • Describe our solution at a high level

5
Detecting Failures
  • Monitor system availability with heartbeats
  • Monitor applications availability quality of
    service using synthetic requests
  • Customer complaints
  • Telephone, email
  • Problems
  • These approaches provide limited coverage
    harder to catch failures that dont affect every
    request
  • Data on detected failures often lacks necessary
    detail to suggest a remedy
  • which front end is flaky?
  • which app component caused end-user failure?

6
Managing Failures
  • Definition
  • Ability to prioritize failures
  • Detect component service degradation
  • Characterizing app-stability
  • Capacity planning
  • When server x fails, what is the impact of this
    failure?
  • Better use of ops and engineering resources
  • Current approach no systematic attempt to
    provide this functionality

7
Our solution (in 2 steps)
  • Detecting and Managing Failures
  • Step 1 Instrument applications to track user
    requests across the service chain
  • Each request is tagged with a unique id
  • Service chain is composed on-the-fly with help of
    app instrumentation
  • For each request
  • Collect per-hop performance information
  • Collect per-request failure status
  • Centralized data collection

8
What kinds of failures?
  • We can handle
  • Machine failures
  • Network connectivity problems
  • Most
  • Misconfiguration
  • Application bugs
  • But not all
  • Application errors where app itself doesnt
    detect that there is a problem

9
Diagnosing Failures
  • Assigning responsibility to a specific hw or sw
    component
  • Insight into internals of a component
  • Cross component interactions
  • Current approach instrument applications
  • App-specific log messages
  • Problems
  • High request rates gt log rollover
  • Perceived overhead gt detailed logging enabled
    during testing, disabled in production

10
Fuse Background
  • FUSE (OSDI 2004) lightweight agreement on only
    one thing whether or not a failure has occurred
  • Lack of a positive ack gt failure

11
Step 2 Conditional Logging
  • Step 2 Implement conditional logging to
    significantly reduce the overhead of collecting
    detailed logs across different machines in the
    service chain
  • Step 1 provides ability to identify a request
    across all participants in the service chain,
    Fuse provides agreement on failure status across
    that chain
  • While fate is undecided Detailed log messages
    stored in main memory
  • Common case overload of logging is vastly reduced
  • Once the fate of service chain is decided, we
    discard app logs for successful requests and save
    logs for failures
  • Quantity of data generated is manageable, when
    most requests are successful

12
Example
  • Benefits
  • FUSE allows monitoring of real transactions.
  • All transactions, or a sampled subset to control
    overhead.
  • When a request fails, FUSE provides an audit
    trail
  • How far did it get?
  • How long did each step take?
  • Any additional application specific context.
  • FUSE can be deployed incrementally.

13
Issues
  • Overload policy need to handle bursts of
    failures without inducing more failures
  • How much effort to make apps FUSE enabled?
  • Are the right components FUSE enabled?
  • Identifying and filtering false positives
  • Tracking request flow is non-trivial with network
    load balancers

14
Status
  • Weve implemented FUSE for MSN, integrated with
    ASP.NET rendering engine
  • Testing in progress
  • Roll-out at end of summer

15
Backups
16
FUSE is Easy to Integrate
Example current code on Front End ReceiveRequestF
romClient() SendRequestToBackEnd() Ex
ample code on Front End using FUSE ReceiveRequest
FromClient(, FUSEinfo f) // default value of
f null if ( f ! null ) JoinFUSEGroup( f
) SendRequestToBackEnd(, f ) Current
implementation is in C, and consists of 2400 LOC
Write a Comment
User Comments (0)
About PowerShow.com