Title: Detecting, Managing, and Diagnosing Failures with FUSE
1Detecting, Managing, and Diagnosing Failures with
FUSE
- John Dunagan, Juhan Lee (MSN), Alec Wolman
- WIP
2Goals Target Environment
- Improve the ability of large internet portals to
gain insight into failures - Non-goals
- masking failures
- use machine learning to inferabnormal behavior
3MSN Background
- Messenger, www.msn.com, Hotmail, Search, many
other properties - Large (gt 100 million users)
- Sources of Complexity
- multiple data-centers
- large of machines
- complex internal network topology
- diversity of applications and software
infrastructure
4The Plan
- Detecting, managing, and diagnosing failures
- Review MSNs current approaches
- Describe our solution at a high level
5Detecting Failures
- Monitor system availability with heartbeats
- Monitor applications availability quality of
service using synthetic requests - Customer complaints
- Telephone, email
- Problems
- These approaches provide limited coverage
harder to catch failures that dont affect every
request - Data on detected failures often lacks necessary
detail to suggest a remedy - which front end is flaky?
- which app component caused end-user failure?
6Managing Failures
- Definition
- Ability to prioritize failures
- Detect component service degradation
- Characterizing app-stability
- Capacity planning
- When server x fails, what is the impact of this
failure? - Better use of ops and engineering resources
- Current approach no systematic attempt to
provide this functionality
7Our solution (in 2 steps)
- Detecting and Managing Failures
- Step 1 Instrument applications to track user
requests across the service chain - Each request is tagged with a unique id
- Service chain is composed on-the-fly with help of
app instrumentation - For each request
- Collect per-hop performance information
- Collect per-request failure status
- Centralized data collection
8What kinds of failures?
- We can handle
- Machine failures
- Network connectivity problems
- Most
- Misconfiguration
- Application bugs
- But not all
- Application errors where app itself doesnt
detect that there is a problem
9Diagnosing Failures
- Assigning responsibility to a specific hw or sw
component - Insight into internals of a component
- Cross component interactions
- Current approach instrument applications
- App-specific log messages
- Problems
- High request rates gt log rollover
- Perceived overhead gt detailed logging enabled
during testing, disabled in production
10Fuse Background
- FUSE (OSDI 2004) lightweight agreement on only
one thing whether or not a failure has occurred - Lack of a positive ack gt failure
11Step 2 Conditional Logging
- Step 2 Implement conditional logging to
significantly reduce the overhead of collecting
detailed logs across different machines in the
service chain - Step 1 provides ability to identify a request
across all participants in the service chain,
Fuse provides agreement on failure status across
that chain - While fate is undecided Detailed log messages
stored in main memory - Common case overload of logging is vastly reduced
- Once the fate of service chain is decided, we
discard app logs for successful requests and save
logs for failures - Quantity of data generated is manageable, when
most requests are successful
12Example
- Benefits
- FUSE allows monitoring of real transactions.
- All transactions, or a sampled subset to control
overhead. - When a request fails, FUSE provides an audit
trail - How far did it get?
- How long did each step take?
- Any additional application specific context.
- FUSE can be deployed incrementally.
13Issues
- Overload policy need to handle bursts of
failures without inducing more failures - How much effort to make apps FUSE enabled?
- Are the right components FUSE enabled?
- Identifying and filtering false positives
- Tracking request flow is non-trivial with network
load balancers
14Status
- Weve implemented FUSE for MSN, integrated with
ASP.NET rendering engine - Testing in progress
- Roll-out at end of summer
15Backups
16FUSE is Easy to Integrate
Example current code on Front End ReceiveRequestF
romClient() SendRequestToBackEnd() Ex
ample code on Front End using FUSE ReceiveRequest
FromClient(, FUSEinfo f) // default value of
f null if ( f ! null ) JoinFUSEGroup( f
) SendRequestToBackEnd(, f ) Current
implementation is in C, and consists of 2400 LOC