Title: Microreboot A Technique for Cheap Recovery
1Microreboot A Technique for Cheap Recovery
- George Candea, ..., Armando Fox
21. Introduction
- Software bugs
- Pro Con of Reboot
- Microreboot
- General conditions for microreboot
- Gains from microreboot
32. Designing Microrebootable Software
- The character of workloads faced by Internet
service - Three Design goals
- fast and correct component recovery
- strongly-localized recovery
- fast and correct reintegration of recovered
components - Crash-only design approach
- The complete separation of data recovery from
application recovery
43. A Microrebootable Prototype
- Prototype based on J2EE AS JBoss and RUBiS
- Microreboot Machinery
- kill EJB Component, associated thread, resources,
metadata - reserved ejb classloader
- A Crash-Only Application
- State segregation
- Isolation and decoupling
54. Evaluation Framework
- a client emulator
- a fault injector
- a system for automated failure detection,
diagnosis, and recovery
64.1 client emulator
74.2 fault injector
- J2EE systems suffer from the following categories
of software-related failures - accidental use of null references (e.g., during
exception handling) that result in
NullPointerException - hung threads due to deadlocks, interminable
waits, etc. - bug-induced corruption of volatile metadata
- leak-induced resource exhaustion
- various other Java exceptions and errors that are
not handled correctly - used both FIG and FAUmachine (under JVM)
- memory and register bit flips
- disk block errors
- network packet drops
- erroneus returns from system calls for memory
allocation and input/output.
84.3 failure detection diagnosis recovery
- failure detection in the client emulator
- recovery manager (RM)
- action-weighted throughput (Taw)
95. Evaluation Results
- Are microreboots effective in recovering from
failures - Are microreboots any better than JVM restarts
- Are microreboots useful in clusters
- Do microreboot-friendly architectures incur a
performance overhead
105.1 Effective in recovering from failures
11Table continued
12Table continued
135.2 better than JVM restarts
145.2 Continued
- At t10 min, corrupt the transaction method map
for EntityGroup, the EJB recovery group that
takes the longest to recover. - At t20 min, corrupt the JNDI entry for
RegisterNewUser, the next-slowest in recovery - At t30 min, inject a transient exception in
BrowseCategories, the entry point for all
browsing (thus, the most-frequently called EJB in
our workload) - Overall, 11,752 requests (3,101 actions) failed
when recovering with a process restart, shown in
the top graph 233 requests (34 actions) failed
when recovering by microrebooting one or more
EJBs. Thus, the average is 3,917 failed requests
(1,034 actions) per process restart, and 78
failed requests (11 actions) per microreboot of
one or more EJBs.
155.2 Continued
- Microreboots recover faster
- recovery time distribution
- Microreboots reduce functional disruption
- Microreboots reduce lost work
- session state lost during recovery(due to FastS)
- used SSM, overall good Taw lower
- microreboots(use FastS)allowed the system to both
preserve session state across recovery and avoid
cross-JVM access penalties
165.3 Useful in Clusters
- a cluster of 8 independent application server
nodes - using a client-side load balancer LB
- failover under normal load
- microreboots preserve cluster load dynamics
175.4 Performance Impact
186. A New Approach to Failure Management
- Alternative Failover Schemes
- microreboot without failover improves
user-perceived availability over failover and
microreboot - User-Transparent Recovery
- Tolerating Lax Failure Detection
- Averting Failure with Microrejuvenation
- resource leaks are a major problem for many
large-scale Java applications
197. Limitations of Recovery by Microreboot
- Impact on shared state
- Interaction with external resources
- Delaying a full reboot
208. Generalizing beyond Prototype
- Biggest challenges
- extricating session state handling from
application logic - ensuring that persistent state is updated with
transactions - microreboot systems design aspects
- Isolation
- Workload
- Resources
21Three-Tiered Architecture
22EJB Container
23Software bugs
- Bugs are hard to be eradicated, tracked down,
resolved and fixed at the time of failure. - It is mostly application-level failures that
bring down enterprise-scale software. - Many failures can be successfully recovered by
rebooting, even when the failure's root cause is
unknown. - Back
24Pro Con of Reboot
- high-confidence way to reclaim stale or leaked
resources - not rely on the correct functioning of the
rebooted system - easy to implement and automate
- return the software to its start state
- Unexpected reboots can result in data loss and
unpredictable recovery times - Back
25Microreboot
- Individual rebooting of fine-grain application
components - The same benefits as whole-process restarts
- An order of magnitude faster and less lost work
- Data recovery is completely separated from
(reboot-based) application recovery - Back
26General conditions for microreboot
- well-isolated
- stateless components
- keep all important application state in
specialized state stores - Back
27Gains from microreboot
- Can be attempted first
- In multi-node clusters, a microreboot may be
preferable even over node failover - To rejuvenate a system by parts without shutting
down - Transparent call-level retries to mask a
microreboot from end users - Back
28Crash-only design approach
- programs that can be safely crashed in whole or
by parts and recover quickly every time - main points of our crash-only design approach
- Fine-grain components
- State segregation
- Back
29complete separation
- shifts the burden of data management from the
often-inexperienced application writers to the
specialists who develop state stores. - conditions
- Decoupling
- Retryable requests
- Leases
- Back
30State segregation
- Persistent state
- MySQL(132K items, 1.5M bids, 10K users)
- Session state
- FastS in-memory repository inside JBoss
- SSM maintains state on separate machines
- static presentation data
- GIFs, HTML, JSPs, etc.
- Ext3FS filesystem
- Back
31failure detector in the client emulator
- detect a service's user-visible failures
- detector to check if a client encounters a
network-level error - detector to flag errors by comparing results
- Back
32recovery manager (RM)
- performs simple failure diagnosis and recovers
- microrebooting EJBs, the WAR ? all of eBid ? JVM
? rebooting the operating system - simple recursive recovery policy trying the
cheapest recovery first - Back
33action-weighted throughput (Taw)
- session ? action action ? operation
- action succeeds or fails atomically
- all operations succeed, count toward good Taw
- an operations failed, all count toward bad Taw
- both long-running and short-running operations
must succeed for a user to be happy with the
service - when an action with many operations succeeds, it
generally means the user did more work than in a
short action - Back
34Microreboots recover faster
35Continued
Back
36recovery time distribution
Back
37Microreboots reduce functional disruption
Back
38Failover under normal load
recovering with JVM restart, on average 2,280
requests failed in the case of microrebooting,
162 requests failed Back
39microreboots preserve cluster load dynamics
40Continued
- requests which response times exceeding 8 seconds
Back
41User-Transparent Recovery
Back
42Tolerating Lax Failure Detection
- Tdet the time to detect the failure
- FPdet false positive rate
- FNdet false negative rate
- Cheap recovery relaxes the task of failure
detection - allows for longer Tdet
- reduces the cost of a false positive
43continued
Back
44Averting Failure with Microrejuvenation
- Available memory during microrejuvenation. Inject
a 2 KB/invocation leak in Item and a 250
KB/invocation leak in ViewItem. Malarm is set to
35 of the 1-GByte heap (thus 350 MB) and
Msufficient to 80 (800 MB). - Back