Microreboot A Technique for Cheap Recovery - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Microreboot A Technique for Cheap Recovery

Description:

Pro & Con of Reboot. Microreboot. General conditions for microreboot. Gains from ... static presentation data. GIFs, HTML, JSPs, etc. Ext3FS filesystem. Back ... – PowerPoint PPT presentation

Number of Views:176

Avg rating:3.0/5.0

Slides: 45

Provided by: sunj2

Category:

more less

Transcript and Presenter's Notes

Title: Microreboot A Technique for Cheap Recovery

1
Microreboot A Technique for Cheap Recovery

George Candea, ..., Armando Fox

2
1. Introduction

Software bugs
Pro Con of Reboot
Microreboot
General conditions for microreboot
Gains from microreboot

3
2. Designing Microrebootable Software

The character of workloads faced by Internet
service
Three Design goals
fast and correct component recovery
strongly-localized recovery
fast and correct reintegration of recovered
components
Crash-only design approach
The complete separation of data recovery from
application recovery

4
3. A Microrebootable Prototype

Prototype based on J2EE AS JBoss and RUBiS
Microreboot Machinery
kill EJB Component, associated thread, resources,
metadata
reserved ejb classloader
A Crash-Only Application
State segregation
Isolation and decoupling

5
4. Evaluation Framework

a client emulator
a fault injector
a system for automated failure detection,
diagnosis, and recovery

6
4.1 client emulator
7
4.2 fault injector

J2EE systems suffer from the following categories
of software-related failures
accidental use of null references (e.g., during
exception handling) that result in
NullPointerException
hung threads due to deadlocks, interminable
waits, etc.
bug-induced corruption of volatile metadata
leak-induced resource exhaustion
various other Java exceptions and errors that are
not handled correctly
used both FIG and FAUmachine (under JVM)
memory and register bit flips
disk block errors
network packet drops
erroneus returns from system calls for memory
allocation and input/output.

8
4.3 failure detection diagnosis recovery

failure detection in the client emulator
recovery manager (RM)
action-weighted throughput (Taw)

9
5. Evaluation Results

Are microreboots effective in recovering from
failures
Are microreboots any better than JVM restarts
Are microreboots useful in clusters
Do microreboot-friendly architectures incur a
performance overhead

10
5.1 Effective in recovering from failures
11
Table continued
12
Table continued
13
5.2 better than JVM restarts
14
5.2 Continued

At t10 min, corrupt the transaction method map
for EntityGroup, the EJB recovery group that
takes the longest to recover.
At t20 min, corrupt the JNDI entry for
RegisterNewUser, the next-slowest in recovery
At t30 min, inject a transient exception in
BrowseCategories, the entry point for all
browsing (thus, the most-frequently called EJB in
our workload)
Overall, 11,752 requests (3,101 actions) failed
when recovering with a process restart, shown in
the top graph 233 requests (34 actions) failed
when recovering by microrebooting one or more
EJBs. Thus, the average is 3,917 failed requests
(1,034 actions) per process restart, and 78
failed requests (11 actions) per microreboot of
one or more EJBs.

15
5.2 Continued

Microreboots recover faster
recovery time distribution
Microreboots reduce functional disruption
Microreboots reduce lost work
session state lost during recovery(due to FastS)
used SSM, overall good Taw lower
microreboots(use FastS)allowed the system to both
preserve session state across recovery and avoid
cross-JVM access penalties

16
5.3 Useful in Clusters

a cluster of 8 independent application server
nodes
using a client-side load balancer LB
failover under normal load
microreboots preserve cluster load dynamics

17
5.4 Performance Impact
18
6. A New Approach to Failure Management

Alternative Failover Schemes
microreboot without failover improves
user-perceived availability over failover and
microreboot
User-Transparent Recovery
Tolerating Lax Failure Detection
Averting Failure with Microrejuvenation
resource leaks are a major problem for many
large-scale Java applications

19
7. Limitations of Recovery by Microreboot

Impact on shared state
Interaction with external resources
Delaying a full reboot

20
8. Generalizing beyond Prototype

Biggest challenges
extricating session state handling from
application logic
ensuring that persistent state is updated with
transactions
microreboot systems design aspects
Isolation
Workload
Resources

21
Three-Tiered Architecture
22
EJB Container
23
Software bugs

Bugs are hard to be eradicated, tracked down,
resolved and fixed at the time of failure.
It is mostly application-level failures that
bring down enterprise-scale software.
Many failures can be successfully recovered by
rebooting, even when the failure's root cause is
unknown.
Back

24
Pro Con of Reboot

high-confidence way to reclaim stale or leaked
resources
not rely on the correct functioning of the
rebooted system
easy to implement and automate
return the software to its start state
Unexpected reboots can result in data loss and
unpredictable recovery times
Back

25
Microreboot

Individual rebooting of fine-grain application
components
The same benefits as whole-process restarts
An order of magnitude faster and less lost work
Data recovery is completely separated from
(reboot-based) application recovery
Back

26
General conditions for microreboot

well-isolated
stateless components
keep all important application state in
specialized state stores
Back

27
Gains from microreboot

Can be attempted first
In multi-node clusters, a microreboot may be
preferable even over node failover
To rejuvenate a system by parts without shutting
down
Transparent call-level retries to mask a
microreboot from end users
Back

28
Crash-only design approach

programs that can be safely crashed in whole or
by parts and recover quickly every time
main points of our crash-only design approach
Fine-grain components
State segregation
Back

29
complete separation

shifts the burden of data management from the
often-inexperienced application writers to the
specialists who develop state stores.
conditions
Decoupling
Retryable requests
Leases
Back

30
State segregation

Persistent state
MySQL(132K items, 1.5M bids, 10K users)
Session state
FastS in-memory repository inside JBoss
SSM maintains state on separate machines
static presentation data
GIFs, HTML, JSPs, etc.
Ext3FS filesystem
Back

31
failure detector in the client emulator

detect a service's user-visible failures
detector to check if a client encounters a
network-level error
detector to flag errors by comparing results
Back

32
recovery manager (RM)

performs simple failure diagnosis and recovers
microrebooting EJBs, the WAR ? all of eBid ? JVM
? rebooting the operating system
simple recursive recovery policy trying the
cheapest recovery first
Back

33
action-weighted throughput (Taw)

session ? action action ? operation
action succeeds or fails atomically
all operations succeed, count toward good Taw
an operations failed, all count toward bad Taw
both long-running and short-running operations
must succeed for a user to be happy with the
service
when an action with many operations succeeds, it
generally means the user did more work than in a
short action
Back

34
Microreboots recover faster
35
Continued
Back
36
recovery time distribution
Back
37
Microreboots reduce functional disruption
Back
38
Failover under normal load
recovering with JVM restart, on average 2,280
requests failed in the case of microrebooting,
162 requests failed Back
39
microreboots preserve cluster load dynamics
40
Continued

requests which response times exceeding 8 seconds

Back
41
User-Transparent Recovery
Back
42
Tolerating Lax Failure Detection

Tdet the time to detect the failure
FPdet false positive rate
FNdet false negative rate
Cheap recovery relaxes the task of failure
detection
allows for longer Tdet
reduces the cost of a false positive

43
continued
Back
44
Averting Failure with Microrejuvenation

Available memory during microrejuvenation. Inject
a 2 KB/invocation leak in Item and a 250
KB/invocation leak in ViewItem. Malarm is set to
35 of the 1-GByte heap (thus 350 MB) and
Msufficient to 80 (800 MB).
Back

Write a Comment

User Comments (0)