A PlanningBased Approach to Failure Recovery in Distributed Systems - PowerPoint PPT Presentation

1 / 77

About This Presentation

Title:

A PlanningBased Approach to Failure Recovery in Distributed Systems

Description:

Availability and reliability are important for dependability in computer systems ... 1 Donn DiNunno. ' Quantifying performance loss. ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 78

Provided by: naveed9

Category:

more less

Transcript and Presenter's Notes

Title: A PlanningBased Approach to Failure Recovery in Distributed Systems

1
A Planning-Based Approach to Failure Recovery in
Distributed Systems

Naveed Arshad
Advisors
Alexander L. Wolf
Dennis M. Heimbigner

2
Overview

Problem
High availability of computer systems is
difficult
Goal
A fast, automated way to recover systems from
failures
Solution
AI planning-based automated failure recovery
mechanism

3
Problem of High Availability

Availability and reliability are important for
dependability in computer systems
Failures disrupt the availability and reliability
of computer systems.
A single hour of downtime cause revenue loss that
ranges from 200,000 for ecommerce sites to
6,000,000 for brokerage firms.
Eight hours to eighty hours per year of downtime
in systems publicized as highly available.

4
Enemies of High Availability

Primary cause is operator errors
Study by Gray in 1986
Cause of Downtime
Operator Errors 42
Study by Oppenheimer and colleagues in 2003
Causes of Downtime
Hardware 15
Software 34
Operator Errors 51
50 of Operator errors are configuration errors
Most of the configuration errors are due to
enormous levels of complexity in computer systems
Systems need automation

5
Common Approaches to High Availability

Redundancy
A good alternative because of cheap hardware.
But is it really cost effective?
Total Cost of Ownership (TCO)
Administrative costs, software cost,
configuration costs, etc.
In reality TCO is 5-10 times higher than cost of
hardware
Failure Recovery Scripts
Manually written
Not practical to write scripts for each failure
scenario
Time, cost, resource optimizations not possible
because of unforeseenfailures

6
Contributions

AI Planning-based Dynamic Reconfiguration
High-level representation of configuration
objectives using AI planning
System automatically performs low-level
operations needed to satisfy the high-level goals
Optimization of time, cost and resource usage
Automated Failure Recovery
System detects and corrects anomalies by itself
Development of an end-to-end automated failure
recovery technique using dynamic reconfiguration

7
Related Work Classification
8
Related Work (Control Theory)

Monitor
Sensors for system state
Analyze
System state is measured against a reference
model
If the state appears to be not within limits
planning is initiated
Plan
An plan is calculated to bring the system within
reference model bounds
Execute
The plan is applied on the system to bring the
desired change

9
Related Work (Recovery-Oriented Computing)

Fault-tolerance cannot achieve high-availability
so let the system fail and recover fast later
Focus is to improve Mean Time To Repair (MTTR)
instead of Mean Time To Fail (MTTF).
Microboot -- reboot at finer levels
Undo/redo administrator can undo a
configuration action

10
Related Work (Architecture-Based Recovery)

A system is described in terms of an
architectural model
Changes are expressed in terms of an
architectural language
Changes in the architecture are analyzed for any
possible mismatches
Change plans are developed to change the
architecture of the system

11
Related Work (Miscellaneous Approaches)

Biologically inspired self-healing system
Environmental Awareness
Adaptation
Decentralization
Redundancy
Task-based
Change based on user preferences

12
Scope and Assumptions

Stateless components
Fail-stop behavior
Reliable network
Application-level recovery
Permanent failed state of hardware
Failure recovery does not introduce worse
failures
Failure recovery system does not fail
Configurations are correctly specified
Database failures are not handled

13
Approach to Automated Failure Recovery

Systems with three types of artifacts
Applications
Components
Machines

14
Approach to Automated Failure Recovery

Application-oriented recovery
High-level recovery specifications using AI
planning
Automated script generation for recovery

15
Representing the System

Application Model
Configuration Model
Dependencies and requirements
Various topology of an application
Dynamic Model
Runtime information
Component Model
Interfaces
Control
Configuration
Properties
Runtime information
Machine Model
Runtime information

16
An Overview of Planning

Domain
Contains the semantics of the system
Represented in a first-order logic like language
e.g. PDDL
Initial State
Represents the current state of the system
Goal State
Represents the desired state of the system
Two ways to representation
Implicit
Explicit
Plan
Steps to move from the current state to the goal
state

17
Automating Failure Recovery

Sense
Remote agents to monitor behavior of monitorable
objects i.e. components and machines
A central monitor receives events from agents
Monitor updates the system models
If a failure is detected the failure recovery
process is invoked
Analyze
Check the dependencies of the system
Develop an exact picture of the system state

18
Automating Failure Recovery

Plan
Using system model develop an
Initial State the failed state of the system
Goal State the recovered state of the system
Give the initial and goal state to a planner
Planner outputs a plan
Plan is interpreted and translated
Actions of plan has one-to-one matching with
component interfaces
Plan dispatch
Execute
Plan scripts are executed on the system to bring
the desired change

19
Exceptions in Recovery Process

No Plan Found
A new goal state is selected from the Application
Configuration Model
Planner is invoked again
More Failures are Reported
Recovery process reverts based on the current
phase

20
Flow of Automated Failure Recovery
21
Failure Recovery Sequence
22
Architecture of Recover
23
Evaluation

Basic Experiments
Goal To get numbers on how long does the
recovery process takes under realistic failure
scenarios
Synthetic Experiments
Goal To test the limitation of the system under
synthetic loads not possible in a laboratory
settings. A large number of machines, components
and applications
Intensive Experiments
Goal To test the system under a sequence of
failures and to see how the system behaves

24
Applications

Sms (Strand Maps Service)
A service designed to provide strand map
functionality to the user
Rubbos (Rice University Bulletin Board)
A bulletin board benchmark site designed like
slashdot with huge amounts of data.
Webcal
An online calendar for individuals and/or groups
for scheduling meetings, tasks etc.

25
Approach to Automated Failure Recovery

Systems with three types of artifacts
Applications
Components
Machines

26
Basic Experiments(Experimental Setup)
System Before a Failure
27
Basic Experiments
System After a Failure
28
Basic Experiments
System After Recovery
29
Average Recovery Time 5.55 seconds
30
Synthetic Experiments

To test the time required for planning in
large-scale systems
Developed a simulator to test large-scale system.
Experimental Setup
20 machines
20 components
10 applications
Applications have various high-level
configurations

31
Synthetic Experiments
32
Synthetic ExperimentRecovery from Machine
FailuresTime to First Plan
33
Synthetic Experiment Recovery from Component
Failures Time to First Plan
34
Intensive Experiments

To test the systems under a sequence of failures
To test the behavior of the system if the
original configuration cannot be recovered
Same setup as basic experiments
Only one application i.e. Rubbos

35
Intensive Experiments(System before Failure)
36
Intensive Experiment(Recovery after First
Failure)
37
Intensive Experiment(Recovery after Second
Failure)
38
Contributions

AI Planning-based Dynamic Reconfiguration
AI planning to get recovery configurations
Ability to handle unforeseen failures
A way to represent high-level configuration
objectives using AI planning
Low-level operations are performed automatically
to achieve high-level objectives
Optimization of time, cost and resource usage
Automated Failure Recovery
A fast automated way to recover distributed
systems from failures

39
Future Work

Availability of more resources
Recovery of state
Failures during failure recovery
Script generation framework
Distributed systems simulator

40
Backup Slides
41
Sensing Models
42
Push Model (Felber et al.)
43
Pull Model (Felber et al.)
44
Dual Model (Felber et al.)
45
Basic ExperimentsDetailed Results
46
Average Recovery Time 3.54 seconds
47
Average Recovery Time 4.77 seconds
48
Synthetic ExperimentsTime to Find Best Plan
49
Recovery from Machine FailuresTime to Best Plan
50
Recovery from Component Failures Time to Best
Plan
51
Introductory Slides w/References
52
Overview

Availability and reliability are important for
dependability in computer systems
Failures disrupt the availability and reliability
of computer systems.
A single hour of downtime cause revenue loss that
ranges from 200,00 for ecommerce sites to
6,000,000 for brokerage firms1.
Eight hours to eighty hours of downtime in
systems publicized as highly available2.

1 Donn DiNunno. Quantifying performance loss. IT
performance engineering measurement
strategies. Meta Group, October 2000. 2
Patterson et al. Recovery oriented computing
(ROC) Motivation, definition, techniques, and
case studies. In UC Berkeley Computer Science
Technical Report UCB/CSD-02-1175, Berkeley, CA,
March 2002. U.C. Berkeley.
53
Common Approaches to High Availability

Redundancy
A good alternative because of cheap hardware.
But is it really cost effective?
Total Cost of Ownership (TCO)
Administrative Costs
Software Costs
Configuration Costs
etc..
In reality TCO is 5-10 times higher than cost of
hardware1

1Patterson et al. Recovery oriented computing
(ROC) Motivation, Definition, Techniques, and
Case Studies. In UC Berkeley Computer Science
Technical Report UCB/CSD-02-1175, Berkeley, CA,
March 2002. U.C. Berkeley.
54
Enemies of High Availability

Study by Gray in 1986 1
Cause of Downtime
Operator Errors 42
Study by Oppenheimer et al. in 2003 2
Causes of Downtime
Hardware 15
Software 34
Operator Errors 51
50 of Operator errors are configuration errors
Most of the configuration errors are due to
enormous levels of complexity in computer systems

1Jim Gray. Why do computers stop and what can be
done about it? In Symposium on Reliability in
Distributed Software and Database Systems, pages
312, 1986. 2Oppenheimer et al. Why do internet
services fail, and what can be done about it? In
USENIX Symposium on Internet Technologies and
Systems, 2003.
55
Planning
56
Domain (Graphical Representation)
57
A Sample Plan
58
PDDL Example

(define (domain metricVehicle)
(requirements strips typing fluents)
(types vehicle location)
(predicates
(at ?v - vehicle ?p - location)
(accessible ?v - vehicle ?p1 ?p2 - location))
(functions
(fuel-level ?v - vehicle)
(fuel-required ?p1 ?p2 - location)
(total-fuel-used))
(action drive
parameters (?v - vehicle ?from ?to - location)
precondition (and (at ?v ?from)
(accessible ?v ?from ?to)
(gt (fuel-level ?v) (fuel-required ?from ?to)))
effect (and (not (at ?v ?from))
(at ?v ?to)

59
Implicit and Explicit Goal State

Explicit Goal State
(application-ready-5 ?app - application ?ap -
apache ?s - service ?t - tomcat ?con - connector)
Implicit Goal State
(application-ready-1a ?app - application)

60
Models
61
Application Configuration ModelNot in PDDL
Format

Application Name
Configuration Number
Component A
Component A Included
Component A Intercomponent Dependencies
Component A Intracomponent Dependencies
Application Import Time to Component A
Component B
Component B Included
Component B Intercomponent Dependencies
Component B Intracomponent Dependencies
Application Import Time to Component B
Component C
Component C Included
Component C Intercomponent Dependencies
Component C Intracomponent Dependencies
Application Import Time to Component C

62
Application Configurations

SMS (Strand Map Service)
A service designed to provide strand map
functionality to the user
Possible configurations
1. ApacheTomcatmySQL
2. TomcatmySQL
3. Apache
RuBBoS
A bulletin board benchmark site designed like
slashdot with huge amounts of data.
Possible Configurations
1. ApacheTomcatmySQL
2. Apache/PHP mySQL
3. TomcatmySQL
4. Apache
WebCal
An online calendar for individuals and/or groups
for scheduling meetings, tasks etc.
Possible Confgurations
1. Apache/PHPmySQL
2. Apache/PHP

63
Application Dynamic Model

Application Name
Current Configuration
Component A
Component A Machine
Component A State
Component B
Component B Machine
Component B State
Component C
Component C Machine
Component C State

64
Component Model

Interfaces
Control Interface
Start
Stop
Configuration Interface
AddVirtualHost
AddModule

65
Component Model

Properties
Component Type
Component Machine
Component State
Start Time
Restart Time
Accessible Port
Module Currently Installed
Module Available to be Installed
Current Load
Component Instance Number
Connectors Available
Applications Installed

66
Machine Model

Machine Name
Machine IP Address
Application Installations Available
Maximum Load
Current Load
Total Memory
Available Memory
Total Harddisk Space
Available Harddisk Space
Operating System
Machine Architecture
Components Installed
Software Available
Machine State

67
Foundational Work
68
Underlying Technologies(Planning)

Planners
Progression Planners
GraphPlan, IPP
Regression Planners
Unpop, HSPr
Causal-link Planners
UCPOP
Compilation Planners
SATPLAN, Blackbox
Planning Systems
Sekitei
ASPEN
Prodigy

69
Underlying Technologies (Dependency Management)

Various classes of dependencies
Classification by Alda et al.
Syntactical Dependencies (Direct Communication)
Implicit Dependencies (Indirect Communication)
Classification by Keller and Kar
Multidimensional Space of Dependencies

70
Underlying Technologies(Dependency Management)
Keller and Kar Dependency Classification
71
Dynamic Reconfiguration Approaches

Agent-based approaches
Redundancy-based Approaches
OS-level approaches
Platform specific approaches
Workflow-based approaches
Middleware-based approaches

72
Fault Tolerance

Goal Reduce Mean Time to Fail (MTTR)
Single version fault tolerance
N-version fault tolerance
Fault tolerance in distributed systems

73
Failures during Failure Recovery
74
Failures during Failure Recovery(Failure
Possibilities)
75
(No Transcript)
76
Failures during Failure Recovery(Handling in
Each Phase)
77
Failures during Failure Recovery(Dependency
Graph)

Write a Comment

User Comments (0)