Title: A PlanningBased Approach to Failure Recovery in Distributed Systems
1A Planning-Based Approach to Failure Recovery in
Distributed Systems
- Naveed Arshad
- Advisors
- Alexander L. Wolf
- Dennis M. Heimbigner
2Overview
- Problem
- High availability of computer systems is
difficult - Goal
- A fast, automated way to recover systems from
failures - Solution
- AI planning-based automated failure recovery
mechanism
3Problem of High Availability
- Availability and reliability are important for
dependability in computer systems - Failures disrupt the availability and reliability
of computer systems. - A single hour of downtime cause revenue loss that
ranges from 200,000 for ecommerce sites to
6,000,000 for brokerage firms. - Eight hours to eighty hours per year of downtime
in systems publicized as highly available.
4Enemies of High Availability
- Primary cause is operator errors
- Study by Gray in 1986
- Cause of Downtime
- Operator Errors 42
- Study by Oppenheimer and colleagues in 2003
- Causes of Downtime
- Hardware 15
- Software 34
- Operator Errors 51
- 50 of Operator errors are configuration errors
- Most of the configuration errors are due to
enormous levels of complexity in computer systems - Systems need automation
5Common Approaches to High Availability
- Redundancy
- A good alternative because of cheap hardware.
- But is it really cost effective?
- Total Cost of Ownership (TCO)
- Administrative costs, software cost,
configuration costs, etc. - In reality TCO is 5-10 times higher than cost of
hardware - Failure Recovery Scripts
- Manually written
- Not practical to write scripts for each failure
scenario - Time, cost, resource optimizations not possible
because of unforeseenfailures
6Contributions
- AI Planning-based Dynamic Reconfiguration
- High-level representation of configuration
objectives using AI planning - System automatically performs low-level
operations needed to satisfy the high-level goals
- Optimization of time, cost and resource usage
- Automated Failure Recovery
- System detects and corrects anomalies by itself
- Development of an end-to-end automated failure
recovery technique using dynamic reconfiguration
7Related Work Classification
8Related Work (Control Theory)
- Monitor
- Sensors for system state
- Analyze
- System state is measured against a reference
model - If the state appears to be not within limits
planning is initiated - Plan
- An plan is calculated to bring the system within
reference model bounds - Execute
- The plan is applied on the system to bring the
desired change
9Related Work (Recovery-Oriented Computing)
- Fault-tolerance cannot achieve high-availability
so let the system fail and recover fast later - Focus is to improve Mean Time To Repair (MTTR)
instead of Mean Time To Fail (MTTF). - Microboot -- reboot at finer levels
- Undo/redo administrator can undo a
configuration action
10Related Work (Architecture-Based Recovery)
- A system is described in terms of an
architectural model - Changes are expressed in terms of an
architectural language - Changes in the architecture are analyzed for any
possible mismatches - Change plans are developed to change the
architecture of the system
11Related Work (Miscellaneous Approaches)
- Biologically inspired self-healing system
- Environmental Awareness
- Adaptation
- Decentralization
- Redundancy
- Task-based
- Change based on user preferences
12Scope and Assumptions
- Stateless components
- Fail-stop behavior
- Reliable network
- Application-level recovery
- Permanent failed state of hardware
- Failure recovery does not introduce worse
failures - Failure recovery system does not fail
- Configurations are correctly specified
- Database failures are not handled
13Approach to Automated Failure Recovery
- Systems with three types of artifacts
- Applications
- Components
- Machines
14Approach to Automated Failure Recovery
- Application-oriented recovery
- High-level recovery specifications using AI
planning - Automated script generation for recovery
15Representing the System
- Application Model
- Configuration Model
- Dependencies and requirements
- Various topology of an application
- Dynamic Model
- Runtime information
- Component Model
- Interfaces
- Control
- Configuration
- Properties
- Runtime information
- Machine Model
- Runtime information
16An Overview of Planning
- Domain
- Contains the semantics of the system
- Represented in a first-order logic like language
e.g. PDDL - Initial State
- Represents the current state of the system
- Goal State
- Represents the desired state of the system
- Two ways to representation
- Implicit
- Explicit
- Plan
- Steps to move from the current state to the goal
state
17Automating Failure Recovery
- Sense
- Remote agents to monitor behavior of monitorable
objects i.e. components and machines - A central monitor receives events from agents
- Monitor updates the system models
- If a failure is detected the failure recovery
process is invoked - Analyze
- Check the dependencies of the system
- Develop an exact picture of the system state
18Automating Failure Recovery
- Plan
- Using system model develop an
- Initial State the failed state of the system
- Goal State the recovered state of the system
- Give the initial and goal state to a planner
- Planner outputs a plan
- Plan is interpreted and translated
- Actions of plan has one-to-one matching with
component interfaces - Plan dispatch
- Execute
- Plan scripts are executed on the system to bring
the desired change
19Exceptions in Recovery Process
- No Plan Found
- A new goal state is selected from the Application
Configuration Model - Planner is invoked again
- More Failures are Reported
- Recovery process reverts based on the current
phase
20Flow of Automated Failure Recovery
21Failure Recovery Sequence
22Architecture of Recover
23Evaluation
- Basic Experiments
- Goal To get numbers on how long does the
recovery process takes under realistic failure
scenarios - Synthetic Experiments
- Goal To test the limitation of the system under
synthetic loads not possible in a laboratory
settings. A large number of machines, components
and applications - Intensive Experiments
- Goal To test the system under a sequence of
failures and to see how the system behaves
24Applications
- Sms (Strand Maps Service)
- A service designed to provide strand map
functionality to the user - Rubbos (Rice University Bulletin Board)
- A bulletin board benchmark site designed like
slashdot with huge amounts of data. - Webcal
- An online calendar for individuals and/or groups
for scheduling meetings, tasks etc.
25Approach to Automated Failure Recovery
- Systems with three types of artifacts
- Applications
- Components
- Machines
26Basic Experiments(Experimental Setup)
System Before a Failure
27Basic Experiments
System After a Failure
28Basic Experiments
System After Recovery
29Average Recovery Time 5.55 seconds
30Synthetic Experiments
- To test the time required for planning in
large-scale systems - Developed a simulator to test large-scale system.
- Experimental Setup
- 20 machines
- 20 components
- 10 applications
- Applications have various high-level
configurations
31Synthetic Experiments
32Synthetic ExperimentRecovery from Machine
FailuresTime to First Plan
33Synthetic Experiment Recovery from Component
Failures Time to First Plan
34Intensive Experiments
- To test the systems under a sequence of failures
- To test the behavior of the system if the
original configuration cannot be recovered - Same setup as basic experiments
- Only one application i.e. Rubbos
35Intensive Experiments(System before Failure)
36Intensive Experiment(Recovery after First
Failure)
37Intensive Experiment(Recovery after Second
Failure)
38Contributions
- AI Planning-based Dynamic Reconfiguration
- AI planning to get recovery configurations
- Ability to handle unforeseen failures
- A way to represent high-level configuration
objectives using AI planning - Low-level operations are performed automatically
to achieve high-level objectives - Optimization of time, cost and resource usage
- Automated Failure Recovery
- A fast automated way to recover distributed
systems from failures
39Future Work
- Availability of more resources
- Recovery of state
- Failures during failure recovery
- Script generation framework
- Distributed systems simulator
40Backup Slides
41Sensing Models
42Push Model (Felber et al.)
43Pull Model (Felber et al.)
44Dual Model (Felber et al.)
45Basic ExperimentsDetailed Results
46Average Recovery Time 3.54 seconds
47Average Recovery Time 4.77 seconds
48Synthetic ExperimentsTime to Find Best Plan
49Recovery from Machine FailuresTime to Best Plan
50Recovery from Component Failures Time to Best
Plan
51Introductory Slides w/References
52Overview
- Availability and reliability are important for
dependability in computer systems - Failures disrupt the availability and reliability
of computer systems. - A single hour of downtime cause revenue loss that
ranges from 200,00 for ecommerce sites to
6,000,000 for brokerage firms1. - Eight hours to eighty hours of downtime in
systems publicized as highly available2.
1 Donn DiNunno. Quantifying performance loss. IT
performance engineering measurement
strategies. Meta Group, October 2000. 2
Patterson et al. Recovery oriented computing
(ROC) Motivation, definition, techniques, and
case studies. In UC Berkeley Computer Science
Technical Report UCB/CSD-02-1175, Berkeley, CA,
March 2002. U.C. Berkeley.
53Common Approaches to High Availability
- Redundancy
- A good alternative because of cheap hardware.
- But is it really cost effective?
- Total Cost of Ownership (TCO)
- Administrative Costs
- Software Costs
- Configuration Costs
- etc..
- In reality TCO is 5-10 times higher than cost of
hardware1
1Patterson et al. Recovery oriented computing
(ROC) Motivation, Definition, Techniques, and
Case Studies. In UC Berkeley Computer Science
Technical Report UCB/CSD-02-1175, Berkeley, CA,
March 2002. U.C. Berkeley.
54Enemies of High Availability
- Study by Gray in 1986 1
- Cause of Downtime
- Operator Errors 42
- Study by Oppenheimer et al. in 2003 2
- Causes of Downtime
- Hardware 15
- Software 34
- Operator Errors 51
- 50 of Operator errors are configuration errors
- Most of the configuration errors are due to
enormous levels of complexity in computer systems
1Jim Gray. Why do computers stop and what can be
done about it? In Symposium on Reliability in
Distributed Software and Database Systems, pages
312, 1986. 2Oppenheimer et al. Why do internet
services fail, and what can be done about it? In
USENIX Symposium on Internet Technologies and
Systems, 2003.
55Planning
56Domain (Graphical Representation)
57A Sample Plan
58PDDL Example
- (define (domain metricVehicle)
- (requirements strips typing fluents)
- (types vehicle location)
- (predicates
- (at ?v - vehicle ?p - location)
- (accessible ?v - vehicle ?p1 ?p2 - location))
- (functions
- (fuel-level ?v - vehicle)
- (fuel-required ?p1 ?p2 - location)
- (total-fuel-used))
-
- (action drive
- parameters (?v - vehicle ?from ?to - location)
- precondition (and (at ?v ?from)
- (accessible ?v ?from ?to)
- (gt (fuel-level ?v) (fuel-required ?from ?to)))
- effect (and (not (at ?v ?from))
- (at ?v ?to)
59Implicit and Explicit Goal State
- Explicit Goal State
- (application-ready-5 ?app - application ?ap -
apache ?s - service ?t - tomcat ?con - connector) - Implicit Goal State
- (application-ready-1a ?app - application)
60Models
61Application Configuration ModelNot in PDDL
Format
- Application Name
- Configuration Number
- Component A
- Component A Included
- Component A Intercomponent Dependencies
- Component A Intracomponent Dependencies
- Application Import Time to Component A
- Component B
- Component B Included
- Component B Intercomponent Dependencies
- Component B Intracomponent Dependencies
- Application Import Time to Component B
- Component C
- Component C Included
- Component C Intercomponent Dependencies
- Component C Intracomponent Dependencies
- Application Import Time to Component C
62Application Configurations
- SMS (Strand Map Service)
- A service designed to provide strand map
functionality to the user - Possible configurations
- 1. ApacheTomcatmySQL
- 2. TomcatmySQL
- 3. Apache
- RuBBoS
- A bulletin board benchmark site designed like
slashdot with huge amounts of data. - Possible Configurations
- 1. ApacheTomcatmySQL
- 2. Apache/PHP mySQL
- 3. TomcatmySQL
- 4. Apache
- WebCal
- An online calendar for individuals and/or groups
for scheduling meetings, tasks etc. - Possible Confgurations
- 1. Apache/PHPmySQL
- 2. Apache/PHP
63Application Dynamic Model
- Application Name
- Current Configuration
- Component A
- Component A Machine
- Component A State
- Component B
- Component B Machine
- Component B State
- Component C
- Component C Machine
- Component C State
64Component Model
- Interfaces
- Control Interface
- Start
- Stop
-
- Configuration Interface
- AddVirtualHost
- AddModule
65Component Model
- Properties
- Component Type
- Component Machine
- Component State
- Start Time
- Restart Time
- Accessible Port
- Module Currently Installed
- Module Available to be Installed
- Current Load
- Component Instance Number
- Connectors Available
- Applications Installed
66Machine Model
- Machine Name
- Machine IP Address
- Application Installations Available
- Maximum Load
- Current Load
- Total Memory
- Available Memory
- Total Harddisk Space
- Available Harddisk Space
- Operating System
- Machine Architecture
- Components Installed
- Software Available
- Machine State
67Foundational Work
68Underlying Technologies(Planning)
- Planners
- Progression Planners
- GraphPlan, IPP
- Regression Planners
- Unpop, HSPr
- Causal-link Planners
- UCPOP
- Compilation Planners
- SATPLAN, Blackbox
- Planning Systems
- Sekitei
- ASPEN
- Prodigy
69Underlying Technologies (Dependency Management)
- Various classes of dependencies
- Classification by Alda et al.
- Syntactical Dependencies (Direct Communication)
- Implicit Dependencies (Indirect Communication)
- Classification by Keller and Kar
- Multidimensional Space of Dependencies
70Underlying Technologies(Dependency Management)
Keller and Kar Dependency Classification
71Dynamic Reconfiguration Approaches
- Agent-based approaches
- Redundancy-based Approaches
- OS-level approaches
- Platform specific approaches
- Workflow-based approaches
- Middleware-based approaches
72Fault Tolerance
- Goal Reduce Mean Time to Fail (MTTR)
- Single version fault tolerance
- N-version fault tolerance
- Fault tolerance in distributed systems
73Failures during Failure Recovery
74Failures during Failure Recovery(Failure
Possibilities)
75(No Transcript)
76Failures during Failure Recovery(Handling in
Each Phase)
77Failures during Failure Recovery(Dependency
Graph)