A PlanningBased Approach to Failure Recovery in Distributed Systems - PowerPoint PPT Presentation

1 / 77
About This Presentation
Title:

A PlanningBased Approach to Failure Recovery in Distributed Systems

Description:

Availability and reliability are important for dependability in computer systems ... 1 Donn DiNunno. ' Quantifying performance loss. ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 78
Provided by: naveed9
Category:

less

Transcript and Presenter's Notes

Title: A PlanningBased Approach to Failure Recovery in Distributed Systems


1
A Planning-Based Approach to Failure Recovery in
Distributed Systems
  • Naveed Arshad
  • Advisors
  • Alexander L. Wolf
  • Dennis M. Heimbigner

2
Overview
  • Problem
  • High availability of computer systems is
    difficult
  • Goal
  • A fast, automated way to recover systems from
    failures
  • Solution
  • AI planning-based automated failure recovery
    mechanism

3
Problem of High Availability
  • Availability and reliability are important for
    dependability in computer systems
  • Failures disrupt the availability and reliability
    of computer systems.
  • A single hour of downtime cause revenue loss that
    ranges from 200,000 for ecommerce sites to
    6,000,000 for brokerage firms.
  • Eight hours to eighty hours per year of downtime
    in systems publicized as highly available.

4
Enemies of High Availability
  • Primary cause is operator errors
  • Study by Gray in 1986
  • Cause of Downtime
  • Operator Errors 42
  • Study by Oppenheimer and colleagues in 2003
  • Causes of Downtime
  • Hardware 15
  • Software 34
  • Operator Errors 51
  • 50 of Operator errors are configuration errors
  • Most of the configuration errors are due to
    enormous levels of complexity in computer systems
  • Systems need automation

5
Common Approaches to High Availability
  • Redundancy
  • A good alternative because of cheap hardware.
  • But is it really cost effective?
  • Total Cost of Ownership (TCO)
  • Administrative costs, software cost,
    configuration costs, etc.
  • In reality TCO is 5-10 times higher than cost of
    hardware
  • Failure Recovery Scripts
  • Manually written
  • Not practical to write scripts for each failure
    scenario
  • Time, cost, resource optimizations not possible
    because of unforeseenfailures

6
Contributions
  • AI Planning-based Dynamic Reconfiguration
  • High-level representation of configuration
    objectives using AI planning
  • System automatically performs low-level
    operations needed to satisfy the high-level goals
  • Optimization of time, cost and resource usage
  • Automated Failure Recovery
  • System detects and corrects anomalies by itself
  • Development of an end-to-end automated failure
    recovery technique using dynamic reconfiguration

7
Related Work Classification
8
Related Work (Control Theory)
  • Monitor
  • Sensors for system state
  • Analyze
  • System state is measured against a reference
    model
  • If the state appears to be not within limits
    planning is initiated
  • Plan
  • An plan is calculated to bring the system within
    reference model bounds
  • Execute
  • The plan is applied on the system to bring the
    desired change

9
Related Work (Recovery-Oriented Computing)
  • Fault-tolerance cannot achieve high-availability
    so let the system fail and recover fast later
  • Focus is to improve Mean Time To Repair (MTTR)
    instead of Mean Time To Fail (MTTF).
  • Microboot -- reboot at finer levels
  • Undo/redo administrator can undo a
    configuration action

10
Related Work (Architecture-Based Recovery)
  • A system is described in terms of an
    architectural model
  • Changes are expressed in terms of an
    architectural language
  • Changes in the architecture are analyzed for any
    possible mismatches
  • Change plans are developed to change the
    architecture of the system

11
Related Work (Miscellaneous Approaches)
  • Biologically inspired self-healing system
  • Environmental Awareness
  • Adaptation
  • Decentralization
  • Redundancy
  • Task-based
  • Change based on user preferences

12
Scope and Assumptions
  • Stateless components
  • Fail-stop behavior
  • Reliable network
  • Application-level recovery
  • Permanent failed state of hardware
  • Failure recovery does not introduce worse
    failures
  • Failure recovery system does not fail
  • Configurations are correctly specified
  • Database failures are not handled

13
Approach to Automated Failure Recovery
  • Systems with three types of artifacts
  • Applications
  • Components
  • Machines

14
Approach to Automated Failure Recovery
  • Application-oriented recovery
  • High-level recovery specifications using AI
    planning
  • Automated script generation for recovery

15
Representing the System
  • Application Model
  • Configuration Model
  • Dependencies and requirements
  • Various topology of an application
  • Dynamic Model
  • Runtime information
  • Component Model
  • Interfaces
  • Control
  • Configuration
  • Properties
  • Runtime information
  • Machine Model
  • Runtime information

16
An Overview of Planning
  • Domain
  • Contains the semantics of the system
  • Represented in a first-order logic like language
    e.g. PDDL
  • Initial State
  • Represents the current state of the system
  • Goal State
  • Represents the desired state of the system
  • Two ways to representation
  • Implicit
  • Explicit
  • Plan
  • Steps to move from the current state to the goal
    state

17
Automating Failure Recovery
  • Sense
  • Remote agents to monitor behavior of monitorable
    objects i.e. components and machines
  • A central monitor receives events from agents
  • Monitor updates the system models
  • If a failure is detected the failure recovery
    process is invoked
  • Analyze
  • Check the dependencies of the system
  • Develop an exact picture of the system state

18
Automating Failure Recovery
  • Plan
  • Using system model develop an
  • Initial State the failed state of the system
  • Goal State the recovered state of the system
  • Give the initial and goal state to a planner
  • Planner outputs a plan
  • Plan is interpreted and translated
  • Actions of plan has one-to-one matching with
    component interfaces
  • Plan dispatch
  • Execute
  • Plan scripts are executed on the system to bring
    the desired change

19
Exceptions in Recovery Process
  • No Plan Found
  • A new goal state is selected from the Application
    Configuration Model
  • Planner is invoked again
  • More Failures are Reported
  • Recovery process reverts based on the current
    phase

20
Flow of Automated Failure Recovery
21
Failure Recovery Sequence
22
Architecture of Recover
23
Evaluation
  • Basic Experiments
  • Goal To get numbers on how long does the
    recovery process takes under realistic failure
    scenarios
  • Synthetic Experiments
  • Goal To test the limitation of the system under
    synthetic loads not possible in a laboratory
    settings. A large number of machines, components
    and applications
  • Intensive Experiments
  • Goal To test the system under a sequence of
    failures and to see how the system behaves

24
Applications
  • Sms (Strand Maps Service)
  • A service designed to provide strand map
    functionality to the user
  • Rubbos (Rice University Bulletin Board)
  • A bulletin board benchmark site designed like
    slashdot with huge amounts of data.
  • Webcal
  • An online calendar for individuals and/or groups
    for scheduling meetings, tasks etc.

25
Approach to Automated Failure Recovery
  • Systems with three types of artifacts
  • Applications
  • Components
  • Machines

26
Basic Experiments(Experimental Setup)
System Before a Failure
27
Basic Experiments
System After a Failure
28
Basic Experiments
System After Recovery
29
Average Recovery Time 5.55 seconds
30
Synthetic Experiments
  • To test the time required for planning in
    large-scale systems
  • Developed a simulator to test large-scale system.
  • Experimental Setup
  • 20 machines
  • 20 components
  • 10 applications
  • Applications have various high-level
    configurations

31
Synthetic Experiments
32
Synthetic ExperimentRecovery from Machine
FailuresTime to First Plan
33
Synthetic Experiment Recovery from Component
Failures Time to First Plan
34
Intensive Experiments
  • To test the systems under a sequence of failures
  • To test the behavior of the system if the
    original configuration cannot be recovered
  • Same setup as basic experiments
  • Only one application i.e. Rubbos

35
Intensive Experiments(System before Failure)
36
Intensive Experiment(Recovery after First
Failure)
37
Intensive Experiment(Recovery after Second
Failure)
38
Contributions
  • AI Planning-based Dynamic Reconfiguration
  • AI planning to get recovery configurations
  • Ability to handle unforeseen failures
  • A way to represent high-level configuration
    objectives using AI planning
  • Low-level operations are performed automatically
    to achieve high-level objectives
  • Optimization of time, cost and resource usage
  • Automated Failure Recovery
  • A fast automated way to recover distributed
    systems from failures

39
Future Work
  • Availability of more resources
  • Recovery of state
  • Failures during failure recovery
  • Script generation framework
  • Distributed systems simulator

40
Backup Slides
41
Sensing Models
42
Push Model (Felber et al.)
43
Pull Model (Felber et al.)
44
Dual Model (Felber et al.)
45
Basic ExperimentsDetailed Results
46
Average Recovery Time 3.54 seconds
47
Average Recovery Time 4.77 seconds
48
Synthetic ExperimentsTime to Find Best Plan
49
Recovery from Machine FailuresTime to Best Plan
50
Recovery from Component Failures Time to Best
Plan
51
Introductory Slides w/References
52
Overview
  • Availability and reliability are important for
    dependability in computer systems
  • Failures disrupt the availability and reliability
    of computer systems.
  • A single hour of downtime cause revenue loss that
    ranges from 200,00 for ecommerce sites to
    6,000,000 for brokerage firms1.
  • Eight hours to eighty hours of downtime in
    systems publicized as highly available2.

1 Donn DiNunno. Quantifying performance loss. IT
performance engineering measurement
strategies. Meta Group, October 2000. 2
Patterson et al. Recovery oriented computing
(ROC) Motivation, definition, techniques, and
case studies. In UC Berkeley Computer Science
Technical Report UCB/CSD-02-1175, Berkeley, CA,
March 2002. U.C. Berkeley.
53
Common Approaches to High Availability
  • Redundancy
  • A good alternative because of cheap hardware.
  • But is it really cost effective?
  • Total Cost of Ownership (TCO)
  • Administrative Costs
  • Software Costs
  • Configuration Costs
  • etc..
  • In reality TCO is 5-10 times higher than cost of
    hardware1

1Patterson et al. Recovery oriented computing
(ROC) Motivation, Definition, Techniques, and
Case Studies. In UC Berkeley Computer Science
Technical Report UCB/CSD-02-1175, Berkeley, CA,
March 2002. U.C. Berkeley.
54
Enemies of High Availability
  • Study by Gray in 1986 1
  • Cause of Downtime
  • Operator Errors 42
  • Study by Oppenheimer et al. in 2003 2
  • Causes of Downtime
  • Hardware 15
  • Software 34
  • Operator Errors 51
  • 50 of Operator errors are configuration errors
  • Most of the configuration errors are due to
    enormous levels of complexity in computer systems

1Jim Gray. Why do computers stop and what can be
done about it? In Symposium on Reliability in
Distributed Software and Database Systems, pages
312, 1986. 2Oppenheimer et al. Why do internet
services fail, and what can be done about it? In
USENIX Symposium on Internet Technologies and
Systems, 2003.
55
Planning
56
Domain (Graphical Representation)
57
A Sample Plan
58
PDDL Example
  • (define (domain metricVehicle)
  • (requirements strips typing fluents)
  • (types vehicle location)
  • (predicates
  • (at ?v - vehicle ?p - location)
  • (accessible ?v - vehicle ?p1 ?p2 - location))
  • (functions
  • (fuel-level ?v - vehicle)
  • (fuel-required ?p1 ?p2 - location)
  • (total-fuel-used))
  • (action drive
  • parameters (?v - vehicle ?from ?to - location)
  • precondition (and (at ?v ?from)
  • (accessible ?v ?from ?to)
  • (gt (fuel-level ?v) (fuel-required ?from ?to)))
  • effect (and (not (at ?v ?from))
  • (at ?v ?to)

59
Implicit and Explicit Goal State
  • Explicit Goal State
  • (application-ready-5 ?app - application ?ap -
    apache ?s - service ?t - tomcat ?con - connector)
  • Implicit Goal State
  • (application-ready-1a ?app - application)

60
Models
61
Application Configuration ModelNot in PDDL
Format
  • Application Name
  • Configuration Number
  • Component A
  • Component A Included
  • Component A Intercomponent Dependencies
  • Component A Intracomponent Dependencies
  • Application Import Time to Component A
  • Component B
  • Component B Included
  • Component B Intercomponent Dependencies
  • Component B Intracomponent Dependencies
  • Application Import Time to Component B
  • Component C
  • Component C Included
  • Component C Intercomponent Dependencies
  • Component C Intracomponent Dependencies
  • Application Import Time to Component C

62
Application Configurations
  • SMS (Strand Map Service)
  • A service designed to provide strand map
    functionality to the user
  • Possible configurations
  • 1. ApacheTomcatmySQL
  • 2. TomcatmySQL
  • 3. Apache
  • RuBBoS
  • A bulletin board benchmark site designed like
    slashdot with huge amounts of data.
  • Possible Configurations
  • 1. ApacheTomcatmySQL
  • 2. Apache/PHP mySQL
  • 3. TomcatmySQL
  • 4. Apache
  • WebCal
  • An online calendar for individuals and/or groups
    for scheduling meetings, tasks etc.
  • Possible Confgurations
  • 1. Apache/PHPmySQL
  • 2. Apache/PHP

63
Application Dynamic Model
  • Application Name
  • Current Configuration
  • Component A
  • Component A Machine
  • Component A State
  • Component B
  • Component B Machine
  • Component B State
  • Component C
  • Component C Machine
  • Component C State

64
Component Model
  • Interfaces
  • Control Interface
  • Start
  • Stop
  • Configuration Interface
  • AddVirtualHost
  • AddModule

65
Component Model
  • Properties
  • Component Type
  • Component Machine
  • Component State
  • Start Time
  • Restart Time
  • Accessible Port
  • Module Currently Installed
  • Module Available to be Installed
  • Current Load
  • Component Instance Number
  • Connectors Available
  • Applications Installed

66
Machine Model
  • Machine Name
  • Machine IP Address
  • Application Installations Available
  • Maximum Load
  • Current Load
  • Total Memory
  • Available Memory
  • Total Harddisk Space
  • Available Harddisk Space
  • Operating System
  • Machine Architecture
  • Components Installed
  • Software Available
  • Machine State

67
Foundational Work
68
Underlying Technologies(Planning)
  • Planners
  • Progression Planners
  • GraphPlan, IPP
  • Regression Planners
  • Unpop, HSPr
  • Causal-link Planners
  • UCPOP
  • Compilation Planners
  • SATPLAN, Blackbox
  • Planning Systems
  • Sekitei
  • ASPEN
  • Prodigy

69
Underlying Technologies (Dependency Management)
  • Various classes of dependencies
  • Classification by Alda et al.
  • Syntactical Dependencies (Direct Communication)
  • Implicit Dependencies (Indirect Communication)
  • Classification by Keller and Kar
  • Multidimensional Space of Dependencies

70
Underlying Technologies(Dependency Management)
Keller and Kar Dependency Classification
71
Dynamic Reconfiguration Approaches
  • Agent-based approaches
  • Redundancy-based Approaches
  • OS-level approaches
  • Platform specific approaches
  • Workflow-based approaches
  • Middleware-based approaches

72
Fault Tolerance
  • Goal Reduce Mean Time to Fail (MTTR)
  • Single version fault tolerance
  • N-version fault tolerance
  • Fault tolerance in distributed systems

73
Failures during Failure Recovery
74
Failures during Failure Recovery(Failure
Possibilities)
75
(No Transcript)
76
Failures during Failure Recovery(Handling in
Each Phase)
77
Failures during Failure Recovery(Dependency
Graph)
Write a Comment
User Comments (0)
About PowerShow.com