Z. Kalbarczyk K. Whisnant, Q. Liu, R.K. Iyer - PowerPoint PPT Presentation

About This Presentation
Title:

Z. Kalbarczyk K. Whisnant, Q. Liu, R.K. Iyer

Description:

ARMOR-based Hierarchical Fault/Error Management Z. Kalbarczyk K. Whisnant, Q. Liu, R.K. Iyer Center for Reliable and High-Performance Computing – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 21
Provided by: KeithWh5
Category:

less

Transcript and Presenter's Notes

Title: Z. Kalbarczyk K. Whisnant, Q. Liu, R.K. Iyer


1
ARMOR-based Hierarchical Fault/Error Management
  • Z. KalbarczykK. Whisnant, Q. Liu, R.K. Iyer
  • Center for Reliable and High-Performance
    Computing
  • Coordinated Science Laboratory
  • University of Illinois at Urbana-Champaign
  • 1308 W. Main St. Urbana, IL 61801

2
Networked/Distributed Systems Key Questions
  • How do we integrate components with varying
    fault tolerance (detection and recovery)
    characteristics into a coherent high availability
    networked system?
  • How do you guarantee reliable communications?
  • How do you synchronize actions of dispersed
    processors and processes?
  • How do you contain errors (or achieve fail-silent
    behavior of components) to prevent error
    propagation?
  • How do you reconfigure the system in response to
    failures?

3
Failure Categories
  • Necessity to cope with machine (node), process,
    and network failures
  • A component specification defines what output
    should be produced in response to any sequence of
    inputs as well as the real-time interval within
    which this output should occur

4
Failure Categories (cont.)
  • Crash failures are a proper subclass of omission
    failures
  • a crash failure occurs when after a first
    omission to send/receive a message a process
    systematically omits to send/receive messages
  • Omission failures are a proper subclass of timing
    failures
  • a process which suffers an omission failure can
    be understood as having an infinite response time
  • Timing failures are a proper subclass of
    incorrect computation failures
  • a timing failure occurs when a process takes some
    action too soon or too late
  • Incorrect computation failures are a proper
    subclass of the class of all possible failures,
    the Byzantine or malicious failures
  • a faulty process may send spurious messages to
    other processes, may lie, may not respond to
    received messages correctly

5
What Do We Propose in Approaching the Problems?
  • ARMOR-based programming environment that provides
  • A process architecture
  • offers flexibility in assigning functionality to
    specific processes, including
  • error detection and recovery techniques, that can
    be reconfigured according to dependability and
    application requirements
  • scales according to the number of nodes available
    and to the number of applications simultaneously
    executing in the system.
  • A runtime environment
  • provides external process management to
    applications
  • allows fine-tuning of fault tolerance services
    provided to and embedded in the application.
  • Hierarchy of error detection and recovery
  • to avoid single point(s) of failure
  • to provide protection not only to the
    applications, but to the entities supporting
    detection and recovery services

6
What are ARMORs?
  • Adaptive Reconfigurable Mobile Objects of
    Reliability
  • Multithreaded processes composed of replaceable
    building blocks.
  • Provide error detection and recovery services to
    user applications via three levels of
    interaction.
  • Hierarchy of ARMOR processes form runtime
    environment
  • System management, error detection, and error
    recovery services distributed across ARMOR
    processes.
  • ARMOR runtime environment is self-checking.
  • ARMOR support for the application
  • Completely transparent and external support.
  • Transparent extension of standard libraries.
  • Instrumentation with ARMOR API.

7
ARMOR Configuration
Repository of Elements
HB element
Data dependency checking element
Progress Indicator element
Checksum Element
Assertion check element
Text-segment signature element
Control flow signature element
Range-check element
Checkpoint element
8
ARMOR Computation Model
opAction1
opAction3
element
opAction2
  • Elements invoked through events called
    operations.
  • A thread consists of a sequence of operations
    that execute.
  • In response to an operation, element can
  • Read/write thread variables that serve as
    input/output for operation.
  • Read/write element state.
  • Generate additional operations to be processed
    within thread.
  • Element-based detection and recovery
  • Monitor generates operation when it detects an
    error.
  • Policy elements subscribe to notification
    operation, and generate sequence of operations to
    effect recovery.
  • Service elements carry out individual recovery
    steps.
  • Response to errors can be reconfigured by
    changing policy elements in ARMORs.

9
ARMOR Runtime Environment
Node
Node
ManagerARMOR
HeartbeatARMOR
App.
multi-nodesolution
DaemonARMOR
DaemonARMOR
Node
Primary ARMOR
App.
network
Node
Node
Backup ARMOR
DaemonARMOR
DaemonARMOR
single-nodesolution
ARMOR
ManagerARMOR
Exec. ARMOR
App.
  • Various kinds of ARMORs execute in environment
    depending upon requirements.
  • Distribution of detection and recovery
    responsibilities makes environment resilient to
    ARMOR failures.
  • Solutions scalable to one-node configuration.

10
Daemons
  • Each node in runtime environment executes a
    daemon.
  • Provide services to local ARMORs
  • Install ARMORs on local node.
  • Detect ARMOR process crash/hang failures.
  • Channel for ARMOR-to-ARMOR communications.

Node 1
Node 2
Daemon
Daemon
ARMOR Microkernel
Network
TCP Connection Mgmt.
Named PipeMgmt.
ProcessMgmt.
DetectionPolicy
ProcessMgmt.
Node 3
Daemon
ARMOR
ARMOR
ARMOR
Local ARMORs
Remote daemons
11
Managers
  • Manage a group of ARMORs.
  • Responsible for recovering failed ARMORs.
  • Contain information about each ARMOR
  • Location in the network.
  • Current configuration.
  • Recovery policy.
  • Associated application.
  • Detect and recover from node failures.
  • Allocate nodes (including spares) for application
    and for ARMOR processes.
  • Interface with user.
  • Manager functionality can be consolidated into
    one Manager ARMOR or distributed across hierarchy
    of Manager ARMORs.

12
Hierarchy of Error Detection Recovery
Attributes (1)
  • Adaptivity and composability of individual
    levels.
  • Detection and recovery composition and invocation
    at the individual levels should be customizable
    to meet .
  • the applications needs,
  • the types of faults being experienced in the
    system,
  • the reliability characteristic desired
  • Applications with varying availability
    requirements should coexist in the same
    environment
  • Detection levels should allow to be
  • selectively turned on or off
  • independent so that they can be composed in
    various ways

13
Hierarchy of Error Detection Recovery
Attributes (2)
  • Intra-level interactions
  • interactions between techniques placed within
    each level should be evaluated taking into
    account
  • cost, coverage, and intrusiveness factors
  • e.g., placing assertion checks in certain points
    of the application code, may not required to
    generate control flow signatures for that portion
    of the code.
  • Inter-level interactions
  • interactions between error detection and recovery
    levels should be carefully defined to eliminate
    redundant invocation of multiple detection
    mechanisms.
  • errors that escape a given level should be
    detectable by higher levels
  • Recovery responsibilities
  • an appropriate recovery strategy should be
    selected based upon the failure and circumstances
    of the failure event
  • avoid a competition during error recovery make
    sure that one and only one entity is responsible
    for recovery of a failed process or node

14
Hierarchy of Error Detection Recovery
  • Detection
  • Watchdog timer (livelock detection)
  • Built-in assertion checks
  • Control and data flow check
  • Recovery
  • Restart a process/thread
  • Hardware reset
  • Techniques encapsulated in separate elements
  • Can be selectively turned on or off,
    inserted or removed
  • Arranged in a hierarchy of layers

Layer 1
Process Inside ARMOR process
  • Detection
  • Progress indicator
  • Smart heartbeats
  • Data audits OS detection
  • Recovery
  • Checkpointing/Rollback
  • Process restart on the same node

Layer 2
Increasing overhead
Layer 3
Network Between ARMORs
  • Detection
  • Signature exchange between processes for
    consistency check
  • Global heartbeats
  • Recovery
  • Checkpointing/Rollback
  • Process migration/restart
  • Masking

15
ARMOR Applications
  • Base station controller protecting
    call-processing application and database of
    digital mobile telephone network controller.
  • Embedded wireless applications
  • protecting wireless communication channel through
    ARMOR-based proxies.
  • Providing automated detection and recovery to
    wireless telephones and servers
  • Network services DHCP (Dynamic Host
    Configuration Protocol).
  • Spaceborne applications runtime environment for
    protecting distributed spaceborne applications.

16
ARMOR-Based Fault Management in RTES Environment
Design Options
17
Manager on DSP
  • Local Managers
  • Execute on dedicated DSP per board.
  • Detect and recover from errors localized to board.
  • Regional Managers
  • Execute on Linux clusters.
  • Handle recovery that spans multiple boards.

Mgr
App
App
Mgr
App
App
App
App
Level 1 DSP Farm
com
com
Board
Board
Daemon
Daemon
Daemon
Level 2/3 Linux Farm
App
Exec ARMOR
App
Exec ARMOR
Region Mgr.
Node
Node
Node
18
Manager on PC
  • Local Managers execute on PC assigned to board or
    group of boards.

Mgr
App
App
App
App
App
App
App
App
Mgr
com
com
Board
Board
Linux/Win32
Linux/Win32
Daemon
Daemon
Daemon
App
Exec ARMOR
App
Exec ARMOR
Region Mgr.
Node
Node
Node
19
ARMOR-based Manager Design Details (1)
  • Local Managers are ARMOR processes
  • Reconfigurable monitoring functionality,
    detection policy, recovery policy.
  • Communicate with Linux farm through common ARMOR
    infrastructure.

Linux/Win32
App
App
App
App
ARMOR
ARMOR Microkernel
com
Daemon
Board
App
ARMOR Interface
Recovery Policy
DSP Interface
Local Manager ARMOR
Daemon
Daemon
Daemon
App
Exec ARMOR
Region Mgr.
Node
Node
20
ARMOR-based Manager Design Details (2)
  • All functionality found in replaceable elements.
  • Individual ARMORs can be customized based upon
    role they play in the system
  • Local Manager ARMORs include element to interface
    with DSP.
  • Daemon ARMORs contain elements to communicate
    with local ARMORs.
  • Execution ARMORs contain elements to oversee user
    application.
  • All ARMORs consists of microkernel used to add
    elements, remove elements, communicate among
    elements.
  • Each element found in separate shared library
  • Elements are explicitly loaded by microkernel
    through dl_open() and dl_sym().
  • Dynamic reconfiguration can be done on demand.
  • Elements subscribe to event messages that they
    are designed to process.
  • Tcl interface used to construct messages that are
    sent to ARMORs.
Write a Comment
User Comments (0)
About PowerShow.com