Design of High Availability Systems and Networks Validation - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Design of High Availability Systems and Networks Validation

Description:

Need benchmarks to facilitates meaningful comparison of designs and systems ... Internet server applications: ftp and ssh (secure shell) - evaluation of error ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 56
Provided by: centerforr3
Category:

less

Transcript and Presenter's Notes

Title: Design of High Availability Systems and Networks Validation


1
Design of High Availability Systems and
NetworksValidation
Ravi K. Iyer Center for Reliable and
High-Performance Computing Department of
Electrical and Computer Engineering
and Coordinated Science Laboratory University of
Illinois at Urbana-Champaign iyer_at_crhc.uiuc.edu
http//www.crhc.uiuc.edu/DEPEND
2
Outline
  • Introduction
  • Validation methods
  • Design phase
  • Fault simulation
  • Prototype phase
  • HW or SW implemented fault injection
  • Operational phase
  • Measurement and analysis of field systems

3
Why Validation/Benchmarking?
  • Characterize different detection and recovery
    mechanisms
  • Coverage
  • Performance overhead
  • Determine system/application sensitivity to
    errors
  • Single points of failures (dependability
    bottlenecks)
  • Error propagation patterns
  • Placement of detection and recovery mechanisms
  • Error susceptibility of runtime fault management
    infrastructure
  • Analyze costsreliabilityperformance tradeoffs
  • Need benchmarks to facilitates meaningful
    comparison of designs and systems

4
Experimental Validation
  • Early Design Phase
  • Approach and Goals
  • CAD environments used to evaluate design via
    simulation
  • Simulated fault injection experiments
  • Evaluate effectiveness of fault-tolerant
    mechanisms
  • Provide timely feedback to system designers
  • Information produced
  • error latency, error detection coverage,
    recovery time distribution
  • Limitation/issues
  • Simulations need accurate inputs, fault models,
    and validation of results simulation time
  • Prototype Phase
  • Approach and Goals
  • System runs under controlled workload
    conditions
  • Controlled fault injections used to evaluate
    system in presents of faults
  • Information produced
  • error latency, propagation, detection
    distributions, availability
  • Limitation/issues
  • Injected faults should create/induce failure
    scenarios representative of actual system
    operation
  • Operational Phase
  • Approach and Goal
  • Study naturally occurring errors
  • Study systems in the field, under real
    workloads
  • Analyze collected error and
    performance data
  • Information produced
  • actual failure characteristics, failure
    rates, time to failure distribution
  • Limitation/issues
  • HW/SW instrumentation, analysis tools

5
Design Phase Fault Injection
Hybrid Simulation
6
Simulation at Different Levels
  • Electrical level
  • transistor circuit chip
  • Logic level
  • circuit VLSI systems
  • Function level
  • VLSI system computer and network systems

Levels of Simulated Fault Injection Fault
Injection
Electrical level Change current Change voltage
Logic level Stuck-at 0 or 1 Inverted fault
Function level Change CPU registers Network Flip
memory bit
Electrical Circuits
Functional Units
Logic Gates
Physical Process
Logic Operation
7
Issues in Simulated Fault Injection
  • Fault models
  • Fault conditions, fault types
  • Number of faults
  • Fault times
  • Fault locations
  • Workload
  • Real applications
  • Benchmarks
  • Synthetic programs
  • Simulation time explosion
  • Mix-mode simulation
  • Importance sampling
  • Concurrent simulation
  • Accelerated fault simulation
  • Hierarchical simulation

8
Fault Injection at Electric Level
  • Why is it needed?
  • Study the impact of physical causes
  • Simple stuck-at models do not represent many real
    types of faults

Transistor Level Simulation
Device Physics Level Simulation
9
Simulated Fault Injection at Logic Level
  • Fault Model
  • Basic models
  • stuck-at (permanent) - forcing logic value for
    entire simulation duration
  • inverted fault (transient) - altering logic value
    momentarily
  • Fault dictionary approach
  • Use electrical level simulation to derive
    logic-level fault models
  • dictionary entry - input vector, injection time,
    fault location

Transistor Level Description of 4-bit Adder
A B Cin Input 0000 0000 0 S
Cout ---- F 34 ---F F 39 --F-
F 7 F-F- F 20 Input 1111 1111
1 ---- F 23 ---F - 1 --FF
F 9 -F-- - 33 -FFF - 33 FFFFF 1
A(30)
B(30)
Logic Level Fault Dictionary
Current-Burst Fault Model
Cout
Cin
S(30)
For all nodes, for all input combinations
10
Fault Injection at Function Level
  • Diversity of Components
  • Object-oriented approach
  • Fault Models
  • Various types - depending on the type of
    components
  • Examples
  • Single bit-flip for a memory or register fault
  • Message corruption for communication channel
    fault
  • Service interrupt for a node fault
  • More detailed fault models derived from
    lower-level simulation
  • Impact of Software
  • Impact of faults is application dependent
  • Software effect can be studied at this level

11
Hierarchical Fault Simulation
Host 2
Local Network
Hardware
System Level
Logic Level
Switch
Host 3
Chip Level
Host 1
Intrf.
Network Interface
Host 4
Program
Memory
Other details DMA,
module j
module i
modulek
Software
Lower level fault effects propagate to the
higher levels
Electric Level
Vdd
Electric Level
A
Processor
B
AB
Fault model
Ionizingparticles
Device Physics Level
GND
SiO2
Transistor Level
n
p
Electrical Transient
p
n channel stop
Chip Level
n
Logic Level
12
Prototype Phase Hardware-Implemented Fault
Injection
Generation of System Activity
Input Files
System Under Study
Fault Injector Monitor/ Controller
MESSALINE - Architecture
Input Files
  • Developed at LAAS-CNRS, France
  • Both probes and socket insertion are used
  • Can inject up to 32 injection points
  • Applications
  • A subsystem of a railway interlocking control
    system
  • A distributed communication system

13
Prototype Phase Software Implemented Fault
Injection
  • Advantages flexibility, low cost
  • Disadvantages perturbation to workload, low time
    resolution
  • Targets for software fault injection
  • Software faults and errors
  • modify the text/data segment of the program
  • Memory faults
  • flip memory bits
  • CPU faults
  • modify CPU registers, cache, buffers
  • Bus faults
  • use traps before and after an instruction to
    change the code or data used by the instruction
    restore the data after the instruction is
    executed
  • Network faults
  • modify or delete transmitted messages
  • introduce faults in network controllers, drivers,
    buffers

14
Prototype Phase Fault Injection Requirements
  • Distributed test and evaluation environment
  • Support for the architecture independent approach
  • Evaluate hardware and software implemented fault
    tolerance of single node architectures,
    distributed systems and embedded applications
  • Support fault injection to variety of targets
    including CPU registers, cache, memory, I/O,
    network, applications,and OS functions
  • Examples of fault injection strategies include
  • random components and locations
  • selected hardware and software components (can be
    the predefined or random locations within a
    component)
  • application data and control flow
  • triggered by high stress conditions
  • impact the system timing (e.g., to mimic omission
    failures)
  • Allow collecting and analysis of results to
    derive measures for characterizing the system
    (e.g., coverage, fault severity, propagation,
    latency, availability .)

15
Fault/Error Injection
Fault Injection Specs Injection
Strategy Stress-based path-based Random Injection
Method by hardware by software Fault
Location CPU Memory disk I/O network I/O Other
I/Os Injection Time load threshold program
execution path fault arrival rate
Workload Specs Rates and Mixes Interaction Inten
sity
Fault Injector
CPU
System Under Test
Load Level
I/O
16
NFTAPE
  • NFTAPE is a tool for conducting automated
    fault/error injection-based dependability
    characterization
  • Tool, which enables a user (1) to specify a
    fault/error injection plan, (2) to carry on
    injection experiments, and (3) to collect the
    experimental results for analysis.
  • Facilitates automated execution of fault/error
    injection experiments.
  • Targets assessment of a broad set of
    dependability metrics, e.g., availability,
    reliability, coverage, mean time to failure.
  • Operates in a distributed environment.
  • Imposes minimal disturbance of target systems

17
NFTAPE Architecture
Error Injection Targets
Control Host
Injector Process
LAN
Process Manager
Campaign Script
ApplicationProcess
Control Host
Injector Process
Process Manager
Log
Application Process
18
Control Host Process Manager
  • Control Host
  • Processes a Campaign Script, a file that
    specifies a state machine or control flow
    followed by the control host during the fault
    injection campaign
  • Simple yet general way to customize a fault
    injection experiment
  • Experiments controlled by the Common Control
    Mechanism
  • Implemented in Java to ensure portability
  • Process Manager
  • Daemon on each target node to manage processes on
    the target node(s) including process execution
    and termination
  • processes include injectors, workloads,
    applications, monitors
  • all processes are treated the same as an abstract
    process object rather than a process of some
    specific type
  • Facilitates communication between the Control
    Host and Target Nodes.

19
State Machine of an Example Campaign Script
Start Application Activate Trigger
ST_run_app
Inject errors Deactivate the Trigger after a
specified time
Initialization (variable, events), Start a
logfile
Condition_2
Condition_1
ST_Trigger_ON
ST_init
ST_start_fi_trig
TRUE
Condition_3
ER_condition_2
Start the Fault Injector Trigger processes
ST_error2
ST_finish
TRUE
ER_condition_1
ST_error1
TRUE
Terminate processes Exit Control Host
20
Fault Injectors and Fault Models
  • Debugger-based fault injector injection to the
    target process memory and registers
  • Driver-based fault injector injection to
    memory, registers, OS functions, I/O devices
  • Network injector injection to network
    cards/controllers corrupting messages
  • Use of performance monitors (built into the CPU)
    to trigger fault injection
  • Fault injection targets
  • CPU registers, memory, network, application,
    specific OS function,
  • Fault injection triggers
  • random (based on time), application supplied
    breakpoint, externally supplied breakpoint

21
Applications of NFTAPE
  • Motorola IDEN MicroLite critical base station
    controller (call-processing application and
    database) in digital mobile telephone network
  • DHCP (Dynamic Host Configuration Protocol) server
    evaluation of application control flow checking
  • Software implemented fault tolerance (SIFT)
    environment on REE testbed - evaluation of
    recovery coverage and performance overhead of the
    SIFT environment
  • Internet server applications ftp and ssh (secure
    shell) - evaluation of error induced security
    vulnerabilities in ftp and ssh applications
  • Voltan and Chameleon ARMORs software middlewares
    - evaluation of fail-silence provided by process
    duplication (Voltan) versus internal error
    detection (Chameleon)
  • Linux kernel characterization of Linux kernel
    under errors
  • Myrinet based network failure analysis of
    high-speed network

22
Group Communication Protocols under Errors
23
Observations
  • Group Communication Systems (GCSs) provide basic
    services for building dependable distributed
    applications
  • Only few studies assessed experimentally the
    dependability of GCS implementations
  • Often under simple failure models (e.g., killing
    target process)
  • We use error injection to study impact of memory
    and network errors into Ensemble
  • Focus on fail silence violations and error
    propagation
  • Understanding the error-propagation patterns is
    vital in maintaining system integrity

24
Experimental Setup
  • Testbed consisting of three machines (Pentium III
    500 MHz) interconnected by an Ethernet 100 Mbps
    LAN
  • Operating system Linux 2.4
  • Group communication system Ensemble 1.40
  • Error injection experiments
  • memory injections to assess the impact of
    errors in a process text- and heap-memory
    segments,
  • network injections to analyze the impact of
    corruption of messages exchanged in support of
    the communication protocols.

25
Benchmarks
  • Use synthetic benchmarks to exercise the
    different group communication protocols
  • Group ? exercises the group membership service
  • Fifo ? exercises the fifo-ordered reliable
    multicast
  • Atomic ? exercises the total ordered reliable
    multicast
  • Sequencer based Ensembles implementation
  • Three processes join a multicast group and
    (possibly) exchange messages in rounds

26
Profiling Ensemble Function Invocations
  • Ensemble is a 2.5 MB static library containing
    6000 functions (only 1000 are actually used).
  • About 5 of function invocations are for the
    Ensemble micro-protocols
  • part of a GCS that is usually formally specified
    and verified
  • 20 of function invocations are for utility
    functions belonging to the Ensemble source code
  • About 50 of run-time function invocations are
    for the run-time support of OCAML.

27
Memory Injections
  • Error Models
  • TEXT bit errors in text segment of the target
    process
  • HEAP bit errors in the allocated heap memory of
    the target process
  • Outcome categories
  • Manifested errors are divided in
  • Crash failure application stops executing, e.g.,
    termination by the OS (e.g., SIGSEGV), HANG,
    ASSERT (target process shuts itself down)
  • Fail silence violation the application performs
    invalid computation, e.g., sends corrupted
    message to other processes causing them to fail

28
TEXT/HEAP Injection Results
  • Over 90,000 bit errors injected in Ensembles
    text/heap memory
  • For the manifested errors
  • 95 result in clean crash failures (5-10 of
    these detected by Ensemble assertion)
  • 5 result in fail silence violations

Fail Silence Violations
  • Text
  • Fail silence violations are rare for group
    (0.5) but not absent
  • The reason is the underlying heartbeat
    between group members
  • The addition of communication among
    application processes significantly increases
    the chances of fail silence violations
  • 4 (fifo) and 36 (atomic)
  • Heap
  • No occurrence of fail silence violations for
    group benchmark
  • In presence of application-level
    communications (fifo and atomic benchmarks)
    fail silence violations account for 5 of the
    manifested errors

29
Fail Silence Violations
  • A majority of the fail silence violations due to
    heap errors are due to a corrupted application
    message being sent/received
  • About 4080 of the fail silence violations due
    to text-errors (atomic and fifo benchmarks) are
    caused by application-level omission failures
  • not the same as the omission failures of the
    underlying GCS, detected and recovered
    transparently to the application by means of
    sequence numbers and retransmissions.
  • Crash of non-injected processes (15 cases due
    to text errors)

30
Application-Level Omission Failure (example from
the layer implementing flow control for multicast
messages)
  • Given two processes p and q
  • p can send to q only if send_creditpq gt 0. At
    that time, send_creditpq is decremented.
  • Every 50KB of data q receives from p, q sends
    an ack-credit to p
  • On p receiving ack-credit from q, send_creditpq
    is incremented and ps buffered messages are sent
    based on the new credit.
  • No process can ever send to q
  • e does not detect it because processes can still
    heart-beat each other
  • Heart-beats are not subjected to MFLOW
  • Due to an injected error q skips send an
    ack-credit to p
  • No process can ever send to q
  • Ensemble does not detect it because processes can
    still heart-beat each other
  • Heart-beats are not subjected to MFLOW

31
Fail Silence Violations (Cont.)
  • Fail silence violations are due to error
    propagation

P2
P1
32
Network Injections
  • Error Model
  • Single bit errors injected in Ensemble messages
  • Errors occur before/after any encoding (e.g.,
    checksum) is applied/removed
  • Purpose
  • Test Ensemble robustness to invalid inputs
  • Investigate error propagation

33
Network Injections Major Results
  • Ensemble does not check validity of certain
    message fields
  • e.g., sender id used to index arrays in
    micro-protocols
  • Solution add a range check for the message
    sender field
  • OCAMLs marshal/unmarshal mechanism is highly
    error sensitive
  • Errors can lead to invalid objects reconstructed
    and thus, to heap corruption
  • Solution use more robust encoding for marshaled
    messages, e.g., by means of object delimiters
  • Majority of crashes occur in a small subset of
    Ensemble functions
  • Solution harden implementation of most error
    prone functions

34
Network-Level Error Propagation (example from
the group benchmark)
  • An error is injected in the sender field of a
    message sent by the targeted group member.
  • The corrupted message received by another group
    member is used to derive an index to the array
    indicating whether group members are faulty
  • Segmentation violation occurs due invalid access
    to the array
  • All group members except the injected process
    member crash.are not subjected to MFLOW

35
Summary
  • Presented experimental study of Ensemble GCS
    under memory/network errors, with focus on FSVs
  • 5-6 of manifested errors result in FSVs
  • In contrast with crash/omission assumption
  • FSVs are an impediment to high dependability
  • Recovery from such failures can be costly
  • Using protocols capable of handling application
    value errors (e.g.,Byzantine agreement) will not
    suffice
  • FSVs can affect the mechanism for communication
  • A fault tolerance middleware must tolerate its
    own errors

36
Steps in Measurement-Based Analysis
  • Step 1 data processing
  • Step 2 model identification and measure
    acquisition
  • Step 3 model solution if necessary
  • Step 4 analysis of models and measures

37
Measurement Issues
  • Deciding what and how to measure is difficult.
  • A combination of installed and custom
    instrumentation is used in most studies.
  • Sound evaluations require a considerable amount
    of data.
  • Failures are infrequent and measurements must be
    taken over a long period of time.
  • Systems must be exposed to a wide range of usage
    conditions
  • Only detected errors can be measured.

38
Goals
  • Understand the nature of failures computer
    systems
  • essential in improving the system availability
    and reliability
  • Characterize failure behavior
  • Provide insight into
  • error propagation (in particular between nodes in
    a network)
  • impact of correlated errors
  • system availability
  • Identify deficiencies and suggest improvements in
    the error logging mechanism

39
Correlated Failures
  • Significantly degrade availability, reliability,
    and performance
  • Single failure tolerance is not enough
  • Models assuming failure independence are not
    appropriate
  • Partial coverage models need to be modified
  • Example analysis of DEC and Tandem systems
    indicate that 10-30 of reported problems
    involve correlated failures

40
Correlated Failures (cont.)
41
Correlated Failures (cont.)
42
Failure Data Analysis of a LAN of Windows NT
Computers
43
Data Used
  • Failures found in a network of about 70 Windows
    NT based mail servers (running Microsoft Exchange
    software).
  • Event logs collected over a six-month period from
    the mail routing network of a commercial
    organization.
  • Analysis of machine reboots
  • a major portion of all logged failure data and
  • most severe type of failures.

44
Classification of Data Collected from a LAN of
Windows NT-based Servers
  • The breakup of system reboot data is based on the
    events that preceded the current reboot by no
    more then one hour
  • The reboot is categorized based on the source and
    the id of most frequently occurring events

45
Classification of Data Collected from a LAN of
Windows NT-based Servers (cont.)
  • 29 of the reboots cannot be categorized
  • A significant percentage (22) of the reboots
    have reported connectivity problems.
  • indication of possible error propagation
  • Only a small percentage (10) of the reboots can
    be traced to a system hardware component. Most
    of the identifiable problems are software
    related.
  • Nearly 50 of the reboots are abnormal reboots
    (i.e., the reboots were due to a problem with the
    machine rather than due to a normal shutdown).
  • In nearly 15 of the cases, server problems with
    a crucial mail server application force a reboot
    of the machine.

46
Machine Uptime Downtime Statistics
  • 50 of the downtimes last about 12 minutes.
  • Too short a period to replace hardware
    components and reconfigure the machine.
  • Majority of the problems are software related
    (memory leaks, misloaded drivers,
    application errors etc.).

47
Availability
  • Availability from the system perspective
  • Availability from the application/user
    perspective
  • Typical machine provides acceptable service
    only about 92 of the time, (on average).

48
Modeling Machine Behavior Machine States
49
Modeling Machine Behavior State Transitions of a
Typical Machine
  • 92 of all transitions are into the Functional
    state
  • this figure is a measure of the average
    availability of a typical machine, i.e., the
    ability of the machine to provide service, not
    just to stay alive.
  • Only about 40 of the transitions out of the
    Reboot states are to the Functional state.
  • More than half of the transitions out of the
    Startup problems are to the Connectivity problems
    state.
  • More than 50 of the transitions out of Disk
    problems state are to the Functional state.

50
Modeling Domain Behavior
  • Nearly 77 (excluding self-loops) of transitions
    from the F state, are to the BDC state.
  • Transitions from F state to MBDC state indicate
    correlated failures and recovery among BDCs.
  • Majority of transitions from state PDC are to
    state F.
  • most of the problems with the PDC are not
    propagated to the BDCs, the PDC recovers before
    any such propagation takes effect on the BDCs
  • problems on the PDC do not bring the machine
    down,

51
Error PropagationTest Results Example
  • Using event-specific tests to implement automated
    detection of error propagation
  • Most of the identifiable problems are local
    machine related.
  • Good news
  • Error propagation of failures is not observed on
    a regular basis.
  • Bed news
  • in a number of cases the tests were not able to
    classify the event
  • it is quite possible that some of these unknowns
    represent propagated failures.

52
Lessons Learned
  • Most of the problems that lead to reboots are
    software related. Only 10 are attributable to
    specific hardware components.
  • Connectivity problems contribute to the most of
    reboots. A significant percentage of these
    problems are persistent.
  • Rebooting the machine does not appear to solve
    the problem in many cases.
  • Average availability evaluates to over 99,
    however a typical machine in the domain, on
    average, provides acceptable service only about
    92 of the time.
  • There are indications of propagated or correlated
    failures.

53
Lessons Learned Insight Into the Logging
Mechanism
  • The presence of a Windows NT shutdown event will
    improve the accuracy in identifying the causes of
    reboots. It will also lead to better estimates of
    machine availability.
  • Improved event logging by the lower-level system
    components (protocol drivers, memory managers)
    can significantly enhance the value of event logs
    in diagnosis.
  • The Primary Domain Controller logs error events
    in bursts
  • periodic logging of a healthy event by the PDC
    would help to increase our understanding of PDC
    behavior

54
Concluding RemarksSystem Evaluation/Validation
Evaluation/Validation
Operation
Prototype
Design
Fault Injection
Analysis on Field Failure Data
Models
Formal Methods
HW Implemented
SW Implemented
Analytical
Simulation
Corrections of Assumptions
Coverage, Error Latency
Coverage
Failure Rates, Fault Models
55
Concluding Remarks (cont.)
  • Design/Simulation
  • Phase
  • Fault tolerance issues
  • need well established system level fault models
  • impact of software faults
  • effect of failures on robustness and system
    integrity
  • Simulation issues
  • simulation time explosion
  • validation of the simulation methodology
  • Prototype (Fault Injection) Phase
  • Fault models and their validity
  • hardware
  • - permanent
  • - transient
  • software
  • - errors
  • - faults/defects
  • Comparison (validation) of various fault
    injection tools
  • claims, portability, coverage
  • Operational Measurement
  • Phase
  • What to measure
  • When to measure
  • From case studies to fundamental results
  • Isolation of machine specific vs. general system
    software dependability characteristics
  • On-line diagnosis
  • Prediction of impact of configuration,
    technology and workload changes based on
    field measurements
Write a Comment
User Comments (0)
About PowerShow.com