Fail-Stop Processors - PowerPoint PPT Presentation

About This Presentation
Title:

Fail-Stop Processors

Description:

Why fail-stop processors can simplify replicated services ... Components may collude with each other. Cannot necessarily detect output is faulty ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 22
Provided by: andreaarpa
Category:

less

Transcript and Presenter's Notes

Title: Fail-Stop Processors


1
Fail-Stop Processors
UNIVERSITY of WISCONSIN-MADISONComputer Sciences
Department
CS 739Distributed Systems
Andrea C. Arpaci-Dusseau
  • Byzantine Generals in Action Implementing
    Fail-Stop Processors, Fred Schneider, TOCS, May
    1984
  • Example usage of byzantine agreement
  • Why fail-stop processors can simplify
    replicated services
  • Why fail-top processors are expensive
    (impractical?) to build
  • Remaining Time Byzantine Werewolves (improved?)

2
Motivation
  • Goal Build systems that continue to work in
    presence of component failure
  • Difficulty/cost of building those systems depends
    upon how components can fail
  • Fail-stop components make building reliable
    systems easier than components with byzantine
    failures

3
Fail-Stop Processors
  • What is a failure?
  • Output (or behavior) that is inconsistent with
    specification
  • What is a Byzantine failure?
  • Arbitrary, even malicious, behavior
  • Components may collude with each other
  • Cannot necessarily detect output is faulty
  • What is a fail-stop processor?
  • Halts instead of performing erroneous
    transformations
  • Others can detect halted state
  • Others can access uncorrupted stable storage even
    after failure

4
Questions to Answer
  • What are the advantages of fail-stop processors?
  • 2) Real processors are not fail-stop
  • Can we build one?
  • How can we build an approximation of one?
  • 3) Approximations of fail-stop processors are
    expensive to build
  • Under what circumstances is replicated service
    with fail-stop processors better?

5
1) Distributed State Machine
  • Common approach for building a reliable system
  • Idea Replicate faulty servers, coordinate client
    interactions with replicas

input sequence
State machine
Client
Byzantine agreement
R
R
R
output
Combine outputs
T-fault tolerant Satisfies specification as long
as no more than t components fail Failure model
of components determines how many replicas, R,
are needed and their interactions
6
How to build t-fault tolerant state machine?
  • Inputs
  • Key All replicas receive and process same
    sequence of inputs
  • 1) Agreement Every nonfaulty replica receives
    same request (interactive consistency or
    byzantine agreement)
  • 2) Ordering Every nonfaulty replica processes
    requests in same order (logical clocks)
  • Outputs

Byzantine Fail-Stop
Combine output? majority any
Number of replicas? 2t1 t1
7
2) Building a Fail-Stop Processor
  • Must provide stable storage
  • Volatile Lost on failure
  • Stable
  • Not affected (lost or corrupted) by failure
  • Can be read by any processor
  • Benefit Recover work of failed process
  • Drawback Minimize interactions since slow
  • Can only build approximation of fail-stop
    processor
  • Finite hardware -gt Finite failures could disable
    all error detection hardware
  • k-fail-stop processor behaves fail-stop unless
    k1 or more failures

8
Implementation of k-FSP Overview
  • Two components
  • k1 p-processes (program)
  • 2k1 s-processes (storage)
  • Each process runs on own processor, all connected
    with network
  • P-Processes (k1)
  • Each runs program for state machine
  • Interacts with s-processes to read and write data
  • If any fail (if any disagreement), then all STOP
  • Cannot necessarily detect k1 failures
  • S-Processes (2k1)
  • Each replicates contents of stable storage for
    this FSP
  • Provides reliable data with k failures (cannot
    just stop)
  • Detects disagreements/failures across p-processes
  • How???

9
Interactive Consistency Requirements
  • IC1. If nonfaulty p-process, then every
    nonfaulty s-process receives request within ?
    seconds (as measured on s-process clock)
  • IC2. Non-faulty s-processes in same k-FSP agree
    on every request from p-process j
  • S-processes must agree even when p-process is
    faulty
  • To provide IC1 and IC2
  • Assuming can authenticate sender of messages,use
    signed message (SM) protocol for byzantine
    agreement
  • Need just k1 processes for agreeement
  • IC3. For each k-FSP, clocks of all p-processes
    are synchronized
  • All non-faulty p-processes must send requests at
    same time to s-processes

10
FSP Algorithm Details Writes
  • Each p-process, on a write
  • Broadcast write to all s-processes
  • Byzantine agreement across all s-processes (all
    s-processes must agree on same input value from
    particular p-process)
  • Each s-process, on a write (Fig 1)
  • Ensure each p-process writes same value and
    receive within time bound
  • Initial code Handle messages after at least time
    ? has transpired since receipt (every s-process
    should receive by then)
  • If receive write request from all k1 p-processes
    (M k1), then update value in stable storage
  • If not, then halt all p-processes
  • Set failed variable to true
  • Do not allow future writes

11
FSP Algorithm Details Reads
  • Each p-process, on a read
  • Broadcast request to all s-processes
  • Use result from majority (k1 out of 2k1)
  • Can read from other FSPs as well
  • Useful if FSP failed and re-balancing work
  • Each p-process, determine if halted/failed
  • Read failed variable from s-process (use
    majority)

12
FSP Example
  • k2, SM code ba1 How many p and s
    processes?

p
a 6 b failed 0
s
  • How do p-processes read a?
  • Broadcast request to each s-process
  • 2) Each s-process responds to read request
  • 3) Each p-process uses majority of responses
    from s-process

13
FSP Example
  • k2, SM code ba1

p
a b failed
s
  • How do p-processes read a?
  • What if 2 s-processes fail?
  • E.g., think a5?
  • What if 3 s-processes fail?

14
FSP Example
  • k2, SM code ba1

p
a b failed
s
  • How do p-processes write b?
  • Each p-process j performs byzantine agreement
    using signed message protocol SM(2) across
    s-processes
  • Each s-process must agree on what p-process j is
    doing, even if j is faulty
  • Each s-process looks at requests after time delta
    elapsed
  • If see same write from all k1 processes, perform
    write
  • Otherwise, halt all p-processes forbid future
    writes

15
FSP Example
  • k2, SM code ba1

p
a b failed
s
  • How do p-processes write b?
  • What if 1 p-process (or network) is very slow?
  • What if 1 p-process gives incorrect request to
    all s-processes?
  • What if 1 p-process gives incorrect request to
    some?
  • Byzantine agreement catches All s-processes
    agree that p-process is faulty (giving different
    requests) agree to treat it similarly
  • When see doesnt agree with other p-processes,
    will halt
  • What if 3 p-processes give bad result?

16
3) Higher-Level Example
  • Goal Service handling k faults N nodes for
    performance
  • Solution Use Nk k-failstop processors
  • Example N2, k3
  • What happens if
  • 3 p-processes in FSP0 fail? 4 p-processes in FSP0
    fail?
  • 1 p-process in FSP0, FSP1, and FSP2 fail? also in
    FSP3?
  • 2 p-processes in FSP0, FSP1, and FSP2 fail?
  • 1 s-process in SS0 fails? also in SS1, SS2, and
    SS3?
  • 4 s-processes in SS0 fail?

17
Should we use Fail Stop Processors?
  • Metric Hardware cost for state machines
  • Fail-stop components
  • Worst-case (assuming 1 process per processor)
  • (Nk) 2k1 k1 (Nk) (3k2) processors
  • Best-case (assuming s-processes from different
    FSP share same processor)
  • (Nk)(k1) (2k1) processors
  • Byzantine components
  • N (2k1)
  • Fail-stop can be better if s-processes share and
    Ngtk
  • Metric Frequency of byzantine agreement protocol
  • Fail-Stop On every access to stable storage
  • Byzantine On every input read
  • Probably fewer input reads

18
Summary
  • Why build fail-stop components?
  • Easier for higher layers to model and deal with
  • Matches assumptions of many distributed protocols
  • Why not?
  • Usually more hardware
  • Usually more agreements needed
  • Higher-levels may be able to cope with slightly
    faulty components
  • Violates end-to-end argument
  • Conclusion Probably shouldnt assume fail-stop
    components

19
Byzantine Werewolves
  • Previous Too easy for villagers to identify
    werewolves
  • Villager A had reliable information that Z was
    werewolf
  • Villager B could validate that A was villager
  • Hard for Z to lie that C was werewolf, because D
    could have checked C too
  • Signed Protocol Many could hear what one said
  • Difficult for werewolves to tell different lies
    to others
  • Have to tell everyone same thing
  • New Changes to give more advantage to werewolves
  • Unknown number of werewolves (1 lt w lt 1/2 N)
  • Night Werewolves convert multiple villagers to
    wolves (1 lt v lt w)
  • Key Info told by moderator will then be stale
    and wrong!
  • Day Villagers can vote to lynch multiple victims

20
Byzantine-Werewolf Game Rules
  • Everyone secretly assigned as werewolf or
    villager
  • W werewolves, rest are seeing villagers
  • I am moderator
  • Night round (changed order)
  • Close your eyes make noises with one hand to
    hide activity
  • For all NAME, open your eyes Pick someone to
    ask about
  • Useless for Werewolves, but hides their identity
  • Point to another player
  • Moderator signs thumbs up for werewolf, down for
    villager
  • NAME, close your eyes
  • Werewolves, open your eyes W can see who is
    who
  • Werewolves, pick villagers to convert
  • Moderator picks secret number between 1 and W
  • Silently agree on villagers by pointing
  • Moderator taps converts on shoulder should open
    eyes to see other werewolves
  • Werewolves, close your eyes

21
Rules Day Time
  • Day Time Everyone open your eyes its daytime
  • Agreement time Everyone talks and votes on who
    should be decommissioned
  • Villagers try to decommission werewolves
  • Werewolves try to trick villagers with bad info
  • Someone must propose who should be killed
  • Vote until kill villager or no more proposals or
    no majority
  • Werewolves really spread at night, so large
    incentive to kill as many as possible now
  • Moderator Uses majority voting to determine who
    is decommissioned Okay, NAME is dead
  • Person is out of game (cant talk anymore) and
    shows card
  • Repeat cycle until All werewolves dead OR
    werewolves gt villagers
Write a Comment
User Comments (0)
About PowerShow.com