Lampson Sturgis Fault Model - PowerPoint PPT Presentation

About This Presentation
Title:

Lampson Sturgis Fault Model

Description:

read only one value. Boolean optimistic_read(Ulong group,address addr,avalue value) ... Set my state to new state. any. input? newer. state? new state. in last ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 29
Provided by: ResearchM53
Category:

less

Transcript and Presenter's Notes

Title: Lampson Sturgis Fault Model


1
Lampson Sturgis Fault Model
  • Jim Gray
  • Microsoft, Gray _at_ Microsoft.com
  • Andreas Reuter
  • International University, Andreas.Reuter_at_i-u.de


Mon
Tue
Wed
Thur
Fri
900
Overview
TP mons
Log
Files Buffers
B-tree
1100
Faults
Lock Theory
ResMgr
COM
Access Paths
130
Tolerance
Lock Techniq
CICS Inet
Corba
Groupware
330
T Models
Queues
Adv TM
Replication
Benchmark
700
Party
Workflow
Cyberbrick
Party
2
Rationale Fault Tolerance Needs a Fault
ModelWhat do you tolerate?
  • Fault tolerance needs a fault model.
  • Model needs to be simple enough to understand.
  • With a model,
  • can design hardware/software to tolerate the
    faults.
  • can make statements about the system behavior.

3
Byzantine Fault Model
  • Some modules are fault free (during the period of
    interest).Other modules may fail (in the worst
    way). Make statements about of the fault-free
    module behavior
  • SynchronousAll operations happen within a time
    limit.
  • Asynchronous No time limit on anything, No
    lost messages.
  • Timed (used here)Notion of timeout and retry
  • Key result N modules can tolerate N/3 faults.

4
Lampson Sturgis Model
  • ProcessesCorrect Execute a program at a
    finite rate.Fault Reset to null state and
    "stop" for a finite time.
  • MessageCorrect Eventually arrives and is
    correct.Fault Lost, duplicated, or
    corrupted.
  • StorageCorrect Read(x) returns the most
    recent value of x. Write(x, v) sets the value
    of x to v.Fault All pages reset to
    null. A page resets to null. Read or Write
    operate on the wrong page.
  • Other faults (called disasters) not dealt with.
  • Assumption Disasters are rare.

5
Byzantine vs. Lampson-Sturgis Fault Models
  • Connections unclear.
  • Byzantine focuses on bounded-time bounded-faults
    (real-time systems)
  • asynchronous (mostly) or
  • synchronous (real time)
  • Lampson/Sturgis focuses on long-term behavior
  • no time or fault limits
  • time and timeout heavily used to detect faults

6
Roadmap of What's Coming
  • Lampson-Sturgis Fault Model
  • Building highly available processes,
    messages, storage from faulty components.
  • Process pairs give quick repair
  • Kinds of process pairs
  • Checkpoint / Restart based on storage
  • Checkpoint / Restart based on messages
  • Restart based on transactions (easy to program).

7
Model of Storage and its Faults
  • System has several stores (discs).
  • Each has a set of pages.
  • Stores fail independently.
  • probability write has no effect 1 in a million
  • mean time to a page fail, a few days
  • mean time to disc fail is a few years
  • wild read/write modeled as a page fail.

a store
status
store_write(store, address, value)
a page
status
value
store_read (store, address, value)
8
Storage Decay (the demon)
  • / There is one store_decay process for each
    store in the system /
  • define mttvf 7E5 / mean time (sec) to a page
    fail, a few days /
  • define mttsf 1E8 / mean time(sec) to disc fail
    is a few years /
  • void store_decay(astore store) / /
  • Ulong addr / the random places that will
    decay /
  • Ulong page_fail time() mttvfrandf()/
    timeto next page decay /
  • Ulong store_fail time() mttsfrandf() /
    timeto next store decay /
  • while (TRUE) / repeat this loop forever /
  • wait(min(page_fail,store_fail) - time())/
    wait for next event/
  • if (time() gt page_fail) / if the event is a
    page decay /
  • addr randf()MAXSTORE / pick a random
    address /
  • store.pageaddr.status FALSE / set it
    invalid /
  • page_fail time() - log(randf())mttvf /
    pick next fault time/
  • / negative exp distributed, mean mttvf /
  • if (time() gt store_fail) / if the event is a
    storage fault /
  • store.status FALSE / mark the store as
    broken /
  • for (addr 0 addr lt MAXSTORE addr)
    /invalidate all pages /
  • store.pageaddr.status FALSE / /
  • store_fail time() log(randf())mttsf /
    pick next fault time/

Simulates (specifies) system behavior.
9
Reliable Write Write all members of a N-plex
set.
  • define nplex 2 / code works for ngt2, but do
    duplex /
  • Boolean reliable_write(Ulong group, address addr,
    avalue value) / /
  • Ulong i / index on elements of store
    group /
  • Boolean status FALSE / true if any write
    worked /
  • / each group uses Nplex stores /
  • for (i 0 i lt nplex i ) /write each store
    in the group /
  • status status / status indicates if
    any write worked /
  • store_write(storesgroupnplexi,addr,value)
    / /
  • / loop to write all stores of group /
  • return status / return indicates if ANY write
    worked/
  • / /

10
Reliable Read read all members of N-plex set
Problems All fail Disaster Ambiguity
(N-different answers) Take majority Take
"newest"
  • Ulong version(avalue) / returns version of a
    value /
  • / read an n-plex group to find the most recent
    version of a page /
  • Boolean reliable_read(Ulong group, address addr,
    avalue value) / /
  • Ulong I 0 / index on store group /
  • Boolean gotone FALSE / flag says had a good
    read /
  • Boolean bad FALSE / bad says group needs
    repair /
  • avalue next / next value that is read /
  • Boolean status / read ok /
  • for (i 0 i lt nplex i ) / for each
    page in the nplex set /
  • status store_read(storesgroupnplexi,addr
    ,next) /read value /
  • if (! status ) bad TRUE / if status bad,
    ignore value /
  • else / have a good read /
  • if (! gotone) / if it is first good
    value /
  • copy(value,next,VSIZE) gotone TRUE/
    make it best value /
  • else if ( version(next) ! version(value))
    /if new val,compare /
  • bad TRUE / if different, repair
    needed /
  • if (version(next) gt version(value)) / if new
    is best version /
  • copy(value, next, VSIZE) / copy it to best
    value /
  • / end of read all copies /

on bad read rewrite with best value
11
Background Store Repair Process
  • / repair the broken pages in an n-plex group.
    /
  • / Group is in 0,...,(MAXSTORE/nplex)-1 /
  • void store_repair(Ulong group) / /
  • int i / next address to be repaired /
  • avalue value / buffer holds value to be
    read /
  • while (TRUE) / do forever /
  • for (i 0 i ltMAXSTORE i) / for each page
    in the store /
  • wait(1) / wait a second /
  • reliable_read(group,i,value) / a reliable
    read repairs page/
  • / if they do not match /
  • Needed to minimize chances of N-failures.
  • Repair is important.

on bad read rewrite with best value
12
Optimistic Reads
  • Most implementations do optimistic reads
  • read only one value.
  • Boolean optimistic_read(Ulong group,address
    addr,avalue value) / /
  • if (group gt MAXSTORES/nplex) return FALSE /
    return false if bad addr/
  • if (store_read(storesnplexgroup,addr,value))
    / read one value /
  • return TRUE / and if that is ok return it as
    the true value /
  • else / if reading one value returned bad
    then /
  • return (reliable_read(group,addr,value)) /
    n-plex read repair. /
  • / /
  • This is dangerous (especially without repair).

13
Storage Fault Summary
  • Simple fault model.
  • Allows discussion/specification of fault
    tolerance.
  • Uncovers some problems in many implementations
  • Ambiguous reads
  • Repair process.
  • Optimistic reads.

14
Process Fault Model
  • Process executes a program and has state.
  • Program causes state change plus send/get
    message.
  • Process fails by stopping (for a while) and then
    resetting its data and message state.

Queue of
a
Sender Process
Input Messages to the process
new
Receiver Process
status
value
next
Data
Program
Data
message
Program
15
Process Fault Model The Break/Fix loop
  • define MAXPROCESS MANY / the system will
    have many processes /
  • typedef Ulong processid / process id is an
    integer index into array /
  • typedef struct char programMANY/2char
    dataMANY/2 state/ program data /
  • struct state initial / process initial
    state /
  • state current / value of the process state
    /
  • amessagep messages / queue of messages
    waiting for process /
  • process MAXPROCESS / /
  • / Process Decay execute a process and
    occasionally inject faults into it /
  • define mttpf 1E7 / mean time to process
    failure Å4 months /
  • define mttpr 1E4 / mean time to repair is 3
    hours /
  • void process_execution(processid pid) / /
  • Ulong proc_fail / time of next process
    fault /
  • Ulong proc_repair / time to repair
    process /
  • amessagep msg, next / pointers to process
    messages /
  • while (TRUE) / global execution loop /
  • proc_fail time() - log(randf())mttpf /
    the time of next fail /
  • proc_repair -log(randf())mttpr / delay
    in next process repair /
  • while (time() lt proc_fail) / /
  • execute(processpid.current) / execute
    for about 4 months (work) /

16
Checkpoint/Restart Process (Storage based)
  • / A checkpoint-restart process server generating
    unique sequence numbers /
  • checkpoint_restart_process() / /
  • Ulong disc 0 / a reliable storage
    group with state /
  • Ulong address2 0,1 / page address of
    two states on disc /
  • Ulong old / index of the disc with the old
    state /
  • struct Ulong ticketno / process reads its
    state from disc. /
  • char fillerVSIZE / newest state has
    max ticket number /
  • value 2 / current state kept in
    value0 /
  • struct msg / buffer to hold input message
    /
  • processid him / contains requesting process
    id /
  • char fillerVSIZE / reply (ticket num)
    sent to process /
  • msg / /
  • / Restart logic recover ticket number from
    persistent storage /
  • for (old 0 oldlt1, old) / read the two
    states from disc /
  • if (!reliable_read(disc,addressold,valueold
    )) /if read fails /
  • panic() / then failfast /
  • if (value1.ticketno lt value0.ticketno) old
    1 / pick max seq no /
  • else old 0 copy(value0,
    value1,VSIZE)/which is old val /
  • / Processing logic generate next number,
    checkpoint, and reply /

17
Process Pairs (message-based checkpoints)
Give me a ticket
Primary
Server Process
Next Ticket Number
Client Processes
Ticket Numbers
I'm Alive
State Checkpoint
Messages
Messages
Backup
Server Process
Next Ticket Number
  • Problem Solutions
  • Detect failure I'm Alive msg timeout N
    o "real" solution.
  • Continuation Checkpoint Messages
  • Startup backup waits for primary

18
Process Pairs (message-based checkpoints)
Restart
am I
Wait a second
default primary?
-
Backup Loop

Wait a second

new state
any
-
-
Broadcast "Im Primary"
in last second?
input?

Reply to last request
Read it
Primary Loop
any
newer
-
input?
-
state?


Set my state to new state
requests
Read it
Compute new state.
Send new state to backup.
Send state to backup.
Im alive
reply
replies
  • Primary in tight loop sending "I'm alive" or
    state change messages to backup
  • Backup thinks primary dead if no messages in
    previous second.

19
What We Have Done So Far
  • Converted "faulty" processes to reliable ones.
  • Tolerate hardware and some software faults
  • Can repair in seconds or milli-seconds.
  • Unlike checkpoint restart No process
    creation/setup time No client reconnect
    time.
  • Operating systems are beginning to provide
    process pairs.
  • Stateless process pairs can use transactional
    servers to
  • Store their state
  • Cleanup the mess at takeover.
  • Like storage-based checkpoint/restart
  • except process setup/connection is instant.

20
Persistent process pairs
  • persistent_process() / prototypical persistent
    process /
  • wait_to_be_primary() / wait to be told you are
    primary /
  • while (TRUE) / when primary, do forever /
  • begin_work() / start transaction or
    subtransaction /
  • read request() / read a request /
  • doit() / perform the desired function /
  • reply() / reply /
  • commit_work() / finish transaction or
    subtransaction/
  • / did a step, now get next request /
  • / /

21
Persistent Process Pairs The ticket server
redone as a transactional server.
  • / A transactional persistent server process
    generating unique tickets /
  • perstistent_ticket_server() / current state
    kept in sql database /
  • int ticketno / next ticket ( from
    DB) /
  • struct msg / buffer to hold input message
    /
  • processid him / contains requesting
    process id /
  • char fillerVSIZE / reply (ticket
    num) sent to that addr /
  • msg / /
  • / Restart logic recover ticket number from
    persistent storage /
  • wait_to_be_primary() / wait to
    be told you are primary /
  • / Processing logic generate next number,
    checkpoint, and reply /
  • while (TRUE) / do forever /
  • begin_work() / begin a transaction /
  • while (! get_msg(msg)) / get next request
    for a ticket /
  • exec sql update ticket / increment the next
    ticket number /
  • set ticketno ticketno 1 / /
  • exec sql select max(ticketno) / fetch current
    ticket number /
  • into ticketno / into program local
    variable /
  • from ticket / from SQL database /
  • commit_work() / commit transaction /

22
Messages Fault Model
  • Each process has a queue of incoming messages.
  • Messages can be
  • corrupted checksum detects it
  • duplicated sequence number detects it.
  • delayed arbitrarily long (ack retransmit).
  • can be lost (ack retransmitseq number).
  • Techniques here give messages fail-fast
    semantics.

23
Message Verbs SEND
  • /send a message to a process returns true if
    the process exists /
  • Boolean message_send(processid him, avalue value)
    / /
  • amessagep it / pointer to message created
    by this call/
  • amessagep queue / pointer to process
    message queue /
  • if (him gt MAXPROCESS) return FALSE / test
    for valid process /
  • loop it malloc(sizeof(amessage)) /
    allocate space to hold message /
  • it-gtstatus TRUE it-gtnext NULL / and
    fill in the fields /
  • copy(it-gtvalue,value,VSIZE) / copy msg
    data to message body /
  • queue processhim.messages / look at
    process message queue /
  • if (queue NULL) processhim.messages it
    / if the empty then /
  • else / place this message at queue head /
  • while (queue-gtnext ! NULL) queue
    queue-gtnext / else place /
  • queue-gtnext it / the
    message at queue end . /
  • if (randf() lt pmf) it-gtstatus FALSE /
    sometimes message corrupted /
  • if (randf() lt pmd) goto loop / sometimes
    the message duplicated /
  • return TRUE / /
  • / /

24
Message Verbs GET
  • / get the next input message of this process
    returns true if a message /
  • Boolean message_get(avalue valuep, Boolean
    msg_status)/ /
  • processid me MyPID() / callers
    process number /
  • amessagep it / pointer to input
    message /
  • it processme.messages / find
    callers message input queue /
  • if (it NULL) return FALSE / return false
    if queue is empty /
  • processme.messages it-gtnext/ take first
    message off the queue /
  • msg_status it-gtstatus / record its
    status /
  • copy(valuep,it-gtvalue,VSIZE) / value
    it-gtvalue /
  • free(it) / deallocate
    its space /
  • return TRUE / return status to
    caller /
  • / /

25
Sessions Make Messages FailFast
Session
7
Ack 7
  • CRC makes corrupt look like lost message
  • Sequence numbers detect duplicates gt lost
    message
  • So, only failure is lost message
  • Timeout/retransmit masks lost messages. gt Only
    failure is delay.

26
Sessions Plus Process Pairs Give Highly
Available Messages
7
ack 7
7
ack 7
  • Checkpoint messages and sequence numbers to
    backup
  • Backup resumes session if primary fails.
  • Backup broadcasts new identity at takeover (see
    book for code)

27
Highly Available Message Verbs
Output Message Session
Application Programs
reliable_send_msg()
reliable_get_msg()
Input Message Session
The
Listener
Process
  • Hide under reliable get/send msg
  • Sequence number,
  • ack retransmit logic
  • checkpoint
  • process pair takeover
  • resend of most recent reply.
  • Uses a Listener process (thread) to do all this
    async work

28
Summary
  • Went from faulty storage, processes, messages
  • to fault tolerant
    versions of each.
  • Simple fault model explains many techniques used
  • (and mis-used) in FT systems.
Write a Comment
User Comments (0)
About PowerShow.com