Lampson Sturgis Fault Model - PowerPoint PPT Presentation

About This Presentation

Title:

Lampson Sturgis Fault Model

Description:

read only one value. Boolean optimistic_read(Ulong group,address addr,avalue value) ... Set my state to new state. any. input? newer. state? new state. in last ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 29

Provided by: ResearchM53

Category:

more less

Transcript and Presenter's Notes

Title: Lampson Sturgis Fault Model

1
Lampson Sturgis Fault Model

Jim Gray
Microsoft, Gray _at_ Microsoft.com
Andreas Reuter
International University, Andreas.Reuter_at_i-u.de

Mon
Tue
Wed
Thur
Fri
900
Overview
TP mons
Log
Files Buffers
B-tree
1100
Faults
Lock Theory
ResMgr
COM
Access Paths
130
Tolerance
Lock Techniq
CICS Inet
Corba
Groupware
330
T Models
Queues
Adv TM
Replication
Benchmark
700
Party
Workflow
Cyberbrick
Party
2
Rationale Fault Tolerance Needs a Fault
ModelWhat do you tolerate?

Fault tolerance needs a fault model.
Model needs to be simple enough to understand.
With a model,
can design hardware/software to tolerate the
faults.
can make statements about the system behavior.

3
Byzantine Fault Model

Some modules are fault free (during the period of
interest).Other modules may fail (in the worst
way). Make statements about of the fault-free
module behavior
SynchronousAll operations happen within a time
limit.
Asynchronous No time limit on anything, No
lost messages.
Timed (used here)Notion of timeout and retry
Key result N modules can tolerate N/3 faults.

4
Lampson Sturgis Model

ProcessesCorrect Execute a program at a
finite rate.Fault Reset to null state and
"stop" for a finite time.
MessageCorrect Eventually arrives and is
correct.Fault Lost, duplicated, or
corrupted.
StorageCorrect Read(x) returns the most
recent value of x. Write(x, v) sets the value
of x to v.Fault All pages reset to
null. A page resets to null. Read or Write
operate on the wrong page.
Other faults (called disasters) not dealt with.
Assumption Disasters are rare.

5
Byzantine vs. Lampson-Sturgis Fault Models

Connections unclear.
Byzantine focuses on bounded-time bounded-faults
(real-time systems)
asynchronous (mostly) or
synchronous (real time)
Lampson/Sturgis focuses on long-term behavior
no time or fault limits
time and timeout heavily used to detect faults

6
Roadmap of What's Coming

Lampson-Sturgis Fault Model
Building highly available processes,
messages, storage from faulty components.
Process pairs give quick repair
Kinds of process pairs
Checkpoint / Restart based on storage
Checkpoint / Restart based on messages
Restart based on transactions (easy to program).

7
Model of Storage and its Faults

System has several stores (discs).
Each has a set of pages.
Stores fail independently.
probability write has no effect 1 in a million
mean time to a page fail, a few days
mean time to disc fail is a few years
wild read/write modeled as a page fail.

a store
status
store_write(store, address, value)
a page
status
value
store_read (store, address, value)
8
Storage Decay (the demon)

/ There is one store_decay process for each
store in the system /
define mttvf 7E5 / mean time (sec) to a page
fail, a few days /
define mttsf 1E8 / mean time(sec) to disc fail
is a few years /
void store_decay(astore store) / /
Ulong addr / the random places that will
decay /
Ulong page_fail time() mttvfrandf()/
timeto next page decay /
Ulong store_fail time() mttsfrandf() /
timeto next store decay /
while (TRUE) / repeat this loop forever /
wait(min(page_fail,store_fail) - time())/
wait for next event/
if (time() gt page_fail) / if the event is a
page decay /
addr randf()MAXSTORE / pick a random
address /
store.pageaddr.status FALSE / set it
invalid /
page_fail time() - log(randf())mttvf /
pick next fault time/
/ negative exp distributed, mean mttvf /
if (time() gt store_fail) / if the event is a
storage fault /
store.status FALSE / mark the store as
broken /
for (addr 0 addr lt MAXSTORE addr)
/invalidate all pages /
store.pageaddr.status FALSE / /
store_fail time() log(randf())mttsf /
pick next fault time/

Simulates (specifies) system behavior.
9
Reliable Write Write all members of a N-plex
set.

define nplex 2 / code works for ngt2, but do
duplex /
Boolean reliable_write(Ulong group, address addr,
avalue value) / /
Ulong i / index on elements of store
group /
Boolean status FALSE / true if any write
worked /
/ each group uses Nplex stores /
for (i 0 i lt nplex i ) /write each store
in the group /
status status / status indicates if
any write worked /
store_write(storesgroupnplexi,addr,value)
/ /
/ loop to write all stores of group /
return status / return indicates if ANY write
worked/
/ /

10
Reliable Read read all members of N-plex set
Problems All fail Disaster Ambiguity
(N-different answers) Take majority Take
"newest"

Ulong version(avalue) / returns version of a
value /
/ read an n-plex group to find the most recent
version of a page /
Boolean reliable_read(Ulong group, address addr,
avalue value) / /
Ulong I 0 / index on store group /
Boolean gotone FALSE / flag says had a good
read /
Boolean bad FALSE / bad says group needs
repair /
avalue next / next value that is read /
Boolean status / read ok /
for (i 0 i lt nplex i ) / for each
page in the nplex set /
status store_read(storesgroupnplexi,addr
,next) /read value /
if (! status ) bad TRUE / if status bad,
ignore value /
else / have a good read /
if (! gotone) / if it is first good
value /
copy(value,next,VSIZE) gotone TRUE/
make it best value /
else if ( version(next) ! version(value))
/if new val,compare /
bad TRUE / if different, repair
needed /
if (version(next) gt version(value)) / if new
is best version /
copy(value, next, VSIZE) / copy it to best
value /
/ end of read all copies /

on bad read rewrite with best value
11
Background Store Repair Process

/ repair the broken pages in an n-plex group.
/
/ Group is in 0,...,(MAXSTORE/nplex)-1 /
void store_repair(Ulong group) / /
int i / next address to be repaired /
avalue value / buffer holds value to be
read /
while (TRUE) / do forever /
for (i 0 i ltMAXSTORE i) / for each page
in the store /
wait(1) / wait a second /
reliable_read(group,i,value) / a reliable
read repairs page/
/ if they do not match /
Needed to minimize chances of N-failures.
Repair is important.

on bad read rewrite with best value
12
Optimistic Reads

Most implementations do optimistic reads
read only one value.
Boolean optimistic_read(Ulong group,address
addr,avalue value) / /
if (group gt MAXSTORES/nplex) return FALSE /
return false if bad addr/
if (store_read(storesnplexgroup,addr,value))
/ read one value /
return TRUE / and if that is ok return it as
the true value /
else / if reading one value returned bad
then /
return (reliable_read(group,addr,value)) /
n-plex read repair. /
/ /
This is dangerous (especially without repair).

13
Storage Fault Summary

Simple fault model.
Allows discussion/specification of fault
tolerance.
Uncovers some problems in many implementations
Ambiguous reads
Repair process.
Optimistic reads.

14
Process Fault Model

Process executes a program and has state.
Program causes state change plus send/get
message.
Process fails by stopping (for a while) and then
resetting its data and message state.

Queue of
a
Sender Process
Input Messages to the process
new
Receiver Process
status
value
next
Data
Program
Data
message
Program
15
Process Fault Model The Break/Fix loop

define MAXPROCESS MANY / the system will
have many processes /
typedef Ulong processid / process id is an
integer index into array /
typedef struct char programMANY/2char
dataMANY/2 state/ program data /
struct state initial / process initial
state /
state current / value of the process state
/
amessagep messages / queue of messages
waiting for process /
process MAXPROCESS / /
/ Process Decay execute a process and
occasionally inject faults into it /
define mttpf 1E7 / mean time to process
failure Å4 months /
define mttpr 1E4 / mean time to repair is 3
hours /
void process_execution(processid pid) / /
Ulong proc_fail / time of next process
fault /
Ulong proc_repair / time to repair
process /
amessagep msg, next / pointers to process
messages /
while (TRUE) / global execution loop /
proc_fail time() - log(randf())mttpf /
the time of next fail /
proc_repair -log(randf())mttpr / delay
in next process repair /
while (time() lt proc_fail) / /
execute(processpid.current) / execute
for about 4 months (work) /

16
Checkpoint/Restart Process (Storage based)

/ A checkpoint-restart process server generating
unique sequence numbers /
checkpoint_restart_process() / /
Ulong disc 0 / a reliable storage
group with state /
Ulong address2 0,1 / page address of
two states on disc /
Ulong old / index of the disc with the old
state /
struct Ulong ticketno / process reads its
state from disc. /
char fillerVSIZE / newest state has
max ticket number /
value 2 / current state kept in
value0 /
struct msg / buffer to hold input message
/
processid him / contains requesting process
id /
char fillerVSIZE / reply (ticket num)
sent to process /
msg / /
/ Restart logic recover ticket number from
persistent storage /
for (old 0 oldlt1, old) / read the two
states from disc /
if (!reliable_read(disc,addressold,valueold
)) /if read fails /
panic() / then failfast /
if (value1.ticketno lt value0.ticketno) old
1 / pick max seq no /
else old 0 copy(value0,
value1,VSIZE)/which is old val /
/ Processing logic generate next number,
checkpoint, and reply /

17
Process Pairs (message-based checkpoints)
Give me a ticket
Primary
Server Process
Next Ticket Number
Client Processes
Ticket Numbers
I'm Alive
State Checkpoint
Messages
Messages
Backup
Server Process
Next Ticket Number

Problem Solutions
Detect failure I'm Alive msg timeout N
o "real" solution.
Continuation Checkpoint Messages
Startup backup waits for primary

18
Process Pairs (message-based checkpoints)
Restart
am I
Wait a second
default primary?
-
Backup Loop

Wait a second

new state
any
-
-
Broadcast "Im Primary"
in last second?
input?

Reply to last request
Read it
Primary Loop
any
newer
-
input?
-
state?

Set my state to new state
requests
Read it
Compute new state.
Send new state to backup.
Send state to backup.
Im alive
reply
replies

Primary in tight loop sending "I'm alive" or
state change messages to backup
Backup thinks primary dead if no messages in
previous second.

19
What We Have Done So Far

Converted "faulty" processes to reliable ones.
Tolerate hardware and some software faults
Can repair in seconds or milli-seconds.
Unlike checkpoint restart No process
creation/setup time No client reconnect
time.
Operating systems are beginning to provide
process pairs.
Stateless process pairs can use transactional
servers to
Store their state
Cleanup the mess at takeover.
Like storage-based checkpoint/restart
except process setup/connection is instant.

20
Persistent process pairs

persistent_process() / prototypical persistent
process /
wait_to_be_primary() / wait to be told you are
primary /
while (TRUE) / when primary, do forever /
begin_work() / start transaction or
subtransaction /
read request() / read a request /
doit() / perform the desired function /
reply() / reply /
commit_work() / finish transaction or
subtransaction/
/ did a step, now get next request /
/ /

21
Persistent Process Pairs The ticket server
redone as a transactional server.

/ A transactional persistent server process
generating unique tickets /
perstistent_ticket_server() / current state
kept in sql database /
int ticketno / next ticket ( from
DB) /
struct msg / buffer to hold input message
/
processid him / contains requesting
process id /
char fillerVSIZE / reply (ticket
num) sent to that addr /
msg / /
/ Restart logic recover ticket number from
persistent storage /
wait_to_be_primary() / wait to
be told you are primary /
/ Processing logic generate next number,
checkpoint, and reply /
while (TRUE) / do forever /
begin_work() / begin a transaction /
while (! get_msg(msg)) / get next request
for a ticket /
exec sql update ticket / increment the next
ticket number /
set ticketno ticketno 1 / /
exec sql select max(ticketno) / fetch current
ticket number /
into ticketno / into program local
variable /
from ticket / from SQL database /
commit_work() / commit transaction /

22
Messages Fault Model

Each process has a queue of incoming messages.
Messages can be
corrupted checksum detects it
duplicated sequence number detects it.
delayed arbitrarily long (ack retransmit).
can be lost (ack retransmitseq number).
Techniques here give messages fail-fast
semantics.

23
Message Verbs SEND

/send a message to a process returns true if
the process exists /
Boolean message_send(processid him, avalue value)
/ /
amessagep it / pointer to message created
by this call/
amessagep queue / pointer to process
message queue /
if (him gt MAXPROCESS) return FALSE / test
for valid process /
loop it malloc(sizeof(amessage)) /
allocate space to hold message /
it-gtstatus TRUE it-gtnext NULL / and
fill in the fields /
copy(it-gtvalue,value,VSIZE) / copy msg
data to message body /
queue processhim.messages / look at
process message queue /
if (queue NULL) processhim.messages it
/ if the empty then /
else / place this message at queue head /
while (queue-gtnext ! NULL) queue
queue-gtnext / else place /
queue-gtnext it / the
message at queue end . /
if (randf() lt pmf) it-gtstatus FALSE /
sometimes message corrupted /
if (randf() lt pmd) goto loop / sometimes
the message duplicated /
return TRUE / /
/ /

24
Message Verbs GET

/ get the next input message of this process
returns true if a message /
Boolean message_get(avalue valuep, Boolean
msg_status)/ /
processid me MyPID() / callers
process number /
amessagep it / pointer to input
message /
it processme.messages / find
callers message input queue /
if (it NULL) return FALSE / return false
if queue is empty /
processme.messages it-gtnext/ take first
message off the queue /
msg_status it-gtstatus / record its
status /
copy(valuep,it-gtvalue,VSIZE) / value
it-gtvalue /
free(it) / deallocate
its space /
return TRUE / return status to
caller /
/ /

25
Sessions Make Messages FailFast
Session
7
Ack 7

CRC makes corrupt look like lost message
Sequence numbers detect duplicates gt lost
message
So, only failure is lost message
Timeout/retransmit masks lost messages. gt Only
failure is delay.

26
Sessions Plus Process Pairs Give Highly
Available Messages
7
ack 7
7
ack 7

Checkpoint messages and sequence numbers to
backup
Backup resumes session if primary fails.
Backup broadcasts new identity at takeover (see
book for code)

27
Highly Available Message Verbs
Output Message Session
Application Programs
reliable_send_msg()
reliable_get_msg()
Input Message Session
The
Listener
Process