Title: Lampson Sturgis Fault Model
1Lampson Sturgis Fault Model
- Jim Gray
- Microsoft, Gray _at_ Microsoft.com
- Andreas Reuter
- International University, Andreas.Reuter_at_i-u.de
Mon
Tue
Wed
Thur
Fri
900
Overview
TP mons
Log
Files Buffers
B-tree
1100
Faults
Lock Theory
ResMgr
COM
Access Paths
130
Tolerance
Lock Techniq
CICS Inet
Corba
Groupware
330
T Models
Queues
Adv TM
Replication
Benchmark
700
Party
Workflow
Cyberbrick
Party
2Rationale Fault Tolerance Needs a Fault
ModelWhat do you tolerate?
- Fault tolerance needs a fault model.
- Model needs to be simple enough to understand.
- With a model,
- can design hardware/software to tolerate the
faults. - can make statements about the system behavior.
3Byzantine Fault Model
- Some modules are fault free (during the period of
interest).Other modules may fail (in the worst
way). Make statements about of the fault-free
module behavior - SynchronousAll operations happen within a time
limit. - Asynchronous No time limit on anything, No
lost messages. - Timed (used here)Notion of timeout and retry
- Key result N modules can tolerate N/3 faults.
4Lampson Sturgis Model
- ProcessesCorrect Execute a program at a
finite rate.Fault Reset to null state and
"stop" for a finite time. - MessageCorrect Eventually arrives and is
correct.Fault Lost, duplicated, or
corrupted. - StorageCorrect Read(x) returns the most
recent value of x. Write(x, v) sets the value
of x to v.Fault All pages reset to
null. A page resets to null. Read or Write
operate on the wrong page. - Other faults (called disasters) not dealt with.
- Assumption Disasters are rare.
5Byzantine vs. Lampson-Sturgis Fault Models
- Connections unclear.
- Byzantine focuses on bounded-time bounded-faults
(real-time systems) - asynchronous (mostly) or
- synchronous (real time)
- Lampson/Sturgis focuses on long-term behavior
- no time or fault limits
- time and timeout heavily used to detect faults
6Roadmap of What's Coming
- Lampson-Sturgis Fault Model
- Building highly available processes,
messages, storage from faulty components. - Process pairs give quick repair
- Kinds of process pairs
- Checkpoint / Restart based on storage
- Checkpoint / Restart based on messages
- Restart based on transactions (easy to program).
7Model of Storage and its Faults
- System has several stores (discs).
- Each has a set of pages.
- Stores fail independently.
- probability write has no effect 1 in a million
- mean time to a page fail, a few days
- mean time to disc fail is a few years
- wild read/write modeled as a page fail.
a store
status
store_write(store, address, value)
a page
status
value
store_read (store, address, value)
8Storage Decay (the demon)
- / There is one store_decay process for each
store in the system / - define mttvf 7E5 / mean time (sec) to a page
fail, a few days / - define mttsf 1E8 / mean time(sec) to disc fail
is a few years / - void store_decay(astore store) / /
- Ulong addr / the random places that will
decay / - Ulong page_fail time() mttvfrandf()/
timeto next page decay / - Ulong store_fail time() mttsfrandf() /
timeto next store decay / - while (TRUE) / repeat this loop forever /
- wait(min(page_fail,store_fail) - time())/
wait for next event/ - if (time() gt page_fail) / if the event is a
page decay / - addr randf()MAXSTORE / pick a random
address / - store.pageaddr.status FALSE / set it
invalid / - page_fail time() - log(randf())mttvf /
pick next fault time/ - / negative exp distributed, mean mttvf /
- if (time() gt store_fail) / if the event is a
storage fault / - store.status FALSE / mark the store as
broken / - for (addr 0 addr lt MAXSTORE addr)
/invalidate all pages / - store.pageaddr.status FALSE / /
- store_fail time() log(randf())mttsf /
pick next fault time/
Simulates (specifies) system behavior.
9Reliable Write Write all members of a N-plex
set.
- define nplex 2 / code works for ngt2, but do
duplex / - Boolean reliable_write(Ulong group, address addr,
avalue value) / / - Ulong i / index on elements of store
group / - Boolean status FALSE / true if any write
worked / - / each group uses Nplex stores /
- for (i 0 i lt nplex i ) /write each store
in the group / - status status / status indicates if
any write worked / - store_write(storesgroupnplexi,addr,value)
/ / - / loop to write all stores of group /
- return status / return indicates if ANY write
worked/ - / /
10Reliable Read read all members of N-plex set
Problems All fail Disaster Ambiguity
(N-different answers) Take majority Take
"newest"
- Ulong version(avalue) / returns version of a
value / - / read an n-plex group to find the most recent
version of a page / - Boolean reliable_read(Ulong group, address addr,
avalue value) / / - Ulong I 0 / index on store group /
- Boolean gotone FALSE / flag says had a good
read / - Boolean bad FALSE / bad says group needs
repair / - avalue next / next value that is read /
- Boolean status / read ok /
- for (i 0 i lt nplex i ) / for each
page in the nplex set / - status store_read(storesgroupnplexi,addr
,next) /read value / - if (! status ) bad TRUE / if status bad,
ignore value / - else / have a good read /
- if (! gotone) / if it is first good
value / - copy(value,next,VSIZE) gotone TRUE/
make it best value / - else if ( version(next) ! version(value))
/if new val,compare / - bad TRUE / if different, repair
needed / - if (version(next) gt version(value)) / if new
is best version / - copy(value, next, VSIZE) / copy it to best
value / - / end of read all copies /
on bad read rewrite with best value
11Background Store Repair Process
- / repair the broken pages in an n-plex group.
/ - / Group is in 0,...,(MAXSTORE/nplex)-1 /
- void store_repair(Ulong group) / /
- int i / next address to be repaired /
- avalue value / buffer holds value to be
read / - while (TRUE) / do forever /
- for (i 0 i ltMAXSTORE i) / for each page
in the store / - wait(1) / wait a second /
- reliable_read(group,i,value) / a reliable
read repairs page/ - / if they do not match /
- Needed to minimize chances of N-failures.
- Repair is important.
on bad read rewrite with best value
12Optimistic Reads
- Most implementations do optimistic reads
- read only one value.
- Boolean optimistic_read(Ulong group,address
addr,avalue value) / / - if (group gt MAXSTORES/nplex) return FALSE /
return false if bad addr/ - if (store_read(storesnplexgroup,addr,value))
/ read one value / - return TRUE / and if that is ok return it as
the true value / - else / if reading one value returned bad
then / - return (reliable_read(group,addr,value)) /
n-plex read repair. / - / /
- This is dangerous (especially without repair).
13Storage Fault Summary
- Simple fault model.
- Allows discussion/specification of fault
tolerance. - Uncovers some problems in many implementations
- Ambiguous reads
- Repair process.
- Optimistic reads.
14Process Fault Model
- Process executes a program and has state.
- Program causes state change plus send/get
message. - Process fails by stopping (for a while) and then
resetting its data and message state.
Queue of
a
Sender Process
Input Messages to the process
new
Receiver Process
status
value
next
Data
Program
Data
message
Program
15Process Fault Model The Break/Fix loop
- define MAXPROCESS MANY / the system will
have many processes / - typedef Ulong processid / process id is an
integer index into array / - typedef struct char programMANY/2char
dataMANY/2 state/ program data / - struct state initial / process initial
state / - state current / value of the process state
/ - amessagep messages / queue of messages
waiting for process / - process MAXPROCESS / /
- / Process Decay execute a process and
occasionally inject faults into it / - define mttpf 1E7 / mean time to process
failure Å4 months / - define mttpr 1E4 / mean time to repair is 3
hours / - void process_execution(processid pid) / /
- Ulong proc_fail / time of next process
fault / - Ulong proc_repair / time to repair
process / - amessagep msg, next / pointers to process
messages / - while (TRUE) / global execution loop /
- proc_fail time() - log(randf())mttpf /
the time of next fail / - proc_repair -log(randf())mttpr / delay
in next process repair / - while (time() lt proc_fail) / /
- execute(processpid.current) / execute
for about 4 months (work) /
16Checkpoint/Restart Process (Storage based)
- / A checkpoint-restart process server generating
unique sequence numbers / - checkpoint_restart_process() / /
- Ulong disc 0 / a reliable storage
group with state / - Ulong address2 0,1 / page address of
two states on disc / - Ulong old / index of the disc with the old
state / - struct Ulong ticketno / process reads its
state from disc. / - char fillerVSIZE / newest state has
max ticket number / - value 2 / current state kept in
value0 / - struct msg / buffer to hold input message
/ - processid him / contains requesting process
id / - char fillerVSIZE / reply (ticket num)
sent to process / - msg / /
- / Restart logic recover ticket number from
persistent storage / - for (old 0 oldlt1, old) / read the two
states from disc / - if (!reliable_read(disc,addressold,valueold
)) /if read fails / - panic() / then failfast /
- if (value1.ticketno lt value0.ticketno) old
1 / pick max seq no / - else old 0 copy(value0,
value1,VSIZE)/which is old val / - / Processing logic generate next number,
checkpoint, and reply /
17Process Pairs (message-based checkpoints)
Give me a ticket
Primary
Server Process
Next Ticket Number
Client Processes
Ticket Numbers
I'm Alive
State Checkpoint
Messages
Messages
Backup
Server Process
Next Ticket Number
- Problem Solutions
- Detect failure I'm Alive msg timeout N
o "real" solution. - Continuation Checkpoint Messages
- Startup backup waits for primary
18Process Pairs (message-based checkpoints)
Restart
am I
Wait a second
default primary?
-
Backup Loop
Wait a second
new state
any
-
-
Broadcast "Im Primary"
in last second?
input?
Reply to last request
Read it
Primary Loop
any
newer
-
input?
-
state?
Set my state to new state
requests
Read it
Compute new state.
Send new state to backup.
Send state to backup.
Im alive
reply
replies
- Primary in tight loop sending "I'm alive" or
state change messages to backup - Backup thinks primary dead if no messages in
previous second.
19What We Have Done So Far
- Converted "faulty" processes to reliable ones.
- Tolerate hardware and some software faults
- Can repair in seconds or milli-seconds.
- Unlike checkpoint restart No process
creation/setup time No client reconnect
time. - Operating systems are beginning to provide
process pairs. - Stateless process pairs can use transactional
servers to - Store their state
- Cleanup the mess at takeover.
- Like storage-based checkpoint/restart
- except process setup/connection is instant.
20 Persistent process pairs
- persistent_process() / prototypical persistent
process / -
- wait_to_be_primary() / wait to be told you are
primary / - while (TRUE) / when primary, do forever /
- begin_work() / start transaction or
subtransaction / - read request() / read a request /
- doit() / perform the desired function /
- reply() / reply /
- commit_work() / finish transaction or
subtransaction/ - / did a step, now get next request /
- / /
21Persistent Process Pairs The ticket server
redone as a transactional server.
- / A transactional persistent server process
generating unique tickets / - perstistent_ticket_server() / current state
kept in sql database / - int ticketno / next ticket ( from
DB) / - struct msg / buffer to hold input message
/ - processid him / contains requesting
process id / - char fillerVSIZE / reply (ticket
num) sent to that addr / - msg / /
- / Restart logic recover ticket number from
persistent storage / - wait_to_be_primary() / wait to
be told you are primary / - / Processing logic generate next number,
checkpoint, and reply / - while (TRUE) / do forever /
- begin_work() / begin a transaction /
- while (! get_msg(msg)) / get next request
for a ticket / - exec sql update ticket / increment the next
ticket number / - set ticketno ticketno 1 / /
- exec sql select max(ticketno) / fetch current
ticket number / - into ticketno / into program local
variable / - from ticket / from SQL database /
- commit_work() / commit transaction /
22Messages Fault Model
- Each process has a queue of incoming messages.
- Messages can be
- corrupted checksum detects it
- duplicated sequence number detects it.
- delayed arbitrarily long (ack retransmit).
- can be lost (ack retransmitseq number).
- Techniques here give messages fail-fast
semantics.
23Message Verbs SEND
- /send a message to a process returns true if
the process exists / - Boolean message_send(processid him, avalue value)
/ / - amessagep it / pointer to message created
by this call/ - amessagep queue / pointer to process
message queue / - if (him gt MAXPROCESS) return FALSE / test
for valid process / - loop it malloc(sizeof(amessage)) /
allocate space to hold message / - it-gtstatus TRUE it-gtnext NULL / and
fill in the fields / - copy(it-gtvalue,value,VSIZE) / copy msg
data to message body / - queue processhim.messages / look at
process message queue / - if (queue NULL) processhim.messages it
/ if the empty then / - else / place this message at queue head /
- while (queue-gtnext ! NULL) queue
queue-gtnext / else place / - queue-gtnext it / the
message at queue end . / - if (randf() lt pmf) it-gtstatus FALSE /
sometimes message corrupted / - if (randf() lt pmd) goto loop / sometimes
the message duplicated / - return TRUE / /
- / /
24Message Verbs GET
-
- / get the next input message of this process
returns true if a message / - Boolean message_get(avalue valuep, Boolean
msg_status)/ / - processid me MyPID() / callers
process number / - amessagep it / pointer to input
message / - it processme.messages / find
callers message input queue / - if (it NULL) return FALSE / return false
if queue is empty / - processme.messages it-gtnext/ take first
message off the queue / - msg_status it-gtstatus / record its
status / - copy(valuep,it-gtvalue,VSIZE) / value
it-gtvalue / - free(it) / deallocate
its space / - return TRUE / return status to
caller / - / /
25Sessions Make Messages FailFast
Session
7
Ack 7
- CRC makes corrupt look like lost message
- Sequence numbers detect duplicates gt lost
message - So, only failure is lost message
- Timeout/retransmit masks lost messages. gt Only
failure is delay.
26Sessions Plus Process Pairs Give Highly
Available Messages
7
ack 7
7
ack 7
- Checkpoint messages and sequence numbers to
backup - Backup resumes session if primary fails.
- Backup broadcasts new identity at takeover (see
book for code)
27Highly Available Message Verbs
Output Message Session
Application Programs
reliable_send_msg()
reliable_get_msg()
Input Message Session
The
Listener
Process
- Hide under reliable get/send msg
- Sequence number,
- ack retransmit logic
- checkpoint
- process pair takeover
- resend of most recent reply.
- Uses a Listener process (thread) to do all this
async work
28Summary
- Went from faulty storage, processes, messages
- to fault tolerant
versions of each. - Simple fault model explains many techniques used
- (and mis-used) in FT systems.