FT 101 Jim Gray Microsoft Research http://research.microsoft.com/~gray/Talks/ 80% of slides are not shown (are hidden) so view with PPT to see them all Outline

About This Presentation

Title:

FT 101 Jim Gray Microsoft Research http://research.microsoft.com/~gray/Talks/ 80% of slides are not shown (are hidden) so view with PPT to see them all Outline

Description:

Fail-Fast is Good, Repair is Needed. Improving either MTTR or ... duplexed disc will fail during maintenance?1: ... to make a process fail-fast ... – PowerPoint PPT presentation

Number of Views:201

Avg rating:3.0/5.0

Slides: 63

Provided by: jimg178

Category:

more less

Transcript and Presenter's Notes

Title: FT 101 Jim Gray Microsoft Research http://research.microsoft.com/~gray/Talks/ 80% of slides are not shown (are hidden) so view with PPT to see them all Outline

1
FT 101 Jim Gray Microsoft Researchhttp//resear
ch.microsoft.com/gray/Talks/80 of slides are
not shown (are hidden) so view with PPT to see
them allOutline

Terminology and empirical measures
General methods to mask faults.
Software-fault tolerance
Summary

2
Dependability The 3 ITIES

Reliability / Integrity does the right thing.
(Also large MTTF)
Availability does it now. (Also small MTTR
MTTFMTTRSystem
Availabilityif 90 of terminals up 99 of DB
up? (gt89 of transactions are serviced on
time).
Holistic vs. Reductionist view

Security
Integrity
Reliability
Availability
3
High Availability System ClassesGoal Build
Class 6 Systems
Availability 90. 99. 99.9 99.99 99.999 99.99
99 99.99999
UnAvailability MTTR/MTBF can cut it in ½ by
cutting MTTR or MTBF
4
Demo looking at some nodes

Look at http//uptime.netcraft.com/
Internet Node availability 92 mean, 97
medianDarrell Long (UCSC) ftp//ftp.cse.ucsc.e
du/pub/tr/
ucsc-crl-90-46.ps.Z "A Study of the Reliability
of Internet Sites"
ucsc-crl-91-06.ps.Z "Estimating the Reliability
of Hosts Using the Internet"
ucsc-crl-93-40.ps.Z "A Study of the Reliability
of Hosts on the Internet"
ucsc-crl-95-16.ps.Z "A Longitudinal Survey of
Internet Host Reliability"

5
Sources of Failures

MTTF MTTR
Power Failure 2000 hr 1 hr
Phone Lines
Soft gt.1 hr .1 hr
Hard 4000 hr 10 hr
Hardware Modules 100,000hr 10hr (many are
transient)
Software
1 Bug/1000 Lines Of Code (after vendor-user
testing)
gt Thousands of bugs in System!
Most software failures are transient dump
restart system.
Useful fact 8,760 hrs/year 10k hr/year

6
Case Study - Japan"Survey on Computer Security",
Japan Info Dev Corp., March 1986. (trans Eiichi
Watanabe).
Vendor
4
2

Tele Comm lines
1
2

1
1
.
2
Environment

2
5

Application Software
9
.
3

Operations

Vendor (hardware and software) 5 Months
Application software 9 Months
Communications lines 1.5 Years
Operations 2 Years
Environment 2 Years
10 Weeks
1,383 institutions reported (6/84 - 7/85)
7,517 outages, MTTF 10 weeks, avg
duration 90 MINUTES
To Get 10 Year MTTF, Must Attack All These Areas

7
Case Studies - Tandem Trends Reported MTTF by
Component

1985 1987 1990
SOFTWARE 2 53 33 Years
HARDWARE 29 91 310 Years
MAINTENANCE 45 162 409 Years
OPERATIONS 99 171 136 Years
ENVIRONMENT 142 214 346 Years
SYSTEM 8 20 21 Years
Problem Systematic Under-reporting

8
Many Software Faults are Soft

After Design Review
Code Inspection
Alpha Test
Beta Test
10k Hrs Of Gamma Test (Production)
Most Software Faults Are Transient
MVS Functional Recovery Routines
51
Tandem Spooler 1001
Adams gt1001
Terminology
Heisenbug Works On Retry
Bohrbug Faults Again On Retry
Adams "Optimizing Preventative Service of
Software Products", IBM J RD,28.1,1984
Gray "Why Do Computers Stop", Tandem TR85.7,
1985
Mourad "The Reliability of the IBM/XA Operating
System", 15 ISFTCS, 1985.

9
Summary of FT Studies

Current Situation 4-year MTTF gt Fault
Tolerance Works.
Hardware is GREAT (maintenance and MTTF).
Software masks most hardware faults.
Many hidden software outages in operations
New Software.
Utilities.
Must make all software ONLINE.
Software seems to define a 30-year MTTF ceiling.
Reasonable Goal 100-year MTTF.
class 4 today gt class 6 tomorrow.

10
Fault Tolerance vs Disaster Tolerance

Fault-Tolerance mask local faults
RAID disks
Uninterruptible Power Supplies
Cluster Failover
Disaster Tolerance masks site failures
Protects against fire, flood, sabotage,..
Redundant system and service at remote site.
Use design diversity

11
Outline

Terminology and empirical measures
General methods to mask faults.
Software-fault tolerance
Summary

12
Fault Model

Failures are independentSo, single fault
tolerance is a big win
Hardware fails fast (blue-screen)
Software fails-fast (or goes to sleep)
Software often repaired by reboot
Heisenbugs
Operations tasks major source of outage
Utility operations
Software upgrades

13
Fault Tolerance Techniques

Fail fast modules work or stop
Spare modules instant repair time.
Independent module fails by design MTTFPair
MTTF2/ MTTR (so want tiny MTTR)
Message based OS Fault Isolation software has
no shared memory.
Session-oriented comm Reliable messages detect
lost/duplicate messages coordinate messages
with commit
Process pairs Mask Hardware Software Faults
Transactions give A.C.I.D. (simple fault model)

14
Example the FT Bank

Modularity Repair are KEY
vonNeumann needed 20,000x redundancy in
wires and switches
We use 2x redundancy.
Redundant hardware can support peak loads (so
not redundant)

15
Fail-Fast is Good, Repair is Needed
Lifecycle of a module fail-fast gives short
fault latency High Availability is
low UN-Availability Unavailability MTTR
MTTF

Improving either MTTR or MTTF gives benefit
Simple redundancy does not help much.

16
Hardware Reliability/Availability (how to make
it fail fast)

Comparitor Strategies
Duplex Fail-Fast fail if either fails (e.g.
duplexed cpus)
vs Fail-Soft fail if both fail (e.g. disc,
atm,...)
Note in recursive pairs, parent knows which is
bad.
Triplex Fail-Fast fail if 2 fail (triplexed
cpus)
Fail-Soft fail if 3 fail (triplexed FailFast
cpus)

17
Redundant Designs have Worse MTTF!
The Airplane Rule A two-engine airplane has
twice as many engine problems as a one engine
plane.

THIS IS NOT GOOD Variance is lower but MTTF is
worse
Simple redundancy does not improve MTTF
(sometimes hurts).
This is just an example of
the airplane rule.

18
Add Repair Get 104 Improvement
19
When To Repair?

Chances Of Tolerating A Fault are 10001 (class
3)
A 1995 study Processor Disc Rated At 10khr
MTTF
Computed Single Observed
Failures Double Fails Ratio
10k Processor Fails 14 Double 1000 1
40k Disc Fails, 26 Double 1000 1
Hardware Maintenance
On-Line Maintenance "Works" 999 Times Out Of
1000.
The chance a duplexed disc will fail during
maintenance?11000
Risk Is 30x Higher During Maintenance
gt Do It Off Peak Hour
Software Maintenance
Repair Only Virulent Bugs
Wait For Next Release To Fix Benign Bugs

20
OK So Far

Hardware fail-fast is easy
Redundancy plus Repair is great (Class 7
availability)
Hardware redundancy repair is via modules.
How can we get instant software repair?
We Know How To Get Reliable Storage
RAID Or Dumps And Transaction Logs.
We Know How To Get Available Storage
Fail Soft Duplexed Discs (RAID 1...N).
? How do we get reliable execution?
? How do we get available execution?

21
Outline

Terminology and empirical measures
General methods to mask faults.
Software-fault tolerance
Summary

22
Key Idea

Architecture Hardware Faults
Software Masks Environmental Faults
Distribution Maintenance
Software automates / eliminates operators
So,
In the limit there are only software design
faults.Software-fault tolerance is the key to
dependability.
INVENT IT!

23
Software Techniques Learning from Hardware

Recall that most outages are not hardware.
Most outages in Fault Tolerant Systems are
SOFTWARE
Fault Avoidance Techniques Good Correct
design.
After that Software Fault Tolerance Techniques
Modularity (isolation, fault containment)
Design diversity
N-Version Programming N-different
implementations
Defensive Programming Check parameters and data
Auditors Check data structures in background
Transactions to clean up state after a failure
Paradox Need Fail-Fast Software

24
Fail-Fast and High-Availability Execution

Software N-Plexing Design Diversity
N-Version Programming
Write the same program N-Times (N gt 3)
Compare outputs of all programs and take
majority vote
Process Pairs Instant restart (repair)
Use Defensive programming to make a process
fail-fast
Have restarted process ready in separate
environment
Second process takes over if primary faults
Transaction mechanism can clean up distributed
state
if takeover in middle of computation.

25
What Is MTTF of N-Version Program?

First fails after MTTF/N
Second fails after MTTF/(N-1),...
so MTTF(1/N 1/(N-1) ... 1/2)
harmonic series goes to infinity, but VERY
slowly
for example 100-version programming gives
4 MTTF of 1-version programming
Reduces variance
N-Version Programming Needs REPAIR
If a program fails, must reset its state from
other programs.
gt programs have common data/state
representation.
How does this work for Database Systems?
Operating Systems?
Network Systems?
Answer I dont know.

26
Why Process Pairs Mask FaultsMany Software
Faults are Soft

After Design Review
Code Inspection
Alpha Test
Beta Test
10k Hrs Of Gamma Test (Production)
Most Software Faults Are Transient
MVS Functional Recovery Routines 51
Tandem Spooler 1001
Adams gt1001
Terminology
Heisenbug Works On Retry
Bohrbug Faults Again On Retry
Adams "Optimizing Preventative Service of
Software Products", IBM J RD,28.1,1984
Gray "Why Do Computers Stop", Tandem TR85.7,
1985
Mourad "The Reliability of the IBM/XA Operating
System", 15 ISFTCS, 1985.

27
Heisenbugs A Probabilistic Approach to
Availability

There is considerable evidence that (1)
production systems have about one bug per
thousand lines of code (2) these bugs manifest
themselves in stochastically failures are due
to confluence of rare events, (3) system
mean-time-to-failure has a lower bound of a
decade or so. To make highly available
systems, architects must tolerate these failures
by providing instant repair (un-availability is
approximated by repair_time/time_to_fail so
cutting the repair time in half makes things
twice as good. Ultimately, one builds a set of
standby servers which have both design diversity
and geographic diversity. This minimizes
common-mode failures.

28
Process Pair Repair Strategy

If software fault (bug) is a Bohrbug, then there
is no repair
wait for the next release or
get an emergency bug fix or
get a new vendor
If software fault is a Heisenbug, then repair
is
reboot and retry or
switch to backup process (instant restart)
PROCESS PAIRS Tolerate Hardware Faults
Heisenbugs
Repair time is seconds, could be mili-seconds if
time is critical
Flavors Of Process Pair Lockstep
Automatic
State Checkpointing
Delta Checkpointing
Persistent

29
How Takeover Masks Failures

Server Resets At Takeover But What About
Application State?
Database State?
Network State?
Answer Use Transactions To Reset State!
Abort Transaction If Process Fails.
Keeps Network "Up"
Keeps System "Up"
Reprocesses Some Transactions On Failure

30
PROCESS PAIRS - SUMMARY

Transactions Give Reliability
Process Pairs Give Availability
Process Pairs Are Expensive Hard To Program
Transactions Persistent Process Pairs
gt Fault Tolerant Sessions E
xecution
When Tandem Converted To This Style
Saved 3x Messages
Saved 5x Message Bytes
Made Programming Easier

31
SYSTEM PAIRSFOR HIGH AVAILABILITY
Primary
Backup

Programs, Data, Processes Replicated at two
sites.
Pair looks like a single system.
System becomes logical concept
Like Process Pairs System Pairs.
Backup receives transaction log (spooled if
backup down).
If primary fails or operator Switches, backup
offers service.

32
SYSTEM PAIR CONFIGURATION OPTIONS
Backup
Primary

Mutual Backup
each has1/2 of Database Application
Hub
One site acts as backup for many others
In General can be any directed graph
Stale replicas Lazy replication

Primary
Primary
Primary
Backup
Backup
Primary
Copy
Copy
Copy
33
SYSTEM PAIRS FOR SOFTWARE MAINTENANCE

(
B
a
c
k
u
p
)
(
B
a
c
k
u
p
)
(
P
r
i
m
a
r
y
)
(
P
r
i
m
a
r
y
)
V
1
V
1
V
1
V
2
S
t
e
p

1

B
o
t
h

s
y
s
t
e
m
s

a
r
e

r
u
n
n
i
n
g

V
1
.
S
t
e
p

2

B
a
c
k
u
p

i
s

c
o
l
d
-
l
o
a
d
e
d

a
s

V
2
.

(
P
r
i
m
a
r
y
)
(
P
r
i
m
a
r
y
)
(
B
a
c
k
u
p
)
(
B
a
c
k
u
p
)
V
1
V
2
V
2
V
2
S
t
e
p

4

B
a
c
k
u
p

i
s

c
o
l
d
-
l
o
a
d
e
d

a
s

V
2

D
3
0
.
S
t
e
p

3

S
W
I
T
C
H

t
o

B
a
c
k
u
p
.

Similar ideas apply to
Database Reorganization
Hardware modification (e.g. add discs,
processors,...)
Hardware maintenance
Environmental changes (rewire, new air
conditioning)
Move primary or backup to new location.

34
SYSTEM PAIR BENEFITS

Protects against ENVIRONMENT weather
utilities
sabotage
Protects against OPERATOR FAILURE
two sites, two sets of operators
Protects against MAINTENANCE OUTAGES
work on backup
software/hardware install/upgrade/move...
Protects against HARDWARE FAILURES
backup takes over
Protects against TRANSIENT SOFTWARE ERRORR
Allows design diversity
different sites have different software/hardware)

35
Key Idea

Architecture Hardware Faults
Software Masks Environmental Faults
Distribution Maintenance
Software automates / eliminates operators
So,
In the limit there are only software design
faults. Many are HeisenbugsSoftware-fault
tolerance is the key to dependability.
INVENT IT!

36
References

Adams, E. (1984). Optimizing Preventative
Service of Software Products. IBM Journal of
Research and Development. 28(1) 2-14.0
Anderson, T. and B. Randell. (1979). Computing
Systems Reliability.
Garcia-Molina, H. and C. A. Polyzois. (1990).
Issues in Disaster Recovery. 35th IEEE Compcon
90. 573-577.
Gray, J. (1986). Why Do Computers Stop and What
Can We Do About It. 5th Symposium on Reliability
in Distributed Software and Database Systems.
3-12.
Gray, J. (1990). A Census of Tandem System
Availability between 1985 and 1990. IEEE
Transactions on Reliability. 39(4) 409-418.
Gray, J. N., Reuter, A. (1993). Transaction
Processing Concepts and Techniques. San Mateo,
Morgan Kaufmann.
Lampson, B. W. (1981). Atomic Transactions.
Distributed Systems -- Architecture and
Implementation An Advanced Course. ACM,
Springer-Verlag.
Laprie, J. C. (1985). Dependable Computing and
Fault Tolerance Concepts and Terminology. 15th
FTCS. 2-11.
Long, D.D., J. L. Carroll, and C.J. Park (1991).
A study of the reliability of Internet sites.
Proc 10th Symposium on Reliable Distributed
Systems, pp. 177-186, Pisa, September 1991.
Darrell Long, Andrew Muir and Richard Golding,
A Longitudinal Study of Internet Host
Reliability,'' Proceedings of the Symposium on
Reliable Distributed Systems, Bad Neuenahr,
Germany IEEE, September 1995, pp. 2-9

37
(No Transcript)
38
Scaleable Replicated Databases

Jim Gray (Microsoft)
Pat Helland (Microsoft)
Dennis Shasha (Columbia)
Pat ONeil (U.Mass)

39
Outline

Replication strategies
Lazy and Eager
Master and Group
How centralized databases scale
deadlocks rise non-linearly with
transaction size
concurrency
Replication systems are unstable on scaleup
A possible solution

40
Scaleup, Replication, Partition

N2 more work

41
Why Replicate Databases?

Give users a local copy for
Performance
Availability
Mobility (they are disconnected)
But... What if they update it?
Must propagate updates to other copies

42
Propagation Strategies

Eager Send update right away
(part of same transaction)
N times larger transactions
Lazy Send update asynchronously
separate transaction
N times more transactions
Either way
N times more updates per second per node
N2 times more work overall

43
Update Control Strategies

Master
Each object has a master node
All updates start with the master
Broadcast to the subscribers
Group
Object can be updated by anyone
Update broadcast to all others
Everyone wants Lazy Group
update anywhere, anytime, anyway

44
Quiz Questions Name One

Eager
Master N-Plexed disks
Group ?
Lazy
Master Bibles, Bank accounts, SQLserver
Group Name servers, Oracle, Access...
Note Lazy contradicts Serializable
If two lazy updates collide, then ... reconcile
discard one transaction (or use some other rule)
Ask for human advice
Meanwhile, nodes disagree gt
Network DB state diverges System Delusion

45
Anecdotal Evidence

Update Anywhere systems are attractive
Products offer the feature
It demos well
But when it scales up
Reconciliations start to cascade
Database drifts out of sync (System Delusion)
Whats going on?

46
Outline

Replication strategies
Lazy and Eager
Master and Group
How centralized databases scale
deadlocks rise non-linearly
Replication is unstable on scaleup
A possible solution

47
Simple Model of Waits
DBsize records

TPS transactions per second
Each
Picks Actions records uniformly from set of
DBsize records
Then commits
About Transactions x Actions/2 resources locked
Chance a request waits is
Action rate is TPS x Actions
Active Transactions TPS x Actions x Action_Time
Wait Rate Action rate x Chance a request waits
10x more transactions, 100x more waits

TransctionsxActions 2
Transactions x Actions 2 x DB_size
TPS2 x Actions3 x Action_Time 2 x DB_size
48
Simple Model of Deadlocks

A deadlock is a wait cycle
Cycle of length 2
Wait rate x Chance Waitee waits for waiter
Wait rate x (P(wait) / Transactions)
Cycles of length 3 are PW3, so ignored.
10x bigger trans 100,000x more deadlocks

TPS x Actions3x Action_Time 2 x DB_size TPS x
Actions x Action_Time
TPS2 x Actions3 x Action_Time 2 x DB_size
TPS2 x Actions5 x Action_Time 4 x DB_size2
49
Summary So Far

Even centralized systems unstable
Waits
Square of concurrency
3rd power of transaction size
Deadlock rate
Square of concurrency
5th power of transaction size

Trans Size
Concurrency
50
Outline

Replication strategies
How centralized databases scale
Replication is unstable on scaleup
Eager (master group)
Lazy (master group disconnected)
A possible solution

51
Eager Transactions are FAT

If N nodes, eager transaction is Nx bigger
Takes Nx longer
10x nodes, 1,000x deadlocks
(derivation in paper)
Master slightly better than group
Good news
Eager transactions only deadlock
No need for reconciliation

52
Lazy Master Group
Write A
New Timestamp
Write B
Write C
Commit
Write A

Use optimistic concurrency control
Keep transaction timestamp with record
Updates carry oldnew timestamp
If record has old timestamp
set value to new value
set timestamp to new timestamp
If record does not match old timestamp
reject lazy transaction
Not SNAPSHOT isolation (stale reads)
Reconciliation
Some nodes are updated
Some nodes are being reconciled

Write A
Write B
Write B
Write C
Write C
Commit
Commit
53
Reconciliation

Reconciliation means System Delusion
Data inconsistent with itself and reality
How frequent is it?
Lazy transactions are not fat
but N times as many
Eager waits become Lazy reconciliations
Rate is
Assuming everyone is connected

TPS2 x (Actions x Nodes)3 x Action_Time 2 x
DB_size
54
Eager Lazy Disconnected

Suppose mobile nodes disconnected for a day
When reconnect
get all incoming updates
send all delayed updates
Incoming is Nodes x TPS x Actions x
disconnect_time
Outgoing is TPS x Actions x Disconnect_Time
Conflicts are intersection of these two sets

Action_Time
Action_Time
Disconnect_Time x (TPS xActions x Nodes)2 DB_size
55
Outline

Replication strategies (lazy eager, master
group)
How centralized databases scale
Replication is unstable on scaleup
A possible solution
Two-tier architecture Mobile Base nodes
Base nodes master objects
Tentative transactions at mobile nodes
Transactions must be commutative
Re-apply transactions on reconnect
Transactions may be rejected

56
Safe Approach

Each object mastered at a node
Update Transactions only read and write master
items
Lazy replication to other nodes
Allow reads of stale data (on user request)
PROBLEMS
doesnt support mobile users
deadlocks explode with scaleup
?? How do banks work???

57
Two Tier Replication

Two kinds of nodes
Base nodes always connected, always up
Mobile nodes occasionally connected
Data mastered at base nodes
Mobile nodes
have stale copies
make tentative updates

58
Mobile Node Makes Tentative Updates

Updates local database while disconnected
Saves transactions
When Mobile node reconnects Tentative
transactions re-done as Eager-Master (at
original time??)
Some may be rejected
(replaces reconciliation)
No System Delusion.

59
Tentative Transactions