CS 603 Failure Recovery - PowerPoint PPT Presentation

About This Presentation

Title:

CS 603 Failure Recovery

Description:

if Tx Ty at the primary, Tx commits at the same epoch or before Ty ... Since Tx Ty at the primary, the order must be W(Tx, d) W(Ty, d). Double-Mark Algorithm ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 29

Provided by: clif8

Learn more at: https://www.cs.purdue.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS 603 Failure Recovery

1
CS 603Failure Recovery

April 19, 2002

2
Failure Recovery

Assumption system designed for normal operation
Failure is an exception
How to handle exception?
Must maintain correctness
Can compromise performance
Fault models provide mechanisms to describe
failure and recovery
But how do implement?

3
Site Failure

Problem complete failure at single site
Must have multiple sites
Thus a distributed problem
Two examples
Distributed Storage Palladio
Think wide-area RAID
Distributed Transactions Epoch algorithm

4
Recovery ExamplePalladio Storage System

Work in HP Labs Storage Systems
Richard Golding
Elizabeth Borowsky (now at Boston College)
Some slides taken from their talks
Goals
Disaster-resistant storage
Must store at multiple (widely distributed) sites
High availability
Cant wait for restoration after disaster
High performance
Use the replication productively under normal
operation

5
Introduction

Palladio - solution for detecting, handling, and
recovering from both small- and large-scale
failures in a distributed storage system.
Palladio - provides virtualized data storage
services to applications via set of virtual
stores, which are structured as a logical array
of bytes into which applications can write and
read data. The stores layout maps each byte in
its address space to an address on one or more
devices.
Palladio - storage devices take an active role in
the recovery of the stores they are part of.
Managers keep track of the virtual stores in the
system, coordinating changes to their layout and
handling recovery from failure.

6
Palladio Overview

Provide robust read and write access to data in
virtual stores.
Atomic and serialized read and write access.
Detect and recover from failure.
Accommodate layout changes.

Entities Hosts Stores Managers Management policies
Protocols Layout Retrieval protocol Data Access
protocol Reconciliation protocol Layout Control
protocol
7
Protocols

Access protocol allows hosts to read and write
data on a storage device as long as there are no
failures or layout changes for the virtual store.
It must provide serialized, atomic writes that
can span multiple devices.
Layout retrieval protocol allows hosts to obtain
the current layout of a virtual store the
mapping from the virtual stores address space
onto the devices that store parts of it.
Reconciliation protocol runs between pairs of
devices to bring them back to consistency after a
failure.
Layout control protocol runs between managers and
devices maintains consensus about the layout
and failure status of the devices, and in doing
so coordinates the other three protocols.

8
Layout Control Protocol

The layout control protocol tries to maintain
agreement between a stores manager and the
storage devices that hold the store.
The layout of data onto storage devices
The identity of the stores active manager.

The notion of epochs
The layout and manager are fixed during each
epoch
Epochs are numbered
Epoch transitions
Device leases acquisition and renewal
Device leases used to detect possible failure.

9
Operation during an epoch

The manager has quorum and coverage of devices.
Periodic lease renewal
In case a device fails to report and try to renew
its lease, the manager considers it failed
In case the manager fails to renew the lease, the
device considers the manager failed and starts a
manager recovery sequence
When the manager loses quorum or coverage the
epoch ends and a state of epoch transition is
entered.

10
Epoch transition

Transaction initiation
Reconciliation
Transaction commitment
Garbage collection

11
The recovery sequence

Initiation - querying a recovery manager with the
current layout and epoch number

12
The recovery sequence (continued)

Contention - managers struggle to obtain quorum
and coverage and to become active managers for
the store - (recovery leases, acks and rejections)

13
The recovery sequence (continued)

Completion - setting correct recovery leases
starting epoch transition
Failure - failure of devices and managers during
recovery

14
Extensions

Single manager v.s. Multiple managers
Whole devices v.s. Device parts (chunks)
Reintegrating devices
Synchrony model (future)
Failure suspectors (future)

15
Application example
16
Application example - benefits

Self-manageable storage

Increased availability

Popularity is hard to fake

Less per node load

Could be appliedrecursively (?)

17
Conclusions recap

Palladio - Replication management system
featuring
Modular protocol design
Active device participation
Distributed management function
Coverage and quorum condition

18
Transaction Systems that Handle Disaster

Goal Safety of transactions
Database consistent even if disaster strikes
2-safe backup Commit survives disaster
Run two-phase commit between sites
Introduces wide-area transmission latency into
commit
1-safe backup May lose transactions
Propagate results to backup

19
Epoch Algorithm (Garcia-Molina, Polyzois, and
Hagmann 1990)

1-Safe backup
No performance penalty
Multiple transaction streams
Use distribution to improve performance
Multiple Logs
Avoid single bottleneck

20
Problem with Multiple LogsConsistency

Assume transactions may span sites
Cant just send logs
What if part of a transaction is sent?
Solution Commit protocol at Backup
Expensive
Commit in batches

BPi BPj BPk
write T1 write T2 write T3
write T2 P(T2)
P(T1) C(T2)
C(T1) write T3
P(T2) P(T3) P(T3)
C(T2) C(T3) C(T3)
21
Correctnes Criteria

Atomicity If any writes of a transaction appear
at backup, all must appear
If ?W(Tx, d) at backup then?W(Tx, d), W(Tx, d)
exists at backup
Consistency If Ti ? Tj at primary, then
Local Tj installed at backup ? Ti installed at
backup
Mutual If W(Ti, d) and W(Tj, d), thenW(Ti, d)
? W(Tj, d)
Minimum Divergence If Tj is at the backup and
does not depend on a missing transaction, then it
should be installed at the backup

22
Algorithm Overview

Idea Transactions that can be committed
together grouped into epochs
Primaries write marker in log
Must agree when safe to write marker
Keep track of current epoch number
Master broadcasts when to end epoch
Backups commit epoch when all backups have
received marker

23
CS 603Failure Recovery

April 22, 2002

24
Single-Mark Algorithm

Problem Is it locally safe to mark when
broadcast received?
Might be in the middle of a transaction
Solution Share epoch at commit
Prepare to commit includes local epoch number
If received number greater than local, end epoch
At Backup When all sites have epoch ?n, Commit
transactions where
C(Ti) ? ?n
P(Ti) ? ?n, local site is not coordinator, and
coordinator has C(Ti) ? ?n

25
Correctness Atomicity

Lemma 1 If C(T) ? ?n _at_ Pi, then CC(T) ? ?n _at_
coordinator Pc of T.
Proof. If Pi Pc, trivial. Suppose Pi ? Pc,
CP(T) ? ?n _at_ Pi, ?n ? CC(T) _at_ Pc. The commit
message from Pc to Pi includes epoch Pc 1 ? Pi
will write ?n. Thus, ?n ? CP(T) is a
contradiction.
Lemma 2 If CC(T) ? ?n _at_ coordinator for T, then
P(T) ? ?n _at_ participants.
Proof. Suppose ?n ? P(T) at some participant.
When the coordinator received the acknowledgement
(along with the epoch) from that participant, it
bumped its epoch (if neces- sary) and then wrote
the CC(T) entry. In either case, ?n? CC(T) is a
contradiction.
Atomicity Suppose the changes T installed at BPi
after ?n. If C(T) ? ?n _at_ Bpi and Pc was
coordinator, by lemma 1 CC(T) ? ?n _at_ BPc. If B i
does not encounter a C(T) entry before ?n, it
must have committed because the coordinator told
it to do so, which implies that in the log of the
coordinator CC(T) ? ?n. Thus, in any case, in the
coordinators log CC(T) ? ?n. According to lemma
2, in the logs of all participants P(T) ? ?n. The
participants for which CP(T) ? ?n will commit T
anyway. The rest of the par- ticipants will ask
BP, and will be informed that T can commit.

26
Correctness Consistency

if Tx ? Ty and Tx installed at the backup during
epoch n, Ty is also installed
Suppose the dependency Tx ? Ty is induced by
conflicting accesses to a data item d at a
processor Pd.
By property 1 C(Tx, Pd) P(Ty, Pd). Since Ty
committed at the backup during epoch n, P(Tx, Pd)
? ?n(Pd), which implies C(Tx, Pd) ? ?n(Pd).
Thus, TX must commit during epoch n or earlier
(see lemmas 1, 2)
Progress made suppose Tx ? Ty, both write data
item d.
if Tx ? Ty at the primary, Tx commits at the same
epoch or before Ty
If TX is installed earlier, W(Tx, d) ? W(Ty, d)
If installed during the same epoch, the writes
are executed in the order in which they appear in
the log. Since Tx ? Ty at the primary, the order
must be W(Tx, d) ? W(Ty, d).