Title: Consistent and Automatic Replica Regeneration
1Consistent and Automatic Replica Regeneration
- Networked systems design and implementation 2004
- Haifeng Yu
- Amin Vahdat
2Outline
- Introduction
- System Architecture Overview
- Normal Case Operations
- Reconfiguration
- Single Replica Regeneration
- Experimental Evaluation
- Conclusions
3Introduction
- This paper presents Om based on PAST
- Challenge
- Maintaining consistency when the composition of
the replica group changes
4PAST
PAST example Object key 100
80
120
90
104
98
103
99
101
100
5PAST
PAST example Object key 100 Replication
80
120
90
104
98
103
99
101
100
6PAST
PAST example Object key 100 Replication Replic
a crash
80
120
90
104
98
103
99
101
100
7PAST
PAST example Object key 100 Replication Replic
a crash Regeneration
80
120
90
104
98
103
99
101
100
8Introduction
- This paper presents Om based on PAST
- Challenge
- Maintaining consistency when the composition of
the replica group changes
9Inconsistency
Node 101 overloaded
80
120
90
104
98
103
99
101
100
10Inconsistency
Node 101 overloaded Node 100 99 detect node 101
failure New replica created on node 98
80
120
90
104
98
103
99
101
100
11Inconsistency
Node 100,99,98 overloaded too
80
120
90
104
98
103
99
101
100
12Inconsistency
Node 100,99,98 overloaded too Considered dead by
node 101 New replica created on node 103,
104 Inconsistency
80
120
90
104
98
103
99
101
100
13Introduction
- Three novel techniques in Om
- Single-replica regeneration instead of majority
- Distinguish between failure-free and
failure-induced reconfiguration - Use a lease graph among all replicas and a two
phase write protocol to avoid executing a
consensus protocol for normal writes
14System Architecture Overview
15Normal Case Operation
Read-one / write-all approach Writes serialized
via primary
80
120
90
104
write
98
103
99
101
100
read
primary
16Normal Case Operation
- Two major anomalies
- The first anomaly arises when replicas from old
configurations are slow in detecting failures,
and continue servicing stale data after
reconfiguration - A second problem results from a read seeing a
write that has not been applied to all replicas,
and the write may be lost in reconfiguration. In
other words, the read observes temporary,
inconsistent state.
17Normal Case Operation
- Solution to first leveraging leases
- In traditional client-server architectures, each
client holds a lease from the server. However,
since Om can regenerate from any replica, a
replica needs to hold valid leases from all other
replicas - Solution to second two-phase protocol
- First prepare round the primary propagates the
writes to replicas - Second commit round sending commits to all
replicas
18Failure Detection and Regeneration
- Failure are detected in Om via timeouts on
messages - Propose new configuration to exclude failed
replicas - Uniqueness of new configuration
19A Simple Design that Needs Majority
Acquire votes from a majority of replicas before
regeneration
80
120
90
104
98
103
99
101
100
20A Simple Design that Needs Majority
Acquire votes from a majority of replicas before
regeneration Create new replica
80
120
90
104
98
103
99
101
100
21A Simple Design that Needs Majority
Acquire votes from a majority of replicas before
regeneration deadlock
80
120
90
104
98
103
99
101
100
22Voting with witness
Use other random nodes (witnesses) for the quorum
system But we still need a majority of
witnesses
80
120
90
104
98
103
99
101
100
23Witness Model
- The witness model utilizes the following limited
view divergence property
- Intuitively, the property says that two replicas
are unlikely to have a completely different view
regarding the reachability of a set of
randomly-placed witnesses.
24Witness Model
- To utilize the limited view divergence property,
all replicas logically organize the witnesses
into an mt matrix - The number of rows, m, determines the probability
of intersection - The number of columns, t, protects against the
failure of individual witnesses, so that each row
has at least one functioning witness with high
probability
25Witness Model
26Witness Model
Limited view divergence Reach one common witness
with good probability
80
120
90
104
98
103
99
101
100
27Reconfiguration
- Public class configuration
- Valid, sequenceNum, primary, secondary,
consensusID - Failure-free reconfiguration
- Only the primary does this, because the other
replicas are passive - Failure-induced reconfiguration
- All replicas transmit configuration notices to
aid in completing reconfiguration earlier
28Failure-free Reconfiguration
- Only the primary may initiate failure-free
reconfiguration - After transferring data to the new replicas in
two stages (snapshot followed by logged writes),
the primary constructs a configuration for the
new desired membership - The primary then informs the other replicas of
the new configuration and waits for acks - If timeout occurs, a failure-induced
reconfiguration will follow
29Failure-induced Reconfiguration
A replica initiates and first disables the
current conf
It will perform another round of failure
detection for all member of the configuration
A result (current replicas) will be used as a
proposal for the new configuration
The replica then invokes a consensus protocol
Before adoping a decision, each replica needs to
waits for all leases to expire with respect to
the old configuration
Finally, the primary of the new configuration
will collect and re-apply any pending writes
30Performance Evaluation
31Conclusions
- Single replica regeneration that enables Om to
achieve high availability with a small number of
replicas - Failure-free reconfigurations allowing
common-case reconfigurations to proceed within a
single round of communication - A lease graph and two-phase write protocol to
avoid expensive consensus for normal writes and
also to allow reads to be processed by any replica