Title: Replication: Synchronous and Asynchronous
1Replication Synchronous and Asynchronous
- Amr El Abbadi
- Department of Computer Science
- University of California
- Santa Barbara, CA 93106
2Organization
- The basic replication model BHG87
- Serializability theory for replicated databases
- Replica control protocols
- quorums
- available copies
- view-based replication
- Asynchronous replication
- Wuu and Bernstein--the epidemic model.
3Why Replicate Data?
- Application semantics (domain servers, routing
info, etc). - Fault-tolerance (banks, information, etc)
- Performance (search engines, parallel
applications, etc)
4The Synchronous approach
- Correctness a replicated database should behave
like a one-copy database in so far as the users
can tell. - Model Each object x is implemented by a set of
copies x1, x2, x3, that reside on different
sites s1, s2, s3, .
5Simple Approach
- Read one/write all protocol.
- readx is translated to read of any copy xa.
- Write x is translated to write of all copies
xa,xb,.. - any correct concurrency control protocol.
- What if failures happen? No write operations!
6Write all available copies
- Consider the following history
- w0xa
w1xa - w0xb r2xb Fail(b)
- w0yc r1yc
w2yc - Since t2 read-x-from t0, order must be t0 t2 t1
- But t1 reads-y-from t0, order must be t0 t1 t2
!!!!!!! - SG is also acyclic
- t0
t2 -
t1
7Correctness of replicated objects
- One-copy equivalence The different copies of
the object must appear has a single copy. - Serializability the concurrent execution of a
set of transactions must be equivalent to a
serial execution. - One-copy serializability the concurrent
execution of a set of transactions must be
equivalent to a serial history on single copy
objects.
8One-Copy Serialization Graph
- Given a history H, a 1-SGH is SGH with
enough edges added such that - ? objects x, 1-SGH embodies a total order (
) on all transactions that write x. - If tj reads-x-from ti, and ti tk, then
1-SGH contains a path from tj to tk. - ti
tk -
-
tj
9Back to example
- Recall
- w0xa
w1xa - w0xb r2xb Fail(b)
- w0yc r1yc
w2yc - SG is t0
t2 -
t1 - Since t1 reads-y-from t0, and t0 t2,
then t1 t2 - But t2 read-x-from t0, and t0 t1,
then t2 t1 - t0
t2 -
-
t1
10Available Copies Protocol BG 83
- Recall
- w0xa
w1xa - w0xb r2xb Fail(b)
- w0yc r1yc
w2yc - Introduce the failure of a site as an atomic
transaction OUTb (similarly for recovery
INb), which causes transactions to change write
set (change directory info). - t0
t2 - We explicitly force
- a path.
OUTb - t1
11Available copies protocol
- Inexpensive read operations
- Tolerates site failures
- - Does NOT tolerate partitioning failures!
- P1
P2
12Quorum Consensus Protocol Gifford 79
- Extend the idea of quorums for mutual exclusion
to read and write operations, i.e., read and
write quorums. - read write
write write - quorum quorum
quorum quorum
13Quorum Consensus Protocol
- Associate with each copy a version number.
- Write operation
- Determine max-version-no of a write quorum
- update write quorum with new value and version
numbers to max-version-no 1 - Read operation
- read value of copy with max-version-no in read
quorum. - Use a correct concurrency control protocol.
14Correctness
- The SG(h) for any execution created by the quorum
consensus protocol is - Acyclic correct concurrency control protocol
- 1-SG(h) all conflicting operations conflict on
a copy - (1) SG(h) has a total order on all write
operations, - (2) SG(h) orders all read and write conflicts.
15Quorum Consensus Protocol
- No special treatment for failures and
recovery. - Tolerates both site and partitioning
failures - - Expensive read operations.
- - Large number of copies to tolerate a given
number of failures, e.g., 3 copies to tolerate 1
failure 5 copies to tolerate 2 failures, etc.
16Virtual partitions ProtocolEl Abbadi et al. 85,
86
- Quorums can tolerate partitions
- Available copies allows read-one.
- We want to combine the best of both worlds!
- Use quorums to decide when to execute an
operation - Use read-one write-all-available for actual
execution.
17Views
- We associate with each site s, view(s), which is
the set of sites s assumes it can communicate
with. - Ideally
b
a,c
a
b
a,c
a,c
c
18Virtual Partitions Rule
- Accessibility Rule A transaction executes only
if a majority of sites are in its view. - Read/write Rule read one copy, write all copies
in view. -
b
b,c
a,b
a,b,c
c
a
19Virtual Partitions Protocol
- Communication Rule Only sites with the same
view are allowed to communicate. - Each new view has associated with it a view-id.
- View Changes
- The initiating site s decides on the members of
the new view, and picks a view-id greater than
any previous one. - s then executes an update transaction to update
all copies in view with most up to date value for
each object. - Update transaction accesses all copies of object
with a majority of sites in new view. - A site participates in new update transaction
only if local view-id is less than proposed
view-id.
20Correctness idea
- Global correctness
- majority rule
- Local correctness
- read-one write all
- correct concurrency control protocol
21Virtual partitions Protocol
- Tolerates partitions and site failures
- Allows read one rule.
- - Costly update transaction
22Asynchronous or Lazy replication
- In large internet type of settings,
transaction-based replication is - too expensive (remember 2PC).
- Unrealistic (all sites are not up all the time)
- does not scale (large number of sites)
- Epidemic approach Bayou project at Xerox
- information is changed locally, and then
propagated in a lazy manner to all other
replicas. - Correctness is based on causality.
23Replicated dictionary problem
- Efficient solutions to the replicated log and
dictionary problems. Wuu and Bernstein PODC 84. - Basic assumptions
- sites may crash, links may fail, partitioning.
- Each site maintains a local clock (a counter).
- Local events are atomic.
- Use Lamports event execution model and
happens-before relation.
24The log problem
- Each site maintains a copy of the log.
- The log contains local events, i.e.,
- insert
- delete
- The goal of the algorithm is to keep all copies
of the log up to date. - Li is the copy of the log at site i.
- L(e) is the contents of log Lnode(e) immediately
after event e is executed.
25The log problem
- Log Problem find an algorithm that maintains
the log such that given an execution ltE, gt,
- ? events e,f if f e then f is in L(e)
- General approach
- For each local event, insert a record in the
local log. - Exchange logs to update other sites.
- Main question when to exchange logs? With
application communication to capture the happens
before relation.
26Solutions to the log problem
- A solution
- Site i sends to site j all records in the log
that were inserted since i last sent a message to
j. - WHY INCORRECT?
- Another solution
- each site i includes Li with each message.
- On receiving a message, a site j incorporates all
new event records. - BAD
- Entire log sent with each message
- Entire log kept at each node.
27Efficient solution for log problem
- Observation 1 Once i knows that j knows of
an event e (which may have occurred on site k),
then i does not need to include event e in
message sent to j. - Observation 2 Once i knows that all sites
know about an event e, then i does not need to
keep a record of e in its local log.
282 Dimensional Time-Table
- TTin,n
- if TTij,k t, then site i knows that site j
has learned of all events that occurred at site k
up to time t.
k
j
t
29The 2 dimensional timetable
- Notes
- site j might actually know about more events, but
site i may not be aware of it. - TTii,i is the value of clock at site i.
- TTii,k is the value of clock at site k of the
most recent event at site k that site i is aware
of.
30Two dimensional timetable
- Let hasrec(TTi, e, k) be true iff
- TTik,node(e) gt time(e)
- The algorithm must guarantee that if hasrec(TTi,
e, k) is true, then site k has learned of event
e. - Note site i need not send a record of event e
to site k if hasrec(TTi, e, k) is true.
31Log maintenance
- Initialize all entries in TT to 0.
- For each local operation, insert a copy in the
local log. - With each send operation from site i to site k
piggyback TT the following subset of the local
log Li all records e such that hasrec(TTi, e, k)
is not true. - On receipt of a message from site k by site i
- incorporate all new events into local log
- update TT
- Max of times in local ith row and remote kth
row. - Max of all elements.
-
32Dictionary problem
- Assume we want to maintain a replicated
dictionary with insert, delete and lookup
operations. - On receipt of a message with a partial log and
TT - Update local copy of the dictionary
- Update local copy of TT as before
- Garbage collect local log from any records that
correspond to events e such that - ? site j such that hasrec(TTi, e, j) is not true
33Asynchronous replication
- Tolerates message loss, failures and
partitioning. - Maintains causality has the correctness
criterion - if e f and a site is aware of f, then is is
aware of e - Extensions for transaction semantics SAE97
- Various proposal to expand semantics to other
applications, e.g. the Bayou project.
34Where is the future?
- Does it belong to the strict atomic approach--it
does ensure secure and predictable behavior - Or does it belong to the lazy propagation
approach, which is more scalable and flexible? - A hybrid approach?