Title: Optimistic replication for Internet data services
1Optimistic replication for Internet data services
http//porcupine.cs.washington.edu/
University of Washington Department of Computer
Science and Engineering, Seattle, WA, U.S.A.
2Overview
- Simple and lightweight algorithm suitable for
cluster-based Internet data services - Dynamic replica addition/deletion.
- Ensures eventual consistency of replicas.
- Completely decentralized.
- Tolerates multiple node failures, partitions,
etc. - Is space- and cost- efficient.
- Implemented on Porcupine scalable email server
3Outline
- Motivation
- Examples
- Correctness
- Practical Issues
- Performance
- Conclusion
4Motivation
The Internet
- Porcupine cluster-based mail server
- Manageability, availability, and performance via
homogeneous architecture and dynamic data
distribution - Other applications BBS, Web, Calendar
Naming load balancing service
...
5Goals and Non-goals
- Goals
- Dynamic addition/removal of replicas
- Space- and computational- efficiency
- Fault tolerance
- Simplicity
- Non-goals
- Single-copy consistency (its Internet, anyway)
6Why a new algorithm?
- PC-based clusters present a new environment.
- Prior art focused on two extreme environments
mainframeLAN, laptopmodem - Single-copy algorithms are not available enough
- Mobile replication algorithms are not optimized
for mostly-connected environments. - Very few algorithm allows addition/deletion of
replicas
7Algorithm Overview
- Contents-pushing (cf. Usenet, MS Active
Directory) - ? Computational efficiency
- Two-phase protocol (Apply, Retire)
- ? Space efficiency
- Unified treatment of contents updates and replica
addition/deletion - Thomas write rule node discovery to resolve
conflicting updates - ? Simplicity fault tolerance
8Outline
- Motivation
- Examples
- Updating contents
- Adding and deleting replicas
- Resolving conflicting updates
- Correctness
- Practical Issues
- Performance
- Conclusion
9Example Updating contents
Object contents
Replica set
A
B
C
A
B
C
Timestamp
C
Ack set
310pm
A
A
A
B
C
Update record (exists only during update
propagation)
B
10Example Update Propagation
A
B
C
A
B
C
310pm
310pm
A
A
C
A
B
C
A
C
310pm
A
B
B
11Update Retirement
Retire 310pm
A
B
C
A
B
C
310pm
310pm
A
B
C
A
C
A
B
C
A
C
310pm
A
B
Retire 310pm
B
12Example Final State
- Algorithm quiescent after update retirement
- New contents absent from the update record
- Contents are read from replica directly
- Update stored only during propagation
- Computational space efficiency
A
A
B
C
B
A
B
C
C
A
B
C
13Replica addition and removal
A issues an update to delete C
A
B
A
B
C
B
- Unified treatment of updates to contents and
to replica-set.
310pm
A
B
A
B
C
A
B
C
C
A
A
New replica set
Target set
Ack set
14What if updates conflict?
- Thomas write rule
- Newest update always wins.
- Older update canceled by overwriting by the newer
update. - Same rule applied to replica addition/deletion.
- But some subtleties...
15Update conflict resolution
C
D
- A adds C, B adds D simultaneously
- B must discover C and let C delete the replica
contents
A
B
C
D
A
B
310pm
320pm
A
B
A
B
C
D
A
B
C
A
B
D
A
B
A
B
Target set
Ack set
New replica set
16Node discovery protocol
A
B
C
D
C
A
B
D
310pm
320pm
Apply 320 update
A
B
A
B
C
D
A
B
C
A
B
D
C
Add targets C
A
B
A
B
17Proof of Correctness
- Claim
- All live replicas will store the newest update,
regardless of - number of concurrent updates.
- number of replicas added or removed.
- number of node failures.
- when
- nodes can discover each other at least
indirectly - E.g., when partitioned, each partition will
become consistent.
18Outline
- Motivation
- Examples
- Correctness
- Practical Issues
- Performance
- Conclusion
19Practical Issues
- Handling long-dead nodes
- Algorithm maintains consistency of remaining
replicas. - But updates will get stuck and clog nodes
disks. - Solution erase dead nodes names from replica
sets update records after the grace period.
20Performance Networking overhead
- Each update sends Apply and Retire msgs.
- Retire can be batched w/o affecting users.
- Actual of msgs
- ? 2(N-1).
Measured networking overhead on a fully loaded
Porcupine mail server.
21Performance Space overhead
- Each update is small
- (contents are read directly from the replicas)
- Update is deleted quickly after retirement.
- of outstanding updates is independent of of
objects on node
100K for update records
2G for email messages
22Conclusion
- Simple and lightweight algorithm suitable for
cluster-based Internet data services - Contributions
- Simple dynamic replica addition protocol
- Node discovery for resolving concurrent updates
- Update retirement using synchronized clocks
- Code available at
- http//porcupine.cs.washington.edu/
23Potential Applications
- This algorithm is not just for email..
- Imagine proxies for update-intensive web sites
- Today, they use timeout and polling
- Dynamic replication improves availability.
Master
Proxies
24Potential Applications
- This algorithm is not just for email..
- Imagine proxies for update-intensive web sites
- Today, they use timeout and polling
- Dynamic replication improves availability.
Master
Proxies
25Performance Networking overhead (bytes)
- Each network message is mostly occupied by actual
object contents. - Overhead by the replication service
- ? 6.