Title: Consistency without concurrency control in large, dynamic systems
1Consistency without concurrency control in
large, dynamic systems
- Marc Shapiro, INRIA LIP6
- Nuno Preguiça, Universidade Nova de Lisboa
- Mihai Le?ia, ENS Lyon
2Consistency without concurrency control
f (x1)
x
g(x2)
- Object x, operation f(x)
- propose f(x1)
- eventually replay f(x2), f(x3), ...
- If f g commute converges safely without
concurrency control - Commutative Replicated Data Type (CRDT) Designed
for commutative operations
3A sequence CRDT
- Treedoc sequence of elements
- insert-at-pos, delete
- Commutative when concurrent
- Minimise overhead
- Scalable
- A commutative replicated data type for
cooperative editing, ICDCS 2009 - Focus today
- Garbage collection
- vs. scale
4Commutative updates
R
1
0
I
A
1
0
I
N
I N R I
A
L
L
Naming tree minimal, self-adjusting
logarithmic TID path 01 Contents infix
order
- insert adds leaf ? non-destructive, TIDs dont
change
Delete tombstone, TIDs don't change
5Wikipedia GWB page space overhead
kB serialised
Treedoc
wikidoc
10 revisions
6Rebalance
R
I
A
L
I
N
L ' I N R I
L ' I N R I
7Rebalance
N
I
L
R
I
L ' I N R I !!!
L ' I N R I
L ' I N R I
- Invalidates TIDs
- Frame of reference epoch
- Requires agreement
- Pervasive!
- e.g. Vector Clocks
8Rebalance in large, dynamic systems
- Rebalance requires consensus
- Consensus requires small, stable membership
- Large communities?!
- Dynamic scenarios?!
- Solution two tiers
- Core rebalancing (and updates)
- Nebula updates (and rebalancing)
- Migration protocol
9Core Nebula
- Group membership
- Small, stable
- Rebalance
- Unanimous agreement (2-phase commit)
- All core sites in same epoch
Arbitrary membership Large, dynamic Communicate
with sites in same epoch only Catch-up to
rebalance, join core epoch
10Core Nebula
Arbitrary membership Large, dynamic Communicate
with sites in same epoch only Catch-up to
rebalance, join core epoch
- Group membership
- Small, stable
- Rebalance
- Unanimous agreement (2-phase commit)
- All core sites in same epoch
11Catch-up protocol summary
12Catch-up protocol
ins(L,00) ins(',001)
del(1)
R
R
A
I
I
A
L
I
I
N
N
13Catch-up protocol
ins(L,00) ins(',001)
rebalance
R
R
A
I
I
A
L
I
I
N
N
14Catch-up protocol
R
I
I
A
N
L
I
N
I
R
15Catch-up protocol
R
A
I
L
I
N
del(1)
rebalance
16Catch-up protocol
I
N
R
I
L
17Summary
- CRDT
- Convergence ensured
- Design for commutativity
- GC cannot be ignored
- Requires commitment
- Pervasive issue
- Large-scale commitment
- Core / Nebula
- To synchronise catch-up migration
18Future work
- More CRDTs
- Understanding CRDTs what invariants can be
CRDTized - Approximations of CRDTs
- Data types for consistent cloud computing without
concurrency control
19(No Transcript)
20Replicated sequence
21State of the art
- Serialising updates
- Single, global execution order
- Lock-step Poor user experience
- Doesn't scale
- Operational Transformation
- Local execution orders
- Modify arguments to take into account concurrent
operations scheduled before - Weak theory, hidden complexity
- Insight design data type to be commutative
22Commutative ReplicatedData Type (CRDT)
- Assuming
- All concurrent operations commute
- Non-concurrent operations execute in
happens-before order - All operations eventually execute at every
replica - Then replicas eventually converge to correct
value - Design data types to support commutative
operations
23Concurrent inserts
- Exceptions to binary tree disambiguator
- Concurrent inserts ordered by disambiguator
- Path site-ID? lt 01, disambiguator? gt
- Alternatives
- site identifier of initiator short, but delete
leaves a tombstone - or Unique ID of operation long, immediate
delete
24Causal ordering
- Vector clocks
- Number of messages received from each site
- Causal ordering
- Filter duplicate messages
- Efficient but grow indefinitely
- Treedoc
- TID encodes causal order
- Duplicates idempotent
- Approximate Vector Clocks Treedoc
25Rebalance requires commitment
- Commutativity of update rebalance
- Updates win
- Rebalance does not impact update performance
- Rebalance unanimous agreement
- Standard 2- or 3-phase commit
- Initiator is coordinator
- Other site If concurrent update Not OK
- Off critical path!
26Experiments
- Estimate overheads, compare design alternatives
- Atom granularity word vs. line
- Disambiguator siteIDtombstone vs. unique
IDimmediate delete - Are trees unbalanced?
- Effect of rebalance, heuristics
- Replay update history of CVS, SVN, Wiki
repositories
27Implementation alternatives
- Disambiguator next slide
- Atom character, word, line, paragraph
- Fine-grain structural overhead
- Coarse-grain delete artefacts
- Data structure
- Tree no TID storage, no GC interior nodes
- vs. (TID, atom) flexible, GC
- Mixed
- Arity binary vs. 256-ary
28Disambiguator design
- Disambiguator options
- 1 byte, no GC until rebalance
- or 10 bytes, immediate GC (if leaf)
- Stored in every node
- Intuitively which do you think is best?
I thought n 1... but I was wrong
29Latex files
30Latex / line
31Summarize deleted nodes (mean)
32Atom granularity and deletes
(200 revisions only)
33Wikipedia GWB benchmark
- en.wikipedia.org/George_W_Bush
- 150 kB text
- 42,000 revisions most frequently revised
- Biggest revision 100 kB
- Benchmark data
- Treedoc node paragraph
- First 15,000 revisions 350,000 updates
- Biggest revision lt 2 s average 0.24 s/revision
- Rebalance every 1,000 revisions
- 256-ary tree
34Wikipedia GWB page
nodes
tombstones
live
1000 ops
35Time per operation
µs
no rebalance
with rebalance
1000 ops
36Size, frequency of TIDs
37flat vs. WOOT vs. Treedoc
38Summary garbage collection
- Efficiency, GC are important
- Tree re-balance
- Requires commitment (move off critical path)
- Pervasive issue
- Large-scale commitment
- Core commitment
- Nebula asynchronous updates
- Occasional synchronisation migration
39Summary CRDT
- CRDT
- Convergence ensured
- Design for commutativity
- Techniques for commutativity
- Partial order
- Non-destructive updates
- Identifiers don't depend on concurrent activity
- Consensus off critical path
40CommutativityGenuine vs. precedence
- Commutativity ?S, u, v S.u.v S.v.u
- Genuine both u and v take effect
- Addition, subtraction commutative
- Non-destructive updates
- Precedence only one of u, v ultimately takes
effect - File writes commute under Last-Writer-Wins
- Destructive updates
41Future work
- Integrate with real editor
- (Single-document) Transactions
- Generalisation
- Characterise invariants
- More data types set, multilog, others?
- Mixed data types improve
- cf. persistent data structures
- When consensus required
- Move off critical path
- Speculation conflict resolution