Consistency without concurrency control in large, dynamic systems - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Consistency without concurrency control in large, dynamic systems

Description:

site identifier of initiator: short, but delete leaves a tombstone ... Disambiguator: siteID tombstone vs. unique ID immediate delete. Are trees unbalanced? ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 20
Provided by: pagespers
Category:

less

Transcript and Presenter's Notes

Title: Consistency without concurrency control in large, dynamic systems


1
Consistency without concurrency control in
large, dynamic systems
  • Marc Shapiro, INRIA LIP6
  • Nuno Preguiça, Universidade Nova de Lisboa
  • Mihai Le?ia, ENS Lyon

2
Consistency without concurrency control
f (x1)
x
g(x2)
  • Object x, operation f(x)
  • propose f(x1)
  • eventually replay f(x2), f(x3), ...
  • If f g commute converges safely without
    concurrency control
  • Commutative Replicated Data Type (CRDT) Designed
    for commutative operations

3
A sequence CRDT
  • Treedoc sequence of elements
  • insert-at-pos, delete
  • Commutative when concurrent
  • Minimise overhead
  • Scalable
  • A commutative replicated data type for
    cooperative editing, ICDCS 2009
  • Focus today
  • Garbage collection
  • vs. scale

4
Commutative updates
R
1
0
I
A
1
0
I
N
I N R I
A
L
L
Naming tree minimal, self-adjusting
logarithmic TID path 01 Contents infix
order
  • insert adds leaf ? non-destructive, TIDs dont
    change

Delete tombstone, TIDs don't change
5
Wikipedia GWB page space overhead
kB serialised
Treedoc
wikidoc
10 revisions
6
Rebalance
R
I
A
L
I
N

L ' I N R I
L ' I N R I
7
Rebalance
N

I
L
R
I

L ' I N R I !!!
L ' I N R I
L ' I N R I
  • Invalidates TIDs
  • Frame of reference epoch
  • Requires agreement
  • Pervasive!
  • e.g. Vector Clocks

8
Rebalance in large, dynamic systems
  • Rebalance requires consensus
  • Consensus requires small, stable membership
  • Large communities?!
  • Dynamic scenarios?!
  • Solution two tiers
  • Core rebalancing (and updates)
  • Nebula updates (and rebalancing)
  • Migration protocol

9
Core Nebula
  • Group membership
  • Small, stable
  • Rebalance
  • Unanimous agreement (2-phase commit)
  • All core sites in same epoch

Arbitrary membership Large, dynamic Communicate
with sites in same epoch only Catch-up to
rebalance, join core epoch
10
Core Nebula
Arbitrary membership Large, dynamic Communicate
with sites in same epoch only Catch-up to
rebalance, join core epoch
  • Group membership
  • Small, stable
  • Rebalance
  • Unanimous agreement (2-phase commit)
  • All core sites in same epoch

11
Catch-up protocol summary
12
Catch-up protocol
ins(L,00) ins(',001)
del(1)
R
R
A
I
I
A
L
I
I
N
N
13
Catch-up protocol
ins(L,00) ins(',001)
rebalance
R
R
A
I
I
A
L
I
I
N
N
14
Catch-up protocol
R
I

I
A
N


L
I
N
I
R
15
Catch-up protocol
R
A
I
L
I
N

del(1)
rebalance
16
Catch-up protocol
I

N


R
I
L

17
Summary
  • CRDT
  • Convergence ensured
  • Design for commutativity
  • GC cannot be ignored
  • Requires commitment
  • Pervasive issue
  • Large-scale commitment
  • Core / Nebula
  • To synchronise catch-up migration

18
Future work
  • More CRDTs
  • Understanding CRDTs what invariants can be
    CRDTized
  • Approximations of CRDTs
  • Data types for consistent cloud computing without
    concurrency control

19
(No Transcript)
20
Replicated sequence
21
State of the art
  • Serialising updates
  • Single, global execution order
  • Lock-step Poor user experience
  • Doesn't scale
  • Operational Transformation
  • Local execution orders
  • Modify arguments to take into account concurrent
    operations scheduled before
  • Weak theory, hidden complexity
  • Insight design data type to be commutative

22
Commutative ReplicatedData Type (CRDT)
  • Assuming
  • All concurrent operations commute
  • Non-concurrent operations execute in
    happens-before order
  • All operations eventually execute at every
    replica
  • Then replicas eventually converge to correct
    value
  • Design data types to support commutative
    operations

23
Concurrent inserts
  • Exceptions to binary tree disambiguator
  • Concurrent inserts ordered by disambiguator
  • Path site-ID? lt 01, disambiguator? gt
  • Alternatives
  • site identifier of initiator short, but delete
    leaves a tombstone
  • or Unique ID of operation long, immediate
    delete

24
Causal ordering
  • Vector clocks
  • Number of messages received from each site
  • Causal ordering
  • Filter duplicate messages
  • Efficient but grow indefinitely
  • Treedoc
  • TID encodes causal order
  • Duplicates idempotent
  • Approximate Vector Clocks Treedoc

25
Rebalance requires commitment
  • Commutativity of update rebalance
  • Updates win
  • Rebalance does not impact update performance
  • Rebalance unanimous agreement
  • Standard 2- or 3-phase commit
  • Initiator is coordinator
  • Other site If concurrent update Not OK
  • Off critical path!

26
Experiments
  • Estimate overheads, compare design alternatives
  • Atom granularity word vs. line
  • Disambiguator siteIDtombstone vs. unique
    IDimmediate delete
  • Are trees unbalanced?
  • Effect of rebalance, heuristics
  • Replay update history of CVS, SVN, Wiki
    repositories

27
Implementation alternatives
  • Disambiguator next slide
  • Atom character, word, line, paragraph
  • Fine-grain structural overhead
  • Coarse-grain delete artefacts
  • Data structure
  • Tree no TID storage, no GC interior nodes
  • vs. (TID, atom) flexible, GC
  • Mixed
  • Arity binary vs. 256-ary

28
Disambiguator design
  • Disambiguator options
  • 1 byte, no GC until rebalance
  • or 10 bytes, immediate GC (if leaf)
  • Stored in every node
  • Intuitively which do you think is best?

I thought n 1... but I was wrong
29
Latex files
30
Latex / line
31
Summarize deleted nodes (mean)
32
Atom granularity and deletes
(200 revisions only)
33
Wikipedia GWB benchmark
  • en.wikipedia.org/George_W_Bush
  • 150 kB text
  • 42,000 revisions most frequently revised
  • Biggest revision 100 kB
  • Benchmark data
  • Treedoc node paragraph
  • First 15,000 revisions 350,000 updates
  • Biggest revision lt 2 s average 0.24 s/revision
  • Rebalance every 1,000 revisions
  • 256-ary tree

34
Wikipedia GWB page
nodes
tombstones
live
1000 ops
35
Time per operation
µs
no rebalance
with rebalance
1000 ops
36
Size, frequency of TIDs
37
flat vs. WOOT vs. Treedoc
38
Summary garbage collection
  • Efficiency, GC are important
  • Tree re-balance
  • Requires commitment (move off critical path)
  • Pervasive issue
  • Large-scale commitment
  • Core commitment
  • Nebula asynchronous updates
  • Occasional synchronisation migration

39
Summary CRDT
  • CRDT
  • Convergence ensured
  • Design for commutativity
  • Techniques for commutativity
  • Partial order
  • Non-destructive updates
  • Identifiers don't depend on concurrent activity
  • Consensus off critical path

40
CommutativityGenuine vs. precedence
  • Commutativity ?S, u, v S.u.v S.v.u
  • Genuine both u and v take effect
  • Addition, subtraction commutative
  • Non-destructive updates
  • Precedence only one of u, v ultimately takes
    effect
  • File writes commute under Last-Writer-Wins
  • Destructive updates

41
Future work
  • Integrate with real editor
  • (Single-document) Transactions
  • Generalisation
  • Characterise invariants
  • More data types set, multilog, others?
  • Mixed data types improve
  • cf. persistent data structures
  • When consensus required
  • Move off critical path
  • Speculation conflict resolution
Write a Comment
User Comments (0)
About PowerShow.com