Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables


1
Viceroy Scalable Emulation of Butterfly Networks
For Distributed Hash Tables
  • By Dahlia Malkhi, Moni Naor David Ratajzcak
  • Nov. 11, 2003
  • Presented by Zhenlei Jia
  • Nov. 11, 2004

2
Acknowledgments
  • Some of the following slides are adapted from
    the slides created by the authors of the paper

3
Outline
  • Outline
  • DHT Properties
  • Viceroy
  • Structure
  • Routing Algorithm
  • Join/Leave
  • Bounding In-degree Bucket Solution
  • Fault Tolerance
  • Summary

4
DHT
  • Whats DHT
  • Store (key, value) pairs
  • Lookup
  • Join/Leave
  • Examples
  • CAN, Pastry, Tapestry, Chord etc.

5
DHT Properties
  • Dilation
  • Efficient lookup, usually O(log(n))
  • Maintenance cost
  • Support dynamic environment
  • Control messages, affected servers
  • Degree
  • Number of opened connections
  • Servers impacted by node join/leave
  • Heartbeat, graceful leave

6
DHT Properties (cont.)
  • Congestion
  • Peers should share the routing load evenly
  • Load (of a node) the probability that it is on a
    route with random source and destination.
  • If path length O(log(n)) then on average, each
    node is on n2 x O(log(n))/n O(nlog(n)) routes.
    Average load O(nlogn)/n2 O(log(n))/n

7
Previous Works
8
Intuition
  • Route is a combination of links of appropriate
    size
  • Chord Each node has ALL log(n) links
  • Viceroy
  • Each node has ONE of the long-range links
  • A link of length 1/2k points to a node has link
    of length 1/2k1

Chrod
9
A Butterfly Network
  • Each node has ONE of the long-range links
  • A link of length 1/2k points to a node has link
    of length 1/2k1
  • Nodes share each others long link
  • Routing
  • Route to root
  • Route to right group
  • Route to right level
  • Path O(log(n))
  • Degree O(1)

10
A Viceroy network
  • Ideally, there should be log(n) levels
  • There is not a global counter
  • Later, we will see how a node can estimate log(n)
    locally

11
Structure Nodes
  • Node
  • Id 128 bits binary string, u
  • Level positive integer, u.level
  • Order of ids
  • b1b2bk ? ?i1k bi/2i
  • Each node has a SUCCESSOR and a PREDECESSOR
  • SUCC(u), PRED(u)
  • Node u stores the keys k such that ukltSUCC(u)

12
Structure Nodes
  • Lemma 2.1
  • Let n0 1/d(x, SUCC(x)), then w.h.p. (i.e.
    pgt1-1/n1e) that
  • log(n)-log(log(n))-O(1) ltlog(n0) 3log(n)
  • Node x selects level from 1log(n0) uniformly
    randomly

13
Structure Links
  • A node u in level k has six out links
  • 2 x Short SUCCESSOR ,PREDECESSOR
  • 2 x Medium (left) closest level-(k1) node whose
    id matches u.idk and is smaller than u.id.
  • 1 x Long the closest level-(k1) node with
    prefix u1uk-1(1-uk)(?)
  • u1uk-1(1-uk)uk1uw
  • where wlog(n0)-log(log(n0))
  • 1 x Parent closest level-(k-1) node
  • Also keeps track of in-bound links

14
Structure Links
15
Routing Algorithm
  • LOOKUP(x, y)
  • Initialization set cur to x
  • Proceed to root while cur.level gt 1
  • cur cur.parent
  • Greedy search
  • if cur.id y lt SUCC(cur).id, return cur.
  • Otherwise, choose m from links of cur that
    minimize d(m, y), move to m and repeat.
  • Demo http//www.cs.huji.ac.il/labs/danss/anatt/vi
    ceroy.html

16
Routing Example
17
One Observation
18
Routing Analysis (1)

19
Routing Analysis (2)
  • Expected path length O(log(n))
  • log(n ) to level-1 node
  • log(n ) for traveling among clusters
  • log(n ) for final local search

20
Routing Theorems
  • Theorem 4.4
  • The path length from x to y is O(log(n)) w.h.p.
  • Proof is based on several lemmas
  • Lemma 4.1
  • For every node u with a level u.level lt
    log(n)-log(log(n)), the number of nodes between u
    and u.Medium-left (Medium-right), if it exists,
    is at most 6log2(n) w.h.p.

21
Routing Theorems (2)
  • Lemma 4.2
  • In the greedy search phase of a lookup of value
    Y from node x, let the jth greedy step vj, for 1
    j m, be such that vj is more than O(log2(n))
    nodes away from y. Then w.h.p. node vj is reached
    over a Medium or Long link, and hence satisfies
    vj.level j and vjj Yj.
  • m log(n)-2loglog(n)-log(3e)
  • W.h.p. within m steps, we are n/2m 6log2(n)
    nodes away from the destination

22
Routing Theorems (3)
  • Lemma 4.3
  • Let v be a node that is O(log2(n)) nodes away
    from the target y. Then w.h.p., within O(log(n))
    greedy steps that target y is reached from v.
  • Theorem 4.4
  • The total length of a route from x to y is
    O(log(n)) w.h.p.
  • Theorem 4.6
  • Expected load on every node is O(log(n)/n).
  • The load on every node is log2(n)/n w.h.p.
  • Theorem 4.7
  • Every node u has in-degree O(log(n)) w.h.p.

23
Join Algorithm
  • Choose identifier select a random 128 bits
    x1x2x128
  • Setup short links invoke LOOKUP(x), let x be
    the result node. Insert x between x and
    x.SUUCESSOR.
  • Choose level let k be the maximal number of
    matching prefix bits between x and either SUCC(x)
    or PRED(x), choose level from 1k.
  • Set parent link If SUCC(x) has level x.level-1,
    set x.parent to it. Otherwise, move to SUCC(x)
    and repeat.
  • Set long link p x1xk-1(1-xk)xk1xw
  • Invoke LOOKUP(p), stop after a node at level
    x.level1 and matches p
  • is reached.

24
Join Algorithm (cont.)
  • 6. Set medium links Denote p x1x2xx.level.
    If SUCC(x) has prefix p and level x.level1, set
    x.Medium-right link to it. Otherwise, move the
    SUCC(x) and repeat.
  • 7. Set inbound links Denote p x1x2xx.level.
  • Set inbound Medium links Following SUCC links,
    so long as successor y has a prefix p and a level
    different from x.level, if y.level x.level-1,
    set y.Medium-left to x.
  • Set inbound long links Following SUCC links,
    find y that has a prefix matches p and has level
    x.level. Take any inbound links that is closer to
    x than y.
  • Set inbound parent links Following PRED link,
    find y such that y.level x.level1. Repeat
    until meet a node in same level as x.

25
Join Example
Set long link P x1xk-1(1-xk)xw stop at level
k1? In this case, find 00
Set Parent link Following SUCC link, find a node
has level k-1.

0111
26
Join Analysis
  • LOOKUP takes O(log(n)) messages w.h.p.
  • Travels on short links during link setting phase
    is O(lg2n) w.h.p.
  • A Medium link is within 6log2(n) nodes from x
    w.h.p.
  • Similar for others
  • Theorem 5.1
  • A JOIN operation by a new node x incurs expected
    O(log(n)) number of messages, and O(log2(n))
    messages w.h.p.
  • The expected number of nodes that change their
    state as a result of xs join is constant, and
    w.h.p is O(log(n)).
  • Because node x has O(log(n)) in-degrees w.h.p.
  • Similar results holds for LEAVE.

27
Bounding In-degrees
  • Theorem 4.7
  • Every node has expected constant in-degree, and
    has O(log(n)) in-degree w.h.p.
  • In-degree of servers affected by join/leave
  • How to guarantee constant in-degree?
  • Bucket solution
  • A background process to balance the assignment of
    levels

28
Bucket Solution Intuition
log(n)
x
  • Node x has log(n) in-degree, assuming Medium Right
  • Too many nodes at level k-1

Too few nodes at level k
  • Improve the level selection procedure

29
Bucket Solution
  • The name space is divided into non-overlapped
    buckets.
  • A bucket contains m nodes, where log(n) m
    clog(n), for cgt2.
  • In a buckets, levels are NOT assigned randomly
  • For each 1jlog(n), there are 1c nodes at
    level j in each bucket
  • In(x) lt 7c (?? 2c)

30
Maintaining Bucket Size
  • n can be accurately estimated
  • When bucket size exceeds clog(n), the bucket is
    split into two equal size buckets.
  • When bucket size drops below log(n), it is merged
    with a neighbor bucket.
  • Further more, if the merged bucket is greater
    than log(n)x(2c2)/3, the new bucket is split
    into two buckets.
  • (c1)/3 gt 1 since cgt2
  • Buckets are organized into a ring, which can be
    merged or split with O(1) message.

31
Maintain Level Property
  • Node join/leave without merging or splitting O(1)
  • Join size lt clog(n), choose a level that has
    less that c nodes
  • Leave If it is the only node in its level, find
    another level that has two nodes, reassign level
    j to one of them.
  • Bucket merge or split may result in a
    reassignment of the levels to all nodes in the
    bucket(s) O(log(n))
  • Merging/splitting are expensive, but they do not
    happen very often
  • After a merging or splitting of buckets, at least
    log(n) (c-2)/3 JOIN/LEAVE must happen in this
    bucket until another merging or splitting of this
    bucket is performed
  • Amortized Overhead c/((c-2)/3) O(1) for cgt2

32
Amortized analysis
33
Fault Tolerance
  • Viceroy has no built in support for fault
    tolerance
  • Viceroy requires graceful leave
  • Leaves are NOT the same as failures
  • Performance is sensitive to failure
  • External techniques
  • Thickening Edges
  • State Machine Replication

34
State Machine Replication
Super node
Viceroy nodes
35
Related Works
  • De Bruijn Graph Based Network
  • Distance halving
  • D2B
  • Koorde
  • Others
  • Symphony (Small world model)
  • Ulysses (ButterFly, log(n), log(n)/loglogn)

36
Summary
  • Constant out-degree
  • Expected constant in-degree
  • O(log(n )) w.h.p.
  • O(1) with bucket solution
  • O(log(n )) path length w.h.p
  • Expected log(n )/n load
  • O(log2(n)/n) w.h.p.
  • Weakness/improvements
  • Not Locality Aware
  • No Fault Tolerance Support
  • Due to the lack of flexibility of ButterFly
    network

37
Question
Photo by Peter J. Bryant
Write a Comment
User Comments (0)
About PowerShow.com