Title: Introduction to Structured Overlay Networks
1Introduction to Structured Overlay Networks
1
2Presentation Overview
- Gentle introduction to Structured Overlay
Networks and Distributed Hash Tables - Chord algorithms and others
3 Whats a Distributed Hash Table (DHT)?
, which is distributed
- An ordinary hash table
- Every node provides a lookup operation
- Provide the value associated with a key
- Nodes keep routing pointers
- If item not found, route to another node
11/20/2009
3
4 So what?
Time to find data is logarithmic Size of routing
tables is logarithmic Example log2(1000000)20 E
FFICIENT!
Store number of items proportional to number of
nodes Typically With D items and n nodes Store
D/n items per node Move D/n items when nodes
join/leave/fail EFFICIENT!
- Self-management routing info
- Ensure routing information is up-to-date
- Self-management of items
- Ensure that data is always replicated and
available
- Characteristic properties
- Scalability
- Number of nodes can be huge
- Number of items can be huge
- Self-manage in presence joins/leaves/failures
- Routing information
- Data items
11/20/2009
4
5 Traditional Motivation (1/2)
- Peer-to-Peer file sharing very popular
- Napster
- Completely centralized
- Central server knows who has what
- Judicial problems
- Gnutella
- Completely decentralized
- Ask everyone you know to find data
- Very inefficient
central index
decentralized index
11/20/2009
5
6 Traditional Motivation (2/2)
- Grand vision of DHTs
- Provide efficient file sharing
- Quote from Chord In particular, Chord can
help avoid single points of failure or control
that systems like Napster possess, and the lack
of scalability that systems like Gnutella display
because of their widespread use of broadcasts.
Stoica et al. 2001 - Hidden assumptions
- Millions of unreliable nodes
- User can switch off computer any time
(leavefailure) - Extreme dynamism (nodes joining/leaving/failing)
- Heterogeneity of computers and latencies
- Untrusted nodes
11/20/2009
6
7Motivation DHT overlay as communication
infra-structure
- Internet communication
- IP/port, TCP and UDP
- Not suited for 21st century computing
- Firewalls
- NATs
- Changing IP addresses
11/20/2009
7
8Name based communication
- DHTs can overcome these
- How?
- Use the DHT
- Map names to locations
- Bypass firewalls and NATs by routing through
neighbors
11/20/2009
8
9Name based communication
- What about group communication?
- IP Multicast is not enabled on the Internet
- Use the overlay to broadcast to all nodes
- Create multiple groups, broadcast within each
11/20/2009
9
10Whats it good for?
- Lets look at 10 applications built using such
systems
11Distributed Backup
- Setup
- Clients installed the backup tool
- Decide on amount of space to share
- Choose files for backup
- Regular backup
- Data is encrypted
- Stored in the directory
12Distributed File System
- Similar to AFS and NFS
- Files stored in directory
- What is new?
- Application logic self-managed
- Add/remove servers on the fly
- Automatically handles failures
- Automatically load-balances
- No manual configuration needed
13P2P Cache
- A distributed cache
- Every node in an org. runs a client
- Want to browse a web page?
- If exists locally -gt download it from a peer
- Otherwise, fetch and cache
- No central proxy needed
14P2P Web Servers
- Distributed Web Server
- Pages stored in the directory
- What is new?
- Application logic self-managed
- Automatically load-balances
- Add/remove servers on the fly
- Automatically handles failures
15P2P SIP
- Session Initiation Protocol
- Used to initiate calls on the Internet
- Is being standardized
- Use the directory to find end-hosts
- Improving Skype
16Host Identity Payload (HIP)
- Uses the directory to provide seamless mobility
- Unlike Mobile IP
- No home agent needed
- Self-managing
17PIER (databases)
- A relational view of the directory
- Use SQL to fetch data
- Standard operations (projection, selection,
equi-join)
18 Summary
- DHT is a useful data structure
- Assumptions mentioned might not be true
- Moderate amount of dynamism
- Leave not same thing as failure
- Dedicated servers
- Nodes can be trusted
- Less heterogeneity
19Chord as Example of DHT
20 How to construct a DHT (Chord)?
- Use a logical name space, called the identifier
space, consisting of identifiers 0,1,2,, N-1 - Identifier space is a logical ring modulo N
- Every node picks a random identifier though Hash
H - Example
- Space N16 0,,15
- Five nodes a, b, c, d, e
- a picks 6
- b picks 5
- c picks 0
- d picks 11
- e picks 2
11/20/2009
20
21 Definition of Successor
- The successor of an identifier is the
- first node met going in clockwise direction
- starting at the identifier
- Example
- succ(12)14
- succ(15)2
- succ(6)6
11/20/2009
21
22 Where to store data (Chord) ?
- Use globally known hash function, H
- Each item ltkey,valuegt gets
- identifier H(key) k
- Store each item at its successor
- Node n is responsible for item k
- Example
- H(Marina)12
- H(Peter)2
- H(Seif)9
- H(Stefan)14
Store number of items proportional to number of
nodes Typically (on average) With D items and n
nodes Store D/n items per node Move D/n items
when nodes join/leave/fail EFFICIENT!
11/20/2009
22
23 Where to point (Chord) ?
- Each node points to its successor
- The successor of a node n is succ(n1)
- Known as a nodes succ pointer
- Each node points to its predecessor
- First node met in anti-clockwise direction
starting at n-1 - Known as a nodes pred pointer
- Example
- 0s successor is succ(1)2
- 2s successor is succ(3)5
- 5s successor is succ(6)6
- 6s successor is succ(7)11
- 11s successor is succ(12)0
11/20/2009
23
24 DHT Lookup
- To lookup a key k
- Calculate H(k)
- Follow succ pointers until item k is found
- Example
- Lookup Seif at node 2
- H(Seif)9
- Traverse nodes
- 2, 5, 6, 11 (BINGO)
- Return Stockholm to initiator
11/20/2009
24
25DHT Lookup
- (a, b the segment of the ring moving clockwise
from but not including a until and including b - n.foo(.) denotes an RPC of foo(.) to node n
- n.bar denotes and RPC to fetch the value of the
variable bar in node n - We call the process of finding the successor of
an id a LOOKUP - // ask node n to find the successor of id
- procedure n.findSuccessor(id)
- if predecessor ? nil ? id ? (predecessor, n
then return n - else if id ?(n, successor then
- return successor
- else // forward the query around the circle
- return successor.findSuccessor(id)
11/20/2009
25
26DHT Lookup and Update
- // ask node n to find the successor of id
- procedure n.put(id,value)
- s findSuccessor(id)
- s.store(id,value)
- procedure n.get(id)
- s findSuccessor(id)
- return s.retrieve(id)
- PUT and GET are nothing but lookups!!
11/20/2009
26
27 Speeding up lookups
- If only pointer to succ(n1) is used
- Worst case lookup time is N, for N nodes
- Improving lookup time (finger/routing table)
- Point to succ(n1)
- Point to succ(n2)
- Point to succ(n4)
- Point to succ(n8)
-
- Point to succ(n2M-1)
- Distance always halved to
- the destination
Time to find data is logarithmic Size of routing
tables is logarithmic Example log2(1000000)20 E
FFICIENT!
11/20/2009
27
28Chord Routing (1/7)
Get(15)
0
15
1
15
- Routing table size M, where N 2M
- Every node n knows successor(n 2 i-1) ,for i
1..M - Routing entries log2(N)
- log2(N) hops from any node to any other node
2
14
13
3
12
4
11
5
10
6
9
7
8
11/20/2009
28
29Chord Routing (2/7)
0
15
1
15
- Routing table size M, where N 2M
- Every node n knows successor(n 2 i-1) ,for i
1..M - Routing entries log2(N)
- log2(N) hops from any node to any other node
2
14
13
3
12
4
11
5
10
6
9
Get(15)
7
8
11/20/2009
29
30Chord Routing (3/7)
Get(15)
0
15
1
15
- Routing table size M, where N 2M
- Every node n knows successor(n 2 i-1) ,for i
1..M - Routing entries log2(N)
- log2(N) hops from any node to any other node
2
14
13
3
12
4
11
5
10
6
9
7
8
11/20/2009
30
31Chord Routing (4/7)
Get(15)
0
15
1
15
- From node 1, only 2 hops to node 0 where item 15
is stored - For an id space of 16 is, the maximum is log2(16)
4 hops between any two nodes - In fact, if nodes are uniformly distributed, the
maximum is log2( of nodes), i.e. log2(8) hops
between any two nodes - The average complexity is
- ½ log(nodes)
2
14
13
3
12
4
11
5
10
6
9
7
8
11/20/2009
31
32Chord Routing (5/7) Pseudo code
findSuccessor(.)
- // ask node n to find the successor of id
- procedure n.findSuccessor(id)
- if predecessor ? nil ? id ? (predecessor, n
then return n - if id ?(n, successor then
- return successor
- else
- n closestPrecedingNode(id)
- return n.findSuccessor(id)
- // search locally for the highest predecessor of
id - procedure closestPrecedingNode(id)
- for i m downto 1 do
- if fingeri ?(n, id) then return
fingeri - end
- return n
-
11/20/2009
32
33Chord Discussion
- We are basically done
- But.
- What about joins and failures/leaves?
- Nodes come and go as they wish
- What about data?
- Should I lose my doc because some kid decided to
shut down his machine and he happened to store my
file? What about storing addresses of files
instead of files? - What did we gain compared to Gnutella? Increased
guarantees and determinism? - So actually we just started..
11/20/2009
33
34Agenda
- Handling successor pointers
- Joins, Leaves
- Scalability
- Routing table reducing the cost from O(N) to
O(logN) - Failures (for all the above)
11/20/2009
34
35Handling SuccessorsRing maintenance
- Every thing depends on successor pointers, so, we
better have them right all the time!! - In Chord, in addition to the successor pointer,
every node has a predecessor pointer as well for
ring maintenance
11/20/2009
35
36 Handling Dynamism
- Periodic stabilization is used to make pointers
eventually correct - Try pointing succ to closest alive successor
- Try pointing pred to closest alive predecessor
- When receiving notify(p) at n
- if prednil or p is in (pred,n
- set predp
- Periodically at n
- vsucc.pred
- if v?nil and v is in (n,succ
- set succv
- send a notify(n) to succ
11/20/2009
36
37 Handling joins
- When n joins
- Find ns successor with lookup(n)
- Set succ to ns successor
- Stabilization fixes the rest
15
13
11
- Periodically at n
- set vsucc.pred
- if v?nil and v is in (n,succ
- set succv
- send a notify(n) to succ
- When receiving notify(p) at n
- if prednil or p is in (pred,n
- set predp
11/20/2009
S. Haridi, ID2210, Lecture 02
37
38Handling Successors - Chord Algorithm
nil
11/20/2009
38
39Handling Join/Leaves For FingersFinger
Stabilization (1/5)
- Periodically refresh finger table entries, and
store the index of the next finger to fix - This is also the initialization procedure for the
finger table (copy the finger table of succ, then
fix ) - Local variable next initially 0
- procedure n.fixFingers()next next1if next gt
m then next 1fingernext findSuccessor(n
? 2next-1)
11/20/2009
39
40Examplefinger stabilization (2/5)
- Current situation succ(N48) is N60
- Succ(N21.Fingerj.start) Succ(53)
N21.Fingerj.node N60
N21.Fingerj.node
N21.Fingerj.start
N21
N32
N26
N60
N48
N53
11/20/2009
40
41Examplefinger stabilization (3/5)
- New node N56 joins and stabilizes successor
pointer - Finger j of node N21 is wrong
- N21 eventually try to fix finger j by looking up
53 which stops at N48, however and nothing
changes
N21.Fingerj.node
N21.Fingerj.start
N21
N32
N26
N60
N48
N53
N56
11/20/2009
41
42Examplefinger stabilization (4/5)
- N48 will eventually stabilize its successor
- This means the ring is correct now.
N21.Fingerj.node
N21.Fingerj.start
N21
N32
N26
N60
N56
N48
N53
11/20/2009
42
43Examplefinger stabilization (5/5)
- When N21 tries to fix Finger j again, this time
the response from N48 will be correct and N21
corrects the finger
N21.Fingerj.node
N21.Fingerj.start
N21
N32
N26
N60
N56
N48
N53
11/20/2009
43
44Agenda
- Handling successor pointers
- Joins, Leaves,
- Scalability
- Routing table reducing the cost from O(N) to
O(log N) - Failures (for all the above)
- Handling data
- Joins, Leaves
11/20/2009
44
45Handling Failures Replication of Successors
- Evidently the failure of one successor pointer
means total collapse - Solution A node has a successors list of size
r containing the immediate r successors - How big should r be? log(nodes) or a large
constant should be ok - Enhance periodic stabilization to handle failures
11/20/2009
45
46 Dealing with failures
- Each node keeps a successor-list
- Pointer to r closest successors
- succ(n1)
- succ(succ(n1)1)
- succ(succ(succ(n1)1)1)
- ...
- If successor fails
- Replace with closest alive successor
- If predecessor fails
- Set pred to nil
11/20/2009
46
47 Handling leaves
- When n leaves
- Just dissappear (like failure)
- When pred detected failed
- Set pred to nil
- When succ detected failed
- Set succ to closest alive in successor list
15
13
11
- Periodically at n
- set vsucc.pred
- if v?nil and v is in (n,succ
- set succv
- send a notify(n) to succ
- When receiving notify(p) at n
- if prednil or p is in (pred,n
- set predp
11/20/2009
S. Haridi, ID2210, Lecture 02
47
48Handling Failures- Ring (1/5)
- Maintaining the ring
- Each node maintains a successor list of length r
- If a nodes immediate successor fails, it uses
the second entry in its successor list - updateSuccessorList copies a successor list from
s removing last entry, and prepending s - Join a Chord containing node n
- procedure n.join(n) predecessor nil s
n.findSuccessor(n) updateSuccessorList(s.success
orList)
11/20/2009
48
49Handling Failures- Ring (2/5)
- Check whether predecessor has failed (Failure
detector) - procedure n.checkPredecessor()if predecessor
has failed then predecessor nil
11/20/2009
49
50Handling Failures- Ring (3/5)
- procedure n.stabilize()
- s Find first alive node in successorList
- x s.predecessorif x not nil and x ? (n, s)
then s x endupdateSuccessorList(s.successorLis
t) s.notify(n) - procedure n.notify(n)if predecessor nil or
n? (predecessor, n) then predecessor n -
11/20/2009
50
51Failure Ring (4/5)Example Node failure (N26)
suc(N21,2)
suc(N21,1)
suc(N26,1)
N32
N26
N21
pred(N32)
pred(N32)
- After N21 performed stabilize(), before
N21.notify(N32)
suc(N21,1)
N32
N26
N21
pred(N32)
11/20/2009
51
52Failure Ring (5/5)Example - Node failure
(N26)
- After N21 performed stabilize(), before
N21.notify(N32) - N21.notify(N32) has no effect
suc(N21,1)
N32
N26
N21
pred(N32)
- After N32.checkPredecessor()
suc(N21,1)
N32
N26
N21
- Next N21.stabilize() fixes N32s predecessor
11/20/2009
52
53Failure Lookups (1/5)
- // ask node n to find the successor of id
- procedure n.findSuccessor(id)
- if id ?(n, successor then
- return successor
- else
- n closestPreceedingNode(id)
- return try
n.findSuccessor(id) catch failure of n
then mark n in finger. as
failed n.findSuccessor(id) - // search locally for the highest predecessor of
id - procedure closestPreceedingNode(id)
- for i m downto 1 do
- if fingeri.node is alive and
fingeri ?(n, id) then return fingeri - end
- return n
11/20/2009
53
54Some Chord Results Load balancing of keys
- For any set of N nodes and K keys, with high
probability - Each node is responsible for at most (1 ?)K/N
keys - When an (N 1)st node joins or leaves the
network, responsibility for O(K/N) keys changes
hands (and only to or from the joining or leaving
node) - ? is bounded by (at most) O(log N)
55Some Chord resultsLoad balancing of keys
- ? is reduced to a small constant by running log N
virtual nodes (each with own identifier) on each
physical node.
56Some Chord ResultsLookup is logarithmic in
number of Nodes
- With high probability, the number of nodes that
must be contacted to find a successor in an
N-node network is O(log N). - This is only if node and key identifiers are
random.
57Some Chord ResultsSuccessor List Failure
- If we use a successor list of length r ?(logN)
in a network that is initially stable, and then
every node fails with probability 1/2, then with
high probability find successor returns the
closest living successor to the query key - Notice it required the nodes in the successor
list are random
58Variations of Chord
59DKS Routing
- Generalization of Chord to provide arbitrary
arity - Provide logk(n) hops per lookup
- k being a configurable parameter
- n being the number of nodes
- Instead of only log2(n)
60 Achieving logk(n) lookup
- Each node logk(N)L levels, NkL
- Each level contains k intervals,
- Example, k4, N64 (43), node 0
0
4
8
12
48
16
32
61 Achieving logk(n) lookup
- Each node logk(N) levels, NkL
- Each level contains k intervals,
- Example, k4, N64 (43), node 0
0
4
8
12
48
16
32
62 Achieving logk(n) lookup
- Each node logk(N) levels, NkL
- Each level contains k intervals,
- Example, k4, N64 (43), node 0
0
4
8
12
48
16
32
63 Arity is Important
- Maximum number of hops can be configured
- Example, a 2-hop system
64Chord
- The routing table has exponentially increasing
pointers on the ring (node space) and NOT the
identifier space (skip-list like structure)
65Routing Table of Chord
- Building the routing table
- log2N pointers
- exponentially spaced pointers
Chord
66Chord vs. Chord
Good for load balancing
67Effect of virtual nodes
68Stretch (proximity routing)
- the ratio between the
- latency of a Chord lookup from the time the
lookup is initiated to the time the result is
returned to the initiator, and - latency of an optimal lookup using the underlying
network - Network lookup
- is computed as the round-trip time between the
initiator and the server responsible for the
queried ID.
69Stretch