Title: A Scalable ContentAddressable Network
1A Scalable Content-Addressable Network
- Paper by S. Ratnasamy, P. Francis, M. Handley, R.
Karp and S. Shenker - Presentation by Nick Hurley for CS598RHK
- 15 Oct 2004
2Introduction
- Hashes good in modern systems
- Why not for distributed systems?
- CAN Distributed hash functionality on internet
scale - Scalable
- Fault-Tolerant
- Self-Organizing
3Why a CAN?
- Peer-to-peer very popular
- Napster
- Gnutella
- New systems rising fast
- Current systems not scalable
- Napster not entirely decentralized
- Gnutella floods on file request
4Can P2P be Scalable?
- Most scalability problems in indexing
- Hashes provide this
- Called a Content-Addressable Network (CAN)
5Not Only for P2P
- CAN also usable in other problem domains
- Storage Management (OceanStore, Farsite, Publius)
- Wide-Area Name Resolution
6Basic Design of CANs
- Hash-Like
- Insertions/Deletions/Lookups on (key,value) pairs
- Many individual nodes
- Each node stores subset of hash table (a zone)
- Know adjacent zones for routing purposes
7Basic Design (cont.)
- Based on Cartesian coordinate space
- d-torus wraps around
- Entirely logical space
- Dynamically partitioned
- Uniform hashing maps data to point
- Store (K, V) hash(K) ? P
- P is point in d-torus
- V stored at node owning P
- Uniformity of hash function ensures scalability
8Example CAN
9Routing in CANs
- Nodes know neighbors coordinate spaces
- Neighbor if coordinate ranges overlap along d-1
dimensions, adjacent on 1 dimension - Greedy forwarding through neighbors
- d-dimensional, n equal zones
- (d/4)(n1/d) avg length
- Length grows O(n1/d)
- Can fail if all neighbors lost
10Example Routing
11Joining a CAN
- Find a current member M
- CAN functionality independent of how
- Authors use Yallcast/YOID mechanism
- Choose random point P, send JOIN to P through M
- Node O owning P gives half its space to new
member N - Update neighbor lists
- O informs N of its neighbors
- O updates its own neighbor list
- O informs its neighbors of topology change
- Affects only O(d) existing nodes
12Example Join
13Node Removal
- Handoff (key,value) pairs to neighbor N
- Inform all existing neighbors of topology change
- Inverse of JOIN
14Node Failure
- Periodically check in with neighbors
- No check in ? failed node
- Each neighbor starts random timer
- Timer expires ? send TAKEOVER to all neighbors of
failed node - Receive TAKEOVER
- Cancel if have more info than sender
- Send TAKEOVER if have less info than sender
- Neighbor with small volume takes over failed
nodes space
15Design Improvements - Dimensions
- Dimensionality of coordinate space not restricted
- Increasing number of dimensions decreases path
length - Path length scales with O(dn1/d)
- Also improves fault tolerance
- More neighbors ? more next hops
16Graph Effects of Dimensions
17Design Improvements - Realities
- Multiple, independent coordinate spaces
- Each node assigned different zone in each space
(reality) - r realities ? each node has r coordinate zones, r
neighbor sets - Contents replicated in each reality
- Can route to any reality shorter routes
- Routing now checks all neighbors, routes to one
closest to data in any reality - Improved fault tolerance more copies
18Graph Effects of Realities
19Dimensions vs. Realities
- Both reduce path length, increase state
- Dimensions reduce path length better
- Does not mean dimensions better realities have
advantages
20Design Improvements Routing Metrics
- Basic routing only progress in Cartesian space
- Ignores underlying network
- Route with highest ratio of Cartesian progress to
RTT
21Design Improvements Overloaded Zones
- Multiple nodes per zone
- Increased state know peers and neighbors
- Improves fault-tolerance, latency, path length
22Design Improvements Multiple Hashing
- Each key maps to multiple points
- Similar to multiple realities
- Shorter paths
- Higher fault-tolerance
- Lower perceived latency
23Design Improvements Topological Sensitivity
- Place neighbors based on underlying network
- Dont randomly choose insertion point
- Decreased latency
24Design Improvements Uniform Partitioning
- Basic partitioning can be unbalanced
- Lots of information at one node
- Insert self to split largest current node
- Doesnt solve hot spot problem
25Design Improvements Caching and Replication
- Solve hot spot problem
- Caching
- Keep copies of things recently accessed through
you - Replication
- Overloaded node pushes copy to neighbors
- Continues until load is reasonable
26Summary of Improvements
- Two runs of CAN
- System size 217 nodes
27Background Zone Reassignment
- Immediate takeover can make one node have
multiple zones - Prefer one-to-one assignment
- DFS on binary tree to find optimal takeover
- Immediate takeover
- New takeover does DFS
- Find sibling leaf nodes
- One merges and takes leaf nodes
- One takes released zone
28Zone Reassignment Example
- 9 dies, 6 takes over immediately
- 6 does DFS, finds 10, 11
- 11 takes over 10/11, 10 takes over ex-9
?
?
29Performance with Failures
- Assume no repair mechanism
- Increased node failure rate ? increased routing
failure - Increased nodes ? increased routing failure
- Node failure worse than more nodes
- Do Expanding Ring Search to find good path
- Search radius increases with node failure rate
and size of system - Path length using ERS expands exponentially in
size of system
30Related Work
- Algorithms
- Distance Vector Routing requires topological
knowledge - Link State Routing see DVR
- Plaxtons Algorithm designed for web caching,
not P2P - Systems
- DNS similar to hash, less general than CAN
- OceanStore large-scale storage system
- Publius web publishing system
- P2P filesharing Napster, Gnutella, Freenet
31Discussion
- System size 65k ? latency lt 2x IP
- Have addressed scalability and indexing
- Security?
- DOS resistance?
- Searching?