Title: An Overlay Infrastructure for Decentralized Object Location and Routing
1An Overlay Infrastructure for Decentralized
Object Location and Routing
- Ben Y. Zhaoravenben_at_eecs.berkeley.edu
- University of California at Berkeley
- Computer Science Division
2Peer-based Distributed Computing
- Cooperative approach to large-scale applications
- peer-based available resources scale w/ of
participants - better than client/server limited resources
scalability - Large-scale, cooperative applications are coming
- content distribution networks (e.g. FastForward)
- large-scale backup / storage utilities
- leverage peers storage for higher resiliency /
availability - cooperative web caching
- application-level multicast
- video on-demand, streaming movies
3What Are the Technical Challenges?
- File system replicate files for
resiliency/performance - how do you find close by replicas?
- how does this scale to millions of users?
billions of files?
4Node Membership Changes
- Nodes join and leave the overlay, or fail
- data or control state needs to know about
available resources - node membership management a necessity
5A Fickle Internet
- Internet disconnections are not rare
(UMichTR98,IMC02) - TCP retransmission is not enough, need
route-around - IP route repair takes too long IS-IS ? 5s, BGP
? 3-15mins - good end-to-end performance requires fast
response to faults
6An Infrastructure Approach
- First generation of large-scale apps vertical
approach - Hard problems, difficult to get right
- instead, solve common challenges once
- build single overlay infrastructure at
application layer
overlay
application
presentation
session
transport
network
link
physical
Internet
7Personal Research Roadmap
TSpaces
DOLR
SPAA 02 / TOCS
IPTPS 03
ICNP 03
modeling of non-stationary datasets
JSAC 04
8Talk Outline
- Motivation
- Decentralized object location and routing
- Resilient routing
- Tapestry deployment performance
- Wrap-up
9What should this infrastructure look like?
here is one appealing direction
10Structured Peer-to-Peer Overlays
- Node IDs and keys from randomized namespace
(SHA-1) - incremental routing towards destination ID
- each node has small set of outgoing routes, e.g.
prefix routing - log (n) neighbors per node, log (n) hops
between any node pair
ID ABCE
ABC0
To ABCD
AB5F
A930
11Related Work
- Unstructured Peer to Peer Approaches
- Napster, Gnutella, KaZaa
- probabilistic search (optimized for the hay, not
the needle) - locality-agnostic routing (resulting in high
network b/w costs) - Structured Peer to Peer Overlays
- the first protocols (2001) Tapestry, Pastry,
Chord, CAN - then Kademlia, SkipNet, Viceroy, Symphony,
Koorde, Ulysseus - distinction how to choose your neighbors
- Tapestry, Pastry latency-optimized routing mesh
- distinction application interface
- distributed hash table put (key, data) data
get (key) - Tapestry decentralized object location and
routing
12Defining the Requirements
- efficient routing to nodes and data
- low routing stretch (ratio of latency to
shortest path distance) - flexible data location
- applications want/need to control data placement
- allows for application-specific performance
optimizations - directory interface publish (ObjID),
RouteToObj(ObjID, msg) - resilient and responsive to faults
- more than just retransmission, route around
failures - reduce negative impact (loss/jitter) on the
application
13Decentralized Object Location Routing
routeobj(k)
routeobj(k)
k
publish(k)
k
- redirect data traffic using log(n) in-network
redirection pointers - average of pointers/machine log(n) avg
files/machine - keys to performance
- proximity-enabled routing mesh with routing
convergence
14Why Proximity Routing?
01234
01234
- Fewer/shorter IP hops shorter e2e latency, less
bandwidth/congestion, less likely to cross
broken/lossy links
15Performance Impact (Proximity)
- Simulated Tapestry w/ and w/o proximity on 5000
node transit-stub network - Measure pair-wise routing stretch between 200
random nodes
16DOLR vs. Distributed Hash Table
- DHT hash content ? name ? replica placement
- modifications ? replicating new version into DHT
- DOLR app places copy near requests, overlay
routes msgs to it
17Performance Impact (DOLR)
- simulated Tapestry w/ DOLR and DHT interfaces on
5000 node T-S - measure route to object latency from clients in 2
stub networks - DHT 5 object replicas DOLR 1 replica
placed in each stub network
18Talk Outline
- Motivation
- Decentralized object location and routing
- Resilient and responsive routing
- Tapestry deployment performance
- Wrap-up
19How do you get fast responses to faults?
Response time fault-detection alternate path
discovery time to
switch
20Fast Response via Static Resiliency
- Reducing fault-detection time
- monitor paths to neighbors with periodic UDP
probes - O(log(n)) neighbors higher frequency w/ low
bandwidth - exponentially weighted moving average for link
quality estimation - avoid route flapping due to short term loss
artifacts - loss rate Ln (1 - ?) ? Ln-1 ? ? ?p
- Eliminate synchronous backup path discovery
- actively maintain redundant paths, redirect
traffic immediately - repair redundancy asynchronously
- create and store backups at node insertion
- restore redundancy via random pair-wise queries
after failures - End result
- fast detection precomputed paths increased
responsiveness
21Routing Policies
- Use estimated overlay link quality to choose
shortest usable link - Use shortest overlay link withminimal quality gt
T - Alternative policies
- prioritize low loss over latency
- use least lossy overlay link
- use path w/ minimal cost functioncf x??
latency y?? loss rate
22Talk Outline
- Motivation
- Decentralized object location and routing
- Resilient and responsive routing
- Tapestry deployment performance
- Wrap-up
23Tapestry, a DOLR Protocol
- Routing based on incremental prefix matching
- Latency-optimized routing mesh
- nearest neighbor algorithm (HKRZ02)
- supports massive failures and large group joins
- Built-in redundant overlay links
- 2 backup links maintained w/ each primary
- Use objects as endpoints for rendezvous
- nodes publish names to announce their presence
- e.g. wireless proxy publishes nearby laptops ID
- e.g. multicast listeners publish multicast
session name to self organize
24Weaving a Tapestry
- inserting node (0123) into network
- route to own ID, find 012X nodes, fill last
column - request backpointers to 01XX nodes
- measure distance, add to rTable
- prune to nearest K nodes
- repeat 24
Existing Tapestry
25Implementation Performance
- Java implementation
- 35000 lines in core Tapestry, 1500 downloads
- Micro-benchmarks
- per msg overhead 50?s, most latency from byte
copying - performance scales w/ CPU speedup
- 5KB msgs on P-IV 2.4Ghz throughput 10,000
msgs/sec - Routing stretch
- route to node lt 2
- route to objects/endpoints lt 3higher stretch
for close by objects
26Responsiveness to Faults (PlanetLab)
- B/W ? network size N, N300 ? 7KB/s/node, N106 ?
20KB/s - sim if link failure lt 10, can route around 90
of survivable failures
27Stability Under Membership Changes
- Routing operations on 40 node Tapestry cluster
- Churn nodes join/leave every 10 seconds, average
lifetime 2mins
28Talk Outline
- Motivation
- Decentralized object location and routing
- Resilient and responsive routing
- Tapestry deployment performance
- Wrap-up
29Lessons and Takeaways
- Consider system constraints in algorithm design
- limited by finite resources (e.g. file
descriptors, bandwidth) - simplicity wins over small performance gains
- easier adoption and faster time to implementation
- Wide-area state management (e.g. routing state)
- reactive algorithm for best-effort, fast response
- proactive periodic maintenance for correctness
- Naïve event programming model is too low-level
- much code complexity from managing stack state
- important for protocols with asychronous control
algorithms - need explicit thread support for callbacks /
stack management
30Future Directions
- Ongoing work to explore p2p application space
- resilient anonymous routing, attack resiliency
- Intelligent overlay construction
- router-level listeners allow application queries
- efficient meshes, fault-independent backup links,
failure notify - Deploying and measuring a lightweight peer-based
application - focus on usability and low overhead
- p2p incentives, security, deployment meet the
real world - A holistic approach to overlay security and
control - p2p good for self-organization, not for security/
management - decouple administration from normal operation
- explicit domains / hierarchy for configuration,
analysis, control
31Thanks!
Questions, comments? ravenben_at_eecs.berkeley.edu
32Impact of Correlated Events
event handler
- correlated requests ABC?D
- e.g. online continuous queries, sensor
aggregation, p2p control layer, streaming data
mining
- web / application servers
- independent requests
- maximize individual throughput
33Some Details
- Simple fault detection techniques
- periodically probe overlay links to neighbors
- exponentially weighted moving average for link
quality estimation - avoid route flapping due to short term loss
artifacts - loss rate Ln (1 - ?) ? Ln-1 ? ? ?p
- p instantaneous loss rate, ? filter constant
- other techniques topics of open research
- How do we get and repair the backup links?
- each hop has flexible routing constraint
- e.g. in prefix routing, 1st hop just requires 1
fixed digit - backups always available until last hop to
destination - create and store backups at node insertion
- restore redundancy via random pair-wise queries
after failures - e.g. to replace 123X neighbor, talk to local 12XX
neighbors
34Route Redundancy (Simulator)
- Simulation of Tapestry, 2 backup paths per
routing entry - 2 backups low maintenance overhead, good
resiliency
35Another Perspective on Reachability
Portion of all pair-wise paths where no
failure-free paths remain
A path exists, but neither IP nor FRLS can locate
the path
Portion of all paths where IP and FRLS both route
successfully
FRLS finds path, where short-term IP routing fails
36Single Node Software Architecture
37Related Work
- Unstructured Peer to Peer Applications
- Napster, Gnutella, KaZaa
- probabilistic search, difficult to scale,
inefficient b/w - Structured Peer to Peer Overlays
- Chord, CAN, Pastry, Kademlia, SkipNet, Viceroy,
Symphony, Koorde, Coral, Ulysseus, - routing efficiency
- application interface
- Resilient routing
- traffic redirection layers
- Detour, Resilient Overlay Networks (RON),
Internet Indirection Infrastructure (I3) - our goals scalability, in-network traffic
redirection
38Node to Node Routing (PlanetLab)
Median31.5, 90th percentile135
- Ratio of end-to-end latency to ping distance
between nodes - All node pairs measured, placed into buckets
39Object Location (PlanetLab)
90th percentile158
- Ratio of end-to-end latency to client-object ping
distance - Local-area stretch improved w/ additional
location state
40Micro-benchmark Results (LAN)
- Per msg overhead 50?s, latency dominated by
byte copying - Performance scales with CPU speedup
- For 5K messages, throughput 10,000 msgs/sec
41Traffic Tunneling
Legacy Node B
Legacy Node A
B
P(B)
A, B are IP addresses
Proxy
Proxy
Structured Peer to Peer Overlay
- Store mapping from end host IP to its proxys
overlay ID - Similar to approach in Internet Indirection
Infrastructure (I3)
42Constrained Multicast
- Used only when all paths are below quality
threshold - Send duplicate messages on multiple paths
- Leverage route convergence
- Assign unique message IDs
- Mark duplicates
- Keep moving window of IDs
- Recognize and drop duplicates
- Limitations
- Assumes loss not from congestion
- Ideal for local area routing
2225
2299
2274
2286
2046
2281
2530
?
?
?
1111
43Link Probing Bandwidth (PL)
- Bandwidth increases logarithmically with overlay
size - Medium sized routing overlays incur low probing
bandwidth
44Control Plane vs. Data Plane
- impact varies with application domain
- control plane
- use overlay as a lookup service
- minimize performance impact
- requires more end-host intervention
- example Internet Indirection Infrastructure
- do extra work to locate nearby server, amortize
cost over time - data plane
- leverage overlay for data traffic
- efficient overlay routing is critical
- build additional logic into overlay hops
- examples routing for resilience, anonymity
- efficiency always desirable, question is who
provides it