Title: Resilient Overlay Network
1Resilient Overlay Network
- David Andersen, Hari Balakrishnan,
- Frans Kaashoek, Robert Morris
- MIT Laboratory for Computer Science
- http//nms.lcs.mit.edu/ron/
- 18th ACM Symp. on Operating Systems Principles
(SOSP) October 2001, - Banff, Canada.
2Outline
- Introduction
- Design Goal
- Design
- Implementation
- Evaluation
- Discussion
- Conclusion
3Fault-tolerant networking
B
A
C
D
- Packet switching and route around failures
4The Internet
Mom-and-popISP
Really-big ISP everyones afraid of
Big ISP
Autonomous System (AS)
Peering
BGP4
Scalability via aggressive aggregation and
information hiding Commercial reality via peering
transit relationships
5How Robust is Internet Routing?
- Slow outage detection and recovery
- Inability to detect badly performing paths
- Inability to efficiently leverage redundant paths
- Inability to perform application-specific routing
- Inability to express sophisticated routing policy
6Introducing RON
- Resilient Overlay Networks (RONs)
- Remedy for some of these problems
- Rapid detection and recovery of Internet path
outages and performance degrades - Distributed application-layer overlay
- Nodes cooperate to forward data for each other
- Exploit redundancy in underlying Internet paths
7Routing Using Overlays
- Cooperating end-systems in different routing
domains can conspire to do better than scalable
wide-area protocols
Scalable BGP-based IP routing substrate
- Types of failures
- Outages Configuration/operational errors,
backhoes, etc. - Performance failures Severe congestion,
denial-of-service attacks, etc.
8Design Goals
- RON nodes can communicate with each other in face
of problems with underlying Internet paths - Each RON node obtains the path metrics
- active probing experiments
- passive observations
- exchange information about the quality of the
paths via a routing protocol - build forwarding tables based on path metrics
- Latency, Packet loss rate, available throughput
- designed to be limited in size
9Design Goals (Cont.)
- Integrate routing and path selection with
distributed applications more tightly - The ability to incorporate application-specific
notions of what network conditions constitute a
fault. - The ability to consult application-specific
metrics in selecting paths - Variety uses
- Provide a framework for the implementation of
expressive routing policies - BGP-4 is incapable of expressing fine-grained
policies aimed at users or hosts. - This lack of precision
- reduces the set of paths available in the case of
a failure - inhibits innovation in the use of carefully
targeted policies
10(No Transcript)
11RON Design
Nodes in different routing domains (ASes)
Virtual link
RON library
Application-specific routing tables
Policy routing module
Performance Database
12Software Architecture
application
application
Application-specific routing tables Policy
routing module
13Software Architecture
- RON client
- Each program that communicates with the RON lib
on a node - The overlay network is defined by a single group
of clients - Service-specific routing metrics
- Conduit
- an API across which A RON client interacts with
RON - Send (pkt, dst, via_ron)
- Recv (pkt, via_ron)
14Software Architecture
- Forwarder object
- Implement basic RON functionality
- RON router
- Implements a routing protocol
- Application can choose which router to use
- RON membership manager
- Maintain the list of members of a RON
15Routing and Path Selection
- The entry node
- Encapsulate packet with a RON packet header
- Path selection
- tags the packets RON header with a flow ID
- support multi-hop routing
- tie a packet flow to a chosen path
- The small size of a RON allows to
- maintain information for each virtual link
- (i) latency, (ii) packet loss rate,(iii)
throughput - select the path that best suits the RON client
16Routing and Path Selection
17Routing and Path Selection
- Link-state dissemination
- The default RON router uses a link-state routing
protocol to disseminate topology information
between routers - information is sent via the RON forwarding mesh
itself - Thus, the RON routing protocol is itself a RON
client
18Routing and Path Selection
- Path Evaluation and Selection
- Every RON router implements outage detection
- uses an active probing mechanism for this.
- By default, every RON router implements three
different routing metrics - latency-minimizer
- loss-minimizer
- TCP throughput-optimizer.
19Routing and Path Selection
- Latency-minimizer
- For any link
- For a RON path, the overall latency is the sum of
the individual virtual link latencies - loss-minimizer
- TCP throughput-optimizer
- Select when estimated throughput improves by 2x
20Monitoring Virtual Links
- Both sides get an RTT sample without requiring
sync clocks - Parameters
- PROBE_INTERVAL random(0 1/3 PROBE_INTERVAL)
- PROBE_TIMEOUT
- OUTAGE_THRESHOLD
21Routing and Path Selection
22Policy Routing
- RON allows users or administrators to define the
types of traffic allowed on particular network
links. - RON separates policy routing into two components
- classification
- Entry node classifies packet with a policy tag
- routing table formation.
- Router computes a set of forwarding tables for
each policy - Two ways of describing policies
- Exclusive cliques
- E.g., only students in CoC are allowed to use
GTs connection to Internet2 - General policies
- BPF-like packet matcher, which returns a policy
- A list of links that are denied by the policy
23Data Forwarding
24Data Forwarding
25Bootstrap and Membership Management
- RON provides two system membership managers
- static membership mechanism
- dynamic membership protocol
- The new node uses this neighbor to broadcast its
existence using a flooder - The main challenge in the dynamic membership
protocol is to avoid confusing a path outage to a
node from its having left the RON - Each node periodically exchange its peer list
with others - 1-hour timeout
26Implementation
27Generating Routing Tables
Single-hop indirection
28Evaluation
- N(N-1) different paths in a N-site RON deployment
- RON1 N12 132 distinct paths
- RON2 N16 240 distinct paths
- Raw measurement datasets
- Probe packets
- Throughput samples
- Traceroute results
- Note Experiments done with No-Internet2-for-comm
ercial-use policy
29RON deployment (19 sites)
To vu.nl lulea.se ucl.uk
To kaist.kr, .ve
.com (ca), .com (ca), dsl (or), cci (ut), aros
(ut), utah.edu, .com (tx) cmu (pa), dsl (nc), nyu
, cornell, cable (ma), cisco (ma), mit, vu.nl,
lulea.se, ucl.uk, kaist.kr, univ-in-venezuela
30AS view
31Major Results
- RON reduced outages by a factor 5 to 10, and
routed around all major outages - RON takes 18s (average) to route around a
failure, and can do so in the face of flooding
attacks - RON successfully routed around bad throughput
failures, doubling TCP throughput in 5 of all
samples - In 5 of the samples, RON reduced the loss
probability by 0.05 or more - Single route indirection delivers the majority
RON benefits
32EvaluationOvercoming Path Outages
30-min average loss rate on Internet
RON loss rate never more than 30
13,000 samples
30-min average loss rate with RON
33An order-of-magnitude fewer failures
30-minute average loss rates
6,825 path hours represented here 12 path
hours of essentially complete outage 76 path
hours of TCP outage RON routed around all of
these! One indirection hop provides almost all
the benefit!
34EvaluationOverhead
- 50 nodes allows recovery times between 12 and
25 s - growth in total traffic the cost of
fault tolerance
35EvaluationHandling Packet Floods
Flood attack
36EvaluationLoss Rate
37EvaluationLatency
38EvaluationLatency
39EvaluationTCP Throughput
40EvaluationWhy does one hop work
R RON nodes
RON
Pi
Ps
RON
source
target
RON
- a single-intermediate RON path is optimal (for
latency) given that the direct path is not
optimal - either the direct path, or a single-hop
intermediate path is the optimal path - if
41Discussion
- RONs relating to routing policy
- Possibility of misuse policy
- Prevent misuse need authentication and AC
- For small RON, these can be solved at
administrative level - Scalability
- Active probes
- Operation across NATs
- Naming
- Two host behind NATs
- Application??
42Conclusion
- Improved availability of Internet communication
paths using small overlays - Layered above scalable IP substrate
- RON provides a set of libraries and programs to
facilitate this application-specific routing - Experimental data suggest that approach works
- Over 10X availability
- Outage detection and recovery in about 15 seconds
- Able to route around certain denial-of-service
attacks
43