RON: Resilient Overlay Networks - PowerPoint PPT Presentation

About This Presentation
Title:

RON: Resilient Overlay Networks

Description:

RON: Resilient Overlay Networks David Andersen, Hari Balakrishnan, Frans Kaashoek, and Robert Morris MIT Laboratory for Computer Science http://nms.lcs.mit.edu/ron/ – PowerPoint PPT presentation

Number of Views:126
Avg rating:3.0/5.0
Slides: 31
Provided by: LCS74
Learn more at: http://nms.csail.mit.edu
Category:

less

Transcript and Presenter's Notes

Title: RON: Resilient Overlay Networks


1
RON Resilient Overlay Networks
  • David Andersen, Hari Balakrishnan,
  • Frans Kaashoek, and Robert Morris
  • MIT Laboratory for Computer Science
  • http//nms.lcs.mit.edu/ron/

2
Fault-tolerant networking
B
A
C
D
  • Packet switching and route around failures

3
Internet network of networks
Site 2
Site 3
ISP1
ISP2
Site 1
ISP3
Site 4
Site 5
  • ISPs peer to forward packets
  • ISP exchange route info using BGP

4
The Internet is ill suited to mission-critical
applications
  • Commercial peer architecture
  • Performance bottlenecks at peering points
  • Ignores many existing alternate paths
  • Directly conflicts with robustness
  • Internets global scale
  • Prevents sophisticated algorithms
  • Route selection uses fixed, simple metrics
  • Routing isnt sensitive to path quality

5
How robust is Internet routing?
Paxson 95-97 3.3 of all routes had serious problems
Labovitz 97-00 10 of routes available lt 95 of the time 65 of routes available lt 99.9 of the time 3-min minimum detectionrecovery time often 15 mins 40 of outages took 30 mins to repair
Chandra 01 5 of faults last more than 2.75 hours
6
Our goal
  • To improve communication availability for small
    groups by at least a factor or 10
  • Many applications
  • Collaboration and conferencing
  • Virtual Private Networks (VPNs) across public
    Internet
  • Overlay Internet Service

7
Overlay routes around Internet failures
MIT
Utah
Utah Company
Cable Modem
  • Failures
  • Outages Configuration/operational errors,
    backhoes, etc.
  • Performance failures Severe congestion,
    denial-of-service attacks, etc.

8
Scalability versus recovery
  • Internet scalability pays a price
  • Slow recovery
  • RON recovers fast by
  • Limiting size of overlay
  • Exploiting redundancy in underlying Internet

9
Redundant links
  • Multiple paths between all sites

MIT
Utah
Internet 2
Utah Company
Cable Modem
10
Redundant links
  • But many of them are hidden

MIT
Utah
Utah Company
Cable Modem
11
Resilient overlay networks
  • Measure all links between nodes
  • Compute path properties
  • Determine best route
  • Forward traffic over that path

12
RON routing using overlays
Scalable BGP-based IP routing substrate
  • Types of failures
  • Outages Configuration/operational errors,
    backhoes, etc.
  • Performance failures Severe congestion,
    denial-of-service attacks, etc.

13
RON design
Nodes in different routing domains (ASes)
RON library
Performance Database
Application-specific routing tables
Policy routing module
14
Routing and path selection
  • Path selection at the entry node
  • Specialized for routing through one intermediate
    node
  • Router computes the forwarding tables
  • Link-state dissemination through RON
  • Path evaluation and selection
  • Latency minimizer EWMA of round-trip samples
  • Loss-rate minimizer average of the last k
    samples
  • Throughput optimizer TCP throughput equation
  • Select when estimated throughput improves by 2x
  • 5 hysteresis to avoid flapping

15
Policy routing
  • Router computes a forwarding table for each
    policy
  • Two ways of describing policies
  • Exclusive cliques (e.g., educational only)
  • General policies
  • BPF-like packet matcher, which returns a policy
  • Links that are denied by a policy
  • Entry node classifies packet with a policy tag

16
Responding to failure
  • Probe interval 12 seconds
  • Probe timeout 3 seconds
  • Routing update interval 14 seconds

17
RON overhead
10 nodes 20 nodes 30 nodes 40 nodes 50 nodes
1.8 Kbps 5.9 Kbps 12 Kbps 21 Kbps 32 Kbps
  • Probe overhead 69 bytes
  • RON routing overhead 60 20 (N-1)
  • 50 allows recovery times between 12 and 25 s

18
Many research questions
  • Does the RON approach work at all?
  • Each RON is small in size, no more than 50 or 100
    nodes
  • How fast can failure detection recovery happen?
  • Policy routing
  • Doesnt RON violate AUPs and other policies?
  • Routing behavior
  • Can stable routing be achieved?
  • Implementing efficient multi-criteria routing
  • Is it safe to deploy a large number of (small)
    interacting RONs on the Internet?

19
IP forwarder
  • A RON application
  • Transparently forwards IP traffic over RON
  • Allows comparisons of IP traffic over RON versus
    over direct Internet

20
RON deployment (19 sites)
To vu.nl lulea.se ucl.uk
To kaist.kr, .ve
.com (ca), .com (ca), dsl (or), cci (ut), aros
(ut), utah.edu, .com (tx) cmu (pa), dsl (nc), nyu
, cornell, cable (ma), cisco (ma), mit, vu.nl,
lulea.se, ucl.uk, kaist.kr, univ-in-venezuela
21
AS view
22
Experiments
  • Measure loss, latency, and throughput with and
    without RON
  • RON1 12 hosts in the US and Europe
  • 64 hours of measurements in March 2001
  • RON2 16 hosts
  • 85 hours of measurements in May 2001
  • 30-minute average loss rates
  • A 30 minute outage is very serious!
  • Note Experiments done with No-Internet2-for-comm
    ercial-use policy

23
Take home messages
  1. RON reduced outages by a factor 5 to 10, and
    routed around all major outages
  2. RON takes 18s (average) to route around a
    failure, and can do so in the face of flooding
    attacks
  3. Single route indirection delivers the majority
    RON benefits

24
RON improves loss-rate
30-min average loss rate on Internet
RON loss rate never more than 30
13,000 samples
30-min average loss rate with RON
25
An order-of-magnitude fewer failures
Loss Rate RON Better No Change RON Worse
10 526 517 58 51 47 45
20 142 140 4 3 15 15
30 32 32 0 0
50 20 20 0 0
80 14 14 0 0
100 10 0 0
30-minute average loss rates
6,825 path hours represented here 12 path
hours of essentially complete outage 72 path
hours of TCP outage RON routed around all of
these! One indirection hop provides almost all
the benefit!
26
Why does one hop work?
R RON nodes
P(good path) (1 (1-p)2)(R1)
RON
Good (p)
Bad (1-p)
RON
source
target

RON
  • In RON testbed
  • P(direct path is good) is 48.8
  • P(intermediate path is good) is 51

27
Resilience Against DoS Attacks
28
Latency using RON
29
Whats next for RON?
  • Data mining of collected samples
  • Applications
  • Routing policies (e.g., rate control)

30
Other progress Chord
  • Chord a peer-to-peer lookup system
  • CFS a peer-to-peer file sharing application
  • www.pdos.lcs.mit.edu/chord

31
Conclusion
  • Improved availability of Internet communication
    paths using small overlays
  • Layered above scalable IP substrate
  • RON provides a set of libraries and programs to
    facilitate this application-specific routing
  • Experimental data suggest that approach works
  • Over 10X availability
  • Outage detection and recovery in about 15 seconds
  • Able to route around certain denial-of-service
    attacks
  • Many interesting questions remain

http//nms.lcs.mit.edu/ron/
32
Policy Routing
  • Today, wide-area policy expression is a
    sledgehammer
  • Policy control is important
  • From talking to some providers
  • E.g., rate control policy Internet2, etc.
  • True, RONs could violate AUPs
  • But, the RON approach enables more flexible
    policies
  • More complex routing decisions rate-based too
  • Multiple routing tables
  • Deeper packet inspection, etc.

33
Example
34
Throughput Improvement
Write a Comment
User Comments (0)
About PowerShow.com