Towards an Internet that Never Fails - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Towards an Internet that Never Fails

Description:

triggered a major outage in Internet access across the country. ... 'WorldCom Inc...suffered a widespread outage on its Internet backbone that ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 17
Provided by: harib45
Category:

less

Transcript and Presenter's Notes

Title: Towards an Internet that Never Fails


1
Towards an Internet that Never Fails
  • Hari BalakrishnanMIT
  • Joint work with Nick Feamster, Scott Shenker,
    Mythili Vutukuru

2
What We Should Aim Toward
  • Carrier airlines (2002 FAA Fact Book)
  • 41 accidents, 6.7 million flights (five nines
    availability)
  • 911 phone service (1993 NRIC report)
  • 29 minutes downtime per year per line (four
    nines availability)
  • Standard phone service (various sources)
  • 53 minutes downtime per year per line (four
    nines availability)
  • The Internet?
  • One to two nines

3
Example Catastrophic Failures
a glitch at a small ISP triggered a major
outage in Internet access across the country.
The problem started when MAI Network
Services...passed bad router information from one
of its customers onto Sprint. -- news.com,
April 25, 1997
Microsoft's websites were offline for up to 23
hours...because of a router misconfigurationit
took nearly a day to determine what was wrong and
undo the changes. -- wired.com, January 25,
2001
WorldCom Incsuffered a widespread outage on its
Internet backbone that affected roughly 20
percent of its U.S. customer base. The network
problemsaffected millions of computer users
worldwide. A spokeswoman attributed the outage to
"a route table issue." -- cnn.com,
October 3, 2002
"A number of Covad customers went out from 5pm
today due to, supposedly, a DDOS (distributed
denial of service attack) on a key Level3 data
center, which later was described as a route leak
(misconfiguration). -- dslreports.com,
February 23, 2004
4
NANOG List Failure Analysis
More than 70 of threads discussing failures
relatedto router configuration or route
announcement problems
Note Only includes problems openly discussed on
this list.
5
Faults and Failures
  • Fault Underlying defect in a component that
    causes it to violate a specification
  • Latent or Active (i.e., cause errors)
  • Unmasked faults (errors) cause failures
  • Failure of subsystem (spec violation) causes
    fault in system
  • Internet faults occur for complex reasons
  • Hardware, software, protocol, design,
    implementation, operational faults could be
    triggered by malice
  • Internet failure A cannot communicate with B

6
Three Directions
  • Configuration as programming
  • Defines BGP behavior
  • Tools to cope with routing complexity
  • Coping with protocol faults failure-atomic
    interdomain routing
  • Prefix-based routing considered harmful
  • End-to-end routing
  • Exposing multiple paths to end systems (and stubs)

7
Configuration Defines BGP Behavior
Flexibility for realizing goals in complex
business landscape
  • Which neighboring networks can send traffic
  • Where traffic enters and leaves the network
  • How routers within the network learn routes to
    external destinations

Traffic
No Route
Route
Flexibility
Complexity
8
Today Reactive Operation
What happens if I tweak this policy?
  • Problems cause downtime
  • Problems often not immediately apparent

9
Coping with Complexity
  • View configuration as (distributed) programming
  • Large-scale over 1M lines of code in some
    networks
  • Programming tools to reduce fault frequency
  • Static analysis can detect many faults rcc
  • Sandboxing to overcome current stimulus-response
    reasoning FR03
  • Centralize configuration platform
  • More intentional config specs
  • Push configs to routers
  • Push routes to routers RCPF04
  • Use static analysis and sandboxing tools

10
Proactive Operation with rcchttp//nms.csail.mit.
edu/rcc
rcc
Distributed router configurations (Single AS)
Correctness Specification

Constraints
Faults
Normalized Representation
  • Represent complex, distributed configuration
  • Define a correctness specification
  • Map specification to constraints

11
Factoring Routing Configuration
Hundreds of thousands of lines of configuration
in hundreds of routers.
Ranking route selection
Customer
Primary
Competitor
Backup
12
Correctness Specification
Path Visibility Every destination with a usable
path has a route advertisement
If there exists a path, then there exists a route
Example violation Signaling partition
Route Validity Every route advertisement
corresponds to a usable path
If there exists a route, then there exists a path
Example violation Routing loop
13
Results Faults across 17 ASes
Every AS had faults, regardless of network size
Most faults can be attributed to distributed
configuration
Route Validity
Path Visibility
14
Web-based Command Line Interfaces
http//nms.csail.mit.edu/rcc
15
Three Directions
  • Configuration as programming
  • Tools to cope with routing complexity
  • Coping with protocol faults failure-atomic
    interdomain routing
  • Prefix-based routing considered harmful
  • End-to-end routing
  • Exposing multiple paths to end systems

16
Prefixes are too coarse-grained
  • Validity If a failure occurs that makes a
    network unreachable via a given path, then the
    route corresponding to that path must be withdrawn

70 of intra-AS failuresnot visible in BGP
FABK03
17
but they are also too fine-grained!
  • 70 of discontiguous prefix pairs from the same
    AS are announced from the same location
  • Allocation explains about 60 of these cases
  • Registries often allocate discontiguous address
    blocks to a single AS on the same day
  • Routes for these prefixes will flap together.
  • 135.36.0.0/16 (Agere) and 135.12.0.0/14 (Lucent)

Route objects should correspond to an atom of
hosts that share fate
18
Proposal Atomic Interdomain Protocol (AIP)
  • Exterminate prefixes
  • Name atomic domains (AD) directly
  • Addressing, forwarding and routing on ADs
  • Like current AS numbers, but finer-grained
  • Example MIT, Microsoft Redmond, one PoP of a
    large ISP,
  • Flat AD IDs can carry cryptographic meaning
  • Self-certifying (hash of public key)
  • End-system addresses have the form AD LocalID

19
Exposing Paths to End Systems
  • Ultimately, failure recovery is an end-to-end
    function
  • Current architecture doesnt expose multiple
    paths to end systems and stubs
  • Result Various hacks to discover distinct
    paths across overlays and underlays

20
Summary
  • Its worth shooting for a two or three
    order-of-magnitude improvement in Internet
    availability
  • Its possible to get four or five nines of
    Internet availability, if we
  • Develop tools to cope with configuration
    complexity
  • Develop a failure-atomic routing system
  • Expose multiple IP-layer paths to higher layers
Write a Comment
User Comments (0)
About PowerShow.com