Title: Towards an Internet that Never Fails
1Towards an Internet that Never Fails
- Hari BalakrishnanMIT
- Joint work with Nick Feamster, Scott Shenker,
Mythili Vutukuru
2What We Should Aim Toward
- Carrier airlines (2002 FAA Fact Book)
- 41 accidents, 6.7 million flights (five nines
availability) - 911 phone service (1993 NRIC report)
- 29 minutes downtime per year per line (four
nines availability) - Standard phone service (various sources)
- 53 minutes downtime per year per line (four
nines availability) - The Internet?
- One to two nines
3Example Catastrophic Failures
a glitch at a small ISP triggered a major
outage in Internet access across the country.
The problem started when MAI Network
Services...passed bad router information from one
of its customers onto Sprint. -- news.com,
April 25, 1997
Microsoft's websites were offline for up to 23
hours...because of a router misconfigurationit
took nearly a day to determine what was wrong and
undo the changes. -- wired.com, January 25,
2001
WorldCom Incsuffered a widespread outage on its
Internet backbone that affected roughly 20
percent of its U.S. customer base. The network
problemsaffected millions of computer users
worldwide. A spokeswoman attributed the outage to
"a route table issue." -- cnn.com,
October 3, 2002
"A number of Covad customers went out from 5pm
today due to, supposedly, a DDOS (distributed
denial of service attack) on a key Level3 data
center, which later was described as a route leak
(misconfiguration). -- dslreports.com,
February 23, 2004
4NANOG List Failure Analysis
More than 70 of threads discussing failures
relatedto router configuration or route
announcement problems
Note Only includes problems openly discussed on
this list.
5Faults and Failures
- Fault Underlying defect in a component that
causes it to violate a specification - Latent or Active (i.e., cause errors)
- Unmasked faults (errors) cause failures
- Failure of subsystem (spec violation) causes
fault in system - Internet faults occur for complex reasons
- Hardware, software, protocol, design,
implementation, operational faults could be
triggered by malice - Internet failure A cannot communicate with B
6Three Directions
- Configuration as programming
- Defines BGP behavior
- Tools to cope with routing complexity
- Coping with protocol faults failure-atomic
interdomain routing - Prefix-based routing considered harmful
- End-to-end routing
- Exposing multiple paths to end systems (and stubs)
7Configuration Defines BGP Behavior
Flexibility for realizing goals in complex
business landscape
- Which neighboring networks can send traffic
- Where traffic enters and leaves the network
- How routers within the network learn routes to
external destinations
Traffic
No Route
Route
Flexibility
Complexity
8Today Reactive Operation
What happens if I tweak this policy?
- Problems cause downtime
- Problems often not immediately apparent
9Coping with Complexity
- View configuration as (distributed) programming
- Large-scale over 1M lines of code in some
networks - Programming tools to reduce fault frequency
- Static analysis can detect many faults rcc
- Sandboxing to overcome current stimulus-response
reasoning FR03 - Centralize configuration platform
- More intentional config specs
- Push configs to routers
- Push routes to routers RCPF04
- Use static analysis and sandboxing tools
10Proactive Operation with rcchttp//nms.csail.mit.
edu/rcc
rcc
Distributed router configurations (Single AS)
Correctness Specification
Constraints
Faults
Normalized Representation
- Represent complex, distributed configuration
- Define a correctness specification
- Map specification to constraints
11Factoring Routing Configuration
Hundreds of thousands of lines of configuration
in hundreds of routers.
Ranking route selection
Customer
Primary
Competitor
Backup
12Correctness Specification
Path Visibility Every destination with a usable
path has a route advertisement
If there exists a path, then there exists a route
Example violation Signaling partition
Route Validity Every route advertisement
corresponds to a usable path
If there exists a route, then there exists a path
Example violation Routing loop
13Results Faults across 17 ASes
Every AS had faults, regardless of network size
Most faults can be attributed to distributed
configuration
Route Validity
Path Visibility
14Web-based Command Line Interfaces
http//nms.csail.mit.edu/rcc
15Three Directions
- Configuration as programming
- Tools to cope with routing complexity
- Coping with protocol faults failure-atomic
interdomain routing - Prefix-based routing considered harmful
- End-to-end routing
- Exposing multiple paths to end systems
16Prefixes are too coarse-grained
- Validity If a failure occurs that makes a
network unreachable via a given path, then the
route corresponding to that path must be withdrawn
70 of intra-AS failuresnot visible in BGP
FABK03
17but they are also too fine-grained!
- 70 of discontiguous prefix pairs from the same
AS are announced from the same location - Allocation explains about 60 of these cases
- Registries often allocate discontiguous address
blocks to a single AS on the same day - Routes for these prefixes will flap together.
- 135.36.0.0/16 (Agere) and 135.12.0.0/14 (Lucent)
Route objects should correspond to an atom of
hosts that share fate
18Proposal Atomic Interdomain Protocol (AIP)
- Exterminate prefixes
- Name atomic domains (AD) directly
- Addressing, forwarding and routing on ADs
- Like current AS numbers, but finer-grained
- Example MIT, Microsoft Redmond, one PoP of a
large ISP, - Flat AD IDs can carry cryptographic meaning
- Self-certifying (hash of public key)
- End-system addresses have the form AD LocalID
19Exposing Paths to End Systems
- Ultimately, failure recovery is an end-to-end
function - Current architecture doesnt expose multiple
paths to end systems and stubs - Result Various hacks to discover distinct
paths across overlays and underlays
20Summary
- Its worth shooting for a two or three
order-of-magnitude improvement in Internet
availability - Its possible to get four or five nines of
Internet availability, if we - Develop tools to cope with configuration
complexity - Develop a failure-atomic routing system
- Expose multiple IP-layer paths to higher layers