Title: Routing Convergence
1Routing Convergence
2Internet Routing Convergence
- An Experimental Study of Delayed Internet
Routing Convergence
- Craig Labovitz, Abha Ahuja, Farnam Jahanian,
Abhijit Bose - ACM Sigcomm September 2000
3Hierarchical Routing -- Review
- Untruths about Internet Routing
- all routers identical
- network flat
- not true in practice
- administrative autonomy
- internet network of networks
- each network admin may want to control routing in
its own network
- scale with 50 million destinations
- cant store all dests in routing tables!
- routing table exchange would swamp links!
4Hierarchical Routing
- aggregate routers into regions, autonomous
systems (AS) - routers in same AS run same routing protocol
- inter-AS routing protocol
- routers in different AS can run different
inter-AS routing protocol
- special routers in AS
- run inter-AS routing protocol with all other
routers in AS - also responsible for routing to destinations
outside AS - run intra-AS routing protocol with other gateway
routers
5Intra-AS and Inter-AS routing
- Gateways
- perform inter-AS routing amongst themselves
- perform intra-AS routers with other routers in
their AS
b
a
a
C
B
d
A
network layer
inter-AS, intra-AS routing in gateway A.c
link layer
physical layer
6Intra-AS and Inter-AS routing
Host h2
Intra-AS routing within AS B
Intra-AS routing within AS A
7(No Transcript)
8AS graphs obscure topology!
The AS graph may look like this.
Tim Griffin, Leiden 2000
9Inter-AS routing (cont)
- BGP (Border Gateway Protocol) the de facto
standard - Path Vector protocol and extension of Distance
Vector - Each Border Gateway broadcast to neighbors
(peers) the entire path (ie, sequence of ASs) to
destination - For example, Gateway X may store the following
path to destination Z - Path (X,Z) X,Y1,Y2,Y3,,Z
10Inter-AS routing (cont)
- Now, suppose Gwy X send its path to peer Gwy W
- Gwy W may or may not select the path offered by
Gwy X, because of cost, policy () or loop
prevention reasons. - If Gwy W selects the path advertised by Gwy X,
then -
- Path (W,Z) w, Path (X,Z)
- Note path selection based not so much on cost
(eg, of - AS hops), but mostly on administrative and policy
issues - (e.g., do not route packets through competitors
AS)
11Inter-AS routing (cont)
- Peers exchange BGP messages using TCP.
- OPEN msg opens TCP connection to peer and
authenticates sender - UPDATE msg advertises new path (or withdraws old)
- KEEPALIVE msg keeps connection alive in absence
of UPDATES it also serves as ACK to an OPEN
request - NOTIFICATION msg reports errors in previous msg
also used to close a connection
12Why different Intra- and Inter-AS routing ?
- Policy Inter is concerned with policies (which
provider we must select/avoid, etc). Intra is
contained in a single organization, so, no policy
decisions necessary - Scale Inter provides an extra level of routing
table size and routing update traffic reduction
above the Intra layer - Performance Intra is focused on performance
metrics needs to keep costs low. In Inter it is
difficult to propagate performance metrics
efficiently (latency, privacy etc). Besides,
policy related information is more meaningful. - We need BOTH!
13What is Routing Policy?
- Description of the routing relationship between
autonomous systems - Who are the peers?
- What routes are
- Originated by a peer?
- Imported from each peer?
- Exported to each peer?
- Preferred when multiple routes exist?
- What to do if no route exists?
14The example I mentioned earlier
Date Fri, 25 Apr 1997 201647 -0500 (CDT)
Subject ALERT Massive Routing Failures
At about 1030 AM today, one of Sprints
customers (AS7007, Florida Internet Exchange)
began announcing a /24 route for every CIDR block
in the core routing table. This was due to a
configuration problem in that they imported all
their routing into a classfull interior routing
protocol and then redistributed the route back
into BGP, becoming a source for the first class C
network in every CIDR block. Sprint does no
border routing filters, so they happily accepted
these routes and gave them away to all
15Motivation
- Why we should care about convergence?
- Routing reliability/fault-tolerance on small time
scales (minutes) not previously a priority - Emerging transaction oriented and interactive
applications (e.g. Internet Telephony) will
require higher levels of end2end network
reliability - How well does the Internet routing infrastructure
tolerate faults?
16Conventional Routing Wisdom
- The Internet is designed to survive a nuclear
cataclysm.Internet routing is robust under
faults - Supports path re-routing and restoral on the
order of seconds - The internet supports fast path rerouting and
restoral. BGP has good convergence properties - Does not exhibit looping/bouncing problems of RIP
- Internet fail-over will improve with faster
routers and faster links - More redundant connections (multi-homing) to
Internet will always improve site fault-tolerance
17Contribution
- Labovitz et al show that most of the conventional
wisdom about routing convergence is not accurate - Measurement of BGP convergence in the Internet
- Analysis/intuition behind delayed BGP routing
convergence - Modifications to BGP implementations which would
improve convergence times
18Motivation
- Why has fail-over and fault-tolerance not
previously been a priority? - Applications like email not delay sensitive and
possess fault-tolerance - TCP/IP fault-tolerance (resend)
- Content replication helps improve reliability for
static content - Network support is required for emerging
transaction oriented and interactive applications
(e.g. Internet Telephony, QoS)
19Building a Reliable Internet
- What Network support has been proposed already?
- Significant recent improvement on data-link
fail-over (e.g. SRP, Sonet). Solves some
enterprise, intra-domain reliability problems - Also significant research on QoS and resource
reservation protocols for the Internet - But, all of these protocols assume stable
underlying IP forwarding path
20Background
- Internet sites multi-home, or purchase
connectivity from multiple Internet providers to
improve fault tolerance - Goal tolerate a single link, router or ISP
failure - 35 Internet end-sites currently multi-homed
21Background Multi-homing
22PSTN versus Internet
- Public Switched Telephone Network (PSTN) is the
other network in place. - Trade-off between
- scalability/extensibility/low cost and
- fault-tolerance/service guarantees/high cost
- PSTN retains significant intermediate state (i.e.
circuit setup) and services on relatively few
nodes. A Smart Network - Internet places all intelligence on end-nodes. A
Stupid Network
23Trade-Offs
PSTN
High
State Reliability Service Guarantees Development
Time Switch Cost Coordination
Low
High
Low
Scalability Flexibility Distributed Operation
24Routing
- Unlike circuit-switched PSTN, packet-switched
Internet uses hop-by-hop forwarding and next-hop
selection - Global state and circuit-setup used in PSTN
- this is like owning an atlas and planning route
- Internet routers only keep local knowledge and
routes learned from neighbors - like asking directions at each stop
25Internet Routing
- Inter-domain Internet routing protocols are
distance vector (i.e. Bellman-Ford) algorithms.
Unlike PSTN, no pre-computed backup paths! - Distance vector protocols are problematic
- Require time to converge
- Suffer from counting to infinity
26Problems with Distance Vector ProtocolsCounting
to Infinity
B
A
R
R 5
R 7
27Internet Routing
- The Internet inter-domain routing protocol, BGP,
solves count-to-infinity problem by keeping
record of path the route announcement has
traveled through network - Internet routing commonly (and incorrectly)
believed to converge within 30 seconds
28BGP Routing
R
29Open Question
- After a fault in a path to multi-homed site, how
long does it take for the majority of Internet
routers to fail-over to the secondary path?
- Routing table convergence (backbone routers reach
steady-state) after a fault - End-to-end paths stable (normal levels of loss
and latency)
BGP
Primary ISP
Customer
BGP
Backup ISP
30Internet Fail-Over Experiments
- Instrument the Internet
- Inject routes into geographically and
topologically diverse provider BGP peering
sessions (Mae-West, Japan, Michigan, London) - Periodically fail and change these routes (i.e.
send withdraws or new attributes) - Monitor impact faults through 1) recordings of
BGP peering sessions with 20 tier1/tier2 ISPs and
2) active ICMP ECHO measurements (512 byte/second
to 100 random web sites) - Write lots of Perl scripts
- Wait two years (125,000 routing events)
31Experiment (For the Last Two Years)
32Fault Scenarios
- Tup -- A new route is advertised
- Tdown -- A route is withdrawn (i.e. single-homed
failure) - Tshort -- Advertise a shorter/better ASPath (i.e.
primary path repaired) - Tlong -- Advertise a longer/worse ASPath
(i.e.primary path fails)
33Major Convergence Results
- Routing convergence requires an order of
magnitude longer than expected (10s of minutes) - Routes converge more quickly following Tup/Repair
than Tdown/Failure events (bad news travels more
slowly) - Curiously, withdrawals (Tdown) generate several
times the number of announcements than
announcements (Tup)
34Example of BGP Convergence
- TIME BGP Message/Event
- 104030 Route Fails/Withdrawn by AS2129
- 104108 2117 announce 5696 2129
- 104132 2117 announce 1 5696 2129
- 104150 2117 announce 2041 3508 3508 4540 7037
1239 5696 2129 - 104217 2117 announce 1 2041 3508 3508 4540 7037
1239 5696 2129 - 104305 2117announce 2041 3508 3508 4540 7037
1239 6113 5696 2129 - 104335 2117 announce 1 2041 3508 3508 4540 7037
1239 6113 5696 2129 - 104359 2117 sends withdraw
- BGP log of updates from AS2117 for route via
AS2129 - One BGP withdrawal triggers 6 announcements and
one withdrawal from 2117 - Increasing ASPath length until final withdraw
35CDF of BGP Routing Table Convergence Times
New Route Long-gtShort Fail-over
Short-gtLong Fail-Over
Failure
- Less than half of Tdown events converge within
two minutes - Tup/Tshort and Tdown/Tlong form equivalence
classes - Long tailed distribution (up to 15 minutes)
36Impact of Delayed Convergence
- Why do we care about routing table convergence?
It deleteriously impacts end-to-end Internet
paths - ICMP experiment results
- Loss of connectivity, packet loss, latency, and
packet re-ordering for an average of 3-5 minutes
after a fault - Why? Routers drop packets for which they do not
have a valid next hop. Also problems with cache
flushing in some older routers.
37End-to-End Impact Failover
- ICMP loss to 100 randomly chosen web sites with
VIF source address of our probe - Tlong/Tshort exhibit similar relationship as
before
38Delayed Convergence Background
- Well known that distance vector protocols exhibit
poor convergence behaviors - Counting to infinity, looping, bouncing problem
- RIP redefines infinity and adds split-horizon,
poison reverse, etc. - Still, slow convergence and not scalable
- BGP advertises ASPaths instead of distance
- Solves counting to infinity and RIP looping
problem, but - BGP can still explore invalid paths during
convergence (i.e. the bouncing problem)
39BGP Convergence Example
40N gt 4?
AS6453
AS2497
6453 1239 5696 237
AS6113
2497 5696 237
6113 2914 237
AS6461
6461 5696 237
AS1239
1239 5696 237
AS5696
5696 237
AS2914
2914 237
AS237
237
AS701
701 6461 5696 237
AS5000
5000 237
AS1
AS1673
1 5696 237
1673 5696 237
41MinRouteAdver Rounds
- Implementation of MinRouteAdver timer and
receiver-side loop detection timer leads to 30
second rounds O(n-3)30 seconds time complexity
42An Experiment with SSF.OS.BGP4
- The Model
- Topology full mesh of N ASes, each with just 1
router - No route filtering
- Shortest path is best
- Advertise, Withdraw, Wait and Watch
- Wait for system to reach stable state, then
- AS 1 advertises a bogus destination to everyone
else - Wait for system to reach a stable state again,
then - AS 1 tells everyone that the bogus route is not
reachable through it any more - Wait for system to reach a stable state again
434
5
1
bogus
3
2
N 10 20 30 40 50
longest path 9 20 28 40 46
convergence time after withdrawal (sec) 150
480 720 1080 1260
avg updates due to withdrawal (range) 59.50
(35-84) 269.55 (58-397) 539.10 (118-892)
945.20 (160-1647) 1423.66 (196-2377)
44. . . 1610.040778415 bgp_at_381 snd update to
bgp_at_21 wdsbogus 1610.040778415 bgp_at_381 snd
update to bgp_at_201 wdsbogus 1610.040778415
bgp_at_381 snd update to bgp_at_321
wdsbogus 1610.040778415 bgp_at_381 snd update
to bgp_at_441 wdsbogus 1610.040890567 bgp_at_321
snd update to bgp_at_381 nlribogus,asp32 44 34 38
4 22 2 20 48 10 26 12 6 16 36 8 14 24 28 41 18 51
21 33 45 43 35 3 5 47 23 31 37 49 25 46 39 7 27
13 9 29 11 15 17 50 19 42 40 30 1 1610.040890567
bgp_at_321 snd update to bgp_at_441
wdsbogus 1610.040907352 bgp_at_441 snd update
to bgp_at_381 wdsbogus 1610.040907352 bgp_at_441
snd update to bgp_at_341 nlribogus,asp44 38 34 32
4 22 2 20 48 10 26 12 6 16 36 8 14 24 28 41 18 51
21 33 45 43 35 3 5 47 23 31 37 49 25 46 39 7 27
13 9 29 11 15 17 50 19 42 40 30 1 1610.050930294
bgp_at_441 snd update to bgp_at_321 wdsbogus . . .
45The Problem with BGP
- If we assume
- unbounded delay on BGP processing and propagation
- Full BGP mesh BGP peers
- Constrained shortest path first selection
algorithm - BGP is O(N!), where N number of default-free BGP
speakers
There exists possible ordering of messages such
that BGP will explore all possible ASPaths of all
possible lengths
46BGP and RIP
- RIP precisely monotonically increasing. Can
explore metrics (1N) - BGP monotonically increasing. Multiple (N!) ways
to represent a path metric of N. - BGP solved RIP routing table loop problem by
making it exponentially worse
2117 5696 2129 2117 1 5696 2129 2117 2041 3508
3508 4540 7037 1239 5696 2129 2117 1 2041 3508
3508 4540 7037 1239 5696 2129 2117 2041 3508 3508
4540 7037 1239 6113 5696 2129 2117 1 2041 3508
3508 4540 7037 1239 6113 5696 2129
47BGP Best Case
- What is the best we can expect from BGP?
- Implementation of MinRouteAdver timer leads to 30
second rounds - Time complexity is O(n-3)30 seconds
- State/Computational complexity O(n)
- At its best, BGP performs as well as RIP2 (but
uses exponentially more memory in the process)
48MinRouteAdver
- Minimum interval between successive updates sent
to a peer for a given prefix - Allow for greater efficiency/packing of updates
- Rate throttle
- Applied only to announcements (at least according
to BGP RFC) - Applied on (prefix destination, peer) basis, but
implemented on (peer) basis
49MinRouteAdver
- 30(N-3) delay due to creation mutual
dependencies. Provide proof that N-3 rounds
necessarily created during bounded BGP
MinRouteAdver convergence - Rounds due to
- Ambiguity in the BGP RFC and lack receiver loop
detection - Inclusion of BGP withdrawals with MinRouteAdver
(in violation of RFC)
50Simulation Results
51Intuition for Delayed BGP Convergence
- There exists possible ordering of messages such
that BGP will explore ALL possible ASPaths of ALL
possible lengths - BGP is O(N!), where N number of default-free BGP
speakers in a complete graph with default policy - Although seemingly very different protocols, BGP
and RIP share very similar convergence behaviors.
Major difference - RIP explores metrics (1N)
- BGP ASPath provides multiple ways to represent
metric (path) of length N, or (N-1)!
52Lower Bound on BGP
- If assume optimal ordering of messages, what is
the best we can expect from BGP? - In practice, BGP timers (MinRouteAdver) provide
synchronization and limit possible orderings of
messages - MinRouteAdver timer specifies interval between
successive updates sent to a peer for a given
prefix - Useful for bundling updates together
- According to RFC, MinRouteAdver applies only
announcements - But, interaction of MinRouteAdver and vendor
ASPath loop detection implementation introduce
artificial delay
53Conclusions
- Internet does not posses effective inter-domain
fail-over (15 minutes is a long time for phone
call) - Majority of BGP convergence delay due to vendor
implementation decisions of MinRouteAdver and
loop detection - In practice, Internet is not a complete graph and
same degree of message re-ordering unlikely. Our
current work - What is the impact of ISP policy and topology on
BGP convergence? - Can we improve BGP convergence times?