Routing Behavior Routing Instability - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

Routing Behavior Routing Instability

Description:

Large-scale routing behavior in the Internet. Routing dynamics ... In effect, nodes spend some time synchronizing before spewing out lots of useless paths. ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 48

Provided by: vmw

Category:

more less

Transcript and Presenter's Notes

Title: Routing Behavior Routing Instability

1
Routing Behavior--- Routing Instability

Prof. Gao
ECE697A Fall 2003
Advanced Computer Networks

2
Outline

End-to-End measurement
Large-scale routing behavior in the Internet
Routing dynamics
Delayed Internet routing convergence

3
Motivation

Internet is in a good shape?
You might seldom have problem with
sending/receiving emails
Might not access web pages sometimes
But whats going on inside the Internet?
Potential problems
Packet loss
Large delay
How to measure and understand these problems?

4
End-to-End Measurement

What are Pathologies and failures in the
Internet?
How stable is the route in the Internet from data
planes view

5
End-to-End Measurement

Methodology
Routing Pathologies
End-to-End Routing Stability
Summary

6
Methodology

Use traceroute to perform end-to-end measurement
among 37 Internet sites
Measure Internet path and round trip time between
these sites
Two data sets
First set D1 Nov. 8 Dec. 24, 1994 27 sites
Mean interval between measurements are 1-2 days
Second set D2 Nov. 3 Dec. 21, 1995, 33 sites
60 with mean interval of 2 hours, 40 with a
mean interval of about 2.75 days

7
Traceroute

When router forwards a packet, it will decrease
the TTL value by one
Drop packet once TTL expired (avoid packet
loops)
Generate ICMP packet to source IP to indicate
drops
Traceroute
Uses TTL and ICMP error messages to trace the
series of routers a packet traverses from the
source node to the destination node

8
How Traceroute Works

Source sends packet to destination with TTL of 1
First router receives the packet, drops it and
return ICMP packet
Source receives the ICMP packet and record RRT
and IP address of first router
Source sends packet to destination with TTL of 2.
Cycle continues until either destination or the
maximum number of routers is reached
By default, traceroute repeat 3 times for each
TTL value

9
Example of Traces

From one host in ECS department, UMASS, to
yahoo.com
1 know-rt-04-1.gw.umass.edu (128.119.91.254)
0.843 ms 1.053 ms 0.656 ms
2 lgrc-rt-106-8.gw.umass.edu (128.119.2.238)
0.730 ms 0.725 ms 0.925 ms
3 border2-rt-gi6-0-0.gw.umass.edu
(128.119.3.113) 63.890 ms 54.213 ms 50.997 ms
4 208.172.51.129 (208.172.51.129) 57.930 ms
56.354 ms 67.693 ms
5 agr4-loopback.NewYork.cw.net (206.24.194.104)
67.362 ms 65.551 ms 66.166 ms
6 acr2-loopback.NewYork.cw.net (206.24.194.62)
160.022 ms 73.918 ms 72.001 ms
7 pos10-2.core2.NewYork1.Level3.net
(209.244.160.133) 80.572 ms 84.883 ms 80.389
ms
8 ae0-54.bbr2.NewYork1.level3.net (64.159.17.98)
80.332 ms 57.156 ms 63.283 ms
9 so-3-0-0.mp2.SanJose1.Level3.net
(64.159.1.130) 151.922 ms 146.631 ms 156.939
ms
10 gige10-0.ipcolo3.SanJose1.Level3.net
(64.159.2.41) 162.707 ms 168.029 ms 182.418 ms
11 unknown.Level3.net (64.152.69.30) 176.864 ms
174.564 ms 164.517 ms
12 alteon3.68.scd.yahoo.com (66.218.68.12)
159.403 ms 149.258 ms 152.188 ms

10
Exceptions of traces

!N Network Unreachable
!H Host Unreachable
!P Protocol Unreachable
!F IP_DF caused drop
!S Source Fail
!X Filter/Net Prohibited
!C Host Prohibited/Prohibited Cutoff
!V Host Precidence
!U Host/Net Unknown
!I Isolated
!T TOS Unreachable
Timeout

11
Is It Good Observation?

As July, 1995, 6.6 Million Internet hosts
estimated
As April, 1995, 50,000 networks known to the
NSFNET
Not plausibly representative, but gives a
considerably richer cross-section of the Internet
routing behavior

12
Participating Sites
13
Links Traversed
14
Routing Pathologies

Routing abnormality
Loops
Erroneous Routing
Fluttering (rapid-oscillating routing)
Unreachable due to too many hops
Failures
Connectivity altered
Infrastructure failures
Temporary outages

15
Routing Pathologies - Loops

10 loops in D1 (0.13), and 50 loops in D2
(0.16)
Duration
Short loop under 3 hours
Long loop more than 0.5 day
Two long-live loop 14-17 hr, and 16-32hr Shows
lack of good tools to diagnosing network problems

16
Loops

Geographical and temporal correlation
Loops are clustered
Two AlterNet in DC and separate Sprint loop at
MAE-East
Suggesting loops may affect nearby routers

17
Erroneous routing

One route from connix to ucl
Connix Caravela Software, Middlefield, CT
Ucl University College, London, U.K.
Route not to London, but instead to Rehovot,
Israel
Cant assume where the packet might travel

18
Fluttering

rapid-oscillating routing
St. Louis has two routes to Amsterdam
Solid-line and dotted-line

19
Fluttering

Pro
Balance network load
Con
Unstable network path
If fluttering only happen in one direction, then
the routes are asymmetric
Estimating path characters like RRT becomes
difficult
If two routes have different propagation time,
then TCP performs worse

20
Connectivity altered

Cases
observed routing connectivity reported earlier
But lost or altered later
0.16 in D1, 0.44 in D2
Some accompanied by outages
Recovery is bimodal
Some are very quick (100s ms to seconds)
Maybe new routes are being announced
Some are in minutes
Existing routes are lost

21
Other Problems

Infrastructure Failure
Classified as when traceroute gets host
unreachable
Outages
Classified as is when traceroute gets timeout
Too many hops
In some cases, number of hops is greater 30
Routing Asymmetry
Routes in two directions travels different routers

22
Source of Routing Asymmetry

Asymmetric link cost along two directions
Configuration errors and inconsistency
Economics of commercial Internet
Hot potato, cold potato

23
Summary

Internet Routing is not as good as we expect
Observations
Loop
Unreachability
Fluttering or Oscillations
What causes these problems?

24
Potential Issues

It does not uncover reason of routing
difficulties
Because end-to-end measurements are hard to
uncover whats happening inside the network
Can just ask the network administrators, but may
not scale well
Use batch measurement rather than a single
request
Use more sophisticated tool than traceroute

25
Routing Dynamics

End-to-End measurement
Gives us an overview of routing behavior
BGP routing dynamics
BGP update messages
Measure the routing behavior in depth
Take a close look at routing changes
Routing convergence time
Overhead of update messages for convergence
Adaptation on topology changes

26
BGP update messages

OPEN msg
opens TCP connection to peer and authenticates
sender
UPDATE msg
advertises new path (or withdraws old)
KEEPALIVE msg
keeps connection alive in absence of UPDATES
serves as ACK to an OPEN request
NOTIFICATION msg
reports errors in previous msg
used to close a connection

27
Convergence time

When a node/link failure event or policy change
happens
BGP router detects the change
Propagate the update messages to neighbors
Announcement
Withdrawal
Until all routers select their best paths and no
update message is propagated any more
Convergence time
From the time of failure or change happened
To all routers reach stable states and no more
update messages propagated

28
Taxonomy

Use a route server to collect continuous update
messages
Check each update by ltprefix, peersgt,
Only consider the AS path and next-hop
WADiff
A different advertisement following withdraw
message
AADiff
A different advertisement following advertisement
message
WADup
A same advertisement following with withdraw
message
AADup
A same advertisement following with advertisement
message
WWDup
A same withdraw following with withdraw message

29
Classifications

Instability
AADiff
WADiff
WADup
AADup
If other attributes (such as MED or community
attributes) are not same
Pathological instability
WWDup
AADup
Two updates are totally same

30
Data Collection

Data Collected BGP routing messages
Time Period Over the course of 9 months starting
Jan 96
Where Five of the major U.S. network exchange
points
Tool Unix based route servers, Multithreaded
routing Toolkit(MRTd)

31
Gross Observations

For 45,000 prefixes and 1500 paths
3 to 6 million updates per day

32
Pathological Behavior

Daily routing updates total on Feb. 1, 1997 at
AADS

33
Observations

Disturbing behaviors
Most of the BGP updates entirely pathological
(WWDup)
Disproportionate effect that a single service
provider can have on global routing
Causal relationship between manufacturer of a
router and level of pathological behavior
Routing updates have a regular, specific
periodicity of either 30 or 60 seconds
Persistence of pathological behavior are under
five minutes

34
Origins of Pathologies

Stateless BGP
Withdrawals are sent for every explicitly and
implicitly withdrawn prefix
no state on info advertised to peers
Plausible Explanations
Unjittered 30 second interval timer,
self-synchronization
Misconfigured interaction of IGP/BGP protocols
Router vendor software bugs
Unconstrained routing policies

35
Analysis of Instability

Instability as the sum of AADiff, WADiff and
WADup updates

36
Fine-grained Instability Statistics

There is no correlation
between the size of an AS and its proportion of
the instability statistics.
No single AS or prefix consistently dominates the
instability statistics
Instability is evenly distributed across routes

37
Temporal Properties of Instability

Plausible causes for the periodicity
Routing software timers
Self synchronization
Routing loops
CSU handshaking timeouts
Flaw in routing protocol

38
Events

AADup
AADiff
Tup and Tdown
Fluctuation in the reachability for a given
prefix
Tup
currently unreachable prefix announced reachable
transitions up
Tdown
announced route is withdrawn and transitions down

39
Analysis of Update Categories

AADup Behavior stems from
Non-transitive attribute filtering
Combination of BGP minimum advertising timer with
stateless BGP

40
Analysis of AADiffs

Note
Low percentage of ASPath ASDiffs
Growth in number of origin AADiffs related to
architecture and policy issues
Growth in number of community AADiffs reflects
its recent adoption by many ISPs
Oscillations in MED due to the IBGP mapped MED
policy at two service providers

41
Intuition for Delayed BGP Convergence

There exists possible ordering of messages such
that BGP will explore ALL possible ASPaths of ALL
possible lengths
BGP is O(N!), where N number of default-free BGP
speakers in a complete graph with default policy.
Although seemingly very different protocols, BGP
and RIP share very similar convergence behaviors.
Major difference
RIP explores metrics (1N)
BGP ASPath provides multiple ways to represent
metric (path) of length N, or (N-1)!

42
Analysis

Labovitz et al run through a series of
observations in order to claim upper and lower
bounds on BGP convergence.
Assumptions
full topology of n nodes.
Each AS is represented by one router
No processing delays, propagation delays, etc.
serialized processing

43
Upper Bound Observation 1

For a complete graph of n nodes, there exists
O((n-1)!) distinct paths to reach a particular
destination.
There are (n-1) paths of length 1
There are (n-1)(n-2) paths of length 2.
There are Pn (n-1) (n-1)(n-2) ...
(n-1)!paths of length n.
This expression is approximated by O((n-1)!)

44
Upper Bound Observations

Upon any k-th iteration of the algorithm,
withdrawal of the current path will result in
exploration of all possible O((n-1)!) paths.
The number of messages generated is based on the
number of neighbors nodes have (n-1).
I.e., (n-1)O((n-1)!)

45
Lower Bound Observation

In the lower bound case, the MinRouteAdvert
timer will help us.
Each node can only send one message per 30
seconds.
In effect, nodes spend some time synchronizing
before spewing out lots of useless paths.
The result is that only one can withdrawal each
timer period due to loop detection
Only receivers can withdrawal due to loops.

46
Topology Impact

Topology Impact on Convergence
Assume BGP selects the shortest path as the best
path.
w Minimum Route Advertisement Interval
Tup O(dw)
where d is the length of the shortest path in the
network.
Tdown O(Dw)
where D is the length of the longest no-loop
path in the network.
Implication
Good news spread fast
Bad news spread very slow

47
Summary

Internet does not posses effective inter-domain
fail-over (15 minutes is a long time for phone
call)
Majority of BGP convergence delay due to
MinRouteAdver and loop detection
What is the impact of ISP policy and topology on
BGP convergence?
Can we improve BGP convergence times?

Write a Comment

User Comments (0)