Routing Behavior Routing Instability - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Routing Behavior Routing Instability

Description:

Large-scale routing behavior in the Internet. Routing dynamics ... In effect, nodes spend some time synchronizing before spewing out lots of useless paths. ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 48
Provided by: vmw
Category:

less

Transcript and Presenter's Notes

Title: Routing Behavior Routing Instability


1
Routing Behavior--- Routing Instability
  • Prof. Gao
  • ECE697A Fall 2003
  • Advanced Computer Networks

2
Outline
  • End-to-End measurement
  • Large-scale routing behavior in the Internet
  • Routing dynamics
  • Delayed Internet routing convergence

3
Motivation
  • Internet is in a good shape?
  • You might seldom have problem with
    sending/receiving emails
  • Might not access web pages sometimes
  • But whats going on inside the Internet?
  • Potential problems
  • Packet loss
  • Large delay
  • How to measure and understand these problems?

4
End-to-End Measurement
  • What are Pathologies and failures in the
    Internet?
  • How stable is the route in the Internet from data
    planes view

5
End-to-End Measurement
  • Methodology
  • Routing Pathologies
  • End-to-End Routing Stability
  • Summary

6
Methodology
  • Use traceroute to perform end-to-end measurement
    among 37 Internet sites
  • Measure Internet path and round trip time between
    these sites
  • Two data sets
  • First set D1 Nov. 8 Dec. 24, 1994 27 sites
  • Mean interval between measurements are 1-2 days
  • Second set D2 Nov. 3 Dec. 21, 1995, 33 sites
  • 60 with mean interval of 2 hours, 40 with a
    mean interval of about 2.75 days

7
Traceroute
  • When router forwards a packet, it will decrease
    the TTL value by one
  • Drop packet once TTL expired (avoid packet
    loops)
  • Generate ICMP packet to source IP to indicate
    drops
  • Traceroute
  • Uses TTL and ICMP error messages to trace the
    series of routers a packet traverses from the
    source node to the destination node

8
How Traceroute Works
  • Source sends packet to destination with TTL of 1
  • First router receives the packet, drops it and
    return ICMP packet
  • Source receives the ICMP packet and record RRT
    and IP address of first router
  • Source sends packet to destination with TTL of 2.
  • Cycle continues until either destination or the
    maximum number of routers is reached
  • By default, traceroute repeat 3 times for each
    TTL value

9
Example of Traces
  • From one host in ECS department, UMASS, to
    yahoo.com
  • 1 know-rt-04-1.gw.umass.edu (128.119.91.254)
    0.843 ms 1.053 ms 0.656 ms
  • 2 lgrc-rt-106-8.gw.umass.edu (128.119.2.238)
    0.730 ms 0.725 ms 0.925 ms
  • 3 border2-rt-gi6-0-0.gw.umass.edu
    (128.119.3.113) 63.890 ms 54.213 ms 50.997 ms
  • 4 208.172.51.129 (208.172.51.129) 57.930 ms
    56.354 ms 67.693 ms
  • 5 agr4-loopback.NewYork.cw.net (206.24.194.104)
    67.362 ms 65.551 ms 66.166 ms
  • 6 acr2-loopback.NewYork.cw.net (206.24.194.62)
    160.022 ms 73.918 ms 72.001 ms
  • 7 pos10-2.core2.NewYork1.Level3.net
    (209.244.160.133) 80.572 ms 84.883 ms 80.389
    ms
  • 8 ae0-54.bbr2.NewYork1.level3.net (64.159.17.98)
    80.332 ms 57.156 ms 63.283 ms
  • 9 so-3-0-0.mp2.SanJose1.Level3.net
    (64.159.1.130) 151.922 ms 146.631 ms 156.939
    ms
  • 10 gige10-0.ipcolo3.SanJose1.Level3.net
    (64.159.2.41) 162.707 ms 168.029 ms 182.418 ms
  • 11 unknown.Level3.net (64.152.69.30) 176.864 ms
    174.564 ms 164.517 ms
  • 12 alteon3.68.scd.yahoo.com (66.218.68.12)
    159.403 ms 149.258 ms 152.188 ms

10
Exceptions of traces
  • !N Network Unreachable
  • !H Host Unreachable
  • !P Protocol Unreachable
  • !F IP_DF caused drop
  • !S Source Fail
  • !X Filter/Net Prohibited
  • !C Host Prohibited/Prohibited Cutoff
  • !V Host Precidence
  • !U Host/Net Unknown
  • !I Isolated
  • !T TOS Unreachable
  • Timeout

11
Is It Good Observation?
  • As July, 1995, 6.6 Million Internet hosts
    estimated
  • As April, 1995, 50,000 networks known to the
    NSFNET
  • Not plausibly representative, but gives a
    considerably richer cross-section of the Internet
    routing behavior

12
Participating Sites
13
Links Traversed
14
Routing Pathologies
  • Routing abnormality
  • Loops
  • Erroneous Routing
  • Fluttering (rapid-oscillating routing)
  • Unreachable due to too many hops
  • Failures
  • Connectivity altered
  • Infrastructure failures
  • Temporary outages

15
Routing Pathologies - Loops
  • 10 loops in D1 (0.13), and 50 loops in D2
    (0.16)
  • Duration
  • Short loop under 3 hours
  • Long loop more than 0.5 day
  • Two long-live loop 14-17 hr, and 16-32hr Shows
    lack of good tools to diagnosing network problems

16
Loops
  • Geographical and temporal correlation
  • Loops are clustered
  • Two AlterNet in DC and separate Sprint loop at
    MAE-East
  • Suggesting loops may affect nearby routers

17
Erroneous routing
  • One route from connix to ucl
  • Connix Caravela Software, Middlefield, CT
  • Ucl University College, London, U.K.
  • Route not to London, but instead to Rehovot,
    Israel
  • Cant assume where the packet might travel

18
Fluttering
  • rapid-oscillating routing
  • St. Louis has two routes to Amsterdam
  • Solid-line and dotted-line

19
Fluttering
  • Pro
  • Balance network load
  • Con
  • Unstable network path
  • If fluttering only happen in one direction, then
    the routes are asymmetric
  • Estimating path characters like RRT becomes
    difficult
  • If two routes have different propagation time,
    then TCP performs worse

20
Connectivity altered
  • Cases
  • observed routing connectivity reported earlier
  • But lost or altered later
  • 0.16 in D1, 0.44 in D2
  • Some accompanied by outages
  • Recovery is bimodal
  • Some are very quick (100s ms to seconds)
  • Maybe new routes are being announced
  • Some are in minutes
  • Existing routes are lost

21
Other Problems
  • Infrastructure Failure
  • Classified as when traceroute gets host
    unreachable
  • Outages
  • Classified as is when traceroute gets timeout
  • Too many hops
  • In some cases, number of hops is greater 30
  • Routing Asymmetry
  • Routes in two directions travels different routers

22
Source of Routing Asymmetry
  • Asymmetric link cost along two directions
  • Configuration errors and inconsistency
  • Economics of commercial Internet
  • Hot potato, cold potato

23
Summary
  • Internet Routing is not as good as we expect
  • Observations
  • Loop
  • Unreachability
  • Fluttering or Oscillations
  • What causes these problems?

24
Potential Issues
  • It does not uncover reason of routing
    difficulties
  • Because end-to-end measurements are hard to
    uncover whats happening inside the network
  • Can just ask the network administrators, but may
    not scale well
  • Use batch measurement rather than a single
    request
  • Use more sophisticated tool than traceroute

25
Routing Dynamics
  • End-to-End measurement
  • Gives us an overview of routing behavior
  • BGP routing dynamics
  • BGP update messages
  • Measure the routing behavior in depth
  • Take a close look at routing changes
  • Routing convergence time
  • Overhead of update messages for convergence
  • Adaptation on topology changes

26
BGP update messages
  • OPEN msg
  • opens TCP connection to peer and authenticates
    sender
  • UPDATE msg
  • advertises new path (or withdraws old)
  • KEEPALIVE msg
  • keeps connection alive in absence of UPDATES
  • serves as ACK to an OPEN request
  • NOTIFICATION msg
  • reports errors in previous msg
  • used to close a connection

27
Convergence time
  • When a node/link failure event or policy change
    happens
  • BGP router detects the change
  • Propagate the update messages to neighbors
  • Announcement
  • Withdrawal
  • Until all routers select their best paths and no
    update message is propagated any more
  • Convergence time
  • From the time of failure or change happened
  • To all routers reach stable states and no more
    update messages propagated

28
Taxonomy
  • Use a route server to collect continuous update
    messages
  • Check each update by ltprefix, peersgt,
  • Only consider the AS path and next-hop
  • WADiff
  • A different advertisement following withdraw
    message
  • AADiff
  • A different advertisement following advertisement
    message
  • WADup
  • A same advertisement following with withdraw
    message
  • AADup
  • A same advertisement following with advertisement
    message
  • WWDup
  • A same withdraw following with withdraw message

29
Classifications
  • Instability
  • AADiff
  • WADiff
  • WADup
  • AADup
  • If other attributes (such as MED or community
    attributes) are not same
  • Pathological instability
  • WWDup
  • AADup
  • Two updates are totally same

30
Data Collection
  • Data Collected BGP routing messages
  • Time Period Over the course of 9 months starting
    Jan 96
  • Where Five of the major U.S. network exchange
    points
  • Tool Unix based route servers, Multithreaded
    routing Toolkit(MRTd)

31
Gross Observations
  • For 45,000 prefixes and 1500 paths
  • 3 to 6 million updates per day

32
Pathological Behavior
  • Daily routing updates total on Feb. 1, 1997 at
    AADS

33
Observations
  • Disturbing behaviors
  • Most of the BGP updates entirely pathological
    (WWDup)
  • Disproportionate effect that a single service
    provider can have on global routing
  • Causal relationship between manufacturer of a
    router and level of pathological behavior
  • Routing updates have a regular, specific
    periodicity of either 30 or 60 seconds
  • Persistence of pathological behavior are under
    five minutes

34
Origins of Pathologies
  • Stateless BGP
  • Withdrawals are sent for every explicitly and
    implicitly withdrawn prefix
  • no state on info advertised to peers
  • Plausible Explanations
  • Unjittered 30 second interval timer,
    self-synchronization
  • Misconfigured interaction of IGP/BGP protocols
  • Router vendor software bugs
  • Unconstrained routing policies

35
Analysis of Instability
  • Instability as the sum of AADiff, WADiff and
    WADup updates

36
Fine-grained Instability Statistics
  • There is no correlation
  • between the size of an AS and its proportion of
    the instability statistics.
  • No single AS or prefix consistently dominates the
    instability statistics
  • Instability is evenly distributed across routes

37
Temporal Properties of Instability
  • Plausible causes for the periodicity
  • Routing software timers
  • Self synchronization
  • Routing loops
  • CSU handshaking timeouts
  • Flaw in routing protocol

38
Events
  • AADup
  • AADiff
  • Tup and Tdown
  • Fluctuation in the reachability for a given
    prefix
  • Tup
  • currently unreachable prefix announced reachable
    transitions up
  • Tdown
  • announced route is withdrawn and transitions down

39
Analysis of Update Categories
  • AADup Behavior stems from
  • Non-transitive attribute filtering
  • Combination of BGP minimum advertising timer with
    stateless BGP

40
Analysis of AADiffs
  • Note
  • Low percentage of ASPath ASDiffs
  • Growth in number of origin AADiffs related to
    architecture and policy issues
  • Growth in number of community AADiffs reflects
    its recent adoption by many ISPs
  • Oscillations in MED due to the IBGP mapped MED
    policy at two service providers

41
Intuition for Delayed BGP Convergence
  • There exists possible ordering of messages such
    that BGP will explore ALL possible ASPaths of ALL
    possible lengths
  • BGP is O(N!), where N number of default-free BGP
    speakers in a complete graph with default policy.
  • Although seemingly very different protocols, BGP
    and RIP share very similar convergence behaviors.
    Major difference
  • RIP explores metrics (1N)
  • BGP ASPath provides multiple ways to represent
    metric (path) of length N, or (N-1)!

42
Analysis
  • Labovitz et al run through a series of
    observations in order to claim upper and lower
    bounds on BGP convergence.
  • Assumptions
  • full topology of n nodes.
  • Each AS is represented by one router
  • No processing delays, propagation delays, etc.
  • serialized processing

43
Upper Bound Observation 1
  • For a complete graph of n nodes, there exists
    O((n-1)!) distinct paths to reach a particular
    destination.
  • There are (n-1) paths of length 1
  • There are (n-1)(n-2) paths of length 2.
  • There are Pn (n-1) (n-1)(n-2) ...
    (n-1)!paths of length n.
  • This expression is approximated by O((n-1)!)

44
Upper Bound Observations
  • Upon any k-th iteration of the algorithm,
    withdrawal of the current path will result in
    exploration of all possible O((n-1)!) paths.
  • The number of messages generated is based on the
    number of neighbors nodes have (n-1).
  • I.e., (n-1)O((n-1)!)

45
Lower Bound Observation
  • In the lower bound case, the MinRouteAdvert
    timer will help us.
  • Each node can only send one message per 30
    seconds.
  • In effect, nodes spend some time synchronizing
    before spewing out lots of useless paths.
  • The result is that only one can withdrawal each
    timer period due to loop detection
  • Only receivers can withdrawal due to loops.

46
Topology Impact
  • Topology Impact on Convergence
  • Assume BGP selects the shortest path as the best
    path.
  • w Minimum Route Advertisement Interval
  • Tup O(dw)
  • where d is the length of the shortest path in the
    network.
  • Tdown O(Dw)
  • where D is the length of the longest no-loop
    path in the network.
  • Implication
  • Good news spread fast
  • Bad news spread very slow

47
Summary
  • Internet does not posses effective inter-domain
    fail-over (15 minutes is a long time for phone
    call)
  • Majority of BGP convergence delay due to
    MinRouteAdver and loop detection
  • What is the impact of ISP policy and topology on
    BGP convergence?
  • Can we improve BGP convergence times?
Write a Comment
User Comments (0)
About PowerShow.com