Monitoring Traffic Exits in a Multihomed I2 Environment - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Monitoring Traffic Exits in a Multihomed I2 Environment

Description:

7 pos1-0.inr-000-eva.Berkeley.EDU (128.32.0.89) 14.907 ms 14.662 ms 14.483 ms ... 'Positive' changes are green, 'negative' changes are red, neutral changes are black ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 48
Provided by: academics4
Learn more at: http://pages.uoregon.edu
Category:

less

Transcript and Presenter's Notes

Title: Monitoring Traffic Exits in a Multihomed I2 Environment


1
Monitoring Traffic Exits in aMultihomed I2
Environment
  • Joe St Sauver joe_at_oregon.uoregon.edu
  • UO Computing Center
  • NLANR/I2 Joint Techs, MinneapolisMay 18th, 2000

2
Today's Message in a Nutshell...
  • What You should pay attention to how application
    traffic is exiting your campus.
  • Why In an I2 multihomed environment there are
    important differences between the network exits
    you've got available.
  • How We have a simple and freely available tool
    written in perl which you can use to monitor your
    application exits

3
Definition Multihomed
  • By saying that a site is "multihomed" we mean
    that the site is"connected to two or more wide
    area networks, such as the commodity Internet and
    a high performance limited access network such as
    Abilene"

4
ALL I2 Schools Are Multihomed
  • ALL I2 schools are multihomed, since all are
    required to have both commodity AND high
    performance connectivity
  • In fact, many I2 schools have-- multiple
    commodity transit providers, and-- multiple high
    performance exits, and-- local or national
    peerage, and-- local intraconsortia traffic

5
Many people don't know/care how their traffic
flows
  • "It just works"
  • "I couldn't change how my traffic is routed even
    if I didn't like how it is exiting"
  • "I wouldn't do anything different even if I could
    get my traffic did go a different way"
  • "I can't tell the difference between ltXgt and ltYgt
    connectivity, anyhow."

6
Should They Care? Yes!
  • Users need to be helped to understand that they
    should care about how traffic exits. (1) In a
    multihomed environment, all exits are not
    alike. (2) Users can adjust their applications
    and their collaborations to make
    efficient use of available exits, if
    they can determine what's going on, and
    if they are motivated.

7
Some crucial differences among various exits you
may have.
  • Different network exits may have (1) different
    available capacity (a primary motivator in
    our case) (2) different cost structures apply
    (also important to us, as at most
    places) (3) different latency and loss
    properties (4) different wide area reachability

8
(1) Available capacity...
  • Classic Example I2 school with congested
    fractional DS3 commodity Internet transit
    connectivity, but lightly loaded OC3
    connectivity via Abilene. A huge difference in
    available capacity exists between those two
    exits.

9
(2) Different Costs Structures for Different
Connections
  • Oregon, like most places, has different cost
    structures for different sorts of connectivity.
  • For example, we pay 1K/Mbps/month for contracted
    inbound commodity Internet transit capacity
    (excluding some categories of traffic)

10
Different Costs Structures for Different
Connections (cont.)
  • While in the case of peerage or HPC connectivity,
    usage may have no direct incremental cost (except
    at step boundries where capacity would need to be
    added), and in fact, there may be an effective
    "negative incremental cost" associated with
    shifting inbound traffic off of commodity transit
    links and over to HPC connectivity.

11
Inbound Traffic? I Thought This Was About
Monitoring Exits?
  • We can't easily/systematically do reverse
    traceroutes to watch our commodity traffic
    inbound (even though that's what we pay for)
  • We'll assume that if we've got outbound traffic
    going to you, you've probably got inbound traffic
    going to us
  • We'll watch what we can and hope that the world
    is symmetric (hah hah, I know, I know)

12
(3) Latency and Loss
  • Different exits may have dramatically different
    latency and loss characteristics.
  • Sustained throughput at a level which might be
    perfectly reasonable over a low latency and low
    loss circuit might be completely unrealistic to
    attempt in the face of higher latency or higher
    loss

13
(4) Reachability
  • In some cases, a change of exit may result in a
    material change in reachability. For example,
    consider a block of addresses that are advertised
    ONLY to high performance connectivity partners
    (the intent being that if the HPC connectivity
    falls over, the load will not flop over onto
    commodity transit circuits and swamp those exits).

14
How Does All This Relate to Network Engineering?
  • "Okay, sure it is true that there's a difference
    between the various exits that might be
    available. "But how does this relate to network
    engineering tasks? What knobs do I need to tweak
    on my Cisco?"

15
Focus is on the user/applications, not on
tweaking the network
  • We assume that the network configuration is "a
    given" and is NOT changeable -- e.g., network
    engineers are notcannotwill notshould not
    tweak routes for load balancing or other traffic
    management.
  • We DO assume that end users can make application
    level decisions about what they do, or who they
    collaborate with.

16
plus user network monitoring
  • Assuming we really need and are really using our
    I2 connectivity, users who are running crucial
    applications over I2 need to know when it isn't
    there for a given collaboration, if only so they
    can call up and complain. -)

17
Why not just look at periodic route snapshots?
  • For example, why not just checkhttp//www.vbns.n
    et8080/route/Allsnap.rt.html
  • Answer too much dataAnswer HPC connectivity
    onlyAnswer only updates every half hourAnswer
    too hard to see what's changedAnswer ...

18
Oregon's Particular Application
  • We run news servers connecting with hundreds of
    Usenet News peers located all around the world,
    some normally I2 connected, some connected via
    commodity transit, some connected via peerage.
  • News traffic must not be allowed to interfere
    with other network traffic gtwe need to closely
    monitor and manage traffic flowing out these
    multiple exits

19
Sample Decisions Made In Part Based on How
Traffic Exits
  • Should we peer with this site at all?
  • What should we feed them?
  • What should we have them feed us?
  • If we will be feeding this site over a high
    performance connection, do we want to insure that
    that HPC traffic doesn't flop onto our commodity
    transit links in case HPC connectivity falls over?

20
Usenet is NOT a unique case...
  • Usenet News is not unique when it comes to
    distributing data to many different sites over
    potentially diverse exits. Examples include --
    Unidata's LDM weather data network -- web cache
    hierarchies-- IRC server traffic-- any
    coordinated server-to-server traffic

21
Stage 1 Exit Selection Analysis "I know, let's
use traceroute..."
  • Inititially, we did what everyone else does when
    we wanted to figure our where traffic was
    exiting, and just said traceroute
    foo.bar.eduwith our traffic involving five main
    exits...

22
Sample traceroute 1
  • traceroute www.berkeley.edutraceroute to
    amber.Berkeley.EDU (128.32.25.12), 30 hops max,
    40 byte packets 1 cisco3-gw.uoregon.edu
    (128.223.142.1) 0.919 ms 0.740 ms 0.432 ms 2
    cisco7-gw.uoregon.edu (128.223.2.7) 0.491 ms
    0.421 ms 0.414 ms 3 ogig-uo.oregon-gigapop.net
    (198.32.163.1) 0.511 ms 0.578 ms 0.574 ms 4
    sac.oregon-gigapop.net (198.32.163.10) 10.123 ms
    10.207 ms 10.148 ms 5 198.32.249.61
    (198.32.249.61) 13.597 ms 13.307 ms 13.551 ms
    6 BERK--SUNV.POS.calren2.net (198.32.249.13)
    15.053 ms 14.485 ms 14.511 ms 7
    pos1-0.inr-000-eva.Berkeley.EDU (128.32.0.89)
    14.907 ms 14.662 ms 14.483 ms 8
    pos5-0-0.inr-001-eva.Berkeley.EDU (128.32.0.66)
    44.428 ms 14.964 ms 15.362 ms 9
    fast1-0-0.inr-007-eva.Berkeley.EDU (128.32.0.7)
    15.910 ms 16.113 ms 16.807 ms10
    f8-0.inr-100-eva.Berkeley.EDU (128.32.235.100)
    19.274 ms 20.805 ms 15.958 ms11
    amber.Berkeley.EDU (128.32.25.12) 17.755 ms
    16.076 ms 16.479 ms
  • But wait, there's more we also have HPC
    connectivity via Denver...

23
Sample traceroute 2
  • traceroute www.bc.nettraceroute to www.bc.net
    (142.231.112.3), 30 hops max, 40 byte packets 1
    cisco3-gw.uoregon.edu (128.223.142.1) 0.590 ms
    0.644 ms 0.572 ms 2 cisco7-gw.uoregon.edu
    (128.223.2.7) 0.526 ms 0.579 ms 0.448 ms 3
    ogig-uo.oregon-gigapop.net (198.32.163.1) 0.850
    ms 0.520 ms 0.509 ms 4 den.oregon-gigapop.net
    (198.32.163.14) 32.563 ms 32.279 ms 32.403 ms
    5 kscy-denv.abilene.ucaid.edu (198.32.8.14)
    42.759 ms 42.956 ms 43.214 ms 6
    ipls-kscy.abilene.ucaid.edu (198.32.8.6) 51.965
    ms 51.841 ms 52.306 ms 7 205.189.32.98
    (205.189.32.98) 56.422 ms 56.242 ms 57.130 ms
    8 win1.canet3.net (205.189.32.141) 73.884 ms
    75.850 ms 73.620 ms 9 reg1.canet3.net
    (205.189.32.137) 80.920 ms 80.823 ms 80.505
    ms10 cal1.canet3.net (205.189.32.133) 89.457
    ms 89.726 ms 89.757 ms11 van1.canet3.net
    (205.189.32.129) 104.529 ms 102.115 ms 102.287
    ms12 c3-bcngig01.canet3.net (205.189.32.193)
    102.085 ms 102.504 ms 102.459 ms13
    policy-canet2.hc.BC.net (207.23.240.241) 102.771
    ms 103.833 ms 103.055 ms14 cyclone.BC.net
    (142.231.112.2) 104.819 ms 104.113 ms 106.695
    ms
  • Oregon to BC, via Indianapolis Guess we still
    need a west coast HPC peering point -

24
Sample traceroute 3
  • traceroute leao.ebonet.nettraceroute to
    leoa.ebonet.net (194.133.121.8), 30 hops max, 40
    byte packets 1 cisco3-gw.uoregon.edu
    (128.223.142.1) 0.757 ms 0.802 ms 0.592 ms 2
    cisco7-gw.uoregon.edu (128.223.2.7) 0.813 ms
    0.626 ms 0.521 ms 3 eugene-hub.nero.net
    (207.98.66.11) 3.230 ms 1.832 ms 2.019 ms 4
    eugene-isp.nero.net (207.98.64.41) 2.981 ms
    2.588 ms 2.044 ms 5 ptld-isp.nero.net
    (207.98.64.2) 6.059 ms 6.347 ms 6.340 ms 6
    906.Hssi8-0-0.GW1.POR2.ALTER.NET (157.130.176.33)
    6.780 ms 18.495 ms Serial4-0.GW1.POR2.ALTE
    R.NET (157.130.179.77) 6.313 ms 7
    121.ATM3-0.XR2.SEA4.ALTER.NET (146.188.200.190)
    18.591 ms 21.871 ms 19.717 ms 8
    292.ATM2-0.TR2.SEA1.ALTER.NET (146.188.200.218)
    18.572 ms 12.785 ms 12.776 ms 9
    110.ATM7-0.TR2.EWR1.ALTER.NET (146.188.137.77)
    68.878 ms 69.940 ms 81.317 ms10
    196.ATM7-0.XR2.EWR1.ALTER.NET (146.188.176.85)
    86.459 ms 71.227 ms 100.778 ms11
    192.ATM9-0-0.GW1.NYC2.ALTER.NET (146.188.177.61)
    93.644 ms 73.741 ms 74.459 ms12
    AngolaTel-gw.customer.ALTER.NET (157.130.4.242)
    618.036 ms 641.660 ms 636.724 ms13
    194.133.121.1 (194.133.121.1) 628.084 ms
    626.335 ms 638.627 ms14 194.133.121.8
    (194.133.121.8) 640.332 ms 637.330 ms 667.900
    ms
  • Dual fractional DS3smultiple gateway addresses
    we need to watch out for...

25
Sample traceroute 4
  • traceroute www.rcn.comtraceroute to
    www.rcn.com (207.172.3.77), 30 hops max, 40 byte
    packets 1 cisco3-gw.uoregon.edu (128.223.142.1)
    0.609 ms 0.473 ms 0.480 ms 2
    cisco7-gw.uoregon.edu (128.223.2.7) 0.595 ms
    0.498 ms 0.544 ms 3 eugene-hub.nero.net
    (207.98.66.11) 1.643 ms 1.390 ms 1.671 ms 4
    eugene-isp.nero.net (207.98.64.41) 1.915 ms
    1.827 ms 1.721 ms 5 xcore2-serial0-1-0-0.SanFra
    ncisco.cw.net (204.70.32.5) 12.065 ms 12.759 ms
    12.053 ms 6 corerouter2.SanFrancisco.cw.net
    (204.70.9.132) 12.410 ms 17.748 ms 19.462 ms
    7 acr2-loopback.SanFranciscosfd.cw.net
    (206.24.210.62) 14.585 ms 19.097 ms 14.700 ms
    8 aar1-loopback.SanFranciscosfd.cw.net
    (206.24.210.2) 18.101 ms 14.707 ms 15.468 ms
    9 rcn.SanFranciscosfd.cw.net (206.24.216.190)
    15.914 ms 17.202 ms 17.063 ms10 10.65.101.3
    (10.65.101.3) 16.201 ms 15.708 ms 17.053 ms11
    pos5-0-0.core1.lnh.md.rcn.net (207.172.19.253)
    97.268 ms 95.429 ms 94.204 ms12
    poet6-0-1.core1.spg.va.rcn.net (207.172.19.234)
    96.311 ms poet5-0-0.core1.spg.va.rcn.net
    (207.172.19.246) 102.148 ms
    poet6-0-1.core1.spg.va.rcn.net (207.172.19.234)
    100.850 ms13 fe11-1-0.core4.spg.va.rcn.net
    (207.172.0.135) 96.572 ms 98.770 ms 96.906
    ms14 corporate-2.web.rcn.net (207.172.3.77)
    98.016 ms 97.085 ms 98.480 ms

26
Sample traceroute 5
  • traceroute www.digilink.nettraceroute to
    www.digilink.net (205.147.0.100), 30 hops max, 40
    byte packets 1 cisco3-gw.uoregon.edu
    (128.223.142.1) 0.777 ms 0.850 ms 0.398 ms 2
    cisco7-gw.uoregon.edu (128.223.2.7) 0.515 ms
    0.536 ms 0.563 ms 3 verio-gw.oregon-ix.net
    (198.32.162.6) 1.373 ms 1.395 ms 1.063 ms 4
    d3-0-1-0.r01.ptldor01.us.ra.verio.net
    (206.163.3.133) 4.023 ms 4.379 ms 4.037 ms 5
    ge-5-0.r00.ptldor01.us.bb.verio.net
    (129.250.31.222) 12.074 ms 24.473 ms 25.085
    ms 6 p1-1-1-3.r01.ptldor01.us.bb.verio.net
    (129.250.2.38) 4.977 ms 18.582 ms 5.035 ms 7
    p4-4-2.r02.sttlwa01.us.bb.verio.net
    (129.250.3.37) 8.746 ms 9.627 ms 9.983 ms 8
    p4-1-0-0.r03.sttlwa01.us.bb.verio.net
    (129.250.2.230) 8.715 ms 9.062 ms 10.871 ms 9
    sea3.pao6.verio.net (129.250.3.89) 27.449 ms
    27.315 ms 26.929 ms10 p4-1-0-0.r00.lsanca01.us.
    bb.verio.net (129.250.2.114) 35.008 ms 35.567
    ms 34.807 ms11 p4-0-0.r01.lsanca01.us.bb.verio.
    net (129.250.2.206) 34.857 ms 34.972 ms 34.764
    ms12 fa-6-0-0.a01.lsanca01.us.ra.verio.net
    (129.250.29.141) 124.100 ms 36.806 ms 34.957
    ms13 uscisi-pl.customer.ni.net (209.189.66.66)
    37.148 ms 36.604 ms 35.889 ms14 130.152.128.1
    (130.152.128.1) 37.993 ms 37.343 ms 37.400
    ms15 border1-100m-ln.digilink.net
    (207.151.118.22) 39.490 ms 38.334 ms 37.764
    ms16 la1.digilink.net (205.147.0.100) 38.755
    ms 37.997 ms 39.237 ms

27
Couple of Slight Problem(s)...
  • Manually tracerouting to several hundred hosts
    gets to be really tedious, really fast
  • Traffic could (and did) shift w/o notice
    (sometimes dramatically) based on MRTG graphs,
    yet we'd only end up doing traceroutes on rare
    occaisions, often missing interesting phenomena

28
And once we were done tracerouting all over the
place...
  • We still needed to consolidate that information
    into a useful format (other than jotted notes on
    the back of recycled memos)
  • We learned that we did badly when it came to
    noticing changes in exit behavior for one or two
    particular hosts out of hundreds

29
Stage 2 Exit Selection Analysis "Let's write a
filter!!!"
  • Simple repetitive task gt create a filter
  • Input list of FQDNs or dotted quads whose exit
    selection we want to monitor
  • Output same as input list, but with exit info
    prepended to each entry
  • Approach do a traceroute, grep for the gateways,
    tag and print the exits accordingly. Write in
    perl. Sort by exit.

30
Stage 2 Philosophy
  • Like Unix itself, build small simple tools that
    can be chained together to dolarger complex tasks
  • "If we can just see where the traffic is going,
    that will be good enough."

31
Stage 2 code...
  • !/usr/local/bin/perlCmd "/usr/etc/traceroute"
    open (SAVERR,"gtSTDERR")open
    (STDERR,"gttemp.tmp")while (Host ltgt) chop
    Host open(IN,"Cmd Host ") while(line
    ltINgt) if (line m/198.32.163.10/)
    print "OGIG-S Host\n" elsif (line
    m/198.32.163.14/) print "OGIG-D Host\n"

32
Stage 2 code (cont.)
  • continuedelsif (line m/157.130.176.33/)
    print "UUNet Host\n" elsif (line
    m/157.130.179.7/) print "UUNet Host\n"
    elsif (line m/204.70.32.5/ print
    "CWIX Host\n" elsif (line
    m/198.32.162.6/ print "Verio Host\n"

33
Sample Stage 2 Run...
  • cat hosts.foo ./chexitVerio
    newsfeed.digilink.netCWIX
    nwes.ciateq.mxetc.and/or pipe it through
    sort.

34
Lessons from Stage 2...
  • Even automated, it can take a long time to
    traceroute all the way to several hundred sites,
    particularly if you go a full thirty hops (e.g.,
    hit a firewall and loop) added -m 6 option to
    the traceroute (max hops of 6)
  • Default time-to-wait is too long (use -w 2)
  • If doing this on an automated basis, there's no
    need to resolve addresses on the traces(add -n
    option)

35
Lessons from Stage 2 (cont.)
  • No need to send three packets one will usually
    be enough (and will be faster than sending three)
    -- add the option -q 1
  • Still can't spot what's changed between runs
  • Other folks want to see the output need some way
    to share this data with others
  • Phenomena existed that we hadn't expected

36
Unexpected phenomena included...
  • We learned that some peers oscillated routinely
    between commodity providers
  • Needed to think about what to do when sites
    became completely unreachable
  • Decided that maybe we should draw an inference
    when ALL hosts which had been using a given exit
    would suddenly stopped doing so

37
Stage 3 Exit Selection AnalysisThe Great
Webification...
  • Abiel Reinhart joined us as an intern, and we
    needed a project for him to work on.

38
Goals of Stage 3.
  • We wanted the perl filter converted into a web
    cgi-bin with two main deliverables (1)
    snapshot web page showing current mapping of
    specified peers to exits, and (2) change file
    web page showing what peers had changed exits

39
Sample snapshot.html output(see
http//darkwing.uoregon.edu/joe/snapshot.html)
  • This report generated at 163000 on
    4/9/100CWIX (68)
    feederbox.skynet.be
    news-in.tvd.be su-news-hub1.bbnplanet
    .com etc.
    newnews.mikom.csir.co.zaOGIG-D (62)
    poseidon.acadiau.ca
    newsflash.concordia.ca
    cyclone-i2.mbnet.mb.ca
    etc.
    news01.sunet.seetc.

40
Interperting snapshot.html
  • Each exit has its own section
  • Total number of peers using that exit isreported
    in parentheses
  • Peers are alphabetized by reversed FQDNwithin
    section
  • Updated every fifteen minutes (via cron)

41
Sample changes.html output(see
http//darkwing.uoregon.edu/joe/changes.html)
  • 151500 sonofmaze.dpo.uab.edu UUNet --gt
    OGIG-D green 144500 feeder.nmix.net
    CWIX --gt UUNet black etc. 124500
    news.maxwell.syr.edu CWIX --gt OGIG-D
    green etc. 063000 Note OGIG-D is now
    being used. green 063000 Note
    OGIG-S is now being used. green
    etc 063000 hardy.tc.umn.edu UUNet
    --gt OGIG-D green 063000 news-ext.gatech.edu
    CWIX --gt OGIG-D green 063000
    news-feeder.sdsmt.edu Verio --gt OGIG-D
    black 063000 news.alaska.edu Verio
    --gt OGIG-S black etc 061500 Warning
    OGIG-D is not being used. red
    061500 Warning OGIG-S is not being used.
    red 061500 hardy.tc.umn.edu
    OGIG-D --gt UUNet red 061500
    news-ext.gatech.edu OGIG-D --gt CWIX
    red 061500 news-feeder.sdsmt.edu OGIG-D
    --gt Verio black 061500 news.alaska.edu
    OGIG-S --gt Verio black

42
Interpreting changes.html
  • Entries are in reverse chronological order
  • "Positive" changes are green, "negative" changes
    are red, neutral changes are black
  • Spaces separate clumps of entries (by time)
  • Example shows examples of actual routing changes
    from 4/9 to actual news peers
  • Also captured a 615-630 OGIG maintenance window
    (Greg loading 120-10.3.S1 on the GSR...)

43
Lessons from Stage 3
  • Abiel did a great job. -)
  • Need to set up a policy to handle growth of the
    changes page (we set a max file size of 1000
    lines)
  • Running the code from a subnet other than the one
    that you're really interested in is okay, but not
    ideal (misses local breakage, and any applicable
    locally applied static routes for hosts with
    multiple interfaces)

44
Lessons from Stage 3 (cont.)
  • Selection of hosts to trace to may be crucial
    (and should be the actual host(s) you really care
    about, not "straw dogs" such as www.ltdomaingt.edu)
    since at least some I2 schools are selectively
    advertising just portions of their network blocks

45
Lessons from Stage 3 (cont.)
  • Assumption of symmetric routes is a bad one to
    make. Practically speaking, this means that if
    you nail up I2-only routes (based on exit data or
    other info), you do so at your peril. This
    problem may be relatively widespread see, for
    example, Hank Nussbacher's paperhttp//www.inte
    rnet-2.org.il/i2-asymmetry/index.html

46
Lessons from Stage 3 (cont.)
  • General case you can build useful network
    monitoring tools by automating simple
    interactively accessible building blocks
  • Next project likely to be focused on generating
    daily SNMP-ish input octets/output octet
    summaries for all connectors of interest via the
    Abilene core node router proxy web interface

47
Want to try monitoring your exits?
  • Send email to joe_at_oregon.uoregon.edu and we'll be
    glad to share our code with you.
  • Code includes some UO specific stuff that you'll
    need to manually tweak (it isn't really
    productized for automatic/"zombie" installation)
    you'll also see it also does some extra stuff we
    haven't talked about (like generating I2-only
    static route entries), but you can simply comment
    that out.
Write a Comment
User Comments (0)
About PowerShow.com