Diagnostic Steps - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Diagnostic Steps

Description:

Presented at the Optimization Technologies for Low-Bandwidth Networks, ICTP ... Look for connectivity, loss, RTT, jitter, dups ... – PowerPoint PPT presentation

Number of Views:124
Avg rating:3.0/5.0
Slides: 38
Provided by: DL266
Category:

less

Transcript and Presenter's Notes

Title: Diagnostic Steps


1
Diagnostic Steps
  • Les Cottrell SLAC
  • Presented at the Optimization Technologies for
    Low-Bandwidth Networks, ICTP Workshop, Trieste,
    Italy, 9-20 October 2006
  • http//www.slac.stanford.edu/grp/scs/net/talk06/di
    agnostics.ppt

Partially funded by DOE/MICS Field Work Proposal
on Internet End-to-end Performance Monitoring
(IEPM), also supported by IUPAP
2
Get ready
  • Bring up terminal window so can try some commands
  • Bring up the presentation so can click on links
  • www.slac.stanford.edu/grp/scs/net/talk06/diagnosti
    cs.ppt

3
Aim
  • Goal provide a practical guide to debugging
    common problems
  • Why is diagnosis difficult yet important?
  • Local host
  • Ping, Traceroute, PingRoute
  • Looking at time series
  • Locating bottlenecks
  • Correlation of problems with routes
  • More tools and problems
  • Where is a node
  • Who do you tell, what do you say?
  • Case studies and More Information

4
Why is diagnosis difficult?
  • Internet's evolution as a composition of
    independently developed and deployed protocols,
    technologies, and core applications
  • Diversity, highly unpredictable, hard to find
    invariants
  • Rapid evolution change, no equilibrium so far
  • Findings may be out of date
  • Measurement/diagnosis not high on vendors list of
    priorities
  • Resources/skill focus on more interesting an
    profitable issues
  • Tools lacking or inadequate
  • Implementations are flaky not fully tested with
    new releases

5
Add to that
  • Distributed systems are very hard
  • A distributed system is one in which I can't get
    my work done because a computer I've never heard
    of has failed. Butler Lampson
  • Network is deliberately transparent
  • The bottlenecks can be in any of the following
    components
  • the applications
  • the OS
  • the disks, NICs, bus, memory, etc. on sender or
    receiver
  • the network switches and routers, and so on
  • Problems may not be logical
  • Most problems are operator errors,
    configurations, bugs
  • When building distributed systems, we often
    observe unexpectedly low performance
  • the reasons for which are usually not obvious
  • Just when you think youve cracked it, in steps
    security
  • Firewall, NAT boxes etc.
  • Block pings, traceroute looks like port scan,
    diagnostic tool ports are blocked
  • ISPs worried about providing access to core,
    making results public, privacy issues

6
Sources of problems
  • Host errors
  • TCP buffers, heavy utilization
  • Duplex mismatch (Ethernet)
  • Misconfigured router/switches
  • Including routing errors, especially for backup
    paths
  • Bad equipment, wiring/fiber problem
  • Congestion

7
Fire Local Host
  • Usual Unix tools (uname-a, top, vmstat, iostat )
  • Is the host overloaded, do you have a gateway
    (route), name server (nslookup/dig), which
    interface are you using (mii-tool (needs root),
    gives duplex speed common error source)
  • 21cottrell_at_pingergtsudo mii-tool eth0
  • eth0 100 Mbit, full duplex, link ok
  • Net ifconfig a (look at errors), netstat a
    more
  • Is server running (if you know port)?
  • gttelnet localhost 2811
  • Trying 127.0.0.1
  • 220 aftpexp04.bnl.gov GridFTP Server 1.12 GSSAPI
    type Globus/GSI wu-2.6.2 (gcc32dbg,
    1069715860-42) ready.
  • telnetgt quit

8
Ping
  • Ping
  • to localhost,
  • ping to gateway (use route or traceroute to find
    gateway),
  • ping to well known host
  • to relevant remote host
  • Use IP address to avoid nameserver problems
  • Look for connectivity, loss, RTT, jitter, dups
  • May need to run for a long time to see some
    pathologies (e.g. bursty loss due to DSL loss of
    sync)
  • Try flood pings if suspect rate limited
  • Use synack or sting if ICMP blocked
  • www-iepm.slac.stanford.edu/tools/synack/

9
Ping example
Packet size
Remote host
Repeat count
RTT
  • syrup/home ping -c 6 -s 64 thumper.bellcore.com
  • PING thumper.bellcore.com (128.96.41.1) 64 data
    bytes
  • 72 bytes from 128.96.41.1 icmp_seq0 ttl240
    time641.8 ms
  • 72 bytes from 128.96.41.1 icmp_seq2 ttl240
    time1072.7 ms
  • 72 bytes from 128.96.41.1 icmp_seq3 ttl240
    time1447.4 ms
  • 72 bytes from 128.96.41.1 icmp_seq4 ttl240
    time758.5 ms
  • 72 bytes from 128.96.41.1 icmp_seq5 ttl240
    time482.1 ms
  • --- thumper.bellcore.com ping statistics --- 6
    packets transmitted, 5 packets received, 16
    packet loss round-trip min/avg/max
    482.1/880.5/1447.4 ms

Missing seq
Summary
10
Try the following Ping Examples
  • ping cepheid.physics.utoronto.ca
  • From mcl-gpb.gw.utoronto.ca Destination Host
    Unreachable
  • ping rolandlap.ph.unimelb.edu.au
  • From rtr4-000037.unimelb.edu.au Packet
    filtered
  • ping www.ncit.edu.np
  • ping unknown host www.ncit.edu.np
  • ping inpe-gw-sp.cptec.inpe.br
  • From 150.163.200.100 icmp_seq0 Time to live
    exceeded
  • ping www.ug.edu.gh
  • 34 packets transmitted, 0 received, 100 packet
    loss, time 33068ms
  • synack -p 80 -k 5 www.ug.edu.gh
  • 5 packets transmitted, 5 packets received, 0.00
    percent packet loss
  • round-trip (ms) min/avg/max 182.052/182.701/183
    .151 (std 0.578)
  • (median 183.095) (interquartile range
    1.039)
  • (25 percentile 182.085) (75 percentile
    183.124)

11
3rd party ping
  • Find servers
  • http//www.slac.stanford.edu/comp/net/wan-mon/trac
    eroute-srv.html
  • Glasgow University Scotland.
  • ICTP , Trieste, Italy.
  • IHEP Beijing, China.
  • Modify URL to request a ping for hosts with
  • pinger.ictp.it/cgi-bin/traceroute.pl?
    functionpingtargetbrunsvigia.tenet.ac.za
  • ping from 134.79.18.163 (www.slac.stanford.edu)
    to 196.21.99.222 (brunsvigia.tenet.ac.za) for
    140.105.16.64
  • PING 196.21.99.222 56 data bytes
  • 64 bytes from brunsvigia.tenet.ac.za
    (196.21.99.222) icmp_seq0. time370. ms
  • 64 bytes from brunsvigia.tenet.ac.za
    (196.21.99.222) icmp_seq1. time1911. ms
  • 64 bytes from brunsvigia.tenet.ac.za
    (196.21.99.222) icmp_seq2. time911. ms 64
    bytes from brunsvigia.tenet.ac.za
    (196.21.99.222) icmp_seq3. time385. ms
  • 64 bytes from brunsvigia.tenet.ac.za
    (196.21.99.222) icmp_seq4. time366. ms
  • ----196.21.99.222 PING Statistics---- 5 packets
    transmitted, 5 packets received, 0 packet loss
    round-trip (ms) min/avg/max 366/788/1911

12
RTT from California to world
Europe
E. Coast
Brazil
E. Coast US
W. Coast US
300ms
RTT (ms)
Europe S. America
0.30.6c
Longitude (degrees)
300ms
Frequency
Source Palo Alto CA, W. Coast
RTT (ms.)
Data from CAIDA Skitter project
13
Traceroute
  • Traceroute to remote host
  • Is the route direct, over commercial congested
    nets
  • Reverse traceroute from remote host to you or 3rd
    party
  • www.slac.stanford.edu/comp/net/wan-mon/traceroute-
    srv.html
  • www.tracert.com/

CAIDA Mouse sensitive map
14
Traceroute
Remote host
Max hops
Probes/hop
  • UDP/ICMP tool to show route packets take from
    local to remote host
  • 17cottrell_at_flora06gttraceroute -q 1 -m 20
    lhr.comsats.net.pk
  • traceroute to lhr.comsats.net.pk (210.56.16.10),
    20 hops max, 40 byte packets
  • 1 RTR-CORE1.SLAC.Stanford.EDU (134.79.19.2)
    0.642 ms
  • 2 RTR-MSFC-DMZ.SLAC.Stanford.EDU
    (134.79.135.21) 0.616 ms
  • 3 ESNET-A-GATEWAY.SLAC.Stanford.EDU
    (192.68.191.66) 0.716 ms
  • 4 snv-slac.es.net (134.55.208.30) 1.377 ms
  • 5 nyc-snv.es.net (134.55.205.22) 75.536 ms
  • 6 nynap-nyc.es.net (134.55.208.146) 80.629 ms
  • 7 gin-nyy-bbl.teleglobe.net (192.157.69.33)
    154.742 ms
  • 8 if-1-0-1.bb5.NewYork.Teleglobe.net
    (207.45.223.5) 137.403 ms
  • 9 if-12-0-0.bb6.NewYork.Teleglobe.net
    (207.45.221.72) 135.850 ms
  • 10 207.45.205.18 (207.45.205.18) 128.648 ms
  • 11 210.56.31.94 (210.56.31.94) 762.150 ms
  • 12 islamabad-gw2.comsats.net.pk (210.56.8.4)
    751.851 ms
  • 13
  • 14 lhr.comsats.net.pk (210.56.16.10) 827.301 ms

location
Long delay satellite
No response Lost packet or router ignores
15
Traceroute server results
  • Example www.slac.stanford.edu/cgi-bin/nph-tracero
    ute.pl

Related info
Security warning
Traceroute
Enter IP address or name
16
Graphical Traceroute
  • http//visualroute.visualware.com/

17
Pingroute
  • Ping routers along route, e.g. a tool to install
    that helps
  • www.slac.stanford.edu/comp/net/fpingroute.pl
  • or www.slac.stanford.edu/comp/net/pingroute.pl if
    fping N/A

15cottrell_at_noric04gtfpingroute.pl fpingroute.pl
does a traceroute to the selected host. For each
of the hops along the route it then uses fping
to ping each node (in parallel) 'count' times.
Output includes traceroute information, RTTs,
losses for 100 and 'size byte
pings. Version0.21, 8/24/04 Usage
fpingroute.pl Opts host where host is the
remote host's IP address or name e.g.
www.slac.stanford.edu Opts
-c count default10 -s
size default1400 -i
initial default1 Example fpingroute.pl -i 3
-c 10 -s 1400 www.triumf.ca
18
Pingroute example
  • May help tell where losses start
  • Will need many pings if losses small

Start of losses?
But?
Start of sustained losses
Routers may not respond
19
Look at time series
  • Look at history plots (PingER, IEPM-BW, ISPs, own
    border router etc.), when did problem start, how
    big an effect is it?
  • Assumes you know proximity of paths for which
    there are archived active measurements to the
    path that you are interested in
  • Also that relevant measurements exist
  • www-iepm.slac.stanford.edu/pinger/
  • amp.nlanr.net/ unfortunately no longer funded
  • ISPs plots (www.slac.stanford.edu/comp/net/wan-mo
    n/netmon.html for a a place to start looking)
  • Abilene http//stryper.uits.iu.edu/abilene/
  • GEANT http//stats.geant.net/usagemap/usagemap
  • RIPE http//www.ripe.net/projects/ttm/Plots/
  • ESnet http//measurement.es.net/ (OWAMP)
  • Collaboration between Internet2/ESnet/Geant to
    provide access to router measurements holds
    promise
  • Look at traceroute histories (see later)

20
Example time series
  • Look for change in measured value
  • Note time
  • Correlate

Italy disconnected
21
Find location of a bottleneck
  • Look at hops along the path
  • Pingroute (see earlier)
  • If possible look at utilizations or active probes
    launched from there
  • Pathneck http//www.cs.cmu.edu/hnn/pathneck/
  • Uses trains of packets to probe hops along route,
    looking at dispersion induced by queuing
  • Pipechar (son of pathchar, pchar)
    http//www.dsd.lbl.gov/OldProjects/NCS
  • Send packets of varying sizes to each router
    along path
  • Look at RTT as a function of packet size
  • From slope deduce bandwidth
  • Diferentiate to find capacity at each hop
  • However pipechar has uncertain support
  • Packet size variation limited to 1-MTU (1500)
    Bytes, so on fast links timing is difficult, with
    the result that estimates may not be reliable (OK
    for slow links)

22
Divide Conquer
  • Abilene has hosts at major PoPs running bwctl
  • So make measurements from end to middle to ID
    loss of performance
  • http//e2epi.internet2.edu/pipes/ami/bwctl/

23
Correlate with routes (traceanal)
24
Visualizing traceroutes
  • www.slac.stanford.edu/comp/net/iepm-bw.slac.stanfo
    rd.edu/slac_wan_bw_tests.html, gt traceroutes
  • One compact page per day
  • One row per host, one column per hour
  • One character per traceroute to indicate
    pathology or change (usually period(.) no
    change)
  • Identify unique routes with a number
  • Be able to inspect the route associated with a
    route number
  • Provide for analysis of long term route
    evolutions

Route at start of day, gives idea of route
stability
Multiple route changes (due to GEANT), later
restored to original route
Period (.) means no change
25
Changes in network topology (BGP) can result in
dramatic changes in performance
Hour
Samples of traceroute trees generated from the
table
Los-Nettos (100Mbps)
Remote host
Snapshot of traceroute summary table
Notes 1. Caltech misrouted via Los-Nettos
100Mbps commercial net 1400-1700 2. ESnet/GEANT
working on routes from 200 to 1400 3. A
previous occurrence went un-noticed for 2
months 4. Next step is to auto detect and notify
Drop in performance (From original path
SLAC-CENIC-Caltech to SLAC-Esnet-LosNettos
(100Mbps) -Caltech )
Back to original path
Dynamic BW capacity (DBC)
Changes detected by IEPM-Iperf and AbWE
Mbits/s
Available BW (DBC-XT)
Cross-traffic (XT)
Esnet-LosNettos segment in the path (100 Mbits/s)
ABwE measurement one/minute for 24 hours Thurs
Oct 9 900am to Fri Oct 10 901am
26
Moving towards application
  • Try user application (mem to mem disk to disk)
  • GridFTP, bbcp, bbftp
  • Iperf or thrulay (also provides RTT) to test TCP
    or UDP throughput (injects traffic, server)
  • dast.nlanr.net/Projects/Iperf/
  • www.internet2.edu/shalunov/thrulay/
  • Available bandwidth
  • Pathload www-static.cc.gatech.edu/fac/Constantino
    s.Dovrolis/pathload.html
  • Pathchirp www.spin.rice.edu/Software/pathChirp/
  • bing
  • NDT
  • What are the interface speeds?
  • What is the bottleneck?
  • Is there a duplex mismatch?
  • Are buffers set right (both ends)?

27
NDT example (Rich Carlson)
  • http//e2epi.internet2.edu/ndt/

28
Other tools
  • Ntop
  • Summarizes libpcap (sniffer) infor
  • Internet2 Detective
  • Tests connectivity to I2, bandwidth, multicast,
    IPv6
  • Can run as Java applet
  • http//detective.internet2.edu/
  • NLANR Internet Advisor
  • Ethereal, tcpdump, snoop for masochists
  • Passive tools
  • Netflow for characterizing network, spotting
    abnormalities, e.g.
  • www.itec.oar.net/abilene-netflow
  • www.slac.stanford.edu/comp/net/slac-netflow/html/
    SLAC-netflow.html
  • SNMP based tools

29
And then
  • Wireless
  • Avoid peer-to-peer/ad-hoc connections
  • Disable connecting to ad-hoc (set infrastructure
    only)
  • Disable bridging
  • How to do it varies by OS (XP, OSX, Linux)
  • Ad hoc can still interfere if on same channel
  • Tools to locate an access point (e.g.
    Yellow-Jacket)
  • Vendors have management tools to enable APs to
    detect rogue APs
  • NAT boxes may block or not support application
  • Private addresses
  • 10.0.0.0 - 10.255.255.255 a single class A net
  • 172.16.0.0 - 172.31.255.255 16 contiguous class
    Bs
  • 192.168.0.0 192.168.255.255 256 contiguous
    class Cs

30
Where is a host?
  • Beware some of information following is
    ephemeral, in general use heuristics with Google
  • Google Internet country codes for TLDs
  • Host may not be in TLD country, especially
    developing regions often use proxies elsewhere
  • Location may be encoded in router name
  • iplsIndianapolis, snvSunnyvale
  • Name server lookup to find hostname given IP
    address
  • 47cottrell_at_netflowgtnslookup 210.56.16.10
  • Server localhost
  • Address 127.0.0.1
  • Name lhr.comsats.net.pk
  • Address 210.56.16.10
  • Use a whois server, e.g.
  • www.networksolutions.com/cgi-bin/whois/whois
    (Americas Africa)
  • www.ripe.net/cgi-bin/whois (Europe)
  • www.apnic.net/ (Asia)
  • May identify site name, address, contact, etc,
    not all domains are in databases (e.g. will not
    find comsats.net.pk)

31
Where is a host cont.
  • Find the Autonomous System (AS) administering
  • Form giving AS for domain name
  • http//www.fixedorbit.com/search.htm
  • Gives AS number, name adjacent ASs web page for
    AS
  • Given an AS find out more about it
  • Use http//bgp.potaroo.net/cidr/ go to bottom and
    enter AS into form
  • Gives ISP name, web page, phone number, email,
    hours etc.
  • Review list of AS's ordered by Upstream AS
    Adjacency
  • www.telstra.net/ops/bgp/bgp-as-upsstm.txt
  • Tells what AS is upstream of an ISP

32
Where is a host - cont.
  • May be able to get latitude longitude
  • http//www.hostip.info/index.html
  • http//www.ip2location.com/ 
  • But it is a subscriber service (, but ),
    however it is probably best for developing
    regions
  • Google
  • www.geoiptool.com/http//www.geoiptool.com/
  • Triangulate pings from landmarks (in development)
  • http//www.slac.stanford.edu/comp/net/wan-mon/tuli
    p/
  • Need more landmarks, send email
    cottrell_at_slac.stanford.edu
  • http//www.cs.cornell.edu/bwong/octant/ for US
    only

33
Who you gonna tell?
  • Local network support people
  • Internet Service Provider (ISP) usually done by
    local networker
  • Usually will know immediate one, e.g.
    trouble_at_es.net
  • Use puck.nether.net/netops/nocs.cgi to find ISP
  • Use www.telstra.net/ops/bgp/bgp-as-upsstm.txt to
    find upstream ISPs
  • Well managed sites and ISPs maintain a list of
    email addresses such as abuse_at_ or postmaster_at_,
    that one can send email to, for example to
    complain about spam etc.
  • This follows an Internet recommendation (RFC
    2142).
  • Some less helpful sites do not provide such
    services, for more on these, see RFC-ignorant.org

34
What ya gonna tell em?
  • Describe problem with details
  • What is affected?
  • Application, host OS (uname a), NIC (ifconfig,
    route)
  • How is it affected?
  • Non responsiveness, unable to contact remote host
  • Slow performance (see Brians talk), packet loss
  • When did it start?
  • Send ping output between hosts
  • Send traceroute forward reverse if possible
  • Maybe use I (ICMP option)
  • NDT
  • Identify when it started
  • If complex think about creating web page with
    details
  • Top, vmstat, pingroute, pipechar, application
    output (GridFTP, iperf)

35
Web page examples Case studies
  • http//www.slac.stanford.edu/grp/scs/net/case/html
    /
  • http//e2epi.internet2.edu/case-studies/

36
More Information
  • Tutorial on monitoring
  • www.slac.stanford.edu/comp/net/wan-mon/tutorial.ht
    ml
  • RFC 2151 on Internet tools
  • www.freesoft.org/CIE/RFC/Orig/rfc2151.txt
  • Network monitoring tools
  • www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html
  • www.caida.org/tools/taxonomy/
  • Network Performance Tools an I2 Cookbook
  • e2epi.internet2.edu/network-perf-wk/tools-cookbook
    .pdf
  • Network Monitoring sites
  • www.slac.stanford.edu/comp/net/wan-mon/netmon.html
  • How to Accelerate Your Internet, ISBN
    0-9778093-1-5, Ed. Flickenger R.

37
Local Host - LISA
  • Localhost Information Service Agent  LISA is a
    Java Web Start application which provides
  • Integration with MonALISA
  • Complete Monitoring of the System (Load, CPU,
    Memory, Disk, Disk IO, Paging, Processes, Network
    Traffic and Connectivity...).
  • History and instantaneous
  • Filters to trigger actions when predefined
    conditions are detected.
  • A user Friendly GUI to present the monitoring
    information.
  • Optimization modules for distributed
    applications.
  • It is a lightweight application that can be
    easily deployed on any system.
  • Modules for End to End network measurements (
    e.g. IPERF).
  • See monalisa.caltech.edu/dev_lisa.html
Write a Comment
User Comments (0)
About PowerShow.com