Title: Diagnostic Steps
1Diagnostic Steps
- Les Cottrell SLAC
- Presented at the Optimization Technologies for
Low-Bandwidth Networks, ICTP Workshop, Trieste,
Italy, 9-20 October 2006 - http//www.slac.stanford.edu/grp/scs/net/talk06/di
agnostics.ppt
Partially funded by DOE/MICS Field Work Proposal
on Internet End-to-end Performance Monitoring
(IEPM), also supported by IUPAP
2Get ready
- Bring up terminal window so can try some commands
- Bring up the presentation so can click on links
- www.slac.stanford.edu/grp/scs/net/talk06/diagnosti
cs.ppt
3Aim
- Goal provide a practical guide to debugging
common problems - Why is diagnosis difficult yet important?
- Local host
- Ping, Traceroute, PingRoute
- Looking at time series
- Locating bottlenecks
- Correlation of problems with routes
- More tools and problems
- Where is a node
- Who do you tell, what do you say?
- Case studies and More Information
4Why is diagnosis difficult?
- Internet's evolution as a composition of
independently developed and deployed protocols,
technologies, and core applications - Diversity, highly unpredictable, hard to find
invariants - Rapid evolution change, no equilibrium so far
- Findings may be out of date
- Measurement/diagnosis not high on vendors list of
priorities - Resources/skill focus on more interesting an
profitable issues - Tools lacking or inadequate
- Implementations are flaky not fully tested with
new releases
5Add to that
- Distributed systems are very hard
- A distributed system is one in which I can't get
my work done because a computer I've never heard
of has failed. Butler Lampson - Network is deliberately transparent
- The bottlenecks can be in any of the following
components - the applications
- the OS
- the disks, NICs, bus, memory, etc. on sender or
receiver - the network switches and routers, and so on
- Problems may not be logical
- Most problems are operator errors,
configurations, bugs - When building distributed systems, we often
observe unexpectedly low performance - the reasons for which are usually not obvious
- Just when you think youve cracked it, in steps
security - Firewall, NAT boxes etc.
- Block pings, traceroute looks like port scan,
diagnostic tool ports are blocked - ISPs worried about providing access to core,
making results public, privacy issues
6Sources of problems
- Host errors
- TCP buffers, heavy utilization
- Duplex mismatch (Ethernet)
- Misconfigured router/switches
- Including routing errors, especially for backup
paths - Bad equipment, wiring/fiber problem
- Congestion
7Fire Local Host
- Usual Unix tools (uname-a, top, vmstat, iostat )
- Is the host overloaded, do you have a gateway
(route), name server (nslookup/dig), which
interface are you using (mii-tool (needs root),
gives duplex speed common error source) - 21cottrell_at_pingergtsudo mii-tool eth0
- eth0 100 Mbit, full duplex, link ok
- Net ifconfig a (look at errors), netstat a
more - Is server running (if you know port)?
- gttelnet localhost 2811
- Trying 127.0.0.1
- 220 aftpexp04.bnl.gov GridFTP Server 1.12 GSSAPI
type Globus/GSI wu-2.6.2 (gcc32dbg,
1069715860-42) ready. -
- telnetgt quit
8Ping
- Ping
- to localhost,
- ping to gateway (use route or traceroute to find
gateway), - ping to well known host
- to relevant remote host
- Use IP address to avoid nameserver problems
- Look for connectivity, loss, RTT, jitter, dups
- May need to run for a long time to see some
pathologies (e.g. bursty loss due to DSL loss of
sync) - Try flood pings if suspect rate limited
- Use synack or sting if ICMP blocked
- www-iepm.slac.stanford.edu/tools/synack/
9Ping example
Packet size
Remote host
Repeat count
RTT
- syrup/home ping -c 6 -s 64 thumper.bellcore.com
- PING thumper.bellcore.com (128.96.41.1) 64 data
bytes - 72 bytes from 128.96.41.1 icmp_seq0 ttl240
time641.8 ms - 72 bytes from 128.96.41.1 icmp_seq2 ttl240
time1072.7 ms - 72 bytes from 128.96.41.1 icmp_seq3 ttl240
time1447.4 ms - 72 bytes from 128.96.41.1 icmp_seq4 ttl240
time758.5 ms - 72 bytes from 128.96.41.1 icmp_seq5 ttl240
time482.1 ms - --- thumper.bellcore.com ping statistics --- 6
packets transmitted, 5 packets received, 16
packet loss round-trip min/avg/max
482.1/880.5/1447.4 ms
Missing seq
Summary
10Try the following Ping Examples
- ping cepheid.physics.utoronto.ca
- From mcl-gpb.gw.utoronto.ca Destination Host
Unreachable - ping rolandlap.ph.unimelb.edu.au
- From rtr4-000037.unimelb.edu.au Packet
filtered - ping www.ncit.edu.np
- ping unknown host www.ncit.edu.np
- ping inpe-gw-sp.cptec.inpe.br
- From 150.163.200.100 icmp_seq0 Time to live
exceeded - ping www.ug.edu.gh
- 34 packets transmitted, 0 received, 100 packet
loss, time 33068ms - synack -p 80 -k 5 www.ug.edu.gh
- 5 packets transmitted, 5 packets received, 0.00
percent packet loss - round-trip (ms) min/avg/max 182.052/182.701/183
.151 (std 0.578) - (median 183.095) (interquartile range
1.039) - (25 percentile 182.085) (75 percentile
183.124)
113rd party ping
- Find servers
- http//www.slac.stanford.edu/comp/net/wan-mon/trac
eroute-srv.html - Glasgow University Scotland.
- ICTP , Trieste, Italy.
- IHEP Beijing, China.
- Modify URL to request a ping for hosts with
- pinger.ictp.it/cgi-bin/traceroute.pl?
functionpingtargetbrunsvigia.tenet.ac.za - ping from 134.79.18.163 (www.slac.stanford.edu)
to 196.21.99.222 (brunsvigia.tenet.ac.za) for
140.105.16.64 - PING 196.21.99.222 56 data bytes
- 64 bytes from brunsvigia.tenet.ac.za
(196.21.99.222) icmp_seq0. time370. ms - 64 bytes from brunsvigia.tenet.ac.za
(196.21.99.222) icmp_seq1. time1911. ms - 64 bytes from brunsvigia.tenet.ac.za
(196.21.99.222) icmp_seq2. time911. ms 64
bytes from brunsvigia.tenet.ac.za
(196.21.99.222) icmp_seq3. time385. ms - 64 bytes from brunsvigia.tenet.ac.za
(196.21.99.222) icmp_seq4. time366. ms - ----196.21.99.222 PING Statistics---- 5 packets
transmitted, 5 packets received, 0 packet loss
round-trip (ms) min/avg/max 366/788/1911
12RTT from California to world
Europe
E. Coast
Brazil
E. Coast US
W. Coast US
300ms
RTT (ms)
Europe S. America
0.30.6c
Longitude (degrees)
300ms
Frequency
Source Palo Alto CA, W. Coast
RTT (ms.)
Data from CAIDA Skitter project
13Traceroute
- Traceroute to remote host
- Is the route direct, over commercial congested
nets - Reverse traceroute from remote host to you or 3rd
party - www.slac.stanford.edu/comp/net/wan-mon/traceroute-
srv.html - www.tracert.com/
CAIDA Mouse sensitive map
14Traceroute
Remote host
Max hops
Probes/hop
- UDP/ICMP tool to show route packets take from
local to remote host - 17cottrell_at_flora06gttraceroute -q 1 -m 20
lhr.comsats.net.pk - traceroute to lhr.comsats.net.pk (210.56.16.10),
20 hops max, 40 byte packets - 1 RTR-CORE1.SLAC.Stanford.EDU (134.79.19.2)
0.642 ms - 2 RTR-MSFC-DMZ.SLAC.Stanford.EDU
(134.79.135.21) 0.616 ms - 3 ESNET-A-GATEWAY.SLAC.Stanford.EDU
(192.68.191.66) 0.716 ms - 4 snv-slac.es.net (134.55.208.30) 1.377 ms
- 5 nyc-snv.es.net (134.55.205.22) 75.536 ms
- 6 nynap-nyc.es.net (134.55.208.146) 80.629 ms
- 7 gin-nyy-bbl.teleglobe.net (192.157.69.33)
154.742 ms - 8 if-1-0-1.bb5.NewYork.Teleglobe.net
(207.45.223.5) 137.403 ms - 9 if-12-0-0.bb6.NewYork.Teleglobe.net
(207.45.221.72) 135.850 ms - 10 207.45.205.18 (207.45.205.18) 128.648 ms
- 11 210.56.31.94 (210.56.31.94) 762.150 ms
- 12 islamabad-gw2.comsats.net.pk (210.56.8.4)
751.851 ms - 13
- 14 lhr.comsats.net.pk (210.56.16.10) 827.301 ms
location
Long delay satellite
No response Lost packet or router ignores
15Traceroute server results
- Example www.slac.stanford.edu/cgi-bin/nph-tracero
ute.pl
Related info
Security warning
Traceroute
Enter IP address or name
16Graphical Traceroute
- http//visualroute.visualware.com/
17Pingroute
- Ping routers along route, e.g. a tool to install
that helps - www.slac.stanford.edu/comp/net/fpingroute.pl
- or www.slac.stanford.edu/comp/net/pingroute.pl if
fping N/A
15cottrell_at_noric04gtfpingroute.pl fpingroute.pl
does a traceroute to the selected host. For each
of the hops along the route it then uses fping
to ping each node (in parallel) 'count' times.
Output includes traceroute information, RTTs,
losses for 100 and 'size byte
pings. Version0.21, 8/24/04 Usage
fpingroute.pl Opts host where host is the
remote host's IP address or name e.g.
www.slac.stanford.edu Opts
-c count default10 -s
size default1400 -i
initial default1 Example fpingroute.pl -i 3
-c 10 -s 1400 www.triumf.ca
18Pingroute example
- May help tell where losses start
- Will need many pings if losses small
Start of losses?
But?
Start of sustained losses
Routers may not respond
19Look at time series
- Look at history plots (PingER, IEPM-BW, ISPs, own
border router etc.), when did problem start, how
big an effect is it? - Assumes you know proximity of paths for which
there are archived active measurements to the
path that you are interested in - Also that relevant measurements exist
- www-iepm.slac.stanford.edu/pinger/
- amp.nlanr.net/ unfortunately no longer funded
- ISPs plots (www.slac.stanford.edu/comp/net/wan-mo
n/netmon.html for a a place to start looking) - Abilene http//stryper.uits.iu.edu/abilene/
- GEANT http//stats.geant.net/usagemap/usagemap
- RIPE http//www.ripe.net/projects/ttm/Plots/
- ESnet http//measurement.es.net/ (OWAMP)
- Collaboration between Internet2/ESnet/Geant to
provide access to router measurements holds
promise - Look at traceroute histories (see later)
20Example time series
- Look for change in measured value
- Note time
- Correlate
Italy disconnected
21Find location of a bottleneck
- Look at hops along the path
- Pingroute (see earlier)
- If possible look at utilizations or active probes
launched from there - Pathneck http//www.cs.cmu.edu/hnn/pathneck/
- Uses trains of packets to probe hops along route,
looking at dispersion induced by queuing - Pipechar (son of pathchar, pchar)
http//www.dsd.lbl.gov/OldProjects/NCS - Send packets of varying sizes to each router
along path - Look at RTT as a function of packet size
- From slope deduce bandwidth
- Diferentiate to find capacity at each hop
- However pipechar has uncertain support
- Packet size variation limited to 1-MTU (1500)
Bytes, so on fast links timing is difficult, with
the result that estimates may not be reliable (OK
for slow links)
22Divide Conquer
- Abilene has hosts at major PoPs running bwctl
- So make measurements from end to middle to ID
loss of performance - http//e2epi.internet2.edu/pipes/ami/bwctl/
23Correlate with routes (traceanal)
24Visualizing traceroutes
- www.slac.stanford.edu/comp/net/iepm-bw.slac.stanfo
rd.edu/slac_wan_bw_tests.html, gt traceroutes - One compact page per day
- One row per host, one column per hour
- One character per traceroute to indicate
pathology or change (usually period(.) no
change) - Identify unique routes with a number
- Be able to inspect the route associated with a
route number - Provide for analysis of long term route
evolutions
Route at start of day, gives idea of route
stability
Multiple route changes (due to GEANT), later
restored to original route
Period (.) means no change
25Changes in network topology (BGP) can result in
dramatic changes in performance
Hour
Samples of traceroute trees generated from the
table
Los-Nettos (100Mbps)
Remote host
Snapshot of traceroute summary table
Notes 1. Caltech misrouted via Los-Nettos
100Mbps commercial net 1400-1700 2. ESnet/GEANT
working on routes from 200 to 1400 3. A
previous occurrence went un-noticed for 2
months 4. Next step is to auto detect and notify
Drop in performance (From original path
SLAC-CENIC-Caltech to SLAC-Esnet-LosNettos
(100Mbps) -Caltech )
Back to original path
Dynamic BW capacity (DBC)
Changes detected by IEPM-Iperf and AbWE
Mbits/s
Available BW (DBC-XT)
Cross-traffic (XT)
Esnet-LosNettos segment in the path (100 Mbits/s)
ABwE measurement one/minute for 24 hours Thurs
Oct 9 900am to Fri Oct 10 901am
26Moving towards application
- Try user application (mem to mem disk to disk)
- GridFTP, bbcp, bbftp
- Iperf or thrulay (also provides RTT) to test TCP
or UDP throughput (injects traffic, server) - dast.nlanr.net/Projects/Iperf/
- www.internet2.edu/shalunov/thrulay/
- Available bandwidth
- Pathload www-static.cc.gatech.edu/fac/Constantino
s.Dovrolis/pathload.html - Pathchirp www.spin.rice.edu/Software/pathChirp/
- bing
- NDT
- What are the interface speeds?
- What is the bottleneck?
- Is there a duplex mismatch?
- Are buffers set right (both ends)?
27NDT example (Rich Carlson)
- http//e2epi.internet2.edu/ndt/
28Other tools
- Ntop
- Summarizes libpcap (sniffer) infor
- Internet2 Detective
- Tests connectivity to I2, bandwidth, multicast,
IPv6 - Can run as Java applet
- http//detective.internet2.edu/
- NLANR Internet Advisor
- Ethereal, tcpdump, snoop for masochists
- Passive tools
- Netflow for characterizing network, spotting
abnormalities, e.g. - www.itec.oar.net/abilene-netflow
- www.slac.stanford.edu/comp/net/slac-netflow/html/
SLAC-netflow.html - SNMP based tools
29And then
- Wireless
- Avoid peer-to-peer/ad-hoc connections
- Disable connecting to ad-hoc (set infrastructure
only) - Disable bridging
- How to do it varies by OS (XP, OSX, Linux)
- Ad hoc can still interfere if on same channel
- Tools to locate an access point (e.g.
Yellow-Jacket) - Vendors have management tools to enable APs to
detect rogue APs - NAT boxes may block or not support application
- Private addresses
- 10.0.0.0 - 10.255.255.255 a single class A net
- 172.16.0.0 - 172.31.255.255 16 contiguous class
Bs - 192.168.0.0 192.168.255.255 256 contiguous
class Cs
30Where is a host?
- Beware some of information following is
ephemeral, in general use heuristics with Google - Google Internet country codes for TLDs
- Host may not be in TLD country, especially
developing regions often use proxies elsewhere - Location may be encoded in router name
- iplsIndianapolis, snvSunnyvale
- Name server lookup to find hostname given IP
address - 47cottrell_at_netflowgtnslookup 210.56.16.10
- Server localhost
- Address 127.0.0.1
- Name lhr.comsats.net.pk
- Address 210.56.16.10
- Use a whois server, e.g.
- www.networksolutions.com/cgi-bin/whois/whois
(Americas Africa) - www.ripe.net/cgi-bin/whois (Europe)
- www.apnic.net/ (Asia)
- May identify site name, address, contact, etc,
not all domains are in databases (e.g. will not
find comsats.net.pk)
31Where is a host cont.
- Find the Autonomous System (AS) administering
- Form giving AS for domain name
- http//www.fixedorbit.com/search.htm
- Gives AS number, name adjacent ASs web page for
AS - Given an AS find out more about it
- Use http//bgp.potaroo.net/cidr/ go to bottom and
enter AS into form - Gives ISP name, web page, phone number, email,
hours etc. - Review list of AS's ordered by Upstream AS
Adjacency - www.telstra.net/ops/bgp/bgp-as-upsstm.txt
- Tells what AS is upstream of an ISP
32Where is a host - cont.
- May be able to get latitude longitude
- http//www.hostip.info/index.html
- http//www.ip2location.com/
- But it is a subscriber service (, but ),
however it is probably best for developing
regions - Google
- www.geoiptool.com/http//www.geoiptool.com/
- Triangulate pings from landmarks (in development)
- http//www.slac.stanford.edu/comp/net/wan-mon/tuli
p/ - Need more landmarks, send email
cottrell_at_slac.stanford.edu - http//www.cs.cornell.edu/bwong/octant/ for US
only
33Who you gonna tell?
- Local network support people
- Internet Service Provider (ISP) usually done by
local networker - Usually will know immediate one, e.g.
trouble_at_es.net - Use puck.nether.net/netops/nocs.cgi to find ISP
- Use www.telstra.net/ops/bgp/bgp-as-upsstm.txt to
find upstream ISPs - Well managed sites and ISPs maintain a list of
email addresses such as abuse_at_ or postmaster_at_,
that one can send email to, for example to
complain about spam etc. - This follows an Internet recommendation (RFC
2142). - Some less helpful sites do not provide such
services, for more on these, see RFC-ignorant.org
34What ya gonna tell em?
- Describe problem with details
- What is affected?
- Application, host OS (uname a), NIC (ifconfig,
route) - How is it affected?
- Non responsiveness, unable to contact remote host
- Slow performance (see Brians talk), packet loss
- When did it start?
- Send ping output between hosts
- Send traceroute forward reverse if possible
- Maybe use I (ICMP option)
- NDT
- Identify when it started
- If complex think about creating web page with
details - Top, vmstat, pingroute, pipechar, application
output (GridFTP, iperf)
35Web page examples Case studies
- http//www.slac.stanford.edu/grp/scs/net/case/html
/ - http//e2epi.internet2.edu/case-studies/
36More Information
- Tutorial on monitoring
- www.slac.stanford.edu/comp/net/wan-mon/tutorial.ht
ml - RFC 2151 on Internet tools
- www.freesoft.org/CIE/RFC/Orig/rfc2151.txt
- Network monitoring tools
- www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html
- www.caida.org/tools/taxonomy/
- Network Performance Tools an I2 Cookbook
- e2epi.internet2.edu/network-perf-wk/tools-cookbook
.pdf - Network Monitoring sites
- www.slac.stanford.edu/comp/net/wan-mon/netmon.html
- How to Accelerate Your Internet, ISBN
0-9778093-1-5, Ed. Flickenger R.
37Local Host - LISA
- Localhost Information Service Agent LISA is a
Java Web Start application which provides - Integration with MonALISA
- Complete Monitoring of the System (Load, CPU,
Memory, Disk, Disk IO, Paging, Processes, Network
Traffic and Connectivity...). - History and instantaneous
- Filters to trigger actions when predefined
conditions are detected. - A user Friendly GUI to present the monitoring
information. - Optimization modules for distributed
applications. - It is a lightweight application that can be
easily deployed on any system. - Modules for End to End network measurements (
e.g. IPERF). - See monalisa.caltech.edu/dev_lisa.html