Internet Monitoring

About This Presentation

Title:

Internet Monitoring

Description:

Title: Quality of Service Author: cottrell Last modified by: cottrell Created Date: 10/17/1999 7:36:36 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:139

Avg rating:3.0/5.0

Slides: 61

Provided by: cott57

Learn more at: https://www.slac.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Internet Monitoring

1
Internet Monitoring

Les Cottrell SLAC
Presented at NUST Institute of Information
Technology (NIIT) Rawalpindi, Pakistan, March 15,
2005

Partially funded by DOE/MICS Field Work Proposal
on Internet End-to-end Performance Monitoring
(IEPM), also supported by IUPAP
2
Overview

Why is measurement difficult yet important?
LAN vs WAN
SNMP
Effects of measurement interval
Passive
Active
Tools including some results on Digital Divide
Trouble shooting
Tools, how to find things who to tell
New challenges

3
Why is measurement difficult?

Internet's evolution as a composition of
independently developed and deployed protocols,
technologies, and core applications
Diversity, highly unpredictable, hard to find
invariants
Rapid evolution change, no equilibrium so far
Findings may be out of date
Measurement not high on vendors list of
priorities
Resources/skill focus on more interesting an
profitable issues
Tools lacking or inadequate
Implementations poor not fully tested with new
releases
ISPs worried about providing access to core,
making results public, privacy issues
The phone connection oriented model (Poisson
distributions of session length etc.) does not
work for Internet traffic (heavy tails, self
similar behavior, multi-fractals etc.)

4
Add to that

Distributed systems are very hard
A distributed system is one in which I can't get
my work done because a computer I've never heard
of has failed. Butler Lampson
Network is deliberately transparent
The bottlenecks can be in any of the following
components
the applications
the OS
the disks, NICs, bus, memory, etc. on sender or
receiver
the network switches and routers, and so on
Problems may not be logical
Most problems are operator errors,
configurations, bugs
When building distributed systems, we often
observe unexpectedly low performance
the reasons for which are usually not obvious
Just when you think youve cracked it, in steps
security

5
Why is measurement important?

End users network managers need to be able to
identify track problems
Choosing an ISP, setting a realistic service
level agreement, and verifying it is being met
Choosing routes when more than one is available
Setting expectations
Deciding which links need upgrading
Deciding where to place collaboration components
such as a regional computing center, software
development
How well will an application work (e.g. VoIP)
Application steering (e.g. forecasting)
Grid middleware, e.g. replication manager

6
LAN vs WAN

Measuring the LAN
Network admin has control so
Can read MIBs from devices
Can within limits passively sniff traffic
Know the routes between devices
Manually for small networks
Automated for large networks
Measuring the WAN
No admin control, unless you are an ISP
Cant read information out of routers
May not be able to sniff/trace traffic due to
privacy/security concerns
Dont know route details between points, may
change, not under your control, may be able to
deduce some of it
So typically have to make do with what can be
measured from end to end with very limited
information from intermediate equipment hops.

7
SNMP (Simple Network Management Protocol)

Example of an Application, usually built on UDP
Defacto standard for network management
Created by IETF to address short term needs of
TCP/IP
Consists of
Management Information Bases (MIBs)
Store information about managed object (host,
router, switch etc.) system status info,
performance configuration data
Remote Network Monitoring (RMON) is a management
tool for passively watching line traffic
SNMP communication protocol to read out data and
set parameters
Polling protocol, manager asks questions agent
responds

8
SNMP Model
Agent MIB

NMS contains manager software to send receive
SNMP messages to Agents
Agent is a software component residing on a
managed node, responds to SNMP queries, performs
updates reports problems
MIBs resides on nodes and at NMS and is a logical
description of all network management data.

Agent MIB
Agent MIB
TCP/IP net
Agent MIB
Agent MIB
Agent MIB
Network Management Station(NMS)
9
SNMP version 1 limitations

Authentication is inadequate
Password (community string) placed in clear in
SNMP messages
MIB variables must be polled separately, i.e.
entire MIB cannot be fetched with single command
SNMPv2 and v3 attempt to address these and other
limitations
Despite limitations, SNMP has been a huge success
Provides device and link utilization (byte,
packets) and errors
Lot of facilities/tools built around SNMP to
provide reports for sites
Security concerns limit access typically to very
limited set of owner/admins
E.g. ISPs wont let you poll their devices

10
SNMP Examples

Using MRTG to display Router bits/s MIB variable

CERN trans- Atlantic traffic
11
Averaging/Sampling intervals

Typical measurements of utilization are made for
5 minute intervals or longer in order not to
create much impact.
Interactive human interactions require second or
sub-second response
So it is interesting to see the difference
between measurement made with different time
frames.

12
Utilization with different averaging times
5 secs

Same data, measured Mbits/s every 5 secs
Average over different time intervals
Does not get a lot smoother
May indicate multi-fractal behavior

5 mins
1 hour
13
Averages vs maxima

Maximum of all 5 sec samples can be factor of 2
or more greater than the average over 5 minute
intervals

14
Lot of heavy FTP activity

The difference depends on traffic type
Only 20 difference in max average

15
Passive vs. Active Monitoring

Active injects traffic on demand
Passive watches things as they happen
Network device records information
Packets, bytes, errors kept in MIBs retrieved
by SNMP
Devices (e.g. probe) capture/watch packets as
they pass
Router, switch, sniffer, host in promiscuous
(tcpdump)
Complementary to one another
Passive
does not inject extra traffic, measures real
traffic
Polling to gather data generates traffic, also
gathers large amounts of data
Active
provides explicit control on the generation of
packets for measurement scenarios
testing what you want, when you need it.
Injects extra artificial traffic
Can do both, e.g. start active measurement and
look at passively

16
Passive tools

SNMP
Hardware probes e.g. Sniffer, NetScout, can be
stand-alone or remotely access from a central
management station
Software probes snoop, tcpdump, require
promiscous access to NIC card, i.e. root/sudo
access
Flow measurement netramet, OCxMon/CoralReef,
Netflow

17
Example Passive site border monitoring

Use Cisco Netflow in Catalyst 6509 with MSFC, on
SLAC border
Gather about 200MBytes/day of flow data
The raw data records include source and
destination addresses and ports, the protocol,
packet, octet and flow counts, and start and end
times of the flows
Much less detailed than saving headers of all
packets, but good compromise
Top talkers history and daily (from to), tlds,
vlans, protocol and application utilization
Use for network security

18
SLAC Traffic profile
SLAC offsite links OC3 to ESnet, 1Gbps to
Stanford U thence OC12 to I2 OC48 to
NTON Profile bulk-data xfer dominates
HTTP
Mbps in
iperf
2 Days
Last 6 months
Mbps out
SSH
FTP
bbftp
19
Top talkers by protocol
Hostname
100
1
10000
Volume dominated by single Application - bbcp
MBytes/day (log scale)
20
Flow sizes
SNMP
Real A/V
AFS file server
Heavy tailed, in out, UDP flows shorter than
TCP, packetbytes 75 TCP-in lt 5kBytes, 75
TCP-out lt 1.5kBytes (lt10pkts) UDP 80 lt 600Bytes
(75 lt 3 pkts), 10 more TCP than UDP Top UDP
AFS (gt55), Real(25), SNMP(1.4)
21
Flow lengths

60 of TCP flows less than 1 second
Would expect TCP streams longer lived
But 60 of UDP flows over 10 seconds, maybe due
to heavy use of AFS

22
Some Active Measurement Tools

Ping connectivity, RTT loss
flavors of ping, fping, Linux vs Solaris ping
but blocking rate limiting
Alternative synack, but can look like DoS attack
Sting measures one way loss
Traceroute
How it works, what it provides
Reverse traceroute servers
Traceroute archives
Combining ping traceroute,
traceping, pingroute
Pathchar, pchar, pipechar, bprobe, abing etc.
Iperf, netperf, ttcp, FTP

23
Ping

ICMP client/server application built on IP
Client send ICMP echo request, server sends reply
Server usually in kernel, so reliable fast
User can specify number of data bytes. Client
puts timestamp in data bytes. Compares timestamp
with time when echo comes back to get RTT
Many flavors (e.g. fping) and options
packet length, number of tries, timeout,
separation
Ping localhost (127.0.0.1) first, then gateway IP
address etc.

24
Ping example

syrup/home ping -c 6 -s 64 thumper.bellcore.com
PING thumper.bellcore.com (128.96.41.1) 64 data
bytes
72 bytes from 128.96.41.1 icmp_seq0 ttl240
time641.8 ms
72 bytes from 128.96.41.1 icmp_seq2 ttl240
time1072.7 ms
72 bytes from 128.96.41.1 icmp_seq3 ttl240
time1447.4 ms
72 bytes from 128.96.41.1 icmp_seq4 ttl240
time758.5 ms
72 bytes from 128.96.41.1 icmp_seq5 ttl240
time482.1 ms
--- thumper.bellcore.com ping statistics --- 6
packets transmitted, 5 packets received, 16
packet loss round-trip min/avg/max
482.1/880.5/1447.4 ms

Packet size
Remote host
Repeat count
RTT
Missing seq
Summary
25
Traceroute

UDP/ICMP tool to show route packets take from
local to remote host
17cottrell_at_flora06gttraceroute -q 1 -m 20
lhr.comsats.net.pk
traceroute to lhr.comsats.net.pk (210.56.16.10),
20 hops max, 40 byte packets
1 RTR-CORE1.SLAC.Stanford.EDU (134.79.19.2)
0.642 ms
2 RTR-MSFC-DMZ.SLAC.Stanford.EDU
(134.79.135.21) 0.616 ms
3 ESNET-A-GATEWAY.SLAC.Stanford.EDU
(192.68.191.66) 0.716 ms
4 snv-slac.es.net (134.55.208.30) 1.377 ms
5 nyc-snv.es.net (134.55.205.22) 75.536 ms
6 nynap-nyc.es.net (134.55.208.146) 80.629 ms
7 gin-nyy-bbl.teleglobe.net (192.157.69.33)
154.742 ms
8 if-1-0-1.bb5.NewYork.Teleglobe.net
(207.45.223.5) 137.403 ms
9 if-12-0-0.bb6.NewYork.Teleglobe.net
(207.45.221.72) 135.850 ms
10 207.45.205.18 (207.45.205.18) 128.648 ms
11 210.56.31.94 (210.56.31.94) 762.150 ms
12 islamabad-gw2.comsats.net.pk (210.56.8.4)
751.851 ms
13
14 lhr.comsats.net.pk (210.56.16.10) 827.301 ms

Max hops
Remote host
Probes/hop
No response Lost packet or router ignores
26
Reverse traceroute servers

Reverse traceroute server runs as CGI script in
web server
Allow measurement of route from other end.
Important for asymmetric routes. See e.g.
www.slac.stanford.edu/comp/net/wan-mon/traceroute-
srv.html
CAIDA map of reverse traceroute servers
www.caida.org/analysis/routing/reversetrace/

27
Pingroute

Run traceroute, then ping each router n times
helps identify where in route the problems start
to occur
Routers may not respond to pings, or may treat
pings directed at them, differently to other
packets

28
Path characterization

Pathchar
sends multiple packets of varying sizes to each
router along route
measures minimum response time
plot min RTT vs packet size to get bandwidth
calculate differences to get individual hop
characteristics
measures for each hop BW, queuing, delay/hop
can take a long time
Pipechar/abing
Also sends back-to-back packets and measures
separation on return
Much faster
Finds bottleneck

Bottleneck
Min spacing At bottleneck
Spacing preserved On higher speed links
29
Network throughput

Iperf
Client generates sends UDP or TCP packets
Server receives receives packets
Can select port, maximum window size, port ,
duration, Mbytes to send etc.
Client/server communicate packets seen etc.
Reports on throughput
Requires sever to be installed at remote site,
i.e. friendly administrators or logon account and
password

30
Iperf example

25cottrell_at_flora06gtiperf -p 5008 -w 512K -P 3
-c sunstats.cern.ch
--------------------------------------------------
----------
Client connecting to sunstats.cern.ch, TCP port
5008
TCP window size 512 KByte
--------------------------------------------------
----------
6 local 134.79.16.101 port 57582 connected
with 192.65.185.20 port 5008
5 local 134.79.16.101 port 57581 connected
with 192.65.185.20 port 5008
4 local 134.79.16.101 port 57580 connected
with 192.65.185.20 port 5008
ID Interval Transfer Bandwidth
4 0.0-10.3 sec 19.6 MBytes 15.3 Mbits/sec
5 0.0-10.3 sec 19.6 MBytes 15.3 Mbits/sec
6 0.0-10.3 sec 19.7 MBytes 15.3 Mbits/sec
Total throughput 315.3Mbits/s 45.9Mbits/s

3 parallel streams
TCP port 5006
Max window size
Remote host
31
Active Measurement Projects

PingER running at NIIT
AMP coming soon to NIIT
One way delay
Surveyor (now defunct), RIPE (mainly Europe),
owamp
IEPM-BW running at NIIT
NIMI (mainly a design infrastructure)
NWS (mainly for forecasting)
Skitter
All projects measure routes
For a detailed comparison see
www.slac.stanford.edu/comp/net/wan-mon/iepm-cf.htm
l
www.slac.stanford.edu/grp/scs/net/proposals/infra-
mon.html

32
AMP

http//amp.nlanr.net/AMP/
AMP uses dedicated PCs as monitors, 150 (June,
2005)
Today mainly does pings
Oriented to Internet 2, 10 countries
Does mainly full mesh pinging
Being re-written to provide support for more
probes

33
PingER

Measure the network performance for developing
regions
From developed to developing vice versa
Between developing regions within developing
regions
Use simple tool (PingER/ping)
Ping installed on all modern hosts, low traffic
interference,
21 pings each 30 mins to remote hosts (lt
100bits/s average)
Provides very useful measures
Originated in High Energy Physics, now focused on
DD
Persistent (data goes back to 1995), interesting
history

PingER coverage Feb 2005
Monitoring site Remote site
34
ExamplesWorld View
C. Asia, Russia, S.E. Europe, L. America, M.
East, China 4-5 yrs behind India, Africa 7 yrs
behind
S.E. Europe, Russia catching up Latin Am., Mid
East, China keeping up India, Africa falling
behind
Important for policy makers
Many institutes in developing world have less
performance than a household in N. America or
Europe
35
Losses

US residential Broadband users have better access
than sites in many regions

36
Loss to Africa (example of variability)
From PingER project
37
Compare with TAI

UN Technology Achievement Index (TAI)
Measures creation diffusion of technology and
building human skills

Note how bad Africa is
38
E2E Troubleshooting

Solving the E2E performance problem is the
critical problem for the user
Improve e2e throughput for data intensive apps in
high-speed WANs
Provide ability to do performance analysis
fault detection ins Grid computing environment
Provide accurate, detailed, adaptive monitoring
of all distributed components including the
network

39
Anatomy of a Problem
Hey, this is not working right!
Others are getting in ok
Not our problem
Applications Developer
Applications Developer
The computer Is working OK
Looks fine
All the lights are green
How do you solve a problem along a path?
We dont see anything wrong
The network is lightly loaded
From an Internet2 E2E presentation by Russ Hobby
40
Needs

Measurement tools to quickly, accurately and
automatically identify problems
Automatically take action to investigate and
gather information, on-demand measurements
Standard ways to discover request and report
results of measurements, for applications
GGF/NMWG schemas
Share information with people and apps across a
federation of measurement infrastructures

41
Trouble shooting

Ping to localhost, ping to gateway to remote
host
Use IP address to avoid nameserver problems
Look for connectivity, loss RTT
May need to run for a long time to see some
pathologies (e.g. bursty loss dues to DSL loss of
sync)
Use synack or sting if ICMP blocked
Traceroute to remote host
Reverse traceroute from remote host to you
Ping routers along route
Look at history plots (PingER, AMP), when did
problem start, how big an effect is it?

42
Trouble shooting

Try user application
Iperf to test throughput

43
Where is a host?

Name server lookup to find hostname given IP
address
47cottrell_at_netflowgtnslookup 210.56.16.10
Server localhost
Address 127.0.0.1
Name lhr.comsats.net.pk
Address 210.56.16.10
Triangulate position based on RTT measurements
made to unknown host from several hosts at known
locations.

44
Whereis a host

Do a Google search on IP address to location,
e.g.
http//www.geobytes.com/IpLocator.htm

45
Hi-perf Challenges

Packet loss hard to measure by ping
For 10 accuracy on BER 1/108 1 day at 1/sec
Ping loss ? TCP loss
Iperf/GridFTP throughput at 10Gbits/s
To measure stable (congestion avoidance) state
for 90 of test takes 60 secs 75GBytes
Requires scheduling implies authentication etc.
Using packet pair dispersion can use only few
tens or hundreds of packets, however
Timing granularity in host is hard (sub µsec)
NICs may buffer (e.g. coalesce interrupts. or TCP
offload) so need info from NIC or before
Security blocked ports, firewalls, keys vs. one
time passwords, varying policies etc.

46
Dedicated Optical Circuits

Could be whole new playing field, todays tools
no longer applicable
No jitter (so packet pair dispersion no use)
Instrumented TCP stacks a la Web100 may not be
relevant
Layer 1 2 switches make traceroute less useful
Losses so low, ping not viable to measure
High speeds make some current techniques fail or
more difficult (timing, amounts of data etc.)

47
More Information

Tutorial on monitoring
www.slac.stanford.edu/comp/net/wan-mon/tutorial.ht
ml
RFC 2151 on Internet tools
www.freesoft.org/CIE/RFC/Orig/rfc2151.txt
Network monitoring tools
www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html
Ping
http//www.ping127001.com/pingpage.htm
IEPM/PingER home site
www-iepm.slac.stanford.edu/
IEEE Communications, May 2000, Vol 38, No 5, pp
130-136

48
Simplified SLAC DMZ Network, 2001

Dial up ISDN
2.4Gbps OC48 link
NTON
()
rtr-msfc-dmz
155Mbps OC3 link()
Stanford
Swh-dmz
ESnet
Internet2
slac-rt1.es.net
OC12 link 622Mbps
swh-root
Etherchannel 4 gbps
SLAC Internal Network
1Gbps Ethernet
() Upgrade to OC12 has been requested () This
link will be replaced with a OC48 POS card for
the 6500 when available
100Mbps Ethernet
10Mbps Ethernet
49
Flow lengths

Distribution of netflow lengths for SLAC border
Log-log plots, linear trendline power law
Netflow ties off flows after 30 minutes
TCP, UDP ICMP flows are log-log linear for
longer (hundreds to 1500 seconds) flows
(heavy-tails)
There are some peaks in TCP distributions,
timeouts?
Web server CGI script timeouts (300s), TCP
connection establishment (default 75s), TIME_WAIT
(default 240s), tcp_fin_wait (default 675s)

ICMP
TCP
UDP
50
Traceroute technical details

Rough traceroute algorithm
ttl1 To 1st router
port33434 Starting UDP port
while we havent got UDP port unreachable
send UDP packet to hostport with ttl
get response
if time exceeded note roundtrip time
else if UDP port unreachable
quit
print output
ttl port
Can appear as a port scan
SLAC gets about one complaint every 2 weeks.

51
Time series
UDP
TCP
Cat 4000 802.1q vs. ISL
Incoming
Outgoing
52
Power law fit parameters by time
Just 2 parameters provide a reasonable
description of the flow size distributions
53
Not your normal Internet site
Ames IXP approximately 60-65 was HTTP, about
13 was NNTP Uwisc 34 HTTP, 24 FTP, 13 Napster
54
PingER cont.

Monitor timestamps and sends ping to remote site
at regular intervals (typically about every 30
minutes)
Remote site echoes the ping back
Monitor notes current and send time and gets RTT
Discussing installing monitor site in Pakistan
provide real experience of using techniques
get real measurements to set expectations,
identify problem areas, make recommendations
provide access to data for developing new
analysis techniques, for statisticians etc.

55
PingER

Measurements from
38 monitors in 14 countries
Over 600 remote hosts
Over 120 countries
Over 3300 monitor-remote site pairs
Measurements go back to Jan-95
Reports on RTT, loss, reachability, jitter,
reorders, duplicates
Uses ubiquitous ping facility of TCP/IP
Countries monitored
Contain over 80 of world population
99 of online users of Internet

56
Surveyor RIPE, NIMI

Surveyor RIPE use dedicated PCs with GPS clocks
for synchronization
Measure 1 way delays and losses
Surveyor mainly for Internet 2
RIPE mainly for European ISPs
NIMI (National Internet Measurement
Infrastructure) more of an infrastructure for
measurements and some tools (I.e. currently does
not have public available data,regularly updated)
Mainly full mesh measurements on demand

57
Skitter

Makes ping route measurements to tens of
thousands of sites around the world. Site
selection varies based on web site hits.
Provide loss RTTs
Skitter PingER are main 2 sites to monitor
developing world.

58
Where is a host cont.

Find the Autonomous System (AS) administering
Use reverse traceroute server with AS
identification, e.g.
www.slac.stanford.edu/cgi-bin/nph-traceroute.pl
14 lhr.comsats.net.pk (210.56.16.10) AS7590 -
COMSATS 711 ms (ttl242)
Get contacts for ISPs (if know ISP or AS)
http//puck.nether.net/netops/nocs.cgi
Gives ISP name, web page, phone number, email,
hours etc.
Review list of AS's ordered by Upstream AS
Adjacency
www.telstra.net/ops/bgp/bgp-as-upsstm.txt
Tells what AS is upstream of an ISP
Look at real-time information about the global
routing system from the perspectives of several
different locations around the Internet
Use route views at www.antc.uoregon.edu/route-view
s/
Triangulate RTT measurements to unknown host from
multiple places