Title: Internet Monitoring - Results
1Internet Monitoring - Results
- Les Cottrell SLAC
- ltcottrell_at_slac.stanford.edugt
- Presented at the ICFA Meeting, CERN, Mar 1998
- Partially funded by MICS joint SLAC/LBL proposal
on Internet End-to-end Performance Monitoring
(IEPM)
2Outline of Talk
- What, why how are we (ESnet/HENP community)
measuring? - What PingER measurement reports are available and
what do they show - (short), intermediate long term
- grouping and multi-site visualization
- Traffic volume Traceroute measurements
- Summary
- Deployment/development, Internet Performance,
Next Steps - Collaborations
- NIMI/IPWT
3Why go to the effort?
- Apparent quality of Internet getting worse as
size and demands increase - Internet woefully under-measured
under-instrumented - Internet very diverse - no single path typical
- Users need
- realistic expectations, planning information
- guidelines for setting and validating SLAs
- information to help in identifying problems
- help to decide where to apply resources
4Importance of Response Time
- Time is scarcest and most valuable commodity
- Studies in late 70s and early 80s showed the
economic value of Rapid Response Time - 0-0.4s High productivity interactive response
- 0.4-2s Fully interactive regime
- 2-12s Sporadically interactive regime
- 12s-600s Break in contact regime
- gt600s Batch regime
- Threshold around 4-5s complaints increase
rapidly. - Voice has threshold around 100ms
5Perception of Poor Packet Loss
- Above 4-6 packet loss video conferencing becomes
irritating, and non native language speakers
become unable to communicate. - The occurrence of long delays of 4 seconds or
more at a frequency of 4-5 or more is also
irritating for interactive activities such as
telnet and X windows. - Above 10-12 packet loss there is an unacceptable
level of back to back loss of packets and
extremely long timeouts, connections start to get
broken, and video conferencing is unusable.
6Our Main Metric is Ping
- Universally available, easy to understand
- no software for clients to install
- Low network impact
- Provides useful real world measures of loss,
response time, reachability, unpredictability
7Ping Response vs Web Response 1/2
8Ping Response vs Web Response 2/2
9Ranked packet loss for 3 months
Stanford
Rome
UK
Cincinnatti
10Sawtooth Effect
2 capacity ( 2Mbps)
Added 45 Mbps (quadrupled capacity)
3 capacity 9 Mbps
Holidays
11RAL Last 180 Days plot
Lines are simply cubic splines fits to aid
eye Upper green and black points are response
time in ms Red blue are weekday loss Cyan are
weekend loss Note weekend/weekday differences
(cyan vs blue) Note Xmas/New Year lull Also note
quick onset of saturation at end August
September
12Italian sites look similar to each other
13Representative International HENP Site Loss
Jan-95 thru Nov-97
- Note RL (UK) saw-tooths as add UK-US bandwidth
(Apr-96, Feb-97, Aug-97)
14Aggregation
- Group measurements, for example
- by area (e.g. N. America E, N. America E, W.
Europe/Japan, others, by country) - trans-oceanic links, intercontinental links
- separation e.g. number of hops, time zones
crossed, IXPs crossed - ISP (ESnet, vBNS/I2, ...)
- by monitoring site
- one site seen from multiple sites
- common interest/affiliation (XIWT, HENP )
- user selectable
15Group Selection (all sites monitoring CERN)
Select one of these groups
CMU CMU CNAF RL FNAL SLAC DESY DESY Carelton RMKI
RMKI CERN KEK
16Group Response Time Jan-95 Nov-97
- Improved between 1 and 2.5 / month
- Response Loss similar improvements
- care with new sites
17Network Quiescence
- Frequency of zero packet loss (for all time - not
cut on prime time)
18Ping Loss Quality
- Want quick to grasp indicator of link quality
- Loss is the most sensitive indicator
- loss of packet requires 4 sec TCP retry timeout
- Studies on economic value of response time by IBM
showed there is a threshold around 4-5secs where
complaints increase. - 0-1 Good 1-2.5 Acceptable
- 2.5-5 Poor 5-12 Very Poor
- gt 12 Bad
19Quality Distributions
- ESnet median good quality
- All other groups poor or very poor
- Critical to have good peering
20Multi Collection Site Visualization
Collection Sites
Remote Sites
21Intercontinental Grouping (Loss)
- Move mouse over ? to see links
Looks pretty bad for intercontinental use
22Top Level Domain Grouping (Loss)
Mouseover red dots gives more information on TLD
(e.g. chSwitzerland) Diagonals are within TLD
23TLD (Response Time)
24Grouping Details
Select metric
Select group
Sort
Color for quality
Also provides Excel for DIY at bottom
25Recent Transoceanic trends
26By Monitoring Site
27CERN Monitoring TLDs
28ESnet bytes accepted by site for Jan 98
Exchanges
LBL/ESnet
29US HENP Traffic Growth
Exponential growth from 3-6
30Multi Router Traffic Grapher (MRTG)
CERN-US E1(2Mbps) link
Added 2nd 2Mbps link
31Traffic Volume for Germany (DFN)
DFN T1 Utilization 15 Jan 98 (5 min averages)
Green to US Blue from US
DFN T1 Utilization for 15 Jan 98 (5 min averages)
of 2 min periods in Dec-96 with peak
utilization gt y
From US
Samples
To US
32Capacity/Load Ratios
- Looking at the link capacity/average load
- Most ESnet links show ratios of a few to several
tens - The international links (CERN-Perryman (4), DFN
(5), Italy (4), KEK (10), Canada (15)) show
ratios of 4-15 - The worst link appears to be the MAE-W-ESnet link
at about 1.5 ratio - However this may not be the bottleneck link
33Bottlenecks
- Identification
- Traceroute
- from/to multiple sites can identify common path
segments in the maps - Can see onset of losses with traceping
- Pathchar can identify bottlenecks
- Then need to work on
- avoiding bottlenecks (new peering)
- getting bottleneck owners to improve
- this is difficult, lots of potential bottlenecks,
bottlenecks move, not under our control
34TracePing (Oxford)
Muliple routes seen
35Traceroute
- Reverse traceroute servers
- Traceping
- TopologyMap
- Ellipses show node on route
- Open ellipse is measurement node
- Blue ellipse no reachable
- Keeping history
From TRIUMF
36GUI Traceroute (e.g. VisualRoute)
37Summary
- Deployment Development
- ESnet/HENP has 14 Collection sites in 8 countries
collecting data on gt 500 links involving 22
countries - XIWT/IPWT deployed 10 collection sites using
PingER tools - 600MB/month/link, 6 bps/link, .25 FTE _at_ analysis
site, 1.5-2.5 FTE on analysis - HEPNRC gathering, archiving
- Long term reports being ported to HEPNRC from
SLAC - Long term analysis today usually requires tool
like SAS
38Summary
- Deployment Development
- Internet Performance
- Performance within ESnet is good
- Performance between ESnet other sites is poor
to very poor on average - one of main causes is congestion points, so
peering is critical - Intercontinental performance is very poor to bad
- ESnet traffic accepted from major HENP labs
growing by 3-6 per month - Response time improving by 1-2 / month
- Packet loss improving between SLAC other sites
by 3 / month
39Summary
- Deployment Development
- Internet Performance (continued)
- Links to sites outside N. America vary from good
(KEK) to bad - Some of the bad sites are to be expected, e.g.
FSU, China, Czeck Republic, some surprises such
as UK - CERN, France, Germany acceptable to poor
40Summary
- Deployment Development
- Internet Performance
- Next Steps
- Improve tools
- Make long term reports at Analysis site available
understandable - Look into prediction (extrapolations, develop
models, configure and validate with data) - Pursue IETF Surveyor NIMI deployment
41National Internet Measurement Infrastructure
(NIMI)
- Secure, scalable infrastructure for scheduling
monitoring, gathering data - Minimal amount of human intervention
- Inexpensive probe built on PC FreeBSD platform
- Dynamic - can add/modify measurement suites,
initially includes - Traceroute
- TReno - measures bulk transfer thruput
- Poip - one way ping
42Asymmetric One-way Delays
20
U Chicago to Advanced
Advanced to U Chicago
Loss
Loss
0
300ms
Delay
Delay
0ms
0
24
43NIMI
- Deployed at PSC, LBL, FNAL, platforms being
configured at SLAC CERN - As NIMI becomes more real will start to use as
infrastructure for IPPM Surveyors - Security
- allows full policy control over any box you own
or delegation of all or subsets - uses ACLs with authentication for requests, and
encryption to prevent sniffing
44Summary
- Deployment Development
- Internet Performance
- Next Steps
- Lots of collaboration
- SLAC HEPNRC
- 14 collection sites, 400 remote sites
- Collection site tools CERN CNAF/ICFA
- Oxford/TracePing
- MapPing/MAPNet/NLANR
- TRIUMF Traceroute topology Map
- NIMI/LBNL Surveyor/IETF
- XIWT/IPWT
- Talks at IETF, XIWT, ICFA, ESCC ...
45More Information
- ICFA Monitoring WG home page (links to status
report, meeting notes, how to access data, and
code) - http//www.slac.stanford.edu/xorg/icfa/ntf/home.ht
ml - WAN Monitoring at SLAC has lots of links
- http//www.slac.stanford.edu/comp/net/wan-mon.html
- Tutorial on WAN Monitoring
- http//www.slac.stanford.edu/comp/net/wan-mon/tuto
rial.html - MapPing Tool
- http//www.slac.stanford.edu/warrenm/work/java/ne
wjava/mapping.html - NIMI http//www.psc.edu/mahdavi/nimi_paper/NIMI.h
tml