Title: A fine-grained view of high-performance networking
1A fine-grained view ofhigh-performance networking
Stephen Casner,Cengiz Alaettinoglu, Chia-Chee
Kuan NANOG 22 May, 2001
2What this talk is about
- Measurements on a tier 1 US backbone
- jitter on test traffic
- routing protocol packet traces
- Analysis of anomalies we found
- Claim can support delay-critical services
- jitter determines volume and latency
- some problems need to be fixed
3What this talk is not about
- Which vendor has more or fewer bugs
- Which ISP provides better or worse service
This is a collaboration We appreciate the
assistance of the ISP and vendor to investigate
the unusual events we found.
4State of the net
- Backbones perform very well
- For several weeks, we found 99.99 availability
and jitter lt 1ms for 99.99 of packets sent - TCP tolerates the occasional delays
- Routing strained but has adapted to growth
- Operators have a good macro view of this state
- link uptime
- ping latency
- router CPU utilization
5Going from four 9s to five 9s
- Want tighter SLAs for VoIP, Virtual Wire, VPNs
- Need to understand what really happens
- on a fine timescale
- over long periods for rare events
6Jitter Measurement
- Installed test hosts in POPs at SF DC
- All services except ssh disabled for security
- Connected directly to core routers
- OC-48 links between POPs
- Continuous 1 Mb/s test traffic
- Uniform random length over 64,1500
- Exponential random interval (6 ms mean)
- Data collected for 15 periods of 5-7 days each
- Data retrieved over the net (takes 24 hours!)
720?s accuracy timestamping
IP UDP seqnum Tx stamp Rx stamp
data
8Offline Jitter Analysis
- Threshold filter interarrival (relative) jitter
for a quick full-week overview - Scan each hour for packet loss and delay shifts
- For interesting hours, graph absolute jitter and
zoom in - NTP not used because adjustments glitch the clock
- Jitter analysis tool removes effects of clock
skew and length variation
999.99 clean
1099.99 clean
11Better ARP implemention
- Do not flush ARP cache entry when timer expires
- Send ARP request
- Continue using the ARP cache entry
- If no ARP response after N retries, flush entry
- Workaround is permanent ARP entries, or
gratuitous ARP responses from host if accepted
12Packets with negative delay?
13Jitter shift due to rerouting
7.6 ms
40 sec
14Constant baseline sawtooth
500 ?s
2.3 seconds
15Mostly smooth, except...
16A very large delay
7 seconds!
9 hours
17Rare but significant events
18Outage followed by flood
19Severe jitter and misordering
20Transmit view blender event
21Data rate of blender event
1172 packets lost
1 Mb/s avg rate
25 Mb/s
14 seconds
22Slope shows deceleration
23Slope shows deceleration
24Monitor routing along with jitter
Test Host
tg
IPBackbone
sk gig-ether
IS-IS hellos tcpdump
R
R
Test host is passive peer, sends no routes
traceroute every 5s
packet trace file
25A recent micro-blender
26Routing loops cause blenders
TTL 16
TTL 30
TTL 60
27Why do loops happen?
- Link-State Routing Protocols 101
- Detect topology changes
- Flood link-state packets
- SPF algorithm to compute routes
- Route databases consistent within propagation time
28Excess churn on lifetime 0 LSPs
Observed Churn
Genuine Churn
Averaged over 100 sec
29Long LSP propagation times
30Explanation
- Route databases are not in sync because
- Churn rate is high ? many LSPs to flood
- Average rate 6.6 / second (as seen at test host)
- Peak rate 10 / second (as seen at test host)
- LSP rate control limits flooding
- 4 LSPs / second on each backbone link
- SPF updates may also be delayed by rate limits
- Any topology change can result in a loop
- DC host link appears down due to LSP switching
31Routing loop on another path?
32A recent week (very boring)
Jitter Measurement Summaryfor the Week 69
million packets transmitted Zero packets
lost 100 jitter lt 700?s
33Experiment conclusions
casner_at_packetdesign.com
- Backbone baseline jitter is lt 1 ms
- Congestion is not the problem we need to solve!
- Many events gt 1ms can be eliminated
- ISPs building with 1/10G ethernet should be
concerned about ARP cache timeout - ISPs need to revisit routing timer settings
- Operational emergencies led to high timer
settings - Software changes may have eliminated the need
- Protocol designers more robust timers
- See talk from NANOG 20 next talk at future NANOG