A fine-grained view of high-performance networking - PowerPoint PPT Presentation

About This Presentation
Title:

A fine-grained view of high-performance networking

Description:

Stephen Casner, Cengiz Alaettinoglu, Chia-Chee Kuan. NANOG 22 May, ... gig-ether. Test Host. R. 11/9/09. SLIDE 25. A recent 'micro-blender' 11/9/09. SLIDE 26 ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 34
Provided by: stephen532
Category:

less

Transcript and Presenter's Notes

Title: A fine-grained view of high-performance networking


1
A fine-grained view ofhigh-performance networking
Stephen Casner,Cengiz Alaettinoglu, Chia-Chee
Kuan NANOG 22 May, 2001
2
What this talk is about
  • Measurements on a tier 1 US backbone
  • jitter on test traffic
  • routing protocol packet traces
  • Analysis of anomalies we found
  • Claim can support delay-critical services
  • jitter determines volume and latency
  • some problems need to be fixed

3
What this talk is not about
  • Which vendor has more or fewer bugs
  • Which ISP provides better or worse service

This is a collaboration We appreciate the
assistance of the ISP and vendor to investigate
the unusual events we found.
4
State of the net
  • Backbones perform very well
  • For several weeks, we found 99.99 availability
    and jitter lt 1ms for 99.99 of packets sent
  • TCP tolerates the occasional delays
  • Routing strained but has adapted to growth
  • Operators have a good macro view of this state
  • link uptime
  • ping latency
  • router CPU utilization

5
Going from four 9s to five 9s
  • Want tighter SLAs for VoIP, Virtual Wire, VPNs
  • Need to understand what really happens
  • on a fine timescale
  • over long periods for rare events

6
Jitter Measurement
  • Installed test hosts in POPs at SF DC
  • All services except ssh disabled for security
  • Connected directly to core routers
  • OC-48 links between POPs
  • Continuous 1 Mb/s test traffic
  • Uniform random length over 64,1500
  • Exponential random interval (6 ms mean)
  • Data collected for 15 periods of 5-7 days each
  • Data retrieved over the net (takes 24 hours!)

7
20?s accuracy timestamping
IP UDP seqnum Tx stamp Rx stamp
data
8
Offline Jitter Analysis
  • Threshold filter interarrival (relative) jitter
    for a quick full-week overview
  • Scan each hour for packet loss and delay shifts
  • For interesting hours, graph absolute jitter and
    zoom in
  • NTP not used because adjustments glitch the clock
  • Jitter analysis tool removes effects of clock
    skew and length variation

9
99.99 clean
10
99.99 clean
11
Better ARP implemention
  • Do not flush ARP cache entry when timer expires
  • Send ARP request
  • Continue using the ARP cache entry
  • If no ARP response after N retries, flush entry
  • Workaround is permanent ARP entries, or
    gratuitous ARP responses from host if accepted

12
Packets with negative delay?
13
Jitter shift due to rerouting
7.6 ms
40 sec
14
Constant baseline sawtooth
500 ?s
2.3 seconds
15
Mostly smooth, except...
16
A very large delay
7 seconds!
9 hours
17
Rare but significant events
18
Outage followed by flood
19
Severe jitter and misordering
20
Transmit view blender event
21
Data rate of blender event
1172 packets lost
1 Mb/s avg rate
25 Mb/s
14 seconds
22
Slope shows deceleration
23
Slope shows deceleration
24
Monitor routing along with jitter
Test Host
tg
IPBackbone
sk gig-ether
IS-IS hellos tcpdump
R
R
Test host is passive peer, sends no routes
traceroute every 5s
packet trace file
25
A recent micro-blender
26
Routing loops cause blenders
TTL 16
TTL 30
TTL 60
27
Why do loops happen?
  • Link-State Routing Protocols 101
  • Detect topology changes
  • Flood link-state packets
  • SPF algorithm to compute routes
  • Route databases consistent within propagation time

28
Excess churn on lifetime 0 LSPs
Observed Churn
Genuine Churn
Averaged over 100 sec
29
Long LSP propagation times
30
Explanation
  • Route databases are not in sync because
  • Churn rate is high ? many LSPs to flood
  • Average rate 6.6 / second (as seen at test host)
  • Peak rate 10 / second (as seen at test host)
  • LSP rate control limits flooding
  • 4 LSPs / second on each backbone link
  • SPF updates may also be delayed by rate limits
  • Any topology change can result in a loop
  • DC host link appears down due to LSP switching

31
Routing loop on another path?
32
A recent week (very boring)
Jitter Measurement Summaryfor the Week 69
million packets transmitted Zero packets
lost 100 jitter lt 700?s
33
Experiment conclusions
casner_at_packetdesign.com
  • Backbone baseline jitter is lt 1 ms
  • Congestion is not the problem we need to solve!
  • Many events gt 1ms can be eliminated
  • ISPs building with 1/10G ethernet should be
    concerned about ARP cache timeout
  • ISPs need to revisit routing timer settings
  • Operational emergencies led to high timer
    settings
  • Software changes may have eliminated the need
  • Protocol designers more robust timers
  • See talk from NANOG 20 next talk at future NANOG
Write a Comment
User Comments (0)
About PowerShow.com