Network diagnostics made easy - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Network diagnostics made easy

Description:

By design TCP/IP hides the net from upper layers. TCP/IP ... Warning: TCP connection is not using SACK. Fail: Received window scale is 0, it should be 2. ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 21
Provided by: PSC47
Category:

less

Transcript and Presenter's Notes

Title: Network diagnostics made easy


1
Network diagnostics made easy
  • Matt Mathis
  • 3/17/2005

2
The Wizard Gap
3
The non-experts are falling behind
  • Year Experts Non-experts Ratio
  • 1988 1 Mb/s 300 kb/s 31
  • 1991 10 Mb/s
  • 1995 100 Mb/s
  • 1999 1 Gb/s
  • 2003 10 Gb/s 3 Mb/s 30001
  • 2004 40 Gb/s
  • Why?

4
TCP tuning requires expert knowledge
  • By design TCP/IP hides the net from upper layers
  • TCP/IP provides basic reliable data delivery
  • The hour glass between applications and
    networks
  • This is a good thing, because it allows
  • Old applications to use new networks
  • New application to use old networks
  • Invisible recovery from data loss, etc
  • But then (nearly) all problems have the same
    symptom
  • Less than expected performance
  • The details are hidden from nearly everyone

5
TCP tuning is really debugging
  • Six classes of bugs limit performance
  • Too small TCP retransmission or reassembly
    buffers
  • Packet losses, congestion, etc
  • Packets arriving out of order or even duplicated
  • Scenic IP routing or excessive round trip
    times
  • Improper packet sizes (MTU/MSS)
  • Inefficient or inappropriate application
    designs

6
TCP tuning is painful debugging
  • All problems reduce performance
  • But the specific symptoms are hidden
  • But any one problem can prevent good performance
  • Completely masking all other problems
  • Trying to fix the weakest link of an invisible
    chain
  • General tendency is to guess and fix random
    parts
  • Repairs are sometimes random walks
  • Repair one problem at time at best

7
The Web100 project
  • When there is a problem, just ask TCP
  • TCP has the ideal vantage point
  • In between the application and the network
  • TCP already measures key network parameters
  • Round Trip Time (RTT) and available data
    capacity
  • Can add more
  • TCP can identify the bottleneck
  • Why did it stop sending data?
  • TCP can even adjust itself
  • autotuning eliminates one of the 6 classes of
    bugs
  • See www.web100.org

8
Key Web100 components
  • Better instrumentation within TCP
  • 120 internal performance monitors
  • Poised to become Internet standard MIB
  • TCP Autotuning
  • Selects the ideal buffer sizes for TCP
  • Eliminate the need for user expertise
  • Basic network diagnostic tools
  • Requires less expertise than prior tools
  • Excellent for network admins
  • But still not useful for end users

9
Web100 Status
  • Two year no-cost extension
  • Can only push standardization after most of the
    work
  • Ongoing support of research users
  • Partial adoption
  • Current Linux includes (most of) autotuning
  • John heffner is maintaining patches for the rest
    of Web100
  • Microsoft
  • Experimental TCP instrumentation
  • Working on autotuning (to support FTTH)
  • IBM z/OS Communications Server
  • Experimental TCP instrumentation

10
The next step
  • Web100 tools still require too much expertise
  • They are not really end user tools
  • Too easy to over look problems
  • Current diagnostic procedures are still
    cumbersome
  • New insight from web100 experience
  • Nearly all symptoms scale with round trip time
  • New NSF funding
  • Network Path and Application Diagnosis
  • 3 Years, we are at the midpoint

11
Nearly all symptoms scale with RTT
  • For example
  • TCP Buffer Space, Network loss and reordering,
    etc
  • On a short path TCP can compensate for the flaw
  • Local Client to Server all applications work
  • Including all standard diagnostics
  • Remote Client to Server all applications fail
  • Leading to faulty implication of other components

12
Examples of flaws that scale
  • Chatty application (e.g., 50 transactions per
    request)
  • On 1ms LAN, this adds 50ms to user response time
  • On 100ms WAN, this adds 5s to user response time
  • Fixed TCP socket buffer space (e.g., 32kBytes)
  • On a 1ms LAN, limit throughput to 200Mb/s
  • On a 100ms WAN, limit throughput to 2Mb/s
  • Packet Loss (e.g., 1 loss with 9kB packets)
  • On a 1ms LAN, models predict 500 Mb/s
  • On a 100ms WAN, models predict 5 Mb/s

13
Review
  • For nearly all network flaws
  • The only symptom is reduced performance
  • But the reduction is scaled by RTT
  • On short paths many flaws are undetectable
  • False pass for even the best conventional
    diagnostics
  • Leads to faulty inductive reasoning about flaw
    locations
  • This is the essence of the end-to-end problem
  • Current state-of-the-art relies on tomography and
    complicated inference techniques

14
Our new technique
  • Specify target performance for S to RC
  • Measure the performance from S to LC
  • Use Web100 to collect detailed statistics
  • Loss, delay, queuing properties, etc
  • Use models to extrapolate results to RC
  • Assume that the rest of the path is ideal
  • Pass/Fail on the basis of extrapolated
    performance

15
Example diagnostic output
  • End-to-end goal 4 Mb/s over a 200 ms path
    including this sectionTester at IP address
    xxx.xxx.115.170 Target at IP address
    xxx.xxx.247.109Warning TCP connection is not
    using SACKFail Received window scale is 0, it
    should be 2.Diagnosis TCP on the test target is
    not properly configured for this path.gt See TCP
    tuning instructions at http//www.psc.edu/networki
    ng/perf_tune.htmlPass data rate check maximum
    data rate was 4.784178 Mb/sFail loss event
    rate 0.025248 (3960 pkts between loss
    events)Diagnosis there is too much background
    (non-congested) packet loss.   The events
    averaged 1.750000 losses each, for a total loss
    rate of 0.0441836FYI To get 4 Mb/s with a 1448
    byte MSS on a 200 ms path the total   end-to-end
    loss budget is 0.010274 (9733 pkts between
    losses). Warning could not measure queue length
    due to previously reported bottlenecks
  • Diagnosis there is a bottleneck in the tester
    itself or test target   (e.g insufficient buffer
    space or too much CPU load)gt Correct previously
    identified TCP configuration problemsgt Localize
    all path problems by testing progressively
    smaller sections of the full path.FYI This path
    may pass with a less strenuous application  
    Try rate4 Mb/s, rtt106 ms   Or if you can
    raise the MTU   Try rate4 Mb/s, rtt662 ms,
    mtu9000Some events in this run were not
    completely diagnosed.

16
Key features
  • Results are specific and less technical
  • Provides a list of action items to be corrected
  • Provides enough detail for escalation
  • Eliminates false pass test results
  • Test becomes more sensitive on shorter paths
  • Conventional diagnostics become less sensitive
  • Depending on models, perhaps too sensitive
  • New problem is false fail
  • Flaws no longer mask other flaws
  • A single test often detects several flaws
  • They can be repaired in parallel

17
Some demos
  • wget http//www.psc.edu/mathis/src/diagnostic-cli
    ent.c
  • cc diagnostic-client.c -o diagnostic-client
  • ./diagnostic-client kirana.psc.edu 70 90

18
Local server information
  • Current servers a single threaded
  • Silent wait if busy
  • Kirana.psc.edu
  • GigE attached directly to 3ROX
  • Outside the PSC firewall
  • Optimistic results to .61., .58. and .59. subnets
  • Scrubber.psc.edu
  • GigE attached in WEC
  • Interfaces on .65. and .66. subnets
  • Can be run on other Web100 systems
  • E.g. Application Gateways

19
The future
  • Collect (local) network pathologies
  • Raghu Reddy is coordinating
  • Keep archived data to improve the tool
  • Harden the diagnostic server
  • Widen testers to include attached campuses
  • 3ROX (3 Rivers Exchange) customers
  • CMU, Pitt, PSU, etc
  • Expect to find much more interesting
    pathologies
  • Replicate server at NCAR (FRGP) for their campuses

20
Related work
  • Also looking at finding flaws in applications
  • An entirely different set of techniques
  • But symptom scaling still applies
  • Provide LAN tools to emulate ideal long paths
  • Support local bench testing
  • For example classic ssh
  • Long known performance problems
  • Recently diagnosed to be due to internal flow
    control
  • Chris Rapier developed a patch
  • Already running on many PSC systems
  • See http//www.psc.edu/networking/projects/hpn-ss
    h/
Write a Comment
User Comments (0)
About PowerShow.com