State of Network Monitoring and Analysis in the US PowerPoint PPT Presentation

presentation player overlay
1 / 22
About This Presentation
Transcript and Presenter's Notes

Title: State of Network Monitoring and Analysis in the US


1
State of Network Monitoring and Analysis in the US
  • Les Cottrell, KC Claffy, Brian Tierney, Ronn
    Ritke, Hans-Werner Braun
  • Prepared for the LSN meeting at NSF Washington
    6/10/03

Partially funded by DOE/MICS Field Work Proposal
on Internet End-to-end Performance Monitoring
(IEPM), by the SciDAC base program
2
Outline
  • Goal for network monitoring analysis talk
  • identify the RD gaps and large-scale deployment
    issues for DOE, NSF, DARPA, NASA, NSA, NIST, etc.
    the federal agencies that fund network
    research in US
  • Two complementary presentations
  • High performance networking measurement needs for
    Science (E2E)- Les
  • Consumer grade net-centric measurement needs
    kc
  • Science network measurement needs
  • The end-to-end challenge, illustrations
  • Solution
  • End to end Monitoring Goals
  • Current issues
  • Problem analysis, measurement infrastructure,
    analysis tools, standards, collaborations
  • Benefits to Science
  • Consequences of not addressing issues
  • Why not leave to industry
  • Appendix
  • What is being done today
  • Who is measuring?
  • Who is using the measurements?
  • What is being measured?

3
The Problem
  • Distributed systems are very hard
  • A distributed system is one in which I can't get
    my work done because a computer I've never heard
    of has failed. Butler Lampson
  • When building distributed systems, we often
    observe unexpectedly low performance
  • the reasons for which are usually not obvious
  • The bottlenecks can be in any of the following
    components
  • the applications
  • the operating systems
  • the disks, network adapters, bus, memory, etc. on
    either the sending or receiving host
  • the network switches and routers, and so on
  • Problems may not be logical
  • Most problems are operator errors,
    configurations, bugs

4
Anatomy of a Problem
Hey, this is not working right!
Others are getting in ok
Not our problem
Applications Developer
Applications Developer
The computer Is working OK
Looks fine
All the lights are green
How do you solve a problem along a path?
We dont see anything wrong
The network is lightly loaded
5
Problem examples Help, its not working
  • Ive lost my connection
  • Despite over-provisioned networks user cannot get
    throughput expected
  • Wizard gap
  • What should I expect the performance to be?
  • It sometimes works
  • What am I, as a scientist, supposed to do?
  • Need tools/measurements to detect problems,
    identify location, cause and time of occurrence

Wizard
Mbits/s
Typical user
Is Grid server down, is the network partitioned,
is there heavy congestion, did DNS fail, is a
firewall preventing access
6
The Solution
  • A complete End-to-End monitoring framework that
    includes the following components
  • instrumentation tools (application, middleware,
    and OS monitoring)
  • host and network sensors (host and network
    monitoring)
  • sensor management / activation tools
  • event publication service
  • event archive service
  • event analysis and visualization tools
  • a common set of protocols for describing,
    exchanging, and locating monitoring data
  • Need for applications (e.g. Grid middleware),
    diagnosis, perf. analysis
  • toolkit for streamlined problem diagnosis
    detection, location, isolation reporting
  • glue to multiple sources of information,
    traceroute archives, router info, delay/loss
    archives, on-demand tests, baselines
  • analysis and heuristics
  • E2EPi working on solution, but only funded for
    coordination not for all the underlying work

7
End-2-End Monitoring Goals
  • Have to solve the E2E performance, it is THE
    critical metric for user, not just a backbone
    bandwidth problem
  • Improve end-to-end data throughput for data
    intensive applications in a high-speed WAN
    environments
  • Provide the ability to do performance analysis
    and fault detection in a Grid computing
    environment
  • Provide accurate, detailed, and adaptive
    monitoring of all of distributed computing
    components, including the network
  • Unfortunately, network management research has
    historically been very under-funded, because it
    is difficult to get funding bodies to recognize
    this as legitimate networking research, IAB
    Concerns Recommendations Regarding Internet
    Research Evolution

8
Current Issues 1 Problem Analysis
  • Cultivate systematic studies of problems, causes,
    how to discover, how to report, how to by-pass
  • Analysis to help in deciding what are the most
    important problems, see how they are tackled
    manually today
  • Decide on which problems are most cost-effective
    to assist in developing tools to assist in
    diagnosis

9
Current issues 2 Measurement Infrastructures
  • Need to build infrastructure to support
    troubleshooting
  • Requires repetitive and on-demand measurements
    with appropriate security model.
  • Provide recommended/accepted set of tools for
    delay, RTT, loss, route tracking, "bandwidth"
    estimation.
  • Include archiving and access to data, analysis
    and reporting of repetitive data.
  • Allow for evaluation, validation and comparison
    of new measurement tools, TCP stacks,
    applications (e.g. file transfer).
  • Reverse traceroute, looking glass, remote tcpdump
    (e.g. SCNM), remote testing of connection (ANL
    NDT),
  • Traceroute archives
  • Make tools easier to comprehend and use by
    scientists
  • Encourage efforts such as Internet2 E2Epi efforts
    to provide measurements inside the cloud
  • Extend to ESnet other NRNs, and beyond
  • Fund collaboration across boundaries
  • Ubiquitous coverage (require multiple toolkits)
    Inter agency, international, hi-speed, digital
    divide, long term and current

10
Current issues 3 Analysis tools
  • Provide measurement tools to accurately quickly
    identify performance problems,
  • to automatically take action to investigate and
    provide information for
  • Scientist
  • Grid support NOC
  • Network administrator or network person
  • Promote well understood, accepted metrics for
    customers for realistic, enforceable SLAs,
  • provide acceptable limits,
  • provide tools to track

11
Current issues 4 Standards
  • All the above requires
  • easy to use standard ways (e.g.web services) for
    applications to access data from existing and new
    monitoring projects.
  • standard naming conventions and schemas.
  • This will provide the ability to share
    information from multiple measurement
    infrastructure projects

12
Current issues 5 Collaboration
  • Need to build multi-disciplinary teams (incent
    orthogonal groups to work with one another)
  • include people close to eventual customers
    (scientists, operational folks)
  • to ensure what is developed is useful, tested out
    in realistic environments
  • include vendors and providers in funded projects
    to bridge the gaps
  • E2Epi is funded to provide coordination
  • Multi agency funding!
  • This is not a problem a single agency can address
  • Science applications cross multi-agency networks,
    but barriers to interagency network monitoring
    collaborations

13
Benefits to Science
  • Network reaches its potential
  • enable new ways of doing science
  • data intensive science (astrophysics, global
    weather, seismology, medicine),
  • remote instrument control (SNS, fusion(ITER),
    surgery),
  • remote visualization/insight (Terascale
    supernova, climate modeling),
  • world-wide collaboration enabling (LHC, ITER)
  • enables scientists to do science
  • Wizard gap closure, not fighting the network,
    network becomes a catalyst
  • Without good troubleshooting capabilities, the
    Grid vision will fail
  • Predictability, planning, expectations, raising
    the bar

14
What happens if we do not address
  • Data continues to ship inefficiently by
    truck/plane FedEx
  • Long delays (2 weeks), degraded collaboration, US
    scientists continue to lose leadership
  • Increased costs (manpower costs, lack of
    automation)
  • Inadequate reliability or performance for new
    applications, (e.g. Grid fails to reach its
    potential)
  • New capabilities do not emerge in US
  • remote instrument control, real-time video, media
    distribution
  • US science loses leadership to Japan, Europe,
    Canada

15
Why not leave it to industry
  • Industry wont do it (its not my problem)
  • Has its interest and hands full elsewhere
  • Its hard, does not sell products, little Return
    on Investment
  • Historically poor record, competitive concerns
  • Management features are late in product
    development cycle
  • Early success with SNMP and Netflow
  • Commercial Network Management Platformss (e.g.
    OpenView, Tivoli) limited success (network
    oriented, not user), not cost effective
  • ISPs only measure own nets, not E2E, SLA
    guarantees are not cross-provider

16
More Information
  • Some Measurement Infrastructures
  • CAIDA list www.caida.org/analysis/performance/mea
    sinfra/
  • AMP amp.nlanr.net/, PMA http//pma..nlanr.net
  • IEPM/PingER home site www-iepm.slac.stanford.edu/
  • IEPM-BW site www-iepm.slac.stanford.edu/bw
  • NIMI ncne.nlanr.net/nimi/
  • RIPE www.ripe.net/test-traffic/
  • NWS nws.cs.ucsb.edu/
  • Internet2 PiPES e2epi.internet2.edu/
  • Tools
  • CAIDA measurement taxonomy www.caida.org/tools/
  • SLAC Network Tools www.slac.stanford.edu/xorg/nmt
    f/nmtf-tools.html
  • Internet research needs
  • www.ietf.org/internet-drafts/draft-iab-research-fu
    nding-00.txt

17
Appendix Current Practices
18
Who is Measuring?
  • CAIDA (skitter, macroscopic )
  • NLANR (e.g. AMP active, PMA passive)
  • LBL (e.g. netest
  • SLAC/FNAL (e.g. PingER, IEPM-BW)
  • PSC (NIMI)
  • RICE (INCITE)
  • Europe RIPE (Eu ISPs), PPMCG
  • NWS
  • Internet2 (PiPES, IETF/IPPM, Netflow)
  • Sprint, ATT Research
  • Commercial (Keynote, Matrix, internetweather)
  • For more see www.caida.org/analysis/performance/me
    asinfra

19
Who are using measurements (customers)?
  • Users
  • Why is the performance not what I would like or
    expect
  • Set expectations, build case to complain to ISP
  • What should I expect, what applications are
    likely to work
  • Planners observe growth, decide when upgrades
    are needed, make cases for upgrades
  • Network engineers pin-point problem, provide
    information to providers
  • Providers where is the problem and what is it,
    best bang for the buck
  • Grid applications users/developers look forward
    to using,
  • e.g. Grid Resource Broker data placement
  • Requires APIs (e.g. web services), common naming
    conventions (e.g. NMWG, GLUE schema ) etc.
  • Security anomalies
  • Researchers modeling, theory testing, scaling
    laws

20
What is being Measured 1/2
Purpose Operations Research
End-to-end Ping/traceroute, bandwidth, application performance Band width estimation
Network centric SNMP, MRTG, Netflow Topology / tomography, mapping, security
Other taxonomies active vs passive
21
What is being measured 2/2?
  • Delays, RTT, loss, jitter, availability
  • Bandwidth estimation
  • TCP UDP throughputs
  • Packet pair techniques
  • Packet length techniques (pchar )
  • Topology /tomography, routing
  • Utilization, errors
  • Security
  • Evaluation of new protocols
  • Applications (many commercial packages)
  • Email, DB, www
  • One off traffic characterization at borders and
    IXPs
  • Exception, providers do not make information
    public

22
What tools are being used
  • Delays etc. ping, OWAMP, GPS
  • Bandwidth iperf, pathload, pipechar, netest,
    ABwE
  • Utilization SNMP
  • Topology/tomography traceroute, skitter, INCITE
  • Routing RIPE, routeviews
  • Traffic characterization netflow, NeTraMet,
    tcpdump, coralreef
  • Visualization MRTG, RRD, netgeo, geoplot,
    tcptrace, xplot
Write a Comment
User Comments (0)
About PowerShow.com