CampusWide Network Performance Monitoring and Recovery CPR PowerPoint PPT Presentation

presentation player overlay
1 / 35
About This Presentation
Transcript and Presenter's Notes

Title: CampusWide Network Performance Monitoring and Recovery CPR


1
Campus-Wide Network Performance Monitoring and
Recovery (CPR)
  • Warren Matthews, Chris Kelly, Russ Clark, Terry
    Turner.

2
Network Services.
  • GT Campus
  • Backbone group maintain 197 buildings, 2027
    switches, 62244 ports.
  • Southern Crossroads gigapop (SOX)
  • Provides connectivity for many Universities
    throughout the South East (including Peachnet)
  • 10Gbps link to Abilene backbone.

3
Motivation.
  • Network Operations
  • Need a campus-wide view.
  • Catastrophic failure is easier to detect
  • Power/fiber cut vs slow.
  • Little quantitative data to troubleshoot
    performance problems.

4
Measurement.
  • Build an Infrastructure
  • Deploy and maintain hardware
  • Tool management
  • Data management
  • Software development
  • Analysis, visualization
  • Establish a baseline, generate alarms
  • Troubleshooting

5
CPR.
  • Campus-wide Network Performance Monitoring and
    Recovery
  • Emphasis on Recovery
  • Regular tests across campus network
  • Active and passive monitoring
  • Shared access to test results
  • Comprehensive analysis and visualization
  • Also State-wide and International.

6
Key Enablers.
  • Control of network
  • Firewalls (departmental and host-based)
  • Switches/Routers
  • DNS (forward and reverse)
  • Physical Access
  • Available hardware
  • People and time

7
Hardware
  • CPR measurement machines
  • Original donated hardware (cheap)
  • Underpowered (P2-3/128MB/10GB) and unreliable
  • New donation of much better machines
  • Dell Optiplex GX260s (P4/512MB/30GB)
  • A few Dell poweredge servers
  • Analysis servers
  • Sun Fire X2100s and Sun Fire V210s

8
Virtual Machines.
  • Sufficiently high-performance and reliable enough
    that deployment can be significantly increased
    using virtual machines.
  • Many buildings have multiple VLANs. With virtual
    machines, multiple VLANs can be monitored from
    the same hardware.

Building Switch
Router
Local Switches
CPR Hosts
9
Deployment.
  • 87 hosts on Campus
  • Collocated with switches in data closets
  • Multiple views of the network
  • Especially the users view

10
Maintenance.
  • Red Hat Enterprise Linux (RHEL)
  • RHEL4
  • State-wide license.
  • RHEL tools such as Up2date and home-grown tools
  • Cprsetup
  • Set up new hosts
  • Cpradmin
  • Easily deploy new tools to all hosts

11
Management.
  • Use nagios to monitor the monitoring hosts.
  • Also disk space, load average, NTP

12
Measurements.
  • No in-house development of measurement tools.
  • Currently (run regularly)
  • Smokeping - roundtrip time and graphs.
  • Nagios - Services.
  • Arpwatch
  • Iptables logs (darknet)
  • Also available
  • Iperf, Pathrate/pathload,
  • Nmap, traceroute, tcpdump

13
Lots of Data.
Estimate 350 million rows by the end of the year
Second 100 million rows by early September
Phase I
Phase II
Pilot Phase
First 100 million rows in mid-May
14
More Measurements.
  • Additional tools
  • OWAMP and BWCTL
  • Pathrate/Pathload
  • Coming soon
  • NDT/NPAD (central, distributed)
  • Test bed for tools under development
  • Wishlist
  • GOAT/Netflow
  • Syslog
  • Integrate with SPAM/SWARM

15
Analysis.
  • Analysis
  • Create base-lines for historical comparison.
  • Use multiple view to detect location.
  • Middleware.
  • Alarm system
  • Plateau detector (AMP), RIPE-TT.
  • How should we react to alarms?
  • Troubleshooting guide.

16
Research.
  • Real time analysis and automated
    trouble-detection is a tough problem.
  • Working with GT researchers
  • Statistics
  • Binary Tomography
  • Reducing Data for real-time fault detection
  • Spatial and Temporal correlations

17
Visualization.
  • Visualization is also a research issue
  • Beyond eye-candy
  • Graphing and tables are useful
  • Smokeping, Nagios (Built-in graphs and tables).
  • Phplot (Home-grown powerful graphing tool).
  • myCPR (Configurable user-friendly interface)

18
Smokeping.
  • Full mesh monitoring between hosts
  • Data extracted before it is summarized

19
Nagios.
  • Central servers are monitored by default
  • Web server
  • Mail servers
  • DNS
  • Additional services added for certain subnets

20
Phplot.
  • Student extended existing package to create a
    very flexible graphing tool
  • Scroll, zoom, error bars etc

21
myCPR.
  • Web front end
  • User-focused diagnostic environment
  • Add graphs, alarms etc for local view
  • Highly configurable
  • Change parameters

22
Iperf.
  • Simple color-coded image for a quick summary of
    network health
  • Arrange by upstream router shows common problem
  • Most problems resolved after upgrade
  • Cpr-savant44 and cpr-savant45 (virtual machines)
    remain poor

23
Case Studies.
  • CPR has helped solve numerous issues
  • Slow file transfer due to duplex mismatch
  • Slow file sharing due to infected server
  • Email problem due to rogue ACL
  • Not everything is a network issue
  • Application problem is reported as a network
    issue (to a network administrator at least, these
    are separate).

24
GAMMON.
  • Expand CPR monitoring to wide-area.
  • Georgia Measurement and Monitoring
  • Valdosta State University
  • Armstrong Atlantic State University
  • Barrow County School System.
  • Distance Learning and Professional Education
    (DLPE).

25
GAMMON Deployment.
Barrow
Bellsouth
Level3
UUNET
Qwest
SOX
GLC
GT
Peachnet
Savannah
Armstrong
Valdosta
26
Throughput
27
Other Deployments.
  • Local ISPs
  • Major providers (Level3, Qwest, Cogent)
  • Residential (SpeedFactory, BellSouth, Charter,
    Cox, Earthlink)
  • International
  • Georgia Tech has adopted an international focus

28
Global Monitoring.
  • International focus in strategic plan.
  • GT presence at many international sites
  • Research and education often involves global
    collaboration
  • CPR hosts deployed in Metz (France) and Shanghai
    (China)
  • Coming soon Korea, Singapore, London
  • However numerous other networks must be crossed
    to reach remote site
  • One group cannot monitor them all.

29
Routing.
SJTU
CERNet
KREOnet
APAN/JP
Transpac
PacificWave
Abilene
SOX
GT
30
Latency.
  • Data from January and February 2006.
  • Very heavy tail
  • Interactive applications unusable
  • Working to by-pass problems
  • Best guess

31
PerfSONAR.
  • International measurement infrastructure.
  • Read performance data (measurement archive, MA)
  • Perform on-demand test (measurement point, MP)
  • Communication using emerging standards from
    GGF-NMWG.
  • Reference implementation developed by Internet2,
    Géant, RNP.
  • Still developing AAA.

32
Empower the end-user.
  • Measurement Infrastructure typically means WAN
    monitoring
  • But problems are LAN and host based.
  • Individual networks, communities and virtual
    organizations take their own measurements.
  • perfSONAR allows them to each others measurements
  • End-users can pin down a problem before calling
    you.

33
Summary.
  • CPR and GAMMON
  • Management, analysis, visualization
  • If you dont measure, you dont know.
  • End-users can verify performance with perfSONAR.

34
You can get involved.
  • Deploy a measurement host
  • GAMMON or your own host.
  • Lots of Projects (put your faculty and students
    to work!)
  • Work with perfSONAR.
  • Analysis.
  • Visualization.

35
This is the End.
  • Contact
  • Warren.Matthews_at_oit.gatech.edu
  • Chris.Kelly_at_oit.gatech.edu
  • Project WebSite
  • http//www.rnoc.gatech.edu/cpr
  • We welcome your input and collaboration.
Write a Comment
User Comments (0)
About PowerShow.com