Title: Enterprise Network Troubleshooting
1Enterprise Network Troubleshooting
- Nick FeamsterGeorgia Tech(joint with Russ
Clark, Yiyi Huang,Anukool Lakhina, Manas
Khadilkar, Aditi Thanekar)
2Three Disjoint Views of the Network
Error Checking and Deployment
Generation
Policy
Static
Dynamic
- rancid/rcc- FIREMAN/Lumeta
Independent analyses!
- Policy The operators wish list
- Static What the configurations say
- Dynamic The behavior that users witness
3A Closer Look
- Proactive analysis
- Fault avoidance
- Policy conformance
- Reactive diagnosis
- Correcting network faults
- Detection
- Localization
- Active and passive measurements
- Need users perspective
- Two studies
- Routing
- Firewalls
Idea These analyses should inform each other
4Catastrophic Configuration Faults
a glitch at a small ISP triggered a major
outage in Internet access across the country.
The problem started when MAI Network
Services...passed bad router information from one
of its customers onto Sprint. --
news.com, April 25, 1997 Microsoft's websites
were offline for up to 23 hours...because of a
router misconfigurationit took nearly a day to
determine what was wrong and undo the changes.
-- wired.com, January 25, 2001 WorldCom
Incsuffered a widespread outage on its Internet
backbone that affected roughly 20 percent of its
U.S. customer base. The network problemsaffected
millions of computer users worldwide. A
spokeswoman attributed the outage to "a route
table issue." -- cnn.com, October 3,
2002 "A number of Covad customers went out from
5pm today due to, supposedly, a DDOS (distributed
denial of service attack) on a key Level3 data
center, which later was described as a route leak
(misconfiguration). --
dslreports.com, February 23, 2004
5Case 1 Network-Wide Routing Analysis
- Proactive routing configuration analysis
- Idea Analyze configuration before deployment
Many faults can be detected with static analysis.
6Operators Find Static Analysis Useful
Thats wicked! -- Nicolas Strina,
ip-man.net Thanks again for a great tool. --
Paul Piecuch, IT Manager ...good to finally see
more coverage of routing as distributed
programming. From my experience, the principles
of software engineering eliminate a vast majority
of errors. -- Joe Provo, rcn.com I
find your approach useful, it is really not fun
(but critical for the health of the network) to
keep track of the inconsistencies among different
routersa configuration verifier like yours can
give the operator a degree of confidence that the
sky won't fall on his head real soon now.
-- Arnaud Le Tallanter, clara.net
7Yes, but Surprises Happen!
- Link failures
- Node failures
- Traffic volumes shift
- Network devices wedged
- Two problems
- Detection
- Localization
8Detection Analyze Routing Dynamics
- Idea Routers exhibit correlated behavior
Blips across signals may be more operationally
interesting than any spike in one.
9Detection Three Types of Events
- Single-router bursts
- Correlated bursts
- Multi-router bursts
- Common
- Commonly missed using thresholds
10Localization Joint Dynamic/Static
- Which routers are border routers for that burst
- Topological properties of routers in the burst
Proactive Analysis
Deployment
Static
Dynamic
Reactive Detection
Diagnosis/Correction
11Case 2 Firewalls
- Georgia Tech Campus Network
- Research and Administrative Network
- 180 buildings
- 130 firewalls
- 1700 switches
- 55000 ports
- Problem Availability/Reachability
- Flux in firewall, router, switch configurations
- No common authority over changes made
12Specific Focus Firewall Configuration
- Difficult to understand and audit configs
- Subject to continual modifications
- Roughly 1-2 touches per day
- Federated policy, distributed dependencies
- Each department has independent policies
- Local changes may affect global behavior
13(Immediate) Open Issues
- Reachability and reliability of controller
- Service-level probes
- Diagnostic tools ! Service-level Happiness
- Policy conformance