Title: BGP Anomaly Detection in an ISP
1BGP Anomaly Detection in an ISP
- Jian Wu (U. Michigan)
- Z. Morley Mao (U. Michigan)
- Jennifer Rexford (Princeton)
- Jia Wang (ATT Labs)
http//www.cs.princeton.edu/jrex/papers/nsdi05-ji
an.pdf
2Goal
- Identify important anomalies
- Lost reachability
- Persistent flapping
- Large traffic shifts
- Contributions
- Build a tool to identify a small number of
important routing disruptions from a large volume
of raw BGP updates in real time. - Use the tool to characterize routing disruptions
in an operational network
3Capturing Routing Changes
Large operational network (8/16/2004 10/10-2004)
eBGP
eBGP
Updates
eBGP
Updates
iBGP
iBGP
Best routes
iBGP
Best routes
BGP Monitor
CPE
iBGP
iBGP
iBGP
eBGP
eBGP
eBGP
4Challenges
- Large volume of BGP updates
- Millions daily, very bursty
- Too much for an operator to manage
- Different than root-cause analysis
- Identify changes and their effects
- Focus on actionable events
- Diagnose causes only in/near the AS
5System Architecture
6Grouping BGP Update into Events
- Challenge A single routing change
- leads to multiple update messages
- affects routing decisions at multiple routers
- Solution
- Group all updates for a prefix with inter-arrival
lt 70 seconds - Flag prefixes with changes lasting gt 10 minutes.
Persistent Flapping Prefixes
7Grouping Thresholds
- Based on data analysis and our understanding of
BGP - Event timeout 70 seconds
- 2 MRAI timer 10 seconds
- 98 inter-arrival time lt 70 seconds
- Convergence timeout 10 minutes
- BGP usually converges within minutes
- 99.9 events lt 10 minutes
8Persistent Flapping Prefixes
- Causes of persistent flapping
- Conservative damping parameters (78.6)
- Protocol oscillations due to MED (18.3)
- Unstable interface or BGP session (3.0)
- Surprising finding 15.2 of updates were caused
by persistent flapping prefixes, even though flap
damping was enabled!
9Example Unstable eBGP Session
Peer
ISP
p
Customer
- Flap damping parameters are session-based
- Damping not implemented for iBGP sessions
10Event Classification
- Challenge Major concerns in network management
- Changes in reachability
- Heavy load of routing messages on the routers
- Change of flow of traffic through the network
Event Classification
Typed Events
Events
Solution classify events by severity of their
impacts
11Event Category No Disruption
p
AS2
AS1
No Traffic Shift
No Disruption each of the border routers has
no traffic shift. (50.3)
ISP
12Event Category Internal Disruption
p
AS2
AS1
Internal Disruption all of the traffic shifts
are internal traffic shift. (15.6)
ISP
Internal Traffic Shift
13Event Category Single External Disruption
p
AS2
AS1
external Traffic Shift
Single External Disruption only one of the
traffic shifts is external traffic shift. (20.7)
ISP
14Statistics on Event Classification
- First 3 categories have significant variations
from day to day - Updates per event depends on the type of events
and the number of affected routers
15Event Correlation
- Challenge A single routing change
- affects multiple destination prefixes
Event Correlation
Typed Events
Clusters
Solution group events of same type that occur
close in time
16EBGP Session Reset
- Caused most single external disruption events
- Check if the number of prefixes using that
session as the best route changes dramatically - Validation with Syslog router report (95)
Number of prefixes
session recovery
session failure
time
17Hot-Potato Changes
- Hot-Potato Changes
- Caused internal disruption events
- Validation with OSPF measurement (95) Teixeira
et al SIGMETRICS 04
P
Hot-potato routing route to closest egress
point
10
11
9
ISP
18Traffic Impact Prediction
- Challenge Routing changes have different impacts
on the network which depends on the popularity of
the destinations
Traffic Impact Prediction
Large Disruptions
Clusters
Netflow Data
Solution weigh each cluster by traffic volume
19Traffic Impact Prediction
- Traffic weight
- Per-prefix measurement from Netflow
- 10 prefixes accounts for 90 of traffic
- Traffic weight of a cluster
- Sum of traffic weight of the prefixes
- A few clusters have large traffic weight
- Mostly session resets hot-potato changes
20Performance Evaluation
- Memory
- Static memory current routes, 600 MB
- Dynamic memory clusters, 300 MB
- Speed
- 99 of intervals of 1 second of updates can be
process within 1 second - Occasional execution lag
- Every interval of 70 seconds of updates can be
processed within 70 seconds
Measurements were based on 900MHz CPU
21Conclusion
- BGP anomaly detection
- Fast, online fashion
- Operator concerns (reachability, flapping,
traffic) - Significant information reduction
- Uncovered important network behaviors
- Persistent flapping prefixes
- Hot-potato changes
- Session resets and interface failures
22Detecting Peering Violations
- Consistent export requirement
- Peer should advertise prefixes at all peering
points, with the same AS path length - Allows the AS to do hot-potato routing
- Detecting violations
- Using iBGP feeds from the border routers
- Some inference tricks to identify inconsistencies
- Results of the study
- http//www.nanog.org/mtg-0410/feamster.html
- http//www.cs.princeton.edu/jrex/papers/imc04.pdf