Title: Virtually Eliminating Router Bugs
1Virtually Eliminating Router Bugs
- Eric Keller, Minlan Yu,
- Matt Caesar, Jennifer Rexford
- NANOG 46 Philadelphia, PA
2Dealing with router bugs
- Internets complexity implemented in software
running on routers - Complexity leads to bugs
- String of high profile vulnerabilities, outages
3Feb. 16, 2009 SuproNet
- Announced a single prefix to a single provider
- huge increase in the global rate of updates
- 10x increase in global instability for an hour
- 1 misconfiguration tickled 2 bugs (2 vendors)
Source Renesys
AS path Prepending After len gt 255
Misconfiguration as-path prepend 47868
Did not filter
AS47878
AS29113
prepended 252 times
Notification
MikroTik bug no-range check
Cisco bug Long AS paths
4Challenges of router bugs
- Bugs different from traditional failures
- Violate protocol, cause cascading outages, need
to wait for vendor to repair, some exploitable by
attackers - Problem is getting worse
- Increasing demands on routing, vendors allowing
3rd party development, other sources of outage
becoming less common
5Building bug-tolerant routers
- Our approach run multiple, diverse instances of
router software in parallel - Instances vote on FIB contents and update
messages sent to peers - Old idea applied to routing
- new opportunities e.g. small dependence on past
- new challenges e.g. transient behavior
- Not just N-version programming
- Can also diversify execution environment
- Working prototype
- Based on Linux, tested with XORP and Quagga
6Categorizing Faults
XORP and Quagga bugzilla databases
7Diverse Replication
- Is it necessary?
- Only 41 of bugs cause a crash/hang
- Rest cause byzantine faults
- Is it possible?
- Multi-core CPUs and cheap DRAM
- Operators already use redundancy
- Is it effective?
- Both software and data diversity are effective
(discussed later)
8Traditional router
12.0.0.0/8 ? IF 2
12.0.0.0/8 ? IF 2
9Bug-tolerant router Architecture
- Requirements for making it transparent
- Sharing network state
- Sending single update (to FIB or to peer)
- Maintaining replicas (hiding churn)
10Bug-tolerant router(Making Transparent)
Sharing Network State send received updates to
each instance
11Bug-tolerant router(Making Transparent)
Send single update per prefix vote on FIB
entries Note Forwarding is unaffected, there
is still a single table
12.0.0.0/8 ? IF 2
12Bug-tolerant router(Making Transparent)
Send single update per prefix vote on update
messages
12.0.0.0/8 ? IF 2
13Bug-tolerant router(Making Transparent)
Maintaining Replicas remove buggy/crashed
instances Hiding churn from neighboring routers
12.0.0.0/8 ? IF 2
14Bug-tolerant router(Making Transparent)
Maintaining Replicas start/bootstrap new diverse
instance Hiding churn from neighboring routers
12.0.0.0/8 ? IF 2
15Voting Algorithms
- Wait-for-consensus handling transience
- Output when a majority of instances agree
- Master-Slave speeding reaction time
- Output Masters answer
- Slaves used for detection
- Switch to slave on buggy behavior
- Continuous Majority hybrid
- Voting rerun when any instance sends an update
16Prototype
- Based on Linux with open source routing software
(XORP and Quagga) - No router software modification
- Detect and recover from faults
- Low complexity
17Prototype Wrapping Software
Router Instance 2 (unmodified)
Router Instance 1 (unmodified)
hv-libc
hv-libc
virtd
To Peer
To Peer
hypervisor
To FIB
- Intercept socket-based communication
- Bootstrap new instances
- Intercept interactions with FIB
- Tested with Quagga (v0.98.6-0.99.10) and XORP
(v1.5-v1.6) on Linux
18Prototype Dealing with Faults
- Detection generalized to the following
- Instance sending update when should not
- Instance not sending update when it should
- Instance sending wrong update
- Causes detectable system event (e.g. crash)
- Recovery
- Kill faulty instance, start new (diverse)
instance - For master-slave correct any previously
advertised update that previous master had wrong
Detected with voting
19Evaluating Key Assumptions
- It is possible to achieve diversity across
routing instances - It is possible for routers to handle the
additional overhead of running multiple instances - Running multiple router replicas does not
substantially worsen convergence
20Achieving Diversity
- General Diversity (e.g. mem space layout)
- Not studied here
- Data Diversity
- Taxonomized XORP and Quagga bug database
- Selected two from each to reproduce and avoid
21Achieving Diversity
- Software Diversity
- Version static analysis
- Overlap decreases quickly between versions
- Only 25 overlap in Quagga 0.99.9 and 0.99.1
- 30 of bugs in Quagga 0.99.9 not in 0.99.1
- Implementation small verification
- Picked 10 from XORP, 10 from Quagga
- Setup test to trigger bug
- None were present in other implementation
22Feb 16, 2009 SuproNet
- Recall 1 misconfig tickled 2 bugs
- Bug 1 MikroTik range-check bug
- version diversity (fixed in latest version)
- Bug 2 Cisco long AS path bug
- configuration diversity (an alternate
configuration avoids bug)
23Processing Overhead
- Replayed RouteViews traces in 3GHz Xeon
- Measured pass through time (PTT)
- At normal rates
- Overhead of just hypervisor is 0.1
- 5 instances increases PTT by 4.6
- At 3000x rates
- 5 increases PTT by 23
- With MRAI, overhead dwarfed by MRAI time
24Effect on Convergence
- Simulated network of BTR (8 instances)
- Entire AS-level topology
- AS 3967 internal network
- Cliques (various sizes)
- No significant change in convergence (beyond the
pass through time) - Simulated 8 virtual networks
- Slightly shorter convergence time
- Much greater number of update messages
25Discussion
- Server-based read-only operation
- Instances run on server to cross-check
- Migrate instance upon fault
- Network-wide deployment
- Parallel networks instead of parallel instances
(enables protocol diversity) - Process-level deployment
- Reduce overhead by sharing RIB
- Leveraging Existing Redundancy
26Conclusions
- Our design has several benefits
- First step in building bug-tolerant networks
- Diverse replication both viable and effective
- Prototype shows improved robustness to bugs with
tolerable additional delay. - Next step?
- Looking for a place to deploy anyone?
- Automate diversity mechanisms