Virtually Eliminating Router Bugs - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Virtually Eliminating Router Bugs

Description:

Announced a single prefix to a single provider. huge increase in the ... Multi-core CPUs and cheap DRAM. Operators already use redundancy. Is it effective? ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 27
Provided by: verbCsPr
Category:

less

Transcript and Presenter's Notes

Title: Virtually Eliminating Router Bugs


1
Virtually Eliminating Router Bugs
  • Eric Keller, Minlan Yu,
  • Matt Caesar, Jennifer Rexford
  • NANOG 46 Philadelphia, PA

2
Dealing with router bugs
  • Internets complexity implemented in software
    running on routers
  • Complexity leads to bugs
  • String of high profile vulnerabilities, outages

3
Feb. 16, 2009 SuproNet
  • Announced a single prefix to a single provider
  • huge increase in the global rate of updates
  • 10x increase in global instability for an hour
  • 1 misconfiguration tickled 2 bugs (2 vendors)

Source Renesys
AS path Prepending After len gt 255
Misconfiguration as-path prepend 47868
Did not filter
AS47878
AS29113
prepended 252 times
Notification
MikroTik bug no-range check
Cisco bug Long AS paths
4
Challenges of router bugs
  • Bugs different from traditional failures
  • Violate protocol, cause cascading outages, need
    to wait for vendor to repair, some exploitable by
    attackers
  • Problem is getting worse
  • Increasing demands on routing, vendors allowing
    3rd party development, other sources of outage
    becoming less common

5
Building bug-tolerant routers
  • Our approach run multiple, diverse instances of
    router software in parallel
  • Instances vote on FIB contents and update
    messages sent to peers
  • Old idea applied to routing
  • new opportunities e.g. small dependence on past
  • new challenges e.g. transient behavior
  • Not just N-version programming
  • Can also diversify execution environment
  • Working prototype
  • Based on Linux, tested with XORP and Quagga

6
Categorizing Faults
XORP and Quagga bugzilla databases
7
Diverse Replication
  • Is it necessary?
  • Only 41 of bugs cause a crash/hang
  • Rest cause byzantine faults
  • Is it possible?
  • Multi-core CPUs and cheap DRAM
  • Operators already use redundancy
  • Is it effective?
  • Both software and data diversity are effective
    (discussed later)

8
Traditional router
12.0.0.0/8 ? IF 2
12.0.0.0/8 ? IF 2
9
Bug-tolerant router Architecture
  • Requirements for making it transparent
  • Sharing network state
  • Sending single update (to FIB or to peer)
  • Maintaining replicas (hiding churn)

10
Bug-tolerant router(Making Transparent)
Sharing Network State send received updates to
each instance
11
Bug-tolerant router(Making Transparent)
Send single update per prefix vote on FIB
entries Note Forwarding is unaffected, there
is still a single table
12.0.0.0/8 ? IF 2
12
Bug-tolerant router(Making Transparent)
Send single update per prefix vote on update
messages
12.0.0.0/8 ? IF 2
13
Bug-tolerant router(Making Transparent)
Maintaining Replicas remove buggy/crashed
instances Hiding churn from neighboring routers
12.0.0.0/8 ? IF 2
14
Bug-tolerant router(Making Transparent)
Maintaining Replicas start/bootstrap new diverse
instance Hiding churn from neighboring routers
12.0.0.0/8 ? IF 2
15
Voting Algorithms
  • Wait-for-consensus handling transience
  • Output when a majority of instances agree
  • Master-Slave speeding reaction time
  • Output Masters answer
  • Slaves used for detection
  • Switch to slave on buggy behavior
  • Continuous Majority hybrid
  • Voting rerun when any instance sends an update

16
Prototype
  • Based on Linux with open source routing software
    (XORP and Quagga)
  • No router software modification
  • Detect and recover from faults
  • Low complexity

17
Prototype Wrapping Software
Router Instance 2 (unmodified)
Router Instance 1 (unmodified)
hv-libc
hv-libc
virtd
To Peer
To Peer
hypervisor
To FIB
  • Intercept socket-based communication
  • Bootstrap new instances
  • Intercept interactions with FIB
  • Tested with Quagga (v0.98.6-0.99.10) and XORP
    (v1.5-v1.6) on Linux

18
Prototype Dealing with Faults
  • Detection generalized to the following
  • Instance sending update when should not
  • Instance not sending update when it should
  • Instance sending wrong update
  • Causes detectable system event (e.g. crash)
  • Recovery
  • Kill faulty instance, start new (diverse)
    instance
  • For master-slave correct any previously
    advertised update that previous master had wrong

Detected with voting
19
Evaluating Key Assumptions
  • It is possible to achieve diversity across
    routing instances
  • It is possible for routers to handle the
    additional overhead of running multiple instances
  • Running multiple router replicas does not
    substantially worsen convergence

20
Achieving Diversity
  • General Diversity (e.g. mem space layout)
  • Not studied here
  • Data Diversity
  • Taxonomized XORP and Quagga bug database
  • Selected two from each to reproduce and avoid

21
Achieving Diversity
  • Software Diversity
  • Version static analysis
  • Overlap decreases quickly between versions
  • Only 25 overlap in Quagga 0.99.9 and 0.99.1
  • 30 of bugs in Quagga 0.99.9 not in 0.99.1
  • Implementation small verification
  • Picked 10 from XORP, 10 from Quagga
  • Setup test to trigger bug
  • None were present in other implementation

22
Feb 16, 2009 SuproNet
  • Recall 1 misconfig tickled 2 bugs
  • Bug 1 MikroTik range-check bug
  • version diversity (fixed in latest version)
  • Bug 2 Cisco long AS path bug
  • configuration diversity (an alternate
    configuration avoids bug)

23
Processing Overhead
  • Replayed RouteViews traces in 3GHz Xeon
  • Measured pass through time (PTT)
  • At normal rates
  • Overhead of just hypervisor is 0.1
  • 5 instances increases PTT by 4.6
  • At 3000x rates
  • 5 increases PTT by 23
  • With MRAI, overhead dwarfed by MRAI time

24
Effect on Convergence
  • Simulated network of BTR (8 instances)
  • Entire AS-level topology
  • AS 3967 internal network
  • Cliques (various sizes)
  • No significant change in convergence (beyond the
    pass through time)
  • Simulated 8 virtual networks
  • Slightly shorter convergence time
  • Much greater number of update messages

25
Discussion
  • Server-based read-only operation
  • Instances run on server to cross-check
  • Migrate instance upon fault
  • Network-wide deployment
  • Parallel networks instead of parallel instances
    (enables protocol diversity)
  • Process-level deployment
  • Reduce overhead by sharing RIB
  • Leveraging Existing Redundancy

26
Conclusions
  • Our design has several benefits
  • First step in building bug-tolerant networks
  • Diverse replication both viable and effective
  • Prototype shows improved robustness to bugs with
    tolerable additional delay.
  • Next step?
  • Looking for a place to deploy anyone?
  • Automate diversity mechanisms
Write a Comment
User Comments (0)
About PowerShow.com