Title: Scott Poretsky
1Core Router Testing for High Availability
- Scott Poretsky
- Avici Systems, Inc.
- June 3, 2002
2Outline
- IP Network Availability
- Test Coverage for 99.999 Availability
- Commercial Test Equipment Requirements
3IP Network Availability
4High Reliability More Revenue
- Reliability is the single biggest criteria in
selecting an ISP, according to Interactive
Week/Telechoice
ISP Customer Survey
ISP Customer Survey
4.8
4.8
4.7
4.7
4.6
4.6
4.5
4.5
Relative Importance
Relative Importance
4.4
4.4
4.3
4.3
4.2
4.2
4.1
4.1
4
4
Reliability
Value
Performance
Customer
Provisioning
Reliability
Value
Performance
Customer
Provisioning
Service
Speed
Service
Speed
New IP services demand higher levels of network
reliability
5High Reliability More Profit
- Compensation for poor router reliability through
redundancy and interconnects can increase network
cost by up to 50
IP Backbone
Service
Service
Service
Provider
Peering
Provider
Provider
Peer
Peer
Peer
Core Layer
Core Layer
(Backbone Router)
(Backbone Router)
Aggregation Layer
Aggregation Layer
(Hub Router)
(Hub Router)
Edge
Edge
Layer
Layer
Access
Access
VOIP
DSLAM
L3/4
CMTS
GGSN
L3/4
Direct
Direct
VOIP
DSLAM
L3/4
CMTS
GGSN
L3/4
Direct
Direct
Switch
Switch
Connects
Connects
Switch
Switch
Connects
Connects
Devices
Devices
6Definitions
- Reliable
- Capable of being dependable (Webster)
- Availability
- Measure of Reliability using router/switch Uptime
- Mission Reliability
- Mean Time Between Critical Failures (MTBCF) or
the average time between hardware or software
failures that interrupt service (the mission) - Maintenance Reliability
- Mean Time Between Failures (MTBF) or the average
time between hardware failures that require
corrective maintenance actions - Defects Per Million (DPM)
- Measure of downtime equal to (1 Availability) x
106
7Contributing Factors for Availability
Total Time to Restore Router/Switch After a
Software Failure
CrashDump Time
Boot Time
Protocol Convergence Time
Mission Reliability
Image Upgrade Time
Software Failure Occurs
Not to Scale
Full Operation Restored
Time
Total Time to Restore a Module After a Hardware
Failure
Maintainer Response Time
Boot Time
Protocol Convergence Time
Removal and Replacement Time
Maintenance Reliability
Time
Hardware Failure Occurs
Full Operation Restored
Not to Scale
8The Availability Goal
- The Goal 99.999 Router Availability
- The Reality 99.9 Router Availability
- Features to achieve 99.999 availability.
- Non-Stop Routing
- Graceful Restart
- What if testing could could improve Mission
Reliability to achieve 99.999 Availability in
absence of new features? - What if the addition of these new features would
then achieve 99.9999 Availability?
9Test Coverage
10Traditional Test Coverage
- Isolated testing of protocols
- Functionality
- Conformance
- Interoperability
- Scaling
- Forwarding Performance in the absence of
protocols. - Disadvantages
- Operational environment is not tested
- Operational conditions are not tested
- The router under test is not completely stressed.
- Deployed routers run multiple protocols
simultaneously.
11Test Program for 99.999 Availability
- Stress Testing
- Longevity Testing
- Convergence Testing
- Network-Specific Topology Testing
- Automated Regression Testing
12Stress Testing
- Simultaneous configuration and scaling of
multiple protocols. - BGP, IGP
- MPLS-TE, LDP (optional)
- MBGP, PIM-SM, MSDP (optional)
- Traffic Forwarding
- Line Rate Traffic Forwarding
- Overutilize links
- Enable QoS
- Network Instability
- Repeated Route Flaps
- Link Loss
- Tunnel Reroutes (optional)
- Serviceability
- Repeated SNMP Gets
- Logging Enabled
- Debug Enabled
- Telnet with SHOW commands (stressful and invalid)
13Stress Configuration
Optional Neighbor Router for Tunnel Reroutes
Router Under Test
Neighbor Router
Neighbor Router
Test Equipment
Test Equipment
Test Equipment
14Stress Execution Guidelines
- Configure ECMP, Parallel Paths, and Composite
Links between routers - Use Live BGP Feed for Route Table
- Mix traffic types across links (IP Unicast, IP
Multicast, MPLS) - One neighbor router should be a different vendor
to show interoperability under stress - Run Stress for many days (if the router lasts
that long) - Router should experience more in a couple of days
then it likely would in its operational lifetime.
15Typical Stress Metrics
- Flap 1 million BGP routes per hour
- Forward 10 Terabits of data per hour
- Perform 100,000 SNMP Gets per hour
- Simulate 100 fiber cuts per hour (use every
remote interface) - Along with
- Full BGP Table
- Full IGP Table
- Full Multicast Cache
- Required MPLS-TE Tunnels (protection optional)
- Required LDP FECs
- Enable Logging and Protocol Debug
16Longevity Testing
- Similar to Stress Testing, but more operational
(less stressful) conditions injected over many
weeks. - Simultaneous configuration and scaling of
multiple protocols - Traffic Forwarding
- More realistic Network Instability
- More typical Serviceability actions
- Use Live Internet feed.
17Convergence Terms
- Network Convergence -
- The point in time at which all nodes in a network
have updated their routing tables for a route
entry change (new, withdrawal, or modification) - Protocol Convergence -
- The point in time in which a single node updates
its routing table and advertises the route table
change to its peer in a routing protocol
advertisement (or update) message. - Route Convergence -
- The point in time in which a single node updates
its routing table and reroutes traffic out the
new interface. - Route Convergence is the common Router Benchmark.
18Convergence Test Issues
- Large number of Protocols in which Convergence is
important. - Number of conditions that can impact results.
- Technical difficulty in testing convergence of
one protocol due to flap or instability of
another protocol.
19Convergence Test Conditions
- Interface shutdown
- on Local Interface
- on Remote Interface
- Fiber Pull
- on Local Interface
- on Remote Interface
- Peer removal via CLI
- on Local router
- on Peer router
- Peer node failure
- Route Table changes
- Route Withdrawal
- Route Flap
- Next-Hop Change
- Metric Change
- Dynamic Constraint Change
- Policy Change
All conditions must be tested because different
results can be produced.
20Network-Specific Topology Testing
- Large network with many routers (e.g. 10)
- Use multiple vendors for interoperability/function
ality testing. - Multiple protocols configured in deployment
scenario - Run test cases to match deployment scenario
21Automated Regression Testing
- Addition of bug fixes/new features put previously
working features at risk. - Regression testing ensures that the previously
working features still work. - As the number of releases with new features grow
it is more difficult to provide complete
regression coverage through manual testing
(increasingly labor intensive). - Automated regression testing enables more
coverage in less time. - Automation is typically achieved using TCL
scripts. - Configuration
Router Under Test
Test Equipment
22Commercial Test Equipment Requirements
23The State of the Union
- Test Equipment fails to meet todays requirements
for testing 99.999 Availability. - Router vendors have been forced to develop their
own specialized test tools. - Carriers have been forced to use the router
vendor test tools. - Test Equipment vendors must respond to the
challenge today.
24Stress Testing Requirements
- Maintain BGP Sessions and IGP Adjacencies
- Flap BGP Routes
- Signal and maintain RSVP-TE tunnels
- Distribute LDP FECs
- Signal and maintain Multicast Groups
- Perform SNMP GETs and check validity
- Forward Traffic (IP Unicast, IP Multicast, and
MPLS) - Make the network seem much bigger than it really
is without having to obtain hundreds of routers.
25Required Protocol Emulation/ Conformance Suites
Coverage
- Routing Protocols
- BGP
- OSPF, ISIS
- OSPF-TE, ISIS-TE
- RSVP-TE
- Fast Reroute
- Standby Tunnels
- Ingress, Mid-Point, Egress
- LDP
- RFC 2547 Layer 3 VPNs
- Martini Layer 2 VPNs
- P and PE
- LDP over RSVP
- Multicast
- MBGP
- PIM-SM
- MSDP
26Protocol Emulation Requirements
- Run any protocols in combination on the same
interface - Forward traffic for emulated protocols
- Protocol Emulation on any interface type GigE,
10GigE, and POS (including 192c). - Scaling
- BGP Sessions gt500/system, gt100/interface
- BGP Routes gt3M/system, gt500K/session
- MPLS-TE Tunnels gt10K - Ingress, Mid-Point, Egress
- FECs gt10K
- Load external BGP table for advertisement
- Controlled BGP Route Flapping
27Automated Regression Requirements
- Commercial test equipment vendors offer protocol
conformance TCL suites. - Test Case coverage must be improved within each
suite - Interaction between protocols must be tested
- Need each script to test multiple interfaces (4
or more) - Full Protocol Coverage
- Multicast protocols have been the forgotten son
28System Requirements
- Multiple ports per chassis (gt32)
- Automated Convergence measurement
- Automated reroute/failover measurement
- Support for ECMP and Composite Links
- System/Protocol Stability For Many Days
- Ability to store GUI configuration for
repeatability. - Ability to TCL script any GUI test case.