Title: Rethinking Network Control
1Rethinking Network Control ManagementThe Case
for a New 4D Architecture
- David A. Maltz
- Carnegie Mellon University/Microsoft Research
- Joint work with
- Albert Greenberg, Gisli Hjalmtysson
- Andy Myers, Jennifer Rexford, Geoffrey Xie,
- Hong Yan, Jibin Zhan, Hui Zhang
2The Role of Network Control and Management
- Many different network environments
- Access, backbone networks
- Data-center networks, enterprise/campus
- Sizes 10-10,000 routers/switches
- Many different technologies
- Longest-prefix routing (IP), fixed-width routing
(Ethernet), label switching (MPLS, ATM), circuit
switching (optical, TDM) - Many different policies
- Routing, reachability, transit, traffic
engineering, robustness - The control plane software binds these elements
together and defines the network
3We Can Change the Control Plane!
- Pre-existing industry trend towards separating
router hardware from software - IETF FORCES, GSMP, GMPLS
- SoftRouter Lakshman, HotNets04
- Incremental deployment path exists
- Individual networks can upgrade their control
planes and gain benefits - Small enterprise networks have most to gain
- No changes to end-systems required
4A Clean-slate Design
- What are the fundamental causes of network
problems? - How to secure the network and protect the
infrastructure? - How to provide flexibility in defining management
logic? - What functionality needs to be distributed what
can be centralized? - How to reduce/simplify the software in networks?
- What would a RISC router look like?
- How to leverage technology trends?
- CPU and link-speed growing faster than of
switches
5Three Principles forNetwork Control Management
- Network-level Objectives
- Express goals explicitly
- Security policies, QoS, egress point selection
- Do not bury goals in box-specific configuration
Reachability matrix Traffic engineering rules
Management Logic
6Three Principles forNetwork Control Management
- Network-wide Views
- Design network to provide timely, accurate info
- Topology, traffic, resource limitations
- Give logic the inputs it needs
Reachability matrix Traffic engineering rules
Management Logic
Read state info
7Three Principles forNetwork Control Management
- Direct Control
- Allow logic to directly set forwarding state
- FIB entries, packet filters, queuing parameters
- Logic computes desired network state, let it
implement it
Reachability matrix Traffic engineering rules
Write state
Management Logic
Read state info
8Overview of the 4D Architecture
Network-level objectives
Decision
Dissemination
Direct control
Network-wide views
Discovery
Data
- Decision Plane
- All management logic implemented on centralized
servers making all decisions - Decision Elements use views to compute data plane
state that meets objectives, then directly writes
this state to routers
9Overview of the 4D Architecture
Network-level objectives
Decision
Dissemination
Direct control
Network-wide views
Discovery
Data
- Dissemination Plane
- Provides a robust communication channel to each
router and robustness is the only goal! - May run over same links as user data, but
logically separate and independently controlled
10Overview of the 4D Architecture
Network-level objectives
Decision
Dissemination
Direct control
Network-wide views
Discovery
Data
- Discovery Plane
- Each router discovers its own resources and its
local environment - E.g., the identity of its immediate neighbors
11Overview of the 4D Architecture
Network-level objectives
Decision
Dissemination
Direct control
Network-wide views
Discovery
Data
- Data Plane
- Spatially distributed routers/switches
- Can deploy with todays technology
- Looking at ways to unify forwarding paradigms
across technologies
12Concerns and Challenges
- Distributed Systems issues
- How will communication between routers and DEs
survive failures in the network? - Latency means DEs view of network is behind
reality. Will the control loop be stable? - What is the overhead to/from the DEs?
- What happens in a network partition?
- Networking issues
- Does the 4D simplify control and management?
- Can we create logic to meet multiple objectives?
13The Feasibility of the 4D Architecture
- We designed and built a prototype of the 4D
Architecture - 4D Architecture permits many designs prototype
is a single, simple design point - Decision plane
- Contains logic to simultaneously compute routes
and enforce reachability matrix - Multiple Decision Elements per network, using
simple election protocol to pick master - Dissemination plane
- Uses source routes to direct control messages
- Extremely simple, but can route around failed
data links
14Evaluation of the 4D Prototype
- Evaluated using Emulab (www.emulab.net)
- Linux PCs used as routers (650 800MHz)
- Tested on 9 enterprise network topologies
(10-100 routers each)
Example network with 49 switches and 5 DEs
15Performance of the 4D Prototype
- Trivial prototype has performance comparable to
well-tuned production networks - Recovers from single link failure in lt 300 ms
- lt 1 s response considered excellent
- Faster forwarding reconvergence possible
- Survives failure of master Decision Element
- New DE takes control within 1 s
- No disruption unless second fault occurs
- Gracefully handles complete network partitions
- Less than 1.5 s of outage
16Fundamental Problem Wrong Abstractions
Shell scripts
Traffic Eng
- Management Plane
- Figure out what is happening in network
- Decide how to change it
Planning tools
Databases
Configs
SNMP
netflow
modems
OSPF
- Control Plane
- Multiple routing processes on each router
- Each router with different configuration program
- Huge number of control knobs metrics, ACLs,
policy
Link metrics
Routing policies
FIB
- Data Plane
- Distributed routers
- Forwarding, filtering, queueing
- Based on FIB or labels
FIB
FIB
Packet filters
17Good Abstractions Reduce Complexity
Management Plane
Configs
Decision Plane
Control Plane
FIBs, ACLs
FIBs, ACLs
Dissemination
Data Plane
Data Plane
- All decision making logic lifted out of control
plane - Eliminates duplicate logic in management plane
- Dissemination plane provides robust communication
to/from data plane switches
18Today Simple Things are Hard to Do
D
Inter-POP Links
Access Networks
19Fundamental Problem Configurations Allow Too
Many Degrees of Freedom
- Computing configuration files that cause control
plane to compute desired forwarding states is
intractable - NP-hard in many cases
- Requires predictive model of control plane
behavior - Configurations files form a program that defines
a set of forwarding states - Very hard to create program that permits only
desired states, and doesnt transit through bad
ones
Forwarding states allowed by configs
Auto-adaptation leads to/thru bad states
Direct Control avoids bad states
20Fundamental Problem Conflation of Issues
- Ideal case all routing information flooded to
all routers inside network - Robustness achieved via flooding
- Reality routing information filtered and
aggregated extensively - Route filtering used to implement security and
resource policies - Route aggregation used to achieve scalability
214D Separates Distributed Computing Issues from
Networking Issues
- Distributed computing issues ! protocols and
network architecture - Overhead
- Resiliency
- Scalability
- Networking issues ! management logic
- Traffic engineering and service provisioning
- Egress point selection
- Reachability control (VPNs)
- Precomputation of backup paths
22Future Work
- Scalability
- Evaluate over 1-10K switches, 10-100K routes
- Networks with backbone-like propagation delays
- Structuring decision logic
- Arbitrate among multiple, potentially competing
objectives - Unify control when some logic takes longer than
others - Protocol improvements
- Better dissemination and discovery planes
- Deployment in todays networks
- Data center, enterprise, campus, backbone (RCP)
23Future Work
- Experiment with network appliances
- Traffic shapers, traffic scrubbers
- Expand relationships with security
- Using 4D as mechanism for monitoring/quarantine
- Formulate models that establish bounds of 4D
- Scale, latency, stability, failure models,
objectives - Generate evidence to support/refute principles
24Questions?
25Direct Control Provides Complete Control
- Zero device-specific configuration
- Supports many models for pushing routes
- Trivial push convergence requires time for all
updates to be receive and applied same as today - Synchronized update updates propagated, but not
applied till agreed time in the future clock
skew defines convergence time - Controlled state trajectory DE serializes
updates to avoid all incorrect transient states
26Fundamental Problem Wrong Abstractions
- interface Ethernet0
- ip address 6.2.5.14 255.255.255.128
- interface Serial1/0.5 point-to-point
- ip address 6.2.2.85 255.255.255.252
- ip access-group 143 in
- frame-relay interface-dlci 28
- router ospf 64
- redistribute connected subnets
- redistribute bgp 64780 metric 1 subnets
- network 66.251.75.128 0.0.0.127 area 0
- router bgp 64780
- redistribute ospf 64 match route-map
8aTzlvBrbaW - neighbor 66.253.160.68 remote-as 12762
- neighbor 66.253.160.68 distribute-list 4 in
access-list 143 deny 1.1.0.0/16 access-list 143
permit any route-map 8aTzlvBrbaW deny 10 match
ip address 4 route-map 8aTzlvBrbaW permit 20
match ip address 7 ip route 10.2.2.1/16 10.2.1.7
27Fundamental Problem Wrong Abstractions
2000
Size of configuration files in a single
enterprise network (881 routers)
Lines in config file
1000
0
881
0
Router ID (sorted by file size)
28(No Transcript)
29(No Transcript)
30Fundamental Problem Conflating Distributed
Systems Issues with Networking Issues
Routing Process
D left
D
D
Routing Process
Routing Process
D
D
D left
D left
- Distributed Systems Concern resiliency to link
failures - Solution multiple paths through routing process
graph
31Fundamental Problem Conflating Distributed
Systems Issues with Networking Issues
Routing Process
D right
D
Routing Process
Routing Process
D
D
D left
D left
- Distributed Systems Concern resiliency to link
failures - Solution multiple paths through routing process
graph
32Fundamental Problem Conflating Distributed
Systems Issues with Networking Issues
Routing Process
Filter routes to D
D left
D
D
Routing Process
Routing Process
D
D
D left
D left
- Networking Concern implement resource or
security policy - Solution restrict flow of routing information,
filter routes, summarize/aggregate routes
334D Supports Network Evolution Expansion
- Decision logic can be upgraded as needed
- No need for update of distributed protocols
implemented in software distributed on every
switch - Decision Elements can be upgraded as needed
- Network expansion requires upgrades only to DEs,
not every switch
34Reachability Example
R1
R2
Chicago (chi)
New York (nyc)
Data Center
Front Office
R5
R4
R3
- Two locations, each with data center front
office - All routers exchange routes over all links
35Reachability Example
R1
R2
Chicago (chi)
New York (nyc)
Data Center
Front Office
R5
R4
R3
chi-DC
chi-FO
nyc-DC
nyc-FO
chi-DC
chi-FO
nyc-DC
nyc-FO
36Reachability Example
Packet filter Drop nyc-FO -gt Permit
R1
R2
chi
Data Center
Front Office
Packet filter Drop chi-FO -gt Permit
R5
nyc
R4
R3
37Reachability Example
Packet filter Drop nyc-FO -gt Permit
R1
R2
chi
Data Center
Front Office
Packet filter Drop chi-FO -gt Permit
R5
nyc
R4
R3
- A new short-cut link added between data centers
- Intended for backup traffic between centers
38Reachability Example
Packet filter Drop nyc-FO -gt Permit
R1
R2
chi
Data Center
Front Office
Packet filter Drop chi-FO -gt Permit
R5
nyc
R4
R3
- Oops new link lets packets violate security
policy! - Routing changed, but
- Packet filters dont update automatically
39Prohibiting Packets from chi-FO to nyc-DC
40Reachability Example
Packet filter Drop nyc-FO -gt Permit
R2
R1
chi
Data Center
Front Office
Packet filter Drop chi-FO -gt Permit
R5
nyc
R4
R3
- Typical response add more packet filters to
plug the holes in security policy
41Reachability Example
Drop nyc-FO -gt
R2
R1
chi
Data Center
Front Office
R5
nyc
Drop chi-FO -gt
R4
R3
- Packet filters have surprising consequences
- Consider a link failure
- chi-FO and nyc-FO still connected
42Reachability Example
Drop nyc-FO -gt
R2
R1
chi
Data Center
Front Office
R5
nyc
Drop chi-FO -gt
R4
R3
- Network has less survivability than topology
suggests - chi-FO and nyc-FO still connected
- But packet filter means no data can flow!
- Probing the network wont predict this problem
43Allowing Packets from chi-FO to nyc-FO
44Multiple Interacting Routing Processes
Client
Server
45The Routing Instance Graph of a 881 Router
Network
46Reconvergence Time UnderSingle Link Failure
47Reconvergence Time When Master DE Crashes
48Reconvergence Time WhenNetwork Partitions
49Reconvergence Time WhenNetwork Partitions
50Many Implementations Possible
Single redundant decision engine
- Multiple decision engines
- Hot stand-by
- Divide network load share
- Distributed decision engines
- Up to one per router
- Choice can be based on reliability requirements
- Dessim. Plane can be in-band, or leverage OOB
links - Less need for distributed solutions (harder to
reason about) - More focus on network issues, less on distributed
protocols
51Direct Expression Enables New Algorithms
D
- OSPF normally calculates a single path to each
destination D - OSPF allows load-balancing only for equal-cost
paths to avoid loops - Using ECMP requires careful engineering of link
weights
D
- Decision Plane with network-wide view can compute
multiple paths - Backup paths installed for free!
- Bounded stretch, bounded fan-in
52Systems of Systems
- Systems are designed as components to be used in
larger systems in different contexts, for
different purposes, interacting with different
components - Example OSPF and BGP are complex systems in its
own right, they are components in a routing
system of a network, interacting with each other
and packet filters, interacting with management
tools - Complex configuration to enable flexibility
- The glue has tremendous impact on network
performance - State of art multiple interactive distributed
programs written in assembly language - Lack of intellectual framework to understand
global behavior
53Supporting Network Evolution
- Logic for controlling the network needs to change
over time - Traffic engineering rules
- Interactions with other networks
- Service characteristics
- Upgrades to field-deployed network equipment must
be avoided - Very high cost
- Software upgrades often require hardware upgrades
(more CPU or memory)
54Supporting Network EvolutionToday
- Todays Solution
- Vendors stuff their routers with software
implementing all possible features - Multiple routing protocols
- Multiple signaling protocols (RSVP, CR-LDP)
- Each feature controlled by parameters set at
configuration time to achieve late binding - Feature-creep creates configuration nightmare
- Tremendous complexity for syntax semantics
- Mis-interactions between features is common
- Our Goal Separate decision making logic from the
field-deployed devices
55Supporting Network Expansion
- Networks are constantly growing
- New routers/switches/links added
- Old equipment rarely removed
- Adding a new switch can cause old equipment to
become overloaded - CPU/Memory demands on each device should not
scale up with network size
56Supporting Network ExpansionToday
- Routers run a link-state routing protocol
- Size of link-state database scales with of
routers - Expanding network can exceed memory limits of old
routers - Todays Solution
- Monitor resources on all routers
- Predict approach of exhaustion and then
- Global upgrade
- Rearchitecture of routing design to add
summarization, route aggregation, information
hiding - Our Goal make demands scale with hardware (e.g.,
of interfaces)
57Supporting Remote Devices
- Maintaining communication with all network
devices is critical for network management - Diagnosis of problems
- Monitoring status and network health
- Updating configuration or software
- the chicken or the egg.
- Cannot send device configuration/management
information until it can communicate - Device cannot communicate until it is correctly
configured
58Supporting Remote DevicesToday
- Todays Solution
- Use PSTN as management network of last resort
- Connect console of remote routers to phone modem
- Cant be used for customer premise equipment
(CPE) DSL/cable modems, integrated access
devices (IADs) - In a converged network, PSTN is decommissioned
- Our Goal Preserve management communication to
any device that is not physically partitioned,
regardless of configuration state
59Recent Publications
- G. Xie, J. Zhan, D. A. Maltz, H. Zhang, A.
Greenberg, G. Hjalmtysson, J. Rexford, On Static
Reachability Analysis of IP Networks, IEEE
INFOCOM 2005, Orlando, FL, March 2005. - J. Rexford, A. Greenberg, G. Hjalmtysson, D. A.
Maltz, A. Myers, G. Xie, J. Zhan, H. Zhang,
Network-Wide Decision Making Toward a
Wafer-Thin Control Plane, Proceedings of ACM
HotNets-III, San Diego, CA, November 2004. - D. A. Maltz, J. Zhan, G. Xie, G. Hjalmtysson, A.
Greenberg, H. Zhang, Routing Design in
Operational Networks A Look from the Inside,
Proceedings of the 2004 Conference on
Applications, Technologies, Architectures, and
Protocols for Computer Communications (ACM
SIGCOMM 2004), Portland, Oregon, 2004. - D. A. Maltz, J. Zhan, G. Xie, H. Zhang, G.
Hjalmtysson, A. Greenberg, J. Rexford, Structure
Preserving Anonymization of Router Configuration
Data, Proceedings of ACM/Usenix Internet
Measurement Conference (IMC 2004), Sicily, Italy,
2004.