Title: Scalability Considerations for Programmable Networks
1Scalability Considerationsfor Programmable
Networks
- Jim Griffioen
- Laboratory for Advanced Networking
- University of Kentucky, USA
- Collaborator Prof. Ken Calvert
- Students Su Wen, Andy Martin, Najati Imam,
Muthulakshmi Muthukumaraswamy
Research supported by DARPA and Intel
2Programmable Services(Outline)
- Background and problem description
- Design goals and consequences
- Overview of ESP, a lightweight network service
- A simple example
- Engineering considerations
- Other uses
- Status
3The Internet
- The Internet has proven itself to be a robust,
scalable, and flexible communication
infrastructure - Simple service abstraction
- Basic service is useful to a wide range of
applications - Additional services must be implemented at
end-systems (end-to-end principle) - Best-effort service
- Service can be implemented with low cost
- Processing Cycles can route packets quickly
- Router State aggregates router state
- Result Can support hundreds of thousand of flows
and millions of end-systems
4The Evolving Internet
- New and emerging applications require new
network-level services not supported by the
Internet Protocol - Examples include
- Flow Prioritization (forwarding/dropping/routing)
- Congestion management (RED, ECN)
- Reliable dissemination (PGM)
- Specification-based anycasting
- Layered multicasting (RLM)
- Scalable aggregation services (Concast)
- Single source multicast (Express)
- .... And many more application specific
services .
5New Service Approaches
- Closed Approach Rely on router vendors to add
new application-specific services to their
equipment - Extremely Open Approach Allows end-systems to
dynamically load arbitrary code into network
routers - Application Layer Approach Implement new
services soley at the application layer without
involving the network layer
6Basic Router Functions
receive
Lookup enqueue
transmit
Routing Table
congestion control
7New Service Approaches
- Closed Approach Rely on router vendors to add
new application-specific services to their
equipment - Extremely Open Approach Allows end-systems to
dynamically load arbitrary code into network
routers - Application Layer Approach Implement new
services soley at the application layer without
involving the network layer
8Programmable Router I
transmit
execute
receive
Virtual Machine
program
input
state store
9Programmable Router II
transmit
execute
Active Application
Virtual Machine
state store
10New Service Approaches
- Closed Approach Rely on router vendors to add
new application-specific services to their
equipment - Extremely Open Approach Allows end-systems to
dynamically load arbitrary code into network
routers - Application Layer Approach Implement new
services only at the application layer without
involving the network layer
11A New Approach (ESP/LWP)
- We need a new approach, a middle ground, that
opens the traditionally opaque network layer
just enough. - We want to allow end-systems to extract
information about the network or control the way
the network processes/handles packets without
exposing too much or creating scalability or
security problems
12Our Design Goals
- A programmable network service that has IP-like
characteristics - Flexible
- Applicable to more than one kind of problem,
including presently unknown problems - Useful
- Deployable today
- Solves (or assists in solving) one or more real
problems - Scalable
- Can potentially be used by every end system
- Accomodates 100 000 simultaneous flows
- Robust
- Best effort
13Requirements
- Allow user-specified information to be stored,
modified, retrieved inside the network - Necessary to solve interesting problems
- Too cheap to meter
- Service must be accessible without special
authorization - Crypto authentication/access policies dont scale
to 100K flows - Negligible management overhead
- DoS-resistance
- Leverage existing IP forwarding infrastructure
- Dont re-invent this wheel
- Necessary for deployability
14User-controlled State
- Conventional wisdom Not scalable
- Too expensive to provide for 100K flows
- Overhead of managing setup, soft-state refresh,
garbage collection - Limiting factors
- Time-space product of memory usage (per flow)
- Signaling overhead and robustness against errors
- Goal
- Bound the Time-Space product
- Reduce or avoid signaling overhead
- Expect and tolerate errors
15Per-Packet Processing
- Centralized computation at routers does not scale
- Cannot assume all packets processed by a single
processor - Need per-port processing
- Should be comparable to IP forwarding
- Approximately constant cost per packet (i.e.,
active network capsules are not reasonable) - Hardware-friendly
- Goal
- Bounded per-packet processing costs
- Processing may not modify the IP header!
- Instructions may modify packet payload or state
store
16Ephemeral State Processing
- ESP Solution
- fixed-lifetime state store
- no management overhead
- fixed-length computations
- ESP Components
- 1) Ephemeral State Store
- Associative memory set of (tag, value) pairs
- Fixed size tags and values (e.g. 64 or 128
unstructured bit strings) - Tags ? names of variables
- Tags randomly selected gt private stores
- Bindings persist for a (short) fixed time??, then
vanish - Bindings cannot be refreshed
- No management overhead
17Ephemeral State Processing
- 2) Set of packet-borne instructions
- Fixed-length computations (one per packet)
- create/update bindings
- update fields in packet payload (only)
- operands control behavior
- On termination, forward or discard packet
- 3) Wire protocol
- Instructions are carried in specially marked ESP
packets recognized and executed hop-by-hop by
routers on the way to the destination - Contain the instruction to execute and its
parameters - Piggy-backed computations are possible
18ESP Probes
- Two Common Types of ESP Probes
- Setup Probe
- create or modify value bindings for the given ESP
tags - Collect Probe
- retrieve values associated with the given ESP tags
19Example Nack Suppression
1. Sender multicasts packet w/ seq N
20Example Nack Suppression
21Example Nack Suppression
Count-filter (CF) instruction
Operands Counter tag c Threshold
value thr
if (c ? ?) c 1 fwd else if (c lt pkt.thr)
c fwd else discard
(N,1)
(N,1)
(N,1)
2. Receivers who did not receive pkt N send
Nack with piggybacked ESP instruction CF
22Example Nack Suppression
Count-filter (CF) instruction
Operands Counter tag c Threshold
value thr
if (c ? ?) c 1 fwd else if (c lt pkt.thr)
c fwd else discard
(N,1)
(N,1)
(N,1)
(N,2)
(N,2)
23Example Nack Suppression
Count-filter (CF) instruction
Operands Counter tag c Threshold
value thr
if (c ? ?) c 1 fwd else if (c lt pkt.thr)
c fwd else discard
(N,1)
(N,1)
(N,1)
(N,2)
(N,2)
(N,2)
24Another Example Finding Slowest ReceiverPhase 1
If (c ? ?) c discard else c1 forward
COUNTCH instruction
B
A
D
S
r1
r2
r3
Stimulus
C
E
Send COUNTCH use tag c
Time0
25Example Finding Slowest ReceiverPhase 1
If (c ? ?) c discard else c1 forward
COUNTCH instruction
B
A
D
S
r1
r2
r3
C
E
Send COUNTCH use tag c
Time1
26Example Finding Slowest ReceiverPhase 1
If (c ? ?) c discard else c1 forward
COUNTCH instruction
B
A
D
S
r1
r2
r3
C
E
Send COUNTCH use tag c
Time2
27Example Finding Slowest ReceiverPhase 1
If (c ? ?) c discard else c1 forward
COUNTCH instruction
B
A
D
S
r1
r2
r3
(c,1)
C
E
Send COUNTCH use tag c
Time3
28Example Finding Slowest ReceiverPhase 1
If (c ? ?) c discard else c1 forward
COUNTCH instruction
B
A
(c,1)
D
S
r1
r2
r3
(c,1)
(c,1)
C
E
Send COUNTCH use tag c
Time4
29Example Finding Slowest ReceiverPhase 1
If (c ? ?) c discard else c1 forward
COUNTCH instruction
B
A
(c,1)
(c,1)
D
S
r1
r2
r3
(c,2)
(c,1)
C
E
Send COUNTCH use tag c
Time5
30Example Finding Slowest ReceiverPhase 1
If (c ? ?) c discard else c1 forward
COUNTCH instruction
B
A
(c,2)
(c,1)
D
S
r1
r2
r3
(c,2)
(c,2)
C
E
Send COUNTCH use tag c
Time6
31Example Finding Slowest ReceiverPhase 1
If (c ? ?) c discard else c1 forward
COUNTCH instruction
B
A
(c,3)
(c,1)
D
S
r1
r2
r3
(c,2)
(c,2)
C
E
Send COUNTCH use tag c
Time7
32Example Finding Slowest ReceiverPhase 2
v min(v,pkt.v) if (--c 0) pkt.v v
forward
COLLECT instruction
B
A
(c,3)
(c,1)
D
S
r1
r2
r3
(c,2)
(c,2)
C
E
Send COLLECT use tags c,v
Time8
33Example Finding Slowest ReceiverPhase 2
v min(v,pkt.v) if (--c 0) pkt.v v
forward
COLLECT instruction
B
A
(c,3)
(c,1)
D
S
r1
r2
r3
(c,2)
(c,2)
C
E
Send COLLECT use tags c,v
Time9
34Example Finding Slowest ReceiverPhase 2
v min(v,pkt.v) if (--c 0) pkt.v v
forward
COLLECT instruction
B
A
(c,3)
(c,1)
D
S
r1
r2
r3
(c,2)
(c,2)
C
E
Send COLLECT use tags c,v
Time10
35Example Finding Slowest ReceiverPhase 2
v min(v,pkt.v) if (--c 0) pkt.v v
forward
COLLECT instruction
B
A
(c,3)
(c,1)
D
S
r1
r2
r3
(c,1) (v,3)
(c,2)
C
E
Send COLLECT use tags c,v
Time11
36Example Finding Slowest ReceiverPhase 2
v min(v,pkt.v) if (--c 0) pkt.v v
forward
COLLECT instruction
B
A
5
(c,3)
(c,1)
D
S
r1
r2
r3
(c,1) (v,3)
(c,1) (v,2)
C
E
Time12
37Example Finding Slowest ReceiverPhase 2
v min(v,pkt.v) if (--c 0) pkt.v v
forward
COLLECT instruction
B
A
(c,2) (v,5)
(c,1)
D
2
S
r1
r2
r3
(c,1) (v,3)
(c,1) (v,2)
C
E
Time13
38Example Finding Slowest ReceiverPhase 2
v min(v,pkt.v) if (--c 0) pkt.v v
forward
COLLECT instruction
B
A
(c,1) (v,2)
(c,1)
D
S
r1
r2
r3
4
(c,1) (v,3)
(c,1) (v,2)
C
E
Time14
39Example Finding Slowest ReceiverPhase 2
v min(v,pkt.v) if (--c 0) pkt.v v
forward
COLLECT instruction
B
A
(c,0) (v,2)
(c,1)
D
S
r1
r2
r3
(c,1) (v,3)
(c,1) (v,2)
C
E
Time15
40Example Finding Slowest ReceiverPhase 2
v min(v,pkt.v) if (--c 0) pkt.v v
forward
COLLECT instruction
B
A
(c,0) (v,2)
(c,1)
D
S
r1
r2
r3
(c,1) (v,3)
(c,0) (v,2)
C
E
Time16
41Example Finding Slowest ReceiverPhase 2
v min(v,pkt.v) if (--c 0) pkt.v v
forward
COLLECT instruction
B
A
(c,0) (v,2)
(c,1)
D
S
r1
r2
r3
(c,0) (v,2)
(c,0) (v,2)
C
E
Time17
42Example Finding Slowest ReceiverResult
v min(v,pkt.v) if (--c 0) pkt.v v
forward
COLLECT instruction
B
A
(c,0) (v,2)
(c,0) (v,2)
D
S
r1
r2
r3
2
(c,0) (v,2)
(c,0) (v,2)
C
E
Time18
43Wire Protocol
IP
Op Code
Standalone
Flow ID
RA ProtoESP
Loc
Err
Operands
Piggyback
e.g. RTP
- Location field specifies where processing occurs
- (and where state is stored)
- Input port
- Output port
- Both
- Neither (aborted instruction)
- Centralized context
- Flow ID sorts packets for serial execution
- Error field carries error code from exceptions
44Input/Output/Central Processing Contexts
Switch Fabric
ESP
Output Context
Normal IP Input Processing
Normal IP Output Processing
45Input/Output/Central Processing Contexts
Switch Fabric
ESP
Output Context
Normal IP Input Processing
Normal IP Output Processing
46Input/Output/Central Processing Contexts
Switch Fabric
ESP
Output Context
Normal IP Input Processing
Input Context
Normal IP Output Processing
47Input/Output/Central Processing Contexts
Switch Fabric
ESP
Output Context
Normal IP Input Processing
Input Context
Normal IP Output Processing
48Input/Output/Central Processing Contexts
Switch Fabric
ESP
Output Context
Normal IP Input Processing
Input Context
Normal IP Output Processing
Both Contexts
49Input/Output/Central Processing Contexts
Switch Fabric
ESP
Output Context
Normal IP Input Processing
Input Context
Normal IP Output Processing
Both Contexts
Central Context
50Engineering ESP
- Tag, value sizes
- 64 bits yields acceptably low collision
probabilities - Store capacity is independent of tag size
- Setting store lifetime ?
- Store capacity maximized by minimizing ?
- For scalability minimize ?
- But need to be able to complete useful
computations - For robustness maximize ?
- 10 seconds (enough for 2-3 round-trips through
the network) - Challenges
- Wire-speed implementation
- Minimize store access time
- Minimize store cost
51Probability of Tag Collision (64 bits)
18 20 22 24
26 28 30
52Prototype Status
- FPGA
- Microcoded processor small ESS
- Extendable ESP instruction set
- Non-pipelined proof-of-concept
- Design runs full speed on 100MHz Virtex chip
- ESS 6 cycle access time
- Network Processor
- ESP/ESS Running on StrongARM
- Moving to ?Engines
- Software ESS
- 2 ?sec access time
53H/W Ephemeral State Processor
Input Packet
Output Packet
Packet Register
Input Control
Output Control
Macro Controller
? Instruction Store
Ephemeral State Store
? Con- troller
? Instruction Reg
Tag Registers
Value Registers
Location Registers
ALU
54Network Processor Implementation
Router
- Per-port ESP facility
- Transparent to router
- Intel BridalVeil IXP1200 board
- 8 100M Ethernet ports
- ESP/ESS running on StrongARM core
- 6 microengines, four threads each
- SRAM DRAM
- Moving to ?Engines
- Software ESS
- 2 ?sec access time
55Leveraging ESP
- Observation
- Application-specific processing often only needed
at a few nodes - Idea
- Use ESP to identify where processing needs to
occur - Deploy functionality directly from the end-systems
56Lightweight Packet Processing Modules
- Simple, pre-defined functions in routers
- Enabled by end systems via signaling
- Signaling protocol identifies
- the functionality to be enabled
- the parameters to be used by the function
- the packets to apply the processing to
(classifier) - a timeout value
- Signaling independent of forwarding (not
hop-by-hop) - Note can use direct point-to-point
authentication
57Dup() An Example LWP Module
- dup() - a simple duplication function
- snoop specially marked packets (the signaling
protocol identifies which packets) - duplicate the packet
- change source and destination in the new
packets IP header (destination specified by the
signaling protocol) - forward the new packet
58Existing Multicast Services
- IP multicast
- Transmit a single packet that is delivered to
multiple destinations - Advantages
- Single address abstraction
- Bandwidth savings
- Scalability - (anonymity/best-effort)
- Drawbacks
- Network defines the abstraction (group membership
and topology) - Protocol Heterogeneity
59Existing Multicast Services
- Application-Level Multicast
- Requires no network-level support
- End-systems construct overlay networks to connect
group members - Advantages
- Membership and topology controlled by app
- Provides multicast service everywhere there is
unicast service - Drawbacks
- Not particularly efficient
- Not scalable
60New Building Block Services
- Our goal achieve the best of both
- Greater flexibility and control in developing
network services for applications - Performance and scalability similar to that of
the network-based solutions - Use simple building block services that give
applications very limited control over the
network to - invoke lightweight packet processing modules at
routers - initiate ephemeral state probes that compute or
gather information about the network
61An Example Multicast Implementation
- Sender maintains the tree topology
- For multicast delivery
- Sender activates dup()s at each branch point
- Data transmitted hop-by-hop
dup?X() -- dup() that copies data to x
B
A
dup?A() dup?D() dup?C()
dup?B() dup?r3()
D
S
r1
r2
r3
BP children r1 r3, B r3 A, C, D
C
E
62An Example Multicast Implementation
- Q How does the sender know where to insert the
dup() function? - Caveat Dont want to know network topology
- A Through Ephemeral State Processing (ESP)
dup?X() -- dup() that copies data to x
B
A
dup?A() dup?D() dup?C()
dup?B() dup?r3()
D
S
r1
r2
r3
BP children r1 r3, B r3 A, C, D
C
E
63Tree Construction
- Finding the new Branch Point (BP)
- The sender multicasts a Setup ESP Probe
B
A
dup?A() dup?D() dup?C()
dup?B() dup?r3()
(next, r3)
S
D
r1
r2
r3
C
Setup Probe If no dup() next pkt.dst
E
64Tree Construction
- Pinpoint the new BP
- unicast to the new receiver (E)
B
A
dup?A() dup?D() dup?C()
dup?B() dup?r3()
(next, r3)
S
D
r1
r2
r3
C
E
(pkt.best, r1) (pkt.next, null)
(pkt.best, null) (pkt.next, null)
65Tree Construction
- Pinpoint the new BP
- unicast to the new receiver (E)
B
A
dup?A() dup?D() dup?C()
dup?B() dup?r3()
(next, r3)
S
D
r1
r2
r3
C
E
66Tree Construction
- Pinpoint the new BP
- unicast to the new receiver (E)
B
A
dup?A() dup?D() dup?C()
dup?B() dup?r3()
(next, r3)
S
D
r1
r2
r3
C
E
67A Sender-Managed Multicast Tree
B
A
dup?A() dup?D() dup?C()
dup?r3() dup?E()
dup?B() dup?r2()
S
D
r1
r2
r3
C
E