Title: Substrate Control: Overview
1Substrate Control Overview
- Fred Kuhns
- fredk_at_arl.wustl.edu
- Applied Research Laboratory
- Washington University in St. Louis
2Defining Terms and Models
3The SPP Node
- Slice instantiation
- Allocate virtual machine (VM)instance on a GPE
- may request code option instance, NPE resources
and bandwidth - Share a common set of (global) IPaddresses
- UDP/TCP port space shared across GPE/NPEs
- Line card TCAM Filters direct traffic
- unregistered traffic originating outside the
nodeis sent to the CP. - unregistered traffic originating within node
usesNAT (on line card) - application may register server ports. Causes
filter to be inserted in the line card directing
traffic to specific GPE - application must register ports (or tunnels)
associated with fast path instances - It is assumed that fast path instances will use
tunnels (overlays) to send traffic between
routing nodes. - Currently we only support UDP tunnels but will
extend to include GRE and possibly others.
GPE
NPE
local delivery/exceptions, uses an Internal UDP
Tunnel
LC
Egress
map flow to internal destination
SCD (ARP, nat)
Ingress
Internet
4Meta-Interfaces and Tunnels
- Slice Fast path (Code option instance, allocated
resources) are assumed to sit at one end of a
tunnel - currently only UDP tunnels are supported.
- UDP Tunnel is defined by the 4-tupleUDP tunnel
peer ipaddr, peer port, local ipaddr, local
port - Meta-interface or MI Represents a tunnel
endpoint as viewed by a slices the fast path
router. A meta-interface is defined by the local
endpoints addressMeta-Interface local ipaddr,
local UDP port - The encapsulated packet is processed by the fast
path. - packet is always encapsulated within a tunnel by
the substrate - code option instance processes the encapsulated
frame - In the SPP context, slice registers MI and
substrate manages encapsulation headers - Guard against forging source address
- A filter is installed in the corresponding line
cards TCAM to send matching packets to the
correct NPE - NPEs decap module verifies the encapsulation
header and provides isolation between slices
(based on local IP and port number values in the
tunnel header) - Fabric VLANs are used to provide link level
isolation between slice instances. The VLAN label
is also used by the substrate to associate
packets with slice fast paths.
MI IP Address UDP Port
0 192.168.1.2 6060
1 192.168.1.3 6060
2 192.168.1.2 6061
3 192.168.1.2 6062
4 192.168.1.3 6061
5 192.168.1.3 6062
6 192.168.1.3 6063
MI local tunnel endpoint (UDP), external
ipaddr, udp_port
fast path (FPx)
meta-interfaces
0
1
2
3
4
5
6
5Lookup Table, TCAM, Use
6Lookup filters Key, Action and Result
- A lookup key is then created from the packets
header fields and the receiving meta-interface - code option extracts fields from the encapsulated
packet - substrate adds the receiving meta-interface
identifier - If no entry is found then the packets no_route
exception attribute is set, otherwise a result is
returned containing an action field and
forwarding information (output meta-interface and
next hop address) - a code option may define additional exception
attributes - The complete filter specification lookup_key,
result_vector - lookup_key RxMI, copt_key
- RxMI Meta interface ID on which the packet was
received. - copt_key Lookup key defined by the code option.
The IPv4 keydaddr(32),saddr(32),sport(16),dport
(16),tcp_flgs(8),proto(8) - result_vector sindx, action, qid, TxMI,
nexthop - sindx stats index
- action Packet disposition, one of drop, fwd,
ld - drop drop packet
- fwd forward packet using next hop value
(fwdkey) - ld local delivery, code option instance has
local address information?? - qid packet Queue
- TxMI Meta-interface used for sending packet,
corresponds to a previously registered local
tunnel endpoint. Used to fill in the local
address of the outgoing packet tunnel header. - nexthop Tunnel endpoint for the next hop. For
UDP tunnels, this is the IP address and UDP port
number of the next hop device.
7Slice view of the Lookup Key
user specified lookup key (4 - 32-bit words)
slice defined fields
xmi
xsid
128-N
N
12
- When a packet is received the substrate creates a
lookup key using the target slices xsid and the
receiving meta-interface. The remaining bits are
defined by the code option. - xsid represents the internal slice ID and may
differ from the value of xsid. For implementation
efficiency, this is the VLAN identifier assigned
to the slice. - xmi Internal representation of the
meta-interface (MI), encoding of the received
tunnel endpoint. - For UDP tunnels this field includes a 4-bit
interface id and the 16 bit local UDP port
number. The 4-bit id is used as an index into a
table of local IP addresses. - The IPv4 code option defined fields are shown
below where pr is the IP protocol field and tcp
is the TCP header flags.
8IPv4 TCAM Filter Formats (on NPE)
Defined by the IPv4 Code Option, 112bits
Substrate defined
tcp/proto
if
daddr
saddr
sport
dport
vlan
T
RX port
11
1
16
4
32
32
16
16
16
Represents input meta-interface
6
8
2
T 0 Normal Lookup T 1 substrate only lookup
Result, 64 bits
L
D
rsv
TX IP daddr
TX dport
TX sport
rsv
QM
rsv
sindx
qid
Sch
32
16
12
15
16
3
11
3
1
1
2
16
TX IP address and sport represents the output
meta-interface. The dport is provided by the
slice. (RMP maps miid to tx tunnel params, use
dport provided by slice)
global stats index (SCD maps slices sindx to
global value)
20-bit internal qid (SCD maps slices miid to QM
and Sch. SCD Also maps slices qid to global qid
value)
D Drop packet L Local delivery
Slice parameters
Key Input miid, IPv4 fltr daddr, saddr, sport,
dport, tcp/proto
Result Flags Drop, GPE, sindx, Output miid, QID
9Lookup
- Parse block make copt_key.
- Substrate add the xsid and xmi fields.
- Substrate uses the TxMI and nexthop fields to
construct encapsulation header
parse block
decap
Lookup A
xsidRxMIcopt_key
sindxactionqidTxMInexthop
TxMInexthop
...
...
10Version 2 and Multicast
- In version 2 there will be 2 stages to the
lookupadd fanout (count) to lookup B. - if fanout gt 1 then address of fanout else result
vector Chain fanout blocks - TxMI includes an interface vector 4-bit field
that is used to lookup interface IP address and
MAC address.
fanout Table
qidTxMInexthop
VLAN table in header format and VLAN table in
Decap/Parse
parse block
...
sindex passed from side A
decap
overloaded with fanout address
rindx
LookupA
LookupB
lookup_key
actionsindxrindx
sindxqidTxMInexthop
result_index
...
...
...
11Lookup Example
- When a code option is requested the slice is
allocated the requested number of TCAM entries
fid e 0,..., Nf-1 - all TCAM operations accept a TCAM entry ID (fid)
- Entries are listed in priority order with fid0
the highest priority and entry Nf-1 the lowest. - It is up to the slice control path to order the
lookup entries. - For example if we have the simple routing
database - 10.10.2.1/32 Local delivery (GPE)
- 10.5.2.0/24 NH A
- 10.5.1.0/24 NH B
- 10.5.0.0/16 NH C
Slice Meta-Interfaces
Slice BW Allocations
MI IP Address UDP Port
0 192.168.1.2 6060
1 10.50.10.2 6061
2 10.50.10.2 6062
3 10.1.1.1 6060
Interface BW ipAddr
0 BE 192.168.1.2
1 100Mbps 10.50.10.2
2 10Mbps 10.1.1.1
Slice Queue Bindings
QID Interface BW max Bytes
0 0 - Local
1 1 40 1024
2 1 60 1024
3 2 100 1024
Desired Route Table (LPM)
prefix TxMI nexthop
10.10.2.1/32 0 Local
10.5.2.0/24 1 NH A
10.5.1.0/24 2 NH B
10.5.0.0/16 3 NH C
- Then the control software could use the
following - write_fltr(fid, rxmi, prefix,width, action,
qid,TxMI,nexthop) - write_fltr(0, , 10.10.2.1, 0xFFFFFFFF, LD)
- write_fltr(1, , 10.5.2.0, 0xFFFFFF00, fwd,
1, 1, NHA) - write_fltr(2, , 10.5.1.0, 0xFFFFFF00, fwd,
2, 2, NHB) - write_fltr(3, , 10.5.0.0, 0xFFFF0000, fwd,
3, 3, NHC)
12Example IPv4 LPM
- In general for longest prefix match a good
strategy is to divide allocated filters into 32
sets - For example assume 1024 TCAM entries have been
allocated and we are using LPM. - Divide the filters into 32 sets of 32 filters
each and associate a prefix length with each - Then for a particular prefix width add it to the
appropriate set. - Entries within a set are non-overlapping so their
order doesnt matter. - This is the scheme used by software written by
IDT, the manufacturer of the TCAM we currently
use.
Prefix Width Filter ID Range
32 0 - 31
31 32-63
w (32-w)32 (0...31)
1 992 - 1023
13Keeping track of TCAM entries
- Substrate will have to manage the mapping of VM
TCAM filter IDs to the actual filter ID. - VM control software will use a normalized filter
index list (starts at 0 and has the requested
number of filters entries). - The SCD (xscale daemon) must map the per-VM index
into the actual TCAM Index. - Source for managing TCAM entries.
- NPU A and B share a common TCAM and index range
so this must be managed across the two xscales. - See C implementation of the RangeMap class in
WUSRC/range - Class will also be used for managing the QID name
space.
14Control SoftwareResource Management
15System Resource Manager
Resource DB
Support fast path configuration via the PLC
SRM
CP
GPE
vmx
NMP
NPE
SCD
RMP
LC
SRAM
SCD
FPk
FPk
root context
FPx
planetlab OS
vnet
Exception and Local delivery traffic. Includes
shim header with RxMI.
16Partitioning of (substrate) Responsibilities
- Virtual Machine (Slice control SW) Application
logic, code option specific control and data
operations. - traditional PlanetLab slice operations
- manage code option specific lookup tables, stats,
memory and configuration blocks - implements interface with fast path for exception
and local delivery traffic - vnet
- flow isolation filtering traffic through the
linux kernel - add support for VLAN- based filtering and port
reservation - Resource Manager Proxy (aka Local Resource
Manager) - all VM commands are issued to the RMP
- the RMP is able to validate command sender
(authenticate) - enforce access restrictions (authorize)
- decouples VMs from substrate control entities.
That is, maps exported abstractions and
interfaces to specific hardware and software
interfaces. - verifies (or inserts) substrate message header
slice IDs to prevent deliberate or accidental
masquerading - part of ensuring isolation and
security. - in tandem with SRM implements device independent
logic - System Resource Manager
- device independent logic
- responsible for implementing and enforcing
- system resource abstractions
- resource isolation and allocation policies
17Responsibilities
System tables
Interfaces
ifntype,ipaddr,linkBW,availBW
...
endpoint (port) maps
resvMap
availMap
usedMaps
xsidMap
Per Slice Tables
xsid
vlan
SRM (the Decider)
RMP
request allocation
SCD (NPE)
Tables in data Path
- RMP Responsibilities
- Translate slice MI to local endpoint. Either call
SRM or cache mappings. - Add xsid to subMsg header
- Pass through identifiers mapped by SCD qid, fid
and stats. - Pass through relative queue weights, SCD maps to
global weight.
SRAM
base
real indx
xsidoffset
VLAN Table
vlan
xsidoffset
fid
xsidsize
real indx
real indx
Queue Params
make allocation
sid
qid
HF Control Block?
xsidrange
- SCD Responsibilities
- Translate slice specific indices to global
indices qid, fid and stats. - Knows the location of all tables
- Interprets commands to add, remove and modify
entries to data path tables. - Knows per slice interface BW allocation and maps
relative queue weight to global weight. - Each interface schedule is assigned (by SRM) max
rate.
code option control blocks?
ranges are not required to be contiguous
Per interface scheduler and rate limits
Per Slice data
Slice Maps
xsid qidMap,FidMap,statsMap Interface BW
18Queuing and allocating Interface Bandwidth
19Simple Queuing Example
Slice Interface and Queue Allocations Port,
BW, QList, Qlist qid, weight, threshold,...
NPE
wrr
Physical Port (Interface) Attributes ifn,
type, ipaddr, linkBW, availBW ifn Interface
number type Internet, Peering Operations get_
interfaces() get_ifattrs(ifn) get_ifpeer(ifn) a
lloc_ifbw(ifn,xsid,bw)
q10
q11
FP slice1
...
qid in 0...n-1
BW11
q1n
q20
LC
q21
FP slice2
...
qid in 0...m-1
q2m
wrr
FP1
BW1
FP2
ipAddr
BW11 BW21 BW1
GPE
linkBW
GPE
BW21
20Substrate Message Format
21Substrate Message
- Assume a simple command response (two-way)
messaging framework. But will support one-way
schemes.. - Supports asynchronous communications using a
message ID. - The command field is overloaded for the return
code. - Every server is expected to implement a simple
Version command (cmd 0) which return the
servers ID and Version number as two 32-bit
fields. - primary use is for monitoring health of servers
and debugging. - All other command values are uniique only to a
particular server. - Uses UDP as the transport protocol.
- All commands are expected to be idempotent
0
15
0
15
msg header
- mlen Total message length, including the header.
- mid Message ID, used to support synchronous
message processing. - cid context identifier. Specifies context within
which the message is processed. A value of 0
indicates substrate context. - cmd Command to execute or a return code.
- The 4 header fields are each 16 bites.
- body 0 or more bytes of command data.
22Overview
- In the interface specifications I provide a
c-like description of the operations and results. - The descriptions are only intended to describe
the actual message format, data fields and
returned results. It is not meant to specify an
application level library. - The arguments are to be encoded into the message
body in the order that are given, using network
byte order (Big Endian) and without padding. - All commands result in
- No return response one-way call semantics
- an error occurs processing the message or command
encounters and unexpected condition or error. In
this case the return message will have the error
return code in the cmd field. - The command completes and does not indicate and
error to the message framework then the message
result code indicates success. The message body
contains any result data.
23Example Message
- Slice with xsid of 0x10 requests the allocation
of a global UDP port (decimal 17) for the local
IP address 128.252.130.34 (hex 0x80FC8222). - Assume the alloc_port command ID is 4.
- port alloc_port(0x80FC8222, 0, 17)
- Allocate a global UDP (decimal 17) port for the
local IP address 128.252.130.34 (hex 0x80FC8222),
and let the system assign the next available port
number. - The resource manager allocates port 5050
(0x13BA), the return code of 0 indicates success.
Command Message
Reply Message
24NAT
25- Problem
- UDP, TCP 2 or more GPEs attempt to use same
global IP, Port and Proto - ICMP ???
26(No Transcript)