Title: Data plane algorithms in routers
 1Data plane algorithms in routers
- From prefix lookup to deep packet inspection 
 - Cristian Estan, University of Wisconsin-Madison
 
  2What is the data plane?
- The part of the router handling the traffic 
 - Data plane algorithms applied to every packet 
 - Successive packets typically treated 
independently  - Example deciding on which link to send a packet 
 - Throughput defined as number of packets or bytes 
handled per second is very important  - Line speed  keeping up with the rate at which 
traffic can be transmitted over the wire or fiber  - Example 10Gbps router has 32 ns to handle 40 
byte packet  - Memory usage limited by technology and costs 
 - Can afford at most tens of megabits of fast 
on-chip memory 
  3A generic data plane problem
- Router has many directives composed of a guard, 
and an associated action (all guards distinct)  - There is a simple procedure for testing how well 
a guard matches a packet  - For each packet, find the guard that matches 
best and take the associated action  - Example  routing table lookup 
 - Each guard is an IP prefix (between 0 and 32 
bits)  - Matching procedure is the guard a prefix of the 
32 bit destination IP address  - Best defined as longest matching prefix
 
  4The rules of the game
- Matching against all guards in sequence is too 
slow  - We build a data structure that captures the 
semantics of all guards and use it for matching  - Primary metrics 
 - How fast the matching algorithm is 
 - How much memory the data structure needs 
 - Time to build data structure also has some 
importance  - We can cheat (but we wont today) by 
 - Using binary or ternary content-addressable 
memories  - Using other forms of hardware support
 
  5Measuring algorithm complexity
- Execution cost measured in number of memory 
accesses to read data structure  - Actual data manipulation operations typically 
very simple  - On some platforms we can read wide words 
 - Worst case performance most important 
 - Worst case defined with respect to input, not 
guards  - Caching has been proven ineffective for many 
settings  - Using algorithms with good amortized complexity, 
but bad worst case requires large buffers  
  6Overview
- Longest matching prefix 
 - Trie-based algorithms 
 - Uni-bit and multi-bit tries (fixed stride and 
variable stride)  - Leaf pushing 
 - Bitmap compression of multi-bit trie nodes 
 - Tree bitmap representation for multi-bit trie 
nodes  - Binary search on ranges 
 - Binary search on prefix lengths 
 - Classification on multiple fields 
 - Signature matching
 
  7Longest matching prefix
- Used in routing table lookup (a.k.a. forwarding) 
for finding the link on which to send a packet  - Guard a bit string of 0 to w bits called IP 
prefix  - Action a single byte interface identifier 
 - Input a w-bit string representing the 
destination IP address of the packet (w is 32 for 
IPv4,128 for IPv6)  - Output the interface associated with the longest 
guard matching the input  - Size of problem hundreds of thousands of prefixes
 
  8Controlled prefix expansion with stride 3
Leaf pushing
Multi-bit trie with fixed stride
Routing table
Multi-bit trie with variable stride
Leaf pushing reduces memory usage but increases 
update time
Uni-bit trie
Given a maximum trie height h and a routing table 
of size n dynamic programming algorithm computes 
optimal variable stride trie in O(nw2h) 
 9Controlled prefix expansion with stride 3
Leaf pushing
Multi-bit trie with fixed stride
Routing table
Input
Multi-bit trie with variable stride
Leaf pushing reduces memory usage but increases 
update time
Longest matching prefix
Uni-bit trie
Given a maximum trie height h and a routing table 
of size n dynamic programming algorithm computes 
optimal variable stride trie in O(nw2h) 
 10Controlled prefix expansion with stride 3
Leaf pushing
Multi-bit trie with fixed stride
Routing table
Input
Multi-bit trie with variable stride
Leaf pushing reduces memory usage but increases 
update time
Longest matching prefix
Uni-bit trie
Given a maximum trie height h and a routing table 
of size n dynamic programming algorithm computes 
optimal variable stride trie in O(nw2h) 
 11Lulea bitmap compression
Bitmap supporting fast counting
Compressed node
When the compression bitmaps are large it is 
expensive to count bits during lookup. The bitmap 
is divided into chunks and a pre-computed 
auxiliary array stores the number of bits set 
before each chunk. The lookup algorithm needs to 
count only bits set within one chunk.
Repeating entries are stored only once in the 
compressed array. An auxiliary bitmap is needed 
to find the right entry in the compressed node. 
It stores a 0 for positions that do not differ 
from the previous one.
Input
Representing node as tree bitmap
Pointers to children and prefixes are stored in 
separate structures. Prefixes of all lengths are 
stored, thus leaf pushing is not needed and 
update is fast. Bitmaps have 1s corresponding to 
entries that are not empty.
Longest matching prefix
13013 
 12Binary search on ranges
- Divide w-bit address space into maximal 
continuous ranges covered by same prefix  - Build array or balanced (binary) search tree with 
boundaries of ranges  - At lookup time perform O(log(n)) search 
 - Not better than multi-bit tries with compression, 
but it is not covered by patents 
  13Binary search on prefix lengths
- Core idea for each prefix length represented in 
the routing table, have a hash table with the 
prefixes  - Can find longest matching prefix after looking up 
in each hash table the prefix of the address with 
corresponding length  - Binary search on prefix lengths is faster 
 - Simple but wrong algorithm if you find prefix at 
length x store it as best match and look for 
longer matching prefixes, otherwise look for 
shorter prefixes  - Problem what if there is both a shorter and a 
longer prefix, but no prefix at length x?  - Solution insert marker at length x when there 
are longer prefixes. Must store with marker 
longest matching shorter prefix. Markers lead to 
moderate increase in memory usage.  - Promising algorithm for IPv6 (w128)
 
  14Papers on longest matching prefix
- G. Varghese Network algorithmics an 
interdisciplinary approach to designing fast 
networked devices, chapter 11, Morgan Kaufmann 
2005  - V. Srinivasan, G. Varghese Faster IP lookups 
using controlled prefix expansion, ACM Trans. on 
Comp. Sys., Feb. 1999  - M. Degermark, A. Brodnik, S. Carlsson, S. Pink 
Small forwarding tables for fast routing 
lookups, ACM SIGCOMM, 1997  - W. Eatherton, Z. Dittia, G. Varghese Tree Bitmap 
 Hardware / Software IP Lookups with Incremental 
Updates, http//www-cse.ucsd.edu/varghese/PAPERS
/willpaper.pdf  - B. Lampson, V. Srinivasan, G. Varghese IP 
lookups using multiway and multicolumn search, 
IEEE Infocom, 1998  - M. Waldvogel, G. Varghese, J. Turner, B. 
Plattner, Scalable high-speed IP lookups, ACM 
Trans. on Comp. Sys., Nov. 2001 
  15Overview
- Longest matching prefix 
 - Classification on multiple fields 
 - Solution for two-dimensional case grid of tries 
 - Bit vector linear search 
 - Cross-producting 
 - Decision tree approaches 
 - Signature matching
 
  16Packet classification problem
- Required for security, recognizing packets with 
quality of service requirements  - Guard prefixes or ranges for k header fields 
 - Typically source and destination prefix, source 
and destination port range, and exact value or  
for protocol  - All fields must match for rule to apply 
 - Action drop, forward, map to a certain traffic 
class  - Input a tuple with the values of the k header 
fields  - Output the action associated with the first rule 
that matches the packet (rules are strictly 
ordered)  - Size of problem thousands of classification rules
 
  17Example of classification rule set
External time server TO
Router that filters traffic
Mail gateway M
Internet
Net
Internal time server TI
Secondary name server S 
 18A geometric view of packet classification
R1
R1
R3
R3
Destination address space
R2
R2
Source address space
Source address space
- In theory number of regions defined can be much 
larger than number of rules  - Any algorithm that guarantees O(n) space for all 
rule sets of size n needs O(log(n)k-1) time for 
classification 
  19The two dimensional case source and destination 
IP addresses
- For each destination prefix in rule set, link to 
corresponding node in destination IP trie a trie 
with source prefixes of rules using this 
destination prefix  - Matching algorithm must use backtracking to visit 
all source tries  - Grid of tries by pre-computing switch pointers 
in destination tries and propagating some 
information about more general rules, matching 
may proceed without backtracking  - Memory used proportional to number of rules 
 - Matching time O(w) with constant depending on 
stride  - Extended grid of tries handles 5 fields and has 
good run time and memory in practice 
  20R8
- Bit vector approaches do linear search through 
rule set  - For each field we pre-compute a structure (e.g. 
trie) to find most specific prefix or range 
distinguished by rule set  - For each rule, a single bit represents whether a 
given most specific prefix matches rule or not  - We associate with each range a bitmap of size n 
encoding which of the rules may match a packet in 
that prefix or range  - Classification algorithm first computes for each 
field of the packet the most specific 
prefix/range it belongs to  - By then AND-ing together the k bitmaps of size n 
we find matching rules  - Works well for hardware solutions that allow wide 
memory reads  - Scales poorly to large rule sets
 
  21312033046131478
Cross-producting performs longest prefix matching 
separately for all fields and combines the 
results in a single step by looking up the 
matching rule in a pre-computed table explicitly 
listing the first matching rule for each element 
of the cross-product. The size of this table is 
the product of the numbers of recognized 
prefixes/ranges for the individual fields. Due to 
its memory requirements this method is not 
feasible.
44523480
Equivalenced cross-producting (a.k.a. recursive 
flow classification or RFC) combines the results 
of the per-field longest matching prefix 
operations two by two. The pairs of values are 
grouped in equivalence classes and in general 
there are much fewer equivalence classes than 
pairs of values. This leads to significant memory 
savings as compared to simple cross-producting. 
This algorithm provides fast packet 
classification, but compared to other algorithms, 
the memory requirements are relatively large (but 
feasible in some settings).
16 entries, 8 distinct classes
Src IP
Dest IP
Src Port
Dest Port
Proto
Final result 
 22Decision tree approaches
- At each node of the tree test a bit in a field or 
perform a range test  - Large fan-out leads to shallow trees and fast 
classification  - Leaves contain a few rules traversed linearly 
 - Interior nodes may contain rules that match also 
 - Tests may look at bits from multiple fields 
 - A rule may appear in multiple nodes of the 
decision tree  this can lead to increased memory 
usage  - Tree built using heuristics that pick fields to 
compare on that divide remaining rules relatively 
evenly among descendants  - Fast and compact on rule sets used today
 
  23Papers on packet classification
- G. Varghese Network algorithmics , chapter 12 
 - V. Srinivasan, G. Varghese, S. Suri, M. 
Waldvogel, Fast and Scalable Layer Four 
Switching, ACM SIGCOMM, Sep. 1998  - F. Baboescu, S. Singh, G. Varghese, Packet 
classification for core routers Is there an 
alternative to CAMs?, IEEE Infocom, 2003  - P. Gupta, N. McKeown, Packet classification on 
multiple fields, ACM SIGCOMM 1999  - T. Woo, A modular approach to packet 
classification Algorithms and results, IEEE 
Infocom, 2000  - S. Singh, F. Baboescu, G. Varghese, Packet 
classification using multidimensional cutting, 
SIGCOMM, 2003 
  24Overview
- Longest matching prefix 
 - Classification on multiple fields 
 - Signature matching 
 - String matching 
 - Regular expression matching w/ DFAs and D2FAs
 
  25Signature matching
- Used in intrusion prevention/detection, 
application classification, load balancing  - Guard a byte string or a regular expression 
 - Action drop packet, log alert, set priority, 
direct to specific server  - Input byte string from the payload of packet(s) 
 - Hence the name deep packet inspection 
 - Output the positions at which various signatures 
match or the identifier of the highest priority 
signature that matches  - Size of problem hundreds of signatures per 
protocol 
  26String matching
- Most widely used early form of deep packet 
inspection, but the more expressive regular 
expressions have superceded strings by now  - Still used as pre-filter to more expensive 
matching operations by popular open source 
IDS/IPS Snort  - Matching multiple strings a well-studied problem 
 - A. Aho, M. Corasick. Efficient string matching 
An aid to bib- liographic search, Communications 
of the ACM, June 1975  - Many hardware-based solutions published in last 
decade  - Matching time independent of number of strings, 
memory requirements proportional to sum of their 
sizes 
  27Regular expression matching
- Deterministic and non-deterministic finite 
automata (DFAs and NFAs) can match regular 
expressions  - NFAs more compact but require backtracking or 
keeping track of sets of states during matching  - Both representations used in hardware and 
software solutions, but only DFA based solutions 
can guarantee throughput in software  - DFAs have a state space explosion problem 
 - From DFAs recognizing individual signatures we 
can build a DFA that recognizes entire signature 
set in a single pass  - Size of combined DFA much larger than sum of 
sizes for DFAs recognizing individual signatures  - Multiple combined DFAs are used to match 
signature set  
  28S. Kumar, S. Dharmapurikar, F. Yu, P. Crowley, J. 
Turner, Algorithms to Accelerate Multiple 
Regular Expressions Matching for Deep Packet 
Inspection, ACM SIGCOMM, September 2006
Delayed Input DFA (D2FA)
Deterministic finite automaton (DFA)
State 0
State 1
State 2
State 3
State 0
State 1
State 2
State 3
Default transitions
Input
If the current state variable meets an 
acceptance condition (e.g. whether the state 
identifier is larger than a given threshold), the 
automaton raises an alert.
Crt. state
D2FAs build on the observation that for many 
pairs of states, the transition tables are very 
similar and it is enough to store the 
differences. The lookup algorithm may need to 
follow multiple default transitions until it 
finds a state that explicitly stores a pointer to 
the next state it needs to transition to. Since 
this is a throughput concern, the algorithm for 
constructing D2FAs allows the user to set a limit 
on the length of the maximum default path. 
The memory columns report the ratio between the 
number of transitions used by the D2FA and the 
corresponding DFA. 
 29Conclusions
- Networking devices implement more and more 
complex data plane processing to better control 
traffic  - The algorithms and data structures used have big 
performance impact  - Often set of rules to be matched against has 
specific structure  - Algorithms exploiting this structure may give 
good performance even if it is impossible to find 
an algorithm that gives good performance on all 
possible rule sets 
  30Thats all folks!