Title: Computer Science is no more about computers
1- Computer Science is no more about computers
- than astronomy is about telescopes.
- Edsger W. Dijkstra
2Models, Algorithms, ArchitecturesforScalable
Packet Classification
- David E. Taylor
- Dissertation Defense
- Department of Computer Science Engineering
- 22 July 2004
3Internet Protocol (IP)
4IP Route Lookup
Longest Prefix Match (LPM) using destination IP
address in packet header
5IP Route Lookup
Longest Prefix Match (LPM) using destination IP
address in packet header
6Packet Classification
Packet Filter Set
7Packet Classification Problem
- Given a packet P containing fields Pj and a set
of filters F with each filter Fi containing
fields Fij, select the highest priority exclusive
filter and r highest priority non-exclusive
filters such that for each matching filter i " j
Fij matches Pj - Performance tradeoffs commonlycharacterized by
point locationproblem in computational geometry - For n regions defined in j dimensions,for j gt 3,
a point may be located inmulti-dimensional space
inO(log n) time with O(nj) spaceor O(logj-1n)
time with O(n) space
- Constraints
- 31 million lookups per second (10 Gb/s link)
- Memory and power efficiency
- Support for fast incremental updates
8Dissertation Overview
- Fast Internet Protocol Lookup (FIPL) search
engine - Scalable hardware implementation of a Longest
Prefix Matching algorithm - Survey taxonomy of packet classification
techniques - Frame the body of work according to high-level
approach - Analysis of the structure of real filter sets
- Identify opportunities for better search
performance - ClassBench tool suite for packet classification
benchmarking - Promote standardized performance evaluation
- Eliminate access barriers to realistic test
vectors - Distributed Crossproducting of Field Labels
(DCFL) - Leverage structure of real filter sets and
capabilities of current hardware - Achieve comparable search performance to TCAM
- Scale to support large filter sets and filters
classifying on additional packet fields
9Fast IP Lookup (FIPL) Engine
- High-performance implementation of Eatherton
Dittias Tree Bitmap algorithm - Compressed multi-bit trie requires 6 to 8 bytes
of memory per stored prefix - Scalable architecture leverages memory
interleaving to allow multiple search engines to
share a memory interface - Each FIPL engine consumes less than 1 of logic
resources in a commodity FPGA - Evaluated performance using backbone route tables
open research systems - Robust lookup performance under update load
10ClassBench
Filter Set Generator
Filter Set Analyzer
size
smoothing
scope
Synthetic Filter Set
Seed Filter Set
Trace Generator
Input Header Trace
scale
locality
Set of Benchmark Parameter Files
- Filter Set Analyzer extracts relevant statistics
and probability distributions, generates
parameter file - Guided by high-level models developed from filter
set analysis - Parameter files provide complete anonymity of
addresses - Filter Set Generator produces a synthetic filter
set retaining characteristics specified by input
parameter file - High-level adjustments provide control over
filter set size and composition - Trace Generator creates a sequence of packet
headers to exercise the input filter set - High-level adjustments provide control over trace
size and locality of reference
11Packet Classification Taxonomy
Exhaustive Search
Decomposition
Crossproducting
DCFL
RFC
P2C
Parallel BV
TCAM
ABV
Linear Search
Pruned Tuple Space
E-TCAM
Modular P. Class
HiCuts
HyperCuts
EGT
Conflict-Free Rectangle Search
Tuple Space
Grid-of-Tries
FIS Trees
Rectangle Search
Decision Tree
Tuple Space
12Distributed Crossproducting of Field Labels
- Motivated by observed structure of real filter
sets - Number of unique field values specified by
filters in the filter set is small relative to
the number of filters in the filter set - Number of unique field values matched by any
packet is small and remains relatively constant
for filter sets of various size - Leverage capabilities of current generation of
ASICs and FPGAs - Hundreds of embedded multi-port memory blocks (gt
1MB total) - Millions of logic gates and high clock speeds (gt
200 MHz) - Transform multi-field searching problem into a
distributed set membership query (set
intersection) - Parallel field-specific search engines
(decomposition) - Aggregation pipeline allows a new search to start
on each pipeline cycle - Scales to large filter sets and filters
classifying on additional packet fields - Enabling technology for next-generation services
13Field Labeling
Form sets of unique filter fields Label each
unique filter field with a locally unique
label Count values support dynamic updates
14Field Combinations Meta-Labeling
Generalizes to any combination of d filter fields
15DCFL Preliminaries
- Partition the filters in the filter set into
fields - Partition each packet header into corresponding
fields - Let Fi be the set of unique field values for
filter field i that appear in one or more filters
in the filter set - Let Fi(x) Í Fi be the subset of filter field
values in Fi matched by a packet with the value x
in header field i - Let Fi,j be the set of unique filter field value
pairs for fields i and j in the filter set i.e.
if (u,v) Î Fi,j there is some filter or filters
in the set with u in field i and v in field j - Let Fi,j(x,y) Í Fi,j be the subset of filter
field value pairs in Fi,j matched by a packet
with the value x in header field i and y in
header field j - This can be extended to higher-order
combinations, such as set Fi,j,k and subset
Fi,j,k(x,y,z), etc.
16DCFL Search
- In parallel, find subsets F1(w), F2(x), F3(y),
and F4(z) - In parallel, find subsets F1,2(w,x)and F3,4(y,z)
as follows - Let Fquery(w,x) be the set of possible field
value pairs formed from the crossproduct of F1(w)
and F2(x) - For each field value pair in Fquery(w,x), query
for set membership in F1,2, if the field value
pair is inset F1,2 add it to set F1,2(w,x) - Perform the symmetric operations to find subset
F3,4(y,z) - Find subset F1,2,3,4(w,x,y,z) by querying set
F1,2,3,4 with the field value combinations formed
from the crossproduct of F1,2(w,x) and F3,4(y,z) - Select the highest priority exclusive filter and
r highest priority non-exclusive filters in
F1,2,3,4(w,x,y,z)
17Example DCFL Search
Protocol UDP
Source Address 1101 0100
Destination Address 1010 1101
Destination Port 3
18Example DCFL Search
Aggregation Node
Aggregation Node
FDP,PR (a,a) (b,b) (a,c) (c,a)
FSA,DA (a,a) (b,a) (c,b) (d,c) (e,d) (f,c) (g,c)
(g,e) (a,e)
19Example DCFL Search
FDP,PR(3, UDP) (a,c),(b,b)
FSA,DA(1101 0100, 1010 1101) (a,e),(c,b),(d,c)
,(f,c)
Aggregation Node
FSA,DA,DP,PR (a,a,a,a) (b,a,b,b)
(c,b,b,b) (d,c,a,c) (e,d,a,c) (f,c,c,a) (g,c,c,a)
(g,e,a,c) (a,e,a,c)
20Optimizing the Aggregation Network
- Pipeline of aggregation nodes ? performance
bottleneck is node with largest query set size,
Fquery - Query set size, Fquery, determines the number
of sequential memory accesses (SMA) performed at
the node - Freedom to aggregate fields in any order allows
various network constructions - Fquery varies with network construction
- Cost of aggregation network, Gi, is the largest
worst-case query set size for all nodes in the
aggregation network - cost(Gi) maxFquery " F1,,F1,,d Î Gi
- Select the minimum cost aggregation network
- Gmin G cost(G) min cost(Gi) " i
- Gmin determined by the structure of the filter set
21Aggregation Network Cost Example
FSA(1101 0100) a,c,d,f
FDA(1010 1101) b,c,e
FDP(3) a,b
FPR(UDP) b,c
FDP,PR (a,a) (b,b) (a,c) (c,a)
Fquery (a,b) (a,c) (a,e) (c,b) (c,c) (c,e) (d,b)
(d,c) (d,e) (f,b) (f,c) (f,e)
Fquery (a,b) (a,c) (b,b) (b,c)
FSA,DA (a,a) (b,a) (c,b) (d,c) (e,d) (f,c) (g,c)
(g,e) (a,e)
(a,c),(b,b)
(a,e),(c,b), (d,c),(f,c)
Fquery (a,e,a,c) (a,e,b,b) (c,b,a,c)
(c,b,b,b) (d,c,a,c) (d,c,b,b) (f,c,a,c) (f,c,b,b)
FSA,DA,DP,PR (a,a,a,a) (b,a,b,b)
(c,b,b,b) (d,c,a,c) (e,d,a,c) (f,c,c,a) (g,c,c,a)
(g,e,a,c) (a,e,a,c)
22Aggregation Network Cost Example
FSA(1101 0100) a,c,d,f
FDA(1010 1101) b,c,e
FDP(3) a,b
FPR(UDP) b,c
FDP,PR (a,a) (a,b) (b,b) (c,a) (d,a) (c,c) (e,a)
Fquery (a,b) (a,c) (c,b) (c,c) (d,b) (d,c) (f,b)
(f,c)
Fquery (b,a) (b,b) (c,a) (c,b) (e,a) (e,b)
FSA,PR (a,a) (d,c) (g,a) (b,b) (e,c) (g,c) (c,b)
(f,a) (a,c)
(b,b),(c,a),(e,a)
(a,c),(c,b),(d,c)
FSA,DA,DP,PR (a,a,a,a) (b,a,b,b)
(c,b,b,b) (d,c,a,c) (e,d,a,c) (f,c,c,a) (g,c,c,a)
(g,e,a,c) (a,e,a,c)
Fquery (a,b,b,c) (a,c,a,c) (a,e,a,c) (c,b,b,b)
(c,c,a,b) (c,e,a,b) (d,b,b,c) (d,c,a,c) (d,e,a,c)
23DCFL Optimizations
- Field Splitting
- Partition filter fields in order to limit the
maximum number of matching field labels for each
packet field - Limit specified by a threshold, t
- Address prefixes O(N log W) algorithm finds
sub-prefix lengths to limit prefix nesting in
each sub-tree to t - Port ranges O(N log N) algorithm sorts port
ranges into subsets to limit range overlap in
each subset to t - Each split adds an aggregation node to the
network - Set membership data structures
- Minimize the number of sequential memory accesses
(SMA) per query - Explored three types of data structures Bloom
Filter Arrays, Field Label Indexing, Meta-Label
Indexing - Each aggregation node may employ the data
structure that minimizes the worst-case SMA
24Real Filter Sets Results
- 12 real filter sets collected from ISPs,
researchers, and a network equipment vendor - Size range 68 to 4557 filters
- Optimized ASIC implementation could achieve 100M
searches/sec with storage for over 200k filters - Assuming 500MHz, dual-port embedded memory,
288-bit word
25Field Splitting Results
- Primary benefit achieve higher performance with
better memory efficiency - Reduces required memory words size for give
performance target - Point of diminishing returns for small thresholds
and large word sizes
26ClassBench Results
- Generated large synthetic filter sets using
parameter files from real filter sets - SMA performance is less sensitive to memory word
size - Achieve better memory efficiency due to smaller
required word size - Provides for easier implementation/management as
tuning W is less critical
27ClassBench Results
- Generated synthetic filter sets (16k filters) to
examine scalability to additional filter fields - Half of all filters specifying TCP or UDP specify
a non-wildcard extra field - Negligible effect on SMA performance
- Each additional field increases memory
requirements by 10B/filter
28Contributions
- Design, hardware implementation, and evaluation
of a high-performance Longest Prefix Matching
(LPM) search engine - Made VHDL code for Fast Internet Protocol Lookup
(FIPL) search engine publicly available - Integrated FIPL into the Packet Processor of the
Network Services Platform - Survey taxonomy of packet classification
techniques - Identified opportunities for contributions
- Analysis of real filter sets
- Provided insight into the impetuses governing
filter composition - Identified varies properties that can be
leveraged for faster searches - ClassBench packet classification benchmarking
tool suite - Eliminated significant access barrier to
realistic test vectors for the research community - Tools and parameter files are publicly available
and in use by several researchers - Distributed Crossproducting of Field Labels
(DCFL) - Novel packet classification algorithm that scales
to support larger filter sets and filter
classifying on additional packet fields
29Future Directions
- Promotion and refinement of ClassBench
- Develop formal benchmarking methodology
- Consensus building and standardizing efforts
could be taken up by the Internet Engineering
Task Force (IETF) - Further optimization of DCFL
- New set membership (set intersection) data
structures for aggregation nodes - Hardware implementation of DCFL
- Architecture for dynamic aggregation network
restructuring - Application of DCFL to deep packet inspection
and other hybrid searching problems - Longest Prefix Matching Range Matching
String Matching
30Acknowledgements
- Advisors committee
- Jon Turner (research advisor)
- William D. Richard (academic advisor)
- Dan Fuhrmann
- John Lockwood
- Fred Rosenberger
- FIPL contributors
- Will Eatherton Zubin Dittia (Tree Bitmap)
- John DeHart (testing)
- Tucker Williams, Ed Spitznagel, Todd Sproull
(software) - ClassBench testers
- Ed Spitznagel Sailesh Kumar
- ARL faculty staff
- CSE faculty staff
- IBM Zurich Research Lab
- Marcel Waldvogel
- Andreas Herkersdorf
- Fellow ARL students
- Lunchtime debate club
- Parents
- Friends
- Sara Taylor (lovely wife)
31Thank you.Questions?
- Only the curious will learn and only the resolute
overcome the obstacle to learning. The quest
quotient - has always excited me more than the intelligence
quotient. - Eugene S. Wilson, Dean of Admissions, Amherst
32Supplementary Slides
33Internet Architecture
34Reducing Query Set Size
- Apply Field Splitting to reduce field overlap
(field label set size) to a given threshold, t - Prefixes ? choose sub-prefix lengths such that
prefix nesting remains below threshold - Port ranges ? distribute ranges into minimum
number of bins such that range overlap remains
below threshold
- Restrict aggregation networkconstruction by
requiringthat each aggregation nodeoperate on
at least onefield label set - Provides control overquery set size at
eachaggregation node - Utilize delay buffers forpipelined implementation
35DCFL Architecture w/ Field Splitting
36Bloom Filter Array Aggregation Node
- Bloom filters use an m-bit vector to efficiently
represent a set - Each element sets k bits in the vector
- False positive probability is tunable
- Bloom Filter Array utilizespre-filter hash
function to distributed elements overW Bloom
filters - Minimizes SMA per query
37Meta-Label Indexing Aggregation Node
- Field value combinations in F1,,i can be
identified by combination of meta-label for
fields (1,,i-1) and the field label for field i - Sort label combinations into bins using
meta-label - For each bin, construct list of field labels and
new meta-label - Store lists in array Aiindexed by meta-label
- Multi-way match logic comparesN label pairs per
memory access - Length() is an array storing the lengths of the
lists in Ai in decreasing order
38DCFL Implementation Architecture
39Related Work
- Ternary Content Addressable Memory (TCAM)
- 100 200 million searches per second
- Requires range to prefix conversion
- For IP 5-tuple, each filter may require 900
entries (typically 2 7) - Consumes 3 µW per bit (150x more than SRAM)
- Extended TCAM (E-TCAM) and Partitioned Encoded
Search of TCAM (PEST) utilize partitioning
algorithms to reduce power consumption by over
95 (Spitznagel, Taylor, Turner) - Does not address scalability to classify on
additional packet fields - Recursive Flow Classification (RFC) (Gupta,
McKeown) - Performs independent searches on chunks of the
packet header - Performs a multi-stage aggregation utilizing
equivalence classes - HyperCuts (Singh, Baboescu, Varghese, Wang)
- Builds a decision tree by partitioning the filter
set - Utilizes uniform partitions and indexing to allow
each decision tree node to make partitions in
multiple dimensions