Title: Line Rate Packet Classification and Scheduling
1Line Rate Packet Classification and Scheduling
- Michael Kounavis (Intel)
- Alok Kumar (Intel)
- Raj Yavatkar (Intel)
- Harrick Vin (U. Texas, Austin)
- October 26, 2005
2Packet Classification
3Tutorial Summary
- PART I
- Understanding the problem
- PART II
- State-of-the-art
- PART III
- Observations on real world classifiers
- PART IV
- Two stage packet classification using Most
Specific Filter Matching and Transport Level
Sharing
4PART I Understanding the Problem
5Problem Statement
- Packet classifiers
- Lists of rules
- Rules ltpriority, predicate, actiongt triplets
- Single Match Problem
- Find the highest priority rule that matches with
a packet - Multiple Match Problem
- Find all rules that match with a packet
6A Rule Database
Predicate
Src. Port
Dst. Port
Action
Src. IP
Dest. IP
Priority Level
PERMIT
147.101.
1040-1070
a rule
1
DENY
128.151.
2110-2150
2
DENY
132.
153.
ftp
3
7A Rule
Predicate
Src. Port
Dst. Port
Action
Src. IP
Dest. IP
Protocol
TCP
PERMIT
147.101.
2140-2170
8An IP Prefix
A range of values in a single dimension
Src. IP address
128.67.0.0
128.67.255.255
0.0.0.0
9An Arbitrary Range
A range of values in a single dimension
Dst. Port
2140
3140
0
10An Exact Value
A Specific Number
0
1
6
17
Protocol Field
11A Source-Destination IP Prefix Pair
rectangle for (128.67., 132.59.)
point for (128.67.208.1, 132.59.64.10)
line segment for (128.67., 132.59.64.10)
12Relationship Between IP Prefix Pairs
121.45.5.255
128.44.32.255
145.39.3.0
145.39.255.255
167. 7. 4.0
167. 7. 255.255
121.45.5.0
128.44.32.0
145.39.0.0
145.39.3.255
167. 7.0.0
167.7.4.255
Dst. IP
128.67.0.0
128.67.32.0
128.67.32.255
128.67.255.255
Src. IP
13Partial Overlaps IP Prefix Pairs vs. Arbitrary
Ranges
partially overlapping IP prefix pairs always
form the shape of a cross
partially overlapping pairs of arbitrary ranges
may form any shape
14Packet Classification as a Point Location Problem
Rule 3
Rule 5
Rule 2
Rule 1
Rule 4
15PART II State-of-the-Art
16Packet Classification An Open Problem
Lakshman and Stiliadis Bit Vector Srinivasan,
Suri, Varghese Grid of Tries, Cross Producting
Baboescu et. al., Aggregate Bit Vector Gupta,
McKeown HiCutts
Taylor, Turner Distributed Cross Producting
Mogul et. al. Packet Filter Concept
Baboescu et. al., Extended Grid of Tries Singh
et. al HyperCutts Kounavis et. al. Most Specific
Filter Matching
Gupta, McKeown Recursive Flow Classification Srini
vasan, Suri, Varghese Tuple-Space Search
Chazele et. al. Point Location Among Hyperplanes
1987
1994
1998
1999
2000
2003
2004
17Multi-dimensional Tries
Rule 1 (, 1)
Rule 2 (0, 0)
Rule 3 (1, 00)
Rule 4 (1, 0)
Rule Database
Packet (1101, 0011)
18Grid of Tries
Rule 1 (, 1)
Rule 2 (0, 0)
Rule 3 (1, 00)
Rule 4 (1, 0)
Rule Database
Packet (1101, 0011)
19Bit Vector Schemes
20Cross Producting
21Tuple-space Search
Key idea Number of different combinations of
prefix lengths (called tuples) is small
Rule 1 (, 1)
Rule 2 (0, 0)
Rule 3 (1, 0)
Rule 4 (1, 00)
Rule Database
Tuple 1 (, 1) Rule 1
Tuple 2 (1, 1) Rules 2 and 3
Tuple 3 (1, 11) Rule 4
Tuple Space
22Recursive Flow Classification
index 1
index 5
index 2
packet
action
index 3
index 6
index 4
23HiCutts and HyperCutts
Rule 1 Rule 2 Rule m
HiCutts Cuts 1 dimension at a time
HyperCutts Cuts multiple dimensions at a time
24TCAM
field values in the packet header
Entries
TCAM
Memory Array
0
1
0
1
1
0
Matches
Priority Encoder
Memory Location
RAM
Action Memory
25Comparison
Algorithm Worst Case Lookup Time Complexity Worst Case Storage Complexity
Multidimensional Tries wd ndw
Grid of Tries wd-1 ndw
Bit Vector dw dn/a dn2/a
Tuple Space Search n n
Cross Producting dw nd
Recursive Flow Classification d nd
HiCutts d nd
TCAM 1 n
n number of rules, d number of fields, w field
size
26PART III Observations on Real World Classifiers
27What we Observed
- IP prefix pairs
- create partial overlaps which are significantly
fewer than the theoretical worst case - transport level fields
- form sets which are being shared by many
different source-destination IP prefix pairs - sets usually contain a small number of entries
28Toward Two Stage Packet Classification
29Observations on IP prefix Pairs
- IP prefix pairs are of 2 types
- partially-specified filters (i.e., (,X) or (Y,
)) - fully-specified filters
- partially specified filters
- are a small fraction (lt 25) of IP prefix pairs
- most fully-specified filters (gt 80) are
represented by - segments of straight lines
- points
30IP Prefix Pair Overlaps
(, )
(Y, )
cluster m
(, X)
cluster 1
cluster 2
n2/4 amount of overlaps
realistic amount of overlaps ?
31Visualizing ACLs with our FilterViewer Tool
32Partial Filter Overlaps in the Realistic Filter
Structure
Breakdown of Overlaps Breakdown of Overlaps Breakdown of Overlaps Breakdown of Overlaps Breakdown of Overlaps Breakdown of Overlaps
observed number of overlaps theoretical worst case overlaps formed by partially specified filters only overlaps formed by fully specified filters only overlaps formed between partially and fully specified filters
ACL1 4 90,525 100 0 0
ACL2 2249 138,601 45 4 51
ACL3 6138 1,260,078 88 1 11
ACL4 852 12,246 100 0 0
33Why Few Filter Overlaps
- Partially specified filters represent a small
fraction of the total number of filters in
databases - Fully specified filters create an insignificant
amount of overlaps - There is a bounded number of important servers
per IP address domain
34Observations on Transport Level Fields
35Transport Level Sharing
number of rules number of unique sets of transport level fields number of entries in unique sets of transport level fields
ACL1 754 102 316
ACL2 607 35 68
ACL3 2399 186 437
ACL4 157 8 47
36Implications
- Classification can be split into 2 stages
- Stage 1 IP address fields
- Stage 2 transport level fields
- A design that returns the smallest filter
intersection is viable - since the amount of overlaps between IP prefix
pairs is small - Searching through transport level fields can be
accelerated by hardware
37PART IV Two Stage Packet Classification Using
Most Specific Filter Matching and Transport Level
Sharing
38Cross Producting Revisited
132.59.10.0
121.45.5.255
132.59.10.255
132.59.0.0
132.59.255.255
121.45.5.0
destination
IP address
125.12.12.0
125.12.12.255
128.67.0.0
128.67.32.0
128.67.32.255
128.67.255.255
source
IP address
Cross Producting may return a non existent
filter. To address this issue Cross Producting
adds all possible filters returned from LPM
searches into its lookup table
39Definition of a Cross Product
- A filter with
- a source IP prefix equal to the prefix of a
filter F1 from a database and - a destination IP prefix equal to the prefix of
another filter F2 from the same database - F2 is not necessarily equal to F1
40Improving Cross Producting
- Cross Producting is fast but
- Memory explosion
- For ACL3 there are 431 src. IP prefixes and 517
dst. Prefixes, hence 222,396 cross products - Solution
- We can remove 70-80 of the cross products with
little penalty to the performance of the
classifier - Not covered cross products
- Partially covered products
41Motivating Example
132.59.10.0
255.255.255.255
132.59.10.255
121.45.5.255
132.59.255.255
132.59.0.0
121.45.5.0
125.12.12.0
A (, )
R
R
D
4
1
125.12.12.255
128.67.0.0
do we need to store all cross products R1-R7?
C
128.67.32.0
R
E
2
128.67.32.255
R
3
128.67.255.255
147.101.10.0
R
R
B
R
6
5
7
147.101.10.255
255.255.255.255
42Not Covered Cross Products
not covered cross products are only covered by
(,). Hence they can be removed from the lookup
table. If no match is found the algorithm returns
(, )
43Partially Covered Cross Products
132.59.10.255
121.45.5.255
132.59.10.0
121.45.5.0
255.255.255.255
destination
125.12.12.0
IP address
125.12.12.255
(125.12.12.,)
(125.12.12., 132.59.10.)
128.67.32.0
partially covered cross product
(125.12.12., 121.45.5.)
128.67.32.255
source
(128.67.32., 121.45.5.)
IP address
partially covered cross products are only
covered by (X,) or (, Y). Hence they can be
removed from the lookup table. If no match is
found, the algorithm checks a database of
partially specified filters. If no match is found
again the algorithm returns (,)
44Fully Covered Cross Products
- those cross products which are neither not
covered nor partially covered - fully specified filters
- filter intersections that are fully specified
- filters which are
- formed by combining the source and destination IP
prefixes of different IP prefix pairs and - contained into fully-specified filters or
fully-specified filter intersections - these are called indicator filters
45Most Specific Filter Matching
LPM lookup
LPM lookup
on the
on the
destination IP
source IP
address
address
index 2
index 1
index 3
index 1
index 2
index 4
index 1
index 2
index 1
index 2
lookup on a
table of
secondary
secondary
table of filters
fully covered
table of filters
of the form
cross products
of the form
(X, )
(, Y)
primary
table
46Back to the Example
cross products R1, R3-R7 can be removed from
primary table
47Transport Level Sharing
- Rules that specify the same source-destination IP
prefix pair are consecutive - These rules share sets of transport level fields
- Most specific filter matching returns a list of
pointers to share sets of transport level fields - Packet classification in the transport level
dimensions is done in hardware
48Hardware Acceleration
- approach we do not need to expand a range into
prefixes - We use a pair of comparators
- We compare a key with upper and lower bounds
49Performance
- Lookup time
- Small and predictable number of steps independent
of the number of rules - 11 memory accesses
- Memory Space
- Reasonable 19-446 KB (for ACLs 1-4 without HW
acceleration) - Memory Access BW
- 64 words/access (without HW acceleration)
- 4 words/access (with HW acceleration)
- Update Time
- Approximately 197,000 memory accesses
50Tutorial Summary and Conclusion
51Tutorial Summary (I)
- Packet Classification
- A complex open problem
- State of the art
- Existing schemes trade-off the lookup time or the
memory requirement
52Tutorial Summary (II)
- Observations on real world classifiers
- Few partial overlaps between IP prefix pairs
- Shared sets of transport level fields
- Proposed a new scheme
- Exploits classifier properties
- Predictable lookup time with reasonable memory
requirement - Requires HW acceleration
53Future Work
- Verify the scheme using more data sets
- Simplify the update process
- Apply other fast solutions to the first stage
- HyperCutts
54Packet Scheduling
55Tutorial Summary
- PART I
- Understanding the problem
- PART II
- State-of-the-art
- PART III
- Sorting packets by packet schedulers using the
Connected Trie data structure - PART IV
- Building a four level, OC-48, programmable
hierarchical packet scheduler
56PART I Understanding the Problem
57The Concept of QoS
- Packet networks
- Usually provide best effort services
- Can we make packet networks capable of delivering
continuous media? - Key concept make packet networks flow aware
- Mechanisms
- Scheduling
- Shaping
- Resource Reservation
- Admission Control
58Packet Scheduling
shaped
traffic
session 1
delay jitter
traffic
shaper
scheduler
burst
head
-
of
-
line
packets
session N
59Generalized Processor Sharing (GPS)
- Ideal scheduling discipline
- Visits each nonempty queue and serves an
infinitesimally small amount of data - Supports exact max-min fair share allocation
- Resources are allocated in order of increasing
demand - No source gets a resource share larger than its
demand - Sources with unsatisfied demands get an equal
share of the resource
60Weighted Fair Queuing (WFQ)
- Key idea If you cant implement GPS simulate it
on the side - The algorithm
- Tag packets with numbers denoting the order of
completion of service according to the simulated
GPS discipline - Transmit the packets in the ascending order of
their tags
61Time Stamp Calculation
62Example
63Relative Fairness Bound
service received by connection A during an
interval
service received by connection B during this
interval
Relative Fairness Bound
-
MAX
rate allocated to A
rate allocated to B
A, B backlogged connections
64Absolute Fairness Bound
service received by connection A during an
interval
service received by connection A during this
interval if serviced by GPS
Absolute Fairness Bound
-
MAX
rate allocated to A
rate allocated to A
A backlogged connection
65Hierarchical Packet Scheduling
- Single level scheduling
- Transmission order does not depend on future
arrivals - Hierarchical scheduling
- Transmission order depends on future arrivals
- To implement hierarchical GPS we need to built a
hierarchy of single level fair queuing
disciplines
66Schedulable Region
- Set of all possible combinations of performance
bounds a scheduler can simultaneously meet
Class I
Class II
Class III
67PART II State-of-Art
68Tagging and Sorting Schemes
Goyal, Vin, Cheng Start Time Fair Queing Bennet,
Stephens, Zhang CMU Sorting Scheme Rexford,Bonomi,
Greenberg ATT Sorting Scheme
Valente Exact GPS Simulation with
logarithmic complexity
Parekh, Gallager Generalized Processor
Sharing Lazar, Hyman, Pacifici Schedulable Region
Demers, Keshav, Shenker WFQ
Ramabhadran, Pasquale Stratified Round
Robin Kounavis, Kumar, Yavatkar Connected Trie
Data Structure
Golestani Self Clocked Fair Queing Shreedhar,
Varghese Deficit Round Robin
Lazar, Hyman, Pacifici MARS
1995
1994
1989
1991
1993
1996
2003
2004
69Self-Clocked Fair Queuing (SCFQ)
Same as WFQ apart from
round number of the simulated GPS service
finish time of the packet currently in service
Main disadvantage large end-to-end delay
70Start Time Fair Queuing (SFQ)
Same as WFQ apart from
round number of the simulated GPS service
start time of the packet currently in service
Transmission order Ascending order of start
times Same end-to-end delay as WFQ
71W2FQ
Same as WFQ apart from
transmission order
select the packet with the minimum tag from
among those that have already started service in
the corresponding GPS simulation
Smaller Absolute Fairness Bound
72Round Robin Scheduling
- Deficit Round Robin
- Each connection has a deficit counter
- Every round the deficit counter is incremented by
a quantum - If the packet size lt counter then the packet is
transmitted and the counter is reduced by the
packet size
73Exact GPS Simulation with Logarithmic Complexity
- L-GPS simulates GPS with minimum deviation of one
packet size at O(logN) complexity - All other well known schedulers that accomplish
the same deviation (e.g., W2FQ) have O(N)
complexity - Key idea
- L-GPS pre-computes the evolution of the round
number function of the simulated GPS service
using a tree structure
74Some Sorting Data Structures
- Heaps
- Binomial heaps
- Calendar Queues
- Van Emde Boas Trees
- Trees of Comparators
- CMU Sorting
- ATT Sorting
- Polytechnic Institute Sorting
75The Tree of Comparators
You divide packets into groups
You send each group to a stage of comparators
You obtain a minimum from each comparator
You pass the minima into a second stage of
comparators
You repeat the process until one packet remains
76The Sorting Scheme from ATT
FIFO 1
Bin 1
FIFO 2
Bin 2
FIFO k
Bin m
scheduling horizon
Connection FIFOs
Sorting Bins
77The Sorting Scheme from Polytechnic Inst.
Brooklyn
Range of time stamp values
Sorting is supported by a hierarchy of bit vectors
78PART III Sorting Packets by Packet Schedulers
Using the Connected Trie Data Structure
79Contribution
- We propose a sorting algorithm and data structure
that reduces the latency of making scheduling
decisions to a singe memory access time - Solution is applicable to SCFQ, SFQ
- Key Observation
- Increments on packet time stamps are between the
range (maximum packet size)/(minimum weight) - This is called the scheduling horizon
- Approach
- We represent the scheduling horizon as a trie
- We put state into the nodes of the trie to allow
the leaves to be connected into linked list
80Trie-based Ordering
trie structure of height h log(scheduling
horizon/region width)
81Van Emde Boas Trees and the Connecting Trie
linear traversal
optimal traversal
linear traversal
optimal traversal (Connected Trie)
binary traversal (Van Emde Boas Tree)
82Main Concepts
- Each node stores
- a pointer to the rightmost leaf
- with the highest value from among those found by
traversing the left child of the node -
- a pointer to the leftmost leaf
- with the lowest value from among those found by
traversing the right child of the node - When a new packet is added into the trie
- The algorithm discovers the rightmost and
leftmost leaves the new packet should be
connected to - The new packet is inserted into a linked list of
leaves - Hence the next packet for transmission into the
network is found in a single memory access time
83Example Root only
R
(-8, 8)
84Example Adding 13
R
85Example Adding 5
86Example Adding 10
87Example Adding 8
R
(5, 8)
0
1
A
D
(10, 13)
(-8, 5)
0
1
1
B
E
G
(13, 8)
(5, 13)
(8, 10)
0
0
0
1
C
I
H
F
(-8, 13)
(8, 10)
(10, 13)
(-8, 5)
0
0
1
1
10
8
5
13
NULL
NULL
88The Node Traversal Algorithm
visit a node
89Using Two Tries at a Time
- Why two tries at a time
- Lets assume D is the scheduling horizon
- During the transmission of a packet, new packets
will be associated with time stamp increments at
most D. - During the transmission of these packets, new
packets will be associated with time stamp
increments at most 2D.
90Optimal Height of the Connected Trie
least common multiple of weights
maximum packet size
optimal height
X
log(
)
greatest common divisor of packet sizes
minimum weight
91Performance
- Scheduling decision time
- Exactly 1 memory access independent of the number
of flows in the scheduler - Insertion time
- 6 read accesses, 7 write accesses for a trie of
height 12. - Memory access bandwidth
- 10 words per read access, 6 words per write
access for a trie of height 12 - Memory requirement
- 34KB for 256 connections, 213 KB for 64K
connections, for a trie of height 12
92PART IV Building a Four Level, OC-48,
Programmable Hierarchical Packet Scheduler
93Contribution
- Efficient implementation of hierarchical
scheduling on the IXP2xxx series and the next
generation processors - Support for OC48 on IXP28xx
- Budget of 228 cycles/packet for scheduling only!
- Support for multiple levels of hierarchy upto 5
levels - Support for a total of lt 256K input queues with
arbitrary weights at each level - Any possible configuration of the hierarchies
should be supported
94Examples of Schedulers
95Approach
- Supporting many single level schedulers (e.g.,
64K) using a limited number of microengines and
threads - We assign a small number of threads to serve each
entire level of the hierarchy as opposed to a
single scheduler only! - packets are exchanged between levels at line rate
- Bandwidth sharing takes place at different levels
in parallel! - Sorting of Packets
- We use the connected trie and tree of comparators
structures
96Parallelizing the Hierarchical Scheduler
- We assign a small number of threads to serve each
level of the hierarchy - We can do that because packets are exchanged
between levels at line rate - We insert pre-sorted packets at each level
- We can do that because hierarchical schedulers
consist of independent single level schedulers - We parallelize the dequeue processing at each of
the levels - A dequeue thread creates a hole (i.e., empty
packet space) - A hole filling thread at the next level fills the
hole by inserting a new packet
97High Level Design from First Order Principles
Fact, Assumption or Design Principle Consequence High-Level Design Guideline
a hierarchical scheduler consists of single-level schedulers we can insert presorted packets at each level we address the sorting problem locally at each single level scheduler
packets need to be exchanged between levels at line rate we can assign a small number of threads to serve each entire level of the hierarchy the levels of the hierarchy can operate in parallel independent of each other
each packet transmitted at a level creates an empty packet space, which we call a hole to fill a hole you may need to perform at least one SRAM access which can be as large as 300 compute cycles we need to buffer more than one packet at each level of the scheduler
the enqueuing process may insert packets into single-level schedulers at any level enqueuing and hole-filling threads may need to access the same state information concurrently mutual exclusion techniques are required
calculating a virtual time function is complex virtual time can be approximated by the finish time of the packet currently in service (SCFQ) a packet entering a scheduler does not need to preempt the packet currently in service
98Parallelized Hierarchical Scheduler Illustrated
99Meeting the OC-48 Line Rate with Buffering
Final Output
- Why buffering?
- To cope for the fact that a single SRAM access
may take more time 228 cycles - Enqueue Process
- We maintain two minimum tag packets at the output
of each scheduler - Dequeue Process
- While a hole is filled, the next packet is ready
to be serviced
100Hierarchical Scheduler Prototyped
101Use of the IXP Microengines
102Remarks
- Four level OC-48 line rate forwarding (2.5 Gbps)
- Data structures fit into the local memory and
SRAM of IXP2400 - We keep the memory access bandwidth consumption
at reasonable level - We fetch either the heads or tails of queues but
not both at the same time - We employ novel inter-thread synchronization
algorithms
103Tutorial Summary and Conclusion
104Tutorial Summary (I)
- Packet scheduling
- Critical component of router datapaths
- Generalized Processor Sharing
- Ideal service (non-implementable)
- Real implementations
- Annotate packets with time stamps
- Sort packets according to their time stamp values
105Tutorial Summary (II)
- We propose a new algorithm for sorting packets
- Reduces the latency of making scheduling
decisions to a single memory access time - Parallelized Processor Architectures like IXP2xxx
are suitable for implementing packet scheduling
in software
106Future Work
- Use the Connected Trie for implementing
disciplines other than SFQ, SCFQ - Example L-GPS?
- Use the Connected Trie in multi-level
hierarchical scheduler configurations
107Thanks for Listening
108References
- Michael E. Kounavis, Alok Kumar, Raj Yavatkar and
Harrick Vin, - Two Stage Packet Classification Using Most
Specific Filter Matching and Transport Level
Sharing, - Technical Report, Communications Technology Lab,
Intel Corporation, In Submission to Computer
Networks - Michael E. Kounavis, Alok Kumar, and Raj
Yavatkar, - Sorting Packets by Packet Schedulers Using the
Connected Trie Data Structure, - Technical Report, Communications Technology Lab,
Intel Corporation, In Submission to Software
Practice and Experience - Michael E. Kounavis, Alok Kumar, and Raj
Yavatkar, - A Four Level OC-48 Programmable Hierarchical
Packet Scheduler, - Technical Report, Communications Technology Lab,
Intel Corporation