Line Rate Packet Classification and Scheduling

About This Presentation

Title:

Line Rate Packet Classification and Scheduling

Description:

Communications Technology. Lab. An IP Prefix. A range of values in a ... Communications Technology. Lab. Packet Classification as a Point Location Problem ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 109

Provided by: frankl72

Category:

more less

Transcript and Presenter's Notes

Title: Line Rate Packet Classification and Scheduling

1
Line Rate Packet Classification and Scheduling

Michael Kounavis (Intel)
Alok Kumar (Intel)
Raj Yavatkar (Intel)
Harrick Vin (U. Texas, Austin)
October 26, 2005

2
Packet Classification
3
Tutorial Summary

PART I
Understanding the problem
PART II
State-of-the-art
PART III
Observations on real world classifiers
PART IV
Two stage packet classification using Most
Specific Filter Matching and Transport Level
Sharing

4
PART I Understanding the Problem
5
Problem Statement

Packet classifiers
Lists of rules
Rules ltpriority, predicate, actiongt triplets
Single Match Problem
Find the highest priority rule that matches with
a packet
Multiple Match Problem
Find all rules that match with a packet

6
A Rule Database
Predicate
Src. Port
Dst. Port
Action
Src. IP
Dest. IP
Priority Level
PERMIT

147.101.
1040-1070

a rule
1
DENY

128.151.

2110-2150
2
DENY
132.
153.

ftp
3
7
A Rule
Predicate
Src. Port
Dst. Port
Action
Src. IP
Dest. IP
Protocol
TCP
PERMIT

147.101.
2140-2170

8
An IP Prefix
A range of values in a single dimension
Src. IP address
128.67.0.0
128.67.255.255
0.0.0.0
9
An Arbitrary Range
A range of values in a single dimension
Dst. Port
2140
3140
0
10
An Exact Value
A Specific Number
0
1
6
17
Protocol Field
11
A Source-Destination IP Prefix Pair
rectangle for (128.67., 132.59.)
point for (128.67.208.1, 132.59.64.10)
line segment for (128.67., 132.59.64.10)
12
Relationship Between IP Prefix Pairs
121.45.5.255
128.44.32.255
145.39.3.0
145.39.255.255
167. 7. 4.0
167. 7. 255.255
121.45.5.0
128.44.32.0
145.39.0.0
145.39.3.255
167. 7.0.0
167.7.4.255
Dst. IP
128.67.0.0

128.67.32.0
128.67.32.255
128.67.255.255
Src. IP
13
Partial Overlaps IP Prefix Pairs vs. Arbitrary
Ranges
partially overlapping IP prefix pairs always
form the shape of a cross
partially overlapping pairs of arbitrary ranges
may form any shape
14
Packet Classification as a Point Location Problem
Rule 3
Rule 5
Rule 2
Rule 1
Rule 4
15
PART II State-of-the-Art
16
Packet Classification An Open Problem
Lakshman and Stiliadis Bit Vector Srinivasan,
Suri, Varghese Grid of Tries, Cross Producting
Baboescu et. al., Aggregate Bit Vector Gupta,
McKeown HiCutts
Taylor, Turner Distributed Cross Producting
Mogul et. al. Packet Filter Concept
Baboescu et. al., Extended Grid of Tries Singh
et. al HyperCutts Kounavis et. al. Most Specific
Filter Matching
Gupta, McKeown Recursive Flow Classification Srini
vasan, Suri, Varghese Tuple-Space Search
Chazele et. al. Point Location Among Hyperplanes
1987
1994
1998
1999
2000
2003
2004
17
Multi-dimensional Tries
Rule 1 (, 1)
Rule 2 (0, 0)
Rule 3 (1, 00)
Rule 4 (1, 0)
Rule Database
Packet (1101, 0011)
18
Grid of Tries
Rule 1 (, 1)
Rule 2 (0, 0)
Rule 3 (1, 00)
Rule 4 (1, 0)
Rule Database
Packet (1101, 0011)
19
Bit Vector Schemes
20
Cross Producting
21
Tuple-space Search
Key idea Number of different combinations of
prefix lengths (called tuples) is small
Rule 1 (, 1)
Rule 2 (0, 0)
Rule 3 (1, 0)
Rule 4 (1, 00)
Rule Database
Tuple 1 (, 1) Rule 1
Tuple 2 (1, 1) Rules 2 and 3
Tuple 3 (1, 11) Rule 4
Tuple Space
22
Recursive Flow Classification
index 1
index 5
index 2
packet
action
index 3
index 6
index 4
23
HiCutts and HyperCutts
Rule 1 Rule 2 Rule m
HiCutts Cuts 1 dimension at a time
HyperCutts Cuts multiple dimensions at a time
24
TCAM
field values in the packet header
Entries
TCAM
Memory Array
0
1
0
1
1
0
Matches
Priority Encoder
Memory Location
RAM
Action Memory
25
Comparison
Algorithm Worst Case Lookup Time Complexity Worst Case Storage Complexity
Multidimensional Tries wd ndw
Grid of Tries wd-1 ndw
Bit Vector dw dn/a dn2/a
Tuple Space Search n n
Cross Producting dw nd
Recursive Flow Classification d nd
HiCutts d nd
TCAM 1 n
n number of rules, d number of fields, w field
size
26
PART III Observations on Real World Classifiers
27
What we Observed

IP prefix pairs
create partial overlaps which are significantly
fewer than the theoretical worst case
transport level fields
form sets which are being shared by many
different source-destination IP prefix pairs
sets usually contain a small number of entries

28
Toward Two Stage Packet Classification

29
Observations on IP prefix Pairs

IP prefix pairs are of 2 types
partially-specified filters (i.e., (,X) or (Y,
))
fully-specified filters
partially specified filters
are a small fraction (lt 25) of IP prefix pairs
most fully-specified filters (gt 80) are
represented by
segments of straight lines
points

30
IP Prefix Pair Overlaps
(, )
(Y, )
cluster m
(, X)
cluster 1
cluster 2
n2/4 amount of overlaps
realistic amount of overlaps ?
31
Visualizing ACLs with our FilterViewer Tool
32
Partial Filter Overlaps in the Realistic Filter
Structure
Breakdown of Overlaps Breakdown of Overlaps Breakdown of Overlaps Breakdown of Overlaps Breakdown of Overlaps Breakdown of Overlaps
observed number of overlaps theoretical worst case overlaps formed by partially specified filters only overlaps formed by fully specified filters only overlaps formed between partially and fully specified filters
ACL1 4 90,525 100 0 0
ACL2 2249 138,601 45 4 51
ACL3 6138 1,260,078 88 1 11
ACL4 852 12,246 100 0 0
33
Why Few Filter Overlaps

Partially specified filters represent a small
fraction of the total number of filters in
databases
Fully specified filters create an insignificant
amount of overlaps
There is a bounded number of important servers
per IP address domain

34
Observations on Transport Level Fields
35
Transport Level Sharing
number of rules number of unique sets of transport level fields number of entries in unique sets of transport level fields
ACL1 754 102 316
ACL2 607 35 68
ACL3 2399 186 437
ACL4 157 8 47
36
Implications

Classification can be split into 2 stages
Stage 1 IP address fields
Stage 2 transport level fields
A design that returns the smallest filter
intersection is viable
since the amount of overlaps between IP prefix
pairs is small
Searching through transport level fields can be
accelerated by hardware

37
PART IV Two Stage Packet Classification Using
Most Specific Filter Matching and Transport Level
Sharing
38
Cross Producting Revisited
132.59.10.0
121.45.5.255
132.59.10.255
132.59.0.0
132.59.255.255
121.45.5.0
destination
IP address
125.12.12.0
125.12.12.255
128.67.0.0
128.67.32.0
128.67.32.255
128.67.255.255
source
IP address
Cross Producting may return a non existent
filter. To address this issue Cross Producting
adds all possible filters returned from LPM
searches into its lookup table
39
Definition of a Cross Product

A filter with
a source IP prefix equal to the prefix of a
filter F1 from a database and
a destination IP prefix equal to the prefix of
another filter F2 from the same database
F2 is not necessarily equal to F1

40
Improving Cross Producting

Cross Producting is fast but
Memory explosion
For ACL3 there are 431 src. IP prefixes and 517
dst. Prefixes, hence 222,396 cross products
Solution
We can remove 70-80 of the cross products with
little penalty to the performance of the
classifier
Not covered cross products
Partially covered products

41
Motivating Example
132.59.10.0
255.255.255.255
132.59.10.255
121.45.5.255
132.59.255.255
132.59.0.0
121.45.5.0
125.12.12.0
A (, )
R
R
D
4
1
125.12.12.255
128.67.0.0
do we need to store all cross products R1-R7?
C
128.67.32.0
R
E
2
128.67.32.255
R
3
128.67.255.255
147.101.10.0
R
R
B
R
6
5
7
147.101.10.255
255.255.255.255
42
Not Covered Cross Products
not covered cross products are only covered by
(,). Hence they can be removed from the lookup
table. If no match is found the algorithm returns
(, )
43
Partially Covered Cross Products
132.59.10.255
121.45.5.255
132.59.10.0
121.45.5.0
255.255.255.255
destination
125.12.12.0
IP address
125.12.12.255
(125.12.12.,)
(125.12.12., 132.59.10.)
128.67.32.0
partially covered cross product
(125.12.12., 121.45.5.)
128.67.32.255
source
(128.67.32., 121.45.5.)
IP address
partially covered cross products are only
covered by (X,) or (, Y). Hence they can be
removed from the lookup table. If no match is
found, the algorithm checks a database of
partially specified filters. If no match is found
again the algorithm returns (,)
44
Fully Covered Cross Products

those cross products which are neither not
covered nor partially covered
fully specified filters
filter intersections that are fully specified
filters which are
formed by combining the source and destination IP
prefixes of different IP prefix pairs and
contained into fully-specified filters or
fully-specified filter intersections
these are called indicator filters

45
Most Specific Filter Matching
LPM lookup
LPM lookup
on the
on the
destination IP
source IP
address
address
index 2
index 1
index 3
index 1
index 2
index 4
index 1
index 2
index 1
index 2
lookup on a
table of
secondary
secondary
table of filters
fully covered
table of filters
of the form
cross products
of the form
(X, )
(, Y)
primary
table
46
Back to the Example
cross products R1, R3-R7 can be removed from
primary table
47
Transport Level Sharing

Rules that specify the same source-destination IP
prefix pair are consecutive
These rules share sets of transport level fields
Most specific filter matching returns a list of
pointers to share sets of transport level fields
Packet classification in the transport level
dimensions is done in hardware

48
Hardware Acceleration

approach we do not need to expand a range into
prefixes
We use a pair of comparators
We compare a key with upper and lower bounds

49
Performance

Lookup time
Small and predictable number of steps independent
of the number of rules
11 memory accesses
Memory Space
Reasonable 19-446 KB (for ACLs 1-4 without HW
acceleration)
Memory Access BW
64 words/access (without HW acceleration)
4 words/access (with HW acceleration)
Update Time
Approximately 197,000 memory accesses

50
Tutorial Summary and Conclusion
51
Tutorial Summary (I)

Packet Classification
A complex open problem
State of the art
Existing schemes trade-off the lookup time or the
memory requirement

52
Tutorial Summary (II)

Observations on real world classifiers
Few partial overlaps between IP prefix pairs
Shared sets of transport level fields
Proposed a new scheme
Exploits classifier properties
Predictable lookup time with reasonable memory
requirement
Requires HW acceleration

53
Future Work

Verify the scheme using more data sets
Simplify the update process
Apply other fast solutions to the first stage
HyperCutts

54
Packet Scheduling
55
Tutorial Summary

PART I
Understanding the problem
PART II
State-of-the-art
PART III
Sorting packets by packet schedulers using the
Connected Trie data structure
PART IV
Building a four level, OC-48, programmable
hierarchical packet scheduler

56
PART I Understanding the Problem
57
The Concept of QoS

Packet networks
Usually provide best effort services
Can we make packet networks capable of delivering
continuous media?
Key concept make packet networks flow aware
Mechanisms
Scheduling
Shaping
Resource Reservation
Admission Control

58
Packet Scheduling
shaped
traffic
session 1
delay jitter
traffic
shaper
scheduler
burst

head
-
of
-
line
packets
session N
59
Generalized Processor Sharing (GPS)

Ideal scheduling discipline
Visits each nonempty queue and serves an
infinitesimally small amount of data
Supports exact max-min fair share allocation
Resources are allocated in order of increasing
demand
No source gets a resource share larger than its
demand
Sources with unsatisfied demands get an equal
share of the resource

60
Weighted Fair Queuing (WFQ)

Key idea If you cant implement GPS simulate it
on the side
The algorithm
Tag packets with numbers denoting the order of
completion of service according to the simulated
GPS discipline
Transmit the packets in the ascending order of
their tags

61
Time Stamp Calculation
62
Example
63
Relative Fairness Bound
service received by connection A during an
interval
service received by connection B during this
interval
Relative Fairness Bound
-
MAX

rate allocated to A
rate allocated to B
A, B backlogged connections
64
Absolute Fairness Bound
service received by connection A during an
interval
service received by connection A during this
interval if serviced by GPS
Absolute Fairness Bound
-
MAX

rate allocated to A
rate allocated to A
A backlogged connection
65
Hierarchical Packet Scheduling

Single level scheduling
Transmission order does not depend on future
arrivals
Hierarchical scheduling
Transmission order depends on future arrivals
To implement hierarchical GPS we need to built a
hierarchy of single level fair queuing
disciplines

66
Schedulable Region

Set of all possible combinations of performance
bounds a scheduler can simultaneously meet

Class I
Class II
Class III
67
PART II State-of-Art
68
Tagging and Sorting Schemes
Goyal, Vin, Cheng Start Time Fair Queing Bennet,
Stephens, Zhang CMU Sorting Scheme Rexford,Bonomi,
Greenberg ATT Sorting Scheme
Valente Exact GPS Simulation with
logarithmic complexity
Parekh, Gallager Generalized Processor
Sharing Lazar, Hyman, Pacifici Schedulable Region
Demers, Keshav, Shenker WFQ
Ramabhadran, Pasquale Stratified Round
Robin Kounavis, Kumar, Yavatkar Connected Trie
Data Structure
Golestani Self Clocked Fair Queing Shreedhar,
Varghese Deficit Round Robin
Lazar, Hyman, Pacifici MARS
1995
1994
1989
1991
1993
1996
2003
2004
69
Self-Clocked Fair Queuing (SCFQ)
Same as WFQ apart from
round number of the simulated GPS service
finish time of the packet currently in service

Main disadvantage large end-to-end delay
70
Start Time Fair Queuing (SFQ)
Same as WFQ apart from
round number of the simulated GPS service
start time of the packet currently in service

Transmission order Ascending order of start
times Same end-to-end delay as WFQ
71
W2FQ
Same as WFQ apart from
transmission order
select the packet with the minimum tag from
among those that have already started service in
the corresponding GPS simulation

Smaller Absolute Fairness Bound
72
Round Robin Scheduling

Deficit Round Robin
Each connection has a deficit counter
Every round the deficit counter is incremented by
a quantum
If the packet size lt counter then the packet is
transmitted and the counter is reduced by the
packet size

73
Exact GPS Simulation with Logarithmic Complexity

L-GPS simulates GPS with minimum deviation of one
packet size at O(logN) complexity
All other well known schedulers that accomplish
the same deviation (e.g., W2FQ) have O(N)
complexity
Key idea
L-GPS pre-computes the evolution of the round
number function of the simulated GPS service
using a tree structure

74
Some Sorting Data Structures

Heaps
Binomial heaps
Calendar Queues
Van Emde Boas Trees
Trees of Comparators
CMU Sorting
ATT Sorting
Polytechnic Institute Sorting

75
The Tree of Comparators
You divide packets into groups
You send each group to a stage of comparators
You obtain a minimum from each comparator

You pass the minima into a second stage of
comparators
You repeat the process until one packet remains
76
The Sorting Scheme from ATT
FIFO 1
Bin 1
FIFO 2
Bin 2
FIFO k
Bin m
scheduling horizon
Connection FIFOs
Sorting Bins
77
The Sorting Scheme from Polytechnic Inst.
Brooklyn
Range of time stamp values
Sorting is supported by a hierarchy of bit vectors
78
PART III Sorting Packets by Packet Schedulers
Using the Connected Trie Data Structure
79
Contribution

We propose a sorting algorithm and data structure
that reduces the latency of making scheduling
decisions to a singe memory access time
Solution is applicable to SCFQ, SFQ
Key Observation
Increments on packet time stamps are between the
range (maximum packet size)/(minimum weight)
This is called the scheduling horizon
Approach
We represent the scheduling horizon as a trie
We put state into the nodes of the trie to allow
the leaves to be connected into linked list

80
Trie-based Ordering
trie structure of height h log(scheduling
horizon/region width)
81
Van Emde Boas Trees and the Connecting Trie
linear traversal
optimal traversal
linear traversal
optimal traversal (Connected Trie)
binary traversal (Van Emde Boas Tree)
82
Main Concepts

Each node stores
a pointer to the rightmost leaf
with the highest value from among those found by
traversing the left child of the node
a pointer to the leftmost leaf
with the lowest value from among those found by
traversing the right child of the node
When a new packet is added into the trie
The algorithm discovers the rightmost and
leftmost leaves the new packet should be
connected to
The new packet is inserted into a linked list of
leaves
Hence the next packet for transmission into the
network is found in a single memory access time

83
Example Root only
R
(-8, 8)
84
Example Adding 13
R
85
Example Adding 5
86
Example Adding 10
87
Example Adding 8
R
(5, 8)
0
1
A
D
(10, 13)
(-8, 5)
0
1
1
B
E
G
(13, 8)
(5, 13)
(8, 10)
0
0
0
1
C
I
H
F
(-8, 13)
(8, 10)
(10, 13)
(-8, 5)
0
0
1
1
10
8
5
13
NULL
NULL
88
The Node Traversal Algorithm
visit a node
89
Using Two Tries at a Time

Why two tries at a time
Lets assume D is the scheduling horizon
During the transmission of a packet, new packets
will be associated with time stamp increments at
most D.
During the transmission of these packets, new
packets will be associated with time stamp
increments at most 2D.

90
Optimal Height of the Connected Trie
least common multiple of weights
maximum packet size
optimal height
X
log(
)
greatest common divisor of packet sizes
minimum weight
91
Performance

Scheduling decision time
Exactly 1 memory access independent of the number
of flows in the scheduler
Insertion time
6 read accesses, 7 write accesses for a trie of
height 12.
Memory access bandwidth
10 words per read access, 6 words per write
access for a trie of height 12
Memory requirement
34KB for 256 connections, 213 KB for 64K
connections, for a trie of height 12

92
PART IV Building a Four Level, OC-48,
Programmable Hierarchical Packet Scheduler
93
Contribution

Efficient implementation of hierarchical
scheduling on the IXP2xxx series and the next
generation processors
Support for OC48 on IXP28xx
Budget of 228 cycles/packet for scheduling only!
Support for multiple levels of hierarchy upto 5
levels
Support for a total of lt 256K input queues with
arbitrary weights at each level
Any possible configuration of the hierarchies
should be supported

94
Examples of Schedulers
95
Approach

Supporting many single level schedulers (e.g.,
64K) using a limited number of microengines and
threads
We assign a small number of threads to serve each
entire level of the hierarchy as opposed to a
single scheduler only!
packets are exchanged between levels at line rate
Bandwidth sharing takes place at different levels
in parallel!
Sorting of Packets
We use the connected trie and tree of comparators
structures

96
Parallelizing the Hierarchical Scheduler

We assign a small number of threads to serve each
level of the hierarchy
We can do that because packets are exchanged
between levels at line rate
We insert pre-sorted packets at each level
We can do that because hierarchical schedulers
consist of independent single level schedulers
We parallelize the dequeue processing at each of
the levels
A dequeue thread creates a hole (i.e., empty
packet space)
A hole filling thread at the next level fills the
hole by inserting a new packet

97
High Level Design from First Order Principles
Fact, Assumption or Design Principle Consequence High-Level Design Guideline
a hierarchical scheduler consists of single-level schedulers we can insert presorted packets at each level we address the sorting problem locally at each single level scheduler
packets need to be exchanged between levels at line rate we can assign a small number of threads to serve each entire level of the hierarchy the levels of the hierarchy can operate in parallel independent of each other
each packet transmitted at a level creates an empty packet space, which we call a hole to fill a hole you may need to perform at least one SRAM access which can be as large as 300 compute cycles we need to buffer more than one packet at each level of the scheduler
the enqueuing process may insert packets into single-level schedulers at any level enqueuing and hole-filling threads may need to access the same state information concurrently mutual exclusion techniques are required
calculating a virtual time function is complex virtual time can be approximated by the finish time of the packet currently in service (SCFQ) a packet entering a scheduler does not need to preempt the packet currently in service
98
Parallelized Hierarchical Scheduler Illustrated

99
Meeting the OC-48 Line Rate with Buffering
Final Output

Why buffering?
To cope for the fact that a single SRAM access
may take more time 228 cycles
Enqueue Process
We maintain two minimum tag packets at the output
of each scheduler
Dequeue Process
While a hole is filled, the next packet is ready
to be serviced

100
Hierarchical Scheduler Prototyped
101
Use of the IXP Microengines
102
Remarks

Four level OC-48 line rate forwarding (2.5 Gbps)
Data structures fit into the local memory and
SRAM of IXP2400
We keep the memory access bandwidth consumption
at reasonable level
We fetch either the heads or tails of queues but
not both at the same time
We employ novel inter-thread synchronization
algorithms

103
Tutorial Summary and Conclusion
104
Tutorial Summary (I)

Packet scheduling
Critical component of router datapaths
Generalized Processor Sharing
Ideal service (non-implementable)
Real implementations
Annotate packets with time stamps
Sort packets according to their time stamp values

105
Tutorial Summary (II)

We propose a new algorithm for sorting packets
Reduces the latency of making scheduling
decisions to a single memory access time
Parallelized Processor Architectures like IXP2xxx
are suitable for implementing packet scheduling
in software

106
Future Work

Use the Connected Trie for implementing
disciplines other than SFQ, SCFQ
Example L-GPS?
Use the Connected Trie in multi-level
hierarchical scheduler configurations

107
Thanks for Listening
108
References

Michael E. Kounavis, Alok Kumar, Raj Yavatkar and
Harrick Vin,
Two Stage Packet Classification Using Most
Specific Filter Matching and Transport Level
Sharing,
Technical Report, Communications Technology Lab,
Intel Corporation, In Submission to Computer
Networks
Michael E. Kounavis, Alok Kumar, and Raj
Yavatkar,
Sorting Packets by Packet Schedulers Using the
Connected Trie Data Structure,
Technical Report, Communications Technology Lab,
Intel Corporation, In Submission to Software
Practice and Experience
Michael E. Kounavis, Alok Kumar, and Raj
Yavatkar,
A Four Level OC-48 Programmable Hierarchical
Packet Scheduler,
Technical Report, Communications Technology Lab,
Intel Corporation