Line Rate Packet Classification and Scheduling - PowerPoint PPT Presentation

1 / 108
About This Presentation
Title:

Line Rate Packet Classification and Scheduling

Description:

Communications Technology. Lab. An IP Prefix. A range of values in a ... Communications Technology. Lab. Packet Classification as a Point Location Problem ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 109
Provided by: frankl72
Category:

less

Transcript and Presenter's Notes

Title: Line Rate Packet Classification and Scheduling


1
Line Rate Packet Classification and Scheduling
  • Michael Kounavis (Intel)
  • Alok Kumar (Intel)
  • Raj Yavatkar (Intel)
  • Harrick Vin (U. Texas, Austin)
  • October 26, 2005

2
Packet Classification
3
Tutorial Summary
  • PART I
  • Understanding the problem
  • PART II
  • State-of-the-art
  • PART III
  • Observations on real world classifiers
  • PART IV
  • Two stage packet classification using Most
    Specific Filter Matching and Transport Level
    Sharing

4
PART I Understanding the Problem
5
Problem Statement
  • Packet classifiers
  • Lists of rules
  • Rules ltpriority, predicate, actiongt triplets
  • Single Match Problem
  • Find the highest priority rule that matches with
    a packet
  • Multiple Match Problem
  • Find all rules that match with a packet

6
A Rule Database
Predicate
Src. Port
Dst. Port
Action
Src. IP
Dest. IP
Priority Level
PERMIT

147.101.
1040-1070

a rule
1
DENY

128.151.

2110-2150
2
DENY
132.
153.

ftp
3
7
A Rule
Predicate
Src. Port
Dst. Port
Action
Src. IP
Dest. IP
Protocol
TCP
PERMIT

147.101.
2140-2170

8
An IP Prefix
A range of values in a single dimension
Src. IP address
128.67.0.0
128.67.255.255
0.0.0.0
9
An Arbitrary Range
A range of values in a single dimension
Dst. Port
2140
3140
0
10
An Exact Value
A Specific Number
0
1
6
17
Protocol Field
11
A Source-Destination IP Prefix Pair
rectangle for (128.67., 132.59.)
point for (128.67.208.1, 132.59.64.10)
line segment for (128.67., 132.59.64.10)
12
Relationship Between IP Prefix Pairs
121.45.5.255
128.44.32.255
145.39.3.0
145.39.255.255
167. 7. 4.0
167. 7. 255.255
121.45.5.0
128.44.32.0
145.39.0.0
145.39.3.255
167. 7.0.0
167.7.4.255
Dst. IP
128.67.0.0

128.67.32.0
128.67.32.255
128.67.255.255
Src. IP
13
Partial Overlaps IP Prefix Pairs vs. Arbitrary
Ranges
partially overlapping IP prefix pairs always
form the shape of a cross
partially overlapping pairs of arbitrary ranges
may form any shape
14
Packet Classification as a Point Location Problem
Rule 3
Rule 5
Rule 2
Rule 1
Rule 4
15
PART II State-of-the-Art
16
Packet Classification An Open Problem
Lakshman and Stiliadis Bit Vector Srinivasan,
Suri, Varghese Grid of Tries, Cross Producting
Baboescu et. al., Aggregate Bit Vector Gupta,
McKeown HiCutts
Taylor, Turner Distributed Cross Producting
Mogul et. al. Packet Filter Concept
Baboescu et. al., Extended Grid of Tries Singh
et. al HyperCutts Kounavis et. al. Most Specific
Filter Matching
Gupta, McKeown Recursive Flow Classification Srini
vasan, Suri, Varghese Tuple-Space Search
Chazele et. al. Point Location Among Hyperplanes
1987
1994
1998
1999
2000
2003
2004
17
Multi-dimensional Tries
Rule 1 (, 1)
Rule 2 (0, 0)
Rule 3 (1, 00)
Rule 4 (1, 0)
Rule Database
Packet (1101, 0011)
18
Grid of Tries
Rule 1 (, 1)
Rule 2 (0, 0)
Rule 3 (1, 00)
Rule 4 (1, 0)
Rule Database
Packet (1101, 0011)
19
Bit Vector Schemes
20
Cross Producting
21
Tuple-space Search
Key idea Number of different combinations of
prefix lengths (called tuples) is small
Rule 1 (, 1)
Rule 2 (0, 0)
Rule 3 (1, 0)
Rule 4 (1, 00)
Rule Database
Tuple 1 (, 1) Rule 1
Tuple 2 (1, 1) Rules 2 and 3
Tuple 3 (1, 11) Rule 4
Tuple Space
22
Recursive Flow Classification
index 1
index 5
index 2
packet
action
index 3
index 6
index 4
23
HiCutts and HyperCutts
Rule 1 Rule 2 Rule m
HiCutts Cuts 1 dimension at a time
HyperCutts Cuts multiple dimensions at a time
24
TCAM
field values in the packet header
Entries
TCAM
Memory Array
0
1
0
1
1
0
Matches
Priority Encoder
Memory Location
RAM
Action Memory
25
Comparison
Algorithm Worst Case Lookup Time Complexity Worst Case Storage Complexity
Multidimensional Tries wd ndw
Grid of Tries wd-1 ndw
Bit Vector dw dn/a dn2/a
Tuple Space Search n n
Cross Producting dw nd
Recursive Flow Classification d nd
HiCutts d nd
TCAM 1 n
n number of rules, d number of fields, w field
size
26
PART III Observations on Real World Classifiers
27
What we Observed
  • IP prefix pairs
  • create partial overlaps which are significantly
    fewer than the theoretical worst case
  • transport level fields
  • form sets which are being shared by many
    different source-destination IP prefix pairs
  • sets usually contain a small number of entries

28
Toward Two Stage Packet Classification

29
Observations on IP prefix Pairs
  • IP prefix pairs are of 2 types
  • partially-specified filters (i.e., (,X) or (Y,
    ))
  • fully-specified filters
  • partially specified filters
  • are a small fraction (lt 25) of IP prefix pairs
  • most fully-specified filters (gt 80) are
    represented by
  • segments of straight lines
  • points

30
IP Prefix Pair Overlaps
(, )
(Y, )
cluster m
(, X)
cluster 1
cluster 2
n2/4 amount of overlaps
realistic amount of overlaps ?
31
Visualizing ACLs with our FilterViewer Tool
32
Partial Filter Overlaps in the Realistic Filter
Structure
Breakdown of Overlaps Breakdown of Overlaps Breakdown of Overlaps Breakdown of Overlaps Breakdown of Overlaps Breakdown of Overlaps
observed number of overlaps theoretical worst case overlaps formed by partially specified filters only overlaps formed by fully specified filters only overlaps formed between partially and fully specified filters
ACL1 4 90,525 100 0 0
ACL2 2249 138,601 45 4 51
ACL3 6138 1,260,078 88 1 11
ACL4 852 12,246 100 0 0
33
Why Few Filter Overlaps
  • Partially specified filters represent a small
    fraction of the total number of filters in
    databases
  • Fully specified filters create an insignificant
    amount of overlaps
  • There is a bounded number of important servers
    per IP address domain

34
Observations on Transport Level Fields
35
Transport Level Sharing
number of rules number of unique sets of transport level fields number of entries in unique sets of transport level fields
ACL1 754 102 316
ACL2 607 35 68
ACL3 2399 186 437
ACL4 157 8 47
36
Implications
  • Classification can be split into 2 stages
  • Stage 1 IP address fields
  • Stage 2 transport level fields
  • A design that returns the smallest filter
    intersection is viable
  • since the amount of overlaps between IP prefix
    pairs is small
  • Searching through transport level fields can be
    accelerated by hardware

37
PART IV Two Stage Packet Classification Using
Most Specific Filter Matching and Transport Level
Sharing
38
Cross Producting Revisited
132.59.10.0
121.45.5.255
132.59.10.255
132.59.0.0
132.59.255.255
121.45.5.0
destination
IP address
125.12.12.0
125.12.12.255
128.67.0.0
128.67.32.0
128.67.32.255
128.67.255.255
source
IP address
Cross Producting may return a non existent
filter. To address this issue Cross Producting
adds all possible filters returned from LPM
searches into its lookup table
39
Definition of a Cross Product
  • A filter with
  • a source IP prefix equal to the prefix of a
    filter F1 from a database and
  • a destination IP prefix equal to the prefix of
    another filter F2 from the same database
  • F2 is not necessarily equal to F1

40
Improving Cross Producting
  • Cross Producting is fast but
  • Memory explosion
  • For ACL3 there are 431 src. IP prefixes and 517
    dst. Prefixes, hence 222,396 cross products
  • Solution
  • We can remove 70-80 of the cross products with
    little penalty to the performance of the
    classifier
  • Not covered cross products
  • Partially covered products

41
Motivating Example
132.59.10.0
255.255.255.255
132.59.10.255
121.45.5.255
132.59.255.255
132.59.0.0
121.45.5.0
125.12.12.0
A (, )
R
R
D
4
1
125.12.12.255
128.67.0.0
do we need to store all cross products R1-R7?
C
128.67.32.0
R
E
2
128.67.32.255
R
3
128.67.255.255
147.101.10.0
R
R
B
R
6
5
7
147.101.10.255
255.255.255.255
42
Not Covered Cross Products
not covered cross products are only covered by
(,). Hence they can be removed from the lookup
table. If no match is found the algorithm returns
(, )
43
Partially Covered Cross Products
132.59.10.255
121.45.5.255
132.59.10.0
121.45.5.0
255.255.255.255
destination
125.12.12.0
IP address
125.12.12.255
(125.12.12.,)
(125.12.12., 132.59.10.)
128.67.32.0
partially covered cross product
(125.12.12., 121.45.5.)
128.67.32.255
source
(128.67.32., 121.45.5.)
IP address
partially covered cross products are only
covered by (X,) or (, Y). Hence they can be
removed from the lookup table. If no match is
found, the algorithm checks a database of
partially specified filters. If no match is found
again the algorithm returns (,)
44
Fully Covered Cross Products
  • those cross products which are neither not
    covered nor partially covered
  • fully specified filters
  • filter intersections that are fully specified
  • filters which are
  • formed by combining the source and destination IP
    prefixes of different IP prefix pairs and
  • contained into fully-specified filters or
    fully-specified filter intersections
  • these are called indicator filters

45
Most Specific Filter Matching
LPM lookup
LPM lookup
on the
on the
destination IP
source IP
address
address
index 2
index 1
index 3
index 1
index 2
index 4
index 1
index 2
index 1
index 2
lookup on a
table of
secondary
secondary
table of filters
fully covered
table of filters
of the form
cross products
of the form
(X, )
(, Y)
primary
table
46
Back to the Example
cross products R1, R3-R7 can be removed from
primary table
47
Transport Level Sharing
  • Rules that specify the same source-destination IP
    prefix pair are consecutive
  • These rules share sets of transport level fields
  • Most specific filter matching returns a list of
    pointers to share sets of transport level fields
  • Packet classification in the transport level
    dimensions is done in hardware

48
Hardware Acceleration
  • approach we do not need to expand a range into
    prefixes
  • We use a pair of comparators
  • We compare a key with upper and lower bounds

49
Performance
  • Lookup time
  • Small and predictable number of steps independent
    of the number of rules
  • 11 memory accesses
  • Memory Space
  • Reasonable 19-446 KB (for ACLs 1-4 without HW
    acceleration)
  • Memory Access BW
  • 64 words/access (without HW acceleration)
  • 4 words/access (with HW acceleration)
  • Update Time
  • Approximately 197,000 memory accesses

50
Tutorial Summary and Conclusion
51
Tutorial Summary (I)
  • Packet Classification
  • A complex open problem
  • State of the art
  • Existing schemes trade-off the lookup time or the
    memory requirement

52
Tutorial Summary (II)
  • Observations on real world classifiers
  • Few partial overlaps between IP prefix pairs
  • Shared sets of transport level fields
  • Proposed a new scheme
  • Exploits classifier properties
  • Predictable lookup time with reasonable memory
    requirement
  • Requires HW acceleration

53
Future Work
  • Verify the scheme using more data sets
  • Simplify the update process
  • Apply other fast solutions to the first stage
  • HyperCutts

54
Packet Scheduling
55
Tutorial Summary
  • PART I
  • Understanding the problem
  • PART II
  • State-of-the-art
  • PART III
  • Sorting packets by packet schedulers using the
    Connected Trie data structure
  • PART IV
  • Building a four level, OC-48, programmable
    hierarchical packet scheduler

56
PART I Understanding the Problem
57
The Concept of QoS
  • Packet networks
  • Usually provide best effort services
  • Can we make packet networks capable of delivering
    continuous media?
  • Key concept make packet networks flow aware
  • Mechanisms
  • Scheduling
  • Shaping
  • Resource Reservation
  • Admission Control

58
Packet Scheduling
shaped
traffic
session 1
delay jitter
traffic
shaper
scheduler
burst

head
-
of
-
line
packets
session N
59
Generalized Processor Sharing (GPS)
  • Ideal scheduling discipline
  • Visits each nonempty queue and serves an
    infinitesimally small amount of data
  • Supports exact max-min fair share allocation
  • Resources are allocated in order of increasing
    demand
  • No source gets a resource share larger than its
    demand
  • Sources with unsatisfied demands get an equal
    share of the resource

60
Weighted Fair Queuing (WFQ)
  • Key idea If you cant implement GPS simulate it
    on the side
  • The algorithm
  • Tag packets with numbers denoting the order of
    completion of service according to the simulated
    GPS discipline
  • Transmit the packets in the ascending order of
    their tags

61
Time Stamp Calculation
62
Example
63
Relative Fairness Bound
service received by connection A during an
interval
service received by connection B during this
interval
Relative Fairness Bound
-
MAX

rate allocated to A
rate allocated to B
A, B backlogged connections
64
Absolute Fairness Bound
service received by connection A during an
interval
service received by connection A during this
interval if serviced by GPS
Absolute Fairness Bound
-
MAX

rate allocated to A
rate allocated to A
A backlogged connection
65
Hierarchical Packet Scheduling
  • Single level scheduling
  • Transmission order does not depend on future
    arrivals
  • Hierarchical scheduling
  • Transmission order depends on future arrivals
  • To implement hierarchical GPS we need to built a
    hierarchy of single level fair queuing
    disciplines

66
Schedulable Region
  • Set of all possible combinations of performance
    bounds a scheduler can simultaneously meet

Class I
Class II
Class III
67
PART II State-of-Art
68
Tagging and Sorting Schemes
Goyal, Vin, Cheng Start Time Fair Queing Bennet,
Stephens, Zhang CMU Sorting Scheme Rexford,Bonomi,
Greenberg ATT Sorting Scheme
Valente Exact GPS Simulation with
logarithmic complexity
Parekh, Gallager Generalized Processor
Sharing Lazar, Hyman, Pacifici Schedulable Region
Demers, Keshav, Shenker WFQ
Ramabhadran, Pasquale Stratified Round
Robin Kounavis, Kumar, Yavatkar Connected Trie
Data Structure
Golestani Self Clocked Fair Queing Shreedhar,
Varghese Deficit Round Robin
Lazar, Hyman, Pacifici MARS
1995
1994
1989
1991
1993
1996
2003
2004
69
Self-Clocked Fair Queuing (SCFQ)
Same as WFQ apart from
round number of the simulated GPS service
finish time of the packet currently in service

Main disadvantage large end-to-end delay
70
Start Time Fair Queuing (SFQ)
Same as WFQ apart from
round number of the simulated GPS service
start time of the packet currently in service

Transmission order Ascending order of start
times Same end-to-end delay as WFQ
71
W2FQ
Same as WFQ apart from
transmission order
select the packet with the minimum tag from
among those that have already started service in
the corresponding GPS simulation

Smaller Absolute Fairness Bound
72
Round Robin Scheduling
  • Deficit Round Robin
  • Each connection has a deficit counter
  • Every round the deficit counter is incremented by
    a quantum
  • If the packet size lt counter then the packet is
    transmitted and the counter is reduced by the
    packet size

73
Exact GPS Simulation with Logarithmic Complexity
  • L-GPS simulates GPS with minimum deviation of one
    packet size at O(logN) complexity
  • All other well known schedulers that accomplish
    the same deviation (e.g., W2FQ) have O(N)
    complexity
  • Key idea
  • L-GPS pre-computes the evolution of the round
    number function of the simulated GPS service
    using a tree structure

74
Some Sorting Data Structures
  • Heaps
  • Binomial heaps
  • Calendar Queues
  • Van Emde Boas Trees
  • Trees of Comparators
  • CMU Sorting
  • ATT Sorting
  • Polytechnic Institute Sorting

75
The Tree of Comparators
You divide packets into groups
You send each group to a stage of comparators
You obtain a minimum from each comparator

You pass the minima into a second stage of
comparators
You repeat the process until one packet remains
76
The Sorting Scheme from ATT
FIFO 1
Bin 1
FIFO 2
Bin 2
FIFO k
Bin m
scheduling horizon
Connection FIFOs
Sorting Bins
77
The Sorting Scheme from Polytechnic Inst.
Brooklyn
Range of time stamp values
Sorting is supported by a hierarchy of bit vectors
78
PART III Sorting Packets by Packet Schedulers
Using the Connected Trie Data Structure
79
Contribution
  • We propose a sorting algorithm and data structure
    that reduces the latency of making scheduling
    decisions to a singe memory access time
  • Solution is applicable to SCFQ, SFQ
  • Key Observation
  • Increments on packet time stamps are between the
    range (maximum packet size)/(minimum weight)
  • This is called the scheduling horizon
  • Approach
  • We represent the scheduling horizon as a trie
  • We put state into the nodes of the trie to allow
    the leaves to be connected into linked list

80
Trie-based Ordering
trie structure of height h log(scheduling
horizon/region width)
81
Van Emde Boas Trees and the Connecting Trie
linear traversal
optimal traversal
linear traversal
optimal traversal (Connected Trie)
binary traversal (Van Emde Boas Tree)
82
Main Concepts
  • Each node stores
  • a pointer to the rightmost leaf
  • with the highest value from among those found by
    traversing the left child of the node
  • a pointer to the leftmost leaf
  • with the lowest value from among those found by
    traversing the right child of the node
  • When a new packet is added into the trie
  • The algorithm discovers the rightmost and
    leftmost leaves the new packet should be
    connected to
  • The new packet is inserted into a linked list of
    leaves
  • Hence the next packet for transmission into the
    network is found in a single memory access time

83
Example Root only
R
(-8, 8)
84
Example Adding 13
R
85
Example Adding 5
86
Example Adding 10
87
Example Adding 8
R
(5, 8)
0
1
A
D
(10, 13)
(-8, 5)
0
1
1
B
E
G
(13, 8)
(5, 13)
(8, 10)
0
0
0
1
C
I
H
F
(-8, 13)
(8, 10)
(10, 13)
(-8, 5)
0
0
1
1
10
8
5
13
NULL
NULL
88
The Node Traversal Algorithm
visit a node
89
Using Two Tries at a Time
  • Why two tries at a time
  • Lets assume D is the scheduling horizon
  • During the transmission of a packet, new packets
    will be associated with time stamp increments at
    most D.
  • During the transmission of these packets, new
    packets will be associated with time stamp
    increments at most 2D.

90
Optimal Height of the Connected Trie
least common multiple of weights
maximum packet size
optimal height
X
log(
)
greatest common divisor of packet sizes
minimum weight
91
Performance
  • Scheduling decision time
  • Exactly 1 memory access independent of the number
    of flows in the scheduler
  • Insertion time
  • 6 read accesses, 7 write accesses for a trie of
    height 12.
  • Memory access bandwidth
  • 10 words per read access, 6 words per write
    access for a trie of height 12
  • Memory requirement
  • 34KB for 256 connections, 213 KB for 64K
    connections, for a trie of height 12

92
PART IV Building a Four Level, OC-48,
Programmable Hierarchical Packet Scheduler
93
Contribution
  • Efficient implementation of hierarchical
    scheduling on the IXP2xxx series and the next
    generation processors
  • Support for OC48 on IXP28xx
  • Budget of 228 cycles/packet for scheduling only!
  • Support for multiple levels of hierarchy upto 5
    levels
  • Support for a total of lt 256K input queues with
    arbitrary weights at each level
  • Any possible configuration of the hierarchies
    should be supported

94
Examples of Schedulers
95
Approach
  • Supporting many single level schedulers (e.g.,
    64K) using a limited number of microengines and
    threads
  • We assign a small number of threads to serve each
    entire level of the hierarchy as opposed to a
    single scheduler only!
  • packets are exchanged between levels at line rate
  • Bandwidth sharing takes place at different levels
    in parallel!
  • Sorting of Packets
  • We use the connected trie and tree of comparators
    structures

96
Parallelizing the Hierarchical Scheduler
  • We assign a small number of threads to serve each
    level of the hierarchy
  • We can do that because packets are exchanged
    between levels at line rate
  • We insert pre-sorted packets at each level
  • We can do that because hierarchical schedulers
    consist of independent single level schedulers
  • We parallelize the dequeue processing at each of
    the levels
  • A dequeue thread creates a hole (i.e., empty
    packet space)
  • A hole filling thread at the next level fills the
    hole by inserting a new packet

97
High Level Design from First Order Principles
Fact, Assumption or Design Principle Consequence High-Level Design Guideline
a hierarchical scheduler consists of single-level schedulers we can insert presorted packets at each level we address the sorting problem locally at each single level scheduler
packets need to be exchanged between levels at line rate we can assign a small number of threads to serve each entire level of the hierarchy the levels of the hierarchy can operate in parallel independent of each other
each packet transmitted at a level creates an empty packet space, which we call a hole to fill a hole you may need to perform at least one SRAM access which can be as large as 300 compute cycles we need to buffer more than one packet at each level of the scheduler
the enqueuing process may insert packets into single-level schedulers at any level enqueuing and hole-filling threads may need to access the same state information concurrently mutual exclusion techniques are required
calculating a virtual time function is complex virtual time can be approximated by the finish time of the packet currently in service (SCFQ) a packet entering a scheduler does not need to preempt the packet currently in service
98
Parallelized Hierarchical Scheduler Illustrated

99
Meeting the OC-48 Line Rate with Buffering
Final Output
  • Why buffering?
  • To cope for the fact that a single SRAM access
    may take more time 228 cycles
  • Enqueue Process
  • We maintain two minimum tag packets at the output
    of each scheduler
  • Dequeue Process
  • While a hole is filled, the next packet is ready
    to be serviced

100
Hierarchical Scheduler Prototyped
101
Use of the IXP Microengines
102
Remarks
  • Four level OC-48 line rate forwarding (2.5 Gbps)
  • Data structures fit into the local memory and
    SRAM of IXP2400
  • We keep the memory access bandwidth consumption
    at reasonable level
  • We fetch either the heads or tails of queues but
    not both at the same time
  • We employ novel inter-thread synchronization
    algorithms

103
Tutorial Summary and Conclusion
104
Tutorial Summary (I)
  • Packet scheduling
  • Critical component of router datapaths
  • Generalized Processor Sharing
  • Ideal service (non-implementable)
  • Real implementations
  • Annotate packets with time stamps
  • Sort packets according to their time stamp values

105
Tutorial Summary (II)
  • We propose a new algorithm for sorting packets
  • Reduces the latency of making scheduling
    decisions to a single memory access time
  • Parallelized Processor Architectures like IXP2xxx
    are suitable for implementing packet scheduling
    in software

106
Future Work
  • Use the Connected Trie for implementing
    disciplines other than SFQ, SCFQ
  • Example L-GPS?
  • Use the Connected Trie in multi-level
    hierarchical scheduler configurations

107
Thanks for Listening
108
References
  • Michael E. Kounavis, Alok Kumar, Raj Yavatkar and
    Harrick Vin,
  • Two Stage Packet Classification Using Most
    Specific Filter Matching and Transport Level
    Sharing,
  • Technical Report, Communications Technology Lab,
    Intel Corporation, In Submission to Computer
    Networks
  • Michael E. Kounavis, Alok Kumar, and Raj
    Yavatkar,
  • Sorting Packets by Packet Schedulers Using the
    Connected Trie Data Structure,
  • Technical Report, Communications Technology Lab,
    Intel Corporation, In Submission to Software
    Practice and Experience
  • Michael E. Kounavis, Alok Kumar, and Raj
    Yavatkar,
  • A Four Level OC-48 Programmable Hierarchical
    Packet Scheduler,
  • Technical Report, Communications Technology Lab,
    Intel Corporation
Write a Comment
User Comments (0)
About PowerShow.com