Title: Scheduling algorithms for input-queued IP routers
1Scheduling algorithms for input-queued IP routers
-
Emilio Leonardi - in collaboration with P. Giaccone, M. Ajmone
Marsan, A Bianco, M.Mellia, F.Neri - Dipartimento di Elettronica
- Telecommunication Network Group
- http//www.tlc-networks.polito.it
- Politecnico di Torino (Italy)
Budapest, March 2006
2Outline
- IP routers
- OQ routers
- IQ routers
- Scheduling
- Optimal algorithms
- Heuristic algorithms
- Packet-mode algorithms
- Networks of routers
- CIOQ routers
- Multicast traffic
- Conclusions
3Note
- The slides marked RWP are reproduced with
permission of Prof.Nick McKeown from the
Electrical Engineering and Computer Science Dept.
of Stanford University (CA,USA)
4Outline
- IP routers
- OQ routers
- IQ routers
- Scheduling
- Optimal algorithms
- Heuristic algorithms
- Packet-mode algorithms
- Networks of routers
- CIOQ routers
- Multicast traffic
- Conclusions
5The Internet is a mesh of routers
core router
access router
enterprise router
6The Internet is a mesh of routers
- Access router
- high number of ports at low speed (kbps/Mbps)
- several access protocols (modem, ADSL, cable)
- Enterprise router
- medium number of ports at high speed (Mbps)
- several services (IP classification, filtering)
- Core router
- moderate number of ports at very high speed
(Mbps/Gbps) - very high throughput
7Basic functions
- Routing
- computation of the output port of an incoming
packet - uses the routing tables computed by the routing
protocols - can be a complex procedure
- very large routing tables
- dynamic variation of routes in the Internet
8Basic functions
- Switching
- transfer of packets from input ports to output
ports - solution of the contentions for output ports
- queueing
- where to store
- scheduling
- what to transfer
9Faster and faster
- Need for high performance routers
- to accommodate the bandwidth demands for new
users and new services - to support QoS
- to reduce costs
10Packet processing and link speed
- Increase of electronic packet processing power
cannot accommodate the increase in link speed
Packet processing Power
Link Speed
10000
1000
2x / 7 months
Moores law 2x / 18 months
100
Fiber Capacity (Gbit/s)
10
1
1985
1990
1995
2000
0,1
TDM
DWDM
Source SPEC95Int David Miller, Stanford.
RWP
11Memory access time
RWP
12Moores law
- Its hard to keep up with Moores law
- the bottleneck is memory speed
- Moores law is too slow
- routers need to improve faster than Moores law
RWP
13Router capacity exceeds Moores law
- Growth in capacity of commercial routers
- 1992 2 Gb/s
- 1995 10 Gb/s
- 1998 40 Gb/s
- 2001 160 Gb/s
- 2003 640 Gb/s
- Average growth rate 2.2x / 18 months
RWP
14Single packet processing
- The time to process one packet is becoming
shorter and shorter - worst case 40-Byte packets (ACKs) travelling
over the Internet - 3.2 ?s at 100 Mbps
- 320 ns at 1 Gps
- 32 ns at 10 Gps
- 3.2 ns at 100 Gbps
- 320 ps at 1Tbps
15Hardware architecture
physical structure
logical structure
16Hardware architecture
- Main elements
- line cards
- support input/output transmissions
- store packets
- adapt packets to the internal format of the
switching fabric - support data link protocols
- classify packets
- schedule packets
- support security
- switching fabric
- transfers packets from input ports to output ports
17Hardware architecture
- Main elements
- control processor/network processor
- runs routing protocols
- computes routing tables
- manages the overall system
- forwarding engines
- compute the packet destination (lookup)
- inspect packet headers
- rewrite packet headers
18Interconnections among main elements - I
19Interconnections among main elements - II
20Cell-based routers
Cell switch (fabric)
cells
packets
packets
cells
1
N
- ISM Input-Segmentation Module
- ORM Output-Reassembly Module
- packet variable-size data unit
- cell fixed-size data unit
21Switching fabric
- Our assumptions
- bufferless
- to reduce internal hardware complexity
- non-blocking
- it is always possible to transfer in parallel
from input to output ports any non-conflicting
set of cells
22Switching fabric
- Examples
- crossbar
- rearrangeable Clos network
- Benes network
- Batcher-Banyan network (self-routing)
- Switching constraints
- at most one cell for each input and for each
output can be transferred
23Switching fabric
- We do not discuss switching fabrics with internal
buffers - e.g. crossbars with buffer at each crosspoint
24Generic switching architecture
output queues
input queues
25Speedup
- The speedup determinates the switch performance
- Sin reading speed from input queues
- Sout writing speed to output queues
- maximum speedup factor
- S max(Sin,Sout)
26Performance comparison
- The performance of different switching systems
can be studied - with analytical models
- introducing simplifying assumptions, but
obtaining general results - with simulation models
- obtaining more detailed results
27Traffic description
- Aij(n) 1 if a packet arrives at time n at input
i, with destination reachable through output j - ?ij EAij(n)
- An arrival process is admissible if
- ?i ?ij ? 1
- ?j ?ij ? 1
- that is, no input and no output are overloaded
on average - note that OQ switches exhibit finite delays only
for admissible traffic - traffic matrix ? ?ij
28Traffic scenarios
- Uniform traffic
- Bernoulli i.i.d. arrivals
- usual testbed in the literature
- easy to schedule
- Diagonal traffic
- Bernoulli i.i.d arrivals
- critical to schedule, since
- only two matchings are good
29Traffic scenarios
- LogDiagonal traffic
- Bernoulli i.i.d. arrivals
- more critical than uniform,less than diagonal
traffic
30Outline
- IP routers
- OQ routers
- IQ routers
- Scheduling
- Optimal algorithms
- Heuristic algorithms
- Packet-mode algorithms
- Networks of routers
- CIOQ routers
- Multicast traffic
- Conclusions
31Output Queued (OQ) switches
- Sin 1 Sout N
- used for low bandwidth routers
- no coordination among ports
- work-conserving
- best average delays
- complete control of delays
- support of QoS scheduling
32Output Queued (OQ) switch
33OQ performance
Uniform traffic
Note OQ is optimal from the point of view of
average delay and throughput
OQ
34Outline
- IP routers
- OQ routers
- IQ routers
- Scheduling
- Optimal algorithms
- Heuristic algorithms
- Packet-mode algorithms
- Networks of routers
- CIOQ routers
- Multicast traffic
- Conclusions
35Simple Input Queued (IQ) switches
- Sin 1 Sout 1
- 1 FIFO queue for each input port
- throughput limitations
- due to head of the line (HOL) blocking
- scheduling
- to solve contentions
- for the same output
36Head of the Line (HOL) Blocking
RWP
37Simple IQ switch performance
Uniform traffic
Simple IQ
OQ
38Improving simple IQ switches
- Window/bypass schedulers
- the first w cells of each queue contend for
outputs - HOL blocking is reduced, not eliminated
- w 1 means FIFO at each input
- higher complexity
- the scheduler deals with wN cells
- non-FIFO queues
39Improving IQ switches
- Virtual output queueing (VOQ)
- one queue for each input/output pair
- N queues at each input
- N2 queues in the whole switch
- eliminates HOL blocking
- used in high-bandwidth routers
- scheduling implemented in hardware at very high
speed
40IQ switches with VOQ
Note from now on, we always assume VOQ at the
switch inputs
41Outline
- IP routers
- OQ routers
- IQ routers
- Scheduling
- Optimal algorithms
- Heuristic algorithms
- Packet-mode algorithms
- Networks of routers
- CIOQ routers
- Multicast traffic
- Conclusions
42Scheduling in IQ switches
- Scheduling can be modeled as a matching problem
in a bipartite graph - the edge from node i to node j refers to packets
at input i and directed to output j - the weight of the edge can be
- binary (not empty/empty queue)
- queue length
- HOL cell waiting time, or cell age
- some other metric indicating the priority of the
HOL cell to be served
43Scheduling in IQ switches
Request Graph
Matching (or Permutation)
inputs
outputs
scheduler
44Scheduling in IQ switches
- Request Matrix
- 3 5 0 0
- 2 0 0 4
- 4 5 0 0
- 0 0 8 2
Permutation
0 1 0 0 0 0 0
1 1 0 0 0 0 0
1 0
45Implementing schedulers
- Scheduling is a complex task
- a scheduling algorithm can be implemented in
hardware if - it shows good performance for a wide range of
traffic patterns - it can be efficiently parallelized
- it can be efficiently pipelined
- it requires few iterations (or clock cycles)
- it requires limited control information
46Scheduling uniform traffic
- A number of algorithms give 100 throughput when
traffic is uniform - For example
- TDM and a few variants
- iSLIP (see later)
Example of TDM for a 4x4 switch
RWP
47Birkhoff - von Neumann theorem
- Any doubly stochastic matrix L can be
- expressed as convex combination of permutation
matrices pn - L ?n an pn
- with
- an0
- ?n an 1
48Scheduling non-uniform traffic
- thanks to the Birkhoff - von Neumann theorem
- If the traffic is known and admissible, 100
throughput can be achieved by a TDM using - for a fraction of time a1 matching M1 (p1)
- for a fraction of time a2 matching M2 (p2)
- for a fraction of time ak matching Mk (p3)
49Outline
- IP routers
- OQ routers
- IQ routers
- Scheduling
- Optimal algorithms
- Heuristic algorithms
- Packet-mode algorithms
- Networks of routers
- CIOQ routers
- Multicast traffic
- Conclusions
50Maximum Size Matching
- Maximum Size Matching (MSM)
- among all the possible matchings, selects the one
with the highest number of edges - MSM is generally not unique
- the best MSM algorithm requires O(N2.5)
iterations, and cannot be implemented
efficiently, since it is based on a flow
augmentation path algorithm
51Instability of MSM
- Assume
- P(arrival at Q12) ?
- P(arrival at Q11) P(arrival at Q22) 1-?-?
- Q12 B 0 Q11 Q22 0
- in case of parity serve Q11 and/or Q22 instead of
Q12 - Observe
- Q12 is served only when A11 0 and A22 0, i.e.
with probability - P(serve Q12) P(no arrivals at both Q11 and Q22
) 1-(1-?-?)2 (??)2 - P(serve Q12) lt P(arrival at Q12) if ? is small
enough - Example ? 0.5 ? 0.1 P(serve Q12) 0.36
Note this proof is due to I.Keslassy, Stanford
Univ.
52Maximum Size Matching
- MSM maximizes the instantaneous throughput
- MSM may not yield 100 throughput
- short term decisions can be inefficient in the
long term - non-binary edge weights allow MWM to maximize
the long-term throughput
53Maximum Weight Matching
- Maximum Weight Matching (MWM)
- among all the possible N! matchings, selects the
one with the highest weight (sum of the edge
metrics) - MWM is generally not unique
- MWM is too complex to be implemented in hardware
at high speed - the best MWM algorithm requires O(N3) iterations,
and cannot be implemented efficiently, since it
is based on a flow augmentation path algorithm - cannot be parallelized and pipelined efficiently
- MWM has never been implemented in a commercial
chipset
54Maximum Weight Matching
- In case of unknown traffic, MWM is the optimal
solution of the scheduling problem when the
weight is either the queue length or the cell age - achieves 100 throughput under any traffic
- also under non-Bernoulli arrival processes,
satisfying the law of large numbers - achieves low average delays, very close to those
of OQ switches - possible starvation for lightly loaded packet
flows
55Maximum Weight Matching
- MWM is the optimal solution of the scheduling
problem when the traffic is unknown, when the
weight is either the queue length or the cell age - achieves 100 throughput under any traffic
- also under non-Bernoulli arrival processes,
satisfying the law of large numbers - achieves low average delays, very close to those
of OQ switches - possible starvation for lightly loaded packet
flows
56MWM with pipeline and latency
- Let T and P be fixed
- Dt denotes the matching used at time t
- The following variations of MWM also achieve
100 throughput - Dt MWM(t-P) MWM with pipeline
degree P - Dt MWM(ceil(t/T)T) MWM with latency T
- combinations of both
- thus, it seems easy to achieve 100 throughput!
57MWM with pipeline and latency
- Bit
- What about throughput?
- 100 throughput
- but needs the computation of a MWM
- What about delays?
- delays can be really bad!
?
?
?
58General consideration
- When scheduling in IQ switches, it is very
difficult to achieve simultaneously - high throughput
- low delay
- limited implementation complexity
59Uniform traffic
- MWM and MSM behave almost identically
Uniform Traffic
100
MWM
MSM
Mean delay
10
1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Normalized Load
60LogDiagonal traffic
- MSM is somewhat inferior to MWM
LogDiagonal Traffic
1000
MWM
MSM
100
Mean delay
10
1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Normalized Load
61Diagonal traffic
- MSM yields much longer delays than MWM at
medium/high loads
Diagonal Traffic
1000
MWM
MSM
100
Mean delay
10
1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Normalized Load
62Outline
- IP routers
- OQ routers
- IQ routers
- Scheduling
- Optimal algorithms
- Heuristic algorithms
- Packet-mode algorithms
- Networks of routers
- CIOQ routers
- Multicast traffic
- Conclusions
63Approximations of MSM and MWM
- Motivation
- strong interest in scheduling algorithms with
- very low complexity
- high performance
- Usually
- implementable schedulers (low complexity)
- ? low throughput, long delays
- theoretical schedulers (high complexity)
- ? high throughput, short delays
64Some implementable algorithms
- Approximate MSM
- WFA, iSLIP, 2DRR, RC, FIRM and many others
- Approximate MWM with wij Xij (queue length)
- iLQF, RPA, learning algorithms
- Approximate MWM with wij cell age
- iOCF
- Approximate MWM with wij ?i Xij ?j Xij
- iLPF, MUCS
65APPROXIMATIONS OF MAXIMUM SIZE MATCHING
66Wave Front Arbiter
Requests
Match
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
RWP
67Wave Front Arbiter
2N-1 steps
Requests
Match
RWP
68Wrapped Wave Front Arbiter
N steps instead of 2N-1
Requests
Match
RWP
69iSLIP
- iSLIP means iterative SLIP
- iterates among the following 3 phases
- Request
- Grant
- Accept
70iSLIP
- 3 phases
- Request (from inputs to outputs)
- each unmatched input sends a request to every
output for which it has a cell - Grant (from outputs to inputs)
- if an unmatched output receives requests, it
sends a grant to one of the inputs - contentions solved by a round-robin mechanism
- Accept (from inputs to outputs)
- if an unmatched input receives grants, it selects
a single output and it becomes matched to it - contentions solved by a round-robin mechanism
71iSLIP
- The round robin mechanism in iSLIP is designed so
that, under uniform traffic, iSLIP emulates a
dynamic TDM scheduler synchronized on the arrival
pattern
72iSLIP
- iSLIP is maximal
- often, with log N iterations
- always, with N iterations
- iSLIP was implemented on one chip in the Cisco
12000 router - http//www.cisco.com/warp/public/cc/pd/rt/12000/te
ch/fasts_wp.pdf
73iSLIP
iSLIP demo
from http//tiny-tera.stanford.edu/tiny-tera/demo
s/index.html
74APPROXIMATIONS OF MAXIMUM WEIGHT MATCHING
75iLQF
- iLQF means iterative Longest Queue First
- iterates among the following 3 phases
- Request
- Grant
- Accept
76iLQF
- 3 phases
- Request (from inputs to outputs)
- each unmatched input sends all its queue lengths
as requests to corresponding outputs - Grant (from outputs to inputs)
- if an unmatched output receives requests, it
sends a grant to the input corresponding to the
longest queue - contentions solved by random choice
- Accept (from inputs to outputs)
- if an unmatched input receives grants, it selects
the output with the longest queue - contentions solved by random choice
77iLQF
- iLQF is maximal
- often, with log N iterations
- always, with N iterations
- iLQF is robust to non-uniform traffic
78iLQF
iLQF demo
from http//tiny-tera.stanford.edu/tiny-tera/demo
s/index.html
79RPA
- RPA means Reservation with Preemption and
Acknowledgment - Two phases
- Reservation (possibly preemptive)
- Acknowledgement
- Sequential accesses to a reservation vector
- Urgj (if set) is the urgency of the transfer
from input Inj to output j
Vector Res
80RPA
Input 1
Input 2
- Vector Res is sequentially accessed by all inputs
Res
Input 4
Input 3
81RPA
- Initially, at each round Urgj 0 for all j
- Reservation phase
- when input i accesses Res
- it computes Wj Xij Urgj for all j
- finds j such that Wj max Wj
- if Wj gt 0,
- ? reserve output j and set UrgjXij, possibly
overwriting the previous reservation - otherwise,
- ? leave the current reservation
82RPA
- Acknowledgement phase
- if input i still finds its reservation at output
j, - ? books output j
- otherwise,
- ? chooses an unreserved output j and books output
j
83Uniform traffic
- comparison between MWM, iSLIP, iLQF, and RPA
Uniform Traffic
1000
MWM
iSLIP
iLQF
RPA
100
Mean delay
10
1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Normalized Load
84LogDiagonal traffic
- iSLIP saturates close to 84 throughput
LogDiagonal Traffic
100000
MWM
iSLIP
iLQF
RPA
10000
1000
Mean delay
100
10
1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Normalized Load
85Diagonal traffic
- RPA achieves 98 throughput, iLQF 87, iSLIP 83
Diagonal Traffic
100000
MWM
iSLIP
iLQF
RPA
10000
1000
Mean delay
100
10
1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Normalized Load
86LEARNING ALGORITMS
87Learning algorithms
- Goal
- find a good compromise among
- throughput, delay and complexity
88Learning algorithms
- Key observation
- the matchings generated by MWM show limited
changes from one time to another - remembering the matching from the past simplifies
the computation of the new matching - the search implemented by MWM can be enhanced
- with a randomized approach
- by observing arrivals
- by searching in parallel
- based on an extension of randomized scheduling
algorithms
89Simple Randomized Schemes
- Choose a matching at random and use it as the
schedule - doesnt yield 100 throughput
- Choose 2 matchings at random and use the heavier
one as the schedule -
- Choose N matchings at random and use the
heaviest one as the schedule - ?None of these can give 100 throughput !
90Simple randomized algorithms
32x32
91Bounds on Maximum Throughput
92Tassiulas scheme
- Consider the following policy
- Rt matching picked at random (uniformly) among
all the possible N! matchings - Dt arg max W(Dt-1), W(Rt)
- Complexity is very low
- O(1) iterations
- easy to pipeline
- Yields 100 throughput !
- note the boost in throughput is due to memory of
the past matching Dt-1 - However, delays are very large
93Tassiulas' scheme
32x32
94Learning approach
- Properties of COMP1
- W(Dt) ? W(Dt-1)
- W(Dt) ? W(Mt)
- Examples
- COMP1 is the MAX among Dt-1 and Mt
- COMP1 is the MERGE among Dt-1 and Mt
95MERGE procedure
Merging
3-12-22
Emulating MWM is O(N)
2-12-4-1
M
W(M)13
96The learning approach
- Properties of Mt
- informally, Mt should be a good sample in the
space of all possible matchings - Examples
- Mt is a matching picked uniformly at random
- Mt is a matching picked non-uniformly at random,
with a high probability of being heavy - Mt is derived from the arrival vector At
- Mt is a good neighbor of Dt-1
97Theoretical properties
- Stability
- 100 throughput under any admissible Bernoulli
traffic pattern - Delay
- the better is the weight of Mt , the smaller are
the queue lengths, and hence the smaller are the
delays
98Example of practical implementation
- Exploiting parallel search
K-th neighbor of Dt-1
Dt-1
MAX
Mt
At
MAX
- This scheme is called APSARA
Dt
99What is a neighbor of a matching?
Dt-1
3 neighbors
N1
N2
N3
- Each neighbor
- differs from Dt-1 in ONLY TWO edges
- can be generated very easily in hardware
100Max-APSARA
- APSARA, as described before, is not maximal
- Max-APSARA is a modified version of APSARA where
a maximal size matching algorithm runs on the
remaining unmatched inputs/outputs - e.g., if k inputs/outputs are unmatched,
- run iSLIP with k iterations
- select k random edges among the unmatched
inputs/outputs
101APSARA performance
102Outline
- IP routers
- OQ routers
- IQ routers
- Scheduling
- Optimal algorithms
- Heuristic algorithms
- Packet-mode algorithms
- Networks of routers
- CIOQ routers
- Multicast traffic
- Conclusions
103Routers and switches
- IP routers deal with variable-size packets
- Hardware switching fabrics often deal with
fixed-size cells - Question
- how to integrate an hardware switching fabric
within an IP router? -
104Router based on an IQ cell switch cell-mode
105Cell-mode scheduling
- Scheduling algorithms work at cell level
- pros
- 100 throughput achievable
- cons
- interleaving of packets at the outputs of the
switching fabric
106Router based on an IQ cell switch packet-mode
NO packet interleaving if packet-mode
IQ cell switch
ORM
1
1
ORM
N
N
switching fabric
107Router based on an IQ cell switch packet-mode
NO packet interleaving if packet-mode
IQ cell switch
ORM
1
1
ORMs can be removed
ORM
N
N
switching fabric
108Packet-mode scheduling
- Rule packets transferred as trains of cells
- when an input starts transferring the first cell
of a packet comprising k cells, it continues to
transfer in the following k-1 time slots - Pros
- no interleaving of packets at the outputs
- easy extension of traditional schedulers
- Cons
- starvation due to long packets
- inherent in packet systems without preemption
- negligible for high speed rates
109Packet-mode scheduling
- Questions
- can packet mode provide high throughput?
- what about delays?
YES! ?
It depends?
110Packet-mode properties
- Main theoretical results
- MWM in packet-mode yields 100 throughput
- Packet mode can provide shorter delays than cell
mode, depending on the packet length distribution
111Simulation scenario
- Router with ISMs and ORMs
- Uniform packet traffic
- uniform packet load
- uniform (1,192) packet size distribution
- Spotted packet traffic
- non uniform packet load
- bimodal (3,100) packet size distribution
112Uniform packet traffic
- Packet mode and cell mode reach the same
throughput
Uniform packet traffic for cell mode
Uniform packet traffic for packet mode
100000
100000
MWM
MSM
iSLIP
iLQF
10000
10000
Mean packet delay
Mean packet delay
1000
1000
100
100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Normalized Load
Normalized Load
Cell-mode
Packet-mode
113Spotted packet traffic
- Packet mode reaches higher throughput than cell
mode
Spotted packet traffic for packet mode
Spotted packet traffic for cell mode
100000
100000
MWM
MSM
iSLIP
iLQF
10000
10000
Mean packet delay
Mean packet delay
1000
1000
100
100
0.5
0.6
0.6
0.7
0.7
0.8
0.8
0.9
0.9
1.0
1.0
0.5
0.6
0.6
0.7
0.7
0.8
0.8
0.9
0.9
1.0
1.0
Normalized Load
Normalized Load
Cell-mode
Packet-mode
114Effect of packet size distribution
- iSLIP delayCM/delayPM for different packet size
distributions
2
Uniform
Exponential
better PM
Trimodal
Bimodal
1.5
Packet mode gain for iSLIP
1
better CM
0.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Normalized load
115Packet mode features
- Packet mode scheduling
- is a feasible modification of schedulers
- improves throughput
- but it can generate some unfairness between long
and short packets - inherent to all variable-packet networks without
preemption - may give better packet delays than cell mode
- depends on the packet size distribution
116Outline
- IP routers
- OQ routers
- IQ routers
- Scheduling
- Optimal algorithms
- Heuristic algorithms
- Packet-mode algorithms
- Networks of routers
- CIOQ routers
- Multicast traffic
- Conclusions
117Network of IQ routers
- Question
- given a network of IQ switches and an admissible
input traffic, is the network always stable?
118Networks of IQ routers
- Consider the acyclic network of IQ routers in the
following slide - derived from well established results from
adversarial queueing theory - a very specific scenario, but comprises only few
switches - this situation may not be common, but cannot be
excluded in real networks
119Pathological network of IQ switches
Network with 8 switches and 4 flows
120Instability of MWM
- If MWM is adopted at each IQ router, and the
traffic is admissible, the system can be unstable
under Bernoulli i.i.d. arrivals
121Instability of MWM
- MWM is too greedy, in the sense that it can
create traffic bursts that are amplified by each
scheduler - A server can be idling when large bursts
(directed to it) are blocked because of the
contentions upstream - the problem arises when a packet flow is subject
to priority changes along its path through the
network - it is dangerous to increase priority along the
path
122Stability in networks of routers
- Global policies
- Oldest in the network and many others
- problem requires global information about the
network, and perfectly synchronized clocks at the
ingress of the network - Local policies
- until now, nothing really satisfying known
(work in progress)
123Stability in networks of routers
- Semi-local policies
- MWM with local information about the router
neighbors can achieves 100 throughput under
i.i.d. Bernoulli arrivals - Virtual network queue
- the weights used by MWM are
- wij max0,Xij-H(Xij)
- where H(Xij) is the size of the queue upstream
which is sending packets to Xij
124Outline
- IP routers
- OQ routers
- IQ routers
- Scheduling
- Optimal algorithms
- Heuristic algorithms
- Packet-mode algorithms
- Networks of routers
- CIOQ routers
- Multicast traffic
- Conclusions
125CIOQ routers
VOQ
126CIOQ routers
- Question
- if a low speedup S is allowed (and queues are
available at both inputs and outputs), is it
possible to design simple scheduling algorithms,
capable of achieving high throughput and low
delay? -
YES! ?
127CIOQ routers with S2
- If S 2
- it is easy to obtain 100 throughput
- all maximal matchings work
- based on stable marriage algorithms
- it is less easy to obtain work conservation
- output never idling whenever a packet is present
destined to it - same average delays as OQ
- very good delay performance
- e.g. LOOFA
- it is difficult to perfectly emulate OQ
128LOOFA
- The occupancy Cj
- is the number of cells currently residing at the
j-th output queue - at each time slot, it is decremented by one
because of departures - Basic idea of LOOFA
- give priority to output channels with low
occupancy, thereby attempting to maintain
work-conservation for all outputs
129LOOFA
- If S 2, during each of the two phases
- each unmatched input selects a non-empty VOQ
directed to the unmatched output with the lowest
occupancy, and sends a request to that output - each unmatched output selects one of the
requests, and sends a request to that input - repeat until the matching is maximal
- the selection at the outputs can be round robin,
random, ...
130CIOQ routers with S2
- If S 2
- it is difficult (but possible) to perfectly
emulate an OQ router in terms of packet
departures - it is impossible to distinguish, by observing
arrivals and departures, if the switching
architecture is CIOQ or OQ - delays are perfectly controlled
- easy to implement scheduling algorithms born for
OQ (eg WFQ)
131CIOQ routers
- CIOQ are very promising architectures
- many degrees of freedom in design
- how to balance input/output buffers
- how the buffers interact
- e.g., by backpressure mechanisms
- Several currently designed architectures are
supposed to be CIOQ - The speedup S is becoming closer and closer to 1
in practical implementations of new switching
architectures (CIOQ ?IQ)
132Outline
- IP routers
- OQ routers
- IQ routers
- Scheduling
- Optimal algorithms
- Heuristic algorithms
- Packet-mode algorithms
- Networks of routers
- CIOQ routers
- Multicast traffic
- Conclusions
133Multicast traffic
- Misleading (but common) idea
- observe
- OQ can achieve 100 throughput under any
admissible unicast and multicast traffic - OQ can be perfectly emulated by CIOQ with S 2
- then, with S 2 it is possible to achieve 100
throughput for multicast traffic
134Multicast traffic
- Question
- what is the minimum speedup required to achieve
100 throughput?
unknown! ?
135Multicast traffic
- Possible implementations
- copy network before the switching fabric
- a multicast cell with f destinations is treated
as f cells - possible bandwidth inefficiency
- dedicated queue
- multicast packets are treated in some specific way
136Multicast traffic optimal queueing
- MC-VOQ queueing
- best throughput performance
- avoids HOL blocking
- 2N-1 queues for each input, one for each fanout
set - re-enqueuing process ? out-of-sequence problem
- no re-enqueuing ? some throughput degradation
137Multicast traffic optimal scheduling
- The optimal scheduling for multicast traffic can
be defined similarly to unicast traffic - it is a sort of max flow algorithm on all N(2N-1)
queues - Many heuristics can be envisaged to approximate
it
138Summary
- 3 main ingredients for IQ scheduling algorithms
- Weight computation
- Matching computation
- Contention resolution
139Summary
- Weight computation
- obtains the priority of each input queue
- the metric can be related to queue length,
waiting time of the cell at the HOL, - Contention resolution
- whenever the selection is among situations with
equal weights - can be round robin, or random
140Summary
- Matching computation
- computes the matching, trying to maximize its
total weight - can be based on
- an iterative search, like in iSLIP, iOCF, iLQF
- a matrix greedy approach, like in MUCS, WFA
- a reservation vector, like in RPA
- a learning approach, like in APSARA
141Summary
- Good IQ scheduling algorithms exist
- 100 throughput
- short delay
- limited complexity
- Performance differences are significant only
close to saturation
142Summary
- Open questions concerning IQ schedulers
- QoS guarantees
- stability of networks of switches
- multicast traffic
143References
- Router functions and architectures
- Keshav S., Sharma R., Issues and trends in
router design'', IEEE Communications Magazine,
vol.36, n.5, May 1998, p.144-151 - Bux W., Denzel W.E., Engbersen T., Herkersdorf
A., Luijten R.P.,Technologies and building
blocks for fast packet forwarding'', IEEE
Communications Magazine, Jan.2001, pp.70-77 - Newman P., Minshall G., Lyon T., Huston L.,IP
switching and gigabit routers'', IEEE
Communications Magazine, Jan.1997, pp.64-69 - Wolf T., Turner J.S., Design issues for
high-performance active routers'', IEEE Journal
on Selected Areas in Communications, vol.19, n.3,
Mar.2001, pp.404-409 - Scheduling in IQ switches
- Karol M., Hluchyj M., Morgan S., Input versus
output queueing on a space division switch'',
IEEE Transactions on Communications, vol.35,
n.12, Dec.1987 - McKeown N., Anantharam V., Walrand J.,Achieving
100\ throughput in an input-queued switch'',IEEE
INFOCOM'96, vol.1, San Francisco, CA, Mar.1996,
pp.296-302 - McKeown N.,iSLIP a scheduling algorithm for
input-queued switches'', IEEE Transactions on
Networking, vol.7, n.2, Apr.1999, pp.188-201 - McKeown N., Mekkittikul A.,A practical
scheduling algorithm to achieve 100\ throughput
in input-queued switches'', IEEE INFOCOM'98,
vol.2, 1998, pp.792-9, New York, NY - Tamir Y., Chi H.-C., Symmetric crossbar
arbiters for VLSI communication switches'', IEEE
Transaction on Parallel and Distributed Systems,
vol.4, no.1, Jan.1993, pp.13 27 - Chen H., Lambert J., Pitsilledes A.,RC-BB
switch. A high performance switching network for
B-ISDN'', GLOBECOM 95
144References
- Scheduling in IQ switches
- Anderson T., Owicki S., Saxe J., Thacker
C.,High speed switch scheduling for local area
networks'', ACM Transactions on Computer Systems,
vol.11, n.4, Nov.1993 - LaMaire R.O., Serpanos D.N., Two dimensional
round-robin schedulers for packet switches with
multiple input queues'', IEEE/ACM Transaction on
Networking, vol.2, n.5, Oct.1994, p.471-482 - Chen H., Lambert J., Pitsilledes A., RC-BB
switch. A high performance switching network for
B-ISDN'', IEEE GLOBECOM 95, 1995 - Duan H., Lockwood J.W., Kang S.M., Will J.D., A
high performance OC12/OC48 queue design prototype
for input buffered ATM switches'', IEEE
INFOCOM'97, vol.1, 1997, pp.20-8, Los Alamitos,
CA - Partridge C., et al., A 50-Gb/s IP router'',
IEEE Transactions on Networking, vol.6, n.3, June
1998, pp.237-248 - Ajmone Marsan M., Bianco A., Leonardi E., Milia
L., RPA a flexible scheduling algorithm for
input buffered switches'', IEEE Transactions on
Communications, vol.47, n.12, Dec.1999,
pp.1921-1933 - Ajmone Marsan M., Bianco A., Filippi E., Giaccone
P.,Leonardi E., Neri F.,On the behavior of
input queueing switch architectures'', European
Transactions on Telecommunications, vol.10, n.2,
Mar.1999, pp.111-124 - Christensen K.J.,Design and evaluation of a
parallel-polled virtual output queued switch'',
IEEE ICC 2001, vol.1, pp.112-116, 2001 - Serpanos D.N., Antoniadis P.I., FIRM a class
of distributed scheduling algorithms for
high-speed ATM switches with multiple input
queues'', IEEE INFOCOM 2000, vol.2, pp.548-555,
2000 - Ying Jiang, Hamdi, M., A 2-stage matching
scheduler for a VOQ packet switch architecture,
IEEE ICC 2002, vol.4, pp.2105-2110, 2002 - Tassiulas L., Linear complexity algorithms for
maximum throughput in radio networks and input
queued switches'', IEEE INFOCOM'98, vol.2, New
York, NY, 1998, pp.533-539 - Giaccone P., Prabhakar B., Shah D., Towards
simple, high-performance schedulers for
high-aggregate bandwidth switches '', IEEE
INFOCOM'02, New York, Jun.2002
145References
- Packet scheduling in IQ switches
- Ajmone Marsan M., Bianco A., Giaccone P.,
Leonardi E., Neri F., Packet scheduling in
input-queued cell-based switches'', IEEE
INFOCOM'01, Anchorage, Alaska, Apr.2001(extended
version to appear in IEEE Trans. on Networking,
about Oct.2002) - Moon S.H., Sung D.K., High-performance
variable-length packet scheduling algorithm for
IP traffic'', IEEE GLOBECOM'01, Dec.2001 - Scheduling multicast traffic in IQ switches
- Hayes J.F., Breault R., Mehmet-Ali M.K.,
Performance analysis of a multicast switch'',
IEEE Transactions on Communications, vol.39, n.4,
Apr.1991, pp.581-587 - Kim C.K., Lee T.T., Call scheduling algorithm
in multicast switching systems'', IEEE
Transactions on Communications, vol.40, n.3,
Mar.1992, pp.625-635 - McKeown N., Prabhakar B., Scheduling multicast
cells in an input-queued switch'', INFOCOM'96,
vol.1, San Francisco, CA, Mar.1996, pp.261-278 - Prabhakar B., McKeown N., Ahuja R., Multicast
scheduling for input-queued switches'', IEEE
Journal on Selected Areas in Communications,
vol.15, n.5, Jun.1997, pp.855-866 - Chen W., Chang Y., Hwang W., A high performance
cell scheduling algorithm in broadband multicast
switching systems'', IEEE GLOBECOM'97, vol.1, New
York, NY, 1997, pp.170-174 - Guo M., Chang R., Multicast ATM switches
survey and performance evaluation'', Computer
Communication Review, vol.28, n.2, Apr.1998,
pp.98-131 - Andrews M., Khanna S., Kumaran K., Integrated
scheduling of unicast and multicast traffic in an
input-queued switch'', IEEE INFOCOM'99, vol.3,
New York, NY, 1999, pp.1144-1151 - Liu Z., Righter R., Scheduling multicast
input-queued switches'', Journal of Scheduling,
John Wiley Sons, May 1999
146References
- Scheduling multicast traffic in IQ switches
- Nong G., Hamdi M., On the provision of
integrated QoS guarantees of unicast and
multicast traffic in input-queued switches'',
IEEE GLOBECOM'99, vol.3, 1999 - Ajmone Marsan M., Bianco A., Giaccone P.,
Leonardi E., Neri F., On the throughput of
input-queued cell-based switches with multicast
traffic'', IEEE INFOCOM'01, Anchorage Alaska,
Apr.2001 - Ge Nong, Hamdi M., Providing QoS guarantees for
unicast/multicast traffic with fixed/variable-leng
th packets in multiple input-queued switches,
IEEE Symposium on Computers and Communications,
pp.166 171, 2001 - Smiljanic A., Flexible bandwidth allocation in
high-capacity packet switches, IEEE/ACM
Transactions on Networking, vol.10, n.2,
pp.287-293, Apr.2002 - QoS support in IQ switches
- Tabatabaee V., Georgiadis L., Tassiulas L.,
QoS provisioning and tracking fluid policies in
input queueing switches'', IEEE INFOCOM'00, New
York, Mar.2000 - Chang C.S., Lee D.S., Jou Y.S., Load balanced
Birkhoff-von Neumann switches'', 2001 IEEE
Workshop on High Performance Switching and
Routing, 2001, pp.276-280. - Hung A., Kesidis G., McKeown N.,ATM
input-buffered switches with guaranteed-rate
property'', IEEE ISCC'98, July 1998, pp.331-335,
Athens, Greece - Advanced architectures derived from pure IQ
- Iyer S., McKeown N., Making parallel packet
switches practical'', IEEE INFOCOM'01, Alaska,
Mar.2001 - Chang C.S., Lee D.S., Jou Y.S., Load balanced
Birkhoff-von Neumann switches'', 2001 IEEE
Workshop on High Performance Switching and
Routing, 2001, pp.276-280 - Sivaram R., Stunkel C.B., Panda D.K., HIPIQS a
high-performance switch architecture using input
queuing, IEEE Transactions on Parallel and
Distributed Systems, vol.13, n.3, pp.275-289,
Mar.2002