Title: Scheduling Algorithms for CIOQ Switches
1Scheduling Algorithms for CIOQ Switches
- Prashanth Pappu
- Applied Research Laboratory
- Washington University in St Louis
2Anatomy of a Router
- Transmission interfaces - terminate physical
links, conversion and encoding functions for
target physical layer. - Port processors queue packets and perform all
packet processing. - Switching fabric single (crossbar) or multi
stage. - Control processor routing and network
management protocols.
3Output Queuing
- Queuing is done only at output ports.
- Maximizes throughput.
- Contentions between packets - only at output
ports. - SpeedupN, impractical but ideal model.
4Combined Input Output Queuing (CIOQ)
- Use of VOQs, speedup typically lt2.
- The scheduling problem ideal OQ behavior.
- Focus of thesis Scheduling Algorithms for CIOQ
switches.
5Crossbar Switches
- Bipartite graph matching problem.
- Stability results
- MSM stable for i.i.d, uniform admissible
traffic. - MWM stable for independent admissible traffic.
- Too complex, O(N5/2) and O(N3logN).
- Router with 10Gb/s links has lt 40 ns to make
scheduling decision. - Maximal size matching. (PIM iSLIP)
- Simple to implement but do not perform well under
extreme traffic conditions.
6Crossbar Switches (contd.)
- Worst case results
- Critical cell first (CCF) and Lowest Output
Occupancy First Algorithm (LOOFA) O(N) - work
conserving with speedup of 2. - Significant results but not practical.
- Traffic in IP networks unregulated and can
cause sustained overloads (inadmissible traffic). - Slow congestion control mechanisms.
- Rapid traffic shifts due to route changes.
- Route selection mechanisms not guided by session
needs. - Presence of malicious users etc.
- How do practical scheduling algorithms perform
under these conditions?
7Multi-stage switches
- Scalable no scheduling algorithm.
- Use buffered switch elements, dynamic routing and
modest speedup. - Loss of performance in extreme traffic
conditions. - How do we regulate traffic in such multi-stage
switches using buffered switch elements?
8Proposal Overview
- Stress resistant scheduling algorithms.
- Simple, practical algorithms for crossbar
switches. - Maintain throughput under uniform traffic and
extreme traffic conditions (stress tests). - Stress tests adversarial traffic patterns to
simulate extreme conditions. - Stress resistant scheduling algorithms for CIOQ
switches, Prashanth Pappu and Jon Turner. In
ICNP 2003. - Distributed scheduling (DS)
- A novel, scalable (coarse-grained and
distributed) mechanism for regulating traffic in
multi-stage switches. - Comprehensive study covering work conserving,
backlog based, time-sliced and fair DS
algorithms. - Distributed queuing for scalable, high
performance routers, Prashanth Pappu, Jyoti
Parwatikar, Jon Turner and Ken Wong. In Infocom
2003.
9Stress Tests
- How do we simulate extreme traffic conditions?
- Adversarial approach in overloading outputs.
- Stress (overload) various outputs with the
objective of increasing misses. - New metric, miss fraction 1 NA/ NI. (instead
of delay) - Tests can be varied by changing number of
participating inputs and phases.
10Stress Test (Example)
- PIM (speedup 1.5). Stress test with 3
participating inputs, 4 phases.
11Stress Tests
- Parallel Iterative Matching (PIM) and iterative
SLIP (iSLIP) iterative, - simple to implement.
- Lowest Output Occupancy first Algorithm (LOOFA)
provably good - worst case performance but too complex to
implement.
12Approximate Output Ordering
- ordering outputs is the key
- complete ordering can be too complex.
- persistent and slowly changing traffic condition
approximate ordering can be good enough. - Lowest Layer Selection
- bigger layers for larger queue lengths.
- beyond a queue limit all outputs are treated
equal. - number of layers independent of N.
- algorithms give priority to outputs in lowest
layer in accept phase. - priority encoder or N-way minimum finding circuit
can be used on a grant vector.
13Stress Test
Miss fractions for LLS-R, LLS-S (using 16 layers)
and LOOFA.
Test B
Test A
14Approximate LOOFA (A-LOOFA)
- LOOFA can be used as the basis for a simpler
algorithm. - Matching in A-LOOFA is accomplished using a
simple combinational circuit (O(N) complexity but
constant factor gate delays) - .13 um ASIC process, gate delays are 25-50 ps.
Match can be completed in 3.2-6.4 ns for N32. - Outputs (columns) are sorted using odd-even sort.
- Inputs (rows) are ordered using a permutation
based on perfect shuffle (for fairness).
15Stress Resistant Scheduling Algorithms
- Done
- Delay vs. Miss fraction as a metric.
- Performance of LLS-R and LLS-S with varying
number of layers. - Performance of A-LOOFA under stress test.
- To do
- More stress tests.
- Detailed notes on implementation.
16Distributed Scheduling
ControlProcessor
Switch Fabric
I
O
I
O
I
O
I
O
I
O
I
O
Sched.
Sched.
Sched.
Sched.
Sched.
Sched.
Routing
Routing
Routing
Routing
Routing
Routing
TI
TI
TI
TI
TI
TI
17Distributed Scheduling
- Paradigm shift from scheduling cells to
regulating rates. - Very scalable router with 1000 links (10 Gb/s
each) needs T lt 100 us for 5 overhead. - Simple to implement voq status in control
cells. - Constraints on allocated rates input and output
constraints. - Throughput and delay properties depend on rate
allocation algorithm used.
18Questions
- Can we design work conserving algorithms which
make scheduling decisions only every T cycles? - Work conserving scheduling algorithms.
- Can we make these work conserving algorithms
non-iterative? - No, work conserving algorithms need multiple
iterations. - How do we design simple non-iterative algorithms
that have good throughput even under extreme
traffic patterns? - Backlog based DS algorithms.
- How do we design simple algorithms that have both
good throughput and delay properties?
(Approximate OQ emulation) - Time sliced DS algorithms.
- How do we use the DS mechanism to provide per
flow guarantees? - Fair DS algorithms.
19Work conserving scheduling algorithms.
- Can a scheduling algorithm that makes a
scheduling decision every T cycles be work
conserving? - Maximal A scheduling algorithm is said to be
maximal if for any cell c that is not transferred
in the transfer phase, either S x T cells at cs
input are transferred or S x T cells are
transferred to cs output. - VOQs are ordered based on output queue length.
Ordering can be extended to cells in VOQs too. - Ordered A scheduling algorithm is said to be
ordered if for any cell c that is not
transferred, no cell preceded by c at the same
output gets transferred unless cs output gets S
x T cells. - Claim - A scheduling algorithm that is both
ordered and maximal is work conserving (with a
speedup of 2).
20Backlog based DS algorithms
- Work conserving scheduling algorithms require
multiple iterations impractical. - Practical rate allocation algorithms have to be
non-iterative too - DSCs periodically exchange B(i,j) and B(j)
- Non-iterative DS algorithms that use only these
backlogs carried in control cells backlog based
DS algorithms. - Objective To maintain throughput under both
admissible and inadmissible traffic conditions. - We design algorithms based on two heuristics.
-
21Backlog based DS algorithms
- Output constraint backlog proportional
allocation - avoids switch congestion
- ensures fairness for uniform traffic
- Input constraint urgency proportional
allocation - priority for outputs with smaller queue lengths.
22Time-sliced DS algorithms
- Backlog based DS algorithms can badly mis-order
cells when arrival rates change abruptly. - Information carried in cells has no timing
information - Can be overcome by using queue length slices
instead of total queue lengths in the rate
calculation. - In the worst case, creates an overhead of one
word per packet.
23Time sliced DS algorithms
24Fair DS algorithms
- To support fair queuing input and output
maintain per flow queues. - Switch fabric bandwidth allocated among inputs in
proportion to sum of weights of backlogged flows. - Excess bandwidth reallocated by inputs among
local backlogged flows. - Can render output non-work conserving but
reallocation mechanisms require multiple
iterations.
25Distributed Scheduling
- Done
- DS mechanism implementation.
- Work conserving scheduling algorithms.
- Performance study of backlog based non-iterative
DS algorithms. - To do
- Performance study of time-sliced DS algorithms.
- Performance study of Fair DS algorithms.