Title: Scheduling Algorithms for CIOQ Switches
1Scheduling Algorithms for CIOQ Switches
Prashanth Pappu (Advisor Dr. Jon Turner)
2The Scheduling Problem
- Need for Combined Input and Output Queuing
(CIOQ). - Speedup (S) is the ratio of speed of switch
fabric to external links. - Need for making a scheduling decision.
- Objective - closely approximate ideal
output-queued switch. - work conservation no loss of output link
capacity - traffic isolation traffic to different outputs
does not interfere - minimize required speedup
3Problem Context
Scheduling algorithms proven to be work
conserving (with speedup 2) Worst Case
Results.
Heuristic algorithms with performance analysis
using simulated traffic conditions.
Scheduling algorithms theoretically proven to be
stable Stability Results.
Worst-case Results
High
Stability Results
Simulation Results
?
Implementation Complexity
Low
Inadmissible Traffic
Admissible Traffic (uniform non-uniform)
Performance under various traffic conditions
- Methods to evaluate scheduling algorithms in
inadmissible traffic conditions. - Design low complexity scheduling algorithms which
have good performance under all traffic
conditions.
4Summary of Contributions
- New method for evaluation of scheduling
algorithms in inadmissible traffic conditions - studying performance under targeted stress tests
- metric that measures lost link capacity miss
fraction - For crossbar based switches Stress resistant
scheduling algorithms - Evaluation of well-known crossbar scheduling
algorithms. - PIM (Anderson, et. al.), i SLIP (McKeown),
APSARA (Giaconne, et. al.) LOOFA (Krishna, et.
al.) - Design and analysis of improved algorithms.
- Lowest Layer Selection (LLS-R and LLS-S) Adding
bias to PIM and i SLIP - SOLIF-A Improved weight metrics for APSARA
5Summary of Contributions
- A-LOOFA practical approximate version of LOOFA
- fast maximal matching in hardware
- approximate sorting
- ensuring input fairness
- For buffered multistage switches
- Distributed Scheduling (DS) A novel scalable
mechanism for regulating flow of traffic. - First provably work conserving DS algorithms.
- BCCF, BLOOFA and OLA
- Design of practical variants single iteration,
distributed scheduling algorithms - DBL and Distributed OLA
- Performance analysis of DS algorithms
6Stress Test
- How do we simulate extreme traffic conditions?
- Adversarial approach in overloading outputs.
- Stress (overload) various outputs with the
objective of increasing misses. - New metric, miss fraction 1 NA/ NI. (instead
of delay) - Tests can be varied by changing number of
participating inputs and phases.
7Example of Stress Test
14,000
OQ_0
PIM, speedup1.5
12,000
3 inputs, 4 phases
10,000
8,000
queue length
VOQ_0
VOQ_2
6,000
VOQ_1
4,000
VOQ_3
OQ_1
2,000
0
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
Ideally, there should be no queuing for VOQ3
time
Phase transition times determined to equal VOQ
lengths
8Least Occupied Output First (LOOFA)
outputs issuegrants
inputs make single requestfor least occupied
outputs
5
5
unmatchedinputs andoutputs repeatuntil no
newmatches canbe added
1
1
3
3
2
2
- LOOFA is work-conserving with speedup of 2.
- implies output link is idle only if no cell for
that output is present in any queue (input or
output) - holds for any input selection rule
- often requires many iterations to complete
little practical use! - Can we design a practical variant which, if not
provably work-conserving, will perform well under
stress tests?
9Approximate LOOFA
a
b
c
inputs
d
e
f
0
1
3
3
4
5
a
e
b
d
f
c
outputs
- Hardware implementation of maximal matching.
- Allows n iterations to complete quickly
- 4n gate delays
- 6.4 ns for 32 port switch _at_50 ps per gate (less
than 20ns for router with 10 Gb/s links)
10Maintaining Output Ordering
- Not enough time to sort outputs by queue length.
- But not essential, since queue lengths change
slowly Odd Even Sort.
initial state
final state
Queue lengths change by at most 1
Compare and swap adjacent elements.
Note that entire column is swapped.
11Fair Treatment of Inputs
- For speedups lt2, unfairness treatment of inputs
becomes issue. - Resolve by performing random row permutations.
Perfect shuffle.
Random settings.
12Performance of Approx. LOOFA
- Summary
- Use of stress test
- Insight to develop practical variant.
- Make qualitative claims about A-LOOFA
performance. - In high speed algorithms where sorting is the
bottleneck step - Approximate sorting is a good variation.
- Choice of approximate sorting technique depends
on context - Odd-Even sort suited for slowly changing traffic
conditions. - O(N) algorithms can still have efficient hardware
implementation! - The constants can be greatly reduced.
13Distributed Scheduling
Scheduler uses state information to pace VOQs to
avoid congesting switch fabric.
Switch Fabric
Queue state information sent out every update
period (T).
I
O
I
O
I
O
I
O
I
O
I
O
Sched.
Sched.
Sched.
Sched.
Sched.
Sched.
Routing
Routing
Routing
Routing
Routing
Routing
TI
TI
TI
TI
TI
TI
- Highly scalable systems with inter stage flow
control. - Lack mechanisms to ensure high throughput in
extreme traffic conditions. - Distributed Scheduling (DS) A scalable
mechanism to maintain throughput in extreme
traffic conditions.
14Distributed Scheduling
- Three important features of the mechanism
- It is coarse-grained scheduling decisions are
made at pre-determined update periods (T). - It is distributed ports make scheduling
decisions independently and asynchronously. - It is non-iterative low overhead due to exchange
of information per update period. - Also, low hardware complexity and execution time
of algorithms at ports. - Question What kind of performance guarantees (if
any) can we provide with this mechanism? - Approach Introduce each feature incrementally
and evaluate the achievable throughput. - Question 1 What is the effect (on achievable
throughput) of making a scheduling decision only
at fixed time periods (T) in the switch?
15T-CIOQ Switch
ARRIVAL
TRANSFER
DEPARTURE
T
T
ST
- Three phases arrival, transfer and departure.
- Up to T cells can arrive (depart) in arrival
(departure) phase. - A scheduling decision is made every T time units
to transfer a maximum of ST cells from an input
or to an output (during transfer phase). - Implications of these assumptions for real
systems discussed in thesis. - T-work conserving At the beginning of the
departure phase every output which has cells
queued in the system has at least T cells in its
output queue. - Question 1 (rephrased) Is there a scheduling
algorithm that can keep a T-CIOQ switch T-work
conserving?
16VOQ Ordering
- All VOQs at an input are ordered.
- Cells at the input are ordered according to
ordering of VOQs. - Two different ordering criterions
- BLOOFA
- BCCF
- Example shows BLOOFA ordering.
- Given a particular ordering we want to
construct maximal and ordered schedules. -
Output Queues
17Maximal Ordered Schedule
0
2
6
0
1
5
5
6
1
6
6
5
3
0
0
6
2
5
0
0
6
4
6
5
6
0
0
18Work Conservation
- Theorem 1 The BLOOFA scheduling algorithm is
T-work conserving for a speedup 2. - No output with cells queued at inputs can have
fewer than T cells in its output queue at the
beginning of the departure phase. - Theorem 2 The BCCF scheduling algorithm is
T-work conserving for a speedup 2. - Proof construction similar to that of BLOOFA.
- Question 1 (contd) How do we find a maximal
ordered schedule?
19Maximal Schedule as a Blocking Flow
Outputs
Inputs
3
1
3
6
5
5
4
3
1
6
1
6
1
3
1
- Dinics algorithm
- Repeatedly search for st-paths with no saturated
edges and add as much flow as possible. - Modification Preferentially select edges
between inputs and outputs according to VOQ
ordering at input.
6
3
2
6
6
2
6
Target
Source
6
1
6
6
1
2
6
6
5
2
1
6
2
1
6
1
4
4
20BLOOFA Example
BLOOFA Example (Arrival)
BLOOFA Example (Transfer)
BLOOFA Example (Departure)
0
2
0
1
5
5
6
1
6
6
3
0
0
6
2
0
0
6
4
6
5
0
0
21Distributed, Iterative Schedulers
- Question 2 Can we make these T-work conserving
algorithms distributed? - Answer Yes.
- Outputs (inputs) send (receive) a maximum of n
messages. - Inputs (outputs) send (receive) a maximum of 2n
messages. - Algorithm not guaranteed to run in O(n) time.
- Question 3 Can we make these T-work conserving
distributed scheduling algorithms non-iterative? - Answer No.
- But distributed, non-iterative schedulers can
approximate the performance of T-work conserving
schedulers. - Distributed BLOOFA.
22Distributed BLOOFA
Backlog Proportional Allocation hi(i,j)
STB(i.j)/B(,j)
6
5
0
0
5
5
0
0
- hi(2,0) 30/8 4
- hi(2,1) 0
- hi(2,2) 12/7 1
- hi(2,3) 0
6
5
0
0
- hi(3,0) 0
- hi(3,1) 6/7 1
- hi(3,2) 12/7 2
- hi(3,3) 4
6
0
6
0
23Performance on stress tests
DBL
BLOOFA
- 90 sets of stress tests (up to 15 inputs and 15
phases). - Worst case results for 2,3,4 and 5 inputs
plotted. - Performance of DBL comparable to BLOOFA.
- Though DBL is not known to be provably T-work
conserving.
24Output Leveling Algorithm (OLA)
- Question Can we improve the performance of
BLOOFA at smaller speedups? - Idea Instead of giving shorter output queues
greater priority, level the output queues! - Comprehensive study
- Formulation OLA produces schedules which are
maximal and level. - Theorem 3 OLA is T-work conserving for speedup
2. - How do we find schedules which are maximal and
level? - Using convex edge costs in a minimum cost maximum
flow problem. - Less complex approximation - OLA.
- Single iteration, distributed version of OLA
distributed OLA.
25OLA Example
OLA Example (Arrival)
OLA Example (Transfer)
OLA Example (Departure)
5
0
1
3
0
2
6
3
6
4
0
0
1
4
6
1
2
1
2
5
6
6
0
0
4
6
4
1
0
1
2
5
3
0
26Performance of Distributed OLA
Delta0.2
Delta0.02
- Distributed OLA requires same speedup as DBL to
reduce miss fraction to 0. - Shows great improvement over DBL for smaller
speedups.
27Summary
- LLS-R, LLS-S
- SOLIF-A
- A-LOOFA
- DBL
- Distributed OLA
Worst-case Results
High
Stability Results
Simulation Results
Implementation Complexity
Stress Resistant Algorithms
Low
Inadmissible Traffic
Admissible Traffic (uniform non-uniform)
Performance under various traffic conditions
- Proposed algorithms are of immediate utility to
routers like Ciscos 12000 series (1.28 Tbps
capacity crossbar based router) and CRS-1 (90
Tbps capacity buffered multi-stage router,
released May 2004)
28Acknowledgements
- Dr Jon Turner
- Members of the Committee
- Dr John Lockwood
- Dr Roger Chamberlain
- Dr Dan Fuhrmann
- Dr Sergey Gorinsky
- ARLites.
- Parents.
29Thank you (The Halle Berry version)
Kayaks coffee shop (where I wrote most of my
thesis)
David for his politically correct humor
Ed for bringing his GPS device on our last float
trip. Which way Ed? ltPausegt Downstream!
My younger brother (for not expecting me to gift
him an iPod)
Praveen for converting my pdf plots to eps on his
unix machine.
Tilman for convening the FOOLISH
symposium Fragmentation Of Organic Life-forms In
Simulated Hallways
Anshul for adding to my vocabulary Highly
Useless
Samphel for his spontaneity. (Movie in fifteen
minutes? You guys are nuts!)
Sumi for readily parting with her machine for
more important research.
Ralph for his politically incorrect humor
My elder brother (for gifting me an iPod without
me expecting it)
Praveen for converting my excel plots to pdf on
his mac.
Jai for buying headphones along with his music
composition wares.
Saigon Restaurant (for No. 34 with Tofu)
(Background music peppered with sounds of mike
being wrenched away) ltInaudiblegt
(While Im at it) Sean _at_ food court (The wrap
guy)
Kurt Elling, Jamiroquai, OutKast
The CD burner and the Internet