Title: Fab 22 Culture Course
1 A Comparative Study of Arbitration Algorithms for
the Alpha 21364 Pipelined Router
Tenth International Conference on Architectural
Support for Programming Languages and Operating
Systems (ASPLOS), October 2002
Shubu Mukherjee, Federico Silla!, Peter Bannon,
Joel Emer, Steve Lang, Dave Webb (ack
Richard Kessler)
Intel, UPV!, HP
2Alpha 21364 Network
M
M
M
M
IO
IO
IO
IO
M
M
M
M
IO
IO
IO
IO
M
M
M
M
IO
IO
IO
IO
21364 Chip (including Router)
Rambus Memory
I/O
3The Alpha 21364 8x7 Router
C R O S S B A R
Input Ports
Output Ports
Distributed Arbitration Algorithm Controls the
Crossbar
- 8 Input ports 4 network, 2 memory, 1 cache, 1
I/O - 7 Output ports 4 network, 2 memory/cache, 1 I/O
- Router Pipeline Length 13/14 cycles
- Virtual Cut-Through
4Problem Maximize Matches
older packet at input port
Input Port 0
2
3
Input Port 1
1
3
Input Port 2
1
2
Input Port 3
1
2
3
Input Port 4
1
3
Input Port 5
0
2
Input Port 6
4
2
Input Port 7
5
2
numbers in table cells destination output port
- Oldest Packet First one match
- Smarter algorithm (shaded boxes) 7 matches
(perfect)
5Simpler Algorithms Have Fewer Matches
complexity
Assumes all output ports are free
6Complexity may not pay off
complexity
_at_ 30 input buffer occupancy
7Key Results
- Arbitration Algorithms
- WFA Wave Front Arbitration Algorithm (Tamir
Chi, 1993, SGI Spider) - PIM1 Parallel Iterative Matching with one
iteration (Anderson, et al., ASPLOS 1992) - SPAA Simple, Pipelined Arbitration Algorithm
(21364) - SPAA outperforms WFA PIM1
- SPAAs matching power similar to WFA PIM1
(when many output ports are busy) - SPAA minimizes interactions between ports
- SPAA can be pipelined more effectively
- Rotary Rule
- avoids network saturation under very heavy load
8Wave Front Arbiter (WFA)
- Proposed by Tamir Chi, 1993
- used in the SGI Spider/Origin switch
- Implement via connection matrix
output ports
input port 0
input port 1
Grant Request N W S N NOT(Grant) E W
NOT(Grant)
input port 2
input port 3
9WFA Advantage Pipeline
- High degree of interaction among output ports
- reduces arbitration collisions improves of
matches - Algorithm (implemented via a connection matrix)
- (1) Select packet at input port load matrix
(1.5 cycles) - (2) Run through matrix and inform input ports
(1.5 cycles) - (3) Forward arbitration to output ports (1 cycle)
10WFA Limitations
- - Higher number of estimated cycles
- 4 cycles in 0.18 micron
- - Harder to pipeline effectively
- micropipelining waves (2) is difficult because
initial cell changes every cycle - restarting (1) before (2) completes is complex
- large in-flight packet table due to large number
of nominations (up to 54) - may require multiple copies of matrix to buffer
pipeline stages (these must avoid stale
nominations)
11Parallel Iterative Matching (PIM)
- Steps in One Iteration (PIM1)
- Nominate each input port nominates packets for
every output port (same packet nominated multiple
times ) - Grant unmatched output port selects an input
port packet randomly - Accept unselected input port selects a grant
randomly
input port 0
input port 0
input port 0
output port 0
output port 0
output port 0
input port 1
input port 1
input port 1
output port 1
output port 1
output port 1
Accept
Grant
Nominate
Output Port 0 unused in this arbitration round
12PIM1 Advantage Pipeline
- High interaction between input and output ports
- reduces arbitration collisions improves of
matches - Algorithm (implemented via connection matrix)
- (1) Select packet at input port load matrix
(1.5 cycles) - (2) Run through matrix and inform input ports
(1.5 cycles) - (3) Forward arbitration to output ports (1 cycle)
13PIM1 Limitations
- - Higher number of estimated cycles
- 4 cycles in 0.18 micron
- - Harder to pipeline effectively
- restarting (1) before (2) completes is complex
- same packet can be nominated multiple times
requiring the Accept step (part of stage 2) - large in-flight packet table due to large number
of nominations (up to 54) - may require multiple copies of matrix to buffer
pipeline stages (these must avoid stale
nominations)
14Simple, Pipelined Arbitration Algorithm
(SPAA)used in the Alpha 21364 Router
- Algorithm
- Nominate each input port nominates packets for
exactly one output port (one packet nominated
only once) - Grant each output port selects an input port
packet based on the least-recently selected one - Reset input ports reset state of all unselected
packets and renominate them in subsequent cycles
input port 0
input port 0
output port 0
output port 0
Reset
input port 1
input port 1
output port 1
output port 1
Accept
Grant
Nominate
15SPAAs Simplicity
- Low degree of interaction among ports
- - increases arbitration collisions
- reduces complexity
- Algorithm (no centralized matrix)
- (1) Select packet at input port load matrix (1
cycle) - (2) Forward packets to output ports (1 cycle)
- (3) Output ports select packets and return
feedback to input ports (1 cycle)
16SPAAs Advantages
- Fewer cycles
- 3 cycles in 0.18micron
- Speculatively read out input buffer
- prior to output port arbitration
- because only one packet is nominated to one
output port - Easier to pipeline
- restart (1) for free input ports before (2)
completes - only one packet nominated to one output port
- small number (16) of in-flight packets
- avoids any centralized matrix
- speculative read allows data flits to follow
header flits
17Summary Simpler is Better
18Saturation Behavior
- Reasons Hot spots tree saturation
- 21364s router shows cyclic pattern (link
utilization with time) - Ideally, operate at saturation bandwidth
- Solution throttle input load
19Rotary Rule
- 21364s in-built throttling
- maximum outstanding cache miss requests per
processor 16 - Rotary Rule more throttling
- 21364 is a direct network
- Rotary Rule prioritizes traffic in network
ports over local ports - also, clears network congestion
- relies on anti-starvation mechanism
- WFARotary change first cell
- SPAARotary change output port priority to the
Rotary Rule
20Simulation Methodology
- Asim
- modeling infrastructure
- detailed timing model of 21364 network
- selected design points validated against RTL
- Traffic Patterns
- 70 three coherence hops, 30 two coherence hops
- random destinations
- other traffic combinations in paper and simulated
internally
2164 Node Network Base Case
- SPAA outperforms WFA PIM1
- 24 higher throughput at knee
2264 Node Network With Rotary Rule
- Rotary Rule helps both SPAA WFA
23Summary Conclusions
- Arbitration Algorithms
- WFA Wave Front Arbitration Algorithm (Tamir
Chi, 1993, SGI Spider) - PIM1 Parallel Iterative Matching with one
iteration (Anderson, et al., ASPLOS 1992) - SPAA Simple, Pipelined Arbitration Algorithm
(21364) - SPAA outperforms WFA PIM1
- SPAAs matching power similar to WFA PIM1
(when many output ports are busy) - SPAA minimizes interactions between ports
- SPAA can be pipelined more effectively
- Rotary Rule
- avoids network saturation under heavy load