Fab 22 Culture Course - PowerPoint PPT Presentation

About This Presentation
Title:

Fab 22 Culture Course

Description:

Implement via 'connection' matrix. E. N. S. W. Grant. Request. i,j. 1. 2. 3. 4. 5. 6. 7. output ports ... multiple copies of matrix to buffer pipeline stages ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 24
Provided by: kristiv150
Category:

less

Transcript and Presenter's Notes

Title: Fab 22 Culture Course


1

A Comparative Study of Arbitration Algorithms for
the Alpha 21364 Pipelined Router

Tenth International Conference on Architectural
Support for Programming Languages and Operating
Systems (ASPLOS), October 2002
Shubu Mukherjee, Federico Silla!, Peter Bannon,
Joel Emer, Steve Lang, Dave Webb (ack
Richard Kessler)
Intel, UPV!, HP
2
Alpha 21364 Network
M
M

M

M
IO
IO
IO
IO

M

M

M

M
IO
IO
IO
IO

M
M
M

M
IO
IO
IO
IO
21364 Chip (including Router)
Rambus Memory
I/O
3
The Alpha 21364 8x7 Router
C R O S S B A R
Input Ports
Output Ports
Distributed Arbitration Algorithm Controls the
Crossbar
  • 8 Input ports 4 network, 2 memory, 1 cache, 1
    I/O
  • 7 Output ports 4 network, 2 memory/cache, 1 I/O
  • Router Pipeline Length 13/14 cycles
  • Virtual Cut-Through

4
Problem Maximize Matches
older packet at input port
Input Port 0
2
3
Input Port 1
1
3
Input Port 2
1
2
Input Port 3
1
2
3
Input Port 4
1
3
Input Port 5
0
2
Input Port 6
4
2
Input Port 7
5
2
numbers in table cells destination output port
  • Oldest Packet First one match
  • Smarter algorithm (shaded boxes) 7 matches
    (perfect)

5
Simpler Algorithms Have Fewer Matches
complexity
Assumes all output ports are free
6
Complexity may not pay off
complexity
_at_ 30 input buffer occupancy
7
Key Results
  • Arbitration Algorithms
  • WFA Wave Front Arbitration Algorithm (Tamir
    Chi, 1993, SGI Spider)
  • PIM1 Parallel Iterative Matching with one
    iteration (Anderson, et al., ASPLOS 1992)
  • SPAA Simple, Pipelined Arbitration Algorithm
    (21364)
  • SPAA outperforms WFA PIM1
  • SPAAs matching power similar to WFA PIM1
    (when many output ports are busy)
  • SPAA minimizes interactions between ports
  • SPAA can be pipelined more effectively
  • Rotary Rule
  • avoids network saturation under very heavy load

8
Wave Front Arbiter (WFA)
  • Proposed by Tamir Chi, 1993
  • used in the SGI Spider/Origin switch
  • Implement via connection matrix

output ports
input port 0
input port 1
Grant Request N W S N NOT(Grant) E W
NOT(Grant)
input port 2
input port 3
9
WFA Advantage Pipeline
  • High degree of interaction among output ports
  • reduces arbitration collisions improves of
    matches
  • Algorithm (implemented via a connection matrix)
  • (1) Select packet at input port load matrix
    (1.5 cycles)
  • (2) Run through matrix and inform input ports
    (1.5 cycles)
  • (3) Forward arbitration to output ports (1 cycle)

10
WFA Limitations
  • - Higher number of estimated cycles
  • 4 cycles in 0.18 micron
  • - Harder to pipeline effectively
  • micropipelining waves (2) is difficult because
    initial cell changes every cycle
  • restarting (1) before (2) completes is complex
  • large in-flight packet table due to large number
    of nominations (up to 54)
  • may require multiple copies of matrix to buffer
    pipeline stages (these must avoid stale
    nominations)

11
Parallel Iterative Matching (PIM)
  • Steps in One Iteration (PIM1)
  • Nominate each input port nominates packets for
    every output port (same packet nominated multiple
    times )
  • Grant unmatched output port selects an input
    port packet randomly
  • Accept unselected input port selects a grant
    randomly

input port 0
input port 0
input port 0
output port 0
output port 0
output port 0
input port 1
input port 1
input port 1
output port 1
output port 1
output port 1
Accept
Grant
Nominate
Output Port 0 unused in this arbitration round
12
PIM1 Advantage Pipeline
  • High interaction between input and output ports
  • reduces arbitration collisions improves of
    matches
  • Algorithm (implemented via connection matrix)
  • (1) Select packet at input port load matrix
    (1.5 cycles)
  • (2) Run through matrix and inform input ports
    (1.5 cycles)
  • (3) Forward arbitration to output ports (1 cycle)

13
PIM1 Limitations
  • - Higher number of estimated cycles
  • 4 cycles in 0.18 micron
  • - Harder to pipeline effectively
  • restarting (1) before (2) completes is complex
  • same packet can be nominated multiple times
    requiring the Accept step (part of stage 2)
  • large in-flight packet table due to large number
    of nominations (up to 54)
  • may require multiple copies of matrix to buffer
    pipeline stages (these must avoid stale
    nominations)

14
Simple, Pipelined Arbitration Algorithm
(SPAA)used in the Alpha 21364 Router
  • Algorithm
  • Nominate each input port nominates packets for
    exactly one output port (one packet nominated
    only once)
  • Grant each output port selects an input port
    packet based on the least-recently selected one
  • Reset input ports reset state of all unselected
    packets and renominate them in subsequent cycles

input port 0
input port 0
output port 0
output port 0
Reset
input port 1
input port 1
output port 1
output port 1
Accept
Grant
Nominate
15
SPAAs Simplicity
  • Low degree of interaction among ports
  • - increases arbitration collisions
  • reduces complexity
  • Algorithm (no centralized matrix)
  • (1) Select packet at input port load matrix (1
    cycle)
  • (2) Forward packets to output ports (1 cycle)
  • (3) Output ports select packets and return
    feedback to input ports (1 cycle)

16
SPAAs Advantages
  • Fewer cycles
  • 3 cycles in 0.18micron
  • Speculatively read out input buffer
  • prior to output port arbitration
  • because only one packet is nominated to one
    output port
  • Easier to pipeline
  • restart (1) for free input ports before (2)
    completes
  • only one packet nominated to one output port
  • small number (16) of in-flight packets
  • avoids any centralized matrix
  • speculative read allows data flits to follow
    header flits

17
Summary Simpler is Better
18
Saturation Behavior
  • Reasons Hot spots tree saturation
  • 21364s router shows cyclic pattern (link
    utilization with time)
  • Ideally, operate at saturation bandwidth
  • Solution throttle input load

19
Rotary Rule
  • 21364s in-built throttling
  • maximum outstanding cache miss requests per
    processor 16
  • Rotary Rule more throttling
  • 21364 is a direct network
  • Rotary Rule prioritizes traffic in network
    ports over local ports
  • also, clears network congestion
  • relies on anti-starvation mechanism
  • WFARotary change first cell
  • SPAARotary change output port priority to the
    Rotary Rule

20
Simulation Methodology
  • Asim
  • modeling infrastructure
  • detailed timing model of 21364 network
  • selected design points validated against RTL
  • Traffic Patterns
  • 70 three coherence hops, 30 two coherence hops
  • random destinations
  • other traffic combinations in paper and simulated
    internally

21
64 Node Network Base Case
  • SPAA outperforms WFA PIM1
  • 24 higher throughput at knee

22
64 Node Network With Rotary Rule
  • Rotary Rule helps both SPAA WFA

23
Summary Conclusions
  • Arbitration Algorithms
  • WFA Wave Front Arbitration Algorithm (Tamir
    Chi, 1993, SGI Spider)
  • PIM1 Parallel Iterative Matching with one
    iteration (Anderson, et al., ASPLOS 1992)
  • SPAA Simple, Pipelined Arbitration Algorithm
    (21364)
  • SPAA outperforms WFA PIM1
  • SPAAs matching power similar to WFA PIM1
    (when many output ports are busy)
  • SPAA minimizes interactions between ports
  • SPAA can be pipelined more effectively
  • Rotary Rule
  • avoids network saturation under heavy load
Write a Comment
User Comments (0)
About PowerShow.com