Title: O1TURN : Near-Optimal Worst-Case Throughput Routing for 2D-Mesh Networks
1O1TURN Near-Optimal Worst-Case Throughput
Routing for 2D-Mesh Networks
- DaeHo Seo, Akif Ali, WonTaek Lim
- Nauman Rafique, Mithuna Thottethodi
- School of Electrical and Computer Engineering
- Purdue University
2Motivation
- New routing algorithm for 2D Mesh networks
O1TURN - Why 2D Mesh networks?
- Important class of interconnection network
- Natural topology for on-chip network
- Many Applications
- yet another routing algorithm?
3Routing Algorithms Objectives
- Maximize throughput and minimize latency
- O1TURN satisfies all design goals
IDEAL DOR ROMM VALIANT MIN-ADAPTIVE
Average case throughput X X X
Worst case Throughput X X ?
Minimal of network hops X X X X
Low complexity router X X X
4Challenges
- Intuition Path flexibility, Load Balancing,
Throughput correlated - Prior results
- Throughput Increasing path flexibility SPAA
2002 - May not improve worst case throughput, even
decrease - Likely to improve average case throughput
- Latency Increasing path flexibility may
increase router complexity
IDEAL DOR ROMM VALIANT MIN-ADAPTIVE
Average case throughput X X X
Worst case Throughput X X ?
Minimal of network hops X X X X
Low complexity router X X X
of Paths ? 1 T(K2) T(K2) T(2K)
5Contributions
- Develop new routing algorithm O1TURN
- Throughput
- Better than DOR / ROMM for worst-case throughput
- Near optimal worst-case throughput for 2D Mesh
- Captures most of the opportunity with limited
path flexibility for average case throughput - O1TURN (with 2 paths) as good as ROMM (with
T(K2) paths) - Latency
- Router Implementation for O1TURN
- Comparable complexity as simple DOR router
- Key Point
- Partition the delay-critical circuitry
- O1TURN is minimal One goal trivially satisfied
6Outline
- Background of interconnection network
- O1TURN routing algorithm
- O1TURN router implementation
- Simulation Results
- Conclusion and QA
7Outline
- Background of interconnection network
- O1TURN routing algorithm
- O1TURN router implementation
- Simulation Results
- Conclusion and QA
8Background
- Packet Switched, 2D mesh network
- Each packet independently routed
- Terminology
- Network Radix k in kxk network (NOT Degree)
- Simplifying assumptions for this talk
- One packet crosses a link in one cycle
- Square mesh networks (K x K)
- K is even (K 2p)
- Analytical method for throughput analysis
- TD Method Towles and Dally, SPAA 2002
- Worst-case throughput (Maximum channel load)-1
- Given permutation and (oblivious) routing
algorithm - Find maximum channel load
- Given only (oblivious) routing algorithm
- Find permutation that causes maximum channel load
9TD-Method Example
Unit of worst-case throughput packets / node /
cycle
- Max Channel Load 0.5
- Worst-case Throughput (1 / 0.5) 2
- Max Channel Load 1
- Worst-case Throughput (1 / 1) 1
A -gt B -gt D A -gt C -gt D
Traffic Src -gt Dst A -gt D D -gt A
A -gt B -gt D
D -gt C -gt A
D -gt B -gt A D -gt C -gt A
10Outline
- Background of interconnection network
- O1TURN routing algorithm
- O1TURN router implementation
- Simulation Results
- Conclusion and QA
11O1TURN routing algorithm
- Orthogonal 1 TURN routing
- There is no U-TURN gt Orthogonal
- At most 1 turn gt 1TURN
- Use 2 routes
- At most 2 minimal, 1-turn routes in 2D MESH (XY,
YX) - Two routing algorithms (XY routing, YX routing)
- With same probability
12O1TURN routing algorithm
- Claim Maximum channel load of O1TURN is K / 2
- Proof Two sources of load contributions
- of nodes of left side of channel by XY routing
- of nodes of right side of channel by YX routing
N 0.5
(K - N) 0.5
XY routing
YX routing
13Optimal Worst Case Throughput
- Maximum channel load K / 2
- Worst-case Throughput 2 / K by TD Method
- Consider a permutation where 100 packets cross
bisection - Throughput (X) bounded when bisection links
saturated - X (K2 / 2) K
- X 2 / K packets / node / cycle
- When K is odd, O1TURN is within (1 / K2) of
optimal worst-case throughput
K x K mesh
14Worst-case Throughput Trends
- Worst-case channel load as network size changes
- Normalized to Optimal worst-case throughput
- Worst case throughput of DOR, ROMM degrades with K
Recall Even Radix Opt 1 Odd Radix Opt (1
- 1 / K2)
15Average Case Analysis
- Extension of TD method B.Towles et.al., SPAA
2003 - Examine randomly chosen permutations
- Harmonic means of worst-case throughput of
various permutations - 1 M random permutations
- O1TURN shows the better or the same average case
throughput
4 x 4 2D MESH 4 x 4 2D MESH 4 x 4 2D MESH 4 x 4 2D MESH
DOR ROMM O1TURN
Average case throughput 1 1.113 1.136
8 x 8 2D MESH 8 x 8 2D MESH 8 x 8 2D MESH 8 x 8 2D MESH
Average case throughput 1 1.180 1.188
16O1TURN Summary
- Near optimal worst-case Throughput
- By TD method
- Optimal for even K
- Approaches Optimal for large, odd K
- Average case throughput
- Better than DOR and comparable to ROMM
-
- Minimal of network hops
- O1TURN is minimal routing
17Outline
- Background of interconnection network
- O1TURN routing algorithm
- O1TURN router implementation
- Simulation Results
- Conclusion and QA
18Base Router Implementation
- Base Router Pipelined Virtual Channel Router
- 4 Stages Routing, Virtual Channel allocation,
Switch allocation, Crossbar Physical Channel
transfer - One control block controls all virtual channels
- Critical Stage Virtual Channel allocation stage
19O1TURN Router Implementation
- O1TURN Router
- Separate Virtual Channels into two virtual
networks (VN) - One VN for XY routing, the other for YX routing
- Deadlock prevention in each independent VN due to
DOR
20Delay Analysis
- Existing router delay models for pipelined
routers - Peh and Dally HPCA 2001
- Based on the logical effort method
- I.Sutherland, B. Sproull, 1999
- FO4 unit
- Comparable complexity as DOR router
VCs / PC DOR DOR O1TURN O1TURN
VCs / PC VC allocation SW allocation VC allocation SW allocation
4 17 14 14 14
8 20 16 17 16
21O1TURN Summary
- Near Optimal Worst case Throughput
- Good average case Throughput
- Minimal Network Hops
- Low Complexity Router Implementation
- Comparable complexity as DOR router
IDEAL O1TURN
Average case throughput X X
Worst case Throughput X X
Minimal of network hops X X
Low complexity router X X
22Outline
- Background of interconnection network
- O1TURN routing algorithm
- O1TURN router implementation
- Simulation Results
- Conclusion and QA
23Evaluation Method
- Modified Popnet network Simulator L. Shang,
2003 - 4x4 2D MESH (8x8 in paper)
- Full-duplex, bidirectional links
- 8 VCs per PC
- 5 Flits per packet
- 500 K cycles
- Synthetic Traffic Uniform Random, BC, MT, HOT
SPOT - Compared with existing routing algorithms
- Oblivious routing algorithms (DOR, ROMM)
- Adaptive routing algorithm (DUATO)
24Simulation Results
- 4 x 4 2D MESH Uniform Random Traffic Pattern
25Simulation Results
- 4 x 4 2D MESH Matrix Transpose Traffic Pattern
- One of the worst-case traffic pattern for DOR
26Simulation Results
- 4 x 4 2D MESH Bit Complement Traffic Pattern
- Already balanced traffic pattern
27Simulation Results
- 4 x 4 2D MESH HOT SPOT Traffic Pattern
- 2 nodes have 20 of traffic
28Simulation Results
- Delay penalty of adaptive routing
- How the complexity of router implementation
affects on latency - Hot Spot Traffic Pattern
29Outline
- Background of interconnection network
- O1TURN routing algorithm
- O1TURN router implementation
- Simulation Results
- Conclusion and QA
30Related Work
- Routing algorithms
- Valiant L.G.Valiant et.al, ACM 1981
- ROMM T.Nesson et.al, ACM 1995
- DUATO J.Duato et.al, 1993
- Partitioned router implementation
- Mad Postman Jesshope et.al, ISCA 1989
- PFNF Upadhyay et.al, 1997
- Analysis methods
- Worst-case B.Towles et.al, 2002
- Throughput centric B.Towles et.al, 2003
- Delay model L.S.Peh et.al, HPCA 2001
31Conclusion
- Goals
- Good average case throughput
- Good or Optimal worst case throughput
- Minimal of network hops
- Low complexity router implementation
- O1TURN
- Provide near optimal worst case throughput
- Provide the better or the same average case
throughput compared with existing routing
algorithms - Minimal of network hops
- Simple router implementation comparable with
DOR router - Satisfy all performance aspects
32Q A