Title: Modulo Graph Embedding : Mapping Applications onto Coarse-Grained Reconfigurable Architectures
1Modulo Graph Embedding Mapping
Applications onto Coarse-Grained
Reconfigurable Architectures
- Hyunchul Park, Kevin Fan,
- Manjunath Kudlur,Scott Mahlke
Advanced Computer Architecture Lab University of
Michigan
2Coarse-Grained Reconfigurable Architecture (CGRA)
Config
FU
LRF
- Array of PEs connected in a mesh-like
interconnect - Characterized by array size, node
functionalities, interconnect, register file
configurations - Execute compute intensive kernels in multimedia
applications
3CGRA Attractive Alternative to ASICs
- Suitable for running multimedia applications on
embedded systems - High computation throughput
- Low power consumption and scalability
- High flexibility with fast configuration
- Morphosys 8x8 array with RISC processor
- SIMD style execution of loops
- Piperench 1-D reconfigurable hardware
- Virtualize hardware pipeline
- ADRES 8x8 array with tightly coupled VLIW
- Modulo scheduling with simulated annealing
4Scheduling in CGRA
- Different from conventional VLIW
- Sparse interconnect and distributed register
files - No dedicated routing resources
- Need a good compiler to exploit the abundance of
computing resources
FU0
LRF
FU1
LRF
Central RF
FU3
FU2
FU1
FU0
FU2
LRF
FU3
LRF
CGRA
Conventional VLIW
5Objectives of This Work
- Modulo scheduling technique for CGRAs
- Exploit loop-level parallelism by overlapping
execution of iterations - Targeting low-cost CGRAs
- Achieve quality schedule under restriction of
hardware - Fast compilation time
6Modulo Scheduling Basics
- Expose loop-level parallelism by overlapping
execution of iterations - Initiation interval (II)
- Each iteration is executed every II cycles
II
Overlapped Execution
7Modulo Scheduling for CGRA
- Mapping DFG onto 3-D scheduling space
- Limited number of scheduling slots (number of
PEs) x II - Minimize routing cost (number of slots used for
routing) - Sparse interconnect and distributed register
files - Ensure routability of operands
II
time
Scheduling Space
4x4 CGRA
8Our Approach
- Systematic approach to generate good schedule in
reasonable time - Minimize routing cost
- Convert scheduling problem into graph embedding
- Leverage graph embedding algorithm
- Ensure routability of operands
- Skewed scheduling space
- Create a narrow, but tall scheduling space
91 Minimize Routing Cost
- Routing cost number of PEs used for routing
- Determined by positions of producer and consumer
- Minimize distance between producers and consumers
- Height-based list scheduling
- Schedule operations in the order of dependence
height - Place consumers close to producers
- Need to carefully place operations in the same
height
10Scheduling Example Routing Cost
time PE 0 PE 1 PE 2 PE 3
0
1
2
3
0
1
3
2
0
1
3
2
4
5
4
5
4
5
6
6
Routing Cost 2
time PE 0 PE 1 PE 2 PE 3
0
1
2
3
DFG
0
1
3
2
4
5
6
1x4 CGRA
Routing Cost 0
Common consumer information is important !
11Affinity Graph Heuristic
- Consider placement of operations with same height
together - Use common consumer information
- Affinity value between operations
- Measured by the distance of common consumers in
DFG - Construct affinity graph
- Nodes operations, edges affinity values
- Place operations with affinity edges close to
each other
12Affinity Graph Example
0
1
3
2
5
4
height 3
height 2
height 1
Affinity Graph
DFG
Mapping onto CGRA
2x4 CGRA
Bad mapping
Good mapping
Drawing affinity graph onto scheduling space
13Leveraging Graph Embedding
- Graph embedding
- Drawing a graph onto a target space
- Grid layout algorithm by Li Kurata
- Embed complicated biochemical networks onto 2-D
grid space - Simulated annealing
- Our scheduling problem is a graph embedding
problem - Draw affinity graph onto scheduling space
minimizing edge length
Process Flow of Grid Layout Li 2005
142 Ensure Routability of Operands
- Resources are repeatedly used every II cycles
- Routing can fail due to previously scheduled
operations
- Backtracking hard to make forward progress for
CGRA
- Take preventative approach
time PE 0 PE 1 PE 2
0
1
2
3
4
5
0
1
2
II
3
4
5
6
1x3 CGRA
7
DFG
Routing failed for Op 7 !
15Skewed Scheduling Space
- Should prevent routing failures in advance
time PE 0 PE 1 PE 2
0
1
2
3
4
5
- Skew scheduling space
- Staggering down to the right
- Create a narrow, but tall scheduling space
- Operations can be routed to the right
- Dynamically adjust scheduling space
16System Flow
17Experimental Setup
- Twelve innermost loop kernels from various
domains - Three designs with different RF configurations
- Evaluate the impact of register file sharing
Dedicated RF
Shared RF
Central RF
18Evaluation of Affinity Heuristic
- Results of acyclic scheduling
- Average of 59 reduction in routing cost
19Modulo Graph Embeddingvs. Simulated Annealing
- Utilization ( slots used for computation) / (
total slots) - Time ( 5 sec) vs. (5 min 3 hours)
20Impact of Register File Configurations
21Conclusions
- Modulo scheduler targeting low-cost CGRAs
- Provide high computation throughput, scalability,
power efficiency - Two heuristics to generate a good schedule
- Affinity graph heuristic
- Skewed scheduling space
- Average utilizations of 56-68 for three designs
- Systematic approach allows fast compilation time
- All benchmarks finished within 5s
22Questions ?