Modulo Graph Embedding : Mapping Applications onto Coarse-Grained Reconfigurable Architectures - PowerPoint PPT Presentation

About This Presentation
Title:

Modulo Graph Embedding : Mapping Applications onto Coarse-Grained Reconfigurable Architectures

Description:

Electrical Engineering and Computer Science. Objectives of This Work ... Draw affinity graph onto scheduling space minimizing edge length ... – PowerPoint PPT presentation

Number of Views:149
Avg rating:3.0/5.0
Slides: 23
Provided by: fank
Category:

less

Transcript and Presenter's Notes

Title: Modulo Graph Embedding : Mapping Applications onto Coarse-Grained Reconfigurable Architectures


1
Modulo Graph Embedding Mapping
Applications onto Coarse-Grained
Reconfigurable Architectures
  • Hyunchul Park, Kevin Fan,
  • Manjunath Kudlur,Scott Mahlke

Advanced Computer Architecture Lab University of
Michigan
2
Coarse-Grained Reconfigurable Architecture (CGRA)
Config
FU
LRF
  • Array of PEs connected in a mesh-like
    interconnect
  • Characterized by array size, node
    functionalities, interconnect, register file
    configurations
  • Execute compute intensive kernels in multimedia
    applications

3
CGRA Attractive Alternative to ASICs
  • Suitable for running multimedia applications on
    embedded systems
  • High computation throughput
  • Low power consumption and scalability
  • High flexibility with fast configuration
  • Morphosys 8x8 array with RISC processor
  • SIMD style execution of loops
  • Piperench 1-D reconfigurable hardware
  • Virtualize hardware pipeline
  • ADRES 8x8 array with tightly coupled VLIW
  • Modulo scheduling with simulated annealing

4
Scheduling in CGRA
  • Different from conventional VLIW
  • Sparse interconnect and distributed register
    files
  • No dedicated routing resources
  • Need a good compiler to exploit the abundance of
    computing resources

FU0
LRF
FU1
LRF
Central RF
FU3
FU2
FU1
FU0
FU2
LRF
FU3
LRF
CGRA
Conventional VLIW
5
Objectives of This Work
  • Modulo scheduling technique for CGRAs
  • Exploit loop-level parallelism by overlapping
    execution of iterations
  • Targeting low-cost CGRAs
  • Achieve quality schedule under restriction of
    hardware
  • Fast compilation time

6
Modulo Scheduling Basics
  • Expose loop-level parallelism by overlapping
    execution of iterations
  • Initiation interval (II)
  • Each iteration is executed every II cycles

II
Overlapped Execution
7
Modulo Scheduling for CGRA
  • Mapping DFG onto 3-D scheduling space
  • Limited number of scheduling slots (number of
    PEs) x II
  • Minimize routing cost (number of slots used for
    routing)
  • Sparse interconnect and distributed register
    files
  • Ensure routability of operands

II
time
Scheduling Space
4x4 CGRA
8
Our Approach
  • Systematic approach to generate good schedule in
    reasonable time
  • Minimize routing cost
  • Convert scheduling problem into graph embedding
  • Leverage graph embedding algorithm
  • Ensure routability of operands
  • Skewed scheduling space
  • Create a narrow, but tall scheduling space

9
1 Minimize Routing Cost
  • Routing cost number of PEs used for routing
  • Determined by positions of producer and consumer
  • Minimize distance between producers and consumers
  • Height-based list scheduling
  • Schedule operations in the order of dependence
    height
  • Place consumers close to producers
  • Need to carefully place operations in the same
    height

10
Scheduling Example Routing Cost
time PE 0 PE 1 PE 2 PE 3
0
1
2
3
0
1
3
2
0
1
3
2
4
5
4
5
4
5
6
6
Routing Cost 2
time PE 0 PE 1 PE 2 PE 3
0
1
2
3
DFG
0
1
3
2
4
5
6
1x4 CGRA
Routing Cost 0
Common consumer information is important !
11
Affinity Graph Heuristic
  • Consider placement of operations with same height
    together
  • Use common consumer information
  • Affinity value between operations
  • Measured by the distance of common consumers in
    DFG
  • Construct affinity graph
  • Nodes operations, edges affinity values
  • Place operations with affinity edges close to
    each other

12
Affinity Graph Example
0
1
3
2
5
4
height 3
height 2
height 1
Affinity Graph
DFG
Mapping onto CGRA
2x4 CGRA
Bad mapping
Good mapping
Drawing affinity graph onto scheduling space
13
Leveraging Graph Embedding
  • Graph embedding
  • Drawing a graph onto a target space
  • Grid layout algorithm by Li Kurata
  • Embed complicated biochemical networks onto 2-D
    grid space
  • Simulated annealing
  • Our scheduling problem is a graph embedding
    problem
  • Draw affinity graph onto scheduling space
    minimizing edge length

Process Flow of Grid Layout Li 2005
14
2 Ensure Routability of Operands
  • Resources are repeatedly used every II cycles
  • Routing can fail due to previously scheduled
    operations
  • Backtracking hard to make forward progress for
    CGRA
  • Take preventative approach

time PE 0 PE 1 PE 2
0
1
2
3
4
5
0
1
2
II
3
4
5
6
1x3 CGRA
7
DFG
Routing failed for Op 7 !
15
Skewed Scheduling Space
  • Should prevent routing failures in advance

time PE 0 PE 1 PE 2
0
1
2
3
4
5
  • Skew scheduling space
  • Staggering down to the right
  • Create a narrow, but tall scheduling space
  • Operations can be routed to the right
  • Dynamically adjust scheduling space

16
System Flow
17
Experimental Setup
  • Twelve innermost loop kernels from various
    domains
  • Three designs with different RF configurations
  • Evaluate the impact of register file sharing

Dedicated RF
Shared RF
Central RF
18
Evaluation of Affinity Heuristic
  • Results of acyclic scheduling
  • Average of 59 reduction in routing cost

19
Modulo Graph Embeddingvs. Simulated Annealing
  • Utilization ( slots used for computation) / (
    total slots)
  • Time ( 5 sec) vs. (5 min 3 hours)

20
Impact of Register File Configurations
21
Conclusions
  • Modulo scheduler targeting low-cost CGRAs
  • Provide high computation throughput, scalability,
    power efficiency
  • Two heuristics to generate a good schedule
  • Affinity graph heuristic
  • Skewed scheduling space
  • Average utilizations of 56-68 for three designs
  • Systematic approach allows fast compilation time
  • All benchmarks finished within 5s

22
Questions ?
Write a Comment
User Comments (0)
About PowerShow.com