Compiler-directed Synthesis of Programmable Loop Accelerators - PowerPoint PPT Presentation

About This Presentation
Title:

Compiler-directed Synthesis of Programmable Loop Accelerators

Description:

NPA (Nonprogrammable Accelerator) Synthesis in PICO. University of Michigan ... PICO Backend. Resource allocation (II, operation graph) ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 24
Provided by: cccpEec
Category:

less

Transcript and Presenter's Notes

Title: Compiler-directed Synthesis of Programmable Loop Accelerators


1
Compiler-directed Synthesis of Programmable Loop
Accelerators
  • Kevin Fan, Hyunchul Park, Scott Mahlke
  • September 25, 2004
  • EDCEP Workshop

2
Loop Accelerators
  • Hardware implementation of a critical loop nest
  • Hardwired state machine
  • Digital camera appln 1000x vs Pentium III
  • Multiple accelerators hooked up in a pipeline
  • Loop accelerator vs. customized processor
  • 1 block of code vs. multiple blocks
  • Trivial control flow vs. handling generic
    branches
  • Traditionally state machine vs. instruction driven

3
Programmable Loop Accelerators
  • Goals
  • Multifunction accelerators Accelerator hardware
    can handle multiple loops (re-use)
  • Post-programmable To a degree, allow changes to
    the application
  • Use compiler as architecture synthesis tool
  • But
  • Dont build a customized processor
  • Maintain ASIC-level efficiency

4
NPA (Nonprogrammable Accelerator) Synthesis in
PICO
5
PICO Frontend
  • Goals
  • Exploit loop-level parallelism
  • Map loop to abstract hardware
  • Manage global memory BW
  • Steps
  • Tiling
  • Load/store elimination
  • Iteration mapping
  • Iteration scheduling
  • Virtual processor clustering

for i 1 to ni
for j 1 to nj
yi wj xij
for jt 1 to 100 step 10
for t 0 to 502
for p 0 to 1
(i,j) function of (t,p)
if (igt1) Wtp Wt-5p
else wjtj
if (igt1 jltbj) Xtp Xt-4p1
else xijtj
Ytp Wtp Xtp
6
PICO Backend
  • Resource allocation (II, operation graph)
  • Synthesize machine description for fake fully
    connected processor with allocated resources

7
Reduced VLIW Processor after Modulo Scheduling
8
Data/control-path Synthesis ? NPA
9
PICO Methodology Why it Works?
  • Systematic design methodology
  • 1. Parameterized meta-architecture all NPAs
    have same general organization
  • 2. Performance/throughput is input
  • 3. Abstract architecture We know how to build
    compilers for this
  • 4. Mapping mechanism Determine architecture
    specifics from schedule for abstract architecture

10
Direct Generalization of PICO?
  • Programmability would require full interconnect
    between elements
  • Back to the meta architecture!
  • Generalize connectivity to enable
    post-programmability
  • But stylize it

11
Programmable Loop Accelerator Design Strategy
  • Compile for partially defined architecture
  • Build long distance communication into schedule
  • Limit global communication bandwidth
  • Proposed meta-architecture
  • Multi-cluster VLIW
  • Explicit inter-cluster transfers (varying
    latency/BW)
  • Intra-cluster communication is complete
  • Hardware partially defined expensive units

12
Programmable Loop Accelerator Schema
DRAM
Shift Register
II
Stream Unit
SRAM
Control Unit
FU
MEM
Accelerator




Intra-cluster Communication




Stream Buffer
Stream Unit
FU
FU
Accelerator
Inter-cluster Register File

Accelerator Datapath
Pipeline of Tiled or Clustered Accelerators
13
Flow Diagram
cheap FUs FUs assigned to clusters
Assembly code, II
Modulo Schedule
FU Alloc
Shift register depth, width, porting Intercluster
bandwidth
clusters expensive FUs
Loop Accelerator
Partition
14
Sobel Kernel
for (i 0 i lt N1 i) for (j 0 j
lt N2 j) int t00, t01, t02, t10, t12, t20,
t21, t22 int e, tmp t00 xi j
t01 xi j1 t02 xi j2 t10
xi1j t12 xi1j2 t20
xi2j t21 xi2j1 t22
xi2j2 e1 ((t00 t01) (t01 t02))
((t20 t21) (t21 t22)) e2
((t00 t10) (t10 t20)) ((t02 t12)
(t12 t22)) e12 e1e1 e22 e2e2 e
e12 e22 if (e gt threshold) tmp 1 else
tmp 0 edgeij tmp
15
FU Allocation
  • Determine number of clusters
  • Determine number of expensive FUs
  • MPY, DIV, memory
  • Sobel with II4
  • 41 ops
  • ? 3 clusters
  • 2 MPY ops
  • ? 1 multiplier
  • 9 memory ops
  • ? 3 memory units

16
Partitioning
  • Multi-level approach consists of two phases
  • Coarsening
  • Refinement
  • Minimize inter-cluster communication
  • Load balance
  • Max of 4 ? II operations per cluster
  • Take FU allocation into account
  • Restricted of expensive units
  • of cheap units (ADD, logic) determined from
    partition

17
Coarsening
  • Group highly related operations together
  • Pair operations together at each step
  • Forces partitioner to consider several operations
    as a single unit
  • Coarsening Sobel subgraph into 2 groups

18
Refinement
  • Move operations between clusters
  • Good moves
  • Reduce inter-cluster communication
  • Improve load balance
  • Reduce hardware cost
  • Reduce number of expensive units to meet limit
  • Collect similar bitwidth operations together

19
Partitioning Example
  • From sobel, II4
  • Place MPYs together
  • Place each tree of ADD-LOAD-ADDs together
  • Cuts 6 edges

20
Modulo Scheduling
  • Determines shift register width, depth, and
    number of read ports
  • Sobel II4

FU Cycle Max resultlifetime Reqddepth Reqd ports
0 2 4 4 1
1 1 2 4 2
1 3 4 4 2
2 4 1 1 1
3 0 - 1 1
3 3 1 1 1
FU0
FU1
FU2
FU3
cycle
ADD
0
LD
1
ADD
2
LD
ADD
ADD
3
21
Test Cases
  • Sobel and fsed kernels, II4 designs
  • Each machine has 4 clusters with 4 FUs per cluster

M
-
M
-
M
-
B
ltlt
sobel
-
-
-
-


-
ltlt
M
-
M
-
M

B
-
fsed
-
ltlt
-
ltlt



22
Cross Compile Results
  • Computation is localized
  • sobel 1.5 moves/cycle
  • fsed 1 move/cycle
  • Cross compile
  • Can still achieve II4
  • More inter-cluster communication
  • May require more units
  • sobel on fsed machine 2 moves/cycle
  • fsed on sobel machine 3 moves/cycle

23
Concluding Remarks
  • Programmable loop accelerator design strategy
  • Meta-architecture with stylized interconnect
  • Systematic compiler-directed design flow
  • Costs of programmability
  • Interconnect, inter-cluster communication
  • Control micro-instructions are necessary
  • Just scratching the surface of this work
  • For more, see the CCCP group webpage
  • http//cccp.eecs.umich.edu
Write a Comment
User Comments (0)
About PowerShow.com