Architecture and Compilation for Reconfigurable Processors - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Architecture and Compilation for Reconfigurable Processors

Description:

Reconfigurable processor (RP) core programmable fabric ... Some pattern instances may be isomorphic. Graph isomorphism test [ Nauty Package ] ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 31
Provided by: yipin3
Category:

less

Transcript and Presenter's Notes

Title: Architecture and Compilation for Reconfigurable Processors


1
Architecture and Compilation for Reconfigurable
Processors
  • Jason Cong, Yiping Fan, Guoling Han, Zhiru Zhang
  • Computer Science Department
  • UCLA
  • Nov 22, 2004

2
Outline
  • Motivation
  • Application-specific instruction set compilation
  • Register file data bandwidth problem
  • Architecture extension shadow registers
  • Shadow register binding
  • Conclusions

3
Reconfigurable Processor Platform
  • Reconfigurable processor (RP) core programmable
    fabric
  • RP core supports Basic instruction set
    customized instructions
  • Programmable fabric implements the customized
    instructions
  • Either runtime reconfigurable or pre-synthesized
  • Example Nios / Nios II from Altera
  • Stratix version supported by Nios 3.0 system
  • 5 extended instruction formats
  • Up to 2048 instructions for each format

4
Motivational Example
t1 a b t2 b 2 t3 c 5 t4 t1
t2 t5 t2 t3 t6 t5 t4
t1 extop1(a, b, 2) t2 extop2(b, c, 2, 5) t3
t1 t2
extop2
extop1
2 clock cycles 1 clock cycle
Extended Instruction Set I?extop1 ?expop2
Execution time 9 clock cycles
Execution time 5 clock cycles Speedup 1.8
5
Problem Statement
  • Given
  • Application program in CDFG G(V, E)
  • A processor with basic instruction set I
  • Pattern constraints
  • Number of inputs less than Nin
  • 1 output
  • Total area no more than A
  • Objective
  • Generate a pattern library P
  • Map G to the extended instruction set I?P, so
    that the total execution time is minimized.

6
Proposed ASIP Compilation Flow
  • Extended Instruction Candidates Generation
  • Satisfying I/O constraints
  • Extended Instruction Selection
  • Select a subset to maximize the potential speedup
    while satisfying the resource constraint
  • Code Generation
  • Graph covering
  • Minimize the total execution time

7
Step 1. Pattern Enumeration
  • Each pattern is a Nin-feasible cone
  • Cut enumeration is used to enumerate all the
    Nin-feasible cones cong et al, FPGA99
  • Basic idea In topological order, merge the cuts
    of fan-ins and discards those cuts not
    Nin-feasible

3-feasible cones
n1 a, b n2 b, 2 n3 c, 5
n4 n1, n2,
n1, b, 2,
n2, a, b,
a, b, 2
8
Step 2. Pattern Selection
  • Basic idea simultaneously consider speed up,
    occurrence frequency and area.
  • Speedup
  • Tsw(p) total execution time with basic
    instructions
  • Thw(p) length of the critical path of
    scheduled p
  • Speedup(p) Tsw(p) / Thw(p)
  • Occurrence
  • Some pattern instances may be isomorphic
  • Graph isomorphism test Nauty Package
  • Small subgraphs, isomorphism test is very fast
  • Gain(p) Speedup(p) ? Occurrence(p)
  • Selection under area constraint can be formulated
    as a 0-1 knapsack problem

n3
n2
n1
n4
n5
n6
Pattern Tsw 3 Thw 2 Speedup 1.5
9
Step 3. Application Mapping
  • Assume execution on an in-order, single-issue
    processor
  • Cover each node in G(V, E) with the extended
    instruction set to minimize the execution time.
  • Trivial pattern software execution time
  • Nontrivial pattern hardware execution time
  • Total execution time Sum of execution time of
    instance patterns after application mapping
  • Theorem The application mapping problem is
    equivalent to the library-based minimum-area
    technology mapping problem.

10
Speedup and Resource Overhead on NIOS
Extended Instruction Speedup Speedup Resource Overhead Resource Overhead Resource Overhead Resource Overhead Resource Overhead
Extended Instruction Estimation Nios LE LE Memory Memory DSP Block
fft_br 9 3.28 2.65 408 6.06 65,536 9.79 16
iir 7 3.18 3.73 255 3.79 4,736 0.71 40
fir 2 2.40 2.14 51 0.76 1,024 0.15 8
pr 2 1.57 1.75 71 1.05 0 0.00 14
dir 2 3.28 3.02 54 0.80 0 0.00 16
mcm 4 4.75 3.22 186 2.76 0 0.00 56
Average 3.08 2.75 - 2.54 - 1.77 -
11
Simulation Environment
  • Simplescalar v3.0
  • Benchmarks
  • From Mediabench suite
  • Machine Configuration
  • Single issue in-order processor (ARM like)
  • DL1 8KB, 4-way, 1 cycle
  • IL1 8KB, direct mapped, 1 cycle
  • Unified L2 256KB, 4-way, 8 cycle
  • Functional units 2 IntAdd, 1 IntMult, 1 FPAdd, 1
    FPMult
  • Reconfigurable units
  • critical path latency of the collapsed
    instructions

12
Pattern Distribution
Most of the patterns have less than 7 nodes inside
13
Ideal Speedup under Different Input Size
Constraints
14
Outline
  • Motivation
  • Application-specific instruction set compilation
  • Register file data bandwidth problem
  • Architecture extension shadow registers
  • Shadow register binding
  • Conclusions

15
Register File Bandwidth Problem
  • Most of the speedup comes from clusters with more
    than two inputs
  • 2-port register file in embedded processors
  • Need extra cycles to transfer data for extended
    instructions with more than 2 inputs
  • Speedup drop due to communication overhead

16
Speedup Drop with Different Input Constraints
  • Move operation takes one cycle
  • 46 speedup drop on average

17
Outline
  • Motivation
  • Application-specific instruction set compilation
  • Register file data bandwidth problem
  • Architecture extension shadow registers
  • Shadow register binding
  • Conclusions

18
Architecture Extensions
  • Existing Solutions
  • Dedicated Data Link
  • Avoid potential resource contention through bus
  • Need extra cycles for communication
  • Employed in Microblaze from Xilinx
  • Multiport Register File
  • Low utilization when executing basic instructions
  • Area and power grows cubically
  • Register File Replication
  • Predetermined one-to-one correspondence
  • Resource waste in terms of area and power
  • Limit compiler optimization

19
Our Approach Shadow Registers
  • Core registers are augmented by an extra set of
    shadow registers
  • Conditionally written
  • Used only by the custom logic

20
Shadow Registers
  • Controlling the shadow register
  • Advantages and limitations
  • Cost-efficient for small number of shadow
    registers
  • Only need a few control signals to be added
  • Opportunity for compiler optimization
  • Require extra control bits

Operation Forward the result Forward the result Forward the result Skip
Instruction Subword 00 01 10 11
Shadow-reg ID 0 1 2 -
21
Outline
  • Motivation
  • Application-specific instruction set compilation
  • Register file data bandwidth problem
  • Architecture extension shadow registers
  • Shadow register binding
  • Conclusions

22
Internal Representation
  • 2-level CDFG representation
  • 1st level control flow graph
  • 2nd level data flow graph
  • Computation node
  • latency scheduled time slot
  • Data edge
  • lifetime
  • Variable lifetime

1
i1 i2 ext1 (, i1, ) i3 i4 ext2
(, i1, ) i5 ext3 (, i3, ) i6 ext4 (,
i3, )
e1
2
e2
3
e3
4
e4
5
6
Life time e1 2, 2 Life time e2 2, 4 Life
time i1 2, 4
23
Observation
  • 2-port register file
  • 3-input extended instruction
  • Without shadow register
  • 4 additional moves
  • Binding for 1 register

1
i1 i2 ext1 (, i1, ) i3 i4 ext2
(, i1, ) i5 ext3 (, i3, ) i6 ext4 (,
i3, )
e1
2
e2
3
e3
4
e4
5
Binding 1 either i1 or i3 in shadow register
save 2 moves
6
Binding 2 save 3 moves
24
Register Binding
  • Which operands should be bound?
  • Each input could be a candidate
  • Binding different candidates leads to different
    savings
  • Unaffordable to try all the combinations

25
One Shadow Register Binding Problem
  • Problem formulation
  • Given
  • A scheduled DFG and one shadow register
  • Objective
  • Bind variables to shadow register
  • Minimize the number of moves

26
Algorithm for Binding One Shadow Register
  • Weighted compatibility graph
  • Vertex lt-gt data edge in the DFG
  • Weight lt-gt saves if the value is kept in
    the register
  • Edge lt-gt lifetimes dont overlap
  • Theorem
  • Binding problem is equivalent to find a maximum
    weighted chain in the compatibility graph
  • Can be optimally solved in time O(V E)
  • Extension to K-shadow registers

27
Experimental Results (1)
Speedup under different number of shadow
registers for 3-input extended instructions
28
Experimental Results (2)
Speedup under different number of shadow
registers for 4-input extended instructions
29
Conclusions
  • Proposed and developed complete compilation flow
  • Observed and quantitatively analyzed data
    bandwidth problem
  • Proposed novel architecture extension and
    efficient register binding algorithm

30
Thank You
Write a Comment
User Comments (0)
About PowerShow.com