Architecture and Compilation for Reconfigurable Processors

About This Presentation

Title:

Architecture and Compilation for Reconfigurable Processors

Description:

Reconfigurable processor (RP) core programmable fabric ... Some pattern instances may be isomorphic. Graph isomorphism test [ Nauty Package ] ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 31

Provided by: yipin3

Learn more at: http://cadlab.cs.ucla.edu

Category:

more less

Transcript and Presenter's Notes

Title: Architecture and Compilation for Reconfigurable Processors

1
Architecture and Compilation for Reconfigurable
Processors

Jason Cong, Yiping Fan, Guoling Han, Zhiru Zhang
Computer Science Department
UCLA
Nov 22, 2004

2
Outline

Motivation
Application-specific instruction set compilation
Register file data bandwidth problem
Architecture extension shadow registers
Shadow register binding
Conclusions

3
Reconfigurable Processor Platform

Reconfigurable processor (RP) core programmable
fabric
RP core supports Basic instruction set
customized instructions
Programmable fabric implements the customized
instructions
Either runtime reconfigurable or pre-synthesized
Example Nios / Nios II from Altera
Stratix version supported by Nios 3.0 system
5 extended instruction formats
Up to 2048 instructions for each format

4
Motivational Example
t1 a b t2 b 2 t3 c 5 t4 t1
t2 t5 t2 t3 t6 t5 t4
t1 extop1(a, b, 2) t2 extop2(b, c, 2, 5) t3
t1 t2
extop2
extop1
2 clock cycles 1 clock cycle
Extended Instruction Set I?extop1 ?expop2
Execution time 9 clock cycles
Execution time 5 clock cycles Speedup 1.8
5
Problem Statement

Given
Application program in CDFG G(V, E)
A processor with basic instruction set I
Pattern constraints
Number of inputs less than Nin
1 output
Total area no more than A
Objective
Generate a pattern library P
Map G to the extended instruction set I?P, so
that the total execution time is minimized.

6
Proposed ASIP Compilation Flow

Extended Instruction Candidates Generation
Satisfying I/O constraints
Extended Instruction Selection
Select a subset to maximize the potential speedup
while satisfying the resource constraint
Code Generation
Graph covering
Minimize the total execution time

7
Step 1. Pattern Enumeration

Each pattern is a Nin-feasible cone
Cut enumeration is used to enumerate all the
Nin-feasible cones cong et al, FPGA99
Basic idea In topological order, merge the cuts
of fan-ins and discards those cuts not
Nin-feasible

3-feasible cones
n1 a, b n2 b, 2 n3 c, 5
n4 n1, n2,
n1, b, 2,
n2, a, b,
a, b, 2
8
Step 2. Pattern Selection

Basic idea simultaneously consider speed up,
occurrence frequency and area.
Speedup
Tsw(p) total execution time with basic
instructions
Thw(p) length of the critical path of
scheduled p
Speedup(p) Tsw(p) / Thw(p)
Occurrence
Some pattern instances may be isomorphic
Graph isomorphism test Nauty Package
Small subgraphs, isomorphism test is very fast
Gain(p) Speedup(p) ? Occurrence(p)
Selection under area constraint can be formulated
as a 0-1 knapsack problem

n3
n2
n1
n4
n5
n6
Pattern Tsw 3 Thw 2 Speedup 1.5
9
Step 3. Application Mapping

Assume execution on an in-order, single-issue
processor
Cover each node in G(V, E) with the extended
instruction set to minimize the execution time.
Trivial pattern software execution time
Nontrivial pattern hardware execution time
Total execution time Sum of execution time of
instance patterns after application mapping
Theorem The application mapping problem is
equivalent to the library-based minimum-area
technology mapping problem.

10
Speedup and Resource Overhead on NIOS
Extended Instruction Speedup Speedup Resource Overhead Resource Overhead Resource Overhead Resource Overhead Resource Overhead
Extended Instruction Estimation Nios LE LE Memory Memory DSP Block
fft_br 9 3.28 2.65 408 6.06 65,536 9.79 16
iir 7 3.18 3.73 255 3.79 4,736 0.71 40
fir 2 2.40 2.14 51 0.76 1,024 0.15 8
pr 2 1.57 1.75 71 1.05 0 0.00 14
dir 2 3.28 3.02 54 0.80 0 0.00 16
mcm 4 4.75 3.22 186 2.76 0 0.00 56
Average 3.08 2.75 - 2.54 - 1.77 -
11
Simulation Environment

Simplescalar v3.0
Benchmarks
From Mediabench suite
Machine Configuration
Single issue in-order processor (ARM like)
DL1 8KB, 4-way, 1 cycle
IL1 8KB, direct mapped, 1 cycle
Unified L2 256KB, 4-way, 8 cycle
Functional units 2 IntAdd, 1 IntMult, 1 FPAdd, 1
FPMult
Reconfigurable units
critical path latency of the collapsed
instructions

12
Pattern Distribution
Most of the patterns have less than 7 nodes inside
13
Ideal Speedup under Different Input Size
Constraints
14
Outline

Motivation
Application-specific instruction set compilation
Register file data bandwidth problem
Architecture extension shadow registers
Shadow register binding
Conclusions

15
Register File Bandwidth Problem

Most of the speedup comes from clusters with more
than two inputs
2-port register file in embedded processors
Need extra cycles to transfer data for extended
instructions with more than 2 inputs
Speedup drop due to communication overhead

16
Speedup Drop with Different Input Constraints

Move operation takes one cycle
46 speedup drop on average

17
Outline

Motivation
Application-specific instruction set compilation
Register file data bandwidth problem
Architecture extension shadow registers
Shadow register binding
Conclusions

18
Architecture Extensions

Existing Solutions
Dedicated Data Link
Avoid potential resource contention through bus
Need extra cycles for communication
Employed in Microblaze from Xilinx
Multiport Register File
Low utilization when executing basic instructions
Area and power grows cubically
Register File Replication
Predetermined one-to-one correspondence
Resource waste in terms of area and power
Limit compiler optimization

19
Our Approach Shadow Registers

Core registers are augmented by an extra set of
shadow registers
Conditionally written
Used only by the custom logic

20
Shadow Registers

Controlling the shadow register
Advantages and limitations
Cost-efficient for small number of shadow
registers
Only need a few control signals to be added
Opportunity for compiler optimization
Require extra control bits

Operation Forward the result Forward the result Forward the result Skip
Instruction Subword 00 01 10 11
Shadow-reg ID 0 1 2 -
21
Outline

Motivation
Application-specific instruction set compilation
Register file data bandwidth problem
Architecture extension shadow registers
Shadow register binding
Conclusions

22
Internal Representation

2-level CDFG representation
1st level control flow graph
2nd level data flow graph
Computation node
latency scheduled time slot
Data edge
lifetime
Variable lifetime

1
i1 i2 ext1 (, i1, ) i3 i4 ext2
(, i1, ) i5 ext3 (, i3, ) i6 ext4 (,
i3, )
e1
2
e2
3
e3
4
e4
5
6
Life time e1 2, 2 Life time e2 2, 4 Life
time i1 2, 4
23
Observation

2-port register file
3-input extended instruction
Without shadow register
4 additional moves
Binding for 1 register

1
i1 i2 ext1 (, i1, ) i3 i4 ext2
(, i1, ) i5 ext3 (, i3, ) i6 ext4 (,
i3, )
e1
2
e2
3
e3
4
e4
5
Binding 1 either i1 or i3 in shadow register
save 2 moves
6
Binding 2 save 3 moves
24
Register Binding

Which operands should be bound?
Each input could be a candidate
Binding different candidates leads to different
savings
Unaffordable to try all the combinations

25
One Shadow Register Binding Problem

Problem formulation
Given
A scheduled DFG and one shadow register
Objective
Bind variables to shadow register
Minimize the number of moves

26
Algorithm for Binding One Shadow Register

Weighted compatibility graph
Vertex lt-gt data edge in the DFG
Weight lt-gt saves if the value is kept in
the register
Edge lt-gt lifetimes dont overlap
Theorem
Binding problem is equivalent to find a maximum
weighted chain in the compatibility graph
Can be optimally solved in time O(V E)
Extension to K-shadow registers

27
Experimental Results (1)
Speedup under different number of shadow
registers for 3-input extended instructions
28
Experimental Results (2)
Speedup under different number of shadow
registers for 4-input extended instructions
29
Conclusions

Architecture and Compilation for Reconfigurable Processors - PowerPoint PPT Presentation

Architecture and Compilation for Reconfigurable Processors

Reconfigurable processor (RP) core programmable fabric ... Some pattern instances may be isomorphic. Graph isomorphism test [ Nauty Package ] ... – PowerPoint PPT presentation