Title: HighLevel Synthesis Challenges and Solutions for a Dynamically Reconfigurable Processor
1High-Level Synthesis Challenges and Solutionsfor
a Dynamically Reconfigurable Processor
- Takao Toi, Noritsugu Nakamura, Yoshinosuke Kato,
Toru Awashima, and Kazutoshi Wakabayashi - System Devices Research Laboratories, NEC
Corporation - Li Jing
- NEC Informatec Systems, Ltd.
2Contents
- Dynamically reconfigurable processor (DRP)
architecture - High-level synthesis (HLS) challenges and
solutions for DRP - How to achieve high area efficiency
- How to control delay precisely
- Evaluations
3DRP architecture
DRP core (variable array size)
- Array of byte-orientedchainable processing
elements - Fully programmable inter-PE wiring resources
State Transition Controller
- A simple sequencer works as FSM
- Array of configurabledata memories
- Dynamic reconfiguration was originally intended
to achieve higher area efficiency than the other
reconfigurable devices (FPGA/CPLD) - Array of coarse grained PE architecture to reduce
configuration bits
4How the context transition works
Reconfigurable on every cycle
Context
Event flag
3
??????????
State transition controller
5Unique points of DRP compare to ASIC/FPGA
- Multi-context datapath
- Can be switched on every cycle
- Coarse grained PE
- Unit base operator and register
- Longer wire delay
- Same as FPGA
DMU
Register File
Instructions
ALU
6Compilation flow and design environment for DRP
C source code
FSM with C source correspondence
Editor with on-chip debugger
High-level synthesizer
Contexts
FSM
Technology mapping tool
Place route tool
Tool launcher with iterative optimizer
Datapath
Resource usage
STC code
PE code
DRP chip
Scheduling result
7The first challenge of HLS for DRPHigh area
efficiency
1st20PEs
1st16PEs
2nd2PEs
2nd16PEs
3rd10PEs
3rd16PEs
4th16PEs
Total48PEs
Total48PEs
Max capacity48PEs
Max capacity80PEs
Filling rate100
Filling rate60
Context usage rate4/4
Context usage rate3/4
of PEs should be equalized among contexts
8Direct mapping from c-step to context
Shared regs
IN
Context
C-Step
a
b
c
x
1
1
IN
2
2
-
-
3
3
lt
lt
4
4
OUT
PE
t
OUT
State transition controller
C-Step in CDFG
Context of DRP
9Equalize PE usage (1)Area constraint
Excess PEs are forwarded to the next step
Constrain upper limit
Constrain the maximum of PEs at first
But there are many c-steps which do not hit upper
limit
10Causes of low filling rate
- The context may be changed without using all of
the PEs in a context - Loops and goto statement
- Data dependency with synchronous resource access
(memory, port, etc.) - Exceeds timing constraint
11Equalize PE usage (2)Multiple-step-allocation
Combine steps as one context
1
2
1st context
3
4
2nd context
5
3rd context
6
Raise lower minimum of PEs
12Synthesis flow and additional features in HLS for
DRP
For high area efficiency
For precise delay controllability
Transformation Optimization
- Selector delay-aware scheduling
- Wire delay consideration
Scheduling
- Unit-base area constraint
Data-path binding Register binding Control
synthesis Module allocation
- No registers and operators sharing
Optimization
13Multiple-step-allocation algorithm
s_list Sort c-step by of PEs while (s_list !
) c_list foreach s in s_list
Add s to c_list PS 0 foreach shared
resource in c_list DS Estimate selector
delay if (D(s) DS gt DC) Delete s from
c_list break PS Estimate of
selector if (SP(c-step in c_list) PS gt
PC) Delete s from c_list Delete all
c-steps in c_list from s_list
Given numbers P(s) of PEs in each
c-step D(s) Delay in each c-step
Constraints DC Delay constraint PC Area
(max of PEs) constraint
14Evaluation with Viterbi decoder
Peaks are diminished by limiting to 128
Total context is reduced to half
Number of operational units assigned in each
context
15Area efficiency
X 1/1.8
X 2.2
X 3.4
X 1/2.8
Context usage rate (Lower is better)
Operational unit filling rate (Higher is better)
High filling rate is achieved by applying both
of PEs constraint and multiple-step-allocation
16The second challenge of HLS for DRPPrecise delay
controllability
- Coarse grained PE architecture
- Less logic optimization possibilities
- Limited locations for unit placement
- Limited wire resources
- Longer wire delay
DRPs regularity has opportunities to estimate
delay more precisely
17Delay model
Register delay with context switching delay
Reg1
STC (FSM)
Reg2
Wire delay
MUX delay
Wire delay
FF
ALU delay
Wire delay
Memory setup time
Mem1
Wire delay account for up to 75 of the overall
design delay (includes metal wire, buffer, and
routing-switching delays)
18Precise delay control (1)Wire delay aware
scheduling
Reg1
Reg2
Register delay
Typical wire delay between PEs is added to each
operational delay
Inter PE wire delay
MUX delay
Inter PE wire delay
Typical wire delay between PE and memory is added
to both setup and delay
ALU delay
Memory wire delay
Memory setup time
Mem1
These delays are estimated based on previous
experimental measurement using PR tool
19Operational delays
- ASIC/FPGA
- Delay of primitive logic and selector may be
neglected - DRP
- Primitive logic should be scheduled
- Level of selectors must be cared precisely
gt
gt
Adder
2-to-1 selector
And
Adder
2-to-1 selector
And
20Resource sharing for ASIC/FPGA
Step 1
B
A
A/C
B
D
Step 2
E
D
C
Step 3
Selectors are needed if registers are not shared
H
G
F
I
Step 4
J
Register must be optimally bound during
scheduling stage
21Precise delay control (2) Resource sharing rule
for DRP
- 1. Selector for resource sharing
- For time exclusive resource
- Resource locations are restricted
- For conditionally exclusive resource
- PEs are wasted because selector and the other
operators are equivalent - 2. Selector for conditional branching
A
B
No sharing between contexts
A
D
F
G
H
I
No sharing between conditionally exclusive adder
22Precise delay control (3)Selector scheduling for
DRP
Step 1
B
A
Step 2
E
D
C
- Treats selector node as an operator
- This node is moved to the next step if the total
delay of both selector and adder is more than
designed delay
Step 3
H
G
F
I
Step 4
J
Selector node with one input can be ignored for a
register or operator but is necessary for port
or memory
23Precise delay control (4)PE-Level bit folding
int a, b c a b
32 bit Adder
Timing Constraint
Timing Constraint
a
c
b
Decompose adder and barrel shifter before
scheduling
24Synthesis flow and additional features in HLS for
DRP
For high area efficiency
For precise delay controllability
Transformation Optimization
- Selector delay-aware scheduling
- Wire delay consideration
Scheduling
- Unit-base area constraint
Data-path binding Register binding Control
synthesis Module allocation
- No registers and operators sharing
- Step grouping into context
Optimization
25Delay predication accuracy(Viterbi decoder case)
26Conclusion
- Datapath are highly parallelized and well
equalized among contexts. - A new technique multiple-step-allocation is
introduced to achieve high are efficiency. - Succeed to controls delay at PE level.
- Wire delay prediction works well thanks to the
architectural regularity. - Both operator and register sharing rules are
completely changed. - Operators are decomposed into PE level.
- Selector and primitive logic are scheduled
precisely.
27Supplement
28Processing Element (PE)
- ALU ordinary byte arithmetic/logic operations
- DMU (data manipulation unit) handles byte
select, shift, mask, constant generation, etc.,
as well as bit manipulations - An instruction dictates ALU/DMU operations and
inter-PE connections - Source/destination operands can either from/to
- its own register file
- other PEs (i.e., flow through)
- Instruction pointer (IP) is provided from STC
(statetransition controller)
IP
Data Wire
Flag Wire
Flag_in
DMU
Data_in (8bx2)
Data_out (8b)
Register File
Instructions
ALU
Flag_out
29Balancing strategy
Combine as many steps as possible as one context
1
5
1st context
Maximize the area efficiency
2
6
3
2nd context
4
3rd context
30Throughput
Multiple-step-allocation doesnt affect the
throughput