HighLevel Synthesis Challenges and Solutions for a Dynamically Reconfigurable Processor - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

HighLevel Synthesis Challenges and Solutions for a Dynamically Reconfigurable Processor

Description:

Takao Toi, Noritsugu Nakamura, Yoshinosuke Kato, Toru Awashima, ... System Devices Research Laboratories, NEC Corporation. Li Jing. NEC Informatec Systems, Ltd. ... – PowerPoint PPT presentation

Number of Views:552

Avg rating:3.0/5.0

Slides: 31

Provided by: taka70

Category:

more less

Transcript and Presenter's Notes

Title: HighLevel Synthesis Challenges and Solutions for a Dynamically Reconfigurable Processor

1
High-Level Synthesis Challenges and Solutionsfor
a Dynamically Reconfigurable Processor

Takao Toi, Noritsugu Nakamura, Yoshinosuke Kato,
Toru Awashima, and Kazutoshi Wakabayashi
System Devices Research Laboratories, NEC
Corporation
Li Jing
NEC Informatec Systems, Ltd.

2
Contents

Dynamically reconfigurable processor (DRP)
architecture
High-level synthesis (HLS) challenges and
solutions for DRP
How to achieve high area efficiency
How to control delay precisely
Evaluations

3
DRP architecture
DRP core (variable array size)

Array of byte-orientedchainable processing
elements
Fully programmable inter-PE wiring resources

State Transition Controller

A simple sequencer works as FSM

Array of configurabledata memories

Dynamic reconfiguration was originally intended
to achieve higher area efficiency than the other
reconfigurable devices (FPGA/CPLD)
Array of coarse grained PE architecture to reduce
configuration bits

4
How the context transition works
Reconfigurable on every cycle
Context
Event flag
3
??????????
State transition controller
5
Unique points of DRP compare to ASIC/FPGA

Multi-context datapath
Can be switched on every cycle
Coarse grained PE
Unit base operator and register
Longer wire delay
Same as FPGA

DMU
Register File
Instructions
ALU
6
Compilation flow and design environment for DRP
C source code
FSM with C source correspondence
Editor with on-chip debugger
High-level synthesizer

Contexts
FSM
Technology mapping tool
Place route tool
Tool launcher with iterative optimizer
Datapath
Resource usage
STC code
PE code
DRP chip
Scheduling result
7
The first challenge of HLS for DRPHigh area
efficiency
1st20PEs
1st16PEs
2nd2PEs
2nd16PEs
3rd10PEs
3rd16PEs
4th16PEs
Total48PEs
Total48PEs
Max capacity48PEs
Max capacity80PEs
Filling rate100
Filling rate60
Context usage rate4/4
Context usage rate3/4
of PEs should be equalized among contexts
8
Direct mapping from c-step to context
Shared regs
IN
Context
C-Step
a
b
c
x
1
1
IN
2
2
-

-

3
3

lt
lt

4
4

OUT
PE
t
OUT
State transition controller
C-Step in CDFG
Context of DRP
9
Equalize PE usage (1)Area constraint
Excess PEs are forwarded to the next step
Constrain upper limit
Constrain the maximum of PEs at first
But there are many c-steps which do not hit upper
limit
10
Causes of low filling rate

The context may be changed without using all of
the PEs in a context
Loops and goto statement
Data dependency with synchronous resource access
(memory, port, etc.)
Exceeds timing constraint

11
Equalize PE usage (2)Multiple-step-allocation
Combine steps as one context
1
2
1st context
3
4
2nd context
5
3rd context
6
Raise lower minimum of PEs
12
Synthesis flow and additional features in HLS for
DRP
For high area efficiency
For precise delay controllability
Transformation Optimization

PE-level bit folding

Selector delay-aware scheduling
Wire delay consideration

Scheduling

Unit-base area constraint

Data-path binding Register binding Control
synthesis Module allocation

No registers and operators sharing

Multiple-step-allocation

Optimization
13
Multiple-step-allocation algorithm
s_list Sort c-step by of PEs while (s_list !
) c_list foreach s in s_list
Add s to c_list PS 0 foreach shared
resource in c_list DS Estimate selector
delay if (D(s) DS gt DC) Delete s from
c_list break PS Estimate of
selector if (SP(c-step in c_list) PS gt
PC) Delete s from c_list Delete all
c-steps in c_list from s_list
Given numbers P(s) of PEs in each
c-step D(s) Delay in each c-step
Constraints DC Delay constraint PC Area
(max of PEs) constraint
14
Evaluation with Viterbi decoder
Peaks are diminished by limiting to 128
Total context is reduced to half
Number of operational units assigned in each
context
15
Area efficiency
X 1/1.8
X 2.2
X 3.4
X 1/2.8
Context usage rate (Lower is better)
Operational unit filling rate (Higher is better)
High filling rate is achieved by applying both
of PEs constraint and multiple-step-allocation
16
The second challenge of HLS for DRPPrecise delay
controllability

Coarse grained PE architecture
Less logic optimization possibilities
Limited locations for unit placement
Limited wire resources
Longer wire delay

DRPs regularity has opportunities to estimate
delay more precisely
17
Delay model
Register delay with context switching delay
Reg1
STC (FSM)
Reg2
Wire delay
MUX delay
Wire delay
FF
ALU delay
Wire delay
Memory setup time
Mem1
Wire delay account for up to 75 of the overall
design delay (includes metal wire, buffer, and
routing-switching delays)
18
Precise delay control (1)Wire delay aware
scheduling
Reg1
Reg2
Register delay
Typical wire delay between PEs is added to each
operational delay
Inter PE wire delay
MUX delay
Inter PE wire delay
Typical wire delay between PE and memory is added
to both setup and delay
ALU delay
Memory wire delay
Memory setup time
Mem1
These delays are estimated based on previous
experimental measurement using PR tool
19
Operational delays

ASIC/FPGA
Delay of primitive logic and selector may be
neglected
DRP
Primitive logic should be scheduled
Level of selectors must be cared precisely

gt
gt
Adder
2-to-1 selector
And
Adder
2-to-1 selector
And
20
Resource sharing for ASIC/FPGA
Step 1
B
A
A/C
B
D
Step 2

E
D
C

Step 3
Selectors are needed if registers are not shared
H
G
F
I

Step 4
J
Register must be optimally bound during
scheduling stage
21
Precise delay control (2) Resource sharing rule
for DRP

1. Selector for resource sharing
For time exclusive resource
Resource locations are restricted
For conditionally exclusive resource
PEs are wasted because selector and the other
operators are equivalent
2. Selector for conditional branching

A
B
No sharing between contexts
A
D
F
G
H
I
No sharing between conditionally exclusive adder
22
Precise delay control (3)Selector scheduling for
DRP
Step 1
B
A
Step 2

E
D
C

Treats selector node as an operator
This node is moved to the next step if the total
delay of both selector and adder is more than
designed delay

Step 3
H
G
F
I

Step 4
J
Selector node with one input can be ignored for a
register or operator but is necessary for port
or memory
23
Precise delay control (4)PE-Level bit folding
int a, b c a b
32 bit Adder
Timing Constraint
Timing Constraint
a
c
b
Decompose adder and barrel shifter before
scheduling
24
Synthesis flow and additional features in HLS for
DRP
For high area efficiency
For precise delay controllability
Transformation Optimization

PE-level bit folding

Selector delay-aware scheduling
Wire delay consideration

Scheduling

Unit-base area constraint

Data-path binding Register binding Control
synthesis Module allocation

No registers and operators sharing

Step grouping into context

Optimization
25
Delay predication accuracy(Viterbi decoder case)
26
Conclusion

Datapath are highly parallelized and well
equalized among contexts.
A new technique multiple-step-allocation is
introduced to achieve high are efficiency.
Succeed to controls delay at PE level.
Wire delay prediction works well thanks to the
architectural regularity.
Both operator and register sharing rules are
completely changed.
Operators are decomposed into PE level.
Selector and primitive logic are scheduled
precisely.

27
Supplement
28
Processing Element (PE)

ALU ordinary byte arithmetic/logic operations
DMU (data manipulation unit) handles byte
select, shift, mask, constant generation, etc.,
as well as bit manipulations
An instruction dictates ALU/DMU operations and
inter-PE connections
Source/destination operands can either from/to
its own register file
other PEs (i.e., flow through)
Instruction pointer (IP) is provided from STC
(statetransition controller)

IP
Data Wire
Flag Wire
Flag_in
DMU
Data_in (8bx2)
Data_out (8b)
Register File
Instructions
ALU
Flag_out
29
Balancing strategy
Combine as many steps as possible as one context
1
5
1st context
Maximize the area efficiency
2
6
3
2nd context
4
3rd context
30
Throughput
Multiple-step-allocation doesnt affect the
throughput

Write a Comment

User Comments (0)