HighLevel Synthesis Challenges and Solutions for a Dynamically Reconfigurable Processor - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

HighLevel Synthesis Challenges and Solutions for a Dynamically Reconfigurable Processor

Description:

Takao Toi, Noritsugu Nakamura, Yoshinosuke Kato, Toru Awashima, ... System Devices Research Laboratories, NEC Corporation. Li Jing. NEC Informatec Systems, Ltd. ... – PowerPoint PPT presentation

Number of Views:552
Avg rating:3.0/5.0
Slides: 31
Provided by: taka70
Category:

less

Transcript and Presenter's Notes

Title: HighLevel Synthesis Challenges and Solutions for a Dynamically Reconfigurable Processor


1
High-Level Synthesis Challenges and Solutionsfor
a Dynamically Reconfigurable Processor
  • Takao Toi, Noritsugu Nakamura, Yoshinosuke Kato,
    Toru Awashima, and Kazutoshi Wakabayashi
  • System Devices Research Laboratories, NEC
    Corporation
  • Li Jing
  • NEC Informatec Systems, Ltd.

2
Contents
  • Dynamically reconfigurable processor (DRP)
    architecture
  • High-level synthesis (HLS) challenges and
    solutions for DRP
  • How to achieve high area efficiency
  • How to control delay precisely
  • Evaluations

3
DRP architecture
DRP core (variable array size)
  • Array of byte-orientedchainable processing
    elements
  • Fully programmable inter-PE wiring resources

State Transition Controller
  • A simple sequencer works as FSM
  • Array of configurabledata memories
  • Dynamic reconfiguration was originally intended
    to achieve higher area efficiency than the other
    reconfigurable devices (FPGA/CPLD)
  • Array of coarse grained PE architecture to reduce
    configuration bits

4
How the context transition works
Reconfigurable on every cycle
Context
Event flag
3
??????????
State transition controller
5
Unique points of DRP compare to ASIC/FPGA
  • Multi-context datapath
  • Can be switched on every cycle
  • Coarse grained PE
  • Unit base operator and register
  • Longer wire delay
  • Same as FPGA

DMU
Register File
Instructions
ALU
6
Compilation flow and design environment for DRP
C source code
FSM with C source correspondence
Editor with on-chip debugger
High-level synthesizer

Contexts
FSM
Technology mapping tool
Place route tool
Tool launcher with iterative optimizer
Datapath
Resource usage
STC code
PE code
DRP chip
Scheduling result
7
The first challenge of HLS for DRPHigh area
efficiency
1st20PEs
1st16PEs
2nd2PEs
2nd16PEs
3rd10PEs
3rd16PEs
4th16PEs
Total48PEs
Total48PEs
Max capacity48PEs
Max capacity80PEs
Filling rate100
Filling rate60
Context usage rate4/4
Context usage rate3/4
of PEs should be equalized among contexts
8
Direct mapping from c-step to context
Shared regs
IN
Context
C-Step
a
b
c
x
1
1
IN
2
2
-

-

3
3


lt
lt


4
4


OUT
PE
t
OUT
State transition controller
C-Step in CDFG
Context of DRP
9
Equalize PE usage (1)Area constraint
Excess PEs are forwarded to the next step
Constrain upper limit
Constrain the maximum of PEs at first
But there are many c-steps which do not hit upper
limit
10
Causes of low filling rate
  • The context may be changed without using all of
    the PEs in a context
  • Loops and goto statement
  • Data dependency with synchronous resource access
    (memory, port, etc.)
  • Exceeds timing constraint

11
Equalize PE usage (2)Multiple-step-allocation
Combine steps as one context
1
2
1st context
3
4
2nd context
5
3rd context
6
Raise lower minimum of PEs
12
Synthesis flow and additional features in HLS for
DRP
For high area efficiency
For precise delay controllability
Transformation Optimization
  • PE-level bit folding
  • Selector delay-aware scheduling
  • Wire delay consideration

Scheduling
  • Unit-base area constraint

Data-path binding Register binding Control
synthesis Module allocation
  • No registers and operators sharing
  • Multiple-step-allocation

Optimization
13
Multiple-step-allocation algorithm
s_list Sort c-step by of PEs while (s_list !
) c_list foreach s in s_list
Add s to c_list PS 0 foreach shared
resource in c_list DS Estimate selector
delay if (D(s) DS gt DC) Delete s from
c_list break PS Estimate of
selector if (SP(c-step in c_list) PS gt
PC) Delete s from c_list Delete all
c-steps in c_list from s_list
Given numbers P(s) of PEs in each
c-step D(s) Delay in each c-step
Constraints DC Delay constraint PC Area
(max of PEs) constraint
14
Evaluation with Viterbi decoder
Peaks are diminished by limiting to 128
Total context is reduced to half
Number of operational units assigned in each
context
15
Area efficiency
X 1/1.8
X 2.2
X 3.4
X 1/2.8
Context usage rate (Lower is better)
Operational unit filling rate (Higher is better)
High filling rate is achieved by applying both
of PEs constraint and multiple-step-allocation
16
The second challenge of HLS for DRPPrecise delay
controllability
  • Coarse grained PE architecture
  • Less logic optimization possibilities
  • Limited locations for unit placement
  • Limited wire resources
  • Longer wire delay

DRPs regularity has opportunities to estimate
delay more precisely
17
Delay model
Register delay with context switching delay
Reg1
STC (FSM)
Reg2
Wire delay
MUX delay
Wire delay
FF
ALU delay
Wire delay
Memory setup time
Mem1
Wire delay account for up to 75 of the overall
design delay (includes metal wire, buffer, and
routing-switching delays)
18
Precise delay control (1)Wire delay aware
scheduling
Reg1
Reg2
Register delay
Typical wire delay between PEs is added to each
operational delay
Inter PE wire delay
MUX delay
Inter PE wire delay
Typical wire delay between PE and memory is added
to both setup and delay
ALU delay
Memory wire delay
Memory setup time
Mem1
These delays are estimated based on previous
experimental measurement using PR tool
19
Operational delays
  • ASIC/FPGA
  • Delay of primitive logic and selector may be
    neglected
  • DRP
  • Primitive logic should be scheduled
  • Level of selectors must be cared precisely

gt
gt
Adder
2-to-1 selector
And
Adder
2-to-1 selector
And
20
Resource sharing for ASIC/FPGA
Step 1
B
A
A/C
B
D
Step 2

E
D
C

Step 3
Selectors are needed if registers are not shared
H
G
F
I


Step 4
J
Register must be optimally bound during
scheduling stage
21
Precise delay control (2) Resource sharing rule
for DRP
  • 1. Selector for resource sharing
  • For time exclusive resource
  • Resource locations are restricted
  • For conditionally exclusive resource
  • PEs are wasted because selector and the other
    operators are equivalent
  • 2. Selector for conditional branching

A
B
No sharing between contexts
A
D
F
G
H
I
No sharing between conditionally exclusive adder
22
Precise delay control (3)Selector scheduling for
DRP
Step 1
B
A
Step 2

E
D
C
  • Treats selector node as an operator
  • This node is moved to the next step if the total
    delay of both selector and adder is more than
    designed delay


Step 3
H
G
F
I


Step 4
J
Selector node with one input can be ignored for a
register or operator but is necessary for port
or memory
23
Precise delay control (4)PE-Level bit folding
int a, b c a b
32 bit Adder
Timing Constraint
Timing Constraint
a
c
b
Decompose adder and barrel shifter before
scheduling
24
Synthesis flow and additional features in HLS for
DRP
For high area efficiency
For precise delay controllability
Transformation Optimization
  • PE-level bit folding
  • Selector delay-aware scheduling
  • Wire delay consideration

Scheduling
  • Unit-base area constraint

Data-path binding Register binding Control
synthesis Module allocation
  • No registers and operators sharing
  • Step grouping into context

Optimization
25
Delay predication accuracy(Viterbi decoder case)
26
Conclusion
  • Datapath are highly parallelized and well
    equalized among contexts.
  • A new technique multiple-step-allocation is
    introduced to achieve high are efficiency.
  • Succeed to controls delay at PE level.
  • Wire delay prediction works well thanks to the
    architectural regularity.
  • Both operator and register sharing rules are
    completely changed.
  • Operators are decomposed into PE level.
  • Selector and primitive logic are scheduled
    precisely.

27
Supplement
28
Processing Element (PE)
  • ALU ordinary byte arithmetic/logic operations
  • DMU (data manipulation unit) handles byte
    select, shift, mask, constant generation, etc.,
    as well as bit manipulations
  • An instruction dictates ALU/DMU operations and
    inter-PE connections
  • Source/destination operands can either from/to
  • its own register file
  • other PEs (i.e., flow through)
  • Instruction pointer (IP) is provided from STC
    (statetransition controller)

IP
Data Wire
Flag Wire
Flag_in
DMU
Data_in (8bx2)
Data_out (8b)
Register File
Instructions
ALU
Flag_out
29
Balancing strategy
Combine as many steps as possible as one context
1
5
1st context
Maximize the area efficiency
2
6
3
2nd context
4
3rd context
30
Throughput
Multiple-step-allocation doesnt affect the
throughput
Write a Comment
User Comments (0)
About PowerShow.com