Title: HLSl: HighLevel Synthesis of High Performance LatchBased Circuits
1HLS-l High-Level Synthesis of High Performance
Latch-Based Circuits
- Seungwhun Paik , Insup Shin
- and Youngsoo Shin
- Dept. of Electrical Engineering, KAIST, KOREA
2Outline
- Motivation main idea
- Latch-based high-level synthesis HLS-l
- Scheduling
- Register allocation
- Control synthesis
- Optimize duty cycle
- Experimental results
- Conclusion
3Motivation
- Large performance gap between custom designs and
ASICs - Microarchitecture, circuit style, cell design,
coping with process variation, sequencing
overhead, etc - Latch-based designs
- Pros. lower sequencing overhead, transparency
offers time borrowing - Cons. complicated timing analysis, more glitches
D. Chinnery et al, Closing the gap between ASIC
custom, Kluwer Academic Publishers, 2002.
4Main Idea of HLS-l
- Schedule operations at both edges of clock
- Scheduling is done in a finer granularity
- Control signals are generated per phase-step
(p-step) basis
Control-step
4
5Main Idea of HLS-l
Conventional scheduling
Proposed scheduling
6Operation Delay (c-step)
- Conventional c-step based scheduling
- Execution delay of operation i is
- given as of c-steps
- Tclk clock period
- DFU(i) max. delay of FU that computes OP i
- Dmargin extra delay through data-path
6
7Operation Delay (p-step)
- P-step based scheduling
- Execution delay of operation i is given as of
p-steps
- ri residual delay (ri Di mod Tclk)
- Ttr a period of time when latches are
transparent - Pi p-step where operation i is scheduled
7
8Operation Delay (p-step)
- ri ? 0 ? p-step based OP delay may vary
8
9P-step Based Scheduling
- Most conventional scheduling algorithms can be
easily extended to p-step based scheduling - No need to postpone scheduling of operation to
the next p-step (even thought the delay gets
smaller) - Concurrent read/write operations must be handled
with a care
4 p-steps
3 p-steps
9
10Register Allocation
- Coloring of register conflict graph
- Use of latch-based registers incurs extra
conflicts - Condition 1 Input and output operands of the
same OP that completes at transparent p-step
(e.g., a and b) - Condition 2 Input and output operands of two
different OPs that complete at the same
transparent p-step (e.g., a and c)
a
a
-
c
b
-
b
c
Register conflict graph
10
11Concurrent Read/Write Operation
- Concurrent read/write operation (CRWO)
- Operation w/ one of its input operands being the
same as its output operand - Handled during operation scheduling
a
4
a
11
12Control Synthesis
- Generate control signals at both edges of clock
- 1. Use a separate clock w/ twice the frequency of
data-path clock - Duty cycle of data-path clock has to be fixed at
50 - (i.e., Ttr is fixed at 0.5Tclk)
- Clock network power is roughly doubled
- 2. Use dual-edge triggered flip-flops (DETFFs)
13Dual-Edge Triggered Flip-Flop
- A latch-mux implementation of D-type DETFF
clk
clk
clk
clk
D
Q
clk
R.P. Llopis et al, Low power, testable dual
edge triggered flip-flops, ISLPED, 1996
14Control Synthesis Flow
- Commercial tools do not support synthesis w/
DETFFs - Control synthesis flow
- Initially, synthesize w/ single-edge triggered
FFs (SETFFs) - Substitute DETFFs for SETFFs after the synthesis
- Check the timing of the controller at both edges
of clock - Timing failure ? increase timing guardband and
re-synthesis
15Optimize Duty Cycle
- Latency is affected by the selection of Ttr
- Either too small or too large Ttr increases
latency
15
16A Heuristic Approach
- Derive Ttr that minimize delay of each OP type k
- rk Tclk/2 rk Ttr Tclk - rk
- rk Tclk/2 rk Ttr or Ttr Tclk - rk
- Find intersection of Ttr that minimizes delay of
each OP type (favor OP type with higher cost) - Cost of OP type k costk wk occurk
- Perform initial scheduling to find of critical
OPs for each OP type (occurk) - Weight of OP type k (wk)
- rk Tclk/2 wk 2, rk Tclk/2 wk 1
16
17Example of Ttr Selection
- Assume Tclk 10
- Perform initial scheduling with Ttr 5
- OPs on the critical path
- One for each OP type
- Ttr that minimize delay of each OP type
- Addition (Di 10)
- rk 0, no need to consider Ttr
- Fast multiplication (Di 13)
- rk 3 ? 3 Ttr 7
- costk 2 1 2
- Slow multiplication (Di 17)
- rk 7 ? Ttr 7 or Ttr 3
- costk 1 1 1
Ttr that minimize the latency is either 3 or 7
17
18Example of Ttr Selection
- Try scheduling with both Ttr 3 and Ttr 7
- Select Ttr 3 as it results in smaller latency
18
19Overall Design Flow
Behavior description
Physical design
VHDL analysis DFG generation
Gate-level netlist of data-path
DFG
HLS-l
Gate-level netlist of controller
success
RTL
Check timing of controller
Substitute DETFFs for SETFFs
Logic synthesis
Increase timing guardband
FU IPs
fail
20Experimental Setting
- Resource-constrained list scheduling
- 10 behavioral benchmark designs
- (23 resource constraints)
- Tclk 8.2 ns, Ttr 2.5 ns
- Di of addition/subtraction 8.2 ns ? 1 c-step
(2 p-steps) - Di of multiplication 10.7 ns ? 2 c-steps (3
p-steps) - Logic synthesis with DC _at_1.2V, 65-nm industrial
standard library - Use DesignWare FUs (32-bits)
21Latency Comparison
22Area Comparison
- Resource constraint of 1 ALU, 1
- Average area reduction of 13 (9.5 for all
benchmarks) - Mainly due to smaller area of latch-registers
(24.6 less)
23Conclusion
- Proposed complete framework of high-level
synthesis for latch-based circuits - Scheduling based on p-steps
- Register allocation w/ extra conflict edges
- Control synthesis using DETFFs
- A method to optimize duty cycle
- Results (compared with conventional HLS)
- Latency is reduced by 3.8 c-steps (16.6)
- Area is reduced by 9.5
24Q AThank you for your attention
Design Technology Lab., KAIST Seungwhun Paik
(swpaik_at_dtlab.kaist.ac.kr)