Title: 4 Synthesis for Lower Power
14 Synthesis for Lower Power
2Improvements at various levels of design
abstraction
3- Two algorithm level techniques for improvement in
power dissipation targeted for digital filters
both techniques try to reduce computation to
achieve low power. - The first technique uses differential coefficient
representation to reduce the dynamic range of
computation (communication vs computation) - Optimize the number of 1s in coefficients
representation to reduce the number of additions
4- Other techniques try to use multipliers-less
implementations for low-power and high
performance. - Since digital signal processing techniques are
very well represented mathematically, algorithm
level techniques can be easily applied.
5Linear time-invariant Finite Impulse Response
6- The algorithms for the differential coefficients
method (DCM) use various orders of differences
between the coefficients in conjunction with
stored precomputed results rather than the
coefficients themselves to compute the canonical
form convolution. - These algorithms result in less computations per
convolution as compared to direct form
computation. - However, they require more storage and storage
accesses and hence more energy for storage
operations.
74.1.1.2 Algorithm Using First-Order Differences
- Yj C0Xj C1Xj-1 C2Xj-2 CN-1Xj-N1
(4.2) - Yj1 C0Xj1 C1Xj C2Xj-1 CN-1Xj-N2
(4.3) - Yj2 C0Xj2 C1Xj1 C2Xj CN-1Xj-N3
(4.4) - Yj1 C0Xj1 Yj (C1 C0)Xj (C2 C1)Xj-1
(CN-1 CN-2)Xj-N2
- CN-1Xj-N1 better if (Ck Ck-1) is zero
8- CkXj-k1 Ck-1Xj-k1 (Ck Ck-1)Xj-k1
- for k 1, , N-1 (4.5)
- Since the Ck-1Xj-k1 terms in identity (4.5)
above have already been computed for the previous
output Yj, one needs to only compute the (Ck
Ck-1)X j-k1 terms and add them to the already
computed Ck-1Xj-k1 terms. - The first product term in the sum for Yj1, which
is C0Xj1, has to be computed without recourse to
this scheme.
9- Ck Ck-1 dk-1/k1 for k 1, , N-1 (4.6)
- Where dk-1/k1 is termed the first-order
difference between coefficients Ck and Ck-1. - Product terms, excepting the first, for computing
Yj1 - Pktj1 CkXj-k1 Ck-1Xj-k1 dk-1/k1
Xj-k1 for k 1, , N-1 (4.7)
10(No Transcript)
11Y X2 5X 3
- X0 X1 X2 X3 X4
- Y3 9 17 27 39
- Dy 6 8 10 12
- D2y 2 2 2
- D3y 0 0
124.1.4 Architecture-Driven Voltage Scaling
- One simple way to maintain throughput while
reducing the supply voltage is to utilize
parallel or pipelined architecture.
13AB gt C ?
14(No Transcript)
15Parallel implementation
- Ppar (2.15C) (0.58V)2(0.5f) ? 0.36CV2f
161. AB, 2. AB gt C ?
17Pipelined implementation
- With the additional pipeline latch (Figure 4.12,
the critical path becomes maxTadder,
Tcomparator, allowing the adder and the
comparator to operate at a slower speed. - Ppipe (1.15C) (0.58V)2(f) ? 0.39CV2f
184.1.5 Power Optimization Using Operation Reduction
- The simplest way to reduce the weighted
capacitance being switched is to reduce the
number of operations performed in the data flow
graph.
19Common subexpression
20(No Transcript)
214.1.7 Precomputation-Based Optimization for Low
Power
- Alidina et al. presented an optimization
technique that precomputes the output logic
values of the circuit one clock cycle before they
are required. - The precomputed logic values are used in the
following clock cycle to reduce the switching
activity at the internal nodes of a circuit. - f1 1 gt Z 1 f2 1 gt Z 0
22(No Transcript)
23(No Transcript)
24- It is possible to precompute output values that
are not required in the succeeding clock cycle
but are required two or more cycles later. The
function Z is computed in two clock cycles. In
this case, two-cycle precomputation can reduce
the switching activity by close to 12.5 (P(f1
f2) 0.125). - f1 A(n-1)B(n-1)C(n-1)D(n-1)
- f2 A(n-1)B(n-1)C(n-1)D(n-1),
- where f1 and f2 satisfy the constraints.
25(No Transcript)
26- A precomputation scheme can be used for all logic
circuits using Shannons expansion. - A logic function Z with input set X x1, , xn
can be expressed as follows using Shannons
expansion - Z xjZxj xjZxj where Zxj and Zxj are the
cofactors of Z with respect to xj.
27- One of the disadvantages of the architecture is
that the number of registers for the inputs is
duplicated. - The selection of the splitting variable for
Shannons expansion is important for minimization
of area and power. One of the heuristics for the
selection of the splitting variable is to use the
most binate variable present in function Z. - A variable is defined as the most binate in a
function Z based on the number of times it
appears in complemented and uncomplemented form.
28(No Transcript)
294.2 Logic Level Optimization for Low Power
- The synthesis process consists of two parts
- state assignment, which determines the
combinational logic function, and - multilevel optimization of the combinational
logic, which tries to minimize area while trying
to reduce the sum over all circuit nodes, the
product of the circuit activity at a node, and
the capacitance at the node.
30- The state assignment scheme considers the
likelihood of state transitions the probability
of a state transition when the primary input
signal probabilities are given.
31(No Transcript)
32(No Transcript)
33Levels of Design Abstraction
function B (a, b, c, d in integer) return
integer is variable t1, t2, t3 integer
0 begin t1 a b t2 c d t3 t1 -
t2 if (t3 gt 25) then return (t3 1) else
return (t3) end if end B
a
out1
B
clk
b
c
d
out2
In2
en1
In1
Block diagram
Behavioral spec.
a
b
c
d
?1
Inc1
Cmp1
Add1
Add2
Sub1
controller
RTL design
Gate-level design
34High-level synthesis tasks
Allocation decide numbers/types of modules
available
Clock Selection decide the clock cycle time for
the design
clock period?
Memory binding deciding the memory module that
stores an array
Array a, d b c e Memory M1 M2 M3 M4
35- rcadd1 ripple carry adder 1
- clsub1 carry lookahead subtract 1
36High-level synthesis tasks (Contd.)
Functional unit binding deciding the FU
instance that performs an operation
Scheduling deciding the cycle-by-cycle behavior
of the design
State transition graph (STG)
Register binding deciding the register that
stores a variable
Variable a b c d t1 t2 t3
out1 Register R1 R2 R3 R4 R1 R2 R1 R2
37Input Domains
Data-dominated
Control-flow intensive (CFI)
done 0 wait until go 1 x0 input x1
s1 x2 s2 x3 s3 x4 s4 x5 s5 x6
s6 v1 x4 x2 - (x0 - x1)c(7) v2 x5
c(5) x6 - x4 s1 v1 c(9) v2 c(9)
x1 s2 x2 x5c(4) - x(5) c(2) s3 x4
c(8) x3 s4 (x1 - x3 - x5) c(6) x4 s5
v1 c(7) v2 c(10) x5 s6 x1 c(3) -
( x5 c(1) x6) output s5 done 1 wait
until go 0 6th order elliptic wave filter
while (X lt Y) loop Wait until (busy 0)
(XoI, YoI) (X0, Y0) (XoR, YoR) (X, Y)
Draw 1 Wait until (busy 1) Draw
0 if (D lt0) then D D 4X
6 else D D 4(X-Y) 10
Y Y - 1 end if X X
1 end loop circle generator
38Control Data Flow Graph (CDFG)
10
1
7
3
i(0)
- Data dependencies red arcs
- Control dependencies blue arcs
- then branch
- else branch --
- loops end with endloop () operations
a(0)
lt1
1
2
lt2
-1
while (i lt 10) a a i if (a lt 7)
a a - 3 i i 1
Sl1
1
-
Out
39Synthesis System Overview
Behavioral Description
Memory Modules
Memory Optimization
40Synthesis Algorithm
- Variable depth search strategy
- Sequences of moves are applied to an initial
solution - Sequence that produces the highest cumulative
gain is chosen - Examine trade-off between voltage scaling and
switched capacitance
41IMPACT Algorithm
- Map CDFG to Parallel RTL
- Schedule
- Obtain Initial Power Estimate
more moves
no improvement
42Scheduling Problem Statement
- Given
- Behavioral description
- Module selection information
- Clock cycle constraints
- Resource constraints
- Derive schedule with minimum Expected Length
- Control Flow Intensive ? different threads of
execution - average length for large number of user-specified
input traces
43Wavesched Summary of Features
- Just enough loop unrolling, pipelining
- Parallelizing independent loops
- Across nested loops, conditionals.
- Flexible trades off STG size with performance of
schedule
44Iterative Improvement Moves
- Multiplexers consume gt40 power in CFI circuits
- Restructure tree
- Reduce switching capacitance in high activity
paths
c
d
a
b
c
d
0.6/0.7
0.1/0.2
0.2/0.05
0.1/0.05
b
a
Switch-level simulations show that (b) consumes
10 less power than (a)
(b)
(a)
Use a Huffman-type algorithm to construct MUX
trees
45Power Model
- Black Box Model
- Linear weighted sum of input switching activity
S1
u
S2
v
46Power Estimation Algorithm
RTL Circuit
Expensive
Simulate Extract Switching Activity Matrix
Partition STG into Loop Induced Regions
Annotate STG with state transition probabilities
CDFG
Input Traces
Cheap
Evaluate Power
EXIT
47Common Case Computation
Behavioral Description
Memory Modules
Memory Optimization
Scheduling Wavesched
48What? Why?
Circuit evaluates F(i1, i2, )
Frequency
Input sub-spaces
I the common case F simplified
common-case functionality
I
49CCC
- Circuit structure that can execute common-case
efficiently - CCC-based design methodology
- Algorithm to identify good common-cases
- Methods to simplify common-case functionality
- Power Savings (12.4 - 59.8 )
- Average 43.7
- Performance Improvement (17.9 - 76.6 )
- Average 41.5
50Example GCD(x,y) e N
(xy)
Let z (x gt 4y)
GCD x
c (x gt y)
S0
c1
Input space z, z
c0
x x - y
y y - x
S1
S2
51Example GCD circuit
y
x
y
x
y ltlt 2
gt
gt
Controller
Controller
x - 4y
Control signals
Control signals
Common-case circuitry
gt 0
Original circuit
CCC-optimized circuit
52Original
CC detect. ckt.
CC exec. ckt.
CCC-opt. original
t1
t2
t3
t4
- CCC detection and execution circuits composed
with original - Redundancy removal on original circuit
- Shut off components when not used
53CCC-based design Issues
- Finding the right common cases
- How frequently does it occur
- What fraction of total computation does it
represent - How much can it be simplified
- Simplifying Common-Case Detection
- Simplifying Common-Case Execution
54Finding the right common-cases
- Coverage of a common case C Fraction of
computation time spent in C
ilt500
S0
S1
maI
S2
mlt25
p0.5
p0.5
xxy
xx3
S3
S8
yy7
S4
S9
yyz
xx-y
S5
S6
zzx
S7
xxz
S10
i
- Coverage is a complex function of state
transition graph and input trace
55Finding the right common cases
Parameterized study of CCC-based design x gt k y
for various k
Value of parameter k
56CCC-based design Methodology
Inputs 1. RTL design 2. Schedule (STG) 3.
Typical input traces
Identify promising state sequence patterns
Use best pattern to derive CCC optimized circuit
Extract corresponding behavior
Optimize common-case computation
Derive compact behavior to detect pattern
Evaluate energy/performance improvement
57Identifying promising common-cases
Coverage fraction of computation spent in the
candidate common-case Coverage
occurrences length of common-case
occurrences Length
coverage 1. 100
1
100 2. 20
5
100
Case 2 clearly better than case 1 common case
of length 1 is hard to simplify !
Revised metric Potential of common case
Coverage length
58Detecting/Simplifying Common-case
. . .
S0
S1
S2
Sn
c0
c1
cn-1
To simplify Find implications between cis
For example cn-1 ? c2 ? c1
. . .
c0
c1
c2
cn-1
59Common-case Optimization Example
a (x gt y) b (x ! y)
a (x gt y) b (x ! y)
(x gt 4y)
(x gt y)
(x gt 3y)
(x gt 2y)
(x ! 4y)
(x ! y)
(x ! 3y)
(x ! 2y)
x x - y
x x - y
a (x gt y) b (x ! y)
a (x gt y) b (x ! y)
x x - y
x x - y
Common-case detection condition
60Experimental Methodology
- Two experiments
- Energy optimization
- Performance also improved as a side-effect
- Power optimization with supply voltage scaling
- Performance kept same as original
- Performance Expected of clock cycles clock
period
RTL designs
Traces
CCC-based design
CCC-optimized RTL
Synopsys DC
NEC OpenCAD
Area,Delay, Power
61Experimental Results
improvement
- With supply voltage scaling (green bar)
- Area overhead 13.7 - 29 (avg. 23.5)
62Conclusions
- Low Power System Implementation
- Scheduling, allocation, assignment, and binding
for data-dominated and CFI behaviors - Memory binding and Clock Selection
- Key Features
- Preserve parallelism in behavior
- Explore memory architectures with considerations
to allocation - Amortize time consuming synthesis steps over
design iterations - Minimize clock period slack on dynamic critical
paths
63Conclusions
- Low Power Methodology
- Common case computation
- Key features
- Find and optimize common case for behavior
without redesign of core - Exploit common state sequences to optimize
performance and power