4 Synthesis for Lower Power - PowerPoint PPT Presentation

1 / 63

About This Presentation

Title:

4 Synthesis for Lower Power

Description:

... the splitting variable is to use the most binate variable present in function Z. A variable is defined as the most binate in a function Z based on the number of ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 64

Provided by: NTU

Category:

more less

Transcript and Presenter's Notes

Title: 4 Synthesis for Lower Power

1
4 Synthesis for Lower Power
2
Improvements at various levels of design
abstraction
3

Two algorithm level techniques for improvement in
power dissipation targeted for digital filters
both techniques try to reduce computation to
achieve low power.
The first technique uses differential coefficient
representation to reduce the dynamic range of
computation (communication vs computation)
Optimize the number of 1s in coefficients
representation to reduce the number of additions

Other techniques try to use multipliers-less
implementations for low-power and high
performance.
Since digital signal processing techniques are
very well represented mathematically, algorithm
level techniques can be easily applied.

5
Linear time-invariant Finite Impulse Response

Yj ?n0N-1 Cn Xj-n

The algorithms for the differential coefficients
method (DCM) use various orders of differences
between the coefficients in conjunction with
stored precomputed results rather than the
coefficients themselves to compute the canonical
form convolution.
These algorithms result in less computations per
convolution as compared to direct form
computation.
However, they require more storage and storage
accesses and hence more energy for storage
operations.

7
4.1.1.2 Algorithm Using First-Order Differences

Yj C0Xj C1Xj-1 C2Xj-2 CN-1Xj-N1
(4.2)
Yj1 C0Xj1 C1Xj C2Xj-1 CN-1Xj-N2
(4.3)
Yj2 C0Xj2 C1Xj1 C2Xj CN-1Xj-N3
(4.4)
Yj1 C0Xj1 Yj (C1 C0)Xj (C2 C1)Xj-1
(CN-1 CN-2)Xj-N2
- CN-1Xj-N1 better if (Ck Ck-1) is zero

CkXj-k1 Ck-1Xj-k1 (Ck Ck-1)Xj-k1
for k 1, , N-1 (4.5)
Since the Ck-1Xj-k1 terms in identity (4.5)
above have already been computed for the previous
output Yj, one needs to only compute the (Ck
Ck-1)X j-k1 terms and add them to the already
computed Ck-1Xj-k1 terms.
The first product term in the sum for Yj1, which
is C0Xj1, has to be computed without recourse to
this scheme.

Ck Ck-1 dk-1/k1 for k 1, , N-1 (4.6)
Where dk-1/k1 is termed the first-order
difference between coefficients Ck and Ck-1.
Product terms, excepting the first, for computing
Yj1
Pktj1 CkXj-k1 Ck-1Xj-k1 dk-1/k1
Xj-k1 for k 1, , N-1 (4.7)

10
(No Transcript)
11
Y X2 5X 3

X0 X1 X2 X3 X4
Y3 9 17 27 39
Dy 6 8 10 12
D2y 2 2 2
D3y 0 0

12
4.1.4 Architecture-Driven Voltage Scaling

One simple way to maintain throughput while
reducing the supply voltage is to utilize
parallel or pipelined architecture.

13
AB gt C ?
14
(No Transcript)
15
Parallel implementation

Ppar (2.15C) (0.58V)2(0.5f) ? 0.36CV2f

16
1. AB, 2. AB gt C ?
17
Pipelined implementation

With the additional pipeline latch (Figure 4.12,
the critical path becomes maxTadder,
Tcomparator, allowing the adder and the
comparator to operate at a slower speed.
Ppipe (1.15C) (0.58V)2(f) ? 0.39CV2f

18
4.1.5 Power Optimization Using Operation Reduction

The simplest way to reduce the weighted
capacitance being switched is to reduce the
number of operations performed in the data flow
graph.

19
Common subexpression
20
(No Transcript)
21
4.1.7 Precomputation-Based Optimization for Low
Power

Alidina et al. presented an optimization
technique that precomputes the output logic
values of the circuit one clock cycle before they
are required.
The precomputed logic values are used in the
following clock cycle to reduce the switching
activity at the internal nodes of a circuit.
f1 1 gt Z 1 f2 1 gt Z 0

22
(No Transcript)
23
(No Transcript)
24

It is possible to precompute output values that
are not required in the succeeding clock cycle
but are required two or more cycles later. The
function Z is computed in two clock cycles. In
this case, two-cycle precomputation can reduce
the switching activity by close to 12.5 (P(f1
f2) 0.125).
f1 A(n-1)B(n-1)C(n-1)D(n-1)
f2 A(n-1)B(n-1)C(n-1)D(n-1),
where f1 and f2 satisfy the constraints.

25
(No Transcript)
26

A precomputation scheme can be used for all logic
circuits using Shannons expansion.
A logic function Z with input set X x1, , xn
can be expressed as follows using Shannons
expansion
Z xjZxj xjZxj where Zxj and Zxj are the
cofactors of Z with respect to xj.

One of the disadvantages of the architecture is
that the number of registers for the inputs is
duplicated.
The selection of the splitting variable for
Shannons expansion is important for minimization
of area and power. One of the heuristics for the
selection of the splitting variable is to use the
most binate variable present in function Z.
A variable is defined as the most binate in a
function Z based on the number of times it
appears in complemented and uncomplemented form.

28
(No Transcript)
29
4.2 Logic Level Optimization for Low Power

The synthesis process consists of two parts
state assignment, which determines the
combinational logic function, and
multilevel optimization of the combinational
logic, which tries to minimize area while trying
to reduce the sum over all circuit nodes, the
product of the circuit activity at a node, and
the capacitance at the node.

The state assignment scheme considers the
likelihood of state transitions the probability
of a state transition when the primary input
signal probabilities are given.

31
(No Transcript)
32
(No Transcript)
33
Levels of Design Abstraction
function B (a, b, c, d in integer) return
integer is variable t1, t2, t3 integer
0 begin t1 a b t2 c d t3 t1 -
t2 if (t3 gt 25) then return (t3 1) else
return (t3) end if end B
a
out1
B
clk
b
c
d
out2
In2
en1
In1
Block diagram
Behavioral spec.
a
b
c
d

?1

Inc1
Cmp1
Add1
Add2
Sub1

controller
RTL design
Gate-level design
34
High-level synthesis tasks
Allocation decide numbers/types of modules
available
Clock Selection decide the clock cycle time for
the design
clock period?
Memory binding deciding the memory module that
stores an array
Array a, d b c e Memory M1 M2 M3 M4
35

rcadd1 ripple carry adder 1
clsub1 carry lookahead subtract 1

36
High-level synthesis tasks (Contd.)
Functional unit binding deciding the FU
instance that performs an operation
Scheduling deciding the cycle-by-cycle behavior
of the design
State transition graph (STG)
Register binding deciding the register that
stores a variable
Variable a b c d t1 t2 t3
out1 Register R1 R2 R3 R4 R1 R2 R1 R2
37
Input Domains
Data-dominated
Control-flow intensive (CFI)
done 0 wait until go 1 x0 input x1
s1 x2 s2 x3 s3 x4 s4 x5 s5 x6
s6 v1 x4 x2 - (x0 - x1)c(7) v2 x5
c(5) x6 - x4 s1 v1 c(9) v2 c(9)
x1 s2 x2 x5c(4) - x(5) c(2) s3 x4
c(8) x3 s4 (x1 - x3 - x5) c(6) x4 s5
v1 c(7) v2 c(10) x5 s6 x1 c(3) -
( x5 c(1) x6) output s5 done 1 wait
until go 0 6th order elliptic wave filter
while (X lt Y) loop Wait until (busy 0)
(XoI, YoI) (X0, Y0) (XoR, YoR) (X, Y)
Draw 1 Wait until (busy 1) Draw
0 if (D lt0) then D D 4X
6 else D D 4(X-Y) 10
Y Y - 1 end if X X
1 end loop circle generator
38
Control Data Flow Graph (CDFG)
10
1
7
3
i(0)

Data dependencies red arcs
Control dependencies blue arcs
then branch
else branch --
loops end with endloop () operations

a(0)
lt1
1

2
lt2

-1

while (i lt 10) a a i if (a lt 7)
a a - 3 i i 1
Sl1
1
-
Out
39
Synthesis System Overview
Behavioral Description
Memory Modules
Memory Optimization
40
Synthesis Algorithm

Variable depth search strategy
Sequences of moves are applied to an initial
solution
Sequence that produces the highest cumulative
gain is chosen
Examine trade-off between voltage scaling and
switched capacitance

41
IMPACT Algorithm

Map CDFG to Parallel RTL
Schedule
Obtain Initial Power Estimate

more moves
no improvement
42
Scheduling Problem Statement

Given
Behavioral description
Module selection information
Clock cycle constraints
Resource constraints
Derive schedule with minimum Expected Length
Control Flow Intensive ? different threads of
execution
average length for large number of user-specified
input traces

43
Wavesched Summary of Features

Just enough loop unrolling, pipelining
Parallelizing independent loops
Across nested loops, conditionals.
Flexible trades off STG size with performance of
schedule

44
Iterative Improvement Moves

Multiplexers consume gt40 power in CFI circuits
Restructure tree
Reduce switching capacitance in high activity
paths

c
d
a
b
c
d
0.6/0.7
0.1/0.2
0.2/0.05
0.1/0.05
b
a
Switch-level simulations show that (b) consumes
10 less power than (a)
(b)
(a)
Use a Huffman-type algorithm to construct MUX
trees
45
Power Model

Black Box Model
Linear weighted sum of input switching activity

S1
u
S2
v
46
Power Estimation Algorithm
RTL Circuit
Expensive
Simulate Extract Switching Activity Matrix
Partition STG into Loop Induced Regions
Annotate STG with state transition probabilities

CDFG
Input Traces
Cheap
Evaluate Power
EXIT
47
Common Case Computation
Behavioral Description
Memory Modules
Memory Optimization
Scheduling Wavesched
48
What? Why?
Circuit evaluates F(i1, i2, )
Frequency
Input sub-spaces
I the common case F simplified
common-case functionality
I
49
CCC

Circuit structure that can execute common-case
efficiently
CCC-based design methodology
Algorithm to identify good common-cases
Methods to simplify common-case functionality

Power Savings (12.4 - 59.8 )
Average 43.7
Performance Improvement (17.9 - 76.6 )
Average 41.5

50
Example GCD(x,y) e N
(xy)
Let z (x gt 4y)
GCD x
c (x gt y)
S0
c1
Input space z, z
c0
x x - y
y y - x
S1
S2
51
Example GCD circuit
y
x
y
x
y ltlt 2
gt
gt
Controller
Controller
x - 4y
Control signals
Control signals
Common-case circuitry
gt 0
Original circuit
CCC-optimized circuit
52
Original
CC detect. ckt.
CC exec. ckt.
CCC-opt. original
t1
t2
t3
t4

CCC detection and execution circuits composed
with original
Redundancy removal on original circuit
Shut off components when not used

53
CCC-based design Issues

Finding the right common cases
How frequently does it occur
What fraction of total computation does it
represent
How much can it be simplified
Simplifying Common-Case Detection
Simplifying Common-Case Execution

54
Finding the right common-cases

Coverage of a common case C Fraction of
computation time spent in C

ilt500
S0
S1
maI
S2
mlt25
p0.5
p0.5
xxy
xx3
S3
S8
yy7
S4
S9
yyz
xx-y
S5
S6
zzx
S7
xxz
S10
i

Coverage is a complex function of state
transition graph and input trace

55
Finding the right common cases

GCD example re-visited

Parameterized study of CCC-based design x gt k y
for various k
Value of parameter k
56
CCC-based design Methodology
Inputs 1. RTL design 2. Schedule (STG) 3.
Typical input traces
Identify promising state sequence patterns
Use best pattern to derive CCC optimized circuit
Extract corresponding behavior
Optimize common-case computation
Derive compact behavior to detect pattern
Evaluate energy/performance improvement
57
Identifying promising common-cases
Coverage fraction of computation spent in the
candidate common-case Coverage
occurrences length of common-case
occurrences Length
coverage 1. 100
1
100 2. 20
5
100
Case 2 clearly better than case 1 common case
of length 1 is hard to simplify !
Revised metric Potential of common case
Coverage length
58
Detecting/Simplifying Common-case
. . .
S0
S1
S2
Sn
c0
c1
cn-1
To simplify Find implications between cis
For example cn-1 ? c2 ? c1
. . .
c0
c1
c2
cn-1

59
Common-case Optimization Example
a (x gt y) b (x ! y)
a (x gt y) b (x ! y)
(x gt 4y)
(x gt y)
(x gt 3y)
(x gt 2y)
(x ! 4y)
(x ! y)
(x ! 3y)
(x ! 2y)
x x - y
x x - y

a (x gt y) b (x ! y)
a (x gt y) b (x ! y)
x x - y
x x - y
Common-case detection condition
60
Experimental Methodology

Two experiments
Energy optimization
Performance also improved as a side-effect
Power optimization with supply voltage scaling
Performance kept same as original
Performance Expected of clock cycles clock
period

RTL designs
Traces
CCC-based design
CCC-optimized RTL
Synopsys DC
NEC OpenCAD
Area,Delay, Power
61
Experimental Results
improvement

With supply voltage scaling (green bar)
Area overhead 13.7 - 29 (avg. 23.5)

62
Conclusions

Low Power System Implementation
Scheduling, allocation, assignment, and binding
for data-dominated and CFI behaviors
Memory binding and Clock Selection
Key Features
Preserve parallelism in behavior
Explore memory architectures with considerations
to allocation
Amortize time consuming synthesis steps over
design iterations
Minimize clock period slack on dynamic critical
paths

63
Conclusions