4 Synthesis for Lower Power - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

4 Synthesis for Lower Power

Description:

... the splitting variable is to use the most binate variable present in function Z. A variable is defined as the most binate in a function Z based on the number of ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 64
Provided by: NTU
Category:

less

Transcript and Presenter's Notes

Title: 4 Synthesis for Lower Power


1
4 Synthesis for Lower Power
2
Improvements at various levels of design
abstraction
3
  • Two algorithm level techniques for improvement in
    power dissipation targeted for digital filters
    both techniques try to reduce computation to
    achieve low power.
  • The first technique uses differential coefficient
    representation to reduce the dynamic range of
    computation (communication vs computation)
  • Optimize the number of 1s in coefficients
    representation to reduce the number of additions

4
  • Other techniques try to use multipliers-less
    implementations for low-power and high
    performance.
  • Since digital signal processing techniques are
    very well represented mathematically, algorithm
    level techniques can be easily applied.

5
Linear time-invariant Finite Impulse Response
  • Yj ?n0N-1 Cn Xj-n

6
  • The algorithms for the differential coefficients
    method (DCM) use various orders of differences
    between the coefficients in conjunction with
    stored precomputed results rather than the
    coefficients themselves to compute the canonical
    form convolution.
  • These algorithms result in less computations per
    convolution as compared to direct form
    computation.
  • However, they require more storage and storage
    accesses and hence more energy for storage
    operations.

7
4.1.1.2 Algorithm Using First-Order Differences
  • Yj C0Xj C1Xj-1 C2Xj-2 CN-1Xj-N1
    (4.2)
  • Yj1 C0Xj1 C1Xj C2Xj-1 CN-1Xj-N2
    (4.3)
  • Yj2 C0Xj2 C1Xj1 C2Xj CN-1Xj-N3
    (4.4)
  • Yj1 C0Xj1 Yj (C1 C0)Xj (C2 C1)Xj-1
    (CN-1 CN-2)Xj-N2
    - CN-1Xj-N1 better if (Ck Ck-1) is zero

8
  • CkXj-k1 Ck-1Xj-k1 (Ck Ck-1)Xj-k1
  • for k 1, , N-1 (4.5)
  • Since the Ck-1Xj-k1 terms in identity (4.5)
    above have already been computed for the previous
    output Yj, one needs to only compute the (Ck
    Ck-1)X j-k1 terms and add them to the already
    computed Ck-1Xj-k1 terms.
  • The first product term in the sum for Yj1, which
    is C0Xj1, has to be computed without recourse to
    this scheme.

9
  • Ck Ck-1 dk-1/k1 for k 1, , N-1 (4.6)
  • Where dk-1/k1 is termed the first-order
    difference between coefficients Ck and Ck-1.
  • Product terms, excepting the first, for computing
    Yj1
  • Pktj1 CkXj-k1 Ck-1Xj-k1 dk-1/k1
    Xj-k1 for k 1, , N-1 (4.7)

10
(No Transcript)
11
Y X2 5X 3
  • X0 X1 X2 X3 X4
  • Y3 9 17 27 39
  • Dy 6 8 10 12
  • D2y 2 2 2
  • D3y 0 0

12
4.1.4 Architecture-Driven Voltage Scaling
  • One simple way to maintain throughput while
    reducing the supply voltage is to utilize
    parallel or pipelined architecture.

13
AB gt C ?
14
(No Transcript)
15
Parallel implementation
  • Ppar (2.15C) (0.58V)2(0.5f) ? 0.36CV2f

16
1. AB, 2. AB gt C ?
17
Pipelined implementation
  • With the additional pipeline latch (Figure 4.12,
    the critical path becomes maxTadder,
    Tcomparator, allowing the adder and the
    comparator to operate at a slower speed.
  • Ppipe (1.15C) (0.58V)2(f) ? 0.39CV2f

18
4.1.5 Power Optimization Using Operation Reduction
  • The simplest way to reduce the weighted
    capacitance being switched is to reduce the
    number of operations performed in the data flow
    graph.

19
Common subexpression
20
(No Transcript)
21
4.1.7 Precomputation-Based Optimization for Low
Power
  • Alidina et al. presented an optimization
    technique that precomputes the output logic
    values of the circuit one clock cycle before they
    are required.
  • The precomputed logic values are used in the
    following clock cycle to reduce the switching
    activity at the internal nodes of a circuit.
  • f1 1 gt Z 1 f2 1 gt Z 0

22
(No Transcript)
23
(No Transcript)
24
  • It is possible to precompute output values that
    are not required in the succeeding clock cycle
    but are required two or more cycles later. The
    function Z is computed in two clock cycles. In
    this case, two-cycle precomputation can reduce
    the switching activity by close to 12.5 (P(f1
    f2) 0.125).
  • f1 A(n-1)B(n-1)C(n-1)D(n-1)
  • f2 A(n-1)B(n-1)C(n-1)D(n-1),
  • where f1 and f2 satisfy the constraints.

25
(No Transcript)
26
  • A precomputation scheme can be used for all logic
    circuits using Shannons expansion.
  • A logic function Z with input set X x1, , xn
    can be expressed as follows using Shannons
    expansion
  • Z xjZxj xjZxj where Zxj and Zxj are the
    cofactors of Z with respect to xj.

27
  • One of the disadvantages of the architecture is
    that the number of registers for the inputs is
    duplicated.
  • The selection of the splitting variable for
    Shannons expansion is important for minimization
    of area and power. One of the heuristics for the
    selection of the splitting variable is to use the
    most binate variable present in function Z.
  • A variable is defined as the most binate in a
    function Z based on the number of times it
    appears in complemented and uncomplemented form.

28
(No Transcript)
29
4.2 Logic Level Optimization for Low Power
  • The synthesis process consists of two parts
  • state assignment, which determines the
    combinational logic function, and
  • multilevel optimization of the combinational
    logic, which tries to minimize area while trying
    to reduce the sum over all circuit nodes, the
    product of the circuit activity at a node, and
    the capacitance at the node.

30
  • The state assignment scheme considers the
    likelihood of state transitions the probability
    of a state transition when the primary input
    signal probabilities are given.

31
(No Transcript)
32
(No Transcript)
33
Levels of Design Abstraction
function B (a, b, c, d in integer) return
integer is variable t1, t2, t3 integer
0 begin t1 a b t2 c d t3 t1 -
t2 if (t3 gt 25) then return (t3 1) else
return (t3) end if end B
a
out1
B
clk
b
c
d
out2
In2
en1
In1
Block diagram
Behavioral spec.
a
b
c
d

?1

Inc1
Cmp1
Add1
Add2
Sub1

controller
RTL design
Gate-level design
34
High-level synthesis tasks
Allocation decide numbers/types of modules
available
Clock Selection decide the clock cycle time for
the design
clock period?
Memory binding deciding the memory module that
stores an array
Array a, d b c e Memory M1 M2 M3 M4
35
  • rcadd1 ripple carry adder 1
  • clsub1 carry lookahead subtract 1

36
High-level synthesis tasks (Contd.)
Functional unit binding deciding the FU
instance that performs an operation
Scheduling deciding the cycle-by-cycle behavior
of the design
State transition graph (STG)
Register binding deciding the register that
stores a variable
Variable a b c d t1 t2 t3
out1 Register R1 R2 R3 R4 R1 R2 R1 R2
37
Input Domains
Data-dominated
Control-flow intensive (CFI)
done 0 wait until go 1 x0 input x1
s1 x2 s2 x3 s3 x4 s4 x5 s5 x6
s6 v1 x4 x2 - (x0 - x1)c(7) v2 x5
c(5) x6 - x4 s1 v1 c(9) v2 c(9)
x1 s2 x2 x5c(4) - x(5) c(2) s3 x4
c(8) x3 s4 (x1 - x3 - x5) c(6) x4 s5
v1 c(7) v2 c(10) x5 s6 x1 c(3) -
( x5 c(1) x6) output s5 done 1 wait
until go 0 6th order elliptic wave filter
while (X lt Y) loop Wait until (busy 0)
(XoI, YoI) (X0, Y0) (XoR, YoR) (X, Y)
Draw 1 Wait until (busy 1) Draw
0 if (D lt0) then D D 4X
6 else D D 4(X-Y) 10
Y Y - 1 end if X X
1 end loop circle generator
38
Control Data Flow Graph (CDFG)
10
1
7
3
i(0)
  • Data dependencies red arcs
  • Control dependencies blue arcs
  • then branch
  • else branch --
  • loops end with endloop () operations

a(0)
lt1
1

2
lt2

-1

while (i lt 10) a a i if (a lt 7)
a a - 3 i i 1
Sl1
1
-
Out
39
Synthesis System Overview
Behavioral Description
Memory Modules
Memory Optimization
40
Synthesis Algorithm
  • Variable depth search strategy
  • Sequences of moves are applied to an initial
    solution
  • Sequence that produces the highest cumulative
    gain is chosen
  • Examine trade-off between voltage scaling and
    switched capacitance

41
IMPACT Algorithm
  • Map CDFG to Parallel RTL
  • Schedule
  • Obtain Initial Power Estimate

more moves
no improvement
42
Scheduling Problem Statement
  • Given
  • Behavioral description
  • Module selection information
  • Clock cycle constraints
  • Resource constraints
  • Derive schedule with minimum Expected Length
  • Control Flow Intensive ? different threads of
    execution
  • average length for large number of user-specified
    input traces

43
Wavesched Summary of Features
  • Just enough loop unrolling, pipelining
  • Parallelizing independent loops
  • Across nested loops, conditionals.
  • Flexible trades off STG size with performance of
    schedule

44
Iterative Improvement Moves
  • Multiplexers consume gt40 power in CFI circuits
  • Restructure tree
  • Reduce switching capacitance in high activity
    paths

c
d
a
b
c
d
0.6/0.7
0.1/0.2
0.2/0.05
0.1/0.05
b
a
Switch-level simulations show that (b) consumes
10 less power than (a)
(b)
(a)
Use a Huffman-type algorithm to construct MUX
trees
45
Power Model
  • Black Box Model
  • Linear weighted sum of input switching activity

S1
u
S2
v
46
Power Estimation Algorithm
RTL Circuit
Expensive
Simulate Extract Switching Activity Matrix
Partition STG into Loop Induced Regions
Annotate STG with state transition probabilities

CDFG
Input Traces
Cheap
Evaluate Power
EXIT
47
Common Case Computation
Behavioral Description
Memory Modules
Memory Optimization
Scheduling Wavesched
48
What? Why?
Circuit evaluates F(i1, i2, )
Frequency
Input sub-spaces
I the common case F simplified
common-case functionality
I
49
CCC
  • Circuit structure that can execute common-case
    efficiently
  • CCC-based design methodology
  • Algorithm to identify good common-cases
  • Methods to simplify common-case functionality
  • Power Savings (12.4 - 59.8 )
  • Average 43.7
  • Performance Improvement (17.9 - 76.6 )
  • Average 41.5

50
Example GCD(x,y) e N
(xy)
Let z (x gt 4y)
GCD x
c (x gt y)
S0
c1
Input space z, z
c0
x x - y
y y - x
S1
S2
51
Example GCD circuit
y
x
y
x
y ltlt 2
gt
gt
Controller
Controller
x - 4y
Control signals
Control signals
Common-case circuitry
gt 0
Original circuit
CCC-optimized circuit
52
Original
CC detect. ckt.
CC exec. ckt.
CCC-opt. original
t1
t2
t3
t4
  • CCC detection and execution circuits composed
    with original
  • Redundancy removal on original circuit
  • Shut off components when not used

53
CCC-based design Issues
  • Finding the right common cases
  • How frequently does it occur
  • What fraction of total computation does it
    represent
  • How much can it be simplified
  • Simplifying Common-Case Detection
  • Simplifying Common-Case Execution

54
Finding the right common-cases
  • Coverage of a common case C Fraction of
    computation time spent in C

ilt500
S0
S1
maI
S2
mlt25
p0.5
p0.5
xxy
xx3
S3
S8
yy7
S4
S9
yyz
xx-y
S5
S6
zzx
S7
xxz
S10
i
  • Coverage is a complex function of state
    transition graph and input trace

55
Finding the right common cases
  • GCD example re-visited

Parameterized study of CCC-based design x gt k y
for various k
Value of parameter k
56
CCC-based design Methodology
Inputs 1. RTL design 2. Schedule (STG) 3.
Typical input traces
Identify promising state sequence patterns
Use best pattern to derive CCC optimized circuit
Extract corresponding behavior
Optimize common-case computation
Derive compact behavior to detect pattern
Evaluate energy/performance improvement
57
Identifying promising common-cases
Coverage fraction of computation spent in the
candidate common-case Coverage
occurrences length of common-case
occurrences Length
coverage 1. 100
1
100 2. 20
5
100
Case 2 clearly better than case 1 common case
of length 1 is hard to simplify !
Revised metric Potential of common case
Coverage length
58
Detecting/Simplifying Common-case
. . .
S0
S1
S2
Sn
c0
c1
cn-1
To simplify Find implications between cis
For example cn-1 ? c2 ? c1
. . .
c0
c1
c2
cn-1

59
Common-case Optimization Example
a (x gt y) b (x ! y)
a (x gt y) b (x ! y)
(x gt 4y)
(x gt y)
(x gt 3y)
(x gt 2y)
(x ! 4y)
(x ! y)
(x ! 3y)
(x ! 2y)
x x - y
x x - y

a (x gt y) b (x ! y)
a (x gt y) b (x ! y)
x x - y
x x - y
Common-case detection condition
60
Experimental Methodology
  • Two experiments
  • Energy optimization
  • Performance also improved as a side-effect
  • Power optimization with supply voltage scaling
  • Performance kept same as original
  • Performance Expected of clock cycles clock
    period

RTL designs
Traces
CCC-based design
CCC-optimized RTL
Synopsys DC
NEC OpenCAD
Area,Delay, Power
61
Experimental Results
improvement
  • With supply voltage scaling (green bar)
  • Area overhead 13.7 - 29 (avg. 23.5)

62
Conclusions
  • Low Power System Implementation
  • Scheduling, allocation, assignment, and binding
    for data-dominated and CFI behaviors
  • Memory binding and Clock Selection
  • Key Features
  • Preserve parallelism in behavior
  • Explore memory architectures with considerations
    to allocation
  • Amortize time consuming synthesis steps over
    design iterations
  • Minimize clock period slack on dynamic critical
    paths

63
Conclusions
  • Low Power Methodology
  • Common case computation
  • Key features
  • Find and optimize common case for behavior
    without redesign of core
  • Exploit common state sequences to optimize
    performance and power
Write a Comment
User Comments (0)
About PowerShow.com