Title: A FineGrained Arithmetic Optimization Technique for HighPerformanceLowPower Data Path Synthesis
1A Fine-GrainedArithmetic Optimization Technique
for High-Performance/Low-Power Data Path
Synthesis
- Junhyung Um, Taewhan Kim
- Dept. of EECS
- KAIST
- Taejon, Korea
C.L. Liu Dept. of Computer Science National
Tsing Hua Univ. Hsinchu, Taiwan R.O.C
2Motivation
3Motivation(cont.)
Col-3
Col-4
Col-5
Col-2
Col-1
Col-3
Col-2
Col-1
Col-4
Col-5
X0
X0
X0
X0
X2
X1
X0
X0
X2
X2
X0
X0
X1
X2
X1
X0
X1
X1
X1
X2
X1
X1
X1
X2
X2
X0
X2
X1
X2
X2
4Motivation(cont.)
Col-3
Col-2
Col-1
Col-4
Col-5
X0
X0
X2
X0
X0
X2
X1
X2
X1
X1
X1
X2
(X2X2,X0X2,X0X1)
(0,x0X2,X0X1)
(X1X2,X1X1,X1X0)
(X1X2,0,X1X1)
3
3
3
(X2X2,X0X2,X0X1)
X2X2
3
4
3
3
(0,X0X0)
Final Adder
Final Adder
X0X0
2
5
4
1
6
6
5 Related Work
- 1. T.Kim, W.Jao, and S.Tjiang, Circuit
Optimization using Carry-Save-Adder Cells, IEEE
Transactions on CAD, 1998. - 2. Hai Zhou and D.F.Wong, An Exact Gate
Decomposition Algorithm for Low-Power Technology
Mappling, Proc.of ICCAD, 1997 - 3. Unni Narayanan and C.L.Liu, Low Power
Logic-Synthesis for XOR Based Circuits, Proc.of
ICCAD, 1997
6Extending the Wallace Scheme
- Additions, Subtractions, Multiplications
1. Non-uniform arrival times -gt Minimizing
execution delay 2. Non-uniform switching
activities -gt Minimizing power consumption
7FA-tree Allocation
F
X Y Z W
W1
W0
W
8FA-tree Allocation(cont.)
X1
Y1
W1
X0
Y0
Z0
W0
FA
C(x0,y0,z0)
S(x0,y0,z0)
9FA-tree Allocation(cont.)
X1
Y1
W1
X0
Y0
Z0
W0
FA
FA
Final Adder
10FA-tree Allocation
Minimize Timing
Examples
1. Based on wallace scheme 2. Based on column
isolation 3. Based on column interaction
11 Example 1
Col-1
Col-0
(7) (2) (3) (7) (5) (4) (2)
X1
Y1
W1
X0
Y0
Z0
W0
X X1X0
Y Y1Y0
FA
FA
Z Z0
WW1W0
S(x1,y1,z1)
S(x0,y0,z0)
C(x1,y1,z1)
C(x0,y0,z0)
(9)
(9)
Dc1,Ds2
(8)
(8)
by Wallaces
12 Example 2
Col-1
Col-0
(7) (2) (3) (2) (5) (4) (7)
X1
Y1
W1
W0
Y0
Z0
X0
FA
FA
S(x1,y1,z1)
S(x0,y0,z0)
C(x1,y1,z1)
C(x0,y0,z0)
(9)
(7)
(8)
(6)
by Column-Isolation
13 Example 3
Col-1
Col-0
(2) (5) (4) (7)
(6)
Y0
Z0
W0
X0
C(x0,y0,z0)
(7) (2) (3)
FA
X1
Y1
W1
FA
S(x0,y0,z0)
C(x1,y1,z1)
S(x1,y1,z1)
(7)
(8)
(7)
by Column-Interaction
14Observations for Minimum Delay
Observation 1 rightmost bit column first
Col-0
Col-1
Col-2
FA
FA
FA
15Observations for Minimum Delay
Observation 2 Assign Addends with the earliest
arrival times
(7) (5) (6) (2) (4) (1) (7)
g
a
b
c
d
e
f
FA
16Algorithm SC_T(FA-tree Allocation for Single
Column)
FA allocation
FA allocation
sorting
sorting
17Lemma 1
SC_T minimizes the arrival times of carryout
signals
FA
FA
FA
18Lemma 2
Repeated application of SC_T produces optimal
timing FA-tree
Col-0
Col-1
Col-2
FA
FA
FA
FA
19Algorithm FA_AOT(F) (FA-tree Allocation for
minimal delay)
Mj set of addends in column j
- Repeat
- step 1. Call SC_T(Mj)
- step 2. Insert all carryouts of FAs to Mj1
- step 3. j j1
- until(Mslt3 for all s)
20Theorem 1
FA_AOT produces an FA-tree with optimal
execution delay
21FA-tree Allocation
Minimize Power Consumption
22 Power Model
Simple Power Model for FA
Stochastic Process Model (p(x) probability of
x being 1) Zero Gate-Delay Model
23Weight Factors
Wc ,Ws
Wc
Ws
24FA decomposition for low power
- Not a simple gate
- Switching activity on two output signals
25Power Consumption of FA-tree T
26Motivational Example
Assume Wc Ws 1
X2
X3
X4
X1
X1
X2
X4
X3
FA
FA
T1
T2
Eswitching(T1) 0.411 Eswitching(T2) 0.400
27Observation for Power
Minimizing the power consumption by the sum
signals
X1
X2
X3
X4
Select three addends with the largest value of
0.5-p(x)
FA
s
28Algorithm SC_LP(FA-tree Allocation for Single
Column for Low Power)
FA allocation
FA allocation
sort by 0.5-p(x)
sort by 0.5-p(x)
29Property 1
0 p(x) 0.5 or 0.5 p(x) 1
SC_LP produces an FA-tree with
minimal switching activity
30Property 2
If ,
SC_LP produces an FA-tree with minimal
switching activity
31Property 3
The sum of signal probabilities of all carryout
signals by SC_LP are constant
32Algorithm FA_ALP(F)(FA-tree Allocation for Low
Power for F)
Mj set of addends in column j
- Repeat
- step 1. Call SC_LP(Mj)
- step 2. Insert all carryouts of FAs to Mj1
- step 3. j j1
- until(Mslt3 for all s)
33Experimental Results
- Timing Optimization
- Power Optimization
- - We used Design Compiler for logic optimization.
- - We used 0.35u Technology.
34Comparison of designs (optimize Timing)
Convent. CSA FA_AOT
Imp. wrt Conv. Imp. wrt CSA
1.3, 545 1.0, 275 0.3,
160 75.2, 80.7 69.0 , 42.8
3.5, 2345 3.2, 1670 2.0, 825
43.2, 64.8 37.9, 50.6 4.6, 5534
3.8, 3789 3.1, 3111 31.3 ,
43.7 17.2 , 17.8 5.2, 9138
4.6, 8134 4.0, 6458 23.8, 29.3
13.4 , 20.6 5.1, 7568 3.7, 6197
3.6, 5916 30.0, 21.8 4.2 ,
6.0 6.5, 13362 4.7, 11202 3.6,
8349 43.9, 37.5 22.5 , 25.5 6.0,
31073 4.5, 25713 3.6, 21542 39.4,
30.7 18.0, 16.2 11.5, 85364 6.3,
77052 4.4, 60307 61.3, 29.3
30.2 , 21.7 5.2, 53879 4.5, 50083
3.7, 38343 29.1, 28.8 17.9 , 23.4
6.4, 6593 6.0, 5608 5.7, 5631
11.5, 4.7 4.7 , -0.4
2
X
3
X
2
X X Y
2
2
x 2xy y 2x2y1
x y z xy yz 10
IIR
Kalman
IDCT
Complex
Serial-adapter
35Comparison of designs(optimize Power)
Designs
FArandom
FA_ALP
Impr.
IIR
257mW 240mW 6.6
Kalman
316mW 281mW 11.0
IDCT
1406mW 1324mW 5.8
Complx
330mW 399mW 6.6
Serial-Adapter
324mW 240mW 25.9
Average
11.8
36Conclusions
- Extended the Wallace based arithmetic
transformation - Proposed a bit-reduction techniques
- timing and power
- FA-tree structure for timing optimal
- FA-tree structure for low power
- less switching activity