Title: Energy Efficient Hardware Synthesis of Polynomial Expressions 18th International Conference on VLSI
1Energy Efficient Hardware Synthesis of Polynomial
Expressions18th International Conference on
VLSI Design
Anup Hosangadi Ryan Kastner ECE Department, UCSB
Farzan Fallah Advanced CAD Research Fujitsu Labs
of America
2Outline
- Introduction
- Related Work
- Problem formulation
- Algorithms for optimizing polynomials
- Experimental results
- Conclusions
3Introduction
- Embedded system applications need to compute
polynomial expressions - Continuous functions can be approximated by
Taylor Series - Adaptive (polynomial) filters
- Polynomial interpolation/extrapolation
- in Computer Graphics
- Encrpytion
4Introduction
- Commonly occuring computations implemented in
hardware - More flexibility than processor architecture
- NPAs (Hardware accelarators) in PICO project
- Custom Instructions (Tensilica)
- Upto 100 times improvement over processor
implementation (Kastner et.al TODAES02) - Develop techniques for reducing power consumption
5Related Work (Behavioral transforms)
- Power consumption depends on many factors
- Reducing number of operations
- Hardware (Nguyen and Chatterjee TVLSI00)
- Software (I.Hong et.al TODAES99)
- Voltage reduction after speedup transformations
- Retiming, Pipelining, Algebraic restructuring
- (Chandrakasan et. al TCAD95)
6Related Work
- Scheduling and resource allocation
- Shutting down unused resources (Monteiro et. al.
DAC 96) - Allocation of registers, functional units and
interconnects (A.Raghunathan et. al ICCD94) - Multiple Vdd scheduling
- Assigning supply voltage to each operation in
CDFG (M.Chang and M.Pedram TVLSI97)
7Related Work
- Switching power is proportional to number of
operations - Multiplications are expensive in Embedded systems
- Average 40 times more power than addition at 5V
(V.Krishna et. al, VLSI Design 1999) - Careful optimization of expressions is therefore
necessary to save power
8Reducing operations in polynomial expressions
- No good tool for polynomials
- Designers rely on hand optimized libraries
- Conventional compiler techniques CSE and Value
numbering not suited for polynomials. - Horner form most popular representation
- anxn a1xn-1 .an-1x a0 (((anx an-1)x
an-2)x ..a1)x a0 - Not good for multivariate polynomials
- Only a single polynomial expression at a time
9Comparison with Horner form
- Quartic-spline polynomial (3-D graphics)
- P zu4 4avu3 6bu2v2 4uv3w qv4
- Horner form (from MapleTM)
- P zu4 (4au3 (6bu2 (4uw qv)v)v)v
- (17 multiplications)
- Proposed algebraic method
- d1 v2 d2 d1v
- P u3(uz ad2) d1( qd1 u(wd2 6bu) )
- (11 multiplications)
10Related Work (Polynomial Expressions
- Expression Factorization (M.A. Breuer JACM69)
- Allows only one kind of operator at a time
- Using Symbolic Algebra (M.A.Peymandoust, De
Micheli) - Mapping polynomial datapaths to libraries
(DAC01) - Low power embedded software (DATE02)
- Results depend heavily on set of library elements
- eg. (a2 b2) (ab)(a-b) iff (ab) or (a-b)
is a library element - Manipulates only a single expression at a time
F1 A B C D F2 A P D
gt Extract (A D)
11Motivating Example
- Consider set of expressions
- Using CSE
-
-
16 multiplications and 4 additions/subtractions
12 multiplications and 4 additions/subtractions
12Motivational Example
- Using Horner transform
- Using our algebraic technique
12 multiplications and 4 additions/subtractions
7 multiplications and 3 additions/subtractions
13Introduction to algebraic technique for
redundancy elimination
- Algebraic techniques in multi-level logic
synthesis (MLLS) - Decomposition, factoring reduce number of
literals - Distill and Condense use Rectangle Covering
methods - Polynomial Expressions (Our Technique)
- Factoring, Single term common subexpressions
reduces number of multiplications - Multiple term common subexpressions reduces
number of additions and possibly multiplications - Key Differences (Generalization to handle higher
orders) - Kernelling techniques
- Finding single cube intersections
14Introduction to our technique(Outline)
- Find a subset of all possible subexpressions
(kernel generation) - Transformation of Polynomial Expressions
- Problem formulation
- Extract multiple term common subexpressions and
factors - Extract single term common factors
15Introduction to our technique
- Terminology
- Literal A variable or a constant eg. a,b,2,3.14
- Cube Product of literals e.g. 3a2b, -2a3b2c
- SOP Sum of cubes e.g. 3a2b 2a3b2c
- Cube-free expression No literal or cube can
divide all the cubes of the expression - Kernel A cube free sub-expression of an
expression, e.g. 3 2abc - Co-Kernel A cube that is used to divide an
expression to get a kernel, e.g. a2b
16Introduction to our Technique
- Matrix Representation of Polynomial Expressions
-
- F x3y xy2z is represented by
- Each row represents a product term
- Each column represents a variable/constant
- Each element (i,j) represents power of variable j
in term i
17Generation of Kernels (example)
- P1 x3y x2y2z L x,y,z
- Divide by x
-
- Ft P1/x x2y xy2z
18Generation of Kernels (example)
- Ft P1/x x2y
xy2z - C Biggest Cube dividing all cubes of Ft
-
-
/ C
1 1 0
C
xy
19Generation of Kernels (example)
- Obtain Kernel
- F1 Ft/C (x2y xy2z)/(xy) ( x yz)
- Obtain Co-Kernel
- D1 x(xy) x2y
- No kernels within F1. Go back to P1
- P1 x3y x2y2z
- Divide now by next variable y
- Ft x3 x2yz
- C x2
- But (x lt y) e C
- Stop Here, to avoid repeating same kernel Ft/C
(x yz) - No more kernels extracted
- Record kernel F1 P1 with co-kernel 1
20Concept of kernels and co-kernels
- Theorem Two expressions f and g can have a
multiple term common subexpression iff there are
2 kernels Kf and Kg having a multiple term
intersection - Detection of multiple term common subexpressions
by intersection of sets of kernels - Each co-kernel kernel pair represents a
possible factorization - e.g. x3y x2y2z x2y(x yz)
- Set of kernels a subset of all possible
subexpressions
21All Kernels and Co Kernels
Which kernels to choose?
22Kernel Cube Matrix (KCM)
- One row for each Kernel generated
- One column for each distinct kernel cube
- Each non-zero element represents a term
-
x3y
23Finding Kernel Intersections(Distill Algorithm)
- Each kernel intersection or factor appears as a
rectangle - Rectangle Set of rows and columns such that all
elements are 1 - Value of a rectangle Weighted sum of the energy
savings of the different operations - Goal Maximum valued rectangular covering of KCM
- Greedy heuristic covering by prime rectangles
24Modeling value function of a rectangle
- Formula for weighted sum of energy savings on
selection of a rectangle - R of rows C of columns
- M(Ri) of multiplications in row
(co-kernel) i. - M(Ci) of multiplications in column
(kernel-cube) i -
- m ratio of average energy consumption of
- multiplication to addition in the target
library -
Value
25Distill Algorithm
26Distill Algorithm
4xy x2y xyd2 d2 4 x Saves 2
multiplications Value 80
Remove covered terms
27Distill Algorithm
- Distill algorithm exits after no more kernel
intersections can be found -
-
P1 x2yd1 d1 x yz
P2 4d1 xyz d2 4 - x P3 xyd2
Can further optimize by finding single cube
intersections
28Finding single cube intersections (Condense
algorithm)
- Form Cube Literal Matrix (CLM)
- One row for each cube
- One column for each literal
- Eg. 2 cubes F1 a4b3c and F2 a2b4c2
29Finding single cube intersections (Condense
algorithm)
- Each (single term) common subexpression appears
as a rectangle. - Rectangle Set of rows and columns where all
elements are non-zero - Value of a rectangle is number of multiplications
saved by selecting it - C cube corresponding to the rectangle
- Value Rows( (SCi ) -1)
- Maximum valued rectangular covering will give
minimum number of multiplications - Use greedy iterative covering by prime rectangles
30Cube Literal Matrix (Condense Algorithm)
CLM for our example after Distill algorithm
Save 2 multiplications by extracting xy
C xy
31Condense Algorithm
Extracting xy
No more favorable cube intersections found
32Final Implementation
- Total 7 multiplications, 3 additions/subtractions
- Savings of 5 multiplications, 1
addition/subtraction compared to CSE - Impossible to obtain such results using
conventional techniques
33Experimental setup
- Polynomials used in Computer graphics and Signal
Processing - 1.0 µ technology library, characterized for power
consumption - Synthesized using Synopsys Design CompilerTM
- Min Hardware constraints (1 adder 1 multiplier)
- Med Hardware constraints (Max 4 multipliers)
34Experimental setup
- Estimated power using Synopsys Power CompilerTM
for random inputs, using RTL Simulator (VCSTM) - Compared energy consumption with CSE and Horner
form - Compared energy after voltage scaling
35Results (Comparing operations)
36Results (Min Hardware constraints)
37Results (Med Hardware constraints)
38Conclusions
- Technique to reduce number of operations in
polynomial expressions - Large savings in energy consumption observed over
CSE and Horner methods - Need to consider scheduling and resource
allocation to obtain further improvements
39Conclusions
- Thank you!!
- Questions ???
40 41Finding Kernel Intersections(Distill Algorithm)
- Worst case scenario for Distill algorithm
- Number of prime rectangles exponential in number
of rows/columns - Heuristic methods to find best prime rectangle
- In practice polynomial expressions are not so
large -