Title: Ajay K' Verma and Paolo Ienne
1Towards the Automatic Exploration of
Arithmetic-Circuit Architectures
- Ajay K. Verma and Paolo Ienne
- Processor Architecture Laboratory (LAP)
- Centre for Advanced Digital Systems (CSDA)
- Ecole Polytechnique Fédérale de Lausanne (EPFL)
2Example Plenty of Different Adders
0.49 ns 691 µm2
0.41 ns 385 µm2
Problem How can we obtain automatically the best
fitting implementation in this space?
0.34 ns 534 µm2
3Typical Synthesis Methods
- Write all the expressions in sum of product form.
- Find all the Kernels and Cokernels of the
expressions. - Formulate the problem as a Rectangle Cover
Problem. - Use heuristics to solve the Rectangle Cover
Problem.
a
de
bc
f
a
0
0
2
1
x af abc def bcde
0
0
4
3
de
(a de) (bc f)
2
4
0
0
bc
0
0
f
1
3
4Limitations of Typical Methods
- All expressions should be in sum of product form.
- Arithmetic expressions are XOR-intensive.
- Kernel extraction is based on algebraic
factoring. - Expressions and are considered
independent (? unable to explore all common
subexpressions). - Rectangle Cover Problem solved with heuristics
which therefore cannot guarantee optimal results.
5Related Work
- Multi-level optimization and Boolean division
- Classic problem Brayton82, Brayton90,
DeMicheli94, - Boolean division improvements Chang99,
- Optimization of specific arithmetic circuits
- Final adders for multipliers Lee91
- Column compressors Stelling98
- Carry-save addition Verma04
- Symbolic algebra
- Various applications to EDA Peymandoust01
6Outline
- Problem formulation.
- Introduction of a different division.
- Core problem finding CSEs.
- Enumeration of all possible CSEs.
- Pruning the search space.
- Results and analysis.
7Problem Statement
- Pareto-optimal implementation An implementation
which is better than any other in terms of area
or critical-path delay.
Given a set of Boolean expressions, generate all
their Pareto-optimal implementations.
8Gröbner Bases and Division
- A well known method for multinomial division
using the remainder theorem.
reduce (f, g)
Algebraic factoring
9Gröbner Bases for Boolean Algebra
- Boolean algebra does not form a ring under the
operations AND and OR. - Neither of the two operations is invertible.
- But Boolean algebra forms a ring over the field
GF(2) underthe operations AND and XOR. - Operation XOR is self-invertible.
- Reed-Muller form has no NOT operation.
- Reed-Muller form of an expression is unique.
- Expected size of an expression in Reed-Muller
form is smaller than the expected size in sum of
product form. - Issue AND operator is idempotent.
- Reduce the final expression with respect to (x2 -
x) forunderlying variables x.
10Two Theorems and Their Consequences
- Theorem 1 In any Pareto-optimal implementation
of E1 and E2 , where they use S as a Common
Sub-Expression, the implementation of S must be
Pareto-optimal.
The problem has a dynamic programming structure.
11Two Theorems and Their Consequences
- Theorem 2 If there are m Pareto-optimal
implementations of E1 and n Pareto-optimal
implementations of E2 which use Sk as the
implementation of their CSE, then by considering
only (m n) combinations of these
implementations we can find all Pareto-optimal
implementations using Sk .
E2
E1
1 (20)
1 (30)
Area
2 (16)
8 (38)
3 (25)
4 (15)
8 (22)
5 (14)
10 (20)
7 (13)
8 (35)
Delay
9 (12)
12Hence, Two Independent Problems
- Problem 1 Given two Boolean expressions E1 and
E2 , find all possible Common Sub-Expressions
between them. - Problem 2 Find all Pareto-optimal
implementations of a single Boolean expression E.
13Problem 1 Enumerating CSEs
The nodes of the DAG correspond to all partial
implementations of the two expressions with some
sharing between them.
14Replacing Partial Occurrences Can Be Useful
Partial occurrences can also be replaced by a new
variable (e.g., s x ? y ? x (x ? y) ? y
s ? y). Kernel extraction algorithms cannot be
used.
Replacing partial occurrences too
Without replacing partial occurrences
3 XOR gates, 2 AND gates
3 XOR gates, 3 AND gates
15t-Reductions Are Necessary
t bd
t -reductions preserve the min delay at least in
one path.
16Pruning the Enumeration DAG
- The size of DAG can be as large as O ((n m)
2m), where n is the number of variables and m is
the sizes of Boolean expressions. - Enumerating the whole DAG is computationally
infeasible. - Pruning Criteria.
- Recognizing node equivalence (width reduction).
- Merging some reductions into a single one(height
reduction). - Delaying certain reductions (branch reduction).
17Pruning Based on Node Equivalence (Width
Reduction)
s5 ? abcd, s6 ? abcd, s7 ? s1cd s5 s6 ?
s7
18Neutral t-Reductions Should Be Applied
Immediately (Height Reduction)
s-reduction
- Neutral t-reductions
- A t-reduction which does not kill any s
-reduction.
t-reduction
- Recognition of neutral t -reductions
- Find all reductions which are killed by this
reduction. - Check if any of them is an s -reduction.
Normalization
Normalization
19Nonneutral t-Reductions Should Be Delayed (Branch
Reduction)
- The only purpose of t -reductions is to preserve
the minimum delay in at least one path. - If there exist at least one s -reduction which
preserves the minimum delay, any t -reduction at
the current node can be avoided. - Computing the minimum delay corresponding to a
Boolean expression is NP-hard. - Not all instances of Boolean expressions are hard
to compute the minimum delay. - E.g., the minimum delay of a Boolean expression
A1 A2 An can be computed using a
two-greedy approach, where Ais are product terms
with disjoint set of variables.
20Two Independent Problems
- Problem 1 Given two Boolean expressions E1 and
E2 , find all possible Common Sub-Expressions
between them. - Problem 2 Find all Pareto-optimal
implementations of a single Boolean expression E.
21Problem 2Special Case of the General Problem
- All the Pareto-optimal implementations of a
single expression can be evaluated using DAG
enumeration. - s - and t -reductions can be defined in a similar
way. - If the corresponding expression occurs more than
once then its an s -reduction, otherwise t
-reduction. - If there no s -reductions in the DAG, then all
implementations will have the same area. - The Pareto-optimal implementation will correspond
to the one with minimum delay and can be computed
using a two-greedy strategy.
22Experimental Setup
E1 f (x1, x2, ) E2 g (x1, x2, ) E3 h
(x1, x2, )
Conversion into Reed-Muller form
Logic synthesis
E1 f1 (x1, x2, ) E2 g1 (x1, x2, ) E3 h1
(x1, x2, )
CSE enumeration
(E11, E21, E31), (E12, E22, E32),
Logic synthesis
Artisan Standard Cells UMC CMOS Technology 0.13µm
23Results
6-bit Adder
5-bit Adder
Multi-input Addition
4 X 3-bit Multiplier
24There Is Scope for Further Pruning
Area and Delay for all 6-bit adders generated by
our algorithm
Without any pruning it is impossible to handle
expressions with more than five variables.
25but the Enumeration Algorithm Finds Interesting
Non-trivial Relations!
4x4-bit multiplier better than our best
manually-designed multiplier?!
Idea Exploit complex dependencies among the
partial product buts of a multiplier
26Conclusions
- We have exploited a new form of division which is
better than algebraic division and still less
complex than Boolean division. - Key to a better exploitation of Common
Sub-Expressions (CSEs). - We have introduced a CSE enumeration algorithm
which discovers all architectures. Unfortunately,
it is still very slow. - More effective pruning strategies are required,
especially based on the inferiority of some
implementations still explored. - Despite the runtime limitations, this exploration
algorithm has already made it possible to study
innovative architectures. - Exploit dependency among input bits in the
compressors of multipliers.
Many opportunities lay still untapped in the
synthesis of arithmetic components.