Title: Projects
1Projects Design
2A decimal multiply-add primitive
Project 1
3The multiplier-cell
proj1.mo
load (modl (lib/decimal/mul_cell,
lib/array/mul_add))
mul_cell.mo
declare (domain ((dec/, 0, 9), (hect/,
0, 99)), domain_fn (dec/,
hect/)) MODEL mul_cell (X, Y, N, W, E, S),
Z X dec/ Y hect/ N E, S
remainder (Z, 10), W quotient (Z,
10), ENDMOD
Local
Inputs
Outputs
4The mul-add primitive
load (modl (lib/decimal/mul_cell,
lib/array/mul_add))
MODEL array/mul_add (N3, Y3, N2, Y2, N1, Y1,
E1, X1, P1,
E2, X2, P2, E3, X3, P3,
P4, P5, P6),
ltmodel-bodygt,ENDMOD
5The mul-add primitive
MODEL array/mul_add (N3, Y3, N2, Y2, N1, Y1,
E1, X1, P1,
E2, X2, P2, E3, X3, P3,
P4, P5, P6), X
Y N W E S
mul_cell (X1, Y1, N1, W11, E1, P1),
mul_cell (X1, Y2, N2, W12, W11,
S12), mul_cell (X1, Y3, N3, W13,
W12, S13), mul_cell (X2, Y1, S12,
W21, E2, P2), mul_cell (X2, Y2,
S13, W22, W21, S22), mul_cell (X2,
Y3, W13, W23, W22, S23), mul_cell (X3,
Y1, S22, W31, E3, P3), mul_cell
(X3, Y2, S23, W32, W31, P4),
mul_cell (X3, Y3, W23, P6, W32,
P5),ENDMOD
6The top-model
load (modl (lib/decimal/mul_cell,
lib/array/mul_add))MODEL top (), input
(X3, X2, X1, '"\n", Y3, Y2, Y1),
array/mul_add (0, Y3, 0, Y2, 0, Y1,
0, X1, P1, 0, X2, P2, 0, X3,
P3, P4, P5,
P6), print ('"\n", X3, X2, X1, '" x ", Y3,
Y2, Y1, '" ", P6, P5, P4, P3,
P2, P1, '"\n\n"), ENDMODFUNCTION proj1 (),
sim (top ()),ENDFUN
proj1.mo
proj1 is undefined, load (modl (proj1))
proj1 ()
simulation ()
7Structured models
MODEL m (N1, N2, W1, W2,
W3, E1, E2, E3, S1,
S2), ltmodel-bodygt, ENDMOD
Note the grouping order of the arguments
8Feed-through wires
MODEL M (N1, N2, feed,
W2, W3, feed, E2, feed,
S1, S2), ltmodel-bodygt, ENDMOD
Model arguments are local variables
9Horizontal-abutment
abut (ltnew-modelgt, m m)
Internal nodes
10The new MODEL created
MODEL M12 (N1, N2, N1, N2,
W1, W2, W3, E1, E2, E3,
S1, S2, S1, S2), ltmodel-body m1gt,
ltmodel-body m2gt, ENDMOD
S1 ? S1
11Vertical abutment
abut (ltnew-modelgt, m
m)
12Reorientation of a MODEL
arrange (n, m)
133x3 u-binary multiplier-adder
Project 2
load (modl ('lib/array/abut/mul_cell,
'lib/array/abut/mul_add)) MODEL top (),
input (X3, X2, X1, '"\n", Y3, Y2, Y1, '"\n",
N3, N2, N1, '"\n", E3, E2, E1, '"\n"),
array/abut/mul_add (N3, Y3, N2, Y2, N1, Y1,
E1, X1, P1, E2, X2, P2, E3, X3, P3,
P6, P5, P4), output (P6, P5, P4, P3, P2, P1,
'"\n"), ENDMOD FUNCTION proj2 (), sim (top
()), ENDFUN
14The multiplier-cell
MODEL full_add (a, b, cin, sum, cout), sum a
/bit/xor b xor cin, cout a /bit/and cin
/bit/or b /bit/and cin or a
/bit/and b, ENDMOD
full_add.mo
load (modl ('lib/full_add)) MODEL
array/abut/mul_cell (north, Y,
west, X, feed,
east, X, south,
feed, Y), prod X /bit/and Y, full_add
(north, prod, east, south, west), ENDMOD
mul_cell.mo
15multiplier-adder abutment
MODEL west (, , feed, X, feed, ),
false, ENDMOD MODEL east (, gnd, X, P, X,
P, ), gnd 0, ENDMOD MODEL south (P, Y,
, , P), false, ENDMOD abut
(array/abut/mul_add, ((3
(west (3 array/abut/mul_cell)))
(corner (3 south))))
16The complete mul_add primitive
abut (array/abut/mul_add, ((3
(west (3 array/abut/mul_cell)))
(corner (3 south))))
17Timing analysis array-multiplier
Project 3
The model mul_add works for any radix !
load (modl ('lib/timing/mul_cell,
'lib/array/mul_add)) FUNCTION proj3 (),
timing (array/mul ('X3, 'X2, 'X1,
'Y3, 'Y2, 'Y1, P6,
P5, P4, P3, P2, P1)), ENDFUN
proj3.mo
The file mull_add.mo contains a mul_add primitive
as well as a multiplier !
Inputs are symbolic constants !
18mul_cell and full-adder
load (modl ('lib/timing/full_add)) delay
('/bit/and) 0n5 MODEL mul_cell (X, Y, north,
west, east, south), prod X /bit/and Y,
a, b, cin, sum, cout full_add
(north, prod, east, south, west), ENDMOD
mul_cell.mo
19Timing model for the full-adder
delay ('Fsum) 1n7 delay ('Fcarry)
1n07 MODEL full_add (a, b, cin, sum, cout),
sum Fsum (a, b, cin), cout Fcarry (a, b,
cin), ENDMOD
full_add.mo
Dedicated delay model
20Worst-case signal propagation
tcarry
tand
tsum
1
2
3
1
2
3
4
5
4
6
5
6
7
8
7
8
9
9
delay tand tcarry n tsum m tcarry
21Critical expressions generated
time valid mul/mul_cell/prod 'X1
bit/and 'Y1, 500p mul/W11 Fcarry (,
mul/mul_cell/prod), 1n57 mul/W12 Fcarry (,
mul/W11), 2n64 mul/S13 Fsum (,
mul/W12), 4n34 mul/S12 Fsum (,
mul/W11), 3n27 mul/W21 Fcarry (,
mul/S12), 4n34 mul/W22 Fcarry (, mul/S13,
mul/W21), 5n41 mul/S23 Fsum (,
mul/W22), 7n11 mul/S22 Fsum (, mul/S13,
mul/W21), 6n04 mul/W31 Fcarry (,
mul/S22), 7n11 mul/W32 Fcarry (, mul/S23,
mul/W31), 8n18 P5 Fsum (,
mul/W32), 9n88
22The critical path
delay tand tcarry n tsum m tcarry
23Timing modified array-multiplier
Project 4
East inputs are moved to the north-side
Carry-signals are running diagonally
24Timing modified array-multiplier
load (modl ('lib/modified/mul_add,
'lib/timing/mul_cell)) FUNCTION proj4 (),
timing (modified/mul ('X3,'X2,'X1,
'Y3,'Y2,'Y1,
P6, P5, P4, P3, P2, P1)), ENDFUN
25The modified array-multiplier
MODEL modified/mul (X3, X2, X1,
Y3, Y2, Y1, P6, P5, P4, P3,
P2, P1), X Y N W E S
mul_cell (X1, Y1, 0, W11, 0, P1), mul_cell
(X1, Y2, 0, W12, 0, S12), mul_cell (X1, Y3,
0, W13, 0, S13), mul_cell (X2, Y1, S12,
W21, W11, P2), mul_cell (X2, Y2, S13, W22,
W12, S22), mul_cell (X2, Y3, W13, W23, W22,
S23), mul_cell (X3, Y1, S22, W31, W21, P3),
mul_cell (X3, Y2, S23, W32, W31, P4), mul_cell
(X3, Y3, W23, P6, W32, P5), ENDMOD
Modified entries
26Critical expressions generated
time valid mul/mul_cell_2/prod 'X1
bit/and 'Y3, 500p mul/S13 Fsum (,
mul/mul_cell_2/prod), 2n2 mul/W22 Fcarry (,
mul/S13), 3n27 mul/S23 Fsum (,
mul/W22), 4n97 mul/W32 Fcarry (,
mul/S23), 6n04 P5 Fsum (,
mul/W32), 7n74
27The critical path
delay tand n tsum (m - 1) tcarry
28Timing carry-save mul_add
Project 5
Extra north inputs on the west-side
East inputs are moved to the north-side
All carry-signals are running diagonally
Extra east input
Final adder
29The carry_save-multiplier
MODEL carry_save/mul_add (N6, N5, N4, N3, Y3, N2,
Y2, N1, Y1,
E1, X1, P1, E2, X2, P2, E3, X3,
P3, E4,
P7, P6, P5, P4), X
Y N W E S mul_cell (X1,
Y1, N1, W11, E1, P1), mul_cell (X1, Y2,
N2, W12, E2, S12), mul_cell (X1, Y3, N3,
W13, E3, S13), mul_cell (X2, Y1, S12, W21,
W11, P2), mul_cell (X2, Y2, S13, W22, W12,
S22), mul_cell (X2, Y3, N4, W23, W13, S23),
mul_cell (X3, Y1, S22, W31, W21, P3),
mul_cell (X3, Y2, S23, W32, W22, S32),
mul_cell (X3, Y3, N5, W33, W23, S33),
a b cin sum cout full_add
(S32, E4, W31, P4, W41), full_add (S33,
W41, W32, P5, W42), full_add ( N6, W42, W33,
P6, P7), ENDMOD
30Critical expressions generated
mul/mul_cell_2/prod 'X1 bit/and
'Y3, 500p mul/W13 Fcarry (,
mul/mul_cell_2/prod), 1n57 mul/S23 Fsum
(, mul/W13), 3n27 mul/S13 Fsum (,
mul/mul_cell_2/prod), 2n2 mul/W22 Fcarry (,
mul/S13), 3n27 mul/S32 Fsum (, mul/S23,
mul/W22), 4n97 mul/W41 Fcarry (,
mul/S32), 6n04 mul/W42 Fcarry (,
mul/W41), 7n11 P6 Fsum (,
mul/W42), 8n81
31The critical path
delay tand (n 1) tsum tcarry (m -
1) tcarry
32The carry-save multiplier
The top-row cells can be replaced by and circuits
delay tand n tsum tcarry (m - 1)
tcarry
33Prove algebraic equivalence
Project 6
and
full_adder
half_adder
34Load an algebraic package
ld (modl ('bdg)), algebraic (), input (X3, X2,
X1, '"\n", Y3, Y2, Y1, '"\n"),
Read algebraic input-values
Print algebraic results
Works on the /bit/operators and, nand, or, nor,
xor, etc.
output (P6, P5, P4, P3, P2, P1, '"\n", P6
/bit/ Q6, P5 /bit/ Q5, P4 /bit/ Q4,
P3 /bit/ Q3, P2 /bit/ Q2, P1 /bit/
Q1), symbolic (),
35load (modl ('lib/array/abut/mul_cell,
'lib/array/abut/mul_add,
'lib/binary/mul_cell, 'lib/carry_save/mul_ad
d)) MODEL top (), ld (modl ('bdg)),
algebraic (), input (X3, X2, X1, '"\n", Y3, Y2,
Y1, '"\n"), array/abut/mul (Y3, Y2, Y1,
X1, P1, X2, P2, X3, P3,
P6, P5, P4), carry_save/mul (X3, X2, X1,
Y3, Y2, Y1, Q6,
Q5, Q4, Q3, Q2, Q1), output (P6, P5, P4, P3,
P2, P1, '"\n", P6 /bit/ Q6, P5 /bit/
Q5, P4 /bit/ Q4, P3 /bit/ Q3, P2
/bit/ Q2, P1 /bit/ Q1), symbolic
(), ENDMOD FUNCTION proj5_v (), sim (top
()), ENDFUN
36Review carry-save multiplier
- The carry-save multiplier can be partitioned into
an and array, an n 2 reductor and a final adder
37Projects Design
- The carry-save multiplier
38Critical path carry-save mul
mul/mul_cell_4/prod y6 bit/and 'x1, 500p
mul/s15 Fsum (, mul/mul_cell_4/prod), 2n2 mu
l/s24 Fsum (, mul/s15), 3n9 mul/s33
Fsum (, mul/s24), 5n6 mul/s42 Fsum (,
mul/s33), 7n3 mul/w51 Fcarry (, mul/s42),
8n37 mul/w52 Fcarry (, mul/w51), 9n44 mul
/w53 Fcarry (, mul/w52), 10n51 mul/w54
Fcarry (, mul/w53), 11n58 mul/w55 Fcarry
(, mul/w54), 12n65 mul/w56 Fcarry (,
mul/w55), 13n72 mul/w57 Fcarry (,
mul/w56), 14n79 mul/w58 Fcarry (,
mul/w57), 15n86 P13 Fsum (,
mul/w58), 17n56
39Critical path carry-save mul
- One and, n n - 1 sum stages, one sum stage,
and m - 1 carry stages in the final adder
40A fast carry-path
41A fast carry path ATLAS adder
- The ATLAS full-adder uses a switch to either
propagate the cin signal to the cout output or to
generate a 0 or an 1 value
First use Relay based computers
Carry signal is propagated with the speed of
light
ATLAS full_adder
42A fast carry path ATLAS adder
Propagate Cin
Generate 1
Generate 0
tcy 0n2 _at_ 1mm 5V
p a /bit/xor b
sum p /bit/xor cin
43The /bit/xor /bit/nxor functions
44Timing of the carry-path
Approximate formulas
45Insertion of inverters
- Use inverters to avoid excessive delay
- Speed-up of the carry-path with a factor of two
46The inverted full-adder
- Application of negative logic to all I/O signals
of a full-adder gives a full-adder
- The xor circuits which calculate p and the sum
should be replaced by nxor circuits
47The carry-select adder
- sum CASE cin 0 a /bits_5/
b, cin 1 a /bits_5/ b 1,
ENDCASE,
cin
cout
48Improved physical design
49High level model
- Bit-level word-level description
50Optimal partitioning
- An m-bit wide adder should be partitioned in
sections of approximate length m ?m - The length of the sections should be rebalanced
such that all delays are equal
When tcin-cout ? 0n3
51Example final-adder partitioning
- m 32-bits _at_ 1?m 5Vm ? 6
partitioning Delay x
1/8(6,6,5,5,4,4) ? 30 6x0n60n33n9
487p(6,6,6,5,5,4) ? 32 7x0n6 4n2
525p (7,6,6,5,5,4) ? 33 7x0n60n34n5 562p - m 64-bitsm ? 8 partitioning
Delay(9,9,8,8,7,7,6,6) ? 60 9x0n60n35n4
675p (10,9,9,8,8,7,7,6)? 64 10x0n6 6n0
750p - _at_ 0.25?m 2.5V x 1/8
52Composition of larger multipliers
53Composition of larger multipliers
- Mixed Architecture Wallace-Tree
54Composition of larger multipliers
55The Wallace-Tree multiplier
- The full-adder can be used as a 32 reductor
- A 9 2 reductor can be build using 5 full-adders
56Building a Wallace tree
- An mxn multiplier uses 32 reductors in the the
carry save tree - The sum-signals are kept at the current bit
position - The carry-signals are propagated to the next bit
position - The tree is completed one level at a time
57A 16x16 Wallace-tree
Delay 6 tsum 10n2 _at_ 1mm 5V 1n125 _at_ 0.25mm
2.5V
level7
58Review
- It is almost impossible to make a layout of a
Wallace tree multiplier using full-custom layout
tools - The wiring within the Wallace tree is long when
compared to the wire-length in an array- or a
carry-save multiplier - There are marginally fewer ripples in a Wallace
tree, provided that it is well balanced
59Cascadable multipliers
- It has been shown that an unsigned multiplier can
be constructed from smaller ones, provided that
these are unsigned - Each embedded-multiplier which should be able to
operate as a stand-alone signed multiplier should
be equipped with a circuit for sign-extension
using either sign-magnitude arithmetic or
two-complement arithmetic
60Cascadable multipliers
61The splitting interface
62MMX (INTEL) VIS (SUN)
- SUN was the first to introduce a splittable
multiplier, consiting of 8 16x16 multiplers which
can be used to perform 4 16x16 multiply-add
operations or one 64x64 floating point multiply - INTEL uses a splittable multiplier consiting of 4
16x16 multiplers which can be used to perform 4
16x16 multiply-add operations or a 64x32 (64x64)
floating point multiply in one (two) clock
cycle(s)
63Load Store operations
- Operands are normally de-normalized such that the
mantissa is scaled in multiples of 216 instead
of 21 - A FP-(MMX)-load performs (discards) the
de-normalization - The FP-(MMX)-store performs (discards) the
FP-normalization
64Sequential Circuits
65The cut-theorem
Provided that the initial conditions are
compatible
66The fully pipelined multiplier
Multiple D-FF in X,N
Multiple D-FF in P
Consider all iso-chronous wires
67Folding
- Map data back from one line of iso-chronity to an
earlier one, use multiplexers for control
A multiplier can be realized with just n
full_adders without speed penalty
68Folding
- Can be used to derive equivalent circuits
- E.g. parallel multiplier serial-parallel
multiplier, based on the shift-add algorithm
The serial-parallel multiplier can be realized
with n full_adders Its delay comes from the
carry-path of an n-bit wide adder