Title: Chapter 12 Arithmetic Building Blocks
1Chapter 12Arithmetic Building Blocks
- Boonchuay Supmonchai
- Integrated Design Application Research (IDAR)
Laboratory - August 20, 2004 Revised - July 5, 2005
2Goals of This Chapter
- Designing for Performance, area, or power
- Adders
- Multipliers
- Shifters
- Logic and System Optimizations for datapath
modules - Power-Delay trade-offs in datapaths
3Review A Generic Processor
4Bit-Sliced Architecture
n-bit Data In
Bit 0
Bit 1
Bit n-2
Bit n-1
Control
n-bit Data Out
- Modular
- Easy to design and verify
- Easy to expand
5Example Itanium Bit-Sliced Design
6Example Itanium Integer Datapath
Itanium has 6 integer execution units (ALU)
7One-Bit Binary Full Adder (FA)
S A ? B ? Cin Cout AB ACin
BCin
- A VERY common operation - so worth spending some
time trying to optimize - Often in the critical path, so need to look at
both logic level and circuit level optimizations
8Propagate, Generate, and Delete (Kill)
- Define 3 new variable which ONLY depend on A, B
(FA itself generates a carry)
Generate (G) AB
(FA passes along carry)
Propagate (P) A ? B
(FA stops propagation of carry)
- Then we can write S and Cout in terms of G, P,
and Cin
S(G,P,C) P ? Cin Cout(G,P,C) G
PCin
- We can also write S and Cout in terms of D, P,
and Cin - Sometimes an alternative definition for P can be
used
Propagate (P) A B
9FA CMOS Implementation First Try
32 Transistors
10Improved CMOS Implementation
- A more compact design is based on the observation
that S can be factored to reuse the Cout term
11Improved CMOS Implementation II
28 Transistors
12Notes on Improved CMOS FA
- Note that the PMOS network is identical to the
NMOS network rather than being the complement. - This is possible because of the inversion
property which says that the function of
complemented inputs is equal to the complement of
the function. - This simplification reduces the number of series
transistors and makes the layout more uniform - This design has a greater delay to compute S than
Cout - Most of the time the extra delay computing S has
little effect on the critical path because carry
is the signal that propagates - With proper sizing this delay on S can be
minimized
13Inversion Property
- The function must be symmetric
14TG-Based FA
Extra delay - slower
16 Transistors
15Complementary PT Logic (CPL) FA
28 transistors dual rail
Voltage drop Problems
Faster, Lower Power, and small area than full
static CMOS
16Delay Balanced FA
Identical Delays for Carry and Sum
202 transistors
17Mirror Adder
PUN and PDN are symmetrical not complemented
244 transistors
Cout AB ACin BCin
18Mirror Adder Features
- The NMOS and PMOS chains are completely
symmetrical with a maximum of two series
transistors in the carry circuitry, guaranteeing
identical rise and fall transitions if the NMOS
and PMOS devices are properly sized. - When laying out the cell, the most critical issue
is the minimization of the capacitances at node
!Cout (four diffusion capacitances, two internal
gate capacitances, and two inverter gate
capacitances). - Shared diffusions can reduce the stack node
capacitances. - The transistors connected to Cin are placed
closest to the output.
19Mirror Adder Sizing Issues
- Only the transistors in the carry stage have to
be optimized for optimal speed. All transistors
in the sum stage can be minimal size. - Assume PMOS/NMOS ratio of 2. Each input in the
carry circuit has a logical effort of 2 so the
optimal fan-out for each is also 2. - Since !Cout drives 2 internal and 2 inverter
transistor gates (to form Cout for the bit adder)
the carry circuit should be oversized
20Mirror Adder Stick Diagram
21Ripple Carry Adder (RCA)
tripple ? tFA(A,B?Cout) (N - 2)tFA(Cin?Cout)
tFA(Cin?S)
Worst Case Delay tripple O(N)
Slow!
22Exploiting the Inversion Property
regular cell
inverted cell
- Now need two flavors of FAs
- Minimizes the critical path (the carry chain) by
elimi-nating inverters between the FAs - Need increasing the transistor sizes on the carry
chain portion of the mirror adder.
23Fast Carry Chain Design
- The key to fast addition is a low latency carry
network - What matters is whether in a given position a
carry is - Generated Gi AiBi
- Propagated Pi Ai ? Bi (sometimes use Ai
Bi) - Annihilated (killed) Ki !Ai !Bi
- Giving a carry recurrence of
C i1 Gi PiCi
C1 G0 P0C0 C2 G1 P1G0 P1P0 C0 C3 G2
P2G1 P2P1G0 P2P1P0 C0 C4 G3 P3G2 P3P2G1
P3P2P1G0 P3P2P1P0 C0
24Manchester Carry Chain
- Switches controlled by Gi and Pi
- Components of total delay
- time to form the switch control signals Gi and Pi
- setup time for the switches
- signal propagation delay through N switches in
the worst case
254-bit Sliced MCC Adder
26Domino MCC Circuit
3 3 3 3
3
1
2
3
4
5 6
1 2
2 3
3 4
4 5
27MCC Stick Diagram
28Notes on MCC Adder
- When clock is low, the carry nodes precharge
when clock goes high if Gi is high, Ci1 is
asserted (goes low) - To prevent Gi from affecting Ci, the signal Pi
must be computed as the xor (rather than the or)
which is not a problem since we need the xor of
Ai and Bi for computing the sum anyway - Delay is roughly proportional to n2 (as n pass
transistors are connected in series) - we usually limit each group to 4 stages, then
buffer the carry chain with an inverter between
each group
29Binary Adder Landscape
Synchronous Word Parallel Adders
t O(N), A O(N)
t O(1), A O(N)
t O(?N) A O(N)
30Carry-Skip (Carry-Bypass) Adder
If (P0 P1 P2 P3 1) then Co,3 Ci,0
otherwise the block itself kills or generates the
carry internally
31Carry-Skip Chain Implementation
Only 10 to 20 area overhead Only two gate
delays to produce Cout if skip occurs
324-bit Block Carry-Skip Adder
Worst-case delay ? carry from bit 0 to bit 15
carry generated in bit 0, ripples through bits 1,
2, and 3, skips the middle two groups (B is the
group size in bits), ripples in the last group
from bit 12 to bit 15
tadd tsetup B tcarry ((N/B) -1) tskip B
tcarry tsum
33Optimal Block Size and Time
- Assuming one stage of ripple (tcarry) has the
same delay as one skip logic stage (tskip) and
both are 1 - tCSkA 1 B (N/B-1) B
1
tsetup ripple in skips ripple in
tsum block 0
last block
- 2B N/B 1
- So the optimal block size, B, is
- dtCSkA/dB 0 ? ?(N/2) Bopt
- And the optimal time is
- Optimal tCSkA 2(?(2N)) 1
34Variations of Carry-Skip Adders I
- Variable block sized Carry-Skip Adders
- A carry that is generated in, or absorbed by, one
of the inner blocks travels a shorter distance
through the skip blocks - Hence a CSA adder can have bigger blocks for the
inner carries without increasing the overall delay
tCSkA 2B O(NB)
35Variations of Carry-Skip Adders II
- Multiple Levels of Skip Logic
- CSAs with large number of bits suffer from linear
carry propagation delay time. - Added higher levels of skip logic, a CSA can skip
more blocks at a time.
Cin
Cout
skip level 1
skip level 2
AND of the first level skip signals (BPs)
tCSkA 2B O(logBN)
36Carry-Skip Adder Comparisons
37Carry Select Adders
- Idea Precompute the carry out of each block for
both carry_in 0 and carry_in 1 (can be done
for all blocks in parallel) and then select the
correct one - More cost effective than the ripple carry adder
38Carry Select Adder Critical Path
tadd tsetup B tcarry (N/B) tmux tsum
39Square Root Carry Select Adders
Balance Delay - Making later block bigger
tadd tsetup 2 tcarry vN tmux tsum
40Adder Delays - Comparison
41LookAhead - Basic Idea
Co,k f(Ak, Bk,Co,k-1) Gk PkCo,k-1
42Look-Ahead Topology
By expanding carry generation all the way
C1 G0 P0C0 C2 G1 P1G0 P1P0 C0 C3 G2
P2G1 P2P1G0 P2P1P0 C0 C4 G3 P3G2 P3P2G1
P3P2P1G0 P3P2P1P0 C0
43Logarithmic Look-Ahead Adder
44Parallel Prefix Adders (PPAs)
- Define carry operator on (G,P) signal pairs
- is associative, i.e.,
- (g,p) (g,p) (g,p)
(g,p) (g,p) (g,p)
(G,P)
(G,P)
where G G PG P PP
(G,P)
45PPA General Structure
- Given P and G terms for each bit position,
computing all the carries is equal to finding all
the prefixes in parallel - (G0,P0) (G1,P1) (G2,P2) (GN-2,PN-2)
(GN-1,PN-1) - Since is associative, we can group them in any
order - but note that it is not commutative
- Measures to consider
- number of cells
- tree cell depth (time)
- tree cell area
- cell fan-in and fan-out
- max wiring length
- wiring congestion
- delay path variation (glitching)
46Brent-Kung PPA
47Kogge-Stone PPF Adder
48More Adder Comparisons
49Adder Speed Comparisons
50Adder Average Power Comparisons
51PDP of Adder Comparisons
52Binary Multiplication - Basics
- Given two unsigned binary numbers X (M bits) and
Y (N bits)
- where Xi, Yj ? 0, 1
- The multiplication operation Z X ? Y is
53Binary Multiplication Operation
- Binary Multiplication as repeated additions
multiplicand
1 0 1 0 1 0 1 0 1 1
multiplier
1 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0
1 0 1 0
partial product array
can be formed in parallel
1 1 1 0 0 1 1 1 0
double precision product
54Shift-and-Add Multiplication
- Right Shift and Add (N bits ? N bits)
Left shift requires 2n-bit adder
tshiftadd_mult O(N tadder) O(N2) for an RCA
55Improving Multipliers
- Making them faster (therefore, bigger area)
- Use faster adders
- Use higher radix (e.g., base 4) multiplication
- Use multiplier recoding to simplify multiple
formation - Form partial product array in parallel and add it
in parallel - Making them smaller (i.e., slower)
- Use array multipliers
- Very regular structure with only short wires to
nearest neighbor cells. Thus, very simple and
efficient layout in VLSI - Can be easily and efficiently pipelined
56Array (or Tree) Multiplier Structure
multiple forming circuits
partial product array reduction tree
mux reduction tree (log N) CPA (log N)
fast carry propagate adder (CPA)
57Partial Product (PP) Generation
- Each row in the partial-product array is either a
copy of the multiplicand or a row of zeros
- Careful optimization of the PP generation can
lead to some substantial delay and area
reduction. - Booths and modified Booths recording
58Array Multiplier Implementation
Assume tadd tcarry
tarray_mult (M -1)(N - 2) tcarry (N - 1)
tsum tand O(N)
59Carry-Save Multiplier
- The idea is to save the (PP) carry and add it
in the next adder stage - In the final addition a fast carry-propagate
(e.g., carry-lookahead) adder is used.
tCSM (N - 1) tcarry tmerge tand O(N)
60CSM Floorplan
Regularity makes the generation of structure
amenable to automation
61Wallace-Tree Multiplier
GOAL Minimize depth ( of stages) with min. no.
of adder elements
62Wallace-Tree Multiplier Implementation
3 HAs and 3 FAs for the reduction process (stage
1 stage 2) Any type of adder can be used for
the final adder
63Notes on Wallace-Tree Multiplier
- Wallace tree substantially saves hardware for
large multipliers - Number of partial products is reduced by
two-thirds per stage - The propagation delay is found to be bound,
tWTM O(log 3/2 (N))
- Although substantially faster than CSM, WTM
structure is very irregular - Difficulty in finding efficient VLSI layout
- Many of todays high performance multipliers use
higher order (e.g. 4-2) compressors in stead of
3-2 compressors (FAs)
64Parallel Programmable Shifters
- Shifting a data word left or right over a
constant amount is a trivial hardware operation
and is implemented by the appropriate signal
wiring - Shifters are used in multipliers, floating point
units
Consume lots of area if done in random logic gates
65A Programmable Binary Shifter
Exactly one signal is active
664-bit Barrel Shifter
Example Sh0 1 B3B2B1B0 A3A2A1A0
Sh1 1 B3B2B1B0 A3A3A2A1
Sh2 1 B3B2B1B0 A3A3A3A2
Sh3 1 B3B2B1B0 A3A3A3A3
Arithmetic shift
Area dominated by wiring
674-bit Barrel Shifter
Example Sh0 1 B3B2B1B0 A3A2A1A0
Sh1 1 B3B2B1B0 A3A3A2A1
Sh2 1 B3B2B1B0 A3A3A3A2
Sh3 1 B3B2B1B0 A3A3A3A3
Arithmetic shift
Area dominated by wiring
68Notes on Barrel Shifter
- Note that signal goes through at most one FET (so
constant propagation delay (in theory)) - Also note, that the FET diffusion capacitance on
an output wire increases linearly with the shift
width but the FET diffusion capacitance on the
input data lines increases quadratically (i.e.,
N2 for circular shifter) - Size of cell is bounded by the pitch of the metal
wires. - A decoder is usually needed for shift control
signals since the amount of shift are normally
given in (encoded) binary number.
694-bit Barrel Shifter Layout
Widthbarrel 2 pm N N max shift distance, pm
metal pitch
708-bit Logarithmic Shifter
718-bit Logarithmic Shifter
728-bit Logarithmic Shifter Layout Slice
Widthlog pm(2K(122K-1)) pm(2K2K-1) K
log2 N
73Shifter Implementation Comparisons
- Barrel Shifter is better for small shifters
(faster, not much bigger) while Log Shifter is
preferred for larger shifters. - Log Shifters are always smaller
- For large shifter we may have to start worrying
about the number of pass transistors in series.
74Decoders
- Decodes inputs to activate one of many outputs
In0 In1
- Cost of 2-to-4 Decoder
- two inverters, four 2-input NAND gates, four
inverters plus enable logic - how about cost for a 3-to-8, 4-to-16, etc.
decoder?
75Dynamic NOR Decoder
Active HIGH Outputs
Capacitance of the output wires increases
linearly with the decoder size
76Dynamic NOR Decoder
1 1 1 1
? 0 ? 0 ? 0 ? 1
Active HIGH Outputs
0 1 0 1
0 ?1
Capacitance of the output wires increases
linearly with the decoder size
77Dynamic NAND Decoder
78Dynamic NAND Decoder
Active LOW Outputs
1 1 1 1
? 1 ? 1 ? 1 ? 0
0 ?1
0 1 0 1
79Notes on Dynamic Decoders
- In Dynamic NOR decoder signal goes through at
most one FET - So constant propagation delay (in theory)
- However, some output wires may have two or more
parallel paths to GND - effectively shortening
the transition time - On the contrary, signal in dynamic NAND decoder
pass through a series of FET - The number of FETs rises linearly with the
decoder size - Thus it will be slower than the NOR
implementation if the gate capacitance dominates
diffusion capacitance - For the NAND decoder all the input signals must
be low during precharge else Vdd and GND will be
connected!
80Building Bigger Decoders
Active low enable, Active low output
1
? 0
? 1
0 0 0 0
1
Need to catch the output that goes to zero before
it precharges again
81Layout of Bit-Sliced Datapaths
Must have enough drive capacity to handle large
fan-out
Horizontal gap for feeding signals to the cells
downstream
82Optimizing Bit-sliced Datapaths