Chapter 12 Arithmetic Building Blocks

About This Presentation

Title:

Chapter 12 Arithmetic Building Blocks

Description:

Chapter 12 Arithmetic Building Blocks Boonchuay Supmonchai Integrated Design Application Research (IDAR) Laboratory August 20, 2004; Revised - July 5, 2005 – PowerPoint PPT presentation

Number of Views:98

Avg rating:3.0/5.0

Slides: 77

Provided by: icdaruRes

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 12 Arithmetic Building Blocks

1
Chapter 12Arithmetic Building Blocks

Boonchuay Supmonchai
Integrated Design Application Research (IDAR)
Laboratory
August 20, 2004 Revised - July 5, 2005

2
Goals of This Chapter

Designing for Performance, area, or power
Adders
Multipliers
Shifters
Logic and System Optimizations for datapath
modules
Power-Delay trade-offs in datapaths

3
Review A Generic Processor
4
Bit-Sliced Architecture
n-bit Data In
Bit 0
Bit 1
Bit n-2
Bit n-1

Control
n-bit Data Out

Modular
Easy to design and verify
Easy to expand

Potential to be fast

5
Example Itanium Bit-Sliced Design
6
Example Itanium Integer Datapath
Itanium has 6 integer execution units (ALU)
7
One-Bit Binary Full Adder (FA)
S A ? B ? Cin Cout AB ACin
BCin

A VERY common operation - so worth spending some
time trying to optimize
Often in the critical path, so need to look at
both logic level and circuit level optimizations

8
Propagate, Generate, and Delete (Kill)

Define 3 new variable which ONLY depend on A, B

(FA itself generates a carry)
Generate (G) AB
(FA passes along carry)
Propagate (P) A ? B
(FA stops propagation of carry)

Then we can write S and Cout in terms of G, P,
and Cin

S(G,P,C) P ? Cin Cout(G,P,C) G
PCin

We can also write S and Cout in terms of D, P,
and Cin
Sometimes an alternative definition for P can be
used

Propagate (P) A B
9
FA CMOS Implementation First Try
32 Transistors
10
Improved CMOS Implementation

A more compact design is based on the observation
that S can be factored to reuse the Cout term

11
Improved CMOS Implementation II
28 Transistors
12
Notes on Improved CMOS FA

Note that the PMOS network is identical to the
NMOS network rather than being the complement.
This is possible because of the inversion
property which says that the function of
complemented inputs is equal to the complement of
the function.
This simplification reduces the number of series
transistors and makes the layout more uniform
This design has a greater delay to compute S than
Cout
Most of the time the extra delay computing S has
little effect on the critical path because carry
is the signal that propagates
With proper sizing this delay on S can be
minimized

13
Inversion Property

The function must be symmetric

14
TG-Based FA
Extra delay - slower
16 Transistors
15
Complementary PT Logic (CPL) FA
28 transistors dual rail
Voltage drop Problems
Faster, Lower Power, and small area than full
static CMOS
16
Delay Balanced FA
Identical Delays for Carry and Sum
202 transistors
17
Mirror Adder
PUN and PDN are symmetrical not complemented
244 transistors
Cout AB ACin BCin
18
Mirror Adder Features

The NMOS and PMOS chains are completely
symmetrical with a maximum of two series
transistors in the carry circuitry, guaranteeing
identical rise and fall transitions if the NMOS
and PMOS devices are properly sized.
When laying out the cell, the most critical issue
is the minimization of the capacitances at node
!Cout (four diffusion capacitances, two internal
gate capacitances, and two inverter gate
capacitances).
Shared diffusions can reduce the stack node
capacitances.
The transistors connected to Cin are placed
closest to the output.

19
Mirror Adder Sizing Issues

Only the transistors in the carry stage have to
be optimized for optimal speed. All transistors
in the sum stage can be minimal size.
Assume PMOS/NMOS ratio of 2. Each input in the
carry circuit has a logical effort of 2 so the
optimal fan-out for each is also 2.
Since !Cout drives 2 internal and 2 inverter
transistor gates (to form Cout for the bit adder)
the carry circuit should be oversized

20
Mirror Adder Stick Diagram
21
Ripple Carry Adder (RCA)
tripple ? tFA(A,B?Cout) (N - 2)tFA(Cin?Cout)
tFA(Cin?S)
Worst Case Delay tripple O(N)
Slow!
22
Exploiting the Inversion Property
regular cell
inverted cell

Now need two flavors of FAs
Minimizes the critical path (the carry chain) by
elimi-nating inverters between the FAs
Need increasing the transistor sizes on the carry
chain portion of the mirror adder.

23
Fast Carry Chain Design

The key to fast addition is a low latency carry
network
What matters is whether in a given position a
carry is
Generated Gi AiBi
Propagated Pi Ai ? Bi (sometimes use Ai
Bi)
Annihilated (killed) Ki !Ai !Bi
Giving a carry recurrence of

C i1 Gi PiCi
C1 G0 P0C0 C2 G1 P1G0 P1P0 C0 C3 G2
P2G1 P2P1G0 P2P1P0 C0 C4 G3 P3G2 P3P2G1
P3P2P1G0 P3P2P1P0 C0
24
Manchester Carry Chain

Switches controlled by Gi and Pi

Components of total delay
time to form the switch control signals Gi and Pi
setup time for the switches
signal propagation delay through N switches in
the worst case

25
4-bit Sliced MCC Adder
26
Domino MCC Circuit
3 3 3 3
3
1
2
3
4
5 6
1 2
2 3
3 4
4 5
27
MCC Stick Diagram
28
Notes on MCC Adder

When clock is low, the carry nodes precharge
when clock goes high if Gi is high, Ci1 is
asserted (goes low)
To prevent Gi from affecting Ci, the signal Pi
must be computed as the xor (rather than the or)
which is not a problem since we need the xor of
Ai and Bi for computing the sum anyway
Delay is roughly proportional to n2 (as n pass
transistors are connected in series)
we usually limit each group to 4 stages, then
buffer the carry chain with an inverter between
each group

29
Binary Adder Landscape
Synchronous Word Parallel Adders
t O(N), A O(N)
t O(1), A O(N)
t O(?N) A O(N)
30
Carry-Skip (Carry-Bypass) Adder
If (P0 P1 P2 P3 1) then Co,3 Ci,0
otherwise the block itself kills or generates the
carry internally
31
Carry-Skip Chain Implementation
Only 10 to 20 area overhead Only two gate
delays to produce Cout if skip occurs
32
4-bit Block Carry-Skip Adder
Worst-case delay ? carry from bit 0 to bit 15
carry generated in bit 0, ripples through bits 1,
2, and 3, skips the middle two groups (B is the
group size in bits), ripples in the last group
from bit 12 to bit 15
tadd tsetup B tcarry ((N/B) -1) tskip B
tcarry tsum
33
Optimal Block Size and Time

Assuming one stage of ripple (tcarry) has the
same delay as one skip logic stage (tskip) and
both are 1
tCSkA 1 B (N/B-1) B
1

tsetup ripple in skips ripple in
tsum block 0
last block

2B N/B 1
So the optimal block size, B, is
dtCSkA/dB 0 ? ?(N/2) Bopt
And the optimal time is
Optimal tCSkA 2(?(2N)) 1

34
Variations of Carry-Skip Adders I

Variable block sized Carry-Skip Adders
A carry that is generated in, or absorbed by, one
of the inner blocks travels a shorter distance
through the skip blocks
Hence a CSA adder can have bigger blocks for the
inner carries without increasing the overall delay

tCSkA 2B O(NB)
35
Variations of Carry-Skip Adders II

Multiple Levels of Skip Logic
CSAs with large number of bits suffer from linear
carry propagation delay time.
Added higher levels of skip logic, a CSA can skip
more blocks at a time.

Cin
Cout
skip level 1
skip level 2
AND of the first level skip signals (BPs)
tCSkA 2B O(logBN)
36
Carry-Skip Adder Comparisons
37
Carry Select Adders

Idea Precompute the carry out of each block for
both carry_in 0 and carry_in 1 (can be done
for all blocks in parallel) and then select the
correct one
More cost effective than the ripple carry adder

38
Carry Select Adder Critical Path
tadd tsetup B tcarry (N/B) tmux tsum
39
Square Root Carry Select Adders
Balance Delay - Making later block bigger
tadd tsetup 2 tcarry vN tmux tsum
40
Adder Delays - Comparison
41
LookAhead - Basic Idea
Co,k f(Ak, Bk,Co,k-1) Gk PkCo,k-1
42
Look-Ahead Topology
By expanding carry generation all the way
C1 G0 P0C0 C2 G1 P1G0 P1P0 C0 C3 G2
P2G1 P2P1G0 P2P1P0 C0 C4 G3 P3G2 P3P2G1
P3P2P1G0 P3P2P1P0 C0
43
Logarithmic Look-Ahead Adder
44
Parallel Prefix Adders (PPAs)

Define carry operator on (G,P) signal pairs
is associative, i.e.,
(g,p) (g,p) (g,p)
(g,p) (g,p) (g,p)

(G,P)
(G,P)
where G G PG P PP

(G,P)

45
PPA General Structure

Given P and G terms for each bit position,
computing all the carries is equal to finding all
the prefixes in parallel
(G0,P0) (G1,P1) (G2,P2) (GN-2,PN-2)
(GN-1,PN-1)
Since is associative, we can group them in any
order
but note that it is not commutative

Measures to consider
number of cells
tree cell depth (time)
tree cell area
cell fan-in and fan-out
max wiring length
wiring congestion
delay path variation (glitching)

46
Brent-Kung PPA
47
Kogge-Stone PPF Adder
48
More Adder Comparisons
49
Adder Speed Comparisons
50
Adder Average Power Comparisons
51
PDP of Adder Comparisons
52
Binary Multiplication - Basics

Given two unsigned binary numbers X (M bits) and
Y (N bits)

where Xi, Yj ? 0, 1
The multiplication operation Z X ? Y is

53
Binary Multiplication Operation

Binary Multiplication as repeated additions

multiplicand
1 0 1 0 1 0 1 0 1 1
multiplier
1 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0
1 0 1 0
partial product array
can be formed in parallel
1 1 1 0 0 1 1 1 0
double precision product
54
Shift-and-Add Multiplication

Right Shift and Add (N bits ? N bits)

Left shift requires 2n-bit adder
tshiftadd_mult O(N tadder) O(N2) for an RCA
55
Improving Multipliers

Making them faster (therefore, bigger area)
Use faster adders
Use higher radix (e.g., base 4) multiplication
Use multiplier recoding to simplify multiple
formation
Form partial product array in parallel and add it
in parallel
Making them smaller (i.e., slower)
Use array multipliers
Very regular structure with only short wires to
nearest neighbor cells. Thus, very simple and
efficient layout in VLSI
Can be easily and efficiently pipelined

56
Array (or Tree) Multiplier Structure
multiple forming circuits
partial product array reduction tree
mux reduction tree (log N) CPA (log N)
fast carry propagate adder (CPA)
57
Partial Product (PP) Generation

Each row in the partial-product array is either a
copy of the multiplicand or a row of zeros

Careful optimization of the PP generation can
lead to some substantial delay and area
reduction.
Booths and modified Booths recording

58
Array Multiplier Implementation
Assume tadd tcarry
tarray_mult (M -1)(N - 2) tcarry (N - 1)
tsum tand O(N)
59
Carry-Save Multiplier

The idea is to save the (PP) carry and add it
in the next adder stage
In the final addition a fast carry-propagate
(e.g., carry-lookahead) adder is used.

tCSM (N - 1) tcarry tmerge tand O(N)
60
CSM Floorplan
Regularity makes the generation of structure
amenable to automation
61
Wallace-Tree Multiplier
GOAL Minimize depth ( of stages) with min. no.
of adder elements
62
Wallace-Tree Multiplier Implementation
3 HAs and 3 FAs for the reduction process (stage
1 stage 2) Any type of adder can be used for
the final adder
63
Notes on Wallace-Tree Multiplier

Wallace tree substantially saves hardware for
large multipliers
Number of partial products is reduced by
two-thirds per stage
The propagation delay is found to be bound,

tWTM O(log 3/2 (N))

Although substantially faster than CSM, WTM
structure is very irregular
Difficulty in finding efficient VLSI layout
Many of todays high performance multipliers use
higher order (e.g. 4-2) compressors in stead of
3-2 compressors (FAs)

64
Parallel Programmable Shifters

Shifting a data word left or right over a
constant amount is a trivial hardware operation
and is implemented by the appropriate signal
wiring
Shifters are used in multipliers, floating point
units

Consume lots of area if done in random logic gates
65
A Programmable Binary Shifter
Exactly one signal is active
66
4-bit Barrel Shifter
Example Sh0 1 B3B2B1B0 A3A2A1A0
Sh1 1 B3B2B1B0 A3A3A2A1
Sh2 1 B3B2B1B0 A3A3A3A2
Sh3 1 B3B2B1B0 A3A3A3A3
Arithmetic shift
Area dominated by wiring
67
4-bit Barrel Shifter
Example Sh0 1 B3B2B1B0 A3A2A1A0
Sh1 1 B3B2B1B0 A3A3A2A1
Sh2 1 B3B2B1B0 A3A3A3A2
Sh3 1 B3B2B1B0 A3A3A3A3
Arithmetic shift
Area dominated by wiring
68
Notes on Barrel Shifter

Note that signal goes through at most one FET (so
constant propagation delay (in theory))
Also note, that the FET diffusion capacitance on
an output wire increases linearly with the shift
width but the FET diffusion capacitance on the
input data lines increases quadratically (i.e.,
N2 for circular shifter)
Size of cell is bounded by the pitch of the metal
wires.
A decoder is usually needed for shift control
signals since the amount of shift are normally
given in (encoded) binary number.

69
4-bit Barrel Shifter Layout
Widthbarrel 2 pm N N max shift distance, pm
metal pitch
70
8-bit Logarithmic Shifter
71
8-bit Logarithmic Shifter
72
8-bit Logarithmic Shifter Layout Slice
Widthlog pm(2K(122K-1)) pm(2K2K-1) K
log2 N
73
Shifter Implementation Comparisons

Barrel Shifter is better for small shifters
(faster, not much bigger) while Log Shifter is
preferred for larger shifters.
Log Shifters are always smaller
For large shifter we may have to start worrying
about the number of pass transistors in series.

74
Decoders

Decodes inputs to activate one of many outputs

In0 In1

Cost of 2-to-4 Decoder
two inverters, four 2-input NAND gates, four
inverters plus enable logic
how about cost for a 3-to-8, 4-to-16, etc.
decoder?

75
Dynamic NOR Decoder
Active HIGH Outputs
Capacitance of the output wires increases
linearly with the decoder size
76
Dynamic NOR Decoder
1 1 1 1
? 0 ? 0 ? 0 ? 1
Active HIGH Outputs
0 1 0 1
0 ?1
Capacitance of the output wires increases
linearly with the decoder size
77
Dynamic NAND Decoder
78
Dynamic NAND Decoder
Active LOW Outputs
1 1 1 1
? 1 ? 1 ? 1 ? 0
0 ?1
0 1 0 1
79
Notes on Dynamic Decoders

In Dynamic NOR decoder signal goes through at
most one FET
So constant propagation delay (in theory)
However, some output wires may have two or more
parallel paths to GND - effectively shortening
the transition time
On the contrary, signal in dynamic NAND decoder
pass through a series of FET
The number of FETs rises linearly with the
decoder size
Thus it will be slower than the NOR
implementation if the gate capacitance dominates
diffusion capacitance
For the NAND decoder all the input signals must
be low during precharge else Vdd and GND will be
connected!

80
Building Bigger Decoders
Active low enable, Active low output
1
? 0
? 1
0 0 0 0
1
Need to catch the output that goes to zero before
it precharges again
81
Layout of Bit-Sliced Datapaths
Must have enough drive capacity to handle large
fan-out
Horizontal gap for feeding signals to the cells
downstream
82
Optimizing Bit-sliced Datapaths

Write a Comment

User Comments (0)