Chapter 12 Arithmetic Building Blocks - PowerPoint PPT Presentation

1 / 76
About This Presentation
Title:

Chapter 12 Arithmetic Building Blocks

Description:

Chapter 12 Arithmetic Building Blocks Boonchuay Supmonchai Integrated Design Application Research (IDAR) Laboratory August 20, 2004; Revised - July 5, 2005 – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 77
Provided by: icdaruRes
Category:

less

Transcript and Presenter's Notes

Title: Chapter 12 Arithmetic Building Blocks


1
Chapter 12Arithmetic Building Blocks
  • Boonchuay Supmonchai
  • Integrated Design Application Research (IDAR)
    Laboratory
  • August 20, 2004 Revised - July 5, 2005

2
Goals of This Chapter
  • Designing for Performance, area, or power
  • Adders
  • Multipliers
  • Shifters
  • Logic and System Optimizations for datapath
    modules
  • Power-Delay trade-offs in datapaths

3
Review A Generic Processor
4
Bit-Sliced Architecture
n-bit Data In
Bit 0
Bit 1
Bit n-2
Bit n-1

Control
n-bit Data Out
  • Modular
  • Easy to design and verify
  • Easy to expand
  • Potential to be fast

5
Example Itanium Bit-Sliced Design
6
Example Itanium Integer Datapath
Itanium has 6 integer execution units (ALU)
7
One-Bit Binary Full Adder (FA)
S A ? B ? Cin Cout AB ACin
BCin
  • A VERY common operation - so worth spending some
    time trying to optimize
  • Often in the critical path, so need to look at
    both logic level and circuit level optimizations

8
Propagate, Generate, and Delete (Kill)
  • Define 3 new variable which ONLY depend on A, B

(FA itself generates a carry)
Generate (G) AB
(FA passes along carry)
Propagate (P) A ? B
(FA stops propagation of carry)
  • Then we can write S and Cout in terms of G, P,
    and Cin

S(G,P,C) P ? Cin Cout(G,P,C) G
PCin
  • We can also write S and Cout in terms of D, P,
    and Cin
  • Sometimes an alternative definition for P can be
    used

Propagate (P) A B
9
FA CMOS Implementation First Try
32 Transistors
10
Improved CMOS Implementation
  • A more compact design is based on the observation
    that S can be factored to reuse the Cout term

11
Improved CMOS Implementation II
28 Transistors
12
Notes on Improved CMOS FA
  • Note that the PMOS network is identical to the
    NMOS network rather than being the complement.
  • This is possible because of the inversion
    property which says that the function of
    complemented inputs is equal to the complement of
    the function.
  • This simplification reduces the number of series
    transistors and makes the layout more uniform
  • This design has a greater delay to compute S than
    Cout
  • Most of the time the extra delay computing S has
    little effect on the critical path because carry
    is the signal that propagates
  • With proper sizing this delay on S can be
    minimized

13
Inversion Property
  • The function must be symmetric

14
TG-Based FA
Extra delay - slower
16 Transistors
15
Complementary PT Logic (CPL) FA
28 transistors dual rail
Voltage drop Problems
Faster, Lower Power, and small area than full
static CMOS
16
Delay Balanced FA
Identical Delays for Carry and Sum
202 transistors
17
Mirror Adder
PUN and PDN are symmetrical not complemented
244 transistors
Cout AB ACin BCin
18
Mirror Adder Features
  • The NMOS and PMOS chains are completely
    symmetrical with a maximum of two series
    transistors in the carry circuitry, guaranteeing
    identical rise and fall transitions if the NMOS
    and PMOS devices are properly sized.
  • When laying out the cell, the most critical issue
    is the minimization of the capacitances at node
    !Cout (four diffusion capacitances, two internal
    gate capacitances, and two inverter gate
    capacitances).
  • Shared diffusions can reduce the stack node
    capacitances.
  • The transistors connected to Cin are placed
    closest to the output.

19
Mirror Adder Sizing Issues
  • Only the transistors in the carry stage have to
    be optimized for optimal speed. All transistors
    in the sum stage can be minimal size.
  • Assume PMOS/NMOS ratio of 2. Each input in the
    carry circuit has a logical effort of 2 so the
    optimal fan-out for each is also 2.
  • Since !Cout drives 2 internal and 2 inverter
    transistor gates (to form Cout for the bit adder)
    the carry circuit should be oversized

20
Mirror Adder Stick Diagram
21
Ripple Carry Adder (RCA)
tripple ? tFA(A,B?Cout) (N - 2)tFA(Cin?Cout)
tFA(Cin?S)
Worst Case Delay tripple O(N)
Slow!
22
Exploiting the Inversion Property
regular cell
inverted cell
  • Now need two flavors of FAs
  • Minimizes the critical path (the carry chain) by
    elimi-nating inverters between the FAs
  • Need increasing the transistor sizes on the carry
    chain portion of the mirror adder.

23
Fast Carry Chain Design
  • The key to fast addition is a low latency carry
    network
  • What matters is whether in a given position a
    carry is
  • Generated Gi AiBi
  • Propagated Pi Ai ? Bi (sometimes use Ai
    Bi)
  • Annihilated (killed) Ki !Ai !Bi
  • Giving a carry recurrence of

C i1 Gi PiCi
C1 G0 P0C0 C2 G1 P1G0 P1P0 C0 C3 G2
P2G1 P2P1G0 P2P1P0 C0 C4 G3 P3G2 P3P2G1
P3P2P1G0 P3P2P1P0 C0
24
Manchester Carry Chain
  • Switches controlled by Gi and Pi
  • Components of total delay
  • time to form the switch control signals Gi and Pi
  • setup time for the switches
  • signal propagation delay through N switches in
    the worst case

25
4-bit Sliced MCC Adder
26
Domino MCC Circuit
3 3 3 3
3
1
2
3
4
5 6
1 2
2 3
3 4
4 5
27
MCC Stick Diagram
28
Notes on MCC Adder
  • When clock is low, the carry nodes precharge
    when clock goes high if Gi is high, Ci1 is
    asserted (goes low)
  • To prevent Gi from affecting Ci, the signal Pi
    must be computed as the xor (rather than the or)
    which is not a problem since we need the xor of
    Ai and Bi for computing the sum anyway
  • Delay is roughly proportional to n2 (as n pass
    transistors are connected in series)
  • we usually limit each group to 4 stages, then
    buffer the carry chain with an inverter between
    each group

29
Binary Adder Landscape
Synchronous Word Parallel Adders
t O(N), A O(N)
t O(1), A O(N)
t O(?N) A O(N)
30
Carry-Skip (Carry-Bypass) Adder
If (P0 P1 P2 P3 1) then Co,3 Ci,0
otherwise the block itself kills or generates the
carry internally
31
Carry-Skip Chain Implementation
Only 10 to 20 area overhead Only two gate
delays to produce Cout if skip occurs
32
4-bit Block Carry-Skip Adder
Worst-case delay ? carry from bit 0 to bit 15
carry generated in bit 0, ripples through bits 1,
2, and 3, skips the middle two groups (B is the
group size in bits), ripples in the last group
from bit 12 to bit 15
tadd tsetup B tcarry ((N/B) -1) tskip B
tcarry tsum
33
Optimal Block Size and Time
  • Assuming one stage of ripple (tcarry) has the
    same delay as one skip logic stage (tskip) and
    both are 1
  • tCSkA 1 B (N/B-1) B
    1

tsetup ripple in skips ripple in
tsum block 0
last block
  • 2B N/B 1
  • So the optimal block size, B, is
  • dtCSkA/dB 0 ? ?(N/2) Bopt
  • And the optimal time is
  • Optimal tCSkA 2(?(2N)) 1

34
Variations of Carry-Skip Adders I
  • Variable block sized Carry-Skip Adders
  • A carry that is generated in, or absorbed by, one
    of the inner blocks travels a shorter distance
    through the skip blocks
  • Hence a CSA adder can have bigger blocks for the
    inner carries without increasing the overall delay

tCSkA 2B O(NB)
35
Variations of Carry-Skip Adders II
  • Multiple Levels of Skip Logic
  • CSAs with large number of bits suffer from linear
    carry propagation delay time.
  • Added higher levels of skip logic, a CSA can skip
    more blocks at a time.

Cin
Cout
skip level 1
skip level 2
AND of the first level skip signals (BPs)
tCSkA 2B O(logBN)
36
Carry-Skip Adder Comparisons
37
Carry Select Adders
  • Idea Precompute the carry out of each block for
    both carry_in 0 and carry_in 1 (can be done
    for all blocks in parallel) and then select the
    correct one
  • More cost effective than the ripple carry adder

38
Carry Select Adder Critical Path
tadd tsetup B tcarry (N/B) tmux tsum
39
Square Root Carry Select Adders
Balance Delay - Making later block bigger
tadd tsetup 2 tcarry vN tmux tsum
40
Adder Delays - Comparison
41
LookAhead - Basic Idea
Co,k f(Ak, Bk,Co,k-1) Gk PkCo,k-1
42
Look-Ahead Topology
By expanding carry generation all the way
C1 G0 P0C0 C2 G1 P1G0 P1P0 C0 C3 G2
P2G1 P2P1G0 P2P1P0 C0 C4 G3 P3G2 P3P2G1
P3P2P1G0 P3P2P1P0 C0
43
Logarithmic Look-Ahead Adder
44
Parallel Prefix Adders (PPAs)
  • Define carry operator on (G,P) signal pairs
  • is associative, i.e.,
  • (g,p) (g,p) (g,p)
    (g,p) (g,p) (g,p)

(G,P)
(G,P)
where G G PG P PP

(G,P)




45
PPA General Structure
  • Given P and G terms for each bit position,
    computing all the carries is equal to finding all
    the prefixes in parallel
  • (G0,P0) (G1,P1) (G2,P2) (GN-2,PN-2)
    (GN-1,PN-1)
  • Since is associative, we can group them in any
    order
  • but note that it is not commutative
  • Measures to consider
  • number of cells
  • tree cell depth (time)
  • tree cell area
  • cell fan-in and fan-out
  • max wiring length
  • wiring congestion
  • delay path variation (glitching)

46
Brent-Kung PPA
47
Kogge-Stone PPF Adder
48
More Adder Comparisons
49
Adder Speed Comparisons
50
Adder Average Power Comparisons
51
PDP of Adder Comparisons
52
Binary Multiplication - Basics
  • Given two unsigned binary numbers X (M bits) and
    Y (N bits)
  • where Xi, Yj ? 0, 1
  • The multiplication operation Z X ? Y is

53
Binary Multiplication Operation
  • Binary Multiplication as repeated additions

multiplicand
1 0 1 0 1 0 1 0 1 1
multiplier
1 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0
1 0 1 0
partial product array
can be formed in parallel
1 1 1 0 0 1 1 1 0
double precision product
54
Shift-and-Add Multiplication
  • Right Shift and Add (N bits ? N bits)

Left shift requires 2n-bit adder
tshiftadd_mult O(N tadder) O(N2) for an RCA
55
Improving Multipliers
  • Making them faster (therefore, bigger area)
  • Use faster adders
  • Use higher radix (e.g., base 4) multiplication
  • Use multiplier recoding to simplify multiple
    formation
  • Form partial product array in parallel and add it
    in parallel
  • Making them smaller (i.e., slower)
  • Use array multipliers
  • Very regular structure with only short wires to
    nearest neighbor cells. Thus, very simple and
    efficient layout in VLSI
  • Can be easily and efficiently pipelined

56
Array (or Tree) Multiplier Structure
multiple forming circuits
partial product array reduction tree
mux reduction tree (log N) CPA (log N)
fast carry propagate adder (CPA)
57
Partial Product (PP) Generation
  • Each row in the partial-product array is either a
    copy of the multiplicand or a row of zeros
  • Careful optimization of the PP generation can
    lead to some substantial delay and area
    reduction.
  • Booths and modified Booths recording

58
Array Multiplier Implementation
Assume tadd tcarry
tarray_mult (M -1)(N - 2) tcarry (N - 1)
tsum tand O(N)
59
Carry-Save Multiplier
  • The idea is to save the (PP) carry and add it
    in the next adder stage
  • In the final addition a fast carry-propagate
    (e.g., carry-lookahead) adder is used.

tCSM (N - 1) tcarry tmerge tand O(N)
60
CSM Floorplan
Regularity makes the generation of structure
amenable to automation
61
Wallace-Tree Multiplier
GOAL Minimize depth ( of stages) with min. no.
of adder elements
62
Wallace-Tree Multiplier Implementation
3 HAs and 3 FAs for the reduction process (stage
1 stage 2) Any type of adder can be used for
the final adder
63
Notes on Wallace-Tree Multiplier
  • Wallace tree substantially saves hardware for
    large multipliers
  • Number of partial products is reduced by
    two-thirds per stage
  • The propagation delay is found to be bound,

tWTM O(log 3/2 (N))
  • Although substantially faster than CSM, WTM
    structure is very irregular
  • Difficulty in finding efficient VLSI layout
  • Many of todays high performance multipliers use
    higher order (e.g. 4-2) compressors in stead of
    3-2 compressors (FAs)

64
Parallel Programmable Shifters
  • Shifting a data word left or right over a
    constant amount is a trivial hardware operation
    and is implemented by the appropriate signal
    wiring
  • Shifters are used in multipliers, floating point
    units

Consume lots of area if done in random logic gates
65
A Programmable Binary Shifter
Exactly one signal is active
66
4-bit Barrel Shifter
Example Sh0 1 B3B2B1B0 A3A2A1A0
Sh1 1 B3B2B1B0 A3A3A2A1
Sh2 1 B3B2B1B0 A3A3A3A2
Sh3 1 B3B2B1B0 A3A3A3A3
Arithmetic shift
Area dominated by wiring
67
4-bit Barrel Shifter
Example Sh0 1 B3B2B1B0 A3A2A1A0
Sh1 1 B3B2B1B0 A3A3A2A1
Sh2 1 B3B2B1B0 A3A3A3A2
Sh3 1 B3B2B1B0 A3A3A3A3
Arithmetic shift
Area dominated by wiring
68
Notes on Barrel Shifter
  • Note that signal goes through at most one FET (so
    constant propagation delay (in theory))
  • Also note, that the FET diffusion capacitance on
    an output wire increases linearly with the shift
    width but the FET diffusion capacitance on the
    input data lines increases quadratically (i.e.,
    N2 for circular shifter)
  • Size of cell is bounded by the pitch of the metal
    wires.
  • A decoder is usually needed for shift control
    signals since the amount of shift are normally
    given in (encoded) binary number.

69
4-bit Barrel Shifter Layout
Widthbarrel 2 pm N N max shift distance, pm
metal pitch
70
8-bit Logarithmic Shifter
71
8-bit Logarithmic Shifter
72
8-bit Logarithmic Shifter Layout Slice
Widthlog pm(2K(122K-1)) pm(2K2K-1) K
log2 N
73
Shifter Implementation Comparisons
  • Barrel Shifter is better for small shifters
    (faster, not much bigger) while Log Shifter is
    preferred for larger shifters.
  • Log Shifters are always smaller
  • For large shifter we may have to start worrying
    about the number of pass transistors in series.

74
Decoders
  • Decodes inputs to activate one of many outputs

In0 In1
  • Cost of 2-to-4 Decoder
  • two inverters, four 2-input NAND gates, four
    inverters plus enable logic
  • how about cost for a 3-to-8, 4-to-16, etc.
    decoder?

75
Dynamic NOR Decoder
Active HIGH Outputs
Capacitance of the output wires increases
linearly with the decoder size
76
Dynamic NOR Decoder
1 1 1 1
? 0 ? 0 ? 0 ? 1
Active HIGH Outputs
0 1 0 1
0 ?1
Capacitance of the output wires increases
linearly with the decoder size
77
Dynamic NAND Decoder
78
Dynamic NAND Decoder
Active LOW Outputs
1 1 1 1
? 1 ? 1 ? 1 ? 0
0 ?1
0 1 0 1
79
Notes on Dynamic Decoders
  • In Dynamic NOR decoder signal goes through at
    most one FET
  • So constant propagation delay (in theory)
  • However, some output wires may have two or more
    parallel paths to GND - effectively shortening
    the transition time
  • On the contrary, signal in dynamic NAND decoder
    pass through a series of FET
  • The number of FETs rises linearly with the
    decoder size
  • Thus it will be slower than the NOR
    implementation if the gate capacitance dominates
    diffusion capacitance
  • For the NAND decoder all the input signals must
    be low during precharge else Vdd and GND will be
    connected!

80
Building Bigger Decoders
Active low enable, Active low output
1
? 0
? 1
0 0 0 0
1
Need to catch the output that goes to zero before
it precharges again
81
Layout of Bit-Sliced Datapaths
Must have enough drive capacity to handle large
fan-out
Horizontal gap for feeding signals to the cells
downstream
82
Optimizing Bit-sliced Datapaths
Write a Comment
User Comments (0)
About PowerShow.com