Title: Ch. 8: Design of Fast Arithmetic Units
1Ch. 8 Design of Fast Arithmetic Units
- Fast addition
- Carry lookahead adder
- Carry save adder
- Fast multiplication
- Combinational and sequential multipliers
- Multiplier recoding (Booth, modified-Booth, etc.)
- Pipelining
- Basic concepts and methods for pipelining
- Pipelined multiply-accumulate unit example
- Verilog implementation
2Basic Adder Design
- Logic equations for a 1-bit full adder
- sum abcin abcin ab cin a b cin
- cout a b a cin b cin
- Implementation
- Figure 8.1 (p. 301)
- (a) One 3-input EX-OR gate for the sum and three
AND gates and an OR gate for the carry-out (cout) - (b) A ripple-carry adder design used to implement
a multiple-bit adder
3Carry Lookahead Adder
- Carry Lookahead
- Carry chain of a ripple-carry adder is broken
by computing the carry-in input of a full adder
module directly from the primary inputs - Uses the concepts of a generate term (g) and a
propagate term (p) - gi ai bi
- if gi 1, carry is generated independent of cin
- pi ai bi
- if pi 1, cin is propagated to the output carry
4Carry-in input for ith full adder computed
directly from theprimary inputs (ai, bi, cin) -
or more accurately, from (gi, pi, cin)
5CLA Implementation
- A direct implementation of the carry lookahead
logic equations requires AND and OR gates with up
to (n1) inputs for an n-bit CLA - This type of hardware implementation is not
practical (for large n) for most types of
technologies used to create logic gate circuits - Alternative designs
- A circuit with k m-bit CLA stages (m/n k) with
ripples (of the carry bit) between adjacent
stages - A hierarchical circuit with k m-bit CLA segments
and a tree of carry lookahead generation logic
blocks used to generate the carries into each CLA
segment
6Carry-Save Adder (CSA)
- Carry-save adder (CSA) structure is useful for
applications requiring a large number of
additions - Three n-bit numbers can be added to produce two
n-bit results - sumi values result in one n-bit result
- couti values result in second n-bit result
7Multiplication
- Multiplication of an n-bit multiplier A with
ann-bit multiplicand B produces a 2n-bit product
P - A an-1 an-2 a1 a0
- B bn-1 bn-2 b1 b0
a0B
21 a1B
. . .
For multiplication of 2s complement signed
numbers, this final operation is a subtraction
instead of an addition Why?
2n-2 an-2B
-2n-1 an-1B
2n-bit product result
8Combinational Multiplier
- Also referred to as a parallel multiplier
- Implement multiplication using a 2-dimensional
array of 1-bit full adders - Basic structure is the same as the addition array
shown on the previous page - Can also be implemented using a linear array of
carry-save adders as shown in Figure 8.3 (p. 308) - Each partial product Pi 2i ai B
- At first CSA level, add three partial products
- At all subsequent CSA levels, add one more
partial product - A CLA (or other type of 21, 2-input/1-output,
adder) is required after the final CSA level
9Sequential Multiplier
- Compute the sums as a sequence of separate steps
- Main benefit only requires one 21 adder
- Much lower hardware cost for large n
- Use a register to store the partial products
- Use two registers to store the multiplier and
multiplicand - Requires a state machine (with the corresponding
control logic) to control the sequence of
additions used - Block diagram shown in Figure 8.4 (p. 309)
10Fast Multiplication
- Many different types of methods possible
- Most methods are based on recoding
- The multiplier bits are recoded (expressed using
a different notational syntax) in order to reduce
the total number of adds/subtracts required - Note addition of the partial product Pi 2i ai
B is not required if ai 0 - Booth recoding
- Given a string of 1-bits in the multiplier A,
addition of Pi is not required for a bit i in the
middle of the 1-bit string
11Booth Recoding
- Ex A 01110 implies that P3, P2, and P1 have to
be added - However, P4 P1 P3 P2 P1 since(24 21) B
(16 2) B (23 22 21) B - A is the Booth-recoded multiplier
- 1 denotes an add and denotes a subtract
- Also referred to as differential recoding since
A is the 1-dimensional differentiation of A
10010
1
12Modified-Booth Recoding
11111111
- Given A 01010101, A
- More adds/subtracts for A than A!
- Identify isolated 1s and 0s as well as strings
of 1s and strings of 0s - Table 8.1 (p. 311) shows a modified-Booth
recoding table and Table 8.2 shows the
corresponding add/subtract operations - Other recoding methods also possible
13Multiply-Accumulate
- Multiply-Accumulate (MAC) is a common operation
in digital signal processing applications - D A B C
- A A B C
- To implement a MAC, we simply have to perform one
more addition in addition to the addition of the
partial products
14Pipelining
- A common method used to speed up digital logic
hardware - Works well if data can be worked on a little
bit and a time, and there is a lot of data - Types of pipelines
- Instruction Pipelines
- Used to speed up fetch-execute operation of CPUs
- Arithmetic Pipelines
- Used to speed up arithmetic operations
- E.g., the vector operation A B C
15Structure of a Simple Pipeline
16- Speedup
- Speedup(pipeline) Time(no pipeline) / Time
(pipeline) - Space-Time Diagram
- Efficiency
- Ratio of numbered blocks to total number of
space-time blocks - What is the efficiency of an ideal m-stage
pipeline operating on N data items?
17Example Pipeline Structure
Linear Pipeline Structure for Floating-point Multi
plication
18Example Timing Diagram
After 4 clock cycles, there is one output result
every clock cycle
19Pipelined MAC Unit Design Code ? refer to pp.
323-328 Test Bench ? pp. 329-332 Simulation
Output ? Figure 8.9 (p. 329)
20Verilog Code for a Pipeline
- Two major methods
- Write one large always block and describe each
pipeline segment in a functional manner
(perhaps using multiple calls to subroutines) - Method used for pipelined MAC
- Write one always block for each segment
- Be careful of interactions between always
blocks - All always blocks execute concurrently
- In order to transfer data from one segment to the
next, we MUST use pipeline registers