Title: CprE / ComS 583 Reconfigurable Computing
1CprE / ComS 583Reconfigurable Computing
Prof. Joseph Zambreno Department of Electrical
and Computer Engineering Iowa State
University Lecture 5 FPGA Arithmetic
2Quick Points
- HW 1 due Thursday at 1200pm
- Any comments?
3Recap
- Cluster size of N 6-8 is good, K 4-5
4LUT Mapping Techniques
5LUT Mapping Techniques (cont.)
6LUT Mapping Techniques (cont.)
7Outline
- Recap
- Motivation
- Carry / Cascade Logic
- Addition
- Ripple Carry
- Carry Bypass
- Carry Select
- Carry Lookahead
- Basic Multipliers
8Motivation
- Traditional microprocessors, DSPs, etc. dont use
LUTs - Instead use a w-bit Arithmetic and Logic Unit
(ALU) - Carry connections are hard-wired
- No switches, no stubs, short wires
(2)
(1) AND2 OR2 XOR2
A
B
Cin
3-LUT
3-LUT
Sum
Cout / Cin
A
B
(2) ADD SUB CMP
3-LUT
3-LUT
Sum
Cout
9Adder Delays
- Assuming a ripple-carry adder
- 32-bit ALU delay 6ns
- 32-cascaded 4-LUTs 32 x 2.5ns 80ns
- Motivates extra hardware to accelerate carry
operations
Compare 32-bit ALU (0.6?)
4-LUT delay
16 ns
Area optimized
2.0 ns
Logic delay
6 ns
Delay optimized
0.5 ns
Single channel delay
2.5 ns
per bit
10Altera Flex 8000 Carry Chain
11Xilinx XC4000 Carry Chain
12Cascades
- Large fanin operations (reductions)
- Decoding
- Matching
- Completion detection
- Many-to-one reductions
- Combining logic is simple
- Makes use of dedicated paths
13Altera Cascade Logic
- LE delay 2.4 ns
- Cascade delay 0.6 ns
14Why Look at Arithmetic?
- Parallelization
- Specialization
- Architecture
- Size
- Inputs
- Adder problem delay grows linearly with bit
width - Solutions for larger adders
- Pipelining
- Carry bypass
- Carry select
15Adder Pipelining
- Not as practical in ASIC world (registers are
expensive) - Registers essentially free in FPGA logic blocks
A1
B1
A2
B2
A3
B3
Cout
S1
Cout
S2
S3
Cout
16Carry Bypass Adders
- If all the propagates are 1 (P0P1P2P3 1) then
Cout3 Cin - Pi Ai xor Bi
- Skip all the carry logic
- Inexpensive
A0
B0
A1
B1
A2
B2
A3
B3
Cout0
Cout1
Cout2
Cin
Cout3
0 1
P0P1P2P3
17Carry Bypass Performance
- Small hardware cost
- 16-bit add 4 CLB overhead
- 32-bit add 9 CLB overhead
- Delay growth still linear, smaller slope
Ripple Carry
Carry Bypass
Delay
N-bit Adder
18Carry Select Adders
- Precompute addition value for (Cin 0) case and
(Cin 1) case - Use mux to select between two with actual Cin
value - Cost of this approach?
A0
B0
A1
B1
A2
B2
A3
B3
Cin
A4
B4
A5
B5
A6
B6
A7
B7
0
S4-7
0
1
1
A4
B4
A5
B5
A6
B6
A7
B7
19Linear Carry-Select
- Adder delay w, mux delay 1
A31-24
A23-16
A15-8
A7-0
B31-24
B23-16
B15-8
B7-0
0
0
0
0
1
1
1
1
t8
t8
t8
t8
t8
t8
t8
t8
t9
t10
t11
t12
S31-24
S23-16
S15-8
S7-0
20Square-Root Carry Select
- Each carry arrives when the corresponding sum is
ready
A31-30
A29-22
A21-15
A14-9
A8-4
A3-0
B31-30
B29-22
B21-15
B14-9
B8-4
B3-0
0
0
0
0
0
0
1
1
1
1
1
1
t4
t4
t5
t5
t6
t6
t7
t7
t8
t8
t5
t6
t7
t8
t9
t10
S31-30
S29-22
S21-15
S14-9
S8-4
S3-0
21Constant Addition
- If one operand is constant
- More speed?
- Less hardware?
A0
0
A1
A2
A3
1
0
1
HA
FA
FA
FA
C3
S0
S1
S2
S3
22Multiplication
- Shift and add operations
- Need N bit adder, M cycles
42 101010 Multiplicand (N bits) x 10
x 1010 Multiplier (M bits) 000000
101010 Partial products
000000 101010 420 111001110 Product
(NM bits)
23Array Multiplier
X0
X1
X2
X3
Y0
- Area N x M cells
- Delay O(NM)
Y1
X1
X2
X0
X3
Y2
X1
X2
X0
X3
Y3
X1
X2
X0
X3
Z0
Z1
Z2
24Carry-Save
X0
X1
X2
X3
Y0
Y1
X1
X2
X0
X3
Y2
Y3
Z0
Z1
Z2
25Multiplier Pipelining
- Register cost
- Multiplicand (N bits/stage x M stages)
- Multiplier (M2 M) / 2 bits
- Early output values (M2 M) / 2 bits
- Total M x (N M 1) bits
- Critical path max
- DFF FA setup
- Bottom-level adder
26Constant Multiplier
Y00
- Can greatly reduce the number of adders
- Removes all and gates
Y11
X1
X2
X0
X3
Y20
Y31
X1
X2
X0
X3
Z0
Z1
Z2
27LUT-based Constant Multipliers
- k-LUT can perform constant multiply of k-bit
number - Break operand into k-bit quantities
- Example 8-bit x 8-bit constant, k4
10101011 x NNNNNNNN
AAAAAAAAAAA (N 1011 (LSN)) BBBBBBBBBBBBBBB
(N 1010 (MSN)) SSSSSSSSSSSSSSS Product
28Summary
- Latency overhead of programmable logic
- Several approaches to reducing design latency
- Fast carry
- Cascade
- Hardwired connections
- Multiplier optimization goals different from
adder - Other techniques
- Logarithmic v. linear (Wallace Tree multiplier)
- Data encoding (Booths multiplier)