CprE / ComS 583 Reconfigurable Computing - PowerPoint PPT Presentation

About This Presentation

Title:

CprE / ComS 583 Reconfigurable Computing

Description:

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #5 FPGA Arithmetic – PowerPoint PPT presentation

Number of Views:131

Avg rating:3.0/5.0

Slides: 29

Provided by: ias128

Learn more at: https://www.ece.iastate.edu

Category:

more less

Transcript and Presenter's Notes

Title: CprE / ComS 583 Reconfigurable Computing

1
CprE / ComS 583Reconfigurable Computing
Prof. Joseph Zambreno Department of Electrical
and Computer Engineering Iowa State
University Lecture 5 FPGA Arithmetic
2
Quick Points

HW 1 due Thursday at 1200pm
Any comments?

3
Recap

Cluster size of N 6-8 is good, K 4-5

4
LUT Mapping Techniques
5
LUT Mapping Techniques (cont.)
6
LUT Mapping Techniques (cont.)
7
Outline

Recap
Motivation
Carry / Cascade Logic
Addition
Ripple Carry
Carry Bypass
Carry Select
Carry Lookahead
Basic Multipliers

8
Motivation

Traditional microprocessors, DSPs, etc. dont use
LUTs
Instead use a w-bit Arithmetic and Logic Unit
(ALU)
Carry connections are hard-wired
No switches, no stubs, short wires

(2)
(1) AND2 OR2 XOR2
A
B
Cin
3-LUT
3-LUT
Sum
Cout / Cin
A
B
(2) ADD SUB CMP
3-LUT
3-LUT
Sum
Cout
9
Adder Delays

Assuming a ripple-carry adder
32-bit ALU delay 6ns
32-cascaded 4-LUTs 32 x 2.5ns 80ns
Motivates extra hardware to accelerate carry
operations

Compare 32-bit ALU (0.6?)
4-LUT delay
16 ns
Area optimized
2.0 ns
Logic delay
6 ns
Delay optimized
0.5 ns
Single channel delay
2.5 ns
per bit
10
Altera Flex 8000 Carry Chain
11
Xilinx XC4000 Carry Chain
12
Cascades

Large fanin operations (reductions)
Decoding
Matching
Completion detection
Many-to-one reductions
Combining logic is simple
Makes use of dedicated paths

13
Altera Cascade Logic

LE delay 2.4 ns
Cascade delay 0.6 ns

14
Why Look at Arithmetic?

Parallelization
Specialization
Architecture
Size
Inputs
Adder problem delay grows linearly with bit
width
Solutions for larger adders
Pipelining
Carry bypass
Carry select

15
Adder Pipelining

Not as practical in ASIC world (registers are
expensive)
Registers essentially free in FPGA logic blocks

A1
B1
A2
B2
A3
B3

Cout
S1

Cout
S2

S3
Cout
16
Carry Bypass Adders

If all the propagates are 1 (P0P1P2P3 1) then
Cout3 Cin
Pi Ai xor Bi
Skip all the carry logic
Inexpensive

A0
B0
A1
B1
A2
B2
A3
B3
Cout0
Cout1
Cout2

Cin
Cout3
0 1
P0P1P2P3
17
Carry Bypass Performance

Small hardware cost
16-bit add 4 CLB overhead
32-bit add 9 CLB overhead
Delay growth still linear, smaller slope

Ripple Carry
Carry Bypass
Delay
N-bit Adder
18
Carry Select Adders

Precompute addition value for (Cin 0) case and
(Cin 1) case
Use mux to select between two with actual Cin
value
Cost of this approach?

A0
B0
A1
B1
A2
B2
A3
B3

Cin
A4
B4
A5
B5
A6
B6
A7
B7

0
S4-7
0

1

1
A4
B4
A5
B5
A6
B6
A7
B7
19
Linear Carry-Select

Adder delay w, mux delay 1

A31-24
A23-16
A15-8
A7-0
B31-24
B23-16
B15-8
B7-0

0
0
0
0

1
1
1
1
t8
t8
t8
t8
t8
t8
t8
t8
t9
t10
t11
t12
S31-24
S23-16
S15-8
S7-0
20
Square-Root Carry Select

Each carry arrives when the corresponding sum is
ready

A31-30
A29-22
A21-15
A14-9
A8-4
A3-0
B31-30
B29-22
B21-15
B14-9
B8-4
B3-0

0
0
0
0
0
0

1
1
1
1
1
1
t4
t4
t5
t5
t6
t6
t7
t7
t8
t8
t5
t6
t7
t8
t9
t10
S31-30
S29-22
S21-15
S14-9
S8-4
S3-0
21
Constant Addition

If one operand is constant
More speed?
Less hardware?

A0
0
A1
A2
A3
1
0
1
HA
FA
FA
FA
C3
S0
S1
S2
S3
22
Multiplication

Shift and add operations
Need N bit adder, M cycles

42 101010 Multiplicand (N bits) x 10
x 1010 Multiplier (M bits) 000000
101010 Partial products
000000 101010 420 111001110 Product
(NM bits)
23
Array Multiplier
X0
X1
X2
X3
Y0