CprE / ComS 583 Reconfigurable Computing - PowerPoint PPT Presentation

About This Presentation
Title:

CprE / ComS 583 Reconfigurable Computing

Description:

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #5 FPGA Arithmetic – PowerPoint PPT presentation

Number of Views:131
Avg rating:3.0/5.0
Slides: 29
Provided by: ias128
Category:

less

Transcript and Presenter's Notes

Title: CprE / ComS 583 Reconfigurable Computing


1
CprE / ComS 583Reconfigurable Computing
Prof. Joseph Zambreno Department of Electrical
and Computer Engineering Iowa State
University Lecture 5 FPGA Arithmetic
2
Quick Points
  • HW 1 due Thursday at 1200pm
  • Any comments?

3
Recap
  • Cluster size of N 6-8 is good, K 4-5

4
LUT Mapping Techniques
5
LUT Mapping Techniques (cont.)
6
LUT Mapping Techniques (cont.)
7
Outline
  • Recap
  • Motivation
  • Carry / Cascade Logic
  • Addition
  • Ripple Carry
  • Carry Bypass
  • Carry Select
  • Carry Lookahead
  • Basic Multipliers

8
Motivation
  • Traditional microprocessors, DSPs, etc. dont use
    LUTs
  • Instead use a w-bit Arithmetic and Logic Unit
    (ALU)
  • Carry connections are hard-wired
  • No switches, no stubs, short wires

(2)
(1) AND2 OR2 XOR2
A
B
Cin
3-LUT
3-LUT
Sum
Cout / Cin
A
B
(2) ADD SUB CMP
3-LUT
3-LUT
Sum
Cout
9
Adder Delays
  • Assuming a ripple-carry adder
  • 32-bit ALU delay 6ns
  • 32-cascaded 4-LUTs 32 x 2.5ns 80ns
  • Motivates extra hardware to accelerate carry
    operations

Compare 32-bit ALU (0.6?)
4-LUT delay
16 ns
Area optimized
2.0 ns
Logic delay
6 ns
Delay optimized
0.5 ns
Single channel delay
2.5 ns
per bit
10
Altera Flex 8000 Carry Chain
11
Xilinx XC4000 Carry Chain
12
Cascades
  • Large fanin operations (reductions)
  • Decoding
  • Matching
  • Completion detection
  • Many-to-one reductions
  • Combining logic is simple
  • Makes use of dedicated paths

13
Altera Cascade Logic
  • LE delay 2.4 ns
  • Cascade delay 0.6 ns

14
Why Look at Arithmetic?
  • Parallelization
  • Specialization
  • Architecture
  • Size
  • Inputs
  • Adder problem delay grows linearly with bit
    width
  • Solutions for larger adders
  • Pipelining
  • Carry bypass
  • Carry select

15
Adder Pipelining
  • Not as practical in ASIC world (registers are
    expensive)
  • Registers essentially free in FPGA logic blocks

A1
B1
A2
B2
A3
B3

Cout
S1

Cout
S2

S3
Cout
16
Carry Bypass Adders
  • If all the propagates are 1 (P0P1P2P3 1) then
    Cout3 Cin
  • Pi Ai xor Bi
  • Skip all the carry logic
  • Inexpensive

A0
B0
A1
B1
A2
B2
A3
B3
Cout0
Cout1
Cout2




Cin
Cout3
0 1
P0P1P2P3
17
Carry Bypass Performance
  • Small hardware cost
  • 16-bit add 4 CLB overhead
  • 32-bit add 9 CLB overhead
  • Delay growth still linear, smaller slope

Ripple Carry
Carry Bypass
Delay
N-bit Adder
18
Carry Select Adders
  • Precompute addition value for (Cin 0) case and
    (Cin 1) case
  • Use mux to select between two with actual Cin
    value
  • Cost of this approach?

A0
B0
A1
B1
A2
B2
A3
B3




Cin
A4
B4
A5
B5
A6
B6
A7
B7




0
S4-7
0

1




1
A4
B4
A5
B5
A6
B6
A7
B7
19
Linear Carry-Select
  • Adder delay w, mux delay 1

A31-24
A23-16
A15-8
A7-0
B31-24
B23-16
B15-8
B7-0




0
0
0
0




1
1
1
1
t8
t8
t8
t8
t8
t8
t8
t8
t9
t10
t11
t12
S31-24
S23-16
S15-8
S7-0
20
Square-Root Carry Select
  • Each carry arrives when the corresponding sum is
    ready

A31-30
A29-22
A21-15
A14-9
A8-4
A3-0
B31-30
B29-22
B21-15
B14-9
B8-4
B3-0






0
0
0
0
0
0






1
1
1
1
1
1
t4
t4
t5
t5
t6
t6
t7
t7
t8
t8
t5
t6
t7
t8
t9
t10
S31-30
S29-22
S21-15
S14-9
S8-4
S3-0
21
Constant Addition
  • If one operand is constant
  • More speed?
  • Less hardware?

A0
0
A1
A2
A3
1
0
1
HA
FA
FA
FA
C3
S0
S1
S2
S3
22
Multiplication
  • Shift and add operations
  • Need N bit adder, M cycles

42 101010 Multiplicand (N bits) x 10
x 1010 Multiplier (M bits) 000000
101010 Partial products
000000 101010 420 111001110 Product
(NM bits)
23
Array Multiplier
X0
X1
X2
X3
Y0
  • Area N x M cells
  • Delay O(NM)

Y1
X1
X2
X0
X3




Y2
X1
X2
X0
X3




Y3
X1
X2
X0
X3




Z0
Z1
Z2
24
Carry-Save
X0
X1
X2
X3
Y0
Y1
X1
X2
X0
X3




Y2




Y3




Z0
Z1
Z2
25
Multiplier Pipelining
  • Register cost
  • Multiplicand (N bits/stage x M stages)
  • Multiplier (M2 M) / 2 bits
  • Early output values (M2 M) / 2 bits
  • Total M x (N M 1) bits
  • Critical path max
  • DFF FA setup
  • Bottom-level adder

26
Constant Multiplier
Y00
  • Can greatly reduce the number of adders
  • Removes all and gates

Y11
X1
X2
X0
X3
Y20
Y31
X1
X2
X0
X3




Z0
Z1
Z2
27
LUT-based Constant Multipliers
  • k-LUT can perform constant multiply of k-bit
    number
  • Break operand into k-bit quantities
  • Example 8-bit x 8-bit constant, k4

10101011 x NNNNNNNN
AAAAAAAAAAA (N 1011 (LSN)) BBBBBBBBBBBBBBB
(N 1010 (MSN)) SSSSSSSSSSSSSSS Product
28
Summary
  • Latency overhead of programmable logic
  • Several approaches to reducing design latency
  • Fast carry
  • Cascade
  • Hardwired connections
  • Multiplier optimization goals different from
    adder
  • Other techniques
  • Logarithmic v. linear (Wallace Tree multiplier)
  • Data encoding (Booths multiplier)
Write a Comment
User Comments (0)
About PowerShow.com