Title: CprE / ComS 583 Reconfigurable Computing
1CprE / ComS 583Reconfigurable Computing
Prof. Joseph Zambreno Department of Electrical
and Computer Engineering Iowa State
University Lecture 15 Midterm Review
2Project Proposals
- Group 1 FPGA Implementation of Frequency-Domain
Audio Effects Processor - Five-band equalizer
- Frequency shifter
3Project Proposals (cont.)
- Group 2 Transparent FPGA-Based Network Analyzer
- Layer I pass-through
- Layer II passive analyzer
4Project Proposals (cont.)
- Group 3 FPGA-Based Library Design for Linear
Algebra Applications - Floating-point sparse matrix-vector
multiplication - Floating-point banded matrix-vector
multiplication - Floating-point lower-upper matrix decomposition
5Project Proposals (cont.)
- Group 4 An Improved Approach of Configuration
Compression for FPGA-Based Embedded Systems - Improved compression algorithms
- LUT-reordering techniques
6Project Proposals (cont.)
- Others Projects
- Group 5 FPGA Ternary Data Conversion
- Group 6 Analysis of Sobel Edge Detection
Implementations - Group 7 Design and Analysis of Artificial
Neural Networks on FPGAs - Reminders
- 11/16 Project Updates (10 minutes)
- 12/5-12/7 Final Presentations (25 minutes)
- 12/15 Final Reports
7Midterm Review
PE
PE
PE
PE
PE
MMX
PE
PE
PE
SSE
PE
PE
PE
FFT
AES
MPP
More Cache
CISC
PE
Reconfigurable Fabric
PE
PE
Superscalar
Vector
Reconfigurable Processor
8Computational Density (Qualitative)
Actel ProASIC
Intel Pentium 4
- FPGAs can complete more work per unit time than a
processor or DSP - Less instruction overhead
- More active computation onto the same silicon
area (allows for more parallelism) - Can control operations at the bit level (as
opposed to word level)
9Coupling in a Reconfigurable System
- Many places to put reconfigurable computing
components - Most implementations involve multiple discrete
devices - How should these devices be connected together?
10Generic FPGA Architecture
- FPGA Field-Programmable Gate Array
- Input/Output Buffers (IOBs)
- Configurable Logic Blocks (CLBs)
- Programmable interconnect mesh
Island-style FPGA architecture
11FPGA Technology
- Various FPGA programming technologies (Anti-fuse,
(E)EPROM, Flash, SRAM) - SRAM most popular
12LUTs and Digital Logic
- k inputs ? 2k possible input values
- k-LUT corresponds to 2k x 1 bit memory
- Truth table is stored
- 22k possible functions O(22k / k!) unique
F A0A1A2 A0A1A2 A0 A1 A2
13Architectural Issues AhmRos04A
- What values of N, I, and K minimize the following
parameters? - Area
- Delay
- Area-delay product
- Assumptions
- All routing wires length 4
- Fully populated IMUX
- Wiring is half pass transistor, half tri-state
14FPGA Arithmetic
- Traditional microprocessors, DSPs, etc. dont use
LUTs - Instead use a w-bit Arithmetic and Logic Unit
(ALU) - Carry connections are hard-wired
- No switches, no stubs, short wires
(2)
(1) AND2 OR2 XOR2
A
B
Cin
3-LUT
3-LUT
Sum
Cout / Cin
A
B
(2) ADD SUB CMP
3-LUT
3-LUT
Sum
Cout
15FPGA Arithmetic (cont.)
- Hard-wired carry logic support
Altera FLEX 8000
Xilinx XCV4000
16Arithmetic (cont.)
X0
X1
X2
X3
Y0
- Carry save multiplication
Y1
X1
X2
X0
X3
Y2
Y3
Z0
Z1
Z2
17LUT-Based Constant Multipliers
10101011 x NNNNNNNN
AAAAAAAAAAAA (N 1011 (LSN)) BBBBBBBBBBBB
(N 1010 (MSN)) SSSSSSSSSSSSSSSS Product
N0N7
A0A11
4-LUT
4-LUT
4-LUT
4-LUT
4-LUT
4-LUT
4-LUT
4-LUT
4-LUT
4-LUT
4-LUT
4-LUT
N0N7
S0S15
4-LUT
4-LUT
4-LUT
4-LUT
4-LUT
4-LUT
4-LUT
4-LUT
4-LUT
4-LUT
4-LUT
4-LUT
B4B15
- Constants can be changed in the LUTs to program
new multipliers
18Capacity Trends
Virtex-5 550 MHz 24M gates
Virtex-II Pro 450 MHz 8M gates
Virtex-4 500 MHz 16M gates
Virtex-II 450 MHz 8M gates
Spartan-3 326 MHz 5M gates
Virtex-E 240 MHz 4M gates
Xilinx Device Complexity
Virtex 200 MHz 1M gates
XC4000 100 MHz 250K gates
Spartan-II 200 MHz 200K gates
Spartan 80 MHz 40K gates
XC3000 85 MHz 7.5K gates
XC5200 50 MHz 23K gates
XC2000 50 MHz 1K gates
1985
1991
1987
1995
1998
1999
2000
2002
2003
2004
2006
Year
19Splash 1 Architecture
VME Bus
VSB Bus
Interface
Interface
FIFO IN
Control
FIFO OUT
F3
F0
F1
F2
F31
F28
F29
F30
M3
M0
M1
M2
M31
M28
M29
M30
M4
M7
M6
M5
M24
M27
M26
M25
F4
F7
F6
F5
F24
F27
F26
F25
F11
F8
F9
F10
F23
F20
F21
F22
M11
M8
M9
M10
M23
M20
M21
M22
M12
M15
M14
M13
M16
M19
M18
M17
F12
F15
F14
F13
F16
F19
F18
F17
20FPGA-based Router
- FPX module contains two FPGAs
- NID network interface device
- Performs data queuing
- RAD reprogrammable application device
- Specialized control sequences
21Mesh Topology
- Chips are connected in a nearest-neighbor pattern
- Simplicity is key
- Linear array is essentially a 1-dimensional mesh
22Other Topologies
- Crossbar topology
- Devices A-D are routing only
- Gives predictable performance
- Potential waste of resources for near-neighbor
connections
23Logic Emulation
- Emulation takes a sizable amount of resources
- Compilation time can be large due to FPGA compiles
24Systolic Architectures
- Goal general methodology for mapping
computations into hardware (spatial computing)
structures - Composition
- Simple compute cells (e.g. add, sub, max, min)
- Regular interconnect pattern
- Pipelined communication between cells
- I/O at boundaries
x
x
min
x
x
c
25Finite Impulse Response
- Sequential
- Memory bandwidth per output 2k1
- O(k) cycles per output
- O(1) hardware
- Systolic
- Memory bandwidth per output 2
- O(1) cycles per output
- O(k) hardware
xi
x
x
x
x
w1
w2
w3
w4
yi
26Matrix-Vector Product
t 4
a41
a23
a23
a14
t 3
a31
a22
a13
t 2
a21
a12
t 1
a11
x1
x2
x3
x4
xn
y1
t n
y2
t n1
y3
t n2
y4
t n3
27Circuit Netlist and Mapping
28Placing and Routing
FPGA
Programmable Connections
29Next Steps
LIBRARY ieee USE ieee.std_logic_1164.all
ENTITY implied IS PORT ( A, B IN
STD_LOGIC AeqB OUT STD_LOGIC ) END
implied ARCHITECTURE Behavior OF implied
IS BEGIN PROCESS ( A, B ) BEGIN IF A B
THEN AeqB lt '1' END IF END PROCESS
END Behavior
- VHDL / VHDL for Synthesis
30HW/SW Co-Design
ARMulator
Modelsim
ARM core simulator
ARMulator API
Modelsim FLI
HDL simulator
AHB Slave I/F
ARM Core
Comm. Buffer Socket Handler
Comm. Buffer
AHB Slave I/F
SOCKET 1
AMBA
AHB Master I/F
Cache
ASIC / FPGA
Mem. Access Socket Handler
AHB Master I/F
SOCKET 2
ARM Local Memory
Shared Memory
AHB Slave I/F
31Multi-Context FPGAs
32Function Unit Architectures
- RaPiD Reconfigurable Pipelined Datapath
- Linear array of function units
- Function type determined by application
- Function units are connected together as needed
using segmented buses - Data enters the pipeline via input streams and
exits via output streams
33High-Level Compilation
C Program
C Libraries on various Targets
SUIF frontend
Directives and Automation
HW / SW Partitioner
C to RTL VHDL/Verilog
C to RTL VHDL/Verilog
SUIF to GCC
VHDL to FPGA Synthesis
VHDL to ASIC Synthesis
GCC compiler for embedded
Object code for Embedded (SA)
Binaries for FPGAs (Xilinx)
Chip layouts (0.18u TSMC)
34Other Topics?
- Second course survey next week
- Provide general feedback, suggest additional
topics
35Midterm Exam
- Three questions
- Review
- Analysis
- Extension
- Any paper mentioned in class is fair game
- Due in 48 hours (10/12 200pm)
- No class on Thursday!
- Some restrictions
- Work alone
- Can ask if something is unclear (what does this
mean? questions, not how do I do this?
questions) - No late submissions strict WebCT deadline