Graduate Computer Architecture I - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

Graduate Computer Architecture I

Description:

Graduate Computer Architecture I Lecture 16: FPGA Design – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 21

Provided by: Youn60

Learn more at: https://www.isi.edu

Category:

more less

Transcript and Presenter's Notes

Title: Graduate Computer Architecture I

1
Graduate Computer Architecture I

Lecture 16 FPGA Design

2
Emergence of FPGA

Great for Prototyping and Testing
Enable logic verification without high cost of
fab
Reprogrammable ? Research and Education
Meets most computational requirements
Options for transferring design to ASIC
Technology Advances
Huge FPGAs are available
Up to 200,000 Logic Units
Above clocking rate of 500 MHz
Competitive Pricing

3
System on Chip (SoC)

Large Embedded Memories
Up 10 Megabits of on-chip memories (Virtex 4)
High bandwidth and reconfigurable
Processor IP Cores
Tons of Soft Processor Cores (some open source)
Embedded Processor Cores
PowerPC, Nios RISC, and etc. 450 MHz
Simple Digital Signal Processing Cores
Up to 512 DSPs on Virtex 4
Interconnects
High speed network I/O (10Gbps)
Built-in Ethernet MACs (Soft/Hard Core)
Security
Embedded 256-bit AES Encryption

4
Potential Advantages of FPGAs
5
Designing with FPGAs

Opportunities
Hardware logics are programmable
Immediate testing on the actual platform
Challenges
Programming Environment
Think and design in 2-D instead of 1-D
Consider hardware limitations
Hardware Synthesis
Smart language interpreter and translator
Efficient HW resource utilization

6
Today

Programming Environment
Object Oriented Programming Model
Template based language editors
Hardware/Software Co-design
Still a disconnect between SW/HW methods
Lack of education to bring them together
Hardware Synthesis
Getting smarter but not smart enough
Tuned specifically for each platform
Not able to take full advantage of resources
Manual tweaking and using templates

7
High Performance Design in FPGA

Fine Grain Pipelining
Reducing Critical Path
One level of look-up-table between D-flip flop
Works best for streaming data with little or no
data dependencies
Logic Resource
Smaller sizes often yield faster design
Use all available resources
Less resource map and place conflicts
Quicker compilation
Parallel Engines
Exploit parallelism in application
Faster place and route

8
Pipelining

DEFINITION
a K-Stage Pipeline (K-pipeline) is an acyclic
circuit having exactly K registers on every path
from an input to an output.
a COMBINATIONAL CIRCUIT is thus an 0-stage
pipeline.
CONVENTION
Every pipeline stage, hence every K-Stage
pipeline, has a register on its OUTPUT (not on
its input).
ALWAYS
The CLOCK common to all registers must have a
period sufficient to cover propagation over
combinational paths (input) register progation
delay (output) register setup time.

9
Bad pipelining

You can not just randomly registers
Successive inputs get mixed e.g., B(A(Xi1), Yi)
This happened because some paths from inputs to
outputs have 2 registers, and some have only 1!
Not a well-formed K pipeline!

10
Adding Pipelines

Method
Draw a line that crosses every output in the
circuit and mark the endpoints as terminal
points.
Continue to draw new lines between the terminal
points across various circuit connections,
ensuring that every connection crosses each line
in the same direction.
These lines represent pipeline stages.
Adding a pipeline register at every point where a
separating line crosses a connection will always
generate a valid pipeline
Focus on the slowest part of the circuit

11
Pipelining Example

8 bit to 256 bit decoder
256 different combination

library ieee use ieee.std_logic_1164.all entity
DECODER is port( I in std_logic_vector(7 downto
0) O out
std_logic_vector(255 downto 0)) end
DECODER architecture behavioral of DECODER
is begin process (I) begin case I is
when 00000000 gt O lt 1000...0000
when 00000001 gt O lt 0100...0000
when 00000010 gt O lt 0010...0000
... when 11111110 gt O lt
0000...0010 when 11111111 gt O lt
0000...0001 end case end
process end behavioral
256 bits
12
Hardware Synthesis

Synthesis
Uses at least three 4 to 1 Look-up-tables to
decode 256 combinations of I(70)
Resource Usage
3-LUT4 X 256
768 LUT4
Critical Path
Input/Output pin delays
2 levels of LUT4
Sometimes 3 levels?!
Virtex 4 Speed 11
8.281 ns ? 121 Mhz

13
Pipelined Decoder

Input/Output pin DFF
Already in most FPGAs
Minimizes pin latencies
DFF after every LUT4
LUT4 always followed by DFF (why not use it)
Only when possible
Minimizes logic latency
FPGA Resource
768 LUT4 as before
Plus 768 dff and 264 pin dff
But not really
Critical Path
1 Level of LUT4
Plus small DFF prop delay and setup
Virtex 4 Speed 11
2.198 ns ? 455 Mhz
3.76x Speedup

14
Logic Resource

Leveraging on FPGA Architecture
Similarity with Architecture
LUT and few special logic followed by DFF
Smaller Design is often Faster
Easier for tools to Map, Place, and Route
Optimize designs wherever
In FPGA, each wire can has a large fanout limit
Reuse logic and results

logic
Input
Output
Fanout ? Capacity for the wire to drive the
inputs to other logic
15
Reusing Logic

Synthesis Tools
Obvious duplicate logics are automatically
combined
Most are not optimized
Decoder Example
Two 4 bit to16 bit decoders
Combining decoder outputs
Two 16 bits to 256 bit
Critical Path
1 Level of LUT4
Approximately the same
Differences in wire delay
FPGA Resources
I/O DFF remain same
2 x 16 LUT4 and DFF
Plus 256 LUT4 and DFF
Total 272 LUT4 and DFF!

LUT4
Comb Logic for 0
O(0)
1
O(1)
2
O(2)
Two sets of 4 to16 decoder

O(2543)
Comb Logic for 256
O(255)
16
Virtex 4 Elementary Logic Block
2 to 1 Multiplexors
4 to 1 LUT
1 bit D-Flip Flops
17
Using MUXF as 2-input Gates
0
a
z
z
b
a
b
0
0
a
z
z
b
a
1
sel
b
Inverters can be pushed into the LUT4 or DFF (by
using inverted Q)
18
Using Unused Multiplexors

Decoder Example
Replace all LUT4 in the 2nd Decoder stages with
MUX based 2 input AND gates
Critical Path
Same
2.198 ns ? 455 Mhz
FPGA Resources
I/O DFF remain same
256 MUXF and DFF
32 LUT4 and DFF

Comb Logic for 0
O(0)
1
O(1)
2
O(2)
Two sets of 4 to16 decoder

O(2543)
Comb Logic for 256
O(255)
19
Parallel Design

Use Area to Increase Performance
Increase the Input bandwidth (Input Bus width)
Processing multiple data at a time
Duplicate engines to process independent data
sets
Thread/Object level parallelism
Instructional level parallelism
Loop unroll to expose the parallelism
Excellent for Streaming Data Applications
Multimedia
Network Processing
Performance Scalability
Linear Performance increase with Size
Achieved for many algorithms
Sometimes Exponential Hardware Size
Try to scale using higher level of parallelism

20
Summary

FPGA Designing Methods
Fine Grain Pipelining to Increase Clock Rate
If possible 1-level of LUT followed by DFF
Parallel Engines to Increase Bandwidth
Duplicate logic to linearly increase the
performance
Reducing Logic Resource Usage
Reusing duplicate logics
Using all available embedded Logic
There are other logics (i.e. Embedded Procs,
Large Memories, Optimized primitive gates, and IP
Cores)
Best Methods Today
Learn about internal architecture of FPGA
Make your own templates and use them
Use IP Cores
Future Research Topics
Integration of Generalize Pipelining Algorithms
(In the works)
Smarter Synthesis Tools (Understanding HDL)
Automatic Platform Specific Optimization
Techniques