Graduate Computer Architecture I - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Graduate Computer Architecture I

Description:

Graduate Computer Architecture I Lecture 16: FPGA Design – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 21
Provided by: Youn60
Learn more at: https://www.isi.edu
Category:

less

Transcript and Presenter's Notes

Title: Graduate Computer Architecture I


1
Graduate Computer Architecture I
  • Lecture 16 FPGA Design

2
Emergence of FPGA
  • Great for Prototyping and Testing
  • Enable logic verification without high cost of
    fab
  • Reprogrammable ? Research and Education
  • Meets most computational requirements
  • Options for transferring design to ASIC
  • Technology Advances
  • Huge FPGAs are available
  • Up to 200,000 Logic Units
  • Above clocking rate of 500 MHz
  • Competitive Pricing

3
System on Chip (SoC)
  • Large Embedded Memories
  • Up 10 Megabits of on-chip memories (Virtex 4)
  • High bandwidth and reconfigurable
  • Processor IP Cores
  • Tons of Soft Processor Cores (some open source)
  • Embedded Processor Cores
  • PowerPC, Nios RISC, and etc. 450 MHz
  • Simple Digital Signal Processing Cores
  • Up to 512 DSPs on Virtex 4
  • Interconnects
  • High speed network I/O (10Gbps)
  • Built-in Ethernet MACs (Soft/Hard Core)
  • Security
  • Embedded 256-bit AES Encryption

4
Potential Advantages of FPGAs
5
Designing with FPGAs
  • Opportunities
  • Hardware logics are programmable
  • Immediate testing on the actual platform
  • Challenges
  • Programming Environment
  • Think and design in 2-D instead of 1-D
  • Consider hardware limitations
  • Hardware Synthesis
  • Smart language interpreter and translator
  • Efficient HW resource utilization

6
Today
  • Programming Environment
  • Object Oriented Programming Model
  • Template based language editors
  • Hardware/Software Co-design
  • Still a disconnect between SW/HW methods
  • Lack of education to bring them together
  • Hardware Synthesis
  • Getting smarter but not smart enough
  • Tuned specifically for each platform
  • Not able to take full advantage of resources
  • Manual tweaking and using templates

7
High Performance Design in FPGA
  • Fine Grain Pipelining
  • Reducing Critical Path
  • One level of look-up-table between D-flip flop
  • Works best for streaming data with little or no
    data dependencies
  • Logic Resource
  • Smaller sizes often yield faster design
  • Use all available resources
  • Less resource map and place conflicts
  • Quicker compilation
  • Parallel Engines
  • Exploit parallelism in application
  • Faster place and route

8
Pipelining
  • DEFINITION
  • a K-Stage Pipeline (K-pipeline) is an acyclic
    circuit having exactly K registers on every path
    from an input to an output.
  • a COMBINATIONAL CIRCUIT is thus an 0-stage
    pipeline.
  • CONVENTION
  • Every pipeline stage, hence every K-Stage
    pipeline, has a register on its OUTPUT (not on
    its input).
  • ALWAYS
  • The CLOCK common to all registers must have a
    period sufficient to cover propagation over
    combinational paths (input) register progation
    delay (output) register setup time.

9
Bad pipelining
  • You can not just randomly registers
  • Successive inputs get mixed e.g., B(A(Xi1), Yi)
  • This happened because some paths from inputs to
    outputs have 2 registers, and some have only 1!
  • Not a well-formed K pipeline!

10
Adding Pipelines
  • Method
  • Draw a line that crosses every output in the
    circuit and mark the endpoints as terminal
    points.
  • Continue to draw new lines between the terminal
    points across various circuit connections,
    ensuring that every connection crosses each line
    in the same direction.
  • These lines represent pipeline stages.
  • Adding a pipeline register at every point where a
    separating line crosses a connection will always
    generate a valid pipeline
  • Focus on the slowest part of the circuit

11
Pipelining Example
  • 8 bit to 256 bit decoder
  • 256 different combination

library ieee use ieee.std_logic_1164.all entity
DECODER is port( I in std_logic_vector(7 downto
0) O out
std_logic_vector(255 downto 0)) end
DECODER architecture behavioral of DECODER
is begin process (I) begin case I is
when 00000000 gt O lt 1000...0000
when 00000001 gt O lt 0100...0000
when 00000010 gt O lt 0010...0000
... when 11111110 gt O lt
0000...0010 when 11111111 gt O lt
0000...0001 end case end
process end behavioral
256 bits
12
Hardware Synthesis
  • Synthesis
  • Uses at least three 4 to 1 Look-up-tables to
    decode 256 combinations of I(70)
  • Resource Usage
  • 3-LUT4 X 256
  • 768 LUT4
  • Critical Path
  • Input/Output pin delays
  • 2 levels of LUT4
  • Sometimes 3 levels?!
  • Virtex 4 Speed 11
  • 8.281 ns ? 121 Mhz

13
Pipelined Decoder
  • Input/Output pin DFF
  • Already in most FPGAs
  • Minimizes pin latencies
  • DFF after every LUT4
  • LUT4 always followed by DFF (why not use it)
  • Only when possible
  • Minimizes logic latency
  • FPGA Resource
  • 768 LUT4 as before
  • Plus 768 dff and 264 pin dff
  • But not really
  • Critical Path
  • 1 Level of LUT4
  • Plus small DFF prop delay and setup
  • Virtex 4 Speed 11
  • 2.198 ns ? 455 Mhz
  • 3.76x Speedup

14
Logic Resource
  • Leveraging on FPGA Architecture
  • Similarity with Architecture
  • LUT and few special logic followed by DFF
  • Smaller Design is often Faster
  • Easier for tools to Map, Place, and Route
  • Optimize designs wherever
  • In FPGA, each wire can has a large fanout limit
  • Reuse logic and results

logic
Input
Output
Fanout ? Capacity for the wire to drive the
inputs to other logic
15
Reusing Logic
  • Synthesis Tools
  • Obvious duplicate logics are automatically
    combined
  • Most are not optimized
  • Decoder Example
  • Two 4 bit to16 bit decoders
  • Combining decoder outputs
  • Two 16 bits to 256 bit
  • Critical Path
  • 1 Level of LUT4
  • Approximately the same
  • Differences in wire delay
  • FPGA Resources
  • I/O DFF remain same
  • 2 x 16 LUT4 and DFF
  • Plus 256 LUT4 and DFF
  • Total 272 LUT4 and DFF!

LUT4
Comb Logic for 0
O(0)
1
O(1)
2
O(2)
Two sets of 4 to16 decoder

O(2543)
Comb Logic for 256
O(255)
16
Virtex 4 Elementary Logic Block
2 to 1 Multiplexors
4 to 1 LUT
1 bit D-Flip Flops
17
Using MUXF as 2-input Gates
0
a
z
z
b
a
b
0
0
a
z
z
b
a
1
sel
b
Inverters can be pushed into the LUT4 or DFF (by
using inverted Q)
18
Using Unused Multiplexors
  • Decoder Example
  • Replace all LUT4 in the 2nd Decoder stages with
    MUX based 2 input AND gates
  • Critical Path
  • Same
  • 2.198 ns ? 455 Mhz
  • FPGA Resources
  • I/O DFF remain same
  • 256 MUXF and DFF
  • 32 LUT4 and DFF

Comb Logic for 0
O(0)
1
O(1)
2
O(2)
Two sets of 4 to16 decoder

O(2543)
Comb Logic for 256
O(255)
19
Parallel Design
  • Use Area to Increase Performance
  • Increase the Input bandwidth (Input Bus width)
  • Processing multiple data at a time
  • Duplicate engines to process independent data
    sets
  • Thread/Object level parallelism
  • Instructional level parallelism
  • Loop unroll to expose the parallelism
  • Excellent for Streaming Data Applications
  • Multimedia
  • Network Processing
  • Performance Scalability
  • Linear Performance increase with Size
  • Achieved for many algorithms
  • Sometimes Exponential Hardware Size
  • Try to scale using higher level of parallelism

20
Summary
  • FPGA Designing Methods
  • Fine Grain Pipelining to Increase Clock Rate
  • If possible 1-level of LUT followed by DFF
  • Parallel Engines to Increase Bandwidth
  • Duplicate logic to linearly increase the
    performance
  • Reducing Logic Resource Usage
  • Reusing duplicate logics
  • Using all available embedded Logic
  • There are other logics (i.e. Embedded Procs,
    Large Memories, Optimized primitive gates, and IP
    Cores)
  • Best Methods Today
  • Learn about internal architecture of FPGA
  • Make your own templates and use them
  • Use IP Cores
  • Future Research Topics
  • Integration of Generalize Pipelining Algorithms
    (In the works)
  • Smarter Synthesis Tools (Understanding HDL)
  • Automatic Platform Specific Optimization
    Techniques
Write a Comment
User Comments (0)
About PowerShow.com