Title: Graduate Computer Architecture I
1Graduate Computer Architecture I
2Emergence of FPGA
- Great for Prototyping and Testing
- Enable logic verification without high cost of
fab - Reprogrammable ? Research and Education
- Meets most computational requirements
- Options for transferring design to ASIC
- Technology Advances
- Huge FPGAs are available
- Up to 200,000 Logic Units
- Above clocking rate of 500 MHz
- Competitive Pricing
3System on Chip (SoC)
- Large Embedded Memories
- Up 10 Megabits of on-chip memories (Virtex 4)
- High bandwidth and reconfigurable
- Processor IP Cores
- Tons of Soft Processor Cores (some open source)
- Embedded Processor Cores
- PowerPC, Nios RISC, and etc. 450 MHz
- Simple Digital Signal Processing Cores
- Up to 512 DSPs on Virtex 4
- Interconnects
- High speed network I/O (10Gbps)
- Built-in Ethernet MACs (Soft/Hard Core)
- Security
- Embedded 256-bit AES Encryption
4Potential Advantages of FPGAs
5Designing with FPGAs
- Opportunities
- Hardware logics are programmable
- Immediate testing on the actual platform
- Challenges
- Programming Environment
- Think and design in 2-D instead of 1-D
- Consider hardware limitations
- Hardware Synthesis
- Smart language interpreter and translator
- Efficient HW resource utilization
6Today
- Programming Environment
- Object Oriented Programming Model
- Template based language editors
- Hardware/Software Co-design
- Still a disconnect between SW/HW methods
- Lack of education to bring them together
- Hardware Synthesis
- Getting smarter but not smart enough
- Tuned specifically for each platform
- Not able to take full advantage of resources
- Manual tweaking and using templates
7High Performance Design in FPGA
- Fine Grain Pipelining
- Reducing Critical Path
- One level of look-up-table between D-flip flop
- Works best for streaming data with little or no
data dependencies - Logic Resource
- Smaller sizes often yield faster design
- Use all available resources
- Less resource map and place conflicts
- Quicker compilation
- Parallel Engines
- Exploit parallelism in application
- Faster place and route
8Pipelining
- DEFINITION
- a K-Stage Pipeline (K-pipeline) is an acyclic
circuit having exactly K registers on every path
from an input to an output. - a COMBINATIONAL CIRCUIT is thus an 0-stage
pipeline. - CONVENTION
- Every pipeline stage, hence every K-Stage
pipeline, has a register on its OUTPUT (not on
its input). - ALWAYS
- The CLOCK common to all registers must have a
period sufficient to cover propagation over
combinational paths (input) register progation
delay (output) register setup time.
9Bad pipelining
- You can not just randomly registers
- Successive inputs get mixed e.g., B(A(Xi1), Yi)
- This happened because some paths from inputs to
outputs have 2 registers, and some have only 1! - Not a well-formed K pipeline!
10Adding Pipelines
- Method
- Draw a line that crosses every output in the
circuit and mark the endpoints as terminal
points. - Continue to draw new lines between the terminal
points across various circuit connections,
ensuring that every connection crosses each line
in the same direction. - These lines represent pipeline stages.
- Adding a pipeline register at every point where a
separating line crosses a connection will always
generate a valid pipeline - Focus on the slowest part of the circuit
11Pipelining Example
- 8 bit to 256 bit decoder
- 256 different combination
library ieee use ieee.std_logic_1164.all entity
DECODER is port( I in std_logic_vector(7 downto
0) O out
std_logic_vector(255 downto 0)) end
DECODER architecture behavioral of DECODER
is begin process (I) begin case I is
when 00000000 gt O lt 1000...0000
when 00000001 gt O lt 0100...0000
when 00000010 gt O lt 0010...0000
... when 11111110 gt O lt
0000...0010 when 11111111 gt O lt
0000...0001 end case end
process end behavioral
256 bits
12Hardware Synthesis
- Synthesis
- Uses at least three 4 to 1 Look-up-tables to
decode 256 combinations of I(70) - Resource Usage
- 3-LUT4 X 256
- 768 LUT4
- Critical Path
- Input/Output pin delays
- 2 levels of LUT4
- Sometimes 3 levels?!
- Virtex 4 Speed 11
- 8.281 ns ? 121 Mhz
13Pipelined Decoder
- Input/Output pin DFF
- Already in most FPGAs
- Minimizes pin latencies
- DFF after every LUT4
- LUT4 always followed by DFF (why not use it)
- Only when possible
- Minimizes logic latency
- FPGA Resource
- 768 LUT4 as before
- Plus 768 dff and 264 pin dff
- But not really
- Critical Path
- 1 Level of LUT4
- Plus small DFF prop delay and setup
- Virtex 4 Speed 11
- 2.198 ns ? 455 Mhz
- 3.76x Speedup
14Logic Resource
- Leveraging on FPGA Architecture
- Similarity with Architecture
- LUT and few special logic followed by DFF
- Smaller Design is often Faster
- Easier for tools to Map, Place, and Route
- Optimize designs wherever
- In FPGA, each wire can has a large fanout limit
- Reuse logic and results
logic
Input
Output
Fanout ? Capacity for the wire to drive the
inputs to other logic
15Reusing Logic
- Synthesis Tools
- Obvious duplicate logics are automatically
combined - Most are not optimized
- Decoder Example
- Two 4 bit to16 bit decoders
- Combining decoder outputs
- Two 16 bits to 256 bit
- Critical Path
- 1 Level of LUT4
- Approximately the same
- Differences in wire delay
- FPGA Resources
- I/O DFF remain same
- 2 x 16 LUT4 and DFF
- Plus 256 LUT4 and DFF
- Total 272 LUT4 and DFF!
LUT4
Comb Logic for 0
O(0)
1
O(1)
2
O(2)
Two sets of 4 to16 decoder
O(2543)
Comb Logic for 256
O(255)
16Virtex 4 Elementary Logic Block
2 to 1 Multiplexors
4 to 1 LUT
1 bit D-Flip Flops
17Using MUXF as 2-input Gates
0
a
z
z
b
a
b
0
0
a
z
z
b
a
1
sel
b
Inverters can be pushed into the LUT4 or DFF (by
using inverted Q)
18Using Unused Multiplexors
- Decoder Example
- Replace all LUT4 in the 2nd Decoder stages with
MUX based 2 input AND gates - Critical Path
- Same
- 2.198 ns ? 455 Mhz
- FPGA Resources
- I/O DFF remain same
- 256 MUXF and DFF
- 32 LUT4 and DFF
Comb Logic for 0
O(0)
1
O(1)
2
O(2)
Two sets of 4 to16 decoder
O(2543)
Comb Logic for 256
O(255)
19Parallel Design
- Use Area to Increase Performance
- Increase the Input bandwidth (Input Bus width)
- Processing multiple data at a time
- Duplicate engines to process independent data
sets - Thread/Object level parallelism
- Instructional level parallelism
- Loop unroll to expose the parallelism
- Excellent for Streaming Data Applications
- Multimedia
- Network Processing
- Performance Scalability
- Linear Performance increase with Size
- Achieved for many algorithms
- Sometimes Exponential Hardware Size
- Try to scale using higher level of parallelism
20Summary
- FPGA Designing Methods
- Fine Grain Pipelining to Increase Clock Rate
- If possible 1-level of LUT followed by DFF
- Parallel Engines to Increase Bandwidth
- Duplicate logic to linearly increase the
performance - Reducing Logic Resource Usage
- Reusing duplicate logics
- Using all available embedded Logic
- There are other logics (i.e. Embedded Procs,
Large Memories, Optimized primitive gates, and IP
Cores) - Best Methods Today
- Learn about internal architecture of FPGA
- Make your own templates and use them
- Use IP Cores
- Future Research Topics
- Integration of Generalize Pipelining Algorithms
(In the works) - Smarter Synthesis Tools (Understanding HDL)
- Automatic Platform Specific Optimization
Techniques