Hardware Acceleration of Applications Using FPGAs - PowerPoint PPT Presentation

About This Presentation

Title:

Hardware Acceleration of Applications Using FPGAs

Description:

... techniques use VHDL or Verilog. Require many low-level hardware ... A hardware compiler translates the specification into VHDL/Verilog, or an EDIF netlist. ... – PowerPoint PPT presentation

Number of Views:111

Avg rating:3.0/5.0

Slides: 17

Provided by: drandy9

Category:

more less

Transcript and Presenter's Notes

Title: Hardware Acceleration of Applications Using FPGAs

1
Hardware Acceleration of Applications Using FPGAs

Andy Nisbet
Andy.Nisbet_at_cs.tcd.ie
http//www.cs.tcd.ie/Andy.Nisbet
Phone353-(0)1-608-3682
FAX353-(0)1-677-2204

2
Content

FPGAs?
High-level language hardware compilation.
High Performance Computing Using FPGAs???
HandelC.
Research Directions at Trinity College using
FPGAs.

3
FPGAs

FPGAs can be configured after the time of
manufacture.
Configurable logic blocks, input/output blocks
for connecting to external microchip pins and
programmable interconnect.
Logic blocks can be configured/interconnected to
form simple combinatorial logic structures,
complex functional units.
FPGAs can be used to provide a standalone
solution to a task, or they can be used in tandem
with a conventional microprocessor.

4
FPGA Development Boards
5
How are FPGAs Programmed

Conventional techniques use VHDL or Verilog.
Require many low-level hardware details.
Synthesis tools can then convert the HDL into an
EDIF netlist which can be placed and routed onto
an FPGA device. A bit/configuration file is
produced.
High-level languages such as HandelC and
SystemC. A hardware compiler translates the
specification into VHDL/Verilog, or an EDIF
netlist.

6
Why use an FPGA?

Conventional microprocessors have a fixed
architecture.
FPGAs can generate application specific logic
where the balance and mix of functional units can
be altered dynamically.
The number and type of functional units that can
be instantiated on an FPGA are only limited by
the silicon real estate available.
Potential to generate orders of magnitude speedup
for computationally intensive algorithms, such as
in signal/image-processing.
Maximum clock speed of FPGA is lt 400MHz, designs
often 50MHz.

7
High Performance Computing using FPGAs

Current FPGAs can instantiate multiple
floating-point units. Applications work has
focussed on using integer and fixed-point
arithmetic.
Logarithmic Number System (LNS) ALU has single
20/32 bit precision with very small area in
comparison to standard floating-point units.
Performance benefits for this system have already
been demonstrated over Texas Instruments
TMS320C3/4x 50MHz DSP processors on a 2million
gate Xilinx FPGA device.

8
LNS on XC2V8000

14 independent 32bit ALUs (2-3GFlop Peak
Estimate).
336 independent 20bit ALUs (20-40GFlop Peak
Estimate).
Clock 60 MHz (predicted), ADD and SUB latency 6
cycles, pipelined. MUL,DIV,SQRT gtdepends just on
the number of parallel 32bit adders, subtractors
or bit shifters.

9
HandelC

Provided by Celoxica http//www.celoxica.com/
Based on CSP/OCCAM (were not showing CSP aspects
in this talk!)
C with hardware extensions.
Enables the compilation of programs into
synchronous/clocked hardware.
Many other similar systems SystemC, Forge, JHDL,
SA-C

10
HandelC

Fork-join model of parallel computation, parallel
statements are placed in a par block.
CSP aspect uses channels, useful for multiple
clock domains.
Each statement takes ONE clock cycle to execute.
Clock cycle is determined after place and route
and is 1/(longest logic gate routing delay).

11
HandelC Simple Example
// Original code takes one LONG clock
cycle. unsigned int 32 x,a,b,c,d,e,f,g,h x a
b c d e f g h // Parallelise and
pipeline into something taking 3 SHORT
cycles par // all statements inside the par
block are executed in parallel temp1
ab temp2 cd temp3 ef temp4
gh // position 1 SYNCHRONISATION????? sum1
temp1temp2 sum2 temp3 temp4 // position
2 SYNCHRONISATION????? x sum1sum2
12
Porting/Optimising Applications for HandelC

Define variable storage bit width, off-chip
SRAM, on-chip FPGA synthesised registers/RAM/Block
RAM.
Replace floating-point with LNS or fixed-point
arithmetic.
Iterative optimisation process, apply high-level
restructuring transformationsgtsee file.HTML

13
Efficient HandelC

Replace parallel for loops with parallel while
loops. The loop increment can then execute in
parallel.
Avoid the use of n-bit (n gtgt 1) comparators
lt,lt,gt,gt and single-cycle multipliers.
Parallelise and pipeline code as far as possible.
Use dedicated on-chip resources such as
multipliers (interface command/VHDL/Verilog).
Sequential statements not on the critical path
can share functional units in order to reduce
area requirements.
Optimise variable storage-- registers,
distributed RAM, block RAM, or off-chip in SRAM.

14
For -gt While
unsigned int 8 i for(i 0 i lt 255 i)
par // becomes unsigned int 1
terminate 0 while(!terminate) par
terminate (i 254) i
15
Variable Storage
unsigned int 32 i // REGISTER unsigned int 23
j40 // ARRAY of REGISTERs (fully
associative) ram unsigned 8 myRAM16 // single
port DISTRIBUTED RAM mpram // dual port
DISTRIBUTED RAM ram unsigned int 8
readWrite16 // R/W rom unsigned int 8
readOnly16 // could be ram as well
myMPRAM //to minimise logic for access to
RAMs/ROMs USE registers myRamaRegister
aRegisterDataValue // Adding with block 1
Makes BLOCK RAM ram unsigned int 8 myBlockRAM16
with block 1 ram unsigned in 21
twoDim128 par twoDim0aReg 0
twoDim1aReg 1
16
FPGA Research at TCD.

Interactive/Automatic iterative conversion from C
to HandelC/SystemC/FORGE, prototype using
SUIF/NCI (with David Gregg).
Application studies using lattice QCD, image
segmentation, image processing (with Jim Sexton,
Simon Wilson Fergal Shevlin). Collision
detection and telecommunication applications.
FPGA/SCI work, (Michael Manzke).
Exploitation of striped CPU FPGAs.
Numerical stability? Floating to 20/32 bit LNS
and fixed-point.
New work, no results (yet!) focussed on compute
bound applications, as PCI has poor IO.