Title: Hardware Acceleration of Applications Using FPGAs
1Hardware Acceleration of Applications Using FPGAs
- Andy Nisbet
- Andy.Nisbet_at_cs.tcd.ie
- http//www.cs.tcd.ie/Andy.Nisbet
- Phone353-(0)1-608-3682
- FAX353-(0)1-677-2204
2Content
- FPGAs?
- High-level language hardware compilation.
- High Performance Computing Using FPGAs???
- HandelC.
- Research Directions at Trinity College using
FPGAs.
3FPGAs
- FPGAs can be configured after the time of
manufacture. - Configurable logic blocks, input/output blocks
for connecting to external microchip pins and
programmable interconnect. - Logic blocks can be configured/interconnected to
form simple combinatorial logic structures,
complex functional units. - FPGAs can be used to provide a standalone
solution to a task, or they can be used in tandem
with a conventional microprocessor.
4FPGA Development Boards
5How are FPGAs Programmed
- Conventional techniques use VHDL or Verilog.
Require many low-level hardware details. - Synthesis tools can then convert the HDL into an
EDIF netlist which can be placed and routed onto
an FPGA device. A bit/configuration file is
produced. - High-level languages such as HandelC and
SystemC. A hardware compiler translates the
specification into VHDL/Verilog, or an EDIF
netlist.
6Why use an FPGA?
- Conventional microprocessors have a fixed
architecture. - FPGAs can generate application specific logic
where the balance and mix of functional units can
be altered dynamically. - The number and type of functional units that can
be instantiated on an FPGA are only limited by
the silicon real estate available. - Potential to generate orders of magnitude speedup
for computationally intensive algorithms, such as
in signal/image-processing. - Maximum clock speed of FPGA is lt 400MHz, designs
often 50MHz.
7High Performance Computing using FPGAs
- Current FPGAs can instantiate multiple
floating-point units. Applications work has
focussed on using integer and fixed-point
arithmetic. - Logarithmic Number System (LNS) ALU has single
20/32 bit precision with very small area in
comparison to standard floating-point units. - Performance benefits for this system have already
been demonstrated over Texas Instruments
TMS320C3/4x 50MHz DSP processors on a 2million
gate Xilinx FPGA device.
8LNS on XC2V8000
- 14 independent 32bit ALUs (2-3GFlop Peak
Estimate). - 336 independent 20bit ALUs (20-40GFlop Peak
Estimate). - Clock 60 MHz (predicted), ADD and SUBÂ latency 6
cycles, pipelined. MUL,DIV,SQRT gtdepends just on
the number of parallel 32bit adders, subtractors
or bit shifters.
9HandelC
- Provided by Celoxica http//www.celoxica.com/
- Based on CSP/OCCAM (were not showing CSP aspects
in this talk!) - C with hardware extensions.
- Enables the compilation of programs into
synchronous/clocked hardware. - Many other similar systems SystemC, Forge, JHDL,
SA-C
10HandelC
- Fork-join model of parallel computation, parallel
statements are placed in a par block. - CSP aspect uses channels, useful for multiple
clock domains. - Each statement takes ONE clock cycle to execute.
- Clock cycle is determined after place and route
and is 1/(longest logic gate routing delay).
11HandelC Simple Example
// Original code takes one LONG clock
cycle. unsigned int 32 x,a,b,c,d,e,f,g,h x a
b c d e f g h // Parallelise and
pipeline into something taking 3 SHORT
cycles par // all statements inside the par
block are executed in parallel temp1
ab temp2 cd temp3 ef temp4
gh // position 1 SYNCHRONISATION????? sum1
temp1temp2 sum2 temp3 temp4 // position
2 SYNCHRONISATION????? x sum1sum2
12Porting/Optimising Applications for HandelC
- Define variable storage bit width, off-chip
SRAM, on-chip FPGA synthesised registers/RAM/Block
RAM. - Replace floating-point with LNS or fixed-point
arithmetic. - Iterative optimisation process, apply high-level
restructuring transformationsgtsee file.HTML
13Efficient HandelC
- Replace parallel for loops with parallel while
loops. The loop increment can then execute in
parallel. - Avoid the use of n-bit (n gtgt 1) comparators
lt,lt,gt,gt and single-cycle multipliers. - Parallelise and pipeline code as far as possible.
- Use dedicated on-chip resources such as
multipliers (interface command/VHDL/Verilog). - Sequential statements not on the critical path
can share functional units in order to reduce
area requirements. - Optimise variable storage-- registers,
distributed RAM, block RAM, or off-chip in SRAM.
14For -gt While
unsigned int 8 i for(i 0 i lt 255 i)
par // becomes unsigned int 1
terminate 0 while(!terminate) par
terminate (i 254) i
15Variable Storage
unsigned int 32 i // REGISTER unsigned int 23
j40 // ARRAY of REGISTERs (fully
associative) ram unsigned 8 myRAM16 // single
port DISTRIBUTED RAM mpram // dual port
DISTRIBUTED RAM ram unsigned int 8
readWrite16 // R/W rom unsigned int 8
readOnly16 // could be ram as well
myMPRAM //to minimise logic for access to
RAMs/ROMs USE registers myRamaRegister
aRegisterDataValue // Adding with block 1
Makes BLOCK RAM ram unsigned int 8 myBlockRAM16
with block 1 ram unsigned in 21
twoDim128 par twoDim0aReg 0
twoDim1aReg 1
16FPGA Research at TCD.
- Interactive/Automatic iterative conversion from C
to HandelC/SystemC/FORGE, prototype using
SUIF/NCI (with David Gregg). - Application studies using lattice QCD, image
segmentation, image processing (with Jim Sexton,
Simon Wilson Fergal Shevlin). Collision
detection and telecommunication applications. - FPGA/SCI work, (Michael Manzke).
- Exploitation of striped CPU FPGAs.
- Numerical stability? Floating to 20/32 bit LNS
and fixed-point. - New work, no results (yet!) focussed on compute
bound applications, as PCI has poor IO.