Hardware Acceleration of Applications Using FPGAs - PowerPoint PPT Presentation

About This Presentation
Title:

Hardware Acceleration of Applications Using FPGAs

Description:

... techniques use VHDL or Verilog. Require many low-level hardware ... A hardware compiler translates the specification into VHDL/Verilog, or an EDIF netlist. ... – PowerPoint PPT presentation

Number of Views:111
Avg rating:3.0/5.0
Slides: 17
Provided by: drandy9
Category:

less

Transcript and Presenter's Notes

Title: Hardware Acceleration of Applications Using FPGAs


1
Hardware Acceleration of Applications Using FPGAs
  • Andy Nisbet
  • Andy.Nisbet_at_cs.tcd.ie
  • http//www.cs.tcd.ie/Andy.Nisbet
  • Phone353-(0)1-608-3682
  • FAX353-(0)1-677-2204

2
Content
  • FPGAs?
  • High-level language hardware compilation.
  • High Performance Computing Using FPGAs???
  • HandelC.
  • Research Directions at Trinity College using
    FPGAs.

3
FPGAs
  • FPGAs can be configured after the time of
    manufacture.
  • Configurable logic blocks, input/output blocks
    for connecting to external microchip pins and
    programmable interconnect.
  • Logic blocks can be configured/interconnected to
    form simple combinatorial logic structures,
    complex functional units.
  • FPGAs can be used to provide a standalone
    solution to a task, or they can be used in tandem
    with a conventional microprocessor.

4
FPGA Development Boards
5
How are FPGAs Programmed
  • Conventional techniques use VHDL or Verilog.
    Require many low-level hardware details.
  • Synthesis tools can then convert the HDL into an
    EDIF netlist which can be placed and routed onto
    an FPGA device. A bit/configuration file is
    produced.
  • High-level languages such as HandelC and
    SystemC. A hardware compiler translates the
    specification into VHDL/Verilog, or an EDIF
    netlist.

6
Why use an FPGA?
  • Conventional microprocessors have a fixed
    architecture.
  • FPGAs can generate application specific logic
    where the balance and mix of functional units can
    be altered dynamically.
  • The number and type of functional units that can
    be instantiated on an FPGA are only limited by
    the silicon real estate available.
  • Potential to generate orders of magnitude speedup
    for computationally intensive algorithms, such as
    in signal/image-processing.
  • Maximum clock speed of FPGA is lt 400MHz, designs
    often 50MHz.

7
High Performance Computing using FPGAs
  • Current FPGAs can instantiate multiple
    floating-point units. Applications work has
    focussed on using integer and fixed-point
    arithmetic.
  • Logarithmic Number System (LNS) ALU has single
    20/32 bit precision with very small area in
    comparison to standard floating-point units.
  • Performance benefits for this system have already
    been demonstrated over Texas Instruments
    TMS320C3/4x 50MHz DSP processors on a 2million
    gate Xilinx FPGA device.

8
LNS on XC2V8000
  • 14 independent 32bit ALUs (2-3GFlop Peak
    Estimate).
  • 336 independent 20bit ALUs (20-40GFlop Peak
    Estimate).
  • Clock 60 MHz (predicted), ADD and SUB latency 6
    cycles, pipelined. MUL,DIV,SQRT gtdepends just on
    the number of parallel 32bit adders, subtractors
    or bit shifters.

9
HandelC
  • Provided by Celoxica http//www.celoxica.com/
  • Based on CSP/OCCAM (were not showing CSP aspects
    in this talk!)
  • C with hardware extensions.
  • Enables the compilation of programs into
    synchronous/clocked hardware.
  • Many other similar systems SystemC, Forge, JHDL,
    SA-C

10
HandelC
  • Fork-join model of parallel computation, parallel
    statements are placed in a par block.
  • CSP aspect uses channels, useful for multiple
    clock domains.
  • Each statement takes ONE clock cycle to execute.
  • Clock cycle is determined after place and route
    and is 1/(longest logic gate routing delay).

11
HandelC Simple Example
// Original code takes one LONG clock
cycle. unsigned int 32 x,a,b,c,d,e,f,g,h x a
b c d e f g h // Parallelise and
pipeline into something taking 3 SHORT
cycles par // all statements inside the par
block are executed in parallel temp1
ab temp2 cd temp3 ef temp4
gh // position 1 SYNCHRONISATION????? sum1
temp1temp2 sum2 temp3 temp4 // position
2 SYNCHRONISATION????? x sum1sum2
12
Porting/Optimising Applications for HandelC
  • Define variable storage bit width, off-chip
    SRAM, on-chip FPGA synthesised registers/RAM/Block
    RAM.
  • Replace floating-point with LNS or fixed-point
    arithmetic.
  • Iterative optimisation process, apply high-level
    restructuring transformationsgtsee file.HTML

13
Efficient HandelC
  • Replace parallel for loops with parallel while
    loops. The loop increment can then execute in
    parallel.
  • Avoid the use of n-bit (n gtgt 1) comparators
    lt,lt,gt,gt and single-cycle multipliers.
  • Parallelise and pipeline code as far as possible.
  • Use dedicated on-chip resources such as
    multipliers (interface command/VHDL/Verilog).
  • Sequential statements not on the critical path
    can share functional units in order to reduce
    area requirements.
  • Optimise variable storage-- registers,
    distributed RAM, block RAM, or off-chip in SRAM.

14
For -gt While
unsigned int 8 i for(i 0 i lt 255 i)
par // becomes unsigned int 1
terminate 0 while(!terminate) par
terminate (i 254) i
15
Variable Storage
unsigned int 32 i // REGISTER unsigned int 23
j40 // ARRAY of REGISTERs (fully
associative) ram unsigned 8 myRAM16 // single
port DISTRIBUTED RAM mpram // dual port
DISTRIBUTED RAM ram unsigned int 8
readWrite16 // R/W rom unsigned int 8
readOnly16 // could be ram as well
myMPRAM //to minimise logic for access to
RAMs/ROMs USE registers myRamaRegister
aRegisterDataValue // Adding with block 1
Makes BLOCK RAM ram unsigned int 8 myBlockRAM16
with block 1 ram unsigned in 21
twoDim128 par twoDim0aReg 0
twoDim1aReg 1
16
FPGA Research at TCD.
  • Interactive/Automatic iterative conversion from C
    to HandelC/SystemC/FORGE, prototype using
    SUIF/NCI (with David Gregg).
  • Application studies using lattice QCD, image
    segmentation, image processing (with Jim Sexton,
    Simon Wilson Fergal Shevlin). Collision
    detection and telecommunication applications.
  • FPGA/SCI work, (Michael Manzke).
  • Exploitation of striped CPU FPGAs.
  • Numerical stability? Floating to 20/32 bit LNS
    and fixed-point.
  • New work, no results (yet!) focussed on compute
    bound applications, as PCI has poor IO.
Write a Comment
User Comments (0)
About PowerShow.com