Towards Compilation of Streaming Programs into FPGA Hardware - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Towards Compilation of Streaming Programs into FPGA Hardware

Description:

Towards Compilation of Streaming Programs into FPGA Hardware ... Stream is a collection of independent elements ... on multiple stream elements in parallel ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 27
Provided by: franjo9
Category:

less

Transcript and Presenter's Notes

Title: Towards Compilation of Streaming Programs into FPGA Hardware


1
Towards Compilation of Streaming Programs into
FPGA Hardware
  • Franjo Plavec, Zvonko Vranesic, Stephen Brown
  • University of Toronto

Forum on Specification and Design
Languages September 25, 2008
2
Outline
  • Motivation
  • Brook Streaming Language
  • Converting Brook programs to FPGA logic
  • Results
  • Improving Performance
  • Related Work
  • Concluding Remarks

3
Motivation
  • Increasing complexity and cost of manufacturing
    ASICs
  • Field Programmable Gate Arrays are an attractive
    alternative
  • FPGAs are becoming huge
  • Designing for them is challenging!
  • Need high-level design-entry methods to make
    FPGAs easier to use
  • Convert software into custom hardware

4
Streaming Paradigm
  • Streaming (a.k.a Stream Computing)
  • The term is used for many different systems
  • Recently popularized by languages such as
    StreamIt and Brook
  • Expose data parallelism
  • Explicit communication

5
Brook Streaming Language
  • Data is organized into streams
  • Stream is a collection of independent elements
  • Elements of a stream can be operated on in
    parallel
  • Computation is expressed through kernels
  • Kernel is a function operating on streams
  • Implicitly applied to every element of the input
    stream

K2
S3
S1
K4
SOUT
K1
OUTPUT STREAM
TEMPORARY STREAMS
SIN
INPUT STREAM
K3
S4
S2
6
Kernels, Streams and Parallelism
  • Streaming model exposes
  • Data parallelism a kernel can operate on
    multiple stream elements in parallel
  • Task parallelism multiple kernels can operate in
    parallel in a pipelined fashion

K2
S3
S1
K4
SOUT
K1
OUTPUT STREAM
TEMPORARY STREAMS
SIN
INPUT STREAM
K3
S4
S2
7
Simple Brook Program
  • kernel void mul (int op1ltgt, int op2ltgt, out int
    resultltgt)
  • result op1 op2
  • void main ()
  • int a3 10, 20, 30
  • int b3 100, 200, 300
  • int c3 int i
  • int strm_alt3gt, strm_blt3gt, strm_clt3gt
  • streamRead (strm_a, a)
  • streamRead (strm_b, b)
  • mul (strm_a, strm_b, strm_c)
  • streamWrite (strm_c, c)

// lt gt denotes a stream // Gets
applied to all stream elements in
parallel // Copy array into stream
(gather) // Copy array into stream
(gather) // Perform Computation //
Copy stream into array (scatter)
Read
strm_a
mul
Write
strm_c
Read
strm_b
8
Autocorrelation in Brook
  • kernel void mul (int altgt, int bltgt, out int cltgt)
  • c ab
  • reduce void sum (int altgt, reduce int rltgt)
  • r ra
  • void main ()
  • for (instance0 instance lt LAGS instance)
  • createStream1 (array_ptr, stream1, instance)
  • createStream2 (array_ptr instance, stream2,
    instance)
  • mul (stream1, stream2, mul_result_stream)
  • sum (mul_result_stream, reduce_result_stream)
  • write_stream (reduce_result_stream, output_array
    instance)

9
From Brook to FPGA Logic
  • The goal is to produce HDL that implements the
    application
  • Leverage existing tools
  • Alteras C2H compiler can convert simple C code
    into HDL
  • Existing Brook compiler can target GPUs
  • We modified it to convert kernel code into C,
    suitable for C2H compilation

10
Converting Brook to C
  • Target an SOPC System on a Programmable Chip
  • Kernel code mapped into hardware
  • Soft processor (Nios II) executes non-kernel code
  • Kernels implicitly operate on all input stream
    elements
  • Done by a loop in the C code
  • Streams are passed between kernels
  • Amount of on-chip memory in FPGAs is relatively
    low
  • Use FIFOs to pass streams
  • References to streams converted into pointers
  • Pointers are mapped to FIFOs using C2H pragma
    statements

11
Conversion Example mul
  • kernel void mul (int altgt, int bltgt, out int cltgt)
  • c ab
  • void mul ()
  • volatile int a, b, c
  • int _iter
  • for (_iter0 _iterltIN_LENGTH _iter)
  • c (a) (b)

FIFOs are implemented in hardware, so pointers do
not have to be modified in software
12
Conversion Example sum
  • reduce void sum (int altgt, reduce int rltgt)
  • r ra
  • Sum is a reduction kernel
  • Input kernel (a) has IN_LENGTH elements
  • Output kernel (r) has one element
  • All elements of the input stream are combined
    (i.e. added) to produce the output

13
Conversion Example sum
  • reduce void sum (int altgt, reduce int rltgt)
  • r ra
  • void sum()
  • volatile int a, r
  • int _temp_r, _iter
  • for (_iter0 _iterltIN_LENGTH _iter)
  • if (_iter 0)
  • _temp_r a
  • else
  • _temp_r _temp_r (a)
  • r _temp_r

14
General Case sum
  • reduce void sum (int altgt, reduce int rltgt)
  • r ra
  • Sum is a reduction kernel
  • Assume input kernel (a) has IN_LENGTH elements
  • Assume output kernel (r) has REDUCE_LENGTH
    elements
  • IN_LENGTH/REDUCE_LENGTH consecutive elements of
    the input stream are combined (i.e. added) to
    produce 1 element of the output stream

15
Conversion for General Case sum
  • reduce void sum (int altgt, reduce int rltgt)
  • r ra
  • void sum()
  • volatile int a, r
  • int _temp_r, _iter, _mod_iter0
  • for (_iter0 _iterltIN_LENGTH _iter)
  • if ((_mod_iter 0) (_iter ! 0))
  • r _temp_r
  • if (_mod_iter 0)
  • _temp_r a
  • else
  • _temp_r _temp_r (a)
  • if (_mod_iter (IN_LENGTH/REDUCE_LENGTH-1))
  • _mod_iter 0
  • else
  • _mod_iter _mod_iter1
  • Modulo operations inefficiently implemented in
    hardware
  • _mod_iter is a compiler generated variable that
    implements modulo operation as a counter

16
Autocorrelation in Brook
  • kernel void mul (int altgt, int bltgt, out int cltgt)
  • c ab
  • reduce void sum (int altgt, reduce int rltgt)
  • r ra
  • void main ()
  • for (instance0 instance lt LAGS instance)
  • createStream1 (array_ptr, stream1, instance)
  • createStream2 (array_ptr instance, stream2,
    instance)
  • mul (stream1, stream2, mul_result_stream)
  • sum (mul_result_stream, reduce_result_stream)
  • write_stream (reduce_result_stream, output_array
    instance)

17
Autocorrelation Example
System Memory
Nios II CPU
create1
stream1
FIFO
reduce_result_stream
mul_result_stream
mul
sum
write
FIFO
FIFO
stream2
create2
FIFO
18
Conversion Example C2H Pragmas
kernel void mul (int altgt, int bltgt, out int cltgt)
c ab
stream1
FIFO
mul_result_stream
mul
FIFO
stream2
FIFO
  • Generated C2H pragmas
  • pragma altera_accelerate connect_variable mul/a
    to stream1/out
  • pragma altera_accelerate connect_variable mul/b
    to stream2/out
  • pragma altera_accelerate connect_variable mul/c
    to mul_result_stream/in

19
Experimental Evaluation
  • Using Alteras Quartus II CAD software targeting
    the Cyclone II FPGA device
  • Compare the performance of our approach to two
    alternative implementations on the same FPGA
  • Software running on the Nios II soft-core
    processor
  • Software compiled to hardware using C2H (without
    first expressing the application in Brook)
  • Two applications
  • 8-tap FIR filter, process 100,000 samples
  • Autocorrelation, process 100,000 samples over 8
    different shift distances

20
Initial Results
  • Performance benefits from task parallelism
  • Kernels operate in parallel in a pipelined fashion

21
Replication Example
System Memory
Nios II CPU
create1
stream1
FIFO
reduce_result_stream
mul_result_stream
mul
sum
write
FIFO
FIFO
stream2
create2
FIFO
22
Exploiting Data Parallelism Through Replication
System Memory
Nios II CPU
stream1 1/2
reduce_result 1/2
mul_result 1/2
FIFO
mul 1/2
sum 1/2
create1
FIFO
FIFO
stream2 1/2
FIFO
Write
stream1 2/2
reduce_result 2/2
mul_result 2/2
FIFO
mul 2/2
sum 2/2
create2
FIFO
FIFO
stream2 2/2
FIFO
23
Improved Results
24
Related Work (I)
  • Compiling C streaming applications to FPGAs
    (Northwestern University and University of
    Pittsburgh)
  • Similar to our approach, but uses C as a source
    language
  • A Streaming Compiler ASC (Imperial College
    London)
  • C library with support for architectural
    optimizations
  • Can also target FPGAs
  • Streaming Accelerators (Motorola)
  • Application is defined through streaming dataflow
    graphs (sDFGs) and stream descriptors
  • Merrimac supercomputer and Brook streaming
    language (Stanford)
  • A multiprocessor consisting of streaming
    processors
  • RAW architecture and StreamIt streaming language
    (MIT)
  • Filters connected into predefined topologies

25
Related Work (II)
  • Brook optimizing compiler for shared memory
    multiprocessor (Intel)
  • Brook compiler for general-purpose CPU (Stanford)
  • General Purpose Computation on Graphics
    Processing Units (GPGPU)
  • GPU Brook (Stanford)
  • Brook modified for GPUs

26
Concluding Remarks
  • Streaming makes it easier to design hardware
  • Streaming is suitable for FPGA implementation
  • Hardware can be custom generated for a given
    application and throughput requirement
  • Good speedup can be achieved
  • Future work
  • Automate replication
  • Implement larger benchmarks
Write a Comment
User Comments (0)
About PowerShow.com