Towards Compilation of Streaming Programs into FPGA Hardware - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Towards Compilation of Streaming Programs into FPGA Hardware

Description:

Towards Compilation of Streaming Programs into FPGA Hardware ... Stream is a collection of independent elements ... on multiple stream elements in parallel ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 27

Provided by: franjo9

Category:

more less

Transcript and Presenter's Notes

Title: Towards Compilation of Streaming Programs into FPGA Hardware

1
Towards Compilation of Streaming Programs into
FPGA Hardware

Franjo Plavec, Zvonko Vranesic, Stephen Brown
University of Toronto

Forum on Specification and Design
Languages September 25, 2008
2
Outline

Motivation
Brook Streaming Language
Converting Brook programs to FPGA logic
Results
Improving Performance
Related Work
Concluding Remarks

3
Motivation

Increasing complexity and cost of manufacturing
ASICs
Field Programmable Gate Arrays are an attractive
alternative
FPGAs are becoming huge
Designing for them is challenging!
Need high-level design-entry methods to make
FPGAs easier to use
Convert software into custom hardware

4
Streaming Paradigm

Streaming (a.k.a Stream Computing)
The term is used for many different systems
Recently popularized by languages such as
StreamIt and Brook
Expose data parallelism
Explicit communication

5
Brook Streaming Language

Data is organized into streams
Stream is a collection of independent elements
Elements of a stream can be operated on in
parallel
Computation is expressed through kernels
Kernel is a function operating on streams
Implicitly applied to every element of the input
stream

K2
S3
S1
K4
SOUT
K1
OUTPUT STREAM
TEMPORARY STREAMS
SIN
INPUT STREAM
K3
S4
S2
6
Kernels, Streams and Parallelism

Streaming model exposes
Data parallelism a kernel can operate on
multiple stream elements in parallel
Task parallelism multiple kernels can operate in
parallel in a pipelined fashion

K2
S3
S1
K4
SOUT
K1
OUTPUT STREAM
TEMPORARY STREAMS
SIN
INPUT STREAM
K3
S4
S2
7
Simple Brook Program

kernel void mul (int op1ltgt, int op2ltgt, out int
resultltgt)
result op1 op2
void main ()
int a3 10, 20, 30
int b3 100, 200, 300
int c3 int i
int strm_alt3gt, strm_blt3gt, strm_clt3gt
streamRead (strm_a, a)
streamRead (strm_b, b)
mul (strm_a, strm_b, strm_c)
streamWrite (strm_c, c)

// lt gt denotes a stream // Gets
applied to all stream elements in
parallel // Copy array into stream
(gather) // Copy array into stream
(gather) // Perform Computation //
Copy stream into array (scatter)
Read
strm_a
mul
Write
strm_c
Read
strm_b
8
Autocorrelation in Brook

kernel void mul (int altgt, int bltgt, out int cltgt)
c ab
reduce void sum (int altgt, reduce int rltgt)
r ra
void main ()
for (instance0 instance lt LAGS instance)
createStream1 (array_ptr, stream1, instance)
createStream2 (array_ptr instance, stream2,
instance)
mul (stream1, stream2, mul_result_stream)
sum (mul_result_stream, reduce_result_stream)
write_stream (reduce_result_stream, output_array
instance)

9
From Brook to FPGA Logic

The goal is to produce HDL that implements the
application
Leverage existing tools
Alteras C2H compiler can convert simple C code
into HDL
Existing Brook compiler can target GPUs
We modified it to convert kernel code into C,
suitable for C2H compilation

10
Converting Brook to C

Target an SOPC System on a Programmable Chip
Kernel code mapped into hardware
Soft processor (Nios II) executes non-kernel code
Kernels implicitly operate on all input stream
elements
Done by a loop in the C code
Streams are passed between kernels
Amount of on-chip memory in FPGAs is relatively
low
Use FIFOs to pass streams
References to streams converted into pointers
Pointers are mapped to FIFOs using C2H pragma
statements

11
Conversion Example mul

kernel void mul (int altgt, int bltgt, out int cltgt)
c ab
void mul ()
volatile int a, b, c
int _iter
for (_iter0 _iterltIN_LENGTH _iter)
c (a) (b)

FIFOs are implemented in hardware, so pointers do
not have to be modified in software
12
Conversion Example sum

reduce void sum (int altgt, reduce int rltgt)
r ra
Sum is a reduction kernel
Input kernel (a) has IN_LENGTH elements
Output kernel (r) has one element
All elements of the input stream are combined
(i.e. added) to produce the output

13
Conversion Example sum

reduce void sum (int altgt, reduce int rltgt)
r ra
void sum()
volatile int a, r
int _temp_r, _iter
for (_iter0 _iterltIN_LENGTH _iter)
if (_iter 0)
_temp_r a
else
_temp_r _temp_r (a)
r _temp_r

14
General Case sum

reduce void sum (int altgt, reduce int rltgt)
r ra
Sum is a reduction kernel
Assume input kernel (a) has IN_LENGTH elements
Assume output kernel (r) has REDUCE_LENGTH
elements
IN_LENGTH/REDUCE_LENGTH consecutive elements of
the input stream are combined (i.e. added) to
produce 1 element of the output stream

15
Conversion for General Case sum

reduce void sum (int altgt, reduce int rltgt)
r ra
void sum()
volatile int a, r
int _temp_r, _iter, _mod_iter0
for (_iter0 _iterltIN_LENGTH _iter)
if ((_mod_iter 0) (_iter ! 0))
r _temp_r
if (_mod_iter 0)
_temp_r a
else
_temp_r _temp_r (a)
if (_mod_iter (IN_LENGTH/REDUCE_LENGTH-1))
_mod_iter 0
else
_mod_iter _mod_iter1

Modulo operations inefficiently implemented in
hardware
_mod_iter is a compiler generated variable that
implements modulo operation as a counter

16
Autocorrelation in Brook

kernel void mul (int altgt, int bltgt, out int cltgt)
c ab
reduce void sum (int altgt, reduce int rltgt)
r ra
void main ()
for (instance0 instance lt LAGS instance)
createStream1 (array_ptr, stream1, instance)
createStream2 (array_ptr instance, stream2,
instance)
mul (stream1, stream2, mul_result_stream)
sum (mul_result_stream, reduce_result_stream)
write_stream (reduce_result_stream, output_array
instance)

17
Autocorrelation Example
System Memory
Nios II CPU
create1
stream1
FIFO
reduce_result_stream
mul_result_stream
mul
sum
write
FIFO
FIFO
stream2
create2
FIFO
18
Conversion Example C2H Pragmas
kernel void mul (int altgt, int bltgt, out int cltgt)
c ab
stream1
FIFO
mul_result_stream
mul
FIFO
stream2
FIFO

Generated C2H pragmas
pragma altera_accelerate connect_variable mul/a
to stream1/out
pragma altera_accelerate connect_variable mul/b
to stream2/out
pragma altera_accelerate connect_variable mul/c
to mul_result_stream/in

19
Experimental Evaluation

Using Alteras Quartus II CAD software targeting
the Cyclone II FPGA device
Compare the performance of our approach to two
alternative implementations on the same FPGA
Software running on the Nios II soft-core
processor
Software compiled to hardware using C2H (without
first expressing the application in Brook)
Two applications
8-tap FIR filter, process 100,000 samples
Autocorrelation, process 100,000 samples over 8
different shift distances

20
Initial Results

Performance benefits from task parallelism
Kernels operate in parallel in a pipelined fashion

21
Replication Example
System Memory
Nios II CPU
create1
stream1
FIFO
reduce_result_stream
mul_result_stream
mul
sum
write
FIFO
FIFO
stream2
create2
FIFO
22
Exploiting Data Parallelism Through Replication
System Memory
Nios II CPU
stream1 1/2
reduce_result 1/2
mul_result 1/2
FIFO
mul 1/2
sum 1/2
create1
FIFO
FIFO
stream2 1/2
FIFO
Write
stream1 2/2
reduce_result 2/2
mul_result 2/2
FIFO
mul 2/2
sum 2/2
create2
FIFO
FIFO
stream2 2/2
FIFO
23
Improved Results
24
Related Work (I)

Compiling C streaming applications to FPGAs
(Northwestern University and University of
Pittsburgh)
Similar to our approach, but uses C as a source
language
A Streaming Compiler ASC (Imperial College
London)
C library with support for architectural
optimizations
Can also target FPGAs
Streaming Accelerators (Motorola)
Application is defined through streaming dataflow
graphs (sDFGs) and stream descriptors
Merrimac supercomputer and Brook streaming
language (Stanford)
A multiprocessor consisting of streaming
processors
RAW architecture and StreamIt streaming language
(MIT)
Filters connected into predefined topologies

25
Related Work (II)

Brook optimizing compiler for shared memory
multiprocessor (Intel)
Brook compiler for general-purpose CPU (Stanford)
General Purpose Computation on Graphics
Processing Units (GPGPU)
GPU Brook (Stanford)
Brook modified for GPUs

26
Concluding Remarks

Streaming makes it easier to design hardware
Streaming is suitable for FPGA implementation
Hardware can be custom generated for a given
application and throughput requirement
Good speedup can be achieved
Future work
Automate replication
Implement larger benchmarks

Write a Comment

User Comments (0)