Title: Towards Compilation of Streaming Programs into FPGA Hardware
1Towards Compilation of Streaming Programs into
FPGA Hardware
- Franjo Plavec, Zvonko Vranesic, Stephen Brown
- University of Toronto
Forum on Specification and Design
Languages September 25, 2008
2Outline
- Motivation
- Brook Streaming Language
- Converting Brook programs to FPGA logic
- Results
- Improving Performance
- Related Work
- Concluding Remarks
3Motivation
- Increasing complexity and cost of manufacturing
ASICs - Field Programmable Gate Arrays are an attractive
alternative - FPGAs are becoming huge
- Designing for them is challenging!
- Need high-level design-entry methods to make
FPGAs easier to use - Convert software into custom hardware
4Streaming Paradigm
- Streaming (a.k.a Stream Computing)
- The term is used for many different systems
- Recently popularized by languages such as
StreamIt and Brook - Expose data parallelism
- Explicit communication
5Brook Streaming Language
- Data is organized into streams
- Stream is a collection of independent elements
- Elements of a stream can be operated on in
parallel - Computation is expressed through kernels
- Kernel is a function operating on streams
- Implicitly applied to every element of the input
stream
K2
S3
S1
K4
SOUT
K1
OUTPUT STREAM
TEMPORARY STREAMS
SIN
INPUT STREAM
K3
S4
S2
6Kernels, Streams and Parallelism
- Streaming model exposes
- Data parallelism a kernel can operate on
multiple stream elements in parallel - Task parallelism multiple kernels can operate in
parallel in a pipelined fashion
K2
S3
S1
K4
SOUT
K1
OUTPUT STREAM
TEMPORARY STREAMS
SIN
INPUT STREAM
K3
S4
S2
7Simple Brook Program
- kernel void mul (int op1ltgt, int op2ltgt, out int
resultltgt) - result op1 op2
-
- void main ()
- int a3 10, 20, 30
- int b3 100, 200, 300
- int c3 int i
- int strm_alt3gt, strm_blt3gt, strm_clt3gt
- streamRead (strm_a, a)
- streamRead (strm_b, b)
-
- mul (strm_a, strm_b, strm_c)
-
- streamWrite (strm_c, c)
// lt gt denotes a stream // Gets
applied to all stream elements in
parallel // Copy array into stream
(gather) // Copy array into stream
(gather) // Perform Computation //
Copy stream into array (scatter)
Read
strm_a
mul
Write
strm_c
Read
strm_b
8Autocorrelation in Brook
- kernel void mul (int altgt, int bltgt, out int cltgt)
- c ab
-
- reduce void sum (int altgt, reduce int rltgt)
- r ra
-
- void main ()
-
- for (instance0 instance lt LAGS instance)
- createStream1 (array_ptr, stream1, instance)
- createStream2 (array_ptr instance, stream2,
instance) - mul (stream1, stream2, mul_result_stream)
- sum (mul_result_stream, reduce_result_stream)
- write_stream (reduce_result_stream, output_array
instance) -
9From Brook to FPGA Logic
- The goal is to produce HDL that implements the
application - Leverage existing tools
- Alteras C2H compiler can convert simple C code
into HDL - Existing Brook compiler can target GPUs
- We modified it to convert kernel code into C,
suitable for C2H compilation
10Converting Brook to C
- Target an SOPC System on a Programmable Chip
- Kernel code mapped into hardware
- Soft processor (Nios II) executes non-kernel code
- Kernels implicitly operate on all input stream
elements - Done by a loop in the C code
- Streams are passed between kernels
- Amount of on-chip memory in FPGAs is relatively
low - Use FIFOs to pass streams
- References to streams converted into pointers
- Pointers are mapped to FIFOs using C2H pragma
statements
11Conversion Example mul
- kernel void mul (int altgt, int bltgt, out int cltgt)
- c ab
-
- void mul ()
- volatile int a, b, c
- int _iter
- for (_iter0 _iterltIN_LENGTH _iter)
- c (a) (b)
-
FIFOs are implemented in hardware, so pointers do
not have to be modified in software
12Conversion Example sum
- reduce void sum (int altgt, reduce int rltgt)
- r ra
-
- Sum is a reduction kernel
- Input kernel (a) has IN_LENGTH elements
- Output kernel (r) has one element
- All elements of the input stream are combined
(i.e. added) to produce the output
13Conversion Example sum
- reduce void sum (int altgt, reduce int rltgt)
- r ra
-
- void sum()
- volatile int a, r
- int _temp_r, _iter
- for (_iter0 _iterltIN_LENGTH _iter)
- if (_iter 0)
- _temp_r a
- else
- _temp_r _temp_r (a)
-
- r _temp_r
14General Case sum
- reduce void sum (int altgt, reduce int rltgt)
- r ra
-
- Sum is a reduction kernel
- Assume input kernel (a) has IN_LENGTH elements
- Assume output kernel (r) has REDUCE_LENGTH
elements - IN_LENGTH/REDUCE_LENGTH consecutive elements of
the input stream are combined (i.e. added) to
produce 1 element of the output stream
15Conversion for General Case sum
- reduce void sum (int altgt, reduce int rltgt)
- r ra
-
- void sum()
- volatile int a, r
- int _temp_r, _iter, _mod_iter0
- for (_iter0 _iterltIN_LENGTH _iter)
- if ((_mod_iter 0) (_iter ! 0))
- r _temp_r
- if (_mod_iter 0)
- _temp_r a
- else
- _temp_r _temp_r (a)
- if (_mod_iter (IN_LENGTH/REDUCE_LENGTH-1))
- _mod_iter 0
- else
- _mod_iter _mod_iter1
-
- Modulo operations inefficiently implemented in
hardware - _mod_iter is a compiler generated variable that
implements modulo operation as a counter
16Autocorrelation in Brook
- kernel void mul (int altgt, int bltgt, out int cltgt)
- c ab
-
- reduce void sum (int altgt, reduce int rltgt)
- r ra
-
- void main ()
-
- for (instance0 instance lt LAGS instance)
- createStream1 (array_ptr, stream1, instance)
- createStream2 (array_ptr instance, stream2,
instance) - mul (stream1, stream2, mul_result_stream)
- sum (mul_result_stream, reduce_result_stream)
- write_stream (reduce_result_stream, output_array
instance) -
17Autocorrelation Example
System Memory
Nios II CPU
create1
stream1
FIFO
reduce_result_stream
mul_result_stream
mul
sum
write
FIFO
FIFO
stream2
create2
FIFO
18Conversion Example C2H Pragmas
kernel void mul (int altgt, int bltgt, out int cltgt)
c ab
stream1
FIFO
mul_result_stream
mul
FIFO
stream2
FIFO
- Generated C2H pragmas
- pragma altera_accelerate connect_variable mul/a
to stream1/out - pragma altera_accelerate connect_variable mul/b
to stream2/out - pragma altera_accelerate connect_variable mul/c
to mul_result_stream/in
19Experimental Evaluation
- Using Alteras Quartus II CAD software targeting
the Cyclone II FPGA device - Compare the performance of our approach to two
alternative implementations on the same FPGA - Software running on the Nios II soft-core
processor - Software compiled to hardware using C2H (without
first expressing the application in Brook) - Two applications
- 8-tap FIR filter, process 100,000 samples
- Autocorrelation, process 100,000 samples over 8
different shift distances
20Initial Results
- Performance benefits from task parallelism
- Kernels operate in parallel in a pipelined fashion
21Replication Example
System Memory
Nios II CPU
create1
stream1
FIFO
reduce_result_stream
mul_result_stream
mul
sum
write
FIFO
FIFO
stream2
create2
FIFO
22Exploiting Data Parallelism Through Replication
System Memory
Nios II CPU
stream1 1/2
reduce_result 1/2
mul_result 1/2
FIFO
mul 1/2
sum 1/2
create1
FIFO
FIFO
stream2 1/2
FIFO
Write
stream1 2/2
reduce_result 2/2
mul_result 2/2
FIFO
mul 2/2
sum 2/2
create2
FIFO
FIFO
stream2 2/2
FIFO
23Improved Results
24Related Work (I)
- Compiling C streaming applications to FPGAs
(Northwestern University and University of
Pittsburgh) - Similar to our approach, but uses C as a source
language - A Streaming Compiler ASC (Imperial College
London) - C library with support for architectural
optimizations - Can also target FPGAs
- Streaming Accelerators (Motorola)
- Application is defined through streaming dataflow
graphs (sDFGs) and stream descriptors - Merrimac supercomputer and Brook streaming
language (Stanford) - A multiprocessor consisting of streaming
processors - RAW architecture and StreamIt streaming language
(MIT) - Filters connected into predefined topologies
25Related Work (II)
- Brook optimizing compiler for shared memory
multiprocessor (Intel) - Brook compiler for general-purpose CPU (Stanford)
- General Purpose Computation on Graphics
Processing Units (GPGPU) - GPU Brook (Stanford)
- Brook modified for GPUs
26Concluding Remarks
- Streaming makes it easier to design hardware
- Streaming is suitable for FPGA implementation
- Hardware can be custom generated for a given
application and throughput requirement - Good speedup can be achieved
- Future work
- Automate replication
- Implement larger benchmarks