Title: Jo
1A Data-Driven Approach for Pipelining Sequences
of Data-Dependent LOOPs
Portugal
ITIV, University of Karlsruhe, July 2, 2007
2Motivation
- Many applications have sequences tasks
- E.g., in image and video processing algorithms
- Contemporary FPGAs
- Plenty of room to accommodate highly specialized
complex architectures - Time to creatively use available resources than
to simply save resources
3Motivation
- Computing Stages
- Sequentially
Task A
Task B
Task C
TIME
4Motivation
- Computing Stages
- Concurrently
Task A
Task B
Task C
TIME
5Outline
- Objective
- Loop Pipelining
- Producer/Consumer Computing Stages
- Pipelining Sequences of Loops
- Inter-Stage Communication
- Experimental Setup and Results
- Related Work
- Conclusions
- Future Work
6Objectives
- To speed-up applications with multiple and
data-dependent stages - each stage seen as a set of nested loops
- How?
- Pipelining those sequences of data-dependent
stages using fine-grain synchronization schemes - Taking advantage of field-custom computing
structures (FPGAs)
7Loop Pipelining
- Attempt to overlap loop iterations
- Significant speedups are achieved
- But how to pipeline sequences of loops?
I1
I2
I3
I4
...
I1
I2
I3
I4
...
time
8Computing Stages
Producer ...A2A1A0
Consumer A0A1A2...
9Computing Stages
- Concurrently
- Ordered producer/consumer pairs
- Send/receive
Producer ...A2A1A0
Consumer A0A1A2...
FIFO with N stages
10Computing Stages
- Concurrently
- Unordered producer/consumer pairs
- Empty/Full table
Empty/full
data
0
1 A1
0
0
0
1 A5
0
0
Producer ...A3A5A1
Consumer A3A1A5...
11Main Idea
Intermediate data
Data output
Data Input
0 1 2 3 4 5 6 7
8
Loop 1 Loop 2
16
Loop 3
24
32
40
48
Global FSM
56
Execution of Loops 1, 2
Execution of Loop 3
Intermediate data array
time
12Main Idea
- FDCT
- Out-of-order producer/consumer pairs
- How to overlap computing stages?
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
8
8
16
16
24
24
32
32
40
40
48
48
56
56
13Main Idea
Data input
Intermediate data ( dual-port RAM )
Data output
Loop 1 Loop 2
Loop 3
0 1 2 3 4 5 6 7
8
FSM 2
16
24
FSM 1
Dual-port 1-bit table ( empty/full )
32
40
48
Execution of Loops 1, 2
56
Intermediate data array
Execution of Loop 3
time
14Main Idea
Memory
Memory
Memory
Task A
Task B
15Possible Scenarios
- Single write, single read
- Accepted without code changes
- Single write, multiple reads
- Accepted without code changes (by using an N-bit
table) - Multiple writes, single read
- Need code transformations
- Multiple writes, multiple reads
- Need code transformations
16Inter-Stage Communication
- Responsible to
- Communicate data between pipelined stages
- Flag data availability
- Solutions
- Perfect associative memory
- Cost too high
- Memory for data plus 1-bit table (each cell
represents full/empty information) - Size of the data set to communicate
- Decrease size using hash-based solution
Empty/full
data
0
1 A1
0
0
0
1 A5
0
0
17Inter-Stage Communication
boolean tabSIZE0, 0,, 0 for(i0
iltnum_fdcts i) //Loop 1
for(j0 jltN j) //Loop 2
//
loads // computations // stores
tmp48i_1 F6 gtgt 13 tab48i_1 true
tmp56i_1 F7 gtgt 13 tab56i_1 true
i_1
i_1 56
i_1 0 for (i0 iltNnum_fdcts i) //Loop
3
L1 f0
tmpi_1 if(!tabi_1) goto L1 L2 f1
tmp1i_1 if(!tab1i_1) goto L2
// remaining
loads // computations // stores i_1 8
18Inter-Stage Communication
boolean tabSIZE0, 0,, 0 for(i0
iltnum_fdcts i) //Loop 1
for(j0 jltN j) //Loop 2
//
loads // computations // stores
tmpH(48i_1) F6 gtgt 13 tabH(48i_1)
true tmpH(56i_1) F7 gtgt 13
tabH(56i_1) true
i_1
i_1 56
i_1 0 for (i0 iltNnum_fdcts i) //Loop
3
L1 f0
tmpH(i_1) if(!tabH(i_1)) goto L1 L2
f1 tmpH(1i_1) if(!tabH(1i_1)) goto
L2 //
remaining loads // computations // stores
i_1 8
19Inter-Stage Communication
- Hash-based solution
- We did not want to include additional delays in
the load/store operations - Use H(k) k MOD m
- When m is a multiple of 2N,
- H(k) can be implemented by just using the least
?log2(m)? significant bits of K to address the
cache (translates to simple interconnections)
H
H
20Inter-Stage Communication
- Hash-based solution H(k) k MOD m
- Single read
- (L1)
- R 1
- ? 0
- a) write
- b) read
- c) empty/full update
21Inter-Stage Communication
- Hash-based solution H(k) k MOD m
- Multiple reads (Lgt1)
- R 11...1 (L)
- ? gtgt R
- a) write
- b) read
- c) empty/full update
22Buffer size calculation
- By monitoring behavior
- of communication component
- For each read and write
- determine the size of the buffer needed to avoid
collisions - Done during RTL simulation
23Experimental Setup
- Compilation flow
- Uses our previous work on compiling algorithms in
a Java subset to FPGAs
24Experimental Setup
Library of Operators (JAVA)
fsm.xml
datapath.xml
fsm.xml
rtg.xml
datapath.xml
XSLTs
to dotty
to hds
to vhdl
to dotty
to java
to java
to vhdl
datapath.hds
fsm.java
rtg.java
HADES
fsm.class
rtg.class
ANT build file
I/O data ( RAMs and Stimulus )
25Experimental Results
Algorithm Stages loops Description
fdct 2 s1,s2 3 Fast DCT (Discrete Cosine Transform)
fwt2D 4 s1,s2,s3,s4 8 Forward Haar Wavelet
RGB2gray histogram 2 s1,s2 2 Transforms an RGB image to a gray image with 256 levels and determines the histogram of the gray image
Smooth sobel, 3 versions (a) (b) (c) 2 s1,s2 6 Smooth image operation based on 3?3 windows being the resultant image input to the sobel edge detector. (a) original code (b) two innermost loops of the smooth algorithm fully unrolled (scalar replacement of the array with coefficients) (c) the same as (b) plus elimination of redundant array references in the original code of sobel.
26Experimental Results
- FDCT (speed-up achieved by Pipelining Sequences
of Loop)
27Experimental Results
Algorithm Input data size Stages cc w/o PSL Speed-up Upper Bound cc w/ PSL Speed-up
fdct 800?600 (s1,s2) (s1) (s2) 3,930,005 1,950,003 1,920,003 2.02 1,830,215 2.02
Fwt2D 512?512 (s1,s2,s3,s4) (s1,s2) (s3,s4) 4,724,745 2,362,373 2,362,373 2.00 3,664,917 1.29
RGB2gray histogram 800?600 (s1,s2) (s1) (s2) 6,720,025 2,880,015 3,840,015 1.75 3,840,007 1.75
Smooth sobel (a) 800?600 (s1,s2) (s1) (s2) 49,634,009 32,929,473 16,606,951 1.51 32,929,489 1.51
Smooth sobel (b) 800?600 (s1,s2) (s1) (s2) 30,068,645 13,364,109 16,606,951 1.81 16,640,509 1.81
Smooth sobel (c) 800?600 (s1,s2) (s1) (s2) 25,773,809 13,364,109 11,862,791 1.92 13,364,117 1.92
28Experimental Results
- What does happen with buffer sizes?
29Experimental Results
- Adjust latency of tasks in order to balance
pipeline stages - Slowdown tasks with higher latency
- Optimization of slower tasks in order to reduce
their latency - Slowdown of producer tasks usually reduces the
size of the inter-stage buffers
30Experimental Results
2 cycles per iteration of the producer
1 cycle per iteration of the producer
original
Optimizations in the consumer
Optimizations in the producer
original
31Experimental Results
32Experimental Results
- Resources and Frequency (Spartan-3 400)
33Related Work
- Previous approach (Ziegler et al.)
- Coarse-grained communication and synchronization
scheme - FIFOs are used to communicate data between
pipelining stages - Width of FIFO stages dependent on
producer/consumer ordering - Less applicable
time
34Conclusions
- We presented a scheme to accelerate applications,
pipelining sequences of loops - I.e., Before the end of a stage (set of nested
loops) a subsequent stage (set of nested loops)
can start executing based on data already
produced - Data-driven scheme is used based on empty/full
tables - A scheme to reduce the size of the memory buffers
for inter-stage pipelining (using a simple hash
function) - Depending on the consumer/producer ordering,
speedups close to theoretical ones are achieved - as if stages are concurrently and independently
executed
35Future Work
- Research other hash functions
- Study slowdown effects
- Apply the technique in the context of Multi-Core
Systems
Processor Core A
Processor Core B
Memory
Memory
36Acknowledgments
- Work partially funded by
- CHIADO - Compilation of High-Level
Computationally Intensive Algorithms to
Dynamically Reconfigurable COmputing Systems - Portuguese Foundation for Science and Technology
(FCT), POSI and FEDER, POSI/CHS/48018/2002 - Based on the work done by Rui Rodrigues
- In collaboration with Pedro C. Diniz
37A Data-Driven Approach for Pipelining Sequences
of Data-Dependent Loops
technologyfrom seed
38Buffer Monitor
39Buffer Monitor
40Buffer Monitor
41Buffer Monitor
42Buffer Monitor
43Buffer Monitor