Jo - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Jo

Description:

Time to creatively 'use available resources' than to simply 'save resources' 3. Motivation ... To speed-up applications with multiple and data-dependent stages ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 44

Provided by: joaoca

Category:

Tags: creatively

more less

Transcript and Presenter's Notes

Title: Jo

1
A Data-Driven Approach for Pipelining Sequences
of Data-Dependent LOOPs

João M. P. Cardoso

Portugal
ITIV, University of Karlsruhe, July 2, 2007
2
Motivation

Many applications have sequences tasks
E.g., in image and video processing algorithms
Contemporary FPGAs
Plenty of room to accommodate highly specialized
complex architectures
Time to creatively use available resources than
to simply save resources

3
Motivation

Computing Stages
Sequentially

Task A
Task B
Task C
TIME
4
Motivation

Computing Stages
Concurrently

Task A
Task B
Task C
TIME
5
Outline

Objective
Loop Pipelining
Producer/Consumer Computing Stages
Pipelining Sequences of Loops
Inter-Stage Communication
Experimental Setup and Results
Related Work
Conclusions
Future Work

6
Objectives

To speed-up applications with multiple and
data-dependent stages
each stage seen as a set of nested loops
How?
Pipelining those sequences of data-dependent
stages using fine-grain synchronization schemes
Taking advantage of field-custom computing
structures (FPGAs)

7
Loop Pipelining

Attempt to overlap loop iterations

Significant speedups are achieved
But how to pipeline sequences of loops?

I1
I2
I3
I4
...
I1
I2
I3
I4
...
time
8
Computing Stages

Sequentially

Producer ...A2A1A0
Consumer A0A1A2...
9
Computing Stages

Concurrently
Ordered producer/consumer pairs
Send/receive

Producer ...A2A1A0
Consumer A0A1A2...
FIFO with N stages
10
Computing Stages

Concurrently
Unordered producer/consumer pairs
Empty/Full table

Empty/full
data
0
1 A1
0
0
0
1 A5
0
0
Producer ...A3A5A1
Consumer A3A1A5...
11
Main Idea

FDCT

Intermediate data
Data output
Data Input
0 1 2 3 4 5 6 7

8
Loop 1 Loop 2
16
Loop 3
24
32
40
48
Global FSM
56
Execution of Loops 1, 2
Execution of Loop 3
Intermediate data array
time
12
Main Idea

FDCT
Out-of-order producer/consumer pairs
How to overlap computing stages?

0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7

8
8
16
16
24
24
32
32
40
40
48
48
56
56
13
Main Idea

Pipelined FDCT

Data input
Intermediate data ( dual-port RAM )
Data output
Loop 1 Loop 2
Loop 3
0 1 2 3 4 5 6 7

8
FSM 2
16
24
FSM 1
Dual-port 1-bit table ( empty/full )
32
40
48
Execution of Loops 1, 2
56
Intermediate data array
Execution of Loop 3
time
14
Main Idea
Memory
Memory
Memory
Task A
Task B
15
Possible Scenarios

Single write, single read
Accepted without code changes
Single write, multiple reads
Accepted without code changes (by using an N-bit
table)
Multiple writes, single read
Need code transformations
Multiple writes, multiple reads
Need code transformations

16
Inter-Stage Communication

Responsible to
Communicate data between pipelined stages
Flag data availability
Solutions
Perfect associative memory
Cost too high
Memory for data plus 1-bit table (each cell
represents full/empty information)
Size of the data set to communicate
Decrease size using hash-based solution

Empty/full
data
0
1 A1
0
0
0
1 A5
0
0
17
Inter-Stage Communication

Memory plus 1-bit table

boolean tabSIZE0, 0,, 0 for(i0
iltnum_fdcts i) //Loop 1

for(j0 jltN j) //Loop 2
//
loads // computations // stores
tmp48i_1 F6 gtgt 13 tab48i_1 true
tmp56i_1 F7 gtgt 13 tab56i_1 true
i_1

i_1 56

i_1 0 for (i0 iltNnum_fdcts i) //Loop
3
L1 f0
tmpi_1 if(!tabi_1) goto L1 L2 f1
tmp1i_1 if(!tab1i_1) goto L2
// remaining
loads // computations // stores i_1 8

18
Inter-Stage Communication

Hash-based solution

boolean tabSIZE0, 0,, 0 for(i0
iltnum_fdcts i) //Loop 1

for(j0 jltN j) //Loop 2
//
loads // computations // stores
tmpH(48i_1) F6 gtgt 13 tabH(48i_1)
true tmpH(56i_1) F7 gtgt 13
tabH(56i_1) true
i_1

i_1 56

i_1 0 for (i0 iltNnum_fdcts i) //Loop
3
L1 f0
tmpH(i_1) if(!tabH(i_1)) goto L1 L2
f1 tmpH(1i_1) if(!tabH(1i_1)) goto
L2 //
remaining loads // computations // stores
i_1 8
19
Inter-Stage Communication

Hash-based solution
We did not want to include additional delays in
the load/store operations
Use H(k) k MOD m
When m is a multiple of 2N,
H(k) can be implemented by just using the least
?log2(m)? significant bits of K to address the
cache (translates to simple interconnections)

H
H
20
Inter-Stage Communication

Hash-based solution H(k) k MOD m

Single read
(L1)
R 1
? 0
a) write
b) read
c) empty/full update

21
Inter-Stage Communication

Hash-based solution H(k) k MOD m

Multiple reads (Lgt1)
R 11...1 (L)
? gtgt R
a) write
b) read
c) empty/full update

22
Buffer size calculation

By monitoring behavior
of communication component
For each read and write
determine the size of the buffer needed to avoid
collisions
Done during RTL simulation

23
Experimental Setup

Compilation flow
Uses our previous work on compiling algorithms in
a Java subset to FPGAs

24
Experimental Setup

Simulation back-end

Library of Operators (JAVA)
fsm.xml
datapath.xml
fsm.xml
rtg.xml
datapath.xml
XSLTs
to dotty
to hds
to vhdl
to dotty
to java
to java
to vhdl
datapath.hds
fsm.java
rtg.java
HADES
fsm.class
rtg.class
ANT build file
I/O data ( RAMs and Stimulus )
25
Experimental Results

Benchmarks

Algorithm Stages loops Description
fdct 2 s1,s2 3 Fast DCT (Discrete Cosine Transform)
fwt2D 4 s1,s2,s3,s4 8 Forward Haar Wavelet
RGB2gray histogram 2 s1,s2 2 Transforms an RGB image to a gray image with 256 levels and determines the histogram of the gray image
Smooth sobel, 3 versions (a) (b) (c) 2 s1,s2 6 Smooth image operation based on 3?3 windows being the resultant image input to the sobel edge detector. (a) original code (b) two innermost loops of the smooth algorithm fully unrolled (scalar replacement of the array with coefficients) (c) the same as (b) plus elimination of redundant array references in the original code of sobel.
26
Experimental Results

FDCT (speed-up achieved by Pipelining Sequences
of Loop)

27
Experimental Results
Algorithm Input data size Stages cc w/o PSL Speed-up Upper Bound cc w/ PSL Speed-up
fdct 800?600 (s1,s2) (s1) (s2) 3,930,005 1,950,003 1,920,003 2.02 1,830,215 2.02
Fwt2D 512?512 (s1,s2,s3,s4) (s1,s2) (s3,s4) 4,724,745 2,362,373 2,362,373 2.00 3,664,917 1.29
RGB2gray histogram 800?600 (s1,s2) (s1) (s2) 6,720,025 2,880,015 3,840,015 1.75 3,840,007 1.75
Smooth sobel (a) 800?600 (s1,s2) (s1) (s2) 49,634,009 32,929,473 16,606,951 1.51 32,929,489 1.51
Smooth sobel (b) 800?600 (s1,s2) (s1) (s2) 30,068,645 13,364,109 16,606,951 1.81 16,640,509 1.81
Smooth sobel (c) 800?600 (s1,s2) (s1) (s2) 25,773,809 13,364,109 11,862,791 1.92 13,364,117 1.92
28
Experimental Results

What does happen with buffer sizes?

29
Experimental Results

Adjust latency of tasks in order to balance
pipeline stages
Slowdown tasks with higher latency
Optimization of slower tasks in order to reduce
their latency
Slowdown of producer tasks usually reduces the
size of the inter-stage buffers

30
Experimental Results

Buffer sizes

2 cycles per iteration of the producer
1 cycle per iteration of the producer
original
Optimizations in the consumer
Optimizations in the producer
original
31
Experimental Results

Buffer sizes

32
Experimental Results

Resources and Frequency (Spartan-3 400)

33
Related Work

Previous approach (Ziegler et al.)
Coarse-grained communication and synchronization
scheme
FIFOs are used to communicate data between
pipelining stages
Width of FIFO stages dependent on
producer/consumer ordering
Less applicable

time
34
Conclusions

We presented a scheme to accelerate applications,
pipelining sequences of loops
I.e., Before the end of a stage (set of nested
loops) a subsequent stage (set of nested loops)
can start executing based on data already
produced
Data-driven scheme is used based on empty/full
tables
A scheme to reduce the size of the memory buffers
for inter-stage pipelining (using a simple hash
function)
Depending on the consumer/producer ordering,
speedups close to theoretical ones are achieved
as if stages are concurrently and independently
executed

35
Future Work

Research other hash functions
Study slowdown effects
Apply the technique in the context of Multi-Core
Systems

Processor Core A
Processor Core B
Memory
Memory
36
Acknowledgments

Work partially funded by
CHIADO - Compilation of High-Level
Computationally Intensive Algorithms to
Dynamically Reconfigurable COmputing Systems
Portuguese Foundation for Science and Technology
(FCT), POSI and FEDER, POSI/CHS/48018/2002
Based on the work done by Rui Rodrigues
In collaboration with Pedro C. Diniz

37
A Data-Driven Approach for Pipelining Sequences
of Data-Dependent Loops
technologyfrom seed
38
Buffer Monitor
39
Buffer Monitor
40
Buffer Monitor
41
Buffer Monitor
42
Buffer Monitor
43
Buffer Monitor

Write a Comment

User Comments (0)