Jo - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Jo

Description:

Time to creatively 'use available resources' than to simply 'save resources' 3. Motivation ... To speed-up applications with multiple and data-dependent stages ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 44
Provided by: joaoca
Category:
Tags: creatively

less

Transcript and Presenter's Notes

Title: Jo


1
A Data-Driven Approach for Pipelining Sequences
of Data-Dependent LOOPs
  • João M. P. Cardoso

Portugal
ITIV, University of Karlsruhe, July 2, 2007
2
Motivation
  • Many applications have sequences tasks
  • E.g., in image and video processing algorithms
  • Contemporary FPGAs
  • Plenty of room to accommodate highly specialized
    complex architectures
  • Time to creatively use available resources than
    to simply save resources

3
Motivation
  • Computing Stages
  • Sequentially

Task A
Task B
Task C
TIME
4
Motivation
  • Computing Stages
  • Concurrently

Task A
Task B
Task C
TIME
5
Outline
  • Objective
  • Loop Pipelining
  • Producer/Consumer Computing Stages
  • Pipelining Sequences of Loops
  • Inter-Stage Communication
  • Experimental Setup and Results
  • Related Work
  • Conclusions
  • Future Work

6
Objectives
  • To speed-up applications with multiple and
    data-dependent stages
  • each stage seen as a set of nested loops
  • How?
  • Pipelining those sequences of data-dependent
    stages using fine-grain synchronization schemes
  • Taking advantage of field-custom computing
    structures (FPGAs)

7
Loop Pipelining
  • Attempt to overlap loop iterations
  • Significant speedups are achieved
  • But how to pipeline sequences of loops?

I1
I2
I3
I4
...
I1
I2
I3
I4
...
time
8
Computing Stages
  • Sequentially

Producer ...A2A1A0
Consumer A0A1A2...
9
Computing Stages
  • Concurrently
  • Ordered producer/consumer pairs
  • Send/receive

Producer ...A2A1A0
Consumer A0A1A2...
FIFO with N stages
10
Computing Stages
  • Concurrently
  • Unordered producer/consumer pairs
  • Empty/Full table

Empty/full
data
0
1 A1
0
0
0
1 A5
0
0
Producer ...A3A5A1
Consumer A3A1A5...
11
Main Idea
  • FDCT

Intermediate data
Data output
Data Input
0 1 2 3 4 5 6 7








8
Loop 1 Loop 2
16
Loop 3
24
32
40
48
Global FSM
56
Execution of Loops 1, 2
Execution of Loop 3
Intermediate data array
time
12
Main Idea
  • FDCT
  • Out-of-order producer/consumer pairs
  • How to overlap computing stages?

0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
















8
8
16
16
24
24
32
32
40
40
48
48
56
56
13
Main Idea
  • Pipelined FDCT

Data input
Intermediate data ( dual-port RAM )
Data output
Loop 1 Loop 2
Loop 3
0 1 2 3 4 5 6 7








8
FSM 2
16
24
FSM 1
Dual-port 1-bit table ( empty/full )
32
40
48
Execution of Loops 1, 2
56
Intermediate data array
Execution of Loop 3
time
14
Main Idea
Memory
Memory
Memory
Task A
Task B
15
Possible Scenarios
  • Single write, single read
  • Accepted without code changes
  • Single write, multiple reads
  • Accepted without code changes (by using an N-bit
    table)
  • Multiple writes, single read
  • Need code transformations
  • Multiple writes, multiple reads
  • Need code transformations

16
Inter-Stage Communication
  • Responsible to
  • Communicate data between pipelined stages
  • Flag data availability
  • Solutions
  • Perfect associative memory
  • Cost too high
  • Memory for data plus 1-bit table (each cell
    represents full/empty information)
  • Size of the data set to communicate
  • Decrease size using hash-based solution

Empty/full
data
0
1 A1
0
0
0
1 A5
0
0
17
Inter-Stage Communication
  • Memory plus 1-bit table

boolean tabSIZE0, 0,, 0 for(i0
iltnum_fdcts i) //Loop 1

for(j0 jltN j) //Loop 2
//
loads // computations // stores
tmp48i_1 F6 gtgt 13 tab48i_1 true
tmp56i_1 F7 gtgt 13 tab56i_1 true
i_1


i_1 56

i_1 0 for (i0 iltNnum_fdcts i) //Loop
3
L1 f0
tmpi_1 if(!tabi_1) goto L1 L2 f1
tmp1i_1 if(!tab1i_1) goto L2
// remaining
loads // computations // stores i_1 8

18
Inter-Stage Communication
  • Hash-based solution

boolean tabSIZE0, 0,, 0 for(i0
iltnum_fdcts i) //Loop 1

for(j0 jltN j) //Loop 2
//
loads // computations // stores
tmpH(48i_1) F6 gtgt 13 tabH(48i_1)
true tmpH(56i_1) F7 gtgt 13
tabH(56i_1) true
i_1

i_1 56

i_1 0 for (i0 iltNnum_fdcts i) //Loop
3
L1 f0
tmpH(i_1) if(!tabH(i_1)) goto L1 L2
f1 tmpH(1i_1) if(!tabH(1i_1)) goto
L2 //
remaining loads // computations // stores
i_1 8
19
Inter-Stage Communication
  • Hash-based solution
  • We did not want to include additional delays in
    the load/store operations
  • Use H(k) k MOD m
  • When m is a multiple of 2N,
  • H(k) can be implemented by just using the least
    ?log2(m)? significant bits of K to address the
    cache (translates to simple interconnections)

H
H
20
Inter-Stage Communication
  • Hash-based solution H(k) k MOD m
  • Single read
  • (L1)
  • R 1
  • ? 0
  • a) write
  • b) read
  • c) empty/full update

21
Inter-Stage Communication
  • Hash-based solution H(k) k MOD m
  • Multiple reads (Lgt1)
  • R 11...1 (L)
  • ? gtgt R
  • a) write
  • b) read
  • c) empty/full update

22
Buffer size calculation
  • By monitoring behavior
  • of communication component
  • For each read and write
  • determine the size of the buffer needed to avoid
    collisions
  • Done during RTL simulation

23
Experimental Setup
  • Compilation flow
  • Uses our previous work on compiling algorithms in
    a Java subset to FPGAs

24
Experimental Setup
  • Simulation back-end

Library of Operators (JAVA)
fsm.xml
datapath.xml
fsm.xml
rtg.xml
datapath.xml
XSLTs
to dotty
to hds
to vhdl
to dotty
to java
to java
to vhdl
datapath.hds
fsm.java
rtg.java
HADES
fsm.class
rtg.class
ANT build file
I/O data ( RAMs and Stimulus )
25
Experimental Results
  • Benchmarks

Algorithm Stages loops Description
fdct 2 s1,s2 3 Fast DCT (Discrete Cosine Transform)
fwt2D 4 s1,s2,s3,s4 8 Forward Haar Wavelet
RGB2gray histogram 2 s1,s2 2 Transforms an RGB image to a gray image with 256 levels and determines the histogram of the gray image
Smooth sobel, 3 versions (a) (b) (c) 2 s1,s2 6 Smooth image operation based on 3?3 windows being the resultant image input to the sobel edge detector. (a) original code (b) two innermost loops of the smooth algorithm fully unrolled (scalar replacement of the array with coefficients) (c) the same as (b) plus elimination of redundant array references in the original code of sobel.
26
Experimental Results
  • FDCT (speed-up achieved by Pipelining Sequences
    of Loop)

27
Experimental Results
Algorithm Input data size Stages cc w/o PSL Speed-up Upper Bound cc w/ PSL Speed-up
fdct 800?600 (s1,s2) (s1) (s2) 3,930,005 1,950,003 1,920,003 2.02 1,830,215 2.02
Fwt2D 512?512 (s1,s2,s3,s4) (s1,s2) (s3,s4) 4,724,745 2,362,373 2,362,373 2.00 3,664,917 1.29
RGB2gray histogram 800?600 (s1,s2) (s1) (s2) 6,720,025 2,880,015 3,840,015 1.75 3,840,007 1.75
Smooth sobel (a) 800?600 (s1,s2) (s1) (s2) 49,634,009 32,929,473 16,606,951 1.51 32,929,489 1.51
Smooth sobel (b) 800?600 (s1,s2) (s1) (s2) 30,068,645 13,364,109 16,606,951 1.81 16,640,509 1.81
Smooth sobel (c) 800?600 (s1,s2) (s1) (s2) 25,773,809 13,364,109 11,862,791 1.92 13,364,117 1.92
28
Experimental Results
  • What does happen with buffer sizes?

29
Experimental Results
  • Adjust latency of tasks in order to balance
    pipeline stages
  • Slowdown tasks with higher latency
  • Optimization of slower tasks in order to reduce
    their latency
  • Slowdown of producer tasks usually reduces the
    size of the inter-stage buffers

30
Experimental Results
  • Buffer sizes

2 cycles per iteration of the producer
1 cycle per iteration of the producer
original
Optimizations in the consumer
Optimizations in the producer
original
31
Experimental Results
  • Buffer sizes

32
Experimental Results
  • Resources and Frequency (Spartan-3 400)

33
Related Work
  • Previous approach (Ziegler et al.)
  • Coarse-grained communication and synchronization
    scheme
  • FIFOs are used to communicate data between
    pipelining stages
  • Width of FIFO stages dependent on
    producer/consumer ordering
  • Less applicable

time
34
Conclusions
  • We presented a scheme to accelerate applications,
    pipelining sequences of loops
  • I.e., Before the end of a stage (set of nested
    loops) a subsequent stage (set of nested loops)
    can start executing based on data already
    produced
  • Data-driven scheme is used based on empty/full
    tables
  • A scheme to reduce the size of the memory buffers
    for inter-stage pipelining (using a simple hash
    function)
  • Depending on the consumer/producer ordering,
    speedups close to theoretical ones are achieved
  • as if stages are concurrently and independently
    executed

35
Future Work
  • Research other hash functions
  • Study slowdown effects
  • Apply the technique in the context of Multi-Core
    Systems

Processor Core A
Processor Core B
Memory
Memory
36
Acknowledgments
  • Work partially funded by
  • CHIADO - Compilation of High-Level
    Computationally Intensive Algorithms to
    Dynamically Reconfigurable COmputing Systems
  • Portuguese Foundation for Science and Technology
    (FCT), POSI and FEDER, POSI/CHS/48018/2002
  • Based on the work done by Rui Rodrigues
  • In collaboration with Pedro C. Diniz

37
A Data-Driven Approach for Pipelining Sequences
of Data-Dependent Loops
technologyfrom seed
38
Buffer Monitor
39
Buffer Monitor
40
Buffer Monitor
41
Buffer Monitor
42
Buffer Monitor
43
Buffer Monitor
Write a Comment
User Comments (0)
About PowerShow.com