Recursive Variable Expansion: A Loop Transformation for Reconfigurable systems presentation

About This Presentation

Transcript and Presenter's Notes

Title: Recursive Variable Expansion: A Loop Transformation for Reconfigurable systems

1
Recursive Variable Expansion A Loop
Transformation for Reconfigurable systems

Zubair Nawaz

2
Outline

Introduction
General Loop Transformations
Motivational Example
Goals for the new Technique
Proposed Transformation
Suitable Benchmarks
Future work

3
Introduction

Loop Parallelization
Why?
Problem
Data dependencies
Loop Transformation

4
General Loop Transformations

Derived for Parallel Computers
Commonly used Loop level optimization for
Reconfigurable Computing
Loop Unrolling
Loop Tiling
Loop Fusion
Loop Peeling
Loop Fission
Loop Reversal
Loop Interchanging
Loop Skewing
Software Pipelining

5
When to apply loop transformation?

Most difficult thing to answer.
Selection and sequence of suitable techniques
depends upon the type of loops nest and
dependency relationship.

6
Motivational Example

For i 2 to n
For j 2 to m
S1 Ai, j Ai-1, j c
S2 Bi, j Ai, j Ai, j-1
End For
End For

7
Data Dependency

For i 2 to n
For j 2 to m
S1 Ai, j Ai-1, j c
S2 Bi, j Ai, j Ai, j-1
End For
End For

Iteration Space showing dependencies among the
iterations
8
Loop Skewing

For I 2 to mn
For J max(2,I-n) to min(m,I)
S1 AI-J, J AI-J-1, J c
S2 BI-J, J AI-J, J AI-J, J-1
End For
End For
Inner loop is parallelizable

9
Speed and Area estimation
For I 2 to mn For J max(2,I-n) to
min(m,I) S1 AI-J, J AI-J-1, J c S2
BI-J, J AI-J, J AI-J, J-1 End For End For

Assumption
Each addition takes one cycle.
Area is directly proportional to the number of
terms to be added.
Time If the whole inner loop is executed in
parallel, then the time for each iteration of
outer I loop is 2 cycles.
Total time 2(mn-1)cycles

10
Speed and Area estimation
For I 2 to mn For J max(2,I-n) to
min(m,I) S1 AI-J, J AI-J-1, J c S2
BI-J, J AI-J, J AI-J, J-1 End For End For

Area number of terms to be added for each inner
iteration 4
Total area 4 x maximum number of iterations of
the inner loop for some outer loop.
Maximum number of terms to be added will be
4min(m, n2)-1.

11
Goals for the new Technique

How to transform a part of code to extract
maximum amount of parallelism and speed ?
What can be the maximum speed up one can achieve
in a program given a lot of resources ?
How to use the Reconfigurable architecture more
efficiently ?

12
Proposed Transformation

Assumption Lot lot of resources
Parallelism is constraint by data dependencies
Is it possible to transform the program such that
the resulted program is free of data dependency ?
Which implies that we can get maximum parallelism
Yes

13
Recursive Variable Expansion

Basic Idea
If there are any two statements Si and Tj for
some iterations i and j, such that SidTj , then
the computation done in Si can be replicated in
Tj makes it independent of Si. Repeat this until
Tj becomes only the function of the inputs or
knowns.
Like in example
A4, 3 A3, 3 c
A2, 3 c c
A1, 3 c c c

14
Speed and Area Estimate
For i 2 to n For j 2 to m S1 Ai, j
Ai-1, j c S2 Bi, j Ai, j Ai,
j-1 End For End For

Area
Statement S1 is only dependent on i, Let the
number of addition terms are denoted by T(i)
T(i) T(i-1) 1
Solution T(i) i
Suppose the number of addition terms in S2 are
denoted by S(i) 2T(i) 2i
Number of addition terms in one iteration is
denoted by t(i) S(i) T(i) 3i
Total number of addition terms for all the
iteration is given by

15
Speed and Area Computation
For i 2 to n For j 2 to m S1 Ai, j
Ai-1, j c S2 Bi, j Ai, j Ai,
j-1 End For End For

Is this area is always going to be high
No, if the output of the loop nest or variable
which is used later is only Bn, m,
Then there is no need to expand all the
intermediate computations on the FPGA, as Bn, m
can be computed readily after the expansion, as
it is only a function of inputs. In this case the
number of addition terms is S(n) 2n.

16
Speed and Area Estimate

Speed Suppose we want to add n terms, we can add
all the consecutive terms and get the sum and
then repeating the same for the output until we
have only one left.

17
Speed and Area Estimate

Can we exploit the flexibility of FPGA to improve
this design ?
Wallace tree

CSA Adder for 3 inputs of 4 bit each
18
Speed and Area Estimation
Wallace Tree for nine inputs
19
Speed and Area Estimation

The number of levels required by Wallace tree to
add the n inputs log1.5(n/2)
We can use a fast CPA which takes logk levels for
k bit addition
As all the computation is done in parallel, then
the time taken time taken by the term with max.
no of addition, which is S(n)2n
If we assume that one cycle 8 gate delays
Then time (log1.5n logk)/8 cycles

20
Other operation efficiently handled by Wallace
tree

Subtract
Multiply
To multiply the n terms of k bit will require
O(lognlogk) cycles.

21
Comparison
22
Benchmark

DCT
16 tap FIR filter of a 32 element array
Matrix Multiply- 12X6, 6X4 Integer Matrices
Sobel Edge detection

23
Software Implementation

GPP is IBM PowerPC 405 at 250MHz implemented on
FPGA fabric.
SW- Compiled using GCC 4.2.0, optimization level
O3, inner loop is completely unrolled.
Instructions counted using PSIM.

24
Hardware Specifications

FPGA- Virtex II Pro XC2Vp30-7ff896
VHDL generated using Xilinx ISE version 8.2.022
Synthesis tool XST
Simulator ModelSim-SE

25
Results
26
Area

Virtex 4 -XC4VLX200 contains 89,088

27
Benefits

It removes dependencies from the part of the
program on which it is applied
On RC, more efficient implementation than other
transformation. E.g. add, multiplication,
subtraction etc.
Single technique exploits more parallelism with
out making wide selection and scheduling for
other transformation
Can work with non perfectly nested and
un-normalized loops.

28
Limitations

Like complete loop unrolling, need to know the
bounds for the loops.
Some statements may grow exponential.
More suitable for no control dependency or
limited control dependency.

29
Future work

limit area
Pipelining
Partial expansion
To work with loops with more extensive control
dependency.

30
Questions
31
Problems?

A(n, m) A(n-1, m) A(n, m-1)

32
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

Recursive Variable Expansion: A Loop Transformation for Reconfigurable systems PowerPoint PPT Presentation